It strikes me that trying to find the "true capability" of an LLM independent of a benchmarked assessment is akin to finding the "intrinsic value" of a good independent of its price...which is to say, challenging!
One approach is to look at other criteria of interest (like the Schaeffer paper I cited at the end: https://arxiv.org/pdf/2502.18339) and see whether and how those correlate with the more static "capability" benchmarks. Do they correlate at all? If so, are they linear, etc.? That's pretty much the only practical advice I can think of here.
More generally, though, my point is just that we fundamentally don't know what exactly we're assessing—not just in terms of construct validity broadly but in terms of the specific numbers we're giving systems ("This one got 90%, this one got 45%", etc.). As I think we've discussed before, this is of course not just an issue with LLM benchmarks but an issue with measurement more generally.
It strikes me that trying to find the "true capability" of an LLM independent of a benchmarked assessment is akin to finding the "intrinsic value" of a good independent of its price...which is to say, challenging!
Agreed that it's hard (perhaps impossible!).
One approach is to look at other criteria of interest (like the Schaeffer paper I cited at the end: https://arxiv.org/pdf/2502.18339) and see whether and how those correlate with the more static "capability" benchmarks. Do they correlate at all? If so, are they linear, etc.? That's pretty much the only practical advice I can think of here.
More generally, though, my point is just that we fundamentally don't know what exactly we're assessing—not just in terms of construct validity broadly but in terms of the specific numbers we're giving systems ("This one got 90%, this one got 45%", etc.). As I think we've discussed before, this is of course not just an issue with LLM benchmarks but an issue with measurement more generally.