An overlooked problem with LLM benchmarks

Mar 12

Proxies for a capability may not map linearly onto the capability itself.

2 Comments

It strikes me that trying to find the "true capability" of an LLM independent of a benchmarked assessment is akin to finding the "intrinsic value" of a good independent of its price...which is to say, challenging!

Expand full comment

Reply (1)

Sean Trott

Apr 8

Agreed that it's hard (perhaps impossible!).

One approach is to look at other criteria of interest (like the Schaeffer paper I cited at the end: https://arxiv.org/pdf/2502.18339) and see whether and how those correlate with the more static "capability" benchmarks. Do they correlate at all? If so, are they linear, etc.? That's pretty much the only practical advice I can think of here.

More generally, though, my point is just that we fundamentally don't know what exactly we're assessing—not just in terms of construct validity broadly but in terms of the specific numbers we're giving systems ("This one got 90%, this one got 45%", etc.). As I think we've discussed before, this is of course not just an issue with LLM benchmarks but an issue with measurement more generally.

Expand full comment

The Counterfactual

An overlooked problem with LLM benchmarks