An overlooked problem with LLM benchmarks

Proxies for a capability may not map linearly onto the capability itself.

Mar 12, 2025

Figuring out what large language models (LLMs) can and can’t do is really important. It matters for pricing the use of a model, or at least it should.1 It also matters theoretically—an LLM’s performance on a task can inform debates in Cognitive Science about where certain human capacities come from. And perhaps most importantly, it also matters for determining whether it’s even safe to deploy a model in the first place, which is why much of the work on evaluating LLM capabilities is done by AI safety researchers. Unfortunately, accurately assessing a model’s capabilities is very hard!

Much has already been written on this topic, including by me, and I don’t want to rehash it all here. If you’re looking for a solid critique of current benchmarking practices, I think this 2021 paper by Deborah Raji and co-authors is a great place to start. I also recommend this 2024 post by Tim Lee on how measuring intelligence isn’t easy. The through-line of much of this work is that many benchmarks suffer from a construct validity problem: it’s not necessarily clear that the tasks used to assess an LLM are good operationalizations of the capabilities they were designed to assess. Various solutions have been proposed, including the use of private benchmarks and more emphasis on tasks that require interactive problem-solving (like ClaudePlaysPokemon).

In this post, however, I want to address a different problem with benchmarks that tends to go overlooked: the scores obtained from a benchmark may not map linearly onto the underlying capability, such that an equivalent difference between two scores (e.g., [45, 55] vs. [85, 95]) may not mean the same thing. It’s a measurement problem, and while it sounds simple, it has deep implications for how we assess and forecast LLM progress.

Put a number on it

Most of the capabilities LLM researchers are interested in are pretty abstract. We can refer to constructs like “Theory of Mind”, “reasoning”, or even “general intelligence”—but those constructs don’t map neatly onto observable reality. This makes it hard to talk concretely about a given models’s “reasoning” ability, and even harder to compare models in terms of this ability.

In principle, a benchmark makes those things easier by giving each model a score. The idea is that then we can directly compare models in terms of their scores, e.g., “Model X achieved an 85, while Model Y achieved a 90—therefore Model Y is better”. We can even compare models to human performance on that benchmark, allowing us to say things like: “Model Y has now surpassed the human average on this task”. We can also use those scores to make forecasts about future changes in model performance: “Model X achieved a 85 with this scale and this training data, and we anticipate that Model Y will achieve a 90 with that scale and that training data.”

There are a bunch of assumptions here. One is that the construct in question can be represented by a scalar value in the first place.2 Another is that human performance can be distilled into a single number and that this number is meaningful, whether it’s the mean, the median, or the maximum of a distribution. And the third assumption is that any given score—whether it’s 55 or 95—maps onto the underlying construct in roughly the same way. That is, scoring a 65 is approximately “10 points better” than scoring a 55, and scoring a 75 is approximately “10 points better” than scoring a 65. You might even further assume that ratios on this scale are meaningful, e.g., a 50 is “twice is good” as a 25.

Any of those assumptions might be false, but it’s the third one that (in my experience) is most frequently overlooked in discussions of model capabilities.

Six scenarios

To illustrate the problem here, it’s useful to consider a few different scenarios where this key assumption may or may not hold.

Let’s say we have a capability scale that measures model abilities on a scale from 1-100 (“Measured Capability”). Of course, we’re interested in a model’s actual capability (“True Capability”), and this scale is just a proxy for that; for the sake of this illustration, let’s say that True Capabilities occupy a scale between 1-100.3 The capability itself doesn’t matter here: you could call it Theory of Mind, reasoning, general intelligence—whatever seems most compelling.

Crucially, there are a number of ways in which Measured Capability might map onto True Capability.

One possibility is that there’s simply no meaningful relationship. In this Null Scenario, LLMs vary both in their measured and actual capabilities, but the former just doesn’t really carry any information about the latter. A model might score a 95 on our scale but actually be much worse than a model that scores a 45. That’s a tough situation to be in, as it means the scale is effectively useless—so making decisions based on that scale would be a bad move.

Another possibility is that the relationship is basically linear with a little bit of noise. The Linear Scenario is what we want, and it’s also implicitly (or explicitly) what’s typically assumed about a scale. In this world, it’s meaningful to say that scoring a 65 is better than scoring a 55. And further, the same interval means the same thing: a model that scores a 65 is 10 points better than a model that scores a 55, which in turn is 10 points better than a model that scores a 45. This is the ideal world for making decisions about whether (for example) to release a model based on its capabilities, as well as make predictions about how model capabilities will change in the future.

In my experience, the Null and Linear scenarios are the ones typically assumed in discourse on capabilities assessments: either the benchmark is meaningful (and linearly related to the capability) or it means nothing at all. But there are many other possibilities!

There are many ways in which performance on a benchmark (“Measured Capability”) could map onto the capability one is trying to assess (“True Capability”).

In the Categorical Scenario, the true capability is basically binary: you either have “it” or you don’t. And the scale does tell us something about that capability—scores above a 50 (say) suggest you’ve got it, and scores below a 50 suggest you don’t. But it’s also not linear: in this scenario, the difference between a 55 and a 45 is extremely important, while the difference between a 65 and a 55 (or even a 95 and a 55!) means nothing at all.

The other three scenarios all involve positive, monotonic relationships between the scale and the true capability—but again, changes in the scale don’t map linearly onto the underlying capability. And these differences aren’t just theoretical. If you’re trying to figure out which “AI Timeline” we’re in, the actual shape of the function matters here! For example, in the Logarithmic scenario, further improvements in measured performance correspond to diminishing returns in actual capabilities.4 The Sigmoid scenario looks a bit more like the Categorical Scenario, with two “plateaus” in actual capabilities and a rapid rise in between. And finally, the Exponential scenario depicts a case where further improvements in measured performance actually correspond to increasing returns in actual capabilities.

Put more simply: in some cases (e.g., the Logarithmic scenario), our benchmark might lead us to overestimate further improvements, while in others (e.g., the Exponential scenario), it might lead us to underestimate them. That’s a really fundamental difference, and it’s also deeper than the more obvious distinction between a benchmark that maps linearly onto the capability and a benchmark that tells us nothing at all.

Where do we go from here?

As I mentioned earlier, there’s been lots of discussion about how hard it is to measure things like “reasoning ability” or “general intelligence” in LLMs—and whether that’s even a coherent goal. But I haven’t seen exactly this problem articulated in exactly this way with respect to assessing LLM capabilities.5 It really does strike me as a challenging and also important problem, particularly when it comes to forecasting AI timelines.

The issue is that I’m not really sure where we go from here.

If we knew a model’s true capabilities, we could simply calculate how variance in the benchmark mapped onto variance in the true capability. But we don’t know a model’s actual capabilities—that’s why we need the benchmarks!

So we’re left with a difficult puzzle. I can think of only two admittedly unsatisfying recommendations. The first is the typically milquetoast recommendation that accompanies proclamations of measurement deficiencies in a field to “be more careful about one’s claims”. It’s clearly true, and it’s something everyone should keep in mind, but it’s pretty vague and it’s also the kind of thing that’s easily ignored.

The second is something we should be doing anyway, which is empirically validating benchmarks. In Psychology and other disciplines, our psychometric instruments are validated by comparing them to other related measures (convergent validity) or to outcomes we think they should correlate with (predictive validity). A recent preprint did this with LLM benchmarks by correlating performance on static evaluations with human judgments about which LLM output they preferred. The good news is that the benchmarks were pretty strongly correlated with human preferences:

Our findings are striking: most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences.

Of course, this is only one outcome measure, and it’s not necessarily the one we might care about the most—especially if what we’re concerned with is something like the safety of deploying an LLM in the wild. For that, we’ll probably need more ecologically valid tasks that index more closely what we hope to measure with benchmarks, which in turn would allow us to determine which benchmarks predict performance on those more ecologically valid tasks. Most relevantly in terms of this post: there are more ways that these outcomes could relate to each other than “linearly” and “not at all”. Researchers ought to consider the variety of relationships that could exist between their measurement instrument and the construct in question, and do their best to actually identify the shape of that function through empirical validation of their instrument.

In principle, it makes sense that companies would charge more money for the use of more “capable” models (assuming we had reliable ways to assess those capabilities).

Or, more generously, by a vector of values (e.g., a multi-dimensional construct).

As we’ll see below, they may not always occupy the full scale.

I’ve normalized the y-axis to be between 1-100 here, but you could imagine the curve plateauing at a lower level.

If any readers know of examples, I’d love to see them! The closest I’ve seen actually concerns the debate around IQ testing for humans and whether variance above a certain score fails to provide meaningful information about the “intelligence” construct itself (see this thread for more details); that, of course, rests on the deeper assumption that this is something that can be measured at all and that IQ tests are an appropriate way to do it.

The Counterfactual

Discussion about this post