This year, I taught an introductory Research Methods class for the first time. I’ve taught related courses and topics (e.g., statistical modeling) but Research Methods starts from the ground floor. The high-level goal is to teach students how to be better producers and consumers of research—and that involves building up the foundations of what constitutes good empirical practices. In particular: what makes a scientific claim valid?
The tl;dr of the course is that any given claim—whether it’s a claim about frequency, association, or causation—can be evaluated with respect to four distinct validities:
internal validity: have you eliminated potential confounds?
external validity: do these results generalize?
statistical validity: do the numbers support the claim?
construct validity: are you measuring what you think you’re measuring?
During the course, I mentioned that construct validity is my favorite validity. The reason is that it’s vital for every type of claim. No matter what you’re studying, you’ve got to think carefully about how to operationalize the thing you’re ultimately interested in (i.e., the “construct”).
Construct validity is also one of the easiest validities to forget. In our rush to connect empirical data with a big-picture question, we sometimes forget that what we’re measuring is not identical to the underlying construct.
Why is construct validity so challenging?
Construct validity rears its head whenever the thing we want to study can’t really be measured directly. A “construct” is essentially an abstraction: an idealized concept that we think might be “real” and important in some sense, but which can only be measured via proxies.
This is especially pernicious in the social and behavioral sciences, where researchers are often interested in constructs that we have everyday words for, but which are also so big and all-encompassing that it’s hard to agree on how to assess them.
“Happiness” is the paragon example of such a construct. It’s clearly important: psychologists, sociologists, and economists alike might all agree that being happy is an important component—perhaps the most important component—of a good life. Accordingly, researchers might wish to spend their time identifying the social, psychological, and economic correlates of happiness. We might want to answer questions like:
How happy are people on average?
Is personal wealth linked to happiness?
Does learning a new language as an adult make you happier?
Answering these questions is hard for a number of reasons. Foremost among them is that it’s unclear how, exactly, ought we to measure “happiness”. This is equally true of all sorts of concepts we have words for but are hard to nail down empirically: “empathy”, “intelligence”, “athleticism”, “stress”, and so on. In each case, researchers have to make decisions about how to operationalize this complex construct.
Subjective well-being: a detour and example
One way to measure things is simply to ask people directly.
For example, researchers have devised a number of surveys that include questions about life satisfaction, work satisfaction, fulfillment, and emotional health. These are measures of subjective well-being: when asked, what do people say about how they feel about their lives? This approach allows researchers to put a number on happiness, which in turn allows for comparison across individuals (e.g., in terms of their socioeconomic status) or even countries (e.g., the World Happiness Report).
Of course, well-intentioned researchers face a number of challenging decisions about how to design and interpret these surveys, including (but not limited to):
Which questions should be chosen for the survey? This decision involves—either implicitly an explicitly—a theoretical commitment to the underlying dimensions of subjective well-being.
How should those questions be worded? This decision involves thinking through (and ideally testing) the various interpretations of a given question and ensuring that the most likely interpretation by survey-takers is aligned with the researcher’s intended interpretation.
Should the questions involved a “forced-choice” or a “free response”? The former allows for easier quantification of results, but closes the researcher off to certain response options and discoveries; the latter has the opposite problem.
How can I encourage individuals to answer honestly and accurately? People may not always say how they feel—either because they’re embarrassed (perhaps admitting unhappiness is stigmatized) or because they don’t have keen insight into their emotional health; the researcher’s task is thus to identify questions that avoid social desirability bias and don’t require an exceptional introspective abilities?
After all this work, other researchers might still (understandably) object that this scale is not the same thing as happiness. And they’d be right: concretely, the scale is measuring what people say in response to specifically worded questions about how they feel about their own lives.
But I also want to be clear that this issue is not limited to self-report measures. I sometimes see researchers dismiss results out of hand because “that’s just self-report”, and I think that’s misguided. It’s true that a self-report measure (e.g., subjective well-being) stands in for, and is not identical to, the construct of interest (e.g., happiness). But the same critique can and should be brought against any measure of any theoretical construct. And assuming a researcher wants to measure the thing they’re interested in, they’re going to have to make some difficult decisions regardless of whether they use a self-report scale. Saying “that’s just self-report” elides the more difficult but important work of determining and articulating whether and why the measure is a good one.
Pretty much all operationalizations are by definition proxies—the important question is what makes a proxy a good proxy.
What makes a construct valid?
The unfortunate answer is that this is very hard and we can never be 100% confident we’re right.
Broadly, however, there are a few prerequisites to establishing the validity of a measure and of the construct it stands in for.
Criterion #1: Reliability
The first is called reliability: how consistent are the results of a given measure? Depending on the nature of the measure and the underlying construct, reliability can be assessed across time (e.g., test-retest reliability), across individual researchers (e.g., inter-rater reliability), or even across different components of the measure itself (e.g., internal reliability). Put simply: we need to know we can trust our measuring devices.
This is obvious for measures of physical attributes like weight. If my bathroom scale produces a wildly different weight every time I step on it, it’s not a very reliable instrument: it has low test-retest reliability.
Similarly, if researchers believe a given psychological trait is relatively stable, then the same instrument for measuring that trait should produce similar results for the same person across periods of time. Considerable research has been devoted to exactly this issue, e.g., for measures of intelligence or personality.
Reliability is important for establishing construct validity. But a reliable measure is not necessarily valid. To give an absurd example: a test of color perception may exhibit high test-retest reliability—the same individual might get the same score across days or even years—but that doesn’t make it a good operationalization of something like stress.
That is, reliability is necessary, but not sufficient.
Criterion #2: Face validity
I hesitated to put face validity—subjectively, whether a measure plausibly relates to the construct of interest—in here because it’s not really an empirical measure.
But face validity often comes up in discussions of validity, even when people aren’t aware that face validity is what they’re discussing—so I think it’s useful to know the word for it. In the example above, “color perception” fails the face validity test as a measure of stress. There’s just not really a plausible sense in which our concept of “stress” can or should be measured by how well someone identifies different colors.
If the example seems strange, that’s because it is: there’s just no theoretical link between the construct and the measure.
In contrast, heart rate does seem like a plausible measure of stress. That doesn’t necessarily make it so—it still has to meet the other criteria—but intuitively, it seems like stress should correlate with heart rate, on average.
Criterion #3: Convergent validity
If a construct is valid, then different measures of that construct should correlate or “converge”. This is called convergent validity.
For example, Beck’s Depression Inventory (BDI) is a self-report scale designed to measure the risk of clinical depression. If the BDI is a good measure, then it should correlate with independent measures of clinical depression for those individuals—like a professional psychiatrist’s diagnosis (or lack thereof).
In my own research, convergent validity crops up as a topic of heated debate for constructs like Theory of Mind. Theory of Mind refers to the ability to represent and reason about the mental and emotional states of other agents. There are a number of important research questions involving Theory of Mind: whether animals have it, whether Large Language Models have it, when children develop it, and more. But there are also a bunch of different measures for Theory of Mind. To name a few: the “false belief task”, the “reading the mind in the eyes task”, the “short story task”, the “strange stories task”. Unfortunately, an individual’s performance on one measure doesn’t always correlate with their performance on other measures. This has led to calls to abandon the construct altogether (see this paper for a nice review of the debate).
Criterion #4: Criterion validity
A good measure should also predict other concrete behavioral outcomes that, intuitively, follow from the construct of interest. This is called criterion validity (or “predictive validity”).
For example, if we’re interested in health, one proxy might be asking people how frequently they engage in regular exercise. And one way to validate that measure is to ask whether self-reported exercise frequency predicts other measures of health, like life expectancy. If it does, then perhaps researchers can rely on self-reported exercise frequency as a proxy for at least one dimension of health (namely, how long people live). And of course, life expectancy is itself an (imperfect) operationalization of health: as concepts like healthspan make clear, health is about more just how many years you live.
The long and winding road towards construct validity
Meeting these criteria is hard.
And of course, no measure will be perfect. A measure might exhibit some variance across tests, or it might exhibit only moderate correlations with behavioral outcomes of interest. In these cases, researchers have to decide whether the perfect is the enemy of the good or whether to abandon the measure.
From the outside, it’s tempting to construe these debates as petty squabbles. “Why have researchers spent the last 30 years arguing about this or that measure of extroversion/intelligence/stress/empathy?”
But that’s exactly what happens when you care about measuring what you think you’re measuring—and if you’re interested in measuring something, you don’t avoid the problem just by ignoring the debates.
Construct validity beyond the ivory tower
At this point, some readers might be thinking something like: all this stuff about construct validity sounds like a real headache for academic psychology researchers—good thing I don’t have to deal with that.
But there’s a good chance you’re wrong. Construct validity afflicts other research fields (e.g., medicine, health, economics, etc.) and also the world beyond the ivory tower.
Hiring
Hiring is a hard problem. Ultimately, you’ve got to predict whether a candidate for a position will do a good job on the basis of very limited information.
One approach is to use a test that measures qualities you think will make someone successful in the role. If you think the role in question demands a certain level of intelligence, then you can try to measure the applicant’s intelligence; if you think it demands leadership qualities, then you can try to measure whether someone is a good leader; if you think being a team player is important, then you can try to measure the applicant’s agreeableness or empathy. But in each of these cases, you’re at the mercy of at least two factors: 1) whether the measure is, in fact, a good measure of the quality in question; 2) whether that quality does, in fact, lead to success in that role. Both are issues of construct validity.
You can also try to measure how someone would perform more directly. For example, if the role involves programming in Python, then the interview could involve live coding—or at least asking the candidate to walk through how they would break down a programming problem. Of course, performance in this (somewhat artificial) setting may not predict performance “on the job”. This means you might select for people who perform well on these demo problems—which might also mean people who’ve prepared by reading about past technical interviews—and select against people who might otherwise be conscientious and effective workers. For a hiring manager, the question is whether, on average, such a test consistently selects for higher-quality candidates. Again, this is an issue of construct validity: specifically, criterion validity.
Importantly, you don’t sidestep the issue by eschewing these empirical measures and going with your subjective impression of a candidate during their interview. Ignoring the numbers doesn’t necessarily get you closer towards the candidate’s “real” quality. Depending on your personal biases, you might end up selecting for candidates that seem more confident, candidates that tell an interesting story or joke, or even simply candidates that look more like you. Or you can be more intentional about this process and evaluate how they respond to specific questions (Tyler Cowen and Daniel Gross discuss interview tactics in their book Talent). But those questions may or may be good measures of whether a candidate will succeed in the role.
Assessing on-the-job performance
Many of the same issues apply to assessing how someone is performing once they already have the job. That doesn’t mean it’s impossible—but it is hard, and opinions will necessarily differ on the validity of any given strategy.
For example, how should we quantify the quality of a software engineer’s contributions? Lines of code is one coarse measure, but may not correlate with the elegance of their contributions. Number of bug-fixes? Number of commits? Inverse number of bugs introduced? And of course, software engineers do a lot more than write code. They review code that others have written, and even more importantly, they make higher-level decisions about which platform to use, about the architecture of a planned feature, and so on. Which of these is most important, and how should we measure it? If we measure all of these factors, how should they be aggregated into a “composite” measure and how should each dimension be weighted?
This is not unique to programmers. How should we assess the performance of a manager or a politician or the owner of a café? I think the difficulty of this task is in part why market mechanisms are appealing to so many people: whoever or whatever succeeds in a market that allows the free exchange of goods and services is, virtually by definition, whoever or whatever is “best” in that market.
To make it more personal: how do we evaluate tenure-track (or already tenured) professors? Concretely, tenure in academia largely depends on a few factors: research output, teaching quality, and service. For research faculty, the first of these is generally viewed as the most important. But what is “research output”? 10 low-quality papers in a pay-to-play journal is clearly less impressive than a single paper in a high-tier journal; thus, tenure review committees typically look at the impact factor (i.e., average #citations per paper per year) of the journals in which someone has published. Yet this, too, is imperfect, and conflates different modes by which a given researcher could contribute to their field. It may underweight the value of careful replications, and overweight original contributions of immediate interest to the field; it may also underweight the value of a researcher working in an obscure corner of the field, which may eventually inspire a paradigm shift in the future. None of these things are really commensurate and yet the realities of promotion demand that we treat them as such.
Notably, all of these issues relate to “Goodhart’s law”: the notion that as soon as we begin to use a measure as a target or metric, it ceases to be a good measure.
Choosing where to eat (and what to buy, etc.)
If you’re anything like me, an occasionally self-defeating tendency towards over-optimization has likely led you to spend too long on a ratings aggregator like Yelp or Google, determining which restaurant would be the “best” place to go that evening, which product you should buy, or which movie you should watch.
Having options is mostly good: I don’t have much sympathy for the complaint that it’s somehow worse than having fewer options. But navigating a vast decision space under uncertainty brings challenges of its own. One solution is to try to reduce the uncertainty of these decisions by relying on what other people say about a given experience. How many stars does this restaurant have on Yelp? How many people say they enjoyed this product? How many students, for that matter, recommended this professor?
In each case, online reviews are a proxy for the quality of the experience you’re making a decision about. The hope is that “number of stars” or “% yes responses” (the measures) have some correlation with underlying quality (the construct). In some cases, it may. In other cases, it may not.
Alternatively, one might decide to only eat at restaurants recommended by someone they know and trust. Again, this doesn’t really avoid the issue of construct validity; there’s still the question of why you trust certain people and not other people to recommend a restaurant. I’m not suggesting you should keep an explicit tally of which people’s recommendations have led to positive or negative experiences—a measure of each friend’s criterion validity, in essence—but in part I think that’s probably because you already have that tally somewhere in your mind.
Evaluating Artificial Intelligence
A big question these days is what, exactly, novel AI systems like ChatGPT can and can’t do.
Readers have likely seen reports of ChatGPT passing the bar, passing medical licensing exams, and acing the SAT. This is not to mention the remarkable improvements in recent years of AI systems on various benchmarks—i.e., field-internal assessments of a system’s capabilities. These results have inspired important debates, e.g., on whether these AI systems will replace human workers in certain jobs.
Zooming out a little, it’s easy to see that many of these debates center around construct validity. One person sees ChatGPT passing a given test and interprets that as evidence of more general capabilities; another person sees the same result and argues that the test is unrepresentative of the general capabilities it’s been designed to measure. Which person is right?
There’s not really an easy answer here. Tests or “metrics” are by definition operationalizations of some construct. This is true for assessments of human capabilities, and it’s equally true for assessments of machine capabilities. Each test is meant to “stand in” for some broader ability, but the challenge is that it’s hard to know how successfully it does this—not least because that broader ability (e.g., “intelligence”) is not necessarily clearly-defined, particularly in lay discourse.
Even worse, we may not be carving up the conceptual space of “abilities” in the correct way. One of the best papers on this issue is called “AI and the Everything in the Whole Wide World Benchmark”. I recommend reading the whole thing, but here’s a choice excerpt from the beginning:
In the 1974 Sesame Street children’s storybook Grover and the Everything in the Whole Wide World Museum [Stiles and Wilcox, 1974], the Muppet monster Grover visits a museum claiming to showcase “everything in the whole wide world”. Example objects representing certain categories fill each room. Several categories are arbitrary and subjective, including showrooms for “Things You Find On a Wall” and “The Things that Can Tickle You Room”. Some are oddly specific, such as “The Carrot Room”, while others unhelpfully vague like “The Tall Hall”. When he thinks that he has seen all that is there, Grover comes to a door that is labeled “Everything Else”. He opens the door, only to find himself in the outside world.
As a children’s story, Grover’s described situation is meant to be absurd. However, in this paper, we discuss how a similar faulty logic is inherent to recent trends in artificial intelligence (AI)— and specifically machine learning (ML) — evaluation, where many popular benchmarks rely on the same false assumptions inherent to the ridiculous “Everything in the Whole Wide World Museum” that Grover visits. In particular, we argue that benchmarks presented as measurements of progress towards general ability within vague tasks such as “visual understanding” or “language understanding” are as ineffective as the finite museum is at representing “everything in the whole wide world,” and for similar reasons — being inherently specific, finite and contextual.
Put simply: the world is complicated and it’s impossible to know whether we’ve developed the right taxonomy. A similar point is made in The Analytical Language of John Wilkins by Jorge Luis Borges, which takes the difficulty of developing a sensible classification scheme to the extreme:
(a) those that belong to the Emperor, (b) embalmed ones, (c) those that are trained, (d) suckling pigs, (e) mermaids, (f) fabulous ones, (g) stray dogs, (h) those that are included in the present classification, (i) those that tremble as if they are mad, (j) innumerable ones, (k) those drawn with a very fine camelhair brush, (l) others, (m) those that have just broken a flower vase, (n) those that look like flies from a long way off.
What’s the solution?
The authors of the paper suggest taking a different approach to benchmarking—namely, identifying tests that eludicate mechanism rather than overall performance:
The effective development of benchmarks is critical to progress in machine learning, but what makes a benchmark effective is not the strength of its arbitrary and false claim to “generality” but its effectiveness in how it helps us understand as researchers how certain systems work— and how they don’t.
As I’ve written before, I completely agree. I think that making progress on AI evaluation is going to require some form of LLM-ology (or what’s also being called mechanistic interpretability).
But I also think the lessons of construct validity—some of which are explored in this post—can and should be explicitly deployed towards evaluating AI systems. To determine whether a metric is good or bad, we can ask whether it’s reliable, whether it’s valid on its face, whether it correlates with other measures of the same ability, and whether it predicts downstream performance once a system is deployed (either in production or in a sandbox).
Measurement all the way down
Measurement is hard. We’re at the mercy of our senses and the fallible instruments we construct to aid them. These tools do not measure the world as it is but rather the world through a kind of selective filter.
It’s easy to forget this. Language sometimes gives us the illusion of specificity, of mutual understanding. But just because two people use the same word for something does not mean: 1) they mean the same thing with that word; or 2) either of them really understand what that word means, when it comes down to it.
These problems are revealed as soon as you try to start measuring things we talk about in everyday discourse (“happiness”, “empathy”, “job performance”). Unfortunately, finding examples of bad metrics is easier than examples of good ones. At that point, it’s tempting to throw our hands up and eschew measurement altogether.
To be honest, I have a lot of sympathy for this view. An over-reliance on measurement—and in particular, quantification—does often lead us astray.
But I also think it’s important to be clear-eyed about the alternative. Eschewing formal measurement doesn’t let us off the hook. Unless you’re a philosophical idealist, you believe in capital-R Reason, or you think that ideas can be divinely inspired, it seems uncontroversial to state that the vast majority of our knowledge comes from empirical observation of the world, or from representations that are somehow transduced from these empirical observations (e.g., as conveyed in language). We’re thrust into life with a particular array of sensors and a particular neural structure with which to process the data those sensors acquire—a particular Umwelt—and we do the best we can to learn about the real structure of the world with those limited affordances. Humans are aided in this process by our sociality, our ability to communicate complex ideas, and our ability to create novel tools that measure the things we can’t perceive directly. Critically, each stage of this knowledge acquisition cycle involves something like measurement: a (lossy) representation of the world that nevertheless facilitates our ability to perform actions and achieve our goals.
In other words, we’re stuck with measurement of some kind, whether we like it or not—so we should do it well.
A fair review. Working as a statistician in social and market research I was always very conscious of the difficulties you discuss, though that was the age of Osgood, Guttman, Fishbein and so on, long before AI made its appearance.
In section "Measurement all the way down":
"But just because two people use the same word for something does mean: 1) they mean the same thing with that word; or 2) either of them really understand what that word means, when it comes down to it."
- I wonder if the word 'not' is missing between 'does' and 'mean' ?