Who's afraid of the null hypothesis?
A useful experiment should have an interesting null hypothesis.
Cognitive scientists routinely run experiments in the lab on samples of human participants. As any experimental researcher would readily admit, virtually all lab experiments are utterly unlike the “real world”. They’re (intentionally) stripped of all context except for the key variables the researcher is hoping to manipulate. For instance, in a lexical decision task, a participant might view a series of strings on a computer screen and indicate which are valid words; this is not, of course, something that one is typically asked to do outside of the lab—but the researcher might be interested in whether participants respond faster to words that are more frequent (or shorter, or more concrete, and so on).
Someone concerned about this fact might argue that these studies lack “ecological validity”, i.e., the conclusions reached by the study may not generalize to the real world. Is this true, and does it matter? It depends crucially on what experiments are for in the broader context of Cognitive Science.
What are experiments for?
Cognitive scientists are not the only scientists that run experiments. The experimental method is widely used across scientific disciplines as a tool for isolating how different variables interact and, ideally, determining causal relationships between those variables.1 The “real world” is full of noise and confounds: threats to the internal validity of a claim. An experiment—whether it’s in physics, chemistry, biology, or psychology—sacrifices some amount of external validity (generalizability and applicability) for the sake of tighter control.
Earlier this year, the psychologist Paul Bloom wrote a good article about this very point. In it, he argued that the concern over whether a study “works” outside the lab is sometimes misguided, since the point of experiments in Cognitive Science is not merely to identify useful, practical interventions—at least in the ideal case, the goal is to inform mechanistic theories of how the mind works. He wrote:
Laboratory studies are one way (not the only way) that science comes to know things. By establishing “perfectly controlled conditions” for our experiments, we’re doing what every other scientist does—testing our theories by focusing on specific contrasts.
I’ve heard versions of this argument from other experimentalists as well, ranging from those studying human behavior to those investigating the neural underpinnings of spatial navigation in rodents. The goal of an experiment is to isolate specific variables and understand how they relate to each other. That understanding can in turn be brought to bear on the construction (and modification or even falsification) of scientific theories. For instance, the relative speed with which humans respond to abstract vs. concrete words might inform theories about how we represent the meaning of words in a “mental lexicon”.
One way to think about this is that an experiment represents a kind of “micro-world”: a simulation of certain relevant parameters of reality that allows us to determine, under controlled conditions, how those parameters interact. Observing an “effect” in an experiment—e.g., a non-zero difference between experimental conditions—can be helpful in at least two ways. First, it serves as an empirical existence proof that there exists some set of conditions under which parameter A (e.g., word concreteness or frequency) influences parameter B (e.g., response time). And second, it can provide an estimate of the “effect size”: does “A” influence “B” a little or a lot?
The second question is often more interesting, but also hard to determine reliably via experimentation, since the effect size in the lab may not actually be representative of the effect size outside the lab—that’s a case where ecological validity really does matter. But the first question is also an important one for triangulating the conditions under which two or more variables causally interact. It might sound like I’m underselling experiments by calling them existence proofs, but a well-designed existence proof can be really helpful for informing theory! But this hinges, in many cases, on how surprising or informative it is to know that “A” can influence “B”.
When are existence proofs useful?
An existence proof is, intuitively, a demonstration that something exists.
Existence proofs are most helpful for disproving universal claims. For instance, the claim that “All swans are white” can be disproven by demonstrating the existence of a single black swan. This is a form of deductive reasoning: we can falsify the first claim (“All swans are white”) by finding evidence contrary to the implications of that claim (“There exists at least one black swan”).
Now, it’s still entirely possible that most swans are white. From another perspective, identifying a single black swan may not do much to change our expectations of observing white swans in general. Conversely, if we observe numerous black swans, we might gradually update our beliefs about the relative preponderance of white and black swans. This is, essentially, a form of inductive reasoning2, in which evidence is used to support probabilistic inferences about the state of the world. Nonetheless, it remains true that the universal claim has been falsified by a single counterexample.
Which kind of reasoning is what scientists do? And what is the role of experiments in that reasoning? Much confusion and debate, in my view, can be traced to differences in how people might answer these questions. It’s made more complicated by the fact that there are normative answers (i.e., Karl Popper’s falsificationism) and descriptive answers (e.g., maybe scientists actually do something more like inductive reasoning under certain paradigmatic assumptions). To make matters worse, the statistical techniques many scientists rely on for drawing inferences about experimental results are grounded in a kind of falsificationism (disconfirming the null hypothesis), but experiments are not always designed with explicit falsification in mind. All of this complicates the actual utility of an experiment.
Below, I’ll work through each of these issues in turn.
Falsify, not verify?
Karl Popper famously argued in support of developing falsifiable theories. Science, he argued (roughly), should proceed not by finding evidence in support of our theories, but by finding evidence that allows us to eliminate theories—letting stand the theories that have not (yet) been explicitly falsified.
A core motivation here is that it’s very hard (arguably impossible) to prove a scientific theory true using empirical evidence. This is a specific form of the more general problem of induction: even if all available evidence is consistent with a proposition (e.g., “The sun will rise again tomorrow”), it is entirely possible that future evidence will come to light disproving that proposition. Indeed, even the fact that this assumption has generally held true and been helpful in the past is itself a form of inductive reasoning—and there’s no way to know with certainty that it won’t fail in the future.
Popper’s solution to this was to argue in favor of falsification, which (as noted above) operates as a kind of deductive reasoning. A good scientific theory is one which can be rigorously tested and falsified; accordingly, theories are not “proven right” but rather “not yet proven wrong”. In principle, this allows us to distinguish scientific from non-scientific claims (Popper was very concerned with the question of demarcation): if a theory can’t be falsified, it’s not a scientific theory. It also provides a clear mechanism for making progress: science proceeds by pruning away falsified theories. When it comes to any particular theory, we ought to try to prove it wrong, not prove it right.
This brings me to null hypothesis significance testing, or “NHST”.
NHST is kind of weird
With some exceptions, NHST is the conceptual foundation of statistical analysis in much of Psychology (and indeed, much of experimental science). There are a litany of “hypothesis tests” (t-tests, chi-squared tests, and much more; many of which can actually be represented in terms of linear regression), but all of them operate under the assumption that theoretical inferences are driven by our decision to either reject or fail to reject the so-called “null hypothesis”.
Consider the schema of a simple experiment with two conditions: “A” and “B”. Participants are randomly assigned to conditions, the intervention is applied, and some dependent variable (DV) is measured. Typically, the “null hypothesis” in such an experiment is that there is no real difference between the conditions, i.e., that the true difference in the DV is zero. Of course, sampling error means that there’ll likely be some marginal difference across conditions even if the intervention had no “true” effect. How do we distinguish random noise from a real effect?
This is where statistical tests come in. We can compare the observed effect to the distribution of effects we’d expect under the null hypothesis. This distribution can either be derived empirically (e.g., through permutation testing) or assumed (as in a t-test). In either case, we can calculate the probability of the observed effect under that distribution; we call this probability the p-value.

Intuitively, a larger effect size is less likely to have occurred under the null distribution.3 Thus, if the p-value is sufficiently small (say, p < 0.05), we might choose to provisionally “reject” the null hypothesis. That is, it is unlikely to have observed an effect this large under the null hypothesis. If the p-value doesn’t cross some predefined threshold for “statistical significance”, we might instead “fail to reject” the null hypothesis. This is what I mean when I write that NHST is deeply rooted in the logic of falsification.
If this all seems a bit strange to you—we’re trying to reject the null hypothesis rather than prove the alternative hypothesis?—you’re not alone. NHST is not intuitive, and even practiced scholars (including myself) are not always careful with how they describe statistical results. But technically, the point of NHST is to provide a kind of conceptual framework for deciding when to reject or fail to reject a null hypothesis.
The problem arises when the null hypothesis simply isn’t very interesting.
Who’s afraid of the null hypothesis?
Earlier, I argued that one goal of an experiment is to provide an empirical existence proof that there exists at least some set of conditions under which (say) “X” affects “Y”. Using the language of falsification and NHST, we might say that the null hypothesis of many experiments is that “X” does not affect “Y”; thus, if “X” is empirically related to “Y” more than we’d expect under the null hypothesis, we can reject (falsify) the null.
I also argued that the extent to which this matters depends on how surprising or informative it is to learn that “X” might affect “Y” in at least some conditions. Put another way: how strongly held was the null hypothesis in the first place?
It’s easy to imagine null hypotheses that could be easily disproven, and that no one would find interesting to disprove. For instance, suppose the null hypothesis is something like: “An anvil and a feather are equally likely to break a glass cup”. We could design a study to test this null hypothesis. First, we might select or “sample” 100 glass cups of identical make and size. Then, we might randomly assign 50 of these cups to the Anvil condition and 50 of them to the Feather condition. Finally, we would measure how many cups broke when we dropped an anvil on them and how many broke when we dropped a feather on them. If the difference between conditions (Anvil - Feather) is larger than we’d expect under the null hypothesis, we can successfully reject the null.
Any reasonable observer would (correctly) ask: what would be the point of such an experiment? Surely we already know that anvils are heavier than feathers, and thus more likely to break a glass cup! No one would be surprised by finding that this is the case.
I don’t think most experiments test something as obvious as the relative heaviness of anvils and feathers. But I do think many experiments (including some I’ve run) are designed without a particularly interesting null hypothesis in mind. Instead, they’re designed to find corroborative evidence for a theory. That is, they identify the predictions of a theory (“X affects Y”) and design an experiment to test those predictions. The problem in such a scenario is that if we think of experiments as existence proofs (as I’ve argued we can), then perhaps it’s not so surprising that a researcher with considerable degrees of freedom in how they design their study can construct some scenario under which their theory’s predictions (“X affects Y”) are corroborated. If the null hypothesis corresponded to a different, competing theory (as I discuss below), that would be a really interesting result. But if the null hypothesis is a theory no one believes—simply the negation of the theory of under investigation—then it’s more of a demonstration that the researcher’s intuitions are borne out with empirical evidence in at least one case. That’s not nothing, but it’s epistemologically fraught to lean entirely on such a methodological paradigm in constructing theories about the world.
What should researchers do?
Where does this leave us?
One obvious (but challenging) solution is to actually design experiments with an interesting null hypothesis in mind and trying to disconfirm that theory: essentially what Popper suggested. This is hard work: it requires carefully enumerating the predictions of psychological theories and figuring out not only what would be consistent with a given theory but also what would be inconsistent with the theory. Then, once you’ve figured this out, you try to design an experiment in which the theory predicts no difference in response to some experimental manipulation; in such a design, finding a significant difference—rejecting the null hypothesis—would equate to a kind of falsification of that theory, requiring some kind of modification (assuming you’ve operationalized your constructs appropriately).
In the ideal case, researchers might even design experiments that pit multiple theories against each other, such that the results actually help adjudicate between competing theories. This is what biophysicist John Platt called “strong inference”. It’s a really effective approach because it helps avoid the inevitable (and very human) urge to “stack the deck” experimentally in favor of finding some experimental effect. Here, the goal is to design a study such that any result would be informative: Result A disconfirms Theory B in favor of Theory A, while Result B disconfirms Theory A in favor of Theory B.
I want to emphasize that this is really hard work. It’s especially hard when different psychological theories don’t make precise predictions; in some cases, it’s difficult to determine whether different theories make competing predictions at all! That’s not a reason not to do it—I absolutely think researchers should strive to design experiments that actually adjudicate between competing theories. But some verbal theories simply aren’t sufficiently developed for this kind of quantitative comparison, so comparing them requires making some strong (often contestable) assumptions on the part of the researcher.
Do experiments still have a role in these fuzzier contexts, absent strong inference? I think the case is less clear, but there are a few possible affirmative responses I can imagine someone making.
The first (and weakest) is the view I pointed out in the previous section: even if experiments are designed to corroborate a theory (rather than disconfirm it), and even if the results are not particularly surprising, there is some information provided above and beyond the researcher’s intuitions—i.e., it is an empirical demonstration that some set of experimental conditions can be devised to produce an effect.
The second is to suggest that the goal of science isn’t always investigating whether X affects Y; it’s to determine the size of this effect or relationship (sometimes called parameter estimation). Here, the role of experiments might be to help estimate an effect size under certain controlled conditions. As I argued earlier, such a finding may not always be reliable or useful given the lack of ecological validity in most experiments: the extent to which “X affects Y” in the lab may not tell us much about the extent to which “X affects Y” in the real world. But this approach can be more useful if multiple studies are conducted, and their results are integrated with studies conducted using more naturalistic, ecologically valid data—providing something like convergent evidence across multiple methodological approaches.
Here, a reader might also suggest that the problem lies in the binary logic of NHST; perhaps scientists should adopt a more “Bayesian” approach instead, in which no single experiment is decisive but each experiment’s results might “update” our beliefs in one direction or another. I think Bayesian statistics are great, but I’m not actually confident the problem here is about which statistical paradigm we used. In my view, the problem is more about how we go about selecting theories to test and designing experiments to test them—not how we analyze the data from those experiments. I think the results of the Anvil vs. Feather experiment would be uninteresting regardless of which statistical paradigm one used to analyze them.
With that said, one insight that the Bayesian paradigm does attempt to capture is the notion of the “weight of evidence” and cumulatively building up beliefs for or against a theory. To me, the best version of this approach looks something like finding convergent evidence across multiple studies (as I argued above)—what I think of as a process of triangulation. Finding a relationship between X and Y across multiple experiments and multiple naturalistic studies might increase our confidence that there is, in fact, a relationship between X and Y, even if none of those studies were designed with falsification in mind.
This is, ultimately, a kind of inductive reasoning, and it comes with well-known philosophical problems. The kind of knowledge obtained through this process is inherently uncertain and unstable, but that may simply be the reality of the epistemological situation we find ourselves in. I do think, however, that we need to be honest about when we’re actually doing falsification and when we're doing something more like this process of fuzzy triangulation.
Notably, this is why randomized controlled trials—an experiment—are generally considered the “gold standard” of evidence in medicine.
Not the same thing as a proof by induction, which is actually a form of deductive reasoning!
Generally this is also affected by things like sample size, which (along with the sample variance) determines the variance of the null distribution.

Wo ho! Thanks for the great essay. It's actually even more complicated than you say.
Popper's thinking was a giant leap forward, but even Popper was wrong. Science can neither prove or disprove (falsify) a hypothesis. There is no proof in science. Take your black swan example. You search everywhere and you find a bird that looks like a black swan. Does that disprove the hypothesis that there are no black swans? No. Because in order to prove it, you have to prove that this bird that looks like a swan really is one. You're right back where you started trying to prove that there are no black swans, now you have to prove that there is no difference between this bird and white swans, except that it is black. In short, in order to disprove a scientific statement, you have to prove that its observations are correct. Popper was hoisted on his own petard.
These points about scientific certainty are not just angels dancing on a pin. They are at the core of science and, therefore, should be at the core of AI science. Thank you so much for bringing them up.
Here are some key points:
* Scientific proof is impossible, therefore, proof that a system has achieved artificial intelligence is impossible.
* The logic of scientific discovery is still critical to advancing science even if you cannot prove it to be correct. You do not have to be certain to get value.
* Intelligence is a scientific conjecture and needs to be treated as such.
* Consider alternative explanations. Intelligence is a cause, you cannot infer the cause from the effect. Other causes (stochastic parroting) may be responsible and it takes careful experimentation to tease them apart.
* Think critically.
* Don't put all of your eggs in the benchmark basket. You cannot prove that they are valid.
* Finally, the question of whether a machine is intelligent is a theoretical statement about the cause of an observation, it is not a definitional or engineering achievement.
Great piece. As regards AI, I think that the conceptual construct of "ecological validity" does more harm than good. It's not that AI's ecologies are perfectly deterministic, but that there's a big difference between the determinism of a digital ecology and an ecology that exceeds digital substrates. This difference is so significant that we might need a new conceptual construct for AI -- one that approximates what we mean by ecological validity but accounts for the fact that its “real world” is always a micro-world. Maybe it's just a language issue, but given the preponderance of slippery writing and research in this field, it deserves some attention.
Regarding your last point: I wonder if a lot of researchers would admit that what they call falsification is the same thing as fuzzy triangulation, but that the semantic distinction isn't really a big deal. I would argue that the difference matters when it comes to communicating these concepts to newcomers. In my experience, a lot of people instinctively withdraw from the study of scientific methodology because it confounds their intuitions and can seem contradictory ("the point of science is that certainty is provisional, but falsification tells us that we can certainly rule out the null.")
In fact, if I can indulge in some folk theorizing, I think that a key difference between science-y people and non-science-y people is that the former intuitively grasp that "precise" terminology is never really that precise -- they don't need words to closely align with meanings, so they're less likely to shut down in the face of ambiguities and "precise" conceptualizations that seem to require excessive qualification in order to make sense (like the norm of seeking to reject the null rather than confirm the alternative.)
Anyway, thanks for all the time that went into this. I wish I'd had it to share with students back when I was teaching courses on scientific thinking.