The Lady Tasting Tea & Statistical
A common situation that arises in inquiry labs involves determining
whether the difference between two groups, for example a treatment
and a control group, is statistically significant. But statistical methods
can seem daunting, with their P-values and null hypotheses. And for
that matter, what does “statistical significance” actually mean? Many
instructors and students struggle to understand these concepts.
A story from the history of statistics about a lady tasting tea should
make significance testing and related concepts more accessible.
The idea of a test of significance was conceived by Ronald
Fisher (1890–1962). He played a major role in developing experimental designs and statistical methods that helped to revolutionize
the practice of science in the twentieth century (Salsburg, 2001). In
his book The Design of Experiments (1971; first published in 1935),
Fisher introduced the concept of a test of significance by recounting the following story. One summer afternoon in the late 1920s,
Fisher and several colleagues were having tea. When Fisher handed
Lady Muriel Bristol a cup, she declined because Fisher had poured
the tea into the cup first. Lady Bristol declared that tea tasted different depending on whether the milk was poured into the tea or the
tea poured into the milk. Fisher and the other scientists were skeptical and began to discuss how they could test Lady Bristol’s claim.
The scientists decided to arrange eight cups, four with the milk
poured into the tea and four with the tea poured into the milk. The
cups were presented to Lady Bristol for tasting one at a time in ran-
dom order, and she was told that she had to identify the four milk-first
cups. In his book, Fisher explains how to determine the probabilities
associated with having the lady evaluate eight cups of tea. Figure 2
shows the probabilities he calculated for each number of milk-first
cups the lady could potentially identify correctly (Fisher, 1971; Gor-
roochurn, 2012). With eight cups of tea presented in random order,
the probability of the lady correctly identifying all four milk-first cups
by guessing is 1 in 70, or 0.014. Assuming that she is unable to distin-
guish the cups, the most likely outcome will be that she guesses two
out of four cups correctly. In repeated experiments, this will happen
by chance 51.4% of the time. If she cannot really tell the difference,
the probability of Fisher being fooled by a random occurrence where
the lady happens to guess all four of the cups correctly is 0.014, or
1.4%. So if Fisher puts her to the test and she evaluates all the cups
correctly – an outcome of the experiment that is unlikely to have
occurred by chance – he can tentatively conclude that she can tell
which liquid was poured first.
While Fisher described the experimental design for the lady
tasting tea, he did not tell us the outcome. However, Salsburg
(2001) has it from a reliable source that the lady correctly identified
all the cups. It is important to understand that by setting up this
experiment and having her demonstrate her talent on one occasion,
Fisher has not proved that the lady can make the necessary distinc-
tion. Even if she can’t tell the difference, there is still a 1.4% chance
that the outcome of the experiment was a random occurrence,
albeit a very unlikely one. However, getting an unlikely result in
an experiment like this is good evidence that Fisher’s initial
assumption, that she cannot tell the difference, may not be true.
So, we can be reasonably safe in rejecting the idea that her success
is just by chance and conclude, based on the result of this experi-
ment, that she can tell the difference.
Fisher termed outcomes of well-designed experiments that are
unlikely to have occurred by chance statistically significant outcomes,
and a test designed to demonstrate this is a test of significance. Methods
of inference commonly used in science to support or reject claims
based on data, like the chi-square test or the independent-samples
t-test, are also tests of significance (Moore et al., 2009).
One key point to remember from the story of Fisher and the
lady is that before performing an experiment, we should consider
Figure 1. Sources of variation in data (Wild, 2006).
Figure 2. Possible outcomes of the “lady tasting tea”