what is it?
P-values is short for probability values. But probability of what? P-value is the probability of obtaining an effect that is at least as extreme as the one you found in your sample - assuming the null hypothesis is true.
No, seriously, what is it?
If the definition was an accessible and intelligible explanation, statistics would not be considered a difficult topic, nor would the meaning/usefulness of p-values be a contentious issue in the scientific community.[1] While there many (many!) attempts to address this issue, what better way to start a blog in methods and statistics than by taking a shot at one of the crux of science-making? In any case, instead of writing about definitions, perhaps it is useful to illustrate the process in which p-values are important so we are better placed to understand what they are, and most importantly, what they are not.
P-values are meaningful under the frequentist approach to probability (which is just one perspective under the larger umbrella of probability theory, e.g., Bayesian and Likelihood approaches). Simply put, frequentists view the probability P of an uncertain event x as the frequency of that event based on previous observations. For example, in a set of random experiments wherein the only possible outcomes are x, y, and z, then the frequency of occurrence of x, based on previous observations (frequency of occurrence of x, y, and z) is a measure of the probability of the event x. If you run the experiment ad infinitum, that is. The rationale behind frequentist thinking is that as the number of trials approaches infinity, the relative frequency will converge to exactly the true probability.
Illustration
Let's proceed with an illustration. Say you are a researcher interested in the birth rate of baby girls. And you would like to know whether there are more baby girls being born than baby boys. Here, there are only two possible outcomes, to observe or not to observe the event of interest (i.e., baby girls being born). To investigate that research question, you start by logging the gender of every born baby in the nearest hospital for a full 24 hours. Then, as shown above, you estimate the probability of your event of interest, P(Girls), which is the ratio between the frequency of baby girls born, divided by the total number of observed births (Boys and Girls). You look at your records and see that you observed 88 births in total, 40 baby boys, and 48 baby girls. Then, the estimated probability of a baby girl being born, according to your data, is P(Girls) = 0.5455. These results could be interesting to policy makers, practitioners, and scholars because your data seems to suggest there are more baby girls being born than baby boys.
But before we run to the world and tell it about this puzzling truth, perhaps we should consider the role of different philosophical approaches the scientific method and how these translate into two forms in dealing with uncertainty and its statistical operationalization.
The role of Uncertainty
Now, you have to find a way to show the policy makers and the scientific community that your finding is "true". But in actuality, you can only show that your results are somewhat 'likely' to be true. Indeed, it is on estimating how probable these results reflect the true probability that the scientific method and statistics intermingle. To start, note that you observed only 88 trials (births), not a infinite number of trials, or a large-enough sample. This means that, statistically (from a frequentist perspective) your estimate did not have the necessary number of trials to converge to the true probability. Another way to look at this is to think about the lack of precision of your estimate. For example, you did not survey the whole population of hospitals and babies being born therein, but just a sample of it. In fact, even more restricted, your surveyed only 24 hours worth of data out of 2,080 hours in say a year, in only one hospital. Given that the true probability is unknown, the limitations of your study could influence the accuracy of your estimate and bias it away from the true population parameter. So the best you can do is to estimate it and assess the degree to which it is an accurate estimate.
The role of Probability distributions
So how do you quantify the strength of your findings? Scientists often resort to a specific method known as statistical inference, which is based on the idea that it is possible - with a degree of confidence - to generalize results from a sample to the population. To give you a more exact explanation, to infer statistically is to abide by a process of deducing properties of an underlying probability distribution via analysis of data. Note the term underlying probability distribution. Researchers relying on empirical and quantitative methods rely heavily on the assumption that the studied phenomenon follows the same pattern as a known probability distribution, whose characteristics and properties are known. Hence, these are compared - the theoretical vs. the empirical distributions - as to make inferences. In your case, you are studying the relative frequency of baby girls versus baby boys. You do your googling and find out that since your variable of interest is a dichotomous and largely randomly determined process (i.e., meaning that there are only two possible outcomes at each event/trial and we assume this to be random), then human births could be understood in Statistics as ensuing from a Binomial distribution. Another example of the same stochastic process is a coin toss (e.g., heads vs. tails), which is known as a Bernoulli trial). In any case, random and dichotomous outcomes, in the long run, tend to follow a Binomial distribution. Now that the probability distribution you assume to underlie human births has been identified, you can use it to compare it to your data.
The role of Hypothesis testing
If you subscribe to the scientific method, in order to perform this comparison, you need a hypothesis, a testable and falsifiable hypothesis. In your case, you would like to test whether you have enough evidence to say 'there are more baby girls being born than baby boys'. Note that a proportion of 0.5 indicates that there are as many baby girls being born as baby boys. That being said, a hypothesis could be "the proportion of baby girls being born is larger than 0.5", or equivalently, you could have said "the probability of baby girls being born is larger than 0.5." Either way, you seem to want to compare the estimate you drew from your data P(Girls) = .5454... to the estimated presumed to be of the population: P(Girls) = P(Boys) = 0.5. This is testable and falsifiable because we presume to know the properties of the stochastic process underlying our data.[2]
The role of confidence LEVEL
The last thing you need before you can assess the degree to which your estimate is likely is to set your preferred the level of confidence. One important reason why you need a degree of confidence is due to sample variation which affects your population estimate P(Girls). Suppose you learned the hospital wherein you collected data found your idea very interesting and decided continue logging the gender of every birth for the next 99 days. After this period, you are shown the results of the Hospital's research below. The picture shows the fluctuation of daily estimates of the probability of baby girls being born. On the y-axis, the number of times a given ratio was found is displayed. On the x-axis, the relative frequencies (or ratios or probabilities) are represented. As before, each count per bar represents an individual day's worth of collected data. For example, the first dark blue bar (on the extreme left), means that there was one day in which the proportion of baby girls and baby boys was estimated to be 0.43. In that day, there were more boys than girls born. The last bar (on the extreme right) was a day in which the proportion of babies girls exceed that of baby boys from 1 to 2. What we learn with this graph is that, had you chosen another day to conduct your original survey, you would have likely found a different estimate of population parameter of interest: P(Girls). Due to this fluctuation, it is good scientific practice to provide a confidence interval to your population estimate. This confidence interval, which is calculated from the sample, asserts that you could be X% sure that the true probability of baby girls being born, or the true population parameter P(Girls), would be contained in the estimated intervals.
Obviously, you want to be as confident as possible, right? Yes! But when it comes to choosing a confidence level for your estimate, there are advantages and drawbacks at every level of confidence. In general, the more confident you want to be, the larger the confidence interval. And vice-versa. Talk about being caught between a rock and a hard place.
The table on the right shows exactly this for your P(Girls) estimate. If you choose a confidence level of 90%, this means that 90% (9 our of 10) of the times you run a random experiment, the true probability of baby girls being born would be contained in the estimated intervals. Instead, if you choose a confidence level of 95%, this means that 19 times out of 20 you run your random experiment (95%), the true estimate of P(Girls) will be contained in the confidence interval. And so on.
But how does one calculate the confidence intervals? Again, some googling or textbooks may be necessary. But since you are in a hurry, having lost so much time reading this post, you decide to go for the fastest option: Wikipedia. There, at the page of the Binomial distribution, you learned that the approximate [3] confidence interval for proportion estimates is a function of the estimated probability (p hat), the number of trials (n), and z. Two aspects are clear from the formula, first is that the higher the n, the smaller the confidence interval (give n is in the denominator). The second is that z appears to be a multiplier, which is to say, the larger it is, the larger the confidence interval.
Z is a value from the standard Normal distribution (mean=0, standard deviation = 1), for the wished confidence level (e.g., 90%, 95% or 99%). In fact, the confidence level is the area of the standard normal delimited by z values. For example, for standard normal distribution, the area comprised between -1.96 and 1.96 is equal to 95% of the total area under the curve of the normal distribution. Analogously, area comprised between -2.576 and 2.576 is equal to 99% of the total area of the distribution. If you are in doubt as to why we use the normal distribution as to calculate a Z score for the construction of a confidence interval for outcomes ensuing from binomial distribution, it is because at large enough N (in the long run), and p hat (estimate of proportion) not too close to zero or one, then the distribution of p hat converges to the normal distribution. These confidence levels are intrinsically related to p-values.[5]
Are we there yet?
Yes! Let put your hypothesis to a test. As mentioned above, your hypothesis is that the ratio of baby girls being born is larger than the 0.5. So, one way to think about out testing your hypothesis is to pitch it against the confidence interval you estimated. The other way is explained below in detail. You are saying the ratio of baby girls is larger than .50, correct? The implication being 0.50 is not one of the values comprising you confidence intervals - in the long run. If the 0.50 is contained in confidence interval, it would mean the true probability of baby girls being born could very well be 0.5 at a given level of confidence (90%, 95%, 99%, etc). Ergo, the data collected would not support your hypothesis that there are more baby girls being born than baby boys. Let's check back at the table above, where we estimated four confidence intervals, one for each confidence level. At every considered confidence level, the confidence intervals include 0.50. That is to say while your estimate - or sample parameter - is 0.5454, we cannot discard the the possibility that the true probability of baby girls being born might be 0.50. Note that we still don't know what the 'truth' is. And if we want to get closer to it, we should continue to repeat the experiment.
However, there is another frequentist interpretation wherein the p-value - and not so much an estimate with a confidence interval - is the star of the show.
The role of philosophical approaches to statistics
So far we have interpreted Frequentist inference as a method with which one can achieve a point estimate (e.g., probability, mean differences, counts, etc.) with an accompanying measure of uncertainty (confidence interval based on the confidence level). And this is a proper way to think of statistical inference within the Frequentist paradigm. However, this approach lacks in terms of practicality to those researchers seeking *an objective answer*, a decision for a given problem. Think of a pharmaceutical company testing whether a drug helps decrease the mortality of a new disease, or, a materials company testing the grade of concrete and steel for constructions in Seismic zones. In these circumstances, null hypothesis significance testing (NHST), is a particularly useful method for statistical inference.
On the philosophical sphere, one important difference between the two frequentist camps (probabilistic vs. hypothesis testing frequentists) ensues from the different answers given to the question where should Statistcs or statistical inference lead to? Should Statistics lead researchers to a conclusion/decision (even if based on probabilities) or should Statistics lead to an estimation/probability statement with an associated confidence interval? Above, we explored the basis of the latter approach. Now we will focus on the former.
The null hypothesis significance testing (NHST)
NHST is the amalgamation of the work of Fischer and Neyman-Pearson. And while these viewpoints differ - by quite a bit - they are unified in providing a framework for hypothesis testing.[4] The underlying reason for 'testing' is that scientists want to generalize the results ensuing from a study with a sample to the population in a way that yields a yes/no decision-making process. As seen above, however, this approach may lead to the observation of biased estimates due to chance alone but also a variety of other factors. NHST is a method designed to quantify the validity of a given generalization. from sample to population. In other words, to infer statistically will always involve probabilities - not surety - and thus, error is inevitable.
Due to this, a framework was developed to assess the extent to which a given decision abides by probabilistic principles and minimizes error. In the standard approach, the relation between two variables are compared (say X and Y). A hypothesis versing about the relation of the variables is proposed (say X > Y) and it is compared to an alternative proposing no relationship between the two variables (implying X = Y). This no-relation hypothesis is called Null Hypothesis, or Ho, and it is based on it being true or false that scientists consider the likelihood their own hypothesis being true. The reasoning behind relying on the null hypothesis is best described by one of the most important 20th century philosopher's of Science, Karl Popper who said all swans are white cannot be proved true by any number of observations of white swans as we may have failed to spot a black swan somewhere. However, it can be shown to be false by a single sighting of a black swan. Scientific theories of this universal form, therefore, can never be conclusively verified, though it may be possible to falsify them. Point being, it is easier to disprove a hypothesis because it is impossible to test every possible outcome. So, instead, Science advances only through disproof. So, given that taking decisions based on probabilities will always give rise to the possibility of an error, when comparing two hypothesis, four outcomes should be considered based on whether the null hypothesis is true or false, and on whether the decision to reject (or fail to) was correct or wrong.
A similar thinking explains the legal principle presumption of innocence, in which you are innocent until proven guilty. Lets think of possible scenarios in a criminal trial. If one is innocent, and the court/jury acquits you, then the decision and the truth match. This is the correct inference and it is called True Negative. "True" refers to the match between decision and the truth, while "negative" has to do with a failed rejection of the null. Similarly, if one is indeed guilty and the court/jury decides for a conviction, the inference is again correct and it is termed True Positive. Where "positive" stands for rejecting the null. Statistics, and NHST, mostly concerns itself with the remaining options. When you are convicted, but didn't commit the crime (False Positive), and when you are acquitted, but did commit a crime (False Negative). These are termed Error Type I and II, respectively, and they are key concepts introduced by Neyman and Person to the Fischerian perspective.
The type I error rate is also known as the significance level, or α level. It is the probability of rejecting the null hypothesis given that it is true. The α value is analogous to the confidence level explained above via the relation CL = 1 - α. Meaning that if the researcher want to adopt a confidence level of 95%, then she or he is automatically adopting an α level of 5%. As for the testing part, the way hypothesis testing is done is relatively simple: you have to assess the probability of the sample data as a realization of the null hypothesis. In other words, one compares probabilistically two sets of data: one ensuing from the null hypothesis and the other from the alternative - both assuming the phenomenon follows a given stochastic process (i.e., data points are observed according to a known distribution). This is done by calculating the probability the sample data has been generated by the null hypothesis. Then, given a significance level, a comparison is deemed 'statistically significant' if the relationship between the sets of data an *unlikely* realization of the null hypothesis. How unlikely depends on the chosen significance level. Makes sense?
Let's take your research example one last time. Since we have a binary variable, it is easy to demonstrate the calculation of the p-value. Your sample data consists of 88 observations where 48 baby girls and 40 baby boys were born. The estimated probability of P(Girls) = 0.5455. You are interested in showing that there are more baby girls being born than baby boys. When thinking about testable hypotheses, the null-hypothesis would be a null difference between the proportions of baby girls and boys. This is mathematically represented by P(Girls) = 0.50. Then, the alternative would be P(Girls) > 0.50. On the philosophical sphere, we learned that you cannot show that P(Girls) > 0.50, but you could show that P(Girls) = 0.50 is so *unlikely*, given your data, that you would consider P(Girls) > 0.50 as being the more likely scenario, until there is further evidence. This is where the key to understanding inferences lies. And this is what p-values do, it informs you about the probability - or how likely or unlikely - your sample data is a realization of the null hypothesis, based on the chosen significance level. So, a null hypothesis is rejected if the relationship between the data sets would be an unlikely realization of the null hypothesis. This is why the definition of p-value contains the "assuming the null-hypothesis is true." It is because the p-value is the probability of the sample data being an outcome of the data generated by the null.
Bare with me through the calculations. What is the probability of observing 48 baby girls in 88 trials, assuming the true probability is 0.5? To calculate this, we need to know two things. First, how many ways one can observe 88 births yielding a total of 48 baby girls? The answer stems from the combination of 88 Bernoulli trials whose sum total 48. There are 1.831258e25 ways as shown below. Second, we need to know what is the probability that we will observe exactly 48 baby girls AND 40 baby boys, in 88 births. P(48 Gilrs) x P(40 Boys) is very low, at 3.231174e-27. We multiply these two probabilities to find the probability of 48 baby girls, in 88 births, assuming the probability is 0.5, which is almost 0.06.
That is to say, if you collect data again, assuming P(Girls) = P(Boys) = 0.50, which is the null hypothesis, there is a 6% chance that I will observe a ratio of 0.5455. By now, I hope we understand the meaning of "assuming the null hypothesis is true" part of the p-value definition. It is because we calculate the likelihood of a given configuration of results while using the parameters set forth by the null hypothesis. That said, we can now break down the meaning of "at least as extreme" part of the definition.
P-values give you the cumulative probability, rather than just a probability. That is to say, p-values assess the probability of observing a given value (in your case, 48 out of 88) and all other possible values that are more extreme. In your case, the p-value gives you the probability to observe 48 or more baby girls being born so we need to calculate the probability of observing 49, 50, 51 ... 86, 87, 88 baby girls, in 88 trials, assuming that P(Girls) = P(Boys) = 0.50. Just as an illustration we also perform the calculations for 49 births.
As you can see from the table below, the probabilities will decrease, as it becomes less and less likely that you would observe an increasingly larger proportion of baby girls being born, if 0.50 is the true probability. So the cumulative probability of observing 48 births, or more, assuming P(Girls) = P(Boys) = 0.50, is 0.223. This is the big moment you were waiting for. 0.223 is the p-value when testing whether he can reject the null hypothesis (P(Girls) = 0.50) in favor of the alternative (P(Girls) > 0.50). It indicates that it is quite likely (almost 1/4 of the times) to observe a set of data showing 48 births or more assuming P(Girls) is indeed 0.50. As for the 'testing part', it require nothing more than comparing the obtained p-value with the criterion for significance. Since most scientists use an α level of 0.05, and the p-value of 0.223 is larger than the α level, we do not have enough evidence to reject Ho. So, pending future studies, we can only consider that the probabilities of baby girls and boys being born are - roughly - the same.
Other researchers, however, could criticize that your alternative hypothesis is tendentious. In the sense that, with one day's worth data, you should perhaps consider a broadened alternative hypothesis. One which didn't provide a direction, one which didn't consider only half of the possible outcomes. Indeed, when setting the alternative as P(Girls) > 0.50, you are not considering P(Girls) < 0.50. And while your estimate indicates a larger proportion of baby girls, you shouldn't assume that your sample's suggestion indicates the truth. So, perhaps more appropriately, you would like to also consider P(Girls) < 0.50. In that case, we can say that your null hypothesis is P(Girls) = 0.50, and your alternative is P(Girls) ≠ 0.50, which covers both directions. In that case, what do we need to do to calculate the p-value for the non-directional hypothesis? We can either repeat the process, for 40, 39, 38... 1 baby girls being born, or multiply the calculated p-value by 2, both yield the same result. Note that you should start with 40, not 48 or 44. This is because you observed a difference of 4 more than your expectations which is 44, if P(Girls) = 0.50. Thus, you calculate the cumulative probability of observing the same difference of 4 from the expectation, but in the other direction: 44 - 4 = 40. The p-value is 0.223*2 = 0.446. The increase in p-values shows that by considering both directions, it is even more common (twice as common) to observe a difference of 4 births from the expected value of 44, assuming the probability P(Girls) = 0.50, in 88 trials.
The role of Power
Now you say "OK, great. But I was confident there were more baby girls than baby boys being born. Before I am completely convinced I was wrong, is there perhaps something I may have missed precluding me from arriving at the correct conclusion?" And despite the fact that the answer to this sort of question in Science is always "yes" (i.e., in the best case scenario, there are always improvements to be made), there is one key aspect of hypothesis testing we have not yet addressed: Type II error. This occurs when there is in fact a true difference (in your case, in the proportion of ratios) but hypothesis testing fails to reject the null hypothesis. This is termed a false negative. It is when one is guilty of a crime, but the court/jury acquits the defendant. In many ways, Type II error is the other side of Type I error, where you false conclude there is a difference, when it is not (i.e., false positive). Both are said to be "False" because it is the wrong decision. Thus, ideally, we want to avoid both, and always find effects when they exist, and fail to find them when they don't. Curiously, Type II error rate is deemed the lesser evil (in comparison to Type I error) by the scientific community. Perhaps because the harm done when incurring in this type of error is the maintenance of the status quo. Personally, I have my doubts about this, and the reproducubility crises in science can better showcase this point.
But more to the point, Type II error, or β, depends on three components, the magnitude of effect, the sample size (or sampling error) and the statistical significance used. The rationale is the following. Assuming a constant significant level, if the magnitude of effects are small, a larger sample size is necessary detect "a signal" (or reject the null). If the magnitude of effects are large, then sample sizes can be smaller and still not incur in Type II error. But if you are somewhat acquainted the scientific method or hypothesis testing, β is hardly ever mentioned. Instead, researchers tend to speak of 1 - β, which is known as power. As in, power to detect signal. Power is the probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true. Importantly, power analysis can be used not only to calculate β (probability of false negative), power (1 - β), but also the minimum sample size n required so that one can be reasonably likely to detect an effect. So, you can use this knowledge to better understand your study's results. By using the formulas to the top-right corner, you would find that the power (1 - β) of your study is 0.214 if you were only interested in showing P(Girls) > P(Boys) - or 0.137 for P(Girls) ≠ P(Boys). This means, at best, you had a 0.214 probability to reject the null when it is true. By contrast, the false negative rate (type II error) is 86%. From these calculations, it becomes clear the sample size (N=88) is too small to identify such a small difference in proportions - from P(Girls) = 0.50 to P(Girls) = 0.5455, whose nominal effect size is 0.091. Indeed, by using the last formula above to calculate the minimal sample size necessary to detect a significant difference with the surveyed effect magnitude. Results show a whooping 948 births as a necessary to be 80% sure that you are able to detect a meaningful difference, when there is one. 80% is usually the power sought after in academia, but different fields use different power ratios (think of medicine, where doctors may prefer a false positive and require additional confirmatory tests, than have a false negative, and send a sick patient home). If you would like to be 90% sure, your study would need 1261 observed births, 1560 for a 95% true positive rate, and 2205 for 99%. In the long run, that is. Also bear in mind that these numbers related to testing non-directional hypothesis, that is, you are considering both P(Girls) > P(Boys) and P(Girls) < P(Boys). The required sample size would be slightly smaller if you only consider P(Girls).
To think about it in another way, imagine 99 other independent researchers were to replicate your study with the same protocol while not polling together their data. And lets assume you were right, and that the true probability of girls being born P(Girls) is 0.55. Then, by cataloging all births for each day, we can plot a summary of these 100 studies as displayed on the right. Each point represents the found ratio/proportion for one specific day while the line represents its confidence interval. Colored in red are the days in which the researcher would find that P(Girls) > P(Boys), and in black that we do not enough evidence to reject P(Girls) = P(Boys). There are 17 instances in which the confidence interval does not include 0.50, and 83 instances that it does. Note that these numbers are slightly different than those reported above by power and Type II error rates. This is because these always refer to "in the long run". That is, if these 100 replications were to be repeated in time, then on average, researchers would identify a significant difference (p-value lower than 0.05 or confidence interval not including the 0.5 in about 14% of the time). And yes, p-values and confidence intervals are equivalent at the same significant level. I personally prefer the former because it has one important advantage: in comparison to p-values, confidence intervals reflect the results at the level of data measurement. Another interesting implication of the above plot is that if you were to rely on one day's worth of data, most days you would take the 'wrong' conclusion. For this reason, power analysis in an integral part of study design. Ideally, these analyses should be done prior to data collection. [6]
To sum up
P-values only make sense under the umbrella of frequentist statistics. This sub-field of statistics has two main camps, one which interpret that statistics should be about probabilistic estimates accompanied by confidence intervals, and another which recognizes the importance and/or necessity to provide objective decisions based on data. NHST is s used to determine whether there is enough evidence in a sample of data to infer that a certain condition is true for the entire population. In this process, p-values play an important role, however, it is also fundamental to conduct a priori power analysis as to insure the study is properly designed for the investigation at hand. Science progresses through disproof. Einstein said ‘A thousand scientists can’t prove me right, but one can prove me wrong’. We can’t prove a hypothesis true, but we can prove its falsehood.[7] [8]
Footnotes
[1] Particularly in Social sciences.
[2] Bayesians would likely disagree this rationale is useful and argue that testing against Ho = 0 yields very little information. Instead, Bayesians use Bayes' theorem to update the probability for a hypothesis as more evidence or information becomes available.
[3] The approximate formula for the confidence interval of proportions should only be used when the population size is 20 times larger than the sample size.
[4] Confidence level: γ = (1 − α). This is the probability of not rejecting the null hypothesis given that it is true. Confidence levels and confidence intervals were introduced by Neyman in 1937.
[5] In practice, this is closer to Fischer's ideas than Neyman-Persons's.
[6] In the future, I will include a meta-analysis of these studies to showcase the importance of cumulative and open-science practices.
[7] In the future, I will include a thought experiment showing how p-values are counter-intuitive and completely backward from what a scientist generally wants.
[8] Soon I will include the R-code and link to a R-markdown document with of all these calculations on GitHub.