P-Value Explained: What It Means and How to Calculate It
The p-value is one of the most frequently cited, and most frequently misunderstood, concepts in statistics. It appears in virtually every scientific paper that uses statistical testing, yet many people struggle to explain what it actually means. This guide provides a clear, thorough explanation of p-values: what they are, what they are not, how to calculate them, and how to interpret them correctly.
We cover significance levels, one-tailed versus two-tailed tests, worked examples using z-scores and t-scores, and the most common misconceptions that lead to incorrect conclusions. By the end, you will be able to confidently interpret p-values in any statistical context.
What Is a P-Value?
A p-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.
Let us break that definition down carefully. In hypothesis testing, you start with a null hypothesis (), which typically states that there is no effect, no difference, or no relationship. You then collect data and compute a test statistic (such as a z-score, t-score, or chi-square value). The p-value answers this question: if the null hypothesis were true, how likely would it be to see results this extreme or more extreme, purely by chance?
A small p-value (close to 0) means the observed data would be very unlikely under the null hypothesis. This gives you evidence against the null hypothesis. A large p-value (close to 1) means the data is consistent with what you would expect under the null hypothesis.
What a P-Value Is NOT
Before going further, it is essential to address the most dangerous misconception.
A p-value is NOT the probability that the null hypothesis is true.
A p-value of 0.03 does not mean there is a 3% chance that the null hypothesis is true. The p-value is calculated under the assumption that the null hypothesis is already true. It tells you about the probability of the data given the hypothesis, not the probability of the hypothesis given the data. This distinction is fundamental and is the source of countless errors in scientific interpretation.
Similarly, a p-value is not the probability that the result happened by chance. It is not the probability of making an error. And a p-value of 0.05 does not mean the result is "95% certain." These are all common but incorrect interpretations.
Significance Levels
A significance level (denoted ) is a threshold chosen before the test is conducted. It defines how small the p-value must be for you to reject the null hypothesis. The most common significance levels are:
- (5%): The standard threshold in most fields. If the p-value is below 0.05, the result is considered "statistically significant."
- (1%): A more stringent threshold, used when you want stronger evidence before rejecting the null hypothesis.
- (10%): A more lenient threshold, sometimes used in exploratory research or when the cost of missing a real effect is high.
The significance level also represents the Type I error rate: the probability of rejecting the null hypothesis when it is actually true (a false positive). Setting means you accept a 5% risk of incorrectly rejecting a true null hypothesis.
One-Tailed vs Two-Tailed Tests
The choice between a one-tailed and two-tailed test affects how the p-value is calculated.
Two-Tailed Test
A two-tailed test checks for a difference in either direction. The null hypothesis is that there is no difference (e.g., ), and the alternative hypothesis is that the mean differs from the hypothesised value in either direction ().
The p-value is the probability of observing a test statistic as extreme as (or more extreme than) the observed value in either tail of the distribution. For a symmetric distribution like the normal or t-distribution, this means doubling the one-tail probability.
One-Tailed Test
A one-tailed test checks for a difference in a specific direction only. For example, (right-tailed) or (left-tailed). The p-value only considers one tail of the distribution.
One-tailed tests are more powerful for detecting effects in the specified direction, but they cannot detect effects in the opposite direction. You should only use a one-tailed test when you have a strong prior reason to expect the effect in one direction and would not care about a result in the other direction.
Worked Example 1: Z-Test for a Population Mean
A factory claims its light bulbs last an average of 1,000 hours. You test a random sample of 36 bulbs and find a sample mean of 985 hours. The population standard deviation is known to be 60 hours. Is there evidence that the true mean differs from 1,000?
Step 1: State the Hypotheses
This is a two-tailed test.
Step 2: Calculate the Z-Score
Step 3: Find the P-Value
For a two-tailed test, we need the probability of . Using a standard normal table or calculator:
Since this is two-tailed, multiply by 2:
Step 4: Interpret
The p-value is 0.1336, which is greater than . We fail to reject the null hypothesis. There is not sufficient evidence to conclude that the mean bulb lifespan differs from 1,000 hours. The observed difference of 15 hours could reasonably be due to sampling variability.
Try it yourself
Use our P-Value Calculator to convert any z-score or t-score into a p-value instantly. You can also compute z-scores with our Z-Score Calculator.
Worked Example 2: T-Test for a Small Sample
A teacher believes a new study method improves test scores. She tests it on 12 students and records their improvement (post-test minus pre-test score). The results show a mean improvement of 4.2 points with a sample standard deviation of 6.1 points. Is there evidence that the method improves scores?
Step 1: State the Hypotheses
This is a one-tailed (right-tailed) test because we are specifically looking for improvement.
Step 2: Calculate the T-Score
Step 3: Find the P-Value
With and a one-tailed test, we look up the probability of in the t-distribution with 11 degrees of freedom.
Step 4: Interpret
The p-value is approximately 0.018, which is less than . We reject the null hypothesis. There is statistically significant evidence that the new study method produces a positive improvement in test scores.
Note: if we had used a two-tailed test instead, the p-value would be , which would still be significant at the 0.05 level. However, the one-tailed test is appropriate here because we had a directional hypothesis from the start.
Try it yourself
Run your own t-test with our T-Test Calculator to get the t-statistic, degrees of freedom, and p-value for your data.
Worked Example 3: Two-Sample Z-Test
A company wants to know whether its new website design leads to longer visit durations. They run an A/B test with 50 visitors on the old design (Group A) and 50 on the new design (Group B). Assume the population standard deviation for visit duration is 2.5 minutes for both groups.
- Group A (old): minutes
- Group B (new): minutes
Step 1: State the Hypotheses
Step 2: Calculate the Z-Score
Step 3: Find the P-Value
For a one-tailed test (right-tailed):
Step 4: Interpret
The p-value is 0.0548, which is just above the 0.05 threshold. At the conventional level, we fail to reject the null hypothesis. The observed difference of 0.8 minutes, while suggestive, does not reach statistical significance. This does not prove there is no difference. It means the evidence is not strong enough to conclude there is one, given the sample sizes and variability.
This example illustrates an important point: a p-value just above 0.05 should not be dismissed as "no effect." It suggests the data is borderline. Collecting more data or reducing variability might yield a clearer result.
The Relationship Between P-Values and Confidence Intervals
P-values and confidence intervals are two sides of the same coin. If you construct a 95% confidence interval for a parameter and the hypothesised value falls outside the interval, the two-tailed p-value will be less than 0.05. If the hypothesised value is inside the interval, the p-value will be greater than 0.05.
Confidence intervals have the advantage of showing both the direction and magnitude of an effect, which a p-value alone cannot do. Many statisticians recommend reporting confidence intervals alongside or instead of p-values for this reason.
Common Misconceptions
- "P-value = probability the null hypothesis is true." Incorrect. The p-value assumes the null hypothesis is true and measures the probability of the observed data (or more extreme data) under that assumption.
- "A non-significant p-value means there is no effect." Incorrect. Failure to reject the null hypothesis is not the same as proving it true. The study may have been underpowered (too few observations) to detect a real effect.
- "P = 0.05 is a magic boundary." The choice of 0.05 as a threshold is a convention, not a law of nature. A p-value of 0.049 is not meaningfully different from 0.051. Treat p-values as a continuous measure of evidence, not a binary pass/fail.
- "A small p-value means a large effect." Incorrect. With a very large sample, even a tiny, practically meaningless difference can produce a very small p-value. Statistical significance does not imply practical significance. Always consider effect size alongside the p-value.
- "If I test enough things, I will find significance." This is the multiple testing problem. If you perform 20 independent tests at , you expect one to be significant by chance alone, even if there is no real effect. Methods like the Bonferroni correction or false discovery rate control address this issue.
Effect Size: What P-Values Cannot Tell You
A p-value tells you whether an effect is statistically distinguishable from zero. It does not tell you how large or important the effect is. For that, you need an effect size measure.
Common effect size measures include Cohen's d (the difference between means divided by the pooled standard deviation), the correlation coefficient r, and odds ratios. Reporting effect sizes alongside p-values gives a much more complete picture of your findings.
A Note on P-Hacking
P-hacking refers to practices that manipulate the analysis to achieve a p-value below 0.05. This includes selectively reporting results, adding or removing data points after seeing the results, trying multiple statistical tests and only reporting the one that works, or repeatedly checking the p-value as data accumulates and stopping when it crosses 0.05.
These practices inflate the false positive rate far beyond the nominal 5%. To guard against p-hacking, pre-register your analysis plan, report all tests performed (not just significant ones), and use appropriate corrections for multiple comparisons.
Try it yourself
Use our P-Value Calculator to find p-values from z-scores, t-scores, chi-square values, or F-statistics. Pair it with our Z-Score Calculator or T-Test Calculator for a complete hypothesis testing workflow.
Frequently Asked Questions
What does a p-value of 0.05 mean?
A p-value of 0.05 means that if the null hypothesis were true, there would be a 5% chance of observing data as extreme as (or more extreme than) what you actually observed. At the conventional significance level of 0.05, this is right on the boundary: you would just barely reject the null hypothesis. It does not mean there is a 5% chance the null hypothesis is true, or a 95% chance the alternative is true.
Can a p-value be exactly 0?
In theory, a p-value can be extremely small but never truly zero, since there is always some non-zero probability of any outcome under a continuous distribution. In practice, software may report p-values as 0.000 or "p < 0.001" when the value is too small to display. This means the observed result is extremely unlikely under the null hypothesis, but it does not mean it is impossible.
What is the difference between a p-value and a significance level?
The significance level () is chosen before the test and represents the threshold for rejecting the null hypothesis. The p-value is calculated from the data and compared against . If , you reject the null hypothesis. The significance level is a decision rule; the p-value is a summary of the evidence from the data.
Should I use 0.05 or 0.01 as my significance level?
It depends on the consequences of errors. If a false positive would be costly or dangerous (such as approving an ineffective medical treatment), use a stricter threshold like 0.01 or even 0.001. If missing a real effect would be more costly than a false alarm, a more lenient threshold like 0.10 may be appropriate. The choice should be justified by the context, not by convention alone.
Why do some researchers want to replace p-values?
Criticisms of p-values centre on their widespread misinterpretation and misuse. Some researchers advocate for Bayesian approaches that directly estimate the probability of hypotheses, while others promote reporting effect sizes and confidence intervals as primary results. The American Statistical Association published a statement in 2016 emphasising that p-values should not be used as the sole basis for scientific conclusions and that the 0.05 threshold should not be treated as a rigid boundary. The goal is not to eliminate p-values but to use them correctly alongside other measures of evidence.
Related Articles
Probability Distributions Explained: Normal, Binomial, and More
Understand the most important probability distributions: normal, binomial, Poisson, and uniform. Learn when to use each and how to calculate probabilities.
Chi-Square Test Explained: When and How to Use It
Learn how to perform chi-square goodness of fit and independence tests. Covers the formula, degrees of freedom, p-value interpretation, and worked examples.
ANOVA Explained: One-Way Analysis of Variance Guide
Learn how to perform one-way ANOVA to compare means across multiple groups. Covers the F-statistic, SS/MS calculations, assumptions, and post-hoc tests.