Understanding Correlation and Regression: A Practical Guide
Correlation and regression are two of the most widely used statistical techniques. Correlation measures the strength and direction of the linear relationship between two variables, while regression finds the best-fitting line to predict one variable from another. Together, they form the backbone of data analysis in science, business, economics, and social research.
This guide covers the Pearson correlation coefficient, interpreting correlation values, simple linear regression, the least squares method, coefficient of determination, and common pitfalls. We work through full numerical examples so you can apply these techniques with confidence.
What Is Correlation?
Correlation quantifies how closely two variables move together. The most common measure is the Pearson correlation coefficient , which ranges from to .
- : Perfect positive linear relationship (as one variable increases, so does the other)
- : Perfect negative linear relationship (as one increases, the other decreases)
- : No linear relationship
The Pearson Correlation Formula
For paired data :
This can also be written as:
Example 1: Computing the Correlation Coefficient
Consider the following data on hours studied () and exam score ():
| Hours () | Score () |
|---|---|
| 2 | 55 |
| 4 | 62 |
| 5 | 70 |
| 7 | 78 |
| 10 | 90 |
First, compute the needed sums ():
Now apply the formula:
An value of 0.98 indicates a very strong positive linear relationship between hours studied and exam score.
Interpreting Correlation Values
While the exact thresholds vary by field, a common guideline is:
- : Weak correlation
- : Moderate correlation
- : Strong correlation
Always remember: correlation does not imply causation. Two variables can be strongly correlated without one causing the other. A third variable (a confounding variable) might drive both, or the relationship might be coincidental.
Simple Linear Regression
While correlation measures the strength of a relationship, regression finds the equation of the line that best fits the data. The simple linear regression model is:
where is the slope (how much changes per unit increase in ) and is the y-intercept (the predicted value of when ).
The Least Squares Method
The “best fit” is defined as the line that minimises the sum of squared residuals (the vertical distances between data points and the line). The formulas for the slope and intercept are:
Example 2: Finding the Regression Line
Using the same data from Example 1, find the regression equation.
The slope:
The means: and .
The intercept:
The regression equation is:
Interpretation: for each additional hour of study, the predicted exam score increases by about 3.98 points.
Example 3: Making Predictions
Using our regression equation, predict the exam score for a student who studies 6 hours.
The predicted score is approximately 73. This is interpolation (within the range of our data), which is generally reliable. Extrapolating far beyond the data range (e.g. predicting for 50 hours) would be unreliable.
Coefficient of Determination (R-Squared)
The coefficient of determination, , tells you what proportion of the variation in is explained by the linear relationship with .
For our example, . This means about 96.3% of the variation in exam scores is explained by the number of hours studied. Only 3.7% is due to other factors.
An close to 1 indicates an excellent fit; close to 0 indicates the model explains very little of the variation.
Spearman Rank Correlation
The Pearson coefficient assumes a linear relationship. When the relationship is monotonic but not necessarily linear, or when the data contains outliers, the Spearman rank correlation is more appropriate. It applies the Pearson formula to the ranked data:
where is the difference between the ranks of each pair of observations. This formula assumes no tied ranks; when ties exist, use the general Pearson formula on the ranks.
Residual Analysis
A residual is the difference between the observed value and the predicted value: . Examining residuals helps validate the regression model:
- Residuals should be randomly scattered around zero.
- If residuals show a pattern (e.g. a curve), the linear model may be inappropriate.
- If residuals fan out (increase in spread), heteroscedasticity is present.
- Large individual residuals may indicate outliers.
Common Mistakes
- Assuming causation from correlation. A strong correlation between ice cream sales and drowning rates does not mean ice cream causes drowning. Both increase in summer.
- Ignoring outliers. A single extreme point can dramatically change the correlation coefficient and regression line. Always plot your data.
- Extrapolating beyond the data range. The regression line is only reliable within the range of observed values.
- Using alone. A high does not guarantee the model is appropriate. Always check residual plots.
- Confusing and . An of 0.7 sounds strong, but means only 49% of the variation is explained.
Multiple Regression
Simple linear regression uses one predictor variable. In practice, outcomes are influenced by many factors. Multiple regression extends the model:
Each coefficient represents the effect of on , holding all other predictors constant. While the calculations become more complex (requiring matrix algebra), the underlying principle is the same: minimise the sum of squared residuals.
Try It Yourself
Compute correlation coefficients and explore relationships in your data with our free Correlation Calculator. Enter paired data values and get the Pearson correlation, regression equation, and value instantly. For related statistical analysis, explore our Standard Deviation Calculator and Probability Calculator.
Related Articles
Understanding Eigenvalues and Eigenvectors: A Practical Guide
Learn what eigenvalues and eigenvectors are, how to compute them for 2x2 and 3x3 matrices, and why they matter in data science and engineering.
Understanding Integrals and Antiderivatives: A Complete Guide
Learn how integrals work, from basic antiderivatives to definite integrals. Covers the power rule for integration, substitution, and the Fundamental Theorem of Calculus.
Understanding Complex Numbers: A Complete Guide
Learn about complex numbers from basics to advanced topics. Covers arithmetic, polar form, Euler's formula, De Moivre's theorem, and applications in engineering.