Linear Regression: Finding the Line of Best Fit
Linear regression is the workhorse of statistical modelling. It fits a straight line to a set of data points in a way that minimises the total squared distance between the observed values and the line. Whether you are predicting house prices, analysing experimental results, or exploring the relationship between study hours and exam scores, linear regression is almost always the first tool you reach for. This guide covers the method from the ground up: formulas, interpretation, worked examples, and the assumptions you need to check.
What Is Linear Regression?
Linear regression models the relationship between a dependent variable and one or more independent variables . In simple (one-variable) linear regression, the model takes the form:
where is the y-intercept (the predicted value of when ) and is the slope (the change in for each one-unit increase in ). The goal is to find the values of and that best fit the data.
The Least Squares Method
The most common approach to fitting the line is ordinary least squares (OLS). For each data point , the residual is the difference between the observed value and the predicted value:
OLS chooses and to minimise the sum of squared residuals:
Setting the partial derivatives of with respect to and equal to zero and solving gives the closed-form formulas.
Slope and Intercept Formulas
The slope is:
An equivalent formulation that many find more intuitive uses deviations from the means:
The intercept is:
This guarantees that the regression line always passes through the point .
R-Squared: Measuring Goodness of Fit
The coefficient of determination, , tells you what proportion of the total variability in is explained by the linear relationship with .
- means the model explains all variability (a perfect fit).
- means the model explains none of the variability (the line is no better than simply using the mean of ).
- In practice, falls somewhere in between. An of 0.85 means the model accounts for 85% of the variability in .
For simple linear regression, equals the square of the Pearson correlation coefficient .
Understanding Residuals
Residuals are the vertical distances between each data point and the regression line. Analysing residuals is the primary way to check whether a linear model is appropriate:
- Random scatter: If the residuals show no pattern when plotted against or the fitted values, the linear model is a reasonable fit.
- Curved pattern: If the residuals form a U-shape or other systematic pattern, a linear model may not be appropriate and a polynomial or transformation might be needed.
- Funnel shape: If the spread of residuals increases or decreases with , this indicates heteroscedasticity (non-constant variance).
Assumptions of Linear Regression
For OLS estimates to be valid and for inference (hypothesis tests, confidence intervals on coefficients) to be reliable, four key assumptions must hold:
- Linearity. The relationship between and is linear. Check with a scatter plot and residual plot.
- Independence. The residuals are independent of each other. This is violated when data are collected over time and exhibit autocorrelation.
- Normality. The residuals are approximately normally distributed. This matters most for small samples; with large samples, the Central Limit Theorem helps. Check with a histogram or Q-Q plot of residuals.
- Homoscedasticity. The variance of the residuals is constant across all values of . If the spread fans out (or narrows), standard errors and p-values may be unreliable.
These are often remembered by the acronym LINE (Linearity, Independence, Normality, Equal variance).
Worked Example: Computing by Hand with 6 Data Points
Suppose you have the following data on hours studied ( ) and exam score ():
| (hours) | (score) |
|---|---|
| 1 | 52 |
| 2 | 58 |
| 3 | 65 |
| 4 | 68 |
| 5 | 74 |
| 6 | 81 |
Step 1. Compute the needed sums.
Step 2. Compute the means.
Step 3. Compute the slope.
Step 4. Compute the intercept.
Step 5. Write the regression equation.
Interpretation: for each additional hour of study, the predicted exam score increases by 5.6 points. A student who studies 0 hours is predicted to score about 46.7 (though extrapolating to 0 hours may not be meaningful; see below).
Step 6. Compute .
First, the predicted values and residuals:
| 1 | 52 | 52.33 | -0.33 | 0.11 |
| 2 | 58 | 57.93 | 0.07 | 0.00 |
| 3 | 65 | 63.53 | 1.47 | 2.16 |
| 4 | 68 | 69.13 | -1.13 | 1.28 |
| 5 | 74 | 74.73 | -0.73 | 0.53 |
| 6 | 81 | 80.33 | 0.67 | 0.45 |
For , compute the deviations from the mean:
An of 0.992 means that 99.2% of the variability in exam scores is explained by hours studied. This is an exceptionally strong linear relationship.
Try it yourself
Use our Correlation Calculator to find and for your own data sets instantly.
Prediction vs Extrapolation
Once you have a regression equation, you can use it to predict for a given value of . However, there is an important distinction:
- Interpolation is prediction within the range of your data. If your data covers to , predicting at is interpolation. This is generally reliable.
- Extrapolation is prediction outside the range of your data. Predicting at or is extrapolation. This is risky because the linear relationship may not hold beyond the observed range. For instance, studying 20 hours does not guarantee a score of (which exceeds the maximum possible score).
Always be cautious when extrapolating. State the range of your data and note that predictions outside it carry additional uncertainty.
Worked Example 2: Interpreting a Weak Relationship
Consider 5 data points measuring daily temperature ( , in degrees Celsius) and number of ice cream cones sold ():
| 20 | 100 |
| 22 | 130 |
| 25 | 115 |
| 28 | 150 |
| 30 | 160 |
Computing the sums: , , , , .
For every 1 degree Celsius increase in temperature, about 3.1 more ice cream cones are sold. Computing for this data yields approximately 0.87, meaning temperature explains about 87% of the variation in sales. While strong, the remaining 13% could be due to day of the week, promotions, or other factors.
Worked Example 3: Using the Slope to Compare Groups
A teacher collects data from two classes on hours of homework and test scores. Class A has a regression slope of 4.2 and Class B has a slope of 6.8. Both have similar values around 0.85.
The steeper slope for Class B means each additional hour of homework is associated with a larger gain in test scores (6.8 points vs 4.2 points). This does not necessarily mean homework causes the improvement (correlation is not causation), but it does suggest the relationship is stronger in Class B. Possible explanations include differences in homework difficulty, teaching style, or student background.
Try it yourself
Compute the slope and intercept for your own data with our Slope Calculator, or explore how the correlation coefficient relates to with our Correlation Calculator.
When Linear Regression Is Not Appropriate
Linear regression assumes a linear relationship, but the real world is not always linear. Watch for these warning signs:
- Curved scatter plot. If the data clearly follows a curve, a polynomial regression, logarithmic model, or exponential model may fit better.
- Outliers. A single extreme point can dramatically shift the regression line. Always inspect your data visually.
- Categorical predictors. If is categorical (e.g. treatment A vs treatment B), you need dummy variables or a different approach.
- Non-constant variance. If the residuals fan out, weighted least squares or a transformation (like taking logarithms) may be needed.
Correlation Does Not Imply Causation
A strong linear relationship ( close to 1) tells you that two variables move together. It does not tell you that one causes the other. Ice cream sales and drowning deaths are both correlated with temperature, but neither causes the other. Always consider confounding variables and whether the study design (e.g. a randomised experiment) supports a causal claim.
Frequently Asked Questions
What is the difference between and ?
The Pearson correlation coefficient measures the strength and direction of a linear relationship. It ranges from to . is the square of (in simple linear regression) and represents the proportion of variance explained. It ranges from 0 to 1 and has no sign, so you cannot tell the direction from alone.
How many data points do I need for linear regression?
Technically, you need at least 2 points to fit a line, but 2 points will always give a perfect fit with no residual information. In practice, you should have considerably more data points than parameters. A common guideline for simple linear regression is at least 20 to 30 observations, though even 10 can be informative if the relationship is strong and the assumptions are met.
Can I use linear regression with more than one independent variable?
Yes. Multiple linear regression extends the model to . The formulas are expressed in matrix notation, but the idea is the same: minimise the sum of squared residuals. in this context is the proportion of variance explained by all predictors together, and adjusted penalises for adding unnecessary variables.
What does a negative slope mean?
A negative slope means that as increases, tends to decrease. For example, if you regress fuel efficiency on vehicle weight, you would expect a negative slope because heavier vehicles generally use more fuel (lower efficiency).
How do I know if my regression model is statistically significant?
Run a hypothesis test on the slope. The null hypothesis is (no linear relationship). The test statistic is , which follows a t-distribution with degrees of freedom. If the p-value is below your significance level (commonly 0.05), you reject the null and conclude there is a statistically significant linear relationship.
Try it yourself
Explore linear relationships in your data with our Correlation Calculator and Slope Calculator.
Related Articles
Understanding Correlation and Regression: A Practical Guide
Learn the Pearson correlation coefficient, simple linear regression, the least squares method, and R-squared. Full numerical examples and common pitfalls explained.
How to Solve Systems of Equations: 3 Methods with Examples
Master three methods for solving systems of linear equations: substitution, elimination, and matrices. Step-by-step worked examples for each approach.
How to Find the Slope of a Line: Formula, Methods, and Examples
Learn how to find the slope of a line using the slope formula, rise over run, and slope-intercept form. Covers parallel and perpendicular slopes with worked examples.