Regression
Scatterplots
A scatterplot displays the relationship between two quantitative variables measured on the same individuals. The explanatory variable () is plotted on the horizontal axis and the response variable () on the vertical axis.
Interpreting Scatterplots
When describing a scatterplot, address:
- Direction: Positive association (both increase), negative association (one increases while the other decreases), or no association
- Form: Linear, curved, or no pattern
- Strength: How closely the points follow a pattern (strong, moderate, weak)
- Outliers: Points that fall far from the overall pattern
Explanatory vs Response Variables
- Explanatory variable (independent variable, predictor): The variable that may explain or predict changes in the response variable
- Response variable (dependent variable): The variable that is being predicted or explained
Not all relationships imply causation, even when the explanatory variable precedes the response variable in time.
Correlation
The correlation coefficient () measures the strength and direction of the linear relationship between two quantitative variables.
Properties of
- : Perfect positive linear relationship
- : Perfect negative linear relationship
- : No linear relationship (but there may be a nonlinear relationship)
- is not affected by changes in the centre (adding a constant) or scale (multiplying by a positive constant) of either variable
- is not resistant to outliers
- only measures linear association; it does not capture curved relationships
- The correlation does not depend on which variable is called and which is called
- Units of measurement do not affect
Calculating
Least-Squares Regression
The least-squares regression line (LSRL) minimises the sum of the squared vertical distances (residuals) between the observed data points and the line.
Equation of the LSRL
Where:
- (slope)
- (y-intercept)
Interpreting the Slope and y-Intercept
- Slope (): For each unit increase in , is predicted to increase (or decrease) by units, on average. Always include “when increases by 1 unit” and “predicted changes by .”
- y-Intercept (): When , the predicted value of is . This interpretation is only meaningful if is within the range of the data.
Prediction
The LSRL gives the predicted value for a given . Predictions are most reliable for values of within the range of the original data (extrapolation beyond the data is risky).
Residuals
A residual is the difference between the observed value and the predicted value:
- A positive residual: the point is above the regression line (actual predicted)
- A negative residual: the point is below the line (actual predicted)
- The mean of the residuals is always (by definition of least squares)
- The sum of squared residuals () is minimised by the LSRL
Residual Plots
A residual plot displays residuals () on the vertical axis and the explanatory variable () or the predicted values () on the horizontal axis.
Interpretation:
- If the regression line is a good model, the residual plot shows a random scatter of points with no pattern around the horizontal axis ()
- A curved pattern in the residual plot indicates a nonlinear relationship
- A fan shape (increasing spread) indicates non-constant variance (heteroscedasticity)
- Outliers in the residual plot are points with large residuals
Coefficient of Determination ()
represents the proportion of the variation in that is accounted for by the linear relationship with .
- : The LSRL explains none of the variation in
- : The LSRL explains all of the variation in
- : 85% of the variation in is explained by the linear regression on
Interpretation
“Approximately 85% of the variation in [response variable] can be accounted for by the linear relationship with [explanatory variable].”
Outliers and Influential Points
Outliers
A point with a large residual (far from the regression line). An outlier in the y-direction.
Influential Points
A point that, if removed, would significantly change the slope and/or y-intercept of the regression line. Influential points typically have extreme x-values (leverage points), even if their residual is not large.
An influential point may or may not be an outlier. Always check the effect of removing a point on the regression equation.
Transformations
When the relationship between and is not linear, a transformation of one or both variables may produce a linear relationship.
Common Transformations
- Logarithmic: or — useful for exponential growth/decay
- Square root: — useful for count data with increasing variance
- Reciprocal: — useful for hyperbolic relationships
- Power: for some power
Strategy
- Make a scatterplot of vs
- If nonlinear, try transformations of , , or both
- Re-examine the scatterplot and residual plot after transformation
- Choose the transformation that produces the most linear pattern with random residuals
Out-of-Context Extrapolation
The LSRL should only be used to make predictions for values of within the range of the observed data. Extrapolating beyond this range is unreliable because the linear pattern may not hold.
Correlation vs Causation
A strong correlation between two variables does not imply that one causes the other. The relationship may be due to:
- Confounding variables: A third variable related to both and
- Common response: Both variables respond to a third variable
- Coincidence: Random chance producing a strong correlation in a particular sample
Establishing causation requires a well-designed experiment with random assignment, not just observational data showing correlation.
Inference for Regression Slope
On the AP exam, you may be asked to test whether the slope of the population regression line is significantly different from zero:
Common Pitfalls
- Confusing correlation with causation
- Using the regression line to extrapolate far beyond the data
- Interpreting as a percentage “explained” without proper context
- Forgetting to check residual plots for nonlinearity
- Confusing outliers (large residuals) with influential points (high leverage)
- Using correlation when the relationship is clearly nonlinear