Influential Points in Regression
Sometimes in regression analysis, a few data points have disproportionate effects on the slope of the regression equation. In this lesson, we describe how to identify those influential points.
Data points that diverge in a big way from the overall pattern are called outliers. There are four ways that a data point might be considered an outlier.
- It could have an extreme X value compared to other data points.
- It could have an extreme Y value compared to other data points.
- It could have extreme X and Y values.
- It might be distant from the rest of the data, even without extreme X or Y values.
Each type of outlier is depicted graphically in the scatterplots below.
Extreme X value
Extreme Y value
Extreme X and Y
Distant data point
An influential point is an outlier that greatly affects the slope of the regression line. One way to test the influence of an outlier is to compute the regression equation with and without the outlier.
This type of analysis is illustrated below. The scatterplots are identical, except that one plot includes an outlier. When the outlier is present, the slope is flatter (-4.10 vs. -3.32); so this outlier would be considered an influential point.
Regression equation: ŷ = 104.78 - 4.10x
Coefficient of determination: R2 = 0.94
Regression equation: ŷ = 97.51 - 3.32x
Coefficient of determination: R2 = 0.55
The charts below compare regression statistics for another data set with and without an outlier. Here, one chart has a single outlier, located at the high end of the X axis (where x = 24). As a result of that single outlier, the slope of the regression line changes greatly, from -2.5 to -1.6; so the outlier would be considered an influential point.
Regression equation: ŷ = 92.54 - 2.5x
Slope: b0 = -2.5
Coefficient of determination: R2 = 0.46
Regression equation: ŷ = 87.59 - 1.6x
Slope: b0 = -1.6
Coefficient of determination: R2 = 0.52
Sometimes, an influential point will cause the coefficient of determination to be bigger; sometimes, smaller. In the first example above, the coefficient of determination is smaller when the influential point is present (0.94 vs. 0.55). In the second example, it is bigger (0.46 vs. 0.52).
If your data set includes an influential point, here are some things to consider.
- An influential point may represent bad data, possibly the result of measurement error. If possible, check the validity of the data point.
- Compare the decisions that would be made based on regression equations defined with and without the influential point. If the equations lead to contrary decisions, use caution.
Test Your Understanding
In the context of regression analysis, which of the following statements are true?
I. When the data set includes an influential point, the data set is
II. Influential points always reduce the coefficient of determination.
III. All outliers are influential data points.
(A) I only
(B) II only
(C) III only
(D) All of the above
(E) None of the above
The correct answer is (E). Data sets with influential points can be linear or nonlinear. Influential points do not always reduce the coefficient of determination. In this lesson, we went over an example in which an influential point increased the coefficient of determination. With respect to regression, outliers are influential only if they have a big effect on the regression equation. Sometimes, outliers do not have big effects. For example, when the data set is very large, a single outlier may not have a big effect on the regression equation.