Stat Trek

Teach yourself statistics

Stat Trek

Teach yourself statistics


Introduction to Multiple Regression

Simple linear regression is a technique for predicting the value of a dependent variable, based on the value of a single independent variable. Sometimes, you only need one relevant independent variable to make an accurate prediction.

Often, however, the prediction is better when you use two or more independent variables. Multiple regression is a technique for predicting the value of a dependent variable, based on the values of two or more independent variables.

The Regression Equation

This is a tutorial about linear regression, so our focus is on linear relationships between variables. The regression equation that expresses the linear relationships between a single dependent variable and one or more independent variables is:

ŷ = b0 + b1x1 + b2x2 + … + bk-1xk-1 + bkxk

In this equation, ŷ is the predicted value of the dependent variable. Values of the k independent variables are denoted by x1, x2, x3, … , xk.

And finally, we have the b's - b0, b1, b2, … , bk. The b's are constants, called regression coefficients. Values are assigned to the b's based on the principle of least squares.

What is the Principle of Least Squares?

In multiple regression, the deviation of the actual value for a dependent variable from its predicted value is called the residual. The residual (e) for a single observation i is:

ei = yi - ŷi = yi - ( b0 + b1x1i + b2x2i + … + bkxki )

Assume that the set of data consists of n observations. The principle of least squares requires that the sum of squared residuals for all n observations be minimized. That is, we want the following value to be as small as possible:

Σ [ yi - ( b0 + b1x1i + b2x2i + … + bkxki ]2

Regression analysis requires that the values of b0, b1, … , bk be defined to minimize the sum of the squared residuals. When we assign values to regression coefficients in this way, we are following the principle of least squares.

Normal Equations for Simple Regression

Finding the right values for regression coefficients (i.e., values that satisfy a least squares criterion) involves solving a set of linear equations. These equations can be derived using calculus, and they are called normal equations.

To illustrate the use of normal equations, let's look at simple linear regression - regression with one dependent variable (y) and one independent variable (x). With simple linear regression, the regression equation is:

ŷ = b0 + b1x

The normal equations for simple linear regression are:

Σ yi = nb0 + b1( Σxi )

Σ xiyi = b0( Σxi ) + b1( Σxi2 )

Here, we have two equations with two unknowns. The unknowns are the regression coefficients b0 and b1. Using ordinary algebra, we can solve for b0 and b1. The result is:

b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]

b0 = y - b1 * x

where x is the mean x score, and y is the mean y score. Note that these are the same equations that we presented in a previous lesson, when we introduced the topic of simple linear regression.

The use of normal equations to assign values to regression coefficients becomes more complicated when there are two or more independent variables. We'll tackle that challenge in the next lesson.

Test Your Understanding

Problem 1

Which of the following statements are true?

I. A regression equation with k independent variables has k regression coefficients.
II. Regression coefficients (bo, b1, b2, etc.) are variables in the regression equation.
III. The principle of least squares calls for minimizing the sum of the squared residuals.

(A) I only.
(B) II only.
(C) III only.
(D) All of the above.
(E) None of the above.

Solution

The correct answer is (C). The principle of least squares defines regression coefficients that minimize the sum of the squared residuals. A regression equation with k independent variables has k + 1 regression coefficients. For example, if there were two independent variables, there would be three regression coefficients - bo, b1, and b2. And finally, regression coefficients are constants - not variables.