Linear Regression

Glenn Elert

Linear Regression

Discussion

the concepts

Keywords: linear relationship, linearly related, linear regression, line of best fit, best fit line, least squares fit, coefficient of determination, coefficient of correlation, … ?

When two quantities are directly proportional or directly related…

y ∝ x

…their ratio is a constant.

y	= a constant
x

When two quantities are linearly related, they are not quite directly proportional. It's not their values that are proportional, but the rate of change in their values that are proportional.

∆y ∝ ∆x

The ratio of these quantities is a constant that should be familiar to you. On a graph of a straight line this ratio is known as the slope.

∆y	= slope
∆x

The symbol for this ratio is the letter m, probably because m is the first letter in the word slope.

∆y	= m
∆x

A graph of a direct relationship is a straight line that runs through the origin. When x is zero y is zero. A graph of a linear relationship is a straight line that may or may not run through the origin. When x is zero, y could be zero or it could be something else. A linear relationship is one that is partly direct and partly constant. That constant is known as the y intercept, which is indicated using the brilliantly chosen letter b. Altogether as an equation…

y = mx + b

the mathematics

y = mx + b

Given a pile of n points of 2 dimensional data…

x₁, x₂, x₃, … x_n

y₁, y₂, y₃, … y_n

Find an equation for the line of best fit.

y = mx + b

We are shooting for a minimal amount of error in the residuals — the distance from the data point to the line measured in the y direction. If we look at the sum of the squares of the residuals (identified using the symbol R²) the method is called a least squares fit and is probably the most common way to compute a best fit line.

	n
R² =	∑	(∆y_i)²
	i = 1

	n
R² =	∑	[(mx_i + b) − y_i]²
	i = 1

This expression has its minimum where the partial derivatives with respect to m and b are both zero. (The limits on the summations will be omitted out of laziness from now on.)

∂	R² = 2 ∑{[(mx_i + b) − y_i ] x_i} = 0
∂m

∂	R² = 2 ∑[(mx_i + b) − y_i ] = 0
∂b

After a bit of algebra, you get these equations…

m =	n ∑(x_iy_i) − ∑x_i ∑y_i
	n ∑(x_i²) − (∑x_i)²

b =	∑(x_i²) ∑y_i − ∑x_i ∑(x_iy_i)
	n ∑(x_i²) − (∑x_i)²

Where…

n =	number of data points
∑x_i =	sum of the x values
∑y_i =	sum of the y values
∑(x_i)² =	sum of the x² values
∑(x_iy_i) =	sum of the xy products

m and b sample calculation

Here's a small data set. Determine its line of best fit the hard way. Pretend you don't have a calculator or spreadsheet app that can figure this out in one step. Pretend you had to actually use the equations given in this section.

Sample data set
x	y
10	08.04
08	06.95
13	07.58
09	08.81
11	08.33
14	09.96
06	07.24
04	04.26
12	10.84
07	04.82
05	05.68

Add a column for x² and xy and compute all those values.

Sample data set
x	y	x²	xy
10	08.04	100	080.40
08	06.95	064	055.60
13	07.58	169	098.54
09	08.81	081	079.29
11	08.33	121	091.63
14	09.96	196	139.44
06	07.24	036	043.44
04	04.26	016	017.04
12	10.84	144	130.08
07	04.82	049	033.74
05	05.68	025	028.40

Total up each column.

Sample data set
x	y	x²	xy
10	08.04	100	080.40
08	06.95	064	055.60
13	07.58	169	098.54
09	08.81	081	079.29
11	08.33	121	091.63
14	09.96	196	139.44
06	07.24	036	043.44
04	04.26	016	017.04
12	10.84	144	130.08
07	04.82	049	033.74
05	05.68	025	028.40
∑x	∑y	∑x²	∑xy
99	82.51	1001	797.60

Substitute numbers into equations and be done: first the slope m…

m =	n ∑(x_iy_i) − ∑x_i ∑y_i
	n ∑(x_i²) − (∑x_i)²

m =	(11)(797.60) − (99)(82.51)
	(11)(1001) − (99)²

m = 0.500

and then the y intercept b.

b =	∑(x_i²) ∑y_i − ∑x_i ∑(x_iy_i)
	n ∑(x_i²) − (∑x_i)²

b =	(1001)(82.51) − (99)(797.60)
	(11)(1001) − (99)²

b = 3.00

the coefficients of determination and correlation

How good is the line of best fit? Are some bests better than others? Here's one way to decide. Swap the explanatory and response variables.

x = m'y + b'

The slope of this new linear equation is just the old one with all the x's replaced by y's and vice versa. (Note that, because multiplication is commutative, the numerator hasn't really changed.)

m′ =	n ∑(y_ix_i) − ∑y_i ∑x_i
	n ∑(y_i²) − (∑y_i)²

Now, multiply this new slope by the old slope. Don't ask why, just do it.

m m′ =	⎛ ⎜ ⎝	n ∑(x_iy_i) − ∑x_i ∑y_i	⎞⎛ ⎟⎜ ⎠⎝	n ∑(y_ix_i) − ∑y_i ∑x_i	⎞ ⎟ ⎠
		n ∑(x_i²) − (∑x_i)²		n ∑(y_i²) − (∑y_i)²

This product is known as the coefficient of determination

r² =	(n ∑(x_iy_i) − ∑x_i ∑y_i)²
	(n ∑(x_i²) − (∑x_i)²)	(n ∑(y_i²) − (∑y_i)²)

and its square root is called the coefficient of correlation.

r =	n ∑(x_iy_i) − ∑x_i ∑y_i
	√(n ∑(x_i²) − (∑x_i)²)	√(n ∑(y_i²) − (∑y_i)²)

r² and r sample calculation

Continue using the sample data set. Add a column for y² and determine its sum.

Sample data set
x	y	x²	xy	y²
10	08.04	100	080.40	064.6416
08	06.95	064	055.60	048.3025
13	07.58	169	098.54	057.4564
09	08.81	081	079.29	077.6161
11	08.33	121	091.63	069.3889
14	09.96	196	139.44	099.2016
06	07.24	036	043.44	052.4176
04	04.26	016	017.04	018.1476
12	10.84	144	130.08	117.5056
07	04.82	049	033.74	023.2324
05	05.68	025	028.40	032.2624
∑x	∑y	∑x²	∑xy	∑y²
99	82.51	1001	797.60	660.1727

Substitute and calculate to get r², the coefficient of determination.

r² =	(n ∑(x_iy_i) − ∑x_i ∑y_i)²
	(n ∑(x_i²) − (∑x_i)²)	(n ∑(y_i²) − (∑y_i)²)

r² =	[(11)(797.60) − (99)(82.51)]²
	[(11)(1001) − (99)²][(11)(660.1727) − (82.51)²]

r² = 0.667

Take the root of this to get r, the coefficient of correlation. Use the positive root since the line of best fit has a positive slope.

r = +√r²
r = +√(0.667)
r = +0.816

Discussion

the concepts

the mathematics

y = mx + b

m and b sample calculation

the coefficients of determination and correlation

r2 and r sample calculation

r² and r sample calculation