The Physics
Hypertextbook
Opus in profectus

# Linear Regression

## Discussion

### the concepts

Keywords: linear relationship, linearly related, linear regression, line of best fit, best fit line, least squares fit, coefficient of determination, coefficient of correlation, … ?

When two quantities are directly proportional or directly related…

y ∝ x

…their ratio is a constant.

 y = a constant x

When two quantities are linearly related, they are not quite directly proportional. It's not their values that are proportional, but the rate of change in their values that are proportional.

y ∝ ∆x

The ratio of these quantities is a constant that should be familiar to you. On a graph of a straight line this ratio is known as the slope.

 ∆y = slope ∆x

The symbol for this ratio is the letter m, probably because m is the first letter in the word slope.

 ∆y = m ∆x

A graph of a direct relationship is a straight line that runs through the origin. When x is zero y is zero. A graph of a linear relationship is a straight line that may or may not run through the origin. When x is zero, y could be zero or it could be something else. A linear relationship is one that is partly direct and partly constant. That constant is known as the y intercept, which is indicated using the brilliantly chosen letter b. Altogether as an equation…

y = mx + b

### the mathematics

#### y = mx + b

Given a pile of n points of 2 dimensional data…

x1x2x3, … xn

y1y2y3, … yn

Find an equation for the line of best fit.

y = mx + b

We are shooting for a minimal amount of error in the residuals — the distance from the data point to the line measured in the y direction. If we look at the sum of the squares of the residuals (identified using the symbol R2) the method is called a least squares fit and is probably the most common way to compute a best fit line.

 n R2 = ∑ (∆yi)2 i = 1 n R2 = ∑ [(mxi + b) − yi]2 i = 1

This expression has its minimum where the partial derivatives with respect to m and b are both zero. (The limits on the summations will be omitted out of laziness from now on.)

 ∂ R2 = 2 ∑{[(mxi + b) − yi ] xi} = 0 ∂m
 ∂ R2 = 2 ∑[(mxi + b) − yi ] = 0 ∂b

After a bit of algebra, you get these equations…

 m = n ∑(xiyi) − ∑xi ∑yi n ∑(xi2) − (∑xi)2
 b = ∑(xi2) ∑yi − ∑xi ∑(xiyi) n ∑(xi2) − (∑xi)2

Where…

 n = number of data points ∑xi = sum of the x values ∑yi = sum of the y values ∑(xi)2 = sum of the x2 values ∑(xiyi) = sum of the xy products

#### m and b sample calculation

Here's a small data set. Determine its line of best fit the hard way. Pretend you don't have a calculator or spreadsheet app that can figure this out in one step. Pretend you had to actually use the equations given in this section.

Sample data set
x y
10 8.04
8 6.95
13 7.58
9 8.81
11 8.33
14 9.96
6 7.24
4 4.26
12 10.84
7 4.82
5 5.68

Add a column for x2 and xy and compute all those values.

Sample data set
x y x2 xy
10 8.04 100 80.40
8 6.95 64 55.60
13 7.58 169 98.54
9 8.81 81 79.29
11 8.33 121 91.63
14 9.96 196 139.44
6 7.24 36 43.44
4 4.26 16 17.04
12 10.84 144 130.08
7 4.82 49 33.74
5 5.68 25 28.40

Total up each column.

Sample data set
x y x2 xy
10 8.04 100 80.40
8 6.95 64 55.60
13 7.58 169 98.54
9 8.81 81 79.29
11 8.33 121 91.63
14 9.96 196 139.44
6 7.24 36 43.44
4 4.26 16 17.04
12 10.84 144 130.08
7 4.82 49 33.74
5 5.68 25 28.40
x y x2 xy
99 82.51 1001 797.60

Substitute numbers into equations and be done: first the slope m

 m = n ∑(xiyi) − ∑xi ∑yi n ∑(xi2) − (∑xi)2
 m = (11)(797.60) − (99)(82.51) (11)(1001) − (99)2
 m = 0.500

and then the intercept b.

 b = ∑(xi2) ∑yi − ∑xi ∑(xiyi) n ∑(xi2) − (∑xi)2
 b = (1001)(82.51) − (99)(797.60) (11)(1001) − (99)2
 b = 3.00

#### the coefficients of determination and correlation

How good is the line of best fit? Are some bests better than others? Here's one way to decide. Swap the explanatory and response variables.

x = m'y + b'

The slope of this new linear equation is just the old one with all the x's replaced by y's and vice versa. (Note that, because multiplication is commutative, the numerator hasn't really changed.)

 m′ = n ∑(yixi) − ∑yi ∑xi n ∑(yi2) − (∑yi)2

Now, multiply this new slope by the old slope. Don't ask why, just do it.

 m m′ = ⎛⎜⎝ n ∑(xiyi) − ∑xi ∑yi ⎞⎛⎟⎜⎠⎝ n ∑(yixi) − ∑yi ∑xi ⎞⎟⎠ n ∑(xi2) − (∑xi)2 n ∑(yi2) − (∑yi)2

This product is known as the coefficient of determination

 r2 = (n ∑(xiyi) − ∑xi ∑yi)2 (n ∑(xi2) − (∑xi)2) (n ∑(yi2) − (∑yi)2)

and its square root is called the coefficient of correlation.

 r = n ∑(xiyi) − ∑xi ∑yi √(n ∑(xi2) − (∑xi)2) √(n ∑(yi2) − (∑yi)2)

#### r2 and r sample calculation

Continue using the sample data set. Add a column for y2 and determine its sum.

Sample data set
x y x2 xy y2
10 8.04 100 80.40 64.6416
8 6.95 64 55.60 48.3025
13 7.58 169 98.54 57.4564
9 8.81 81 79.29 77.6161
11 8.33 121 91.63 69.3889
14 9.96 196 139.44 99.2016
6 7.24 36 43.44 52.4176
4 4.26 16 17.04 18.1476
12 10.84 144 130.08 117.5056
7 4.82 49 33.74 23.2324
5 5.68 25 28.40 32.2624
x y x2 xy y2
99 82.51 1001 797.60 660.1727

Substitute and calculate to get r2, the coefficient of determination.

 r2 = (n ∑(xiyi) − ∑xi ∑yi)2 (n ∑(xi2) − (∑xi)2) (n ∑(yi2) − (∑yi)2)
 r2 = [(11)(797.60) − (99)(82.51)]2 [(11)(1001) − (99)2][(11)(660.1727) − (82.51)2]
 r2 = 0.667

Take the root of this to get r, the coefficient of correlation. Use the positive root since the line of best fit has a positive slope.

 r = +√r2r = +√(0.667)r = +0.816