The Physics
Hypertextbook
Opus in profectus

Linear Regression

search icon

Discussion

the concepts

Keywords: linear relationship, linearly related, linear regression, line of best fit, best fit line, least squares fit, coefficient of determination, coefficient of correlation, … ?

When two quantities are directly proportional or directly related…

y ∝ x

…their ratio is a constant.

y = a constant
x

When two quantities are linearly related, they are not quite directly proportional. It's not their values that are proportional, but the rate of change in their values that are proportional.

y ∝ ∆x

The ratio of these quantities is a constant that should be familiar to you. On a graph of a straight line this ratio is known as the slope.

y = slope
x

The symbol for this ratio is the letter m, probably because m is the first letter in the word slope.

y = m
x

A graph of a direct relationship is a straight line that runs through the origin. When x is zero y is zero. A graph of a linear relationship is a straight line that may or may not run through the origin. When x is zero, y could be zero or it could be something else. A linear relationship is one that is partly direct and partly constant. That constant is known as the y intercept, which is indicated using the brilliantly chosen letter b. Altogether as an equation…

y = mx + b

the mathematics

y = mx + b

Given a pile of n points of 2 dimensional data…

x1x2x3, … xn

y1y2y3, … yn

Find an equation for the line of best fit.

y = mx + b

We are shooting for a minimal amount of error in the residuals — the distance from the data point to the line measured in the y direction. If we look at the sum of the squares of the residuals (identified using the symbol R2) the method is called a least squares fit and is probably the most common way to compute a best fit line.

 n
R2 =  (∆yi)2
i = 1
 
 n
R2 =  [(mxi + b) − yi]2
i = 1

This expression has its minimum where the partial derivatives with respect to m and b are both zero. (The limits on the summations will be omitted out of laziness from now on.)

 R2 = 2 ∑{[(mxi + b) − yi ] xi} = 0
m
 R2 = 2 ∑[(mxi + b) − yi ] = 0
b

After a bit of algebra, you get these equations…

m =  n ∑(xiyi) − ∑xi ∑yi
n ∑(xi2) − (∑xi)2
b =  ∑(xi2) ∑yi − ∑xi ∑(xiyi)
n ∑(xi2) − (∑xi)2

Where…

n =  number of data points
xi =  sum of the x values
yi =  sum of the y values
∑(xi)2 =  sum of the x2 values
∑(xiyi) =  sum of the xy products

m and b sample calculation

Here's a small data set. Determine its line of best fit the hard way. Pretend you don't have a calculator or spreadsheet app that can figure this out in one step. Pretend you had to actually use the equations given in this section.

Sample data set
x y
10 08.04
08 06.95
13 07.58
09 08.81
11 08.33
14 09.96
06 07.24
04 04.26
12 10.84
07 04.82
05 05.68

Add a column for x2 and xy and compute all those values.

Sample data set
x y x2 xy
10 08.04 100 080.40
08 06.95 064 055.60
13 07.58 169 098.54
09 08.81 081 079.29
11 08.33 121 091.63
14 09.96 196 139.44
06 07.24 036 043.44
04 04.26 016 017.04
12 10.84 144 130.08
07 04.82 049 033.74
05 05.68 025 028.40

Total up each column.

Sample data set
x y x2 xy
10 08.04 100 080.40
08 06.95 064 055.60
13 07.58 169 098.54
09 08.81 081 079.29
11 08.33 121 091.63
14 09.96 196 139.44
06 07.24 036 043.44
04 04.26 016 017.04
12 10.84 144 130.08
07 04.82 049 033.74
05 05.68 025 028.40
x y x2 xy
99 82.51 1001 797.60

Substitute numbers into equations and be done: first the slope m

m =  n ∑(xiyi) − ∑xi ∑yi
n ∑(xi2) − (∑xi)2
m =  (11)(797.60) − (99)(82.51)
(11)(1001) − (99)2
m = 0.500

and then the intercept b.

b =  ∑(xi2) ∑yi − ∑xi ∑(xiyi)
n ∑(xi2) − (∑xi)2
b =  (1001)(82.51) − (99)(797.60)
(11)(1001) − (99)2
b = 3.00

the coefficients of determination and correlation

How good is the line of best fit? Are some bests better than others? Here's one way to decide. Swap the explanatory and response variables.

x = m'y + b'

The slope of this new linear equation is just the old one with all the x's replaced by y's and vice versa. (Note that, because multiplication is commutative, the numerator hasn't really changed.)

m′ =  n ∑(yixi) − ∑yi ∑xi
n ∑(yi2) − (∑yi)2

Now, multiply this new slope by the old slope. Don't ask why, just do it.

m m′ = 

∑(xiyi) − ∑xi ∑yi ⎞⎛
⎟⎜
⎠⎝
n ∑(yixi) − ∑yi ∑xi

n ∑(xi2) − (∑xi)2 n ∑(yi2) − (∑yi)2

This product is known as the coefficient of determination

r2 =  (∑(xiyi) − ∑xi ∑yi)2
(n ∑(xi2) − (∑xi)2) (n ∑(yi2) − (∑yi)2)

and its square root is called the coefficient of correlation.

r =  n ∑(xiyi) − ∑xi ∑yi
√(n ∑(xi2) − (∑xi)2) √(n ∑(yi2) − (∑yi)2)

r2 and r sample calculation

Continue using the sample data set. Add a column for y2 and determine its sum.

Sample data set
x y x2 xy y2
10 08.04 100 080.40 064.6416
08 06.95 064 055.60 048.3025
13 07.58 169 098.54 057.4564
09 08.81 081 079.29 077.6161
11 08.33 121 091.63 069.3889
14 09.96 196 139.44 099.2016
06 07.24 036 043.44 052.4176
04 04.26 016 017.04 018.1476
12 10.84 144 130.08 117.5056
07 04.82 049 033.74 023.2324
05 05.68 025 028.40 032.2624
x y x2 xy y2
99 82.51 1001 797.60 660.1727

Substitute and calculate to get r2, the coefficient of determination.

r2 =  (∑(xiyi) − ∑xi ∑yi)2
(n ∑(xi2) − (∑xi)2) (n ∑(yi2) − (∑yi)2)
r2 =  [(11)(797.60) − (99)(82.51)]2
[(11)(1001) − (99)2][(11)(660.1727) − (82.51)2]
r2 = 0.667

Take the root of this to get r, the coefficient of correlation. Use the positive root since the line of best fit has a positive slope.

r = +√r2
r = +√(0.667)
r = +0.816