Linear Regression

Practice

practice problem 1

electric-energy.txt
In the United States, electric energy is measured in kilowatt hours and purchased with dollars. This data set came from 12 months of electrical bills for a New York City apartment in the early years of the Twenty First Century.
  1. Plot a graph of cost vs. energy consumed and determine the equation of the best fit straight line.
  2. Explain the significance of the coefficients m, b, and r2.

solution

  1. Here's what the graph looks like.

  2. The slope (m) of a linear function is the rate of change of the vertical quantity (y) with respect to the horizontal quantity (x). It should be apparent that the slope of this graph is the average price for electricity per kilowatt hour. The real question should be, why doesn't this graph intercept the vertical axis at the origin? Surely, if I were to use no energy I should pay no money. When I don't go to a restaurant, I don't get charged. Why should electricity be any different? Well, there are two answers to this question. One is that utility companies as legal monopolies are trying to extract every penny they can from their captive customers. The second, which is the contention of the utilities themselves, is that there are fixed expenses associated with every customer regardless of how much energy they consume: maintenance, administration, insurance, etc. Such fixed expenses are gathered under the umbrella term "basic service charges". Thus, in this bill …
    1. (the slope) is the average price for electric energy: 14.7¢ per kilowatt hour;
    2. (the y-intercept) is the average basic service charge: $9.81 per month. Note the lone data point on the left hand side of the graph that illustrates this policy. The entire house was on vacation for the entire month, the electricity was shut off at the circuit breaker, and yet still there was a charge.
    3. (the correlation coefficient) shows the correlation between the energy and price: 0.96. The more important quantity for this analysis is the square of this number — the coefficient of determination: r2 = 0.92. This number shows that 92% of the variation in these electric bills is due to the amount of energy consumed. The remaining 8% variation is due to seasonal effects (electricity is always more expensive in summer when air conditioner use drives up demand) and bracket billing policies (the first quarter megawatt hour or so is cheaper than the rest with this utility). These secondary variations are evident in the data point in the extreme upper right hand corner of the graph that lies well above the line of best fit. It occured during the summer when rates and consumption were at their highest.

practice problem 2

dash-world-records.txt (utf-8)
The text file referenced above has data on the world records for the 100 m dash. The data are broken up into four groups:
  1. men's electronically-timed world records
  2. men's hand-timed world records
  3. women's electronically-timed world records
  4. women's hand-timed world records

Analyze this data.

  1. Perform a linear regression on both men's and women's world record times as a function of the year the record was set.
  2. Explain the significance of the numerical results.
  3. Make an interesting prediction.

Source: International Association of Athletics Federations (IAAF)

solution

  1. The graph …

    [slide]

  2. The numbers …
         
    men   women
         
    y =  mx + b
    m =  −0.009052 s/yr
    b =  +27.84 s
    r =  −0.9511
     
    y =  mx + b
    m =  −0.02399 s/yr
    b =  58.32 s
    r =  −0.9199
         
    The slope of this graph shows us that men's times are decreasing at approximately 0.01 seconds each year.   Women's times are decreasing faster, 0.02 seconds per year, approximately twice the rate of men
         
    The y-intercept would be the world record in the year zero (a year that does not exist, by the way). Extrapolating this linear fit back 20 centuries would be a stupid thing to do. Surely there was someone around at the turn of the first millennium who could run a hundred meters in under 27 seconds..   The y-intercept for women is extra foolish. Nearly a minute to run 100 m? I don't think so. Linear regression is nice, but it isn't a religion. You don't have to believe everything it says.
         
    The r value gives us an indication of how well the data can be explained by a linear model. Squaring −0.9511 gives us 0.9046, which means 90% of the variation in men's world record 100 m dash times is linear. That's quite a reasonable fit to an artificial model.   The fit is not quite as tight for the women's times. Squaring −0.9199 yields a coefficient of determination of 0.8462. Thus a linear model only explains 85% of the variation in women's world record 100 m dash times. Still pretty good for a messy data set like this one.
         
  3. I find it somewhat surprising that the trends in world record times can be so well explained by a linear model. I would have expected that the data would show the athletes approaching some limit. Surely, humans can't keep running faster and faster indefinitely. There must be some performance "wall" ahead of them — something to keep them from running faster than a speeding bullet. As far as the last century goes, this appears not to be the case. Times have been shrinking at a steady rate. Assuming they keep up like this, women sprinters will eventually outrun their male counterparts some time in the middle of the Twenty-first Century. We can even predict the year at which the transition will occur. Set the two regression equations equal and see what happens.
         
    (mx + b)men  =  (mx + b)women
    (−0.009052 x + 27.84)  =  (−0.02399 x + 58.32)
    (0.02399 − 0.009052) x  =  (58.32 − 27.84)
    0.014938 x  =  30.58
    x  =  (30.48 ÷ 0.014992)
    x  =  2040

If you really felt that world record times would follow a linear progression you might even try determining the day in 2040 when the women catch up to the men. But since I recognize the limitations of this model, I won't be entering the office "men-vs.-women-hundred-meter-dash" pool. In fact, if we choose a slightly different data set, we'll end up predicting a significantly different transition year. These calculations are left as an exercise for the reader.

practice problem 3

vostok.txt
Snow rarely gets a chance to melt in Antarctica, even in the summer when the sun never sets. In the interior of the continent, the temperature of the air hasn't been above the freezing point of water in any significant way for the last 900,000 years. The snow that falls there accumulates and accumulates and accumulates until it compresses into rock solid ice — up to 4.5 km thick in some regions. Since the snow that falls is originally fluffy with air, the ice that eventually forms still holds remnants of this air — very, very old air. By examining the isotopic composition of the gases in carefully extracted cores of this ice we can learn things about the past climatic conditions on earth. By extension we might also predict some things about the climate of the future. The columns in this data set are as follows:
  1. Age of air (years before present)
  2. Temperature anomaly with respect to the mean recent time value (℃)
  3. Carbon dioxide concentration (ppm)
  4. Dust concentration (ppm)

Source: Adapted from Petit, et al. 1999.

Questions …

  1. CO2
    1. Construct a set of overlapping time series graphs for CO2 concentration and temperature anomaly.
    2. Construct a scatter plot of temperature anomaly vs. CO2 concentration.
    3. How are atmospheric carbon dioxide concentration and temperature anomaly related?
    4. What temperature anomaly might one expect given current atmospheric CO2 levels?

solution

  1. CO2
    1. Here are the overlapping time series graphs. The data show a definite correlation. The two quantities go up and down in near synchrony.

      [slide]

    2. Here's the scatter plot of the two time-varying quantities plotted against one another. The data forms a dense cloud that is roughly oval shaped. The best fit line slices nicely through the data.

      [slide]

    3. Temperature varies linearly with atmospheric carbon dioxide concentration. Low CO2 levels go with a cooler climate and high CO2 levels go with a warmer climate.
    4. What does our linear regression analysis predict given current carbon dioxide levels?
        
      y = mx + b
      y = (0.0908 ℃/ppm)(386 ppm) − 25.23 ℃
      y = +9.8 ℃

      The current consensus among working climate scientists is that the globe will warm +5 ℃ on average over the course of the Twenty-first Century. The increase is expected to be smaller than average near the equator and greater than average near the poles. Since the Vostok ice cores were collected in Antarctica, our prediction of approximately +10 ℃ is right in line with those made by more sophisticated means.

      Correlation is not causation, however. Graphs like those used in this problem cannot tell us whether carbon dioxide affects temperature, temperature affects carbon dioxide, or some third factor is affecting both. We need a theoretical model that describes which way the cause and effect work. That model is described in more detail in the section of this book that deals with heat transfer by radiation.

      Carbon dioxide is a greenhouse gas. Its role in atmospheric thermodynamics is much like the glass in a greenhouse. It is transparent to visible light, but not to infrared. Visible light easily punches through the atmosphere. It is absorbed by the ground and then reradiated as infrared. The infrared is partly blocked by the atmosphere and has a hard time escaping out into space. This little delay keeps the earth comfortably warm. Water vapor, carbon dioxide, methane, and other gases have been shown to play a significant role in this process. They all interact with infrared radiation. These properties have been measured in tabletop laboratory experiments that had no direct connection to climatology.

      Atmospheric carbon dioxide levels have increased steadily over the past 100 to 150 years. This is due to the burning of coal, petroleum, and natural gas as well as deforestation and other changes in land use associated with the Industrial Revolution. During this same time period, average global temperatures have been generally increasing and there is no reason to believe that this trend will quit anytime soon. Climate models all show that as long as CO2 concentrations stay somewhere around their turn of the Twenty-first Century levels, global temperatures will continue to increase for the next 100 years. This conclusion is based on solid scientific reasoning and is regarded by nearly all climate scientists as valid. The scientific questions that remain unanswered are: how can we increase the precision and reliability of our global climate predictions and what effect will the inevitable changes have on life as we know it? The question of what is to be done about all of this is left to the people to answer.

practice problem 4

anscombe-data.txt
This collection of four hypothetical data sets in one table was created by F.J. Anscombe in 1973 for use as a teaching tool. The data don't correspond to any real experiment. They are just a bunch of numbers with a peculiar behavior. Identify this peculiarity by calculating the coefficients m, b, and r for each of the four data sets, then look at each graph with your eyes and employ your brain to make a judgment. Is linear regression the right tool for analyzing this data? If not, why not and what should be done instead? The columns should be paired up in the following manner …
  1. X and Y1
  2. X and Y2
  3. X and Y3
  4. X4 and Y4

Source: Graphs in Statistical Analysis. Anscombe, F.J. The American Statistician. Vol. 27 No. 1 (February 1973): 19.

solution

These data sets have been rigged to have the same slope (0.50), y-intercept (3.00), and correlation (0.82). Only one of them should be analyzed with a best fit straight line. This shows that there is more to data analysis than number crunching. Any brainless computer can process data. An actual working brain is needed to understand it.
  1. A linear fit is useful here. Not much more needs to be said.
       
    [slide]  
       
  2. A linear fit is not useful here. This is probably a quadratic or some other kind of polynomial.
       
    [slide] [slide]
       
  3. That one outlier should be removed and a linear fit tried again. An alternate solution would be to investigate further. Someone may have entered the wrong number or a piece of equipment may have failed. (My money is on the former.)
       
    [slide] [slide]
       
  4. The linear fit is strongly affected by that one outlier. Without it, however, there isn't enough variation to see a trend. There isn't much that can be done with this data set. We would need to know what these numbers are all about before we should even consider graphing them. Maybe a graph isn't even the right idea.
       
    [slide] [slide]
       

practice problem 5

standard-atmosphere.txt
This text file provides standard meteorological data for the earth's atmosphere as a function of altitude above sea level.
  1. Find the transformation that will relate the pressure to altitude with a linear equation.
  2. Write the nonlinear equation that results.

solution