Research 10: Linear Regression

practice

  1. electric-energy.txt
    In the United States, electric energy is measured in kilowatt hours and purchased with dollars. This data set came from 12 months of electrical bills for a New York City apartment in the early years of the Twenty First Century.
    1. Plot a graph of cost vs. energy consumed and determine the equation of the best fit straight line.
    2. Explain the significance of the coefficients m, b, and r2.
  2. standard-atmosphere.txt
    This text file provides standard meteorological data for the earth's atmosphere as a function of altitude above sea level.
    1. Find the transformation that will relate the pressure to altitude with a linear equation.
    2. Write the nonlinear equation that results.
  3. vostok.txt
    Snow rarely gets a chance to melt in Antarctica, even in the summer when the sun never sets. In the interior of the continent, the temperature of the air hasn't been above the freezing point of water in any significant way for the last 900,000 years. The snow that falls there accumulates and accumulates and accumulates until it compresses into rock solid ice — up to 4.5 km thick in some regions. Since the snow that falls is originally fluffy with air, the ice that eventually forms still holds remnants of this air — very, very old air. By examining the isotopic composition of the gases in carefully extracted cores of this ice we can learn things about the past climatic conditions on earth. By extension we might also predict some things about the climate of the future. The columns in this data set are as follows:
    1. Age of air (years before present)
    2. Temperature anomaly with respect to the mean recent time value (℃)
    3. Carbon dioxide concentration (ppm)
    4. Dust concentration (ppm)

    Source: Adapted from Petit, et al. 1999.

    Questions …

    1. CO2
      1. Construct a set of overlapping time series graphs for CO2 concentration and temperature anomaly.
      2. Construct a scatter plot of temperature anomaly vs. CO2 concentration.
      3. How are atmospheric carbon dioxide concentration and temperature anomaly related?
      4. What temperature anomaly might one expect given current atmospheric CO2 levels?
  4. anscombe-data.txt
    This collection of four hypothetical data sets in one table was created by F.J. Anscombe in 1973 for use as a teaching tool. The data don't correspond to any real experiment. They are just a bunch of numbers with a peculiar behavior. Identify this peculiarity by calculating the coefficients m, b, and r for each of the four data sets, then look at each graph with your eyes and employ your brain to make a judgment. Is linear regression the right tool for analyzing this data? If not, why not and what should be done instead? The columns should be paired up in the following manner …
    1. X and Y1
    2. X and Y2
    3. X and Y3
    4. X4 and Y4
    Source: Anscombe, F.J. "Graphs in Statistical Analysis," The American Statistician. Vol. 27, No. 1 (1973): 19.

statistical

  1. For each of the following data sets …
      1. determine the equation of the best fit straight line(s) and
      2. explain the significance of the coefficients m, b, and r2.
    Here are the data sets …
    1. braking-distance.txt
      In this road test, braking distances were measured for different cars traveling at 60 mph and 80 mph. Graph these distances against one another.
      Source: "Road Test Summary." Road & Track. (July 1998): 186-87.
    2. satellite-failures.txt
      Satellites in low earth orbit (LEO) operate between 250 and 1500 km above the ground. Because Earth's atmosphere extends hundreds of miles into space, LEOs eventually experience enough friction that they fall back to earth and burn up. The accompanying text files gives the number of low earth orbit satellites that reentered the earth's atmosphere and the number of sunspots for each year since 1969. Graph the number of reentered satellites vs. the number of sunspots.
      Source: NASA Goddard Space Flight Center.
    3. soap.txt
      The weight of the soap in a bathroom shower was recorded almost every day for about a month. Graph the mass of this soap as a function of time.
    4. standard-atmosphere.txt
      This text file provides standard meteorological data for the earth's atmosphere as a function of altitude above sea level. Graph temperature as a function of altitude for the tropospheric portion of the atmosphere from sea level to 11 km. (Do not analyze the entire data set. The atmosphere above 11 km behaves much differently.)
    5. toaster.txt
      The duration of the toast cycle was measured for different light-dark settings of a two-slot electric bread toaster. Graph cycle time as a function of light-dark setting for this toaster when it held one and two slices of bread.
    6. wavelength-of-light.txt
      In this experiment the wavelengths of the visible line spectra for an excited gas were measured using two different methods. Graph these trials against one another.
  2. Answer the questions associated with the following data sets.
    1. Determine the year when women sprinters will run as fast as their male counterparts in the 100 m dash using …
      1. dash-electronic-timing.txt (utf-8)
        only those world records that were timed electronically (as opposed to manually).
      2. dash-olympic-gold-medals.txt (utf-8)
        olympic gold medal winners (as opposed to world record setters).
      Compare your results with those obtained in pracice problem 3
      Source: International Association of Athletics Federations (IAAF)
    2. co2-mauna-loa.txt
      Mauna Loa Observatory on the "Big Island" of Hawaii has been recording atmospheric carbon dioxide concentrations for nearly half a century beginning in the year 1958. Readings are taken continuously, but only the monthly averages are reported. Values are reported in parts per million (ppm)
      1. Construct a graph of atmospheric CO2 concentration vs. time.
      2. What two obvious behaviors are revealed in your graph?
      3. Split the data set in half and perform a linear regression analysis on the data for the years …
        1. 1958-1980 and
        2. 1981-2006.
      4. Compare the behavior of CO2 levels in the first half of the data set to the second half.
      Source: Scripps Institution of Oceanography
    3. gw-vardo.txt
      Global warming is most easily observed in long term temperature measurements taken at high latitudes (near the poles). Vardø is a village in the extreme northeast of Norway on the Barents Sea. Despite being a few degrees north of the Arctic Circle, its harbor remains ice free due to the warm North Atlantic drift current (an extension of the Gulf Stream). Vardø's climate is mild for its latitude, which means it varies from a few ℃ above freezing in the summer to a few ℃ below freezing in the winter. A location with such a stable climate is a good place to check for human induced climate change.
      1. Construct a graph of monthly average temperature vs. time for the period 1881 to 2006.
      2. Using linear regression, determine the following quantities for the whole data set …
        1. the rate of change of temperature in ℃ per century
        2. the uncertainty in this value
        3. the coefficient of determination
        4. the root-mean-square error (if you have the ability to calculate this number)
      3. Divide the data set up into four equal intervals of roughly 378 months (31.5 years) and repeat.
      4. Compile your results in a table like the one below and comment on the manner in which temperatures have changed at Vardø in this 125 year period. (Use the results of all four calculated columns in your analysis, not just the rate of temperature change.)
                 
      time interval ΔTt
      (℃/100 y)
      uncertainty
      (℃/100 y)
      r2 rmse
      (℃)
      overall (1881-2006)        
      1st quarter (1881-1912)        
      2nd quarter (1912-1943)        
      3rd quarter (1944-1975)        
      4th quarter (1975-2006)        
       
      Source: NASA Goddard Institue for Space Science
    4. gw-central-park.txt
      [Note: This is an extension of the previous problem, but it can be worked on independently with little loss of meaning.]
      Surface air temperatures have increased in New York City on the order of one degree celsius in the Twentieth Century — consistent with the trend of global warming. New York is the largest city in the United States and the fourth largest metropolitan area on the planet. 8.5 million people live within the city limits and an additional 10 million are within commuting distance. With a gross metropolitan product approaching one trillion dollars ($1015) the economy of New York City is larger than that of all but a dozen or so nations. This geographic concentration of people and economic power must certainly have an effect on the local climate. Repeat the analysis described in the previous problem using 125 years worth of temperature measurements taken in Central Park in New York City.
      1. Construct a graph of monthly average temperature vs. time for the period 1881 to 2006.
      2. Using linear regression, determine the following quantities for the whole data set …
        1. the rate of change of temperature in ℃ per century
        2. the uncertainty in this value
        3. the coefficient of determination
        4. the root-mean-square error (if you have the ability to calculate this number)
      3. Divide the data set up into four equal intervals of roughly 378 months (31.5 years) and repeat.
      4. Compile your results in a table like the one below and comment on the manner in which temperatures have changed at New York City in this 125 year period. (Use the results of all four calculated columns in your analysis, not just the rate of temperature change.)
                 
      time interval ΔTt
      (℃/100 y)
      uncertainty
      (℃/100 y)
      r2 rmse
      (℃)
      overall (1881-2006)        
      1st quarter (1881-1912)        
      2nd quarter (1912-1943)        
      3rd quarter (1944-1975)        
      4th quarter (1975-2006)        
       
      Source: NASA Goddard Institue for Space Science