Linear Regression Instructions

instructions-linear-regression

User Manual:

Open the PDF directly: View PDF .
Page Count: 8

Linear regression

Helen Burn

2019-04-15

Activities

•Introducing linear regression

•Describing relationship patterns

•How much is explained by regression?

Learning objectives connected to linear regression

•Create scatterplots for bivariate data using graphing technology where

appropriate. Lesson: point plots

•Sensibly choose which variable should be the response and which the

explanatory variable, and know when it does and doesn’t matter. (Other

nomenclature for explanatory/response: predictor/response or indepen-

dent/dependent.) Lesson: response and explanatory variables

–External evidence of which direction causation goes, e.g. hours work

explains total pay.

–Why are you making a prediction:

*Deduce from something easy to measure, something that would be

hard to measure. Future.

*Hypothesis formation.

•Determine whether a straight-line model is appropriate for describing a

given relationship. Lesson: ﬂexibility

–Students can distinguish between situations where the relationship

is approximately linear and when it is not. Examples: Height versus

age, BMI vs weight, BMI versus height (which has a crazy, upsidedown

whistle-shaped cloud)

–residual, e.g. heteroscedasticity

–covariate

•Interpret the correlation coefﬁcient in terms of pos/neg/null and strength of

correlation

•Use appropriately terms such as equation,function,model,formula

•Interpret the slope of the regression in terms of the relationship between

incremental change in xand the corresponding incremental change in y.

•Translate a difference in the input to the corresponding difference in the

output. (Rule of 4 from calculus reform.)

–from the graph

–from the regression bcoefﬁcient

LINEAR REGRESSION 2

–Effect size,

–What’s a big change in input? (A couple of SD of x), What’s a strong

relationship: results in a big change in the output (e.g. SD of y). Correla-

tion coefﬁcient is directly in terms of translation of SD in input to SD in

output.

•Identify the residual of a point given the location of the point and the regres-

sion function.

•Use the regression equation for prediction

–plug in an input to get an output

–recognize extrapolation as unsafe

–proper prediction includes the residual variation around the model.

•Use technology to ﬁnd linear regression models and correlation coefﬁcients

for a pair of variables Lesson: relationship-patterns

•Understand the pitfalls of extrapolation

•Be able to make a point plot using technology and to relate the location of

each point to the corresponding row in a data table.

•Develop an intuition for how a mathematical function can describe the

pattern in a point-plot cloud.

•Recognize settings and variables for which regression is an appropriate

technique.

•Be able to use the slope as a concise description of a relationship.

•Recognize what residuals from a regression model have to say.

•Understand how a regression model can be used for prediction.

•Insofar as the correlation coefﬁcient is topic in your course (and it need

not be!) ... establish the connections between regression slopes and

correlation coefﬁcients.

Additional resources

•

•Instructor orientation

•Role in statistical practice

•Classroom discussion

•Assessment

•Tips for an active classroom

•Student pre-requisites

•Looking forward

•Pitfalls

LINEAR REGRESSION 3

Orientation for instructors

Linear regression is one of the oldest and most widely used statistical tech-

niques. It is used to describe or model a connection or relationship between a

quantitative response variable and one or more explanatory variables.

Many, perhaps most, introductory statistics courses cover simple regres-

sion, which is a special case of linear regression in which the response vari-

able, yis modeled as a straight-line function of the explanatory variable x, that

is, y=f(x) = ax +b. The slope mand intercept b, constitute a concise but

very limited way of describing important features of the relationship between

the response and explanatory variables.

Role in statistical practice

It’s fair to say that simple regression is too simple to support contemporary

research and has been for some decades. It is uncommon for there to be just

a single explanatory variable. A more general technique, multiple regression,

supports the use of multiple explanatory variables.

Conceptual pitfalls

There are many potential pitfalls in teaching about simple regression. One has

to do with nomenclature. Mathematicians describe aand bas “coefﬁcients” or

“parameters.” In statistics, the meaning of “parameter” is different (referring to

a population) and the values of aand bgenerated by regression are “statistics”

(referring to a sample from the population). And a “coefﬁcient” in a formula like

a+bx is not particularly similar to a “correlation coefﬁcient.”

Usually, the slope parameter bis the quantity of interest. The slope param-

eter is not, in general a number. Instead, it is a quantity expressed in units.

Modeling spending versus age? Then bwill have units like dollars-per-year.

Many instructors are tempted to use Greek-like notation in teaching regres-

sion. If you’re going to use sophisticated mathematical notation to convey

concepts, you are assuming your students know something about that nota-

tion. This might include:

•Greek letters and their Roman equivalents, e.g. distinguishing among βand

Band bor between µand mand remember that µis not cognate to u.

•The different meanings of subscripts and superscripts, e.g. the distinct

meanings of β2(exponentiation) and β2(identifying one in a series).

•The various (inconsistent and sometimes contradictory) notations for

distinguishing between estimates and population parameters:

–Parameters: β,µ,σ, and informally b,m,s

–Estimates: ˆ

β,b,ˆ

b,ˆµ,m,¯m,s,ˆσIt’s unlikely that you intend for your

students to have to deal with such complexity, so try to keep the notation

as simple as possible. We suggest:

LINEAR REGRESSION 4

•bthe slope of the regression line as estimated by data.

•R2the coefﬁcient of determination

•rthe correlation coefﬁcient

•sxand systandard deviations of the x and y variables

The correlation coefﬁcient is a pure number that combines three pieces of

information: band the standard deviations sxand syof the xand yvariables.

The relationship is

r=bsy

Note that syhas the same units as yand sxthe same units as x. Thus, the

ratio sy/s +xcancels out the units of b.

In multiple regression, it makes sense to describe a model using the unit-ful

coefﬁcients like b, but there is no equivalent to the relationship between rand

bin simple regression. Given the importance of multiple regression, it seems

sensible to teach simple regression in terms of the unit-ful coefﬁcient brather

than the unitless r.

Almost all statistics textbooks present ras a means to quantify the

“strength” of the relationship between two quantitative variables. It is that,

but it is equally applicable to situations where one or both of the variables are

binomial, for instance yes/no or win/lose or A/B.

The analog to rin multiple regression is √R2, where R2, the coefﬁcient of

determination presents the fraction of the variance in the response variable

that is captured by the model. R2(“R-squared”) is an important summary de-

scription of a model. It makes sense, then to prepare students for R2by using

it as a descriptive statistic even in simple regression. You might be tempted to

refer to this as r2, but do recall that R2is a more generally applicable statistic

that encompasses the special case of r2in simple regression.

When we use coefﬁcients like bto quantify a relationship, we set up an

interpretation of bas as a kind of translation factor from xunits to yunits.

That is, a one-unit increase in xis associated with a b-unit increase in y.

Sometimes simple regression is presented as a way to predict a value of

ygiven the value of x. This use is seriously misleading. A proper prediction

should not be in the form of a single number, but a probability assigned to each

possible outcome. In the case of simple regression, a meaningful prediction

is that the output yis predicted to have the form of a normal distribution

with mean $ a + b x$ and a standard deviation corresponding roughly to the

standard deviation of the residuals of the y-values from the corresponding

model value.

R2(or, r2if you insist) has a central role in statistical inference. The ratio

F= (n−1) R2

1−R2

is an informative quantity with respect to p-values and conﬁdence intervals.

LINEAR REGRESSION 5

For the p-value, an F of 3.84 corresponds to p = 0.05, an F of around 7 corre-

sponds to p = 0.01, and an F of 12 to p = 0.001. (You can read off the p-value

by looking up the quantile of F in the F distribution with 1 and n−1degrees of

freedom.)

The 95% conﬁdence interval on bis

CI95 =b(1 ±p3.84/F )

Note that the t-statistic on bis simply t=√F. A reason to use F instead of

t is that F generalizes to multiple regression while t does not.

The F statistic also generalizes to nonlinear formulas y=f(x). Roughly

speaking, for a quadratic shaped model, the n−1term in Fshould be replaced

by n−2

Student pre-requisites

Students will need some background knowledge in order to follow lessons on

simple regression.

•Variable types: quantitative and categorical Lesson: variable types

•Point plot: (The term “scatter plot” has traditionally been used.) Lesson:

point plots

–each axis corresponds to a variable

–each row is one dot.

•Mathematical functions:

–translate a given input to an output by plugging the input into an arith-

metic formula

–in writing the formula, we often use symbols, like mand bto represent

quantities.

–the straight-line function

*slope (primary importance here)

*intercept

•Understand distinctions between various reasons for examining relation-

ship. Lesson: response and explanatory variables

–to make a prediction of the unknown value of a variable given the known

values of other variables

–to anticipate the result of an intervention (This is a form of prediction

that assumes a speciﬁc causal relationship)

–to demonstrate that two variables are connected in some way.

–to explore data in order to frame hypotheses about how the system

works.

•Standard deviations if using r. This is not central if focusing on slope and

intercept.

LINEAR REGRESSION 6

Creating an active classroom

See the document on general tips for creating an active classroom.

Some speciﬁc discussion topics/themes for linear regression:

1. BMI (from NHANES2) as a response variable. It’s important for students to

know what this is. Explanation from the CDC &BMI calculator for students.

•age (r = 0.5 reasonable scatterplot to assume linearity)

•income (r = -0.07) shows a very diffuse scatter plot but also helps demo

the app to students.

•pulse: weak relationship

•systolic: weak-to-moderate relationship

•diastolic: has outliers

•sleep_hour: weak-to-moderate. But has a negative relationship

2. wage (from CPS85)

•age

•education

3. mother’s age (from Births_2014)

•father’s age. Moderate size correlation. Ask what it means

4. Open-ended exploring

5. Consider systolic blood pressure from the NHANES2 data.

•Background: Explain to students what is the difference between the

systolic and diastolic blood pressure. Each time the heart beats, the

blood pressure in the arteries goes up. It quickly rises to a maximum and

then decays until the next beat. Systolic is the maximum blood pressure

each beat, diastolic the minimum. The “pulse pressure” is the difference

between the two. See this site on blood pressure.

•Tasks

1. Determinine three explanatory variables that are predictive of systolic

blood pressure.

2. For each of the three, list the strength of the relationship both as a

fraction of the variation explained as as the change in systolic blood

pressure per unit change of the explanatory variable.

3. Then check whether those three explanatory variables explain di-

astolic blood pressure as well. Which of systolic or diastolic blood

pressure is better explained by the explanatory variables?

6. Diamonds similar to the above, but predict the price of a diamond.

Assessment items

•Point plot and functions. In which we’ll ask students to sketch out some

functions from prior knowledge (e.g. height versus age ) and then indicate

LINEAR REGRESSION 7

the range of values around the function. Then turn this around so that you

deduce the function and range of residuals from the point plot.

•Explanatory vs response variable: prediction versus intervention vs descrip-

tion vs hypothesis formation.

•From data to function.

•Slopes and differences.

–Don’t use y=mx +bexcept as a reminder of what a slope is. Instead

...

*read the slope off a graph. Don’t worry about the intercept.

*read the slope off a regression report.

*“the effect size of x on y”????

–Differences: if the input changes, how much does the output change?

•With the app: Can we predict something hard to measure from something

easy.

–systolic blood pressure from height?

–income from BMI

•With the app: f(x) is not destiny. Predict BMI from education. The averages

differ, but there is a big range around the line. Can’t predict for an individual,

could say something about averages in a group.

•With the app: How much variation is explained?

Looking forward

Understand the different settings in which regression is used in practice. A

good topic for discussion in the workshop. Use examples from the different

settings. - causation - classiﬁcation - exploration: what might explain body

mass index?

Deﬁning big in terms of a the individual variables, e.g. a couple of standard

deviations. This relates to the discussion of “interpreting slope.”

A commonly used tricotomy for describing relationships between two vari-

ables is “negative” vs “zero”/“none” vs “positive”. In the context of simple

regression, these correspond to the sign of the slope b. This can be mislead-

ing, since a zero value of bcan occur even when there is a strong (nonlinear)

relationship between yand x.

The slope bis a physical quantity that has dimension and units. For in-

stance if yis a person’s height in cm, and xis a person’s weight in kg, the units

of bwill be cm/kg. (The “dimension” of this is L/M – length over mass.) Many

mathematical educators prefer to de-emphasize physical units, preferring to

regard bas a pure number. This is a mistake from a statistical point of view.

LINEAR REGRESSION 8

The size of physical quantities is important. Interpreting bas large or small

needs to be understood in the context of the problem.

The correlation coefﬁcient ris a scaled version of b. The scaling is by the

ratio of the standard deviations of the xand yvariables, that is, r=σx

σyb. This

scaling results in rbeing a pure number since the units of σx/σycancel out

the units of b.

The slope bcan be any numerical quantity. In contrast, the correlation

coefﬁcient must always be −1≤r≤1. Many mathematics educators believe

that this means that rdescribes the “strength” of the relationship between y

and x. Whether or not this is true depends on what one means by “strength.”

In scientiﬁc research, the intuition behind strength corresponds better to the

slope band includes the physical units of b. In statistics, when “strength” is

taken to refer to how compelling the evidence is for a claim, an appropriate

measure is the conﬁdence interval on b. Another statistical quantity, the p-value

on the slope, refers to a quantifying the evidence for a particular but very weak

sort of claim, that mis anything but zero.

Although students are often drilled in the fact that −1≤r≤1, the

reason why ris bounded in this way is subtle. It’s misleading to conclude that

the bounds on rsuggest that a “strong” relationship is one there |r| ≈ 1.

The correlation coefﬁcient rpredates the distinction between descriptive

and inferential statistics and mixes together aspects of both. This leads to

pedagogical challenges that could be avoided if relationships are described

using mand inferences made using the conﬁdence interval on m.

•Too much is made of the “optimality” of the estimates of the slope and

intercept. See the sum of squares Little App.

•Categorical explanatory variables can also be used. ANOVA is a general

procedure in linear regression. Almost every statistical method covered in

intro stats – proportions, differences in proportions, means, differences in

means, ANOVA – can be presented quite naturally as a linear regression

problem.

•Robust statistical methods are available to deal automatically with outliers,

without having to handle them as special cases.

•ris meaningless in multiple regression. R2is more general.

•Although yand xare conventional names given to the variables involved

when discussing statistical and mathematical theory,

Author info

Linear Regression Instructions

instructions-linear-regression

Navigation menu

Versions of this User Manual:

Views

Navigation