Project 1 Guide Paper

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 7

DownloadProject 1 Guide Paper
Open PDF In BrowserView PDF
Project-1 of “introduction to Statistical
Learning and Machine Learning”
Yanwei Fu
September 28, 2016

1

Linear Regression and Nonlinear Bases

In class we discuss fitting a linear regression model by minimizing the squared
error. This classic model is the simplest version of many of the more complicated
models. However, it typically performs very poorly in practice. One of the
reasons it performs poorly is that it assumes that the target yi is a linear function
of the features xi , if with an intercept of zero. This drawback can be addressed
by adding a bias variable and using nonlinear bases (although nonlinear bases
may increase to overfitting).
In this question, you will start with a data set where least squares performs
poorly. You will then explore how adding a bias variable and using nonlinear
(polynomial) bases can drastically improve the performance. You will also explore how the complexity of a basis affects both the training error and the test
error. In the final part of the question, it will be up to you to design a basis
with better performance than polynomial bases.
You can use either Matlab, Python, or R codes. However, the sample codes
are provided by Matlab.

1.1

Adding a Bias Variable

Download codes from the course webpage, and start Matlab in a directory containing the extracted files. If you run the script example_basis.m, it will:
1. Load a one-dimensional regression dataset.
2. Fit a least-squares linear regression model.
3. Report the training error.
4. Report the test error (on a dataset not used for training).
5. Draw a figure showing the training data and what the linear model looks
like.

1

Unfortunately, this is an awful model of the data. The average squared training
error on the data set is over 28000 (as is the test error), and the figure produced
by the demo confirms that the predictions are usually nowhere near the training
data:

The y-intercept of this data is clearly not zero (it looks like it’s closer to
200), so we should expect to improve performance by adding a bias variable, so
that our model is
yi = w T x i +
instead of

yi = w T x i

Write a new function, leastSquaresBias, that has the same input/model/predict
format as the leastSquares function, but that includes a bias variable w0 .
In the report, you should explain the critical sections of your new function,
the updated plot, and the updated training/test error.

1.2

Polynomial Basis

Adding a bias variable improves the prediction substantially, but the model
is still problematic because the target seems to be is a non-linear function of
2

the input. Write a new function, leastSquaresBasis(x,y,deg), that takes a data
vector x (i.e., assuming we only have one feature) and the polynomial order
deg. The function should perform a least squares t based on a matrix Xpoly
j
where each of its rows contains the values (xi ) for j = 0 up to deg. E.g.,
leastSquaresBasis(x,y,3) should form the matrix
2

6
6
Xpoly = 6
4

2

3

1
1
..
.

x1
x2
..
.

(x1 )
2
(x2 )
..
.

(x1 )
3
(x2 )
..
.

1

xn

(xn )

(xn )

2

3

3
7
7
7
5

and fit a least squares model based on it.
In the report, you should explain the critical sections leastSquaresBasis(x,y,3)
function of and report the training and test error for deg = 0 through deg = 10.
Explain the effect of deg on the training error and on the test error.

2

Regularization

Regularization is a core idea in machine learning: regularization can significantly
reduce overfitting when we try very complicated models. In this question, you’ll
implement a simple L2-regularized least squares model and see the dramatic
performance improvement it can bring on test data.

2.1

Preprocessing – Data standardization

The goal of standardization is to make the d input attributes of the data comparable. If we were predicting the price or real estate in Shanghai using as input
attributes the number of rooms and the price of the property 3 years ago (d = 2),
then we would run into trouble because the number of rooms is typically in the
range 1 to 4, while the price 3 years ago is probably in the range 500, 000 to
5, 000, 000. We need to place these two attributes on the same scale so that
applying the same regularizer to their weights makes sense.
To achieve this, we standardize the input data by ensuring it has zero mean
and unit variance. We can do this by computing
xij =

xij

x¯j
j

Pn

2
where x̄j =
j =
i=1 xij is empirical mean of the j’th attribute, and
P
n
2
1
x̄j ) is the empirical variance. Here, i = 1, · · · , n denotes the
i=1 (xij
n
index over the data and j = 1, · · · , d is the index over input attributes.
The subtraction of the mean is not necessary for datasets that are already
known to be zero mean, such as acoustic time series. Likewise, for gray-scale
images we know that each attribute (pixel) has the same range of possible values,
so we can skip the variance normalization. If the attributes do not have the
same range of values, then the variance rescaling can prove to be very useful.
1
n

3

For example if the first attribute is the number of people who entered a building,
the second attribute the temperature of the building, and the third attribute the
number of rooms in the building, then these three numbers are likely not to be
in the same scale. Variance rescaling places them on a reasonable comparative
scale.
The parameters x̄j and j2 should be considered part of the model. So at
test time, we should center with respect to x¯j and scale with respect to j ,
rather than centering and scaling the test set with respect to the mean and
standard deviation of the test set. To see why, imagine the test set consisted of
a single example; then scaling by the test set standard deviation, instead of the
value derived from the training set, would clearly be wrong, since the standard
deviation of a single data point is not defined. All this will be made more clear
in the following section.

2.2

Load and standardize the data

Download the prostate cancer dataset from the course website. In this prostate
cancer study 9 variables – including age, log weight, log cancer volume, etc. –
were measured for 97 patients. We will now construct a model to predict the
9th variable a linear combination of the other 8. A description of this dataset
appears in the textbook of Hastie et al :
“The data for this example come from a study by Stamey et al.
(1989) that examined the correlation between the level of prostate
specific antigen (PSA) and a number of clinical measures, in 97 men
who were about to receive a radical prostatectomy. The goal is
to predict the log of PSA (lpsa) from a number of measurements
including log cancer volume (lcavol), log prostate weight lweight,
age, log of benign prostatic hyperplasia amount lbph, seminal vesicle
invasion svi, log of capsular penetration lcp, Gleason score gleason,
and percent of Gleason scores 4 or 5 pgg45. "
1. First, load the data and split it into a response vector (y) and a matrix of
attributes (X);
2. Second, randomly shuffle the orders of patients in the table and then
choose the first 50 patients as the training data. The remaining patients
will be the test data.
3. Set both variables (i.e. the training set of X and y) to have zero mean
and standardize the input variables to have unit variance.
Note that in the training step, we will learn the bias term (i.e. intercept term
0 ) separately, i.e. not regularize bias term. Recall that the purpose of regularization is to get rid of unwanted input attributes. Mathematically, what we are
saying is that the bias term will be computed separately as follows,
0

= ȳ
4

x̄T ˆ

Figure 1: Regularization path for ridge regression.
where ȳ is the mean of the elements of the training data vector y and xT
is the vector of 8 means for the input attributes. Note that in this case the
8-dimensional parameter vector ˆ includes all the parameters other than the
bias term that have been learned with ridge regression. That is, we first learn
ˆ using standardized data and then proceed to learn 0 .
When we encounter a new input x? in the test set, we need to standardize
it before making a prediction. The actual prediction should be:
ŷ = ȳ +

8
X
x?j
i=1

x̄j
j

✓ˆj

where xj and j are the mean and standard deviation of the j-th attribute
obtained from the training data. One reason for standardizing the inputs is that
we want them to be comparable.
Had we had an input much bigger than the other, we would have wanted
to apply a different regularizer to it. By standardizing the inputs first, we only
need a single scalar regularization coefficient 2 .

2.3

Ridge regression

We will now construct a model using ridge regression to predict the 9th variable
as a linear combination of the other 8.
The ridge method is a regularized version of least squares, with objective
function:
min k y

✓2Rd

X✓ k22 +

5

2

k ✓ k22

Figure 2: Relative error of the ridge estimator against regularization parameter
2
.
Here is a scalar, the input matrix X 2 Rn⇥d and the output vector y 2 Rn
and the parameter vector ✓ 2 Rd .
1. Write code for ridge regression starting from the following skeleton:
function ridge(X, y, d2)
???
return theta
Compute the ridge regression solutions for a range of regularizers ( 2 ).
Plot the values of each ✓ in the y-axis against 2 in the x-axis. This set
of plotted values is known as a regularization path. Your plot should look
like Figure 1. Hand in your version of this plot, along with the code you
used to generate it.
2. For each computed value of ✓, compute the train and test error. Remember, you will have to standardize your test data with the same means and
standard deviations as above (Xbar, Xstd, ybar) before you can make a
prediction and compute your test error.
3. Choose a value of 2 using cross-validation. What is this value? Show
all your intermediate cross-validation steps and the criterion you used to
choose 2 . Plot the train and test errors as a function of 2 . Your plot

6

should look like Figure 2. Hand in your version of this plot, along with
the code used to generate it.

7



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
Linearized                      : No
XMP Toolkit                     : XMP Core 5.4.0
Modify Date                     : 2016:09:28 22:23:17+08:00
Create Date                     : 2016:09:28 22:22:57+08:00
Metadata Date                   : 2016:09:28 22:23:17+08:00
Creator Tool                    : TeX
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
Producer                        : pdfTeX-1.40.17
Trapped                         : False
Document ID                     : uuid:7bc2ded5-ff07-3e47-b1cd-cd9f1b077d33
Instance ID                     : uuid:3021d99f-5c51-d947-8ec1-98836aa1d4da
Format                          : application/pdf
Page Count                      : 7
PDF Version                     : 1.4
Creator                         : TeX
EXIF Metadata provided by EXIF.tools

Navigation menu