Project 1 Guide Paper
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 7
Download | ![]() |
Open PDF In Browser | View PDF |
Project-1 of “introduction to Statistical Learning and Machine Learning” Yanwei Fu September 28, 2016 1 Linear Regression and Nonlinear Bases In class we discuss fitting a linear regression model by minimizing the squared error. This classic model is the simplest version of many of the more complicated models. However, it typically performs very poorly in practice. One of the reasons it performs poorly is that it assumes that the target yi is a linear function of the features xi , if with an intercept of zero. This drawback can be addressed by adding a bias variable and using nonlinear bases (although nonlinear bases may increase to overfitting). In this question, you will start with a data set where least squares performs poorly. You will then explore how adding a bias variable and using nonlinear (polynomial) bases can drastically improve the performance. You will also explore how the complexity of a basis affects both the training error and the test error. In the final part of the question, it will be up to you to design a basis with better performance than polynomial bases. You can use either Matlab, Python, or R codes. However, the sample codes are provided by Matlab. 1.1 Adding a Bias Variable Download codes from the course webpage, and start Matlab in a directory containing the extracted files. If you run the script example_basis.m, it will: 1. Load a one-dimensional regression dataset. 2. Fit a least-squares linear regression model. 3. Report the training error. 4. Report the test error (on a dataset not used for training). 5. Draw a figure showing the training data and what the linear model looks like. 1 Unfortunately, this is an awful model of the data. The average squared training error on the data set is over 28000 (as is the test error), and the figure produced by the demo confirms that the predictions are usually nowhere near the training data: The y-intercept of this data is clearly not zero (it looks like it’s closer to 200), so we should expect to improve performance by adding a bias variable, so that our model is yi = w T x i + instead of yi = w T x i Write a new function, leastSquaresBias, that has the same input/model/predict format as the leastSquares function, but that includes a bias variable w0 . In the report, you should explain the critical sections of your new function, the updated plot, and the updated training/test error. 1.2 Polynomial Basis Adding a bias variable improves the prediction substantially, but the model is still problematic because the target seems to be is a non-linear function of 2 the input. Write a new function, leastSquaresBasis(x,y,deg), that takes a data vector x (i.e., assuming we only have one feature) and the polynomial order deg. The function should perform a least squares t based on a matrix Xpoly j where each of its rows contains the values (xi ) for j = 0 up to deg. E.g., leastSquaresBasis(x,y,3) should form the matrix 2 6 6 Xpoly = 6 4 2 3 1 1 .. . x1 x2 .. . (x1 ) 2 (x2 ) .. . (x1 ) 3 (x2 ) .. . 1 xn (xn ) (xn ) 2 3 3 7 7 7 5 and fit a least squares model based on it. In the report, you should explain the critical sections leastSquaresBasis(x,y,3) function of and report the training and test error for deg = 0 through deg = 10. Explain the effect of deg on the training error and on the test error. 2 Regularization Regularization is a core idea in machine learning: regularization can significantly reduce overfitting when we try very complicated models. In this question, you’ll implement a simple L2-regularized least squares model and see the dramatic performance improvement it can bring on test data. 2.1 Preprocessing – Data standardization The goal of standardization is to make the d input attributes of the data comparable. If we were predicting the price or real estate in Shanghai using as input attributes the number of rooms and the price of the property 3 years ago (d = 2), then we would run into trouble because the number of rooms is typically in the range 1 to 4, while the price 3 years ago is probably in the range 500, 000 to 5, 000, 000. We need to place these two attributes on the same scale so that applying the same regularizer to their weights makes sense. To achieve this, we standardize the input data by ensuring it has zero mean and unit variance. We can do this by computing xij = xij x¯j j Pn 2 where x̄j = j = i=1 xij is empirical mean of the j’th attribute, and P n 2 1 x̄j ) is the empirical variance. Here, i = 1, · · · , n denotes the i=1 (xij n index over the data and j = 1, · · · , d is the index over input attributes. The subtraction of the mean is not necessary for datasets that are already known to be zero mean, such as acoustic time series. Likewise, for gray-scale images we know that each attribute (pixel) has the same range of possible values, so we can skip the variance normalization. If the attributes do not have the same range of values, then the variance rescaling can prove to be very useful. 1 n 3 For example if the first attribute is the number of people who entered a building, the second attribute the temperature of the building, and the third attribute the number of rooms in the building, then these three numbers are likely not to be in the same scale. Variance rescaling places them on a reasonable comparative scale. The parameters x̄j and j2 should be considered part of the model. So at test time, we should center with respect to x¯j and scale with respect to j , rather than centering and scaling the test set with respect to the mean and standard deviation of the test set. To see why, imagine the test set consisted of a single example; then scaling by the test set standard deviation, instead of the value derived from the training set, would clearly be wrong, since the standard deviation of a single data point is not defined. All this will be made more clear in the following section. 2.2 Load and standardize the data Download the prostate cancer dataset from the course website. In this prostate cancer study 9 variables – including age, log weight, log cancer volume, etc. – were measured for 97 patients. We will now construct a model to predict the 9th variable a linear combination of the other 8. A description of this dataset appears in the textbook of Hastie et al : “The data for this example come from a study by Stamey et al. (1989) that examined the correlation between the level of prostate specific antigen (PSA) and a number of clinical measures, in 97 men who were about to receive a radical prostatectomy. The goal is to predict the log of PSA (lpsa) from a number of measurements including log cancer volume (lcavol), log prostate weight lweight, age, log of benign prostatic hyperplasia amount lbph, seminal vesicle invasion svi, log of capsular penetration lcp, Gleason score gleason, and percent of Gleason scores 4 or 5 pgg45. " 1. First, load the data and split it into a response vector (y) and a matrix of attributes (X); 2. Second, randomly shuffle the orders of patients in the table and then choose the first 50 patients as the training data. The remaining patients will be the test data. 3. Set both variables (i.e. the training set of X and y) to have zero mean and standardize the input variables to have unit variance. Note that in the training step, we will learn the bias term (i.e. intercept term 0 ) separately, i.e. not regularize bias term. Recall that the purpose of regularization is to get rid of unwanted input attributes. Mathematically, what we are saying is that the bias term will be computed separately as follows, 0 = ȳ 4 x̄T ˆ Figure 1: Regularization path for ridge regression. where ȳ is the mean of the elements of the training data vector y and xT is the vector of 8 means for the input attributes. Note that in this case the 8-dimensional parameter vector ˆ includes all the parameters other than the bias term that have been learned with ridge regression. That is, we first learn ˆ using standardized data and then proceed to learn 0 . When we encounter a new input x? in the test set, we need to standardize it before making a prediction. The actual prediction should be: ŷ = ȳ + 8 X x?j i=1 x̄j j ✓ˆj where xj and j are the mean and standard deviation of the j-th attribute obtained from the training data. One reason for standardizing the inputs is that we want them to be comparable. Had we had an input much bigger than the other, we would have wanted to apply a different regularizer to it. By standardizing the inputs first, we only need a single scalar regularization coefficient 2 . 2.3 Ridge regression We will now construct a model using ridge regression to predict the 9th variable as a linear combination of the other 8. The ridge method is a regularized version of least squares, with objective function: min k y ✓2Rd X✓ k22 + 5 2 k ✓ k22 Figure 2: Relative error of the ridge estimator against regularization parameter 2 . Here is a scalar, the input matrix X 2 Rn⇥d and the output vector y 2 Rn and the parameter vector ✓ 2 Rd . 1. Write code for ridge regression starting from the following skeleton: function ridge(X, y, d2) ??? return theta Compute the ridge regression solutions for a range of regularizers ( 2 ). Plot the values of each ✓ in the y-axis against 2 in the x-axis. This set of plotted values is known as a regularization path. Your plot should look like Figure 1. Hand in your version of this plot, along with the code you used to generate it. 2. For each computed value of ✓, compute the train and test error. Remember, you will have to standardize your test data with the same means and standard deviations as above (Xbar, Xstd, ybar) before you can make a prediction and compute your test error. 3. Choose a value of 2 using cross-validation. What is this value? Show all your intermediate cross-validation steps and the criterion you used to choose 2 . Plot the train and test errors as a function of 2 . Your plot 6 should look like Figure 2. Hand in your version of this plot, along with the code used to generate it. 7
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf Linearized : No XMP Toolkit : XMP Core 5.4.0 Modify Date : 2016:09:28 22:23:17+08:00 Create Date : 2016:09:28 22:22:57+08:00 Metadata Date : 2016:09:28 22:23:17+08:00 Creator Tool : TeX PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2 Producer : pdfTeX-1.40.17 Trapped : False Document ID : uuid:7bc2ded5-ff07-3e47-b1cd-cd9f1b077d33 Instance ID : uuid:3021d99f-5c51-d947-8ec1-98836aa1d4da Format : application/pdf Page Count : 7 PDF Version : 1.4 Creator : TeXEXIF Metadata provided by EXIF.tools