Week 03 Polynomial Regression Assignment Instructions

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 5

Regression Week 3: Polynomial Regression Quiz

In this notebook you will compare different regression models in order to assess which model fits best. We will be

using polynomial regression as a means to examine this topic. In particular you will:

• Write a function to take an an array and a degree and return an data frame where each column is the array to a

polynomial value up to the total degree.

• Use a plotting tool (e.g. matplotlib) to visualize polynomial regressions

• Use a plotting tool (e.g. matplotlib) to visualize the same polynomial degree on different subsets of the data

• Use a validation set to select a polynomial degree

• Assess the final fit using test data

If you are doing the assignment with IPython Notebook

An IPython Notebook has been provided below to you for this quiz. This notebook contains the instructions, quiz

questions and partially-completed code for you to use as well as some cells to test your code.

What you need to download

If you are using GraphLab Create:

• Download the King County House Sales data In SFrame format: kc_house_data.gl.zip

• Download the companion IPython Notebook: week-3-polynomial-regression-assignment-blank.ipynb

• Save both of these files in the same directory (where you are calling IPython notebook from) and unzip the data

file.

If you are not using GraphLab Create:

• Download the King County House Sales data csv file: kc_house_data.csv

NOTE: The following files are different from weeks 1 and 2

• Download the King County House Sales training data csv file: wk3_kc_house_train_data.csv

• Download the King County House Sales validation data csv file: wk3_kc_house_valid_data.csv

• Download the King County House Sales testing data csv file: wk3_kc_house_test_data.csv

• Download the King County House Sales subset 1 data csv file: wk3_kc_house_set_1_data.csv

• Download the King County House Sales subset 2 data csv file: wk3_kc_house_set_2_data.csv

• Download the King County House Sales subset 3 data csv file: wk3_kc_house_set_3_data.csv

• Download the King County House Sales subset 4 data csv file: wk3_kc_house_set_4_data.csv

• IMPORTANT: use the following types for columns when importing the csv files. Otherwise, they may not

be imported correctly: [str, str, float, float, float, float, int, str, int, int, int, int, int, int, int, int, str, float, float,

float, float]. If your tool of choice requires a dictionary of types for importing csv files (e.g. Pandas), use:

dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':floa

t, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long

':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':fl

oat, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':in

Useful resources

You may need to install the software tools or use the free Amazon EC2 machine. Instructions for both options are

provided in the reading for Module 1.

If you are following the IPython Notebook and/or are new to numpy then you might find the following tutorial

helpful: numpy-tutorial.ipynb

If instead you are using other tools to do your homework

You are welcome, however, to write your own code and use any other libraries, like Pandas or R, to help you in the

process. If you would like to take this path, follow the instructions below.

1. You’re going to write a function that adds powers of a feature to columns of a data frame. For those using

SFrames:

Recall that if we have an SArray ‘tmp’ we can get a new SArray with all the values to the third power with:

tmp_cubed = tmp.apply(lambda x: x**3)

We can create an empty SFrame with:

my_SFrame = graphlab.SFrame()

And append the tmp to it with:

my_SFrame['power_1'] = tmp

Where here ‘power_1’ will refer to the power our feature was raised to.

2. Write your own function called ‘polynomial_sframe’ (or otherwise) which accepts an array ‘feature’ and a maximal

‘degree’ and returns an data frame (e.g. SFrame) with the first column equal to ‘feature’ and the remaining columns

equal to ‘feature’ to increasing integer powers up to ‘degree’.

e.g. if you’re using SFrames, you can complete the following function:

def polynomial_sframe(feature, degree): # assume that degree >= 1 # initialize th

e SFrame: poly_sframe = graphlab.SFrame() # and set poly_sframe['power_1'] equal

to the passed feature ... # first check if degree > 1 if degree > 1:

# then loop over the remaining degrees: for power in range(2, degree+1):

# first we'll give the column a name: name = 'power_' + str(power)

# assign poly_sframe[name] to be feature^power ... return poly_sfra

e.g. if you’re using Pandas, you can complete the following function:

def polynomial_dataframe(feature, degree): # feature is pandas.Series type # assume t

hat degree >= 1 # initialize the dataframe: poly_dataframe = pandas.DataFrame()

# and set poly_dataframe['power_1'] equal to the passed feature ... # first ch

eck if degree > 1 if degree > 1: # then loop over the remaining degrees:

for power in range(2, degree+1): # first we'll give the column a name:

name = 'power_' + str(power) # assign poly_dataframe[name] to be fe

ature^power; use apply(*) ... return poly_dataframe

3. For the remainder of the assignment we will be working with the house Sales data as in the previous notebooks.

Load in the data and also sort the sales SFrame by ‘sqft_living’. When we plot the fitted values we want to join them

up in a line and this works best if the variable on the X-axis (which will be ‘sqft_living’) is sorted. For houses with

identical square footage, we break the tie by their prices.

e.g. if you’re using SFrames

sales = graphlab.SFrame('kc_house_data.gl/') sales = sales.sort(['sqft_living','price'])

e.g. if you're using Pandas

sales = pandas.read_csv('kc_house_data.csv', dtype=dtype_dict) sales = sales.sort(['sqft_

living','price'])

4. Make a 1 degree polynomial SFrame with sales[‘sqft_living’] as the the feature. Call it ‘poly1_data’.

5. Add sales[‘price’] to poly1_data as this will be our output variable. e.g. if you’re using SFrames

poly1_data = polynomial_sframe(sales['sqft_living'], 1) poly1_data['price'] = sales['pric

e']

6. Use graphlab.linear_regression.create (or another linear regression library) to compute the regression weights for

predicting sales[‘price’] based on the 1 degree polynomial feature ‘sqft_living’. The result should be an intercept and

slope. e.g if you’re using graphlab create:

model1 = graphlab.linear_regression.create(poly1_data, target = 'price', features = ['pow

er_1'], validation_set = None)

If you use graphlab.linear_regression.create() to estimate these models please ensure that you set validation_set =

None. This way you will get the same answer every time you run the code.

7. Next use the produce a scatter plot of the training data (just square feet vs price) and add the fitted model. e.g.

with matplotlib and SFrames:

import matplotlib.pyplot as plt %matplotlib inline plt.plot(poly1_data['power_1'],poly1_d

ata['price'],'.', poly1_data['power_1'], model1.predict(poly1_data),'-')

The resulting plot should look like a cloud of points with a straight line passing through.

8. Now that you have plotted the results using a 1st degree polynomial, try it again using a 2nd degree and 3rd

degree polynomial. Look at the fitted lines, do they appear as you would expect?

9. Now try a 15th degree polynomial. Print out the coefficients and look at the resulted fitted line. Do you think this

degree is appropriate for these data? If we were to use a different subset of the data do you think we would get

pretty much the same curve?

10. If you’re using SFrames then create four subsets as follows:

• first split sales into 2 subsets with .random_split(.5) use seed = 0!

• next split these into 2 more subsets (4 total) using random_split(0.5) again set seed = 0!

• you should have 4 subsets of (approximately) equal size, call them set_1, set_2, set_3, and set_4

If you’re not using SFrames then please download the provided csv files for each subset.

11. Estimate a 15th degree polynomial on all 4 sets, plot the results and view the coefficients for all four models.

12. Quiz Question: Is the sign (positive or negative) for power_15 the same in all four models?

13. Quiz Question: True/False the plotted fitted lines look the same in all four plots

14. Since the “best” polynomial degree is unknown to us we will use cross validation to select the best degree. If

you’re using SFrames then create a training, validation and testing subsets as follows:

• First split sales into training_and_validation and testing with sales.random_split(0.9) use seed = 1!

• Next split training_and_validation into training and validation using .random_split(0.5) use seed = 1!

If you’re not using SFrames then please download the provided csv files for training, validation and test data.

15. Now for each degree from 1 to 15:

• Build an polynomial data set using training_data[‘sqft_living’] as the feature and the current degree

• Add training_data[‘price’] as a column to your polynomial data set

• Learn a model on TRAINING data to predict ‘price’ based on your polynomial data set at the current degree

• Compute the RSS on VALIDATION for the current model (print or save the RSS)

Hint: in graphlab.linear_regression.create() you can set verbose = False if you want to suppress the interim output of

linear_regression.create().

16. Quiz Question: Which degree (1, 2, …, 15) had the lowest RSS on Validation data?

17. Now that you have selected a degree compute the RSS on TEST data for the model with the best degree from

the Validation data.

18. Quiz Question: what is the RSS on TEST data for the model with the degree selected from Validation

data? (Make sure you got the correct degree from the previous question)

Week 03 Polynomial Regression Assignment Instructions

Navigation menu

Versions of this User Manual:

Views

Navigation