A3 Instructions V2
User Manual:
Open the PDF directly: View PDF .
Page Count: 3
Download | |
Open PDF In Browser | View PDF |
Assignment 3 (12 Marks) Deadline: October 21 2018, 11:59pm 1. PROBLEM DESCRIPTION In this assignment, we will practice executing ensemble methods in R and learn how those methods could help us improve the prediction performance. The data is a simulated dataset with one binary label and 15 features. There are 2000 records in the training data (with label values) and 2000 records in the test data (without label values). For Task 1 and Task 2, please use the first 1500 records in A3_train.csv for training and the last 500 records in A3_train.csv for computing performance. 2. TASKS Task 1: Write your own code of Random Forest of a postpruned rpart by modifying the R code template uploaded to IVLE A3 folder. (3 marks) ● You might find the swirl exercise (BT5152 Tutorial 1 Decision Trees) from week 3 helpful if you need to refresh your memory on postprune of an rpart decision tree. ● Performance metric is simple accuracy for binary labels. ● This part is for practice purpose to help you check your understanding about Random Forest algorithm. In practice, most packages implements with using a fully grown tree. ● Grading of this part is about the correctness of your code and checking your understanding about random forest. Prediction performance won’t be graded. Task 2: Stacking of three algorithms: C50 with default parameter values, KNN with k=3, and your random forest in Task 1. The output of level0 is a binary label (not predicted probability). Logistic regression is used for the level1 algorithm. The learning objective is to help you check your understanding about Stacking. (4 marks) ● The final output is a binary label and the performance metric is simple accuracy. ● You may use the same classification problem dataset provided in the template for Task 1. Make sure your implementation is able to report the prediction accuracy on the test dataset. ● Same as Task 1: grading of this part is about the correctness of your code. Prediction performance won’t be graded. ● In this question, you need to code the details of Stacking. In other words, you are not allowed to use caretEnsemble or caretStack. You are allowed and encouraged to use these packages in Task 3. ● Bonus (up to 1 mark): You may include additional code and a half page discussion comparing your stacking implementation and any of the level0 models. You may also consider generalizing your stacking implantation such that it can be used on any classification dataset. Task 3: Toy Data Competition. Now you try your best to predict the true label of the 2000 rows in the test set file (the file without true label). The performance metric is AUC. In other words, you are required to submit predicted probabilities. (5 marks) ● Grading of this task is based on your prediction performance and reproducibility of your prediction results. You need to submit your predicted values and also the code to generate predicted values for verification purpose. If your AUC is around median AUC of this class, your expected mark is 2.5 out of 5 in this assignment. ● To alleviate the workload of TA, your training code must complete within 5 minutes. ■ You can gridsearch by Caret and only submit the code to build your final model with the chosen parameters. On my 3year old normal desktop, xgBoost takes less than 1 second to train on this dataset. ● You are allowed to use any R packages for algorithms covered in our lectures, the required textbook, and tutorials before week 7 (including Week 7). Packages for algorithms not covered so far are NOT allowed. ■ Only R is allowed. Python is not allowed in this exercise. ■ LightGBM is not allowed. GBM or XGBoost in R is allowed. ■ At the same time, you are allowed to try different settings of any of the R packages covered. You do not need to stick to the (default) parameter settings used in the sample codes from tutorials. For example, you can change the parameter settings of neuralnet or nnet packages in any way that you like. ■ Using randomForest package in R or Caret is allowed. No need to use the handcoded version of Random Forest. ■ caretEnsemble or caretStack is allowed. ■ You can choose to use Caret or not. ● You are allowed and encouraged to create new features based on raw data. Any function for features engineering is allowed. ● You are allowed to drop features if you believe it helps the performance. Using R packages to help you execute features selection methods or dimension reduction methods is allowed. Submissions and Grading You can submit up to three files: 1. a *.R (or *.Rmd) file [required] 2. a *.csv for Q3 [required] 3. a *.PDF (or *.html generated by your rmarkdown) file of you results and answers. [optional] Name all files by your student number (e.g., A0123456X.R, A0123456X.html, A0123456X.csv) and upload to IVLE workbin submission folder "A3". Do not zip your submissions. In your R script you can assume that dataset files are in the same directory as the R script, e.g. train_data < read.csv("A3_train.csv") The page limit of the pdf file is maximum 2 pages including everything. The formatting is A4, default margin, 12 font size, singlespacing. There is no need to try to fill 2 pages. Correct answers are much more important than the length of your answers for grading. You may revise and submit as many times before deadline. Make sure to remove any old version that you don’t wish to be graded. If you have questions about the assignment, feel free to email TA and cc me. Later, if you have questions about grading of the assignment, then you can email TA and cc me because TA (not me) will grade your assignment by following my grading rules listed below. Grading Rules All suspected plagiarism cases will receive 0 mark. Every day of late submission will result in 3 marks deducted, i.e. 4 days late = 0 mark. For Q1, you should use the provided R code template without major modification. The completed R code should be correct. For Q2, you should have correct R code that can be used for prediction of at least the test classification dataset in Q1 and produce the test prediction accuracy. For Q3, R code that can reproduce the exact same prediction results as your *.csv submission. 0 will be given for Q3 if executing the R script/markdown doesn’t produce the same csv file. 2.5 marks if execution time on the TA’s 2GHz i7, 8GM memory MacBook Air is more than 10 minutes. Submissions without a runnable R/Rmd file will receive a failing grade. Make sure all the dependency packages are imported e.g. library(C50) TA can judge the quality of your code and deduct up to 2 marks. For example, if you have unnecessary/repeated code, meaningless variable names, excessive comments, marks may be deducted.
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Has XFA : No XMP Toolkit : Adobe XMP Core 5.6-c015 84.159810, 2016/09/10-02:41:30 Create Date : 2018:10:12 01:52:50Z Creator Tool : Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3569.0 Safari/537.36 Modify Date : 2018:10:20 11:51:13+08:00 Metadata Date : 2018:10:20 11:51:13+08:00 Producer : Skia/PDF m71 Document ID : uuid:9c808b6a-0ce0-754b-b719-268c20f59eed Instance ID : uuid:3ccac7e8-92c2-674d-9eea-ab43bba6bac1 Format : application/pdf Page Count : 3 Creator : Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3569.0 Safari/537.36EXIF Metadata provided by EXIF.tools