A3 Instructions V2
User Manual:
Open the PDF directly: View PDF .
Page Count: 3
Assignment3(12Marks)
Deadline:October212018,11:59pm
1. PROBLEMDESCRIPTION
Inthisassignment,wewillpracticeexecutingensemblemethodsinRandlearnhow
thosemethodscouldhelpusimprovethepredictionperformance.
Thedataisasimulateddatasetwithonebinarylabeland15features.Thereare2000
recordsinthetrainingdata(withlabelvalues)and2000recordsinthetestdata
(withoutlabelvalues).ForTask1andTask2,pleaseusethefirst1500recordsin
A3_train.csvfortrainingandthelast500recordsinA3_train.csvforcomputing
performance.
2.TASKS
Task1:WriteyourowncodeofRandomForestofapostprunedrpartbymodifying
theRcodetemplateuploadedtoIVLEA3folder.(3marks)
●Youmightfindtheswirlexercise(BT5152Tutorial1Decision
Trees)fromweek3helpfulifyouneedtorefreshyourmemoryon
postpruneofanrpartdecisiontree.
●Performancemetricissimpleaccuracyforbinarylabels.
●Thispartisforpracticepurposetohelpyoucheckyour
understandingaboutRandomForestalgorithm.Inpractice,most
packagesimplementswithusingafullygrowntree.
●Gradingofthispartisaboutthecorrectnessofyourcodeand
checkingyourunderstandingaboutrandomforest.Prediction
performancewon’tbegraded.
Task2:Stackingofthreealgorithms:C50withdefaultparametervalues,KNNwith
k=3,andyourrandomforestinTask1.Theoutputoflevel0isabinarylabel
(notpredictedprobability).Logisticregressionisusedforthelevel1
algorithm.Thelearningobjectiveistohelpyoucheckyourunderstanding
aboutStacking.(4marks)
●Thefinaloutputisabinarylabelandtheperformancemetricis
simpleaccuracy.
●Youmayusethesameclassificationproblemdatasetprovidedinthe
templateforTask1.Makesureyourimplementationisabletoreport
thepredictionaccuracyonthetestdataset.
●SameasTask1:gradingofthispartisaboutthecorrectnessofyour
code.Predictionperformancewon’tbegraded.
●Inthisquestion,youneedtocodethedetailsofStacking.Inother
words,youarenotallowedtousecaretEnsembleorcaretStack.You
areallowedandencouragedtousethesepackagesinTask3.
●Bonus(upto1mark):Youmayincludeadditionalcodeandahalf
pagediscussioncomparingyourstackingimplementationandanyof
thelevel0models.Youmayalsoconsidergeneralizingyourstacking
implantationsuchthatitcanbeusedonanyclassificationdataset.
Task3:ToyDataCompetition.Nowyoutryyourbesttopredictthetruelabelofthe
2000rowsinthetestsetfile(thefilewithouttruelabel).Theperformance
metricisAUC.Inotherwords,youarerequiredtosubmitpredicted
probabilities.(5marks)
●Gradingofthistaskisbasedonyourpredictionperformanceand
reproducibilityofyourpredictionresults.Youneedtosubmityour
predictedvaluesandalsothecodetogeneratepredictedvaluesfor
verificationpurpose.IfyourAUCisaroundmedianAUCofthisclass,
yourexpectedmarkis2.5outof5inthisassignment.
●ToalleviatetheworkloadofTA,yourtrainingcodemustcompletewithin
5minutes.
■YoucangridsearchbyCaretandonlysubmitthecodetobuildyour
finalmodelwiththechosenparameters.Onmy3yearoldnormal
desktop,xgBoosttakeslessthan1secondtotrainonthisdataset.
●YouareallowedtouseanyRpackagesforalgorithmscoveredinour
lectures,therequiredtextbook,andtutorialsbeforeweek7(including
Week7).PackagesforalgorithmsnotcoveredsofarareNOTallowed.
■OnlyRisallowed.Pythonisnotallowedinthisexercise.
■LightGBMisnotallowed.GBMorXGBoostinRisallowed.
■Atthesametime,youareallowedtotrydifferentsettingsofanyof
theRpackagescovered.Youdonotneedtosticktothe(default)
parametersettingsusedinthesamplecodesfromtutorials.For
example,youcanchangetheparametersettingsofneuralnetornnet
packagesinanywaythatyoulike.
■UsingrandomForestpackageinRorCaretisallowed.Noneedtouse
thehandcodedversionofRandomForest.
■caretEnsembleorcaretStackisallowed.
■YoucanchoosetouseCaretornot.
●Youareallowedandencouragedtocreatenewfeaturesbasedonrawdata.
Anyfunctionforfeaturesengineeringisallowed.
●Youareallowedtodropfeaturesifyoubelieveithelpstheperformance.
UsingRpackagestohelpyouexecutefeaturesselectionmethodsor
dimensionreductionmethodsisallowed.
SubmissionsandGrading
Youcansubmituptothreefiles:
1. a*.R(or*.Rmd)file[required]
2. a*.csvforQ3[required]
3. a*.PDF(or*.htmlgeneratedbyyourrmarkdown)fileofyouresults
andanswers.[optional]
Nameallfilesbyyourstudentnumber(e.g.,A0123456X.R,
A0123456X.html,A0123456X.csv)anduploadtoIVLEworkbin
submissionfolder"A3".Donotzipyoursubmissions.
InyourRscriptyoucanassumethatdatasetfilesareinthesamedirectoryas
theRscript,e.g.train_data<read.csv("A3_train.csv")
Thepagelimitofthepdffileismaximum2pagesincludingeverything.The
formattingisA4,defaultmargin,12fontsize,singlespacing.Thereisnoneed
totrytofill2pages.Correctanswersaremuchmoreimportantthanthelength
ofyouranswersforgrading.
Youmayreviseandsubmitasmanytimesbeforedeadline.Makesureto
removeanyoldversionthatyoudon’twishtobegraded.
Ifyouhavequestionsabouttheassignment,feelfreetoemailTAandccme.
Later,ifyouhavequestionsaboutgradingoftheassignment,thenyoucan
emailTAandccmebecauseTA(notme)willgradeyourassignmentby
followingmygradingruleslistedbelow.
GradingRules
Allsuspectedplagiarismcaseswillreceive0mark.
Everydayoflatesubmissionwillresultin3marksdeducted,i.e.4dayslate=
0mark.
ForQ1,youshouldusetheprovidedRcodetemplatewithoutmajor
modification.ThecompletedRcodeshouldbecorrect.
ForQ2,youshouldhavecorrectRcodethatcanbeusedforpredictionofat
leastthetestclassificationdatasetinQ1andproducethetestprediction
accuracy.
ForQ3,Rcodethatcanreproducetheexactsamepredictionresultsasyour
*.csvsubmission.0willbegivenforQ3ifexecutingtheRscript/markdown
doesn’tproducethesamecsvfile.2.5marksifexecutiontimeontheTA’s
2GHzi7,8GMmemoryMacBookAirismorethan10minutes.
SubmissionswithoutarunnableR/Rmdfilewillreceiveafailinggrade.Make
sureallthedependencypackagesareimportede.g.library(C50)
TAcanjudgethequalityofyourcodeanddeductupto2marks.Forexample,
ifyouhaveunnecessary/repeatedcode,meaninglessvariablenames,excessive
comments,marksmaybededucted.