Efficient R Programming: A Practical Guide To Smarter Programming
Efficient_R_Programming_A_Practical_Guide_to_Smarter_Programming
User Manual:
Open the PDF directly: View PDF .
Page Count: 335 [warning: Documents this large are best viewed by clicking the View PDF Link!]
- Preface
- 1. Introduction
- 2. Efficient Setup
- 3. Efficient Programming
- 4. Efficient Workflow
- 5. Efficient Input/Output
- 6. Efficient Data Carpentry
- 7. Efficient Optimization
- 8. Efficient Hardware
- 9. Efficient Collaboration
- 10. Efficient Learning
- A. Package Dependencies
- B. References
- Index
EfficientRProgramming
APracticalGuidetoSmarterProgramming
ColinGillespieandRobinLovelace
EfficientRProgramming
byColinGillespieandRobinLovelace
Copyright©2017ColinGillespie,RobinLovelace.Allrightsreserved.
PrintedintheUnitedStatesofAmerica.
PublishedbyO’ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA95472.
O’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluse.Online
editionsarealsoavailableformosttitles(http://oreilly.com/safari).Formoreinformation,
contactourcorporate/institutionalsalesdepartment:800-998-9938orcorporate@oreilly.com.
Editor:NicoleTache
ProductionEditor:NicholasAdams
Copyeditor:GillianMcGarvey
Proofreader:ChristinaEdwards
Indexer:WordCoIndexingServices
InteriorDesigner:DavidFutato
CoverDesigner:RandyComer
Illustrator:RebeccaDemarest
December2016:FirstEdition
RevisionHistoryfortheFirstEdition
2016-11-29:FirstRelease
Seehttp://oreilly.com/catalog/errata.csp?isbn=9781491950784forreleasedetails.
TheO’ReillylogoisaregisteredtrademarkofO’ReillyMedia,Inc.EfficientRProgramming,
thecoverimage,andrelatedtradedressaretrademarksofO’ReillyMedia,Inc.
Whilethepublisherandtheauthorshaveusedgoodfaitheffortstoensurethattheinformation
andinstructionscontainedinthisworkareaccurate,thepublisherandtheauthorsdisclaimall
responsibilityforerrorsoromissions,includingwithoutlimitationresponsibilityfor
damagesresultingfromtheuseoforrelianceonthiswork.Useoftheinformationand
instructionscontainedinthisworkisatyourownrisk.Ifanycodesamplesorother
technologythisworkcontainsordescribesissubjecttoopensourcelicensesorthe
intellectualpropertyrightsofothers,itisyourresponsibilitytoensurethatyourusethereof
complieswithsuchlicensesand/orrights.
978-1-491-95078-4
[LSI]
Preface
EfficientRProgrammingisaboutincreasingtheamountofworkyoucandowithRinagiven
amountoftime.It’saboutbothcomputationalandprogrammerefficiency.Therearemany
excellentRresourcesabouttopicssuchasvisualization(e.g.,Chang2012),datascience(e.g.,
GrolemundandWickham2016),andpackagedevelopment(e.g.,Wickham2015).Thereare
evenmoreresourcesonhowtouseRinparticulardomains,includingBayesianstatistics,
machinelearning,andgeographicinformationsystems.However,thereareveryfewunified
resourcesonhowtosimplymakeRworkeffectively.Hints,tips,anddecadesofcommunity
knowledgeonthesubjectarescatteredacrosshundredsofinternetpages,emailthreads,and
discussionforums,makingitchallengingforRuserstounderstandhowtowriteefficient
code.
Inourteachingwehavefoundthatthisissueappliestobeginnersandexperiencedusersalike.
Whetherit’saquestionofunderstandinghowtouseR’svectorobjectstoavoidforloops,
knowinghowtosetupyour.Rprofileand.Renvironfiles,ortheabilitytoharnessR’s
excellentC++interfacetodotheheavylifting,theconceptofefficiencyiskey.Thebookaims
todistilltips,warnings,andtricksofthetradeintoasingle,cohesivewholethatprovidesa
usefulresourcetoRprogrammersofallstripesforyearstocome.
Thecontentofthebookreflectsthequestionsthatourstudentsfromarangeofdisciplines,
skilllevels,andindustrieshaveaskedovertheyearstomaketheirRworkfaster.Howtoset
upmysystemoptimallyforRprogrammingwork?Howcanoneapplygeneralprinciples
fromcomputerscience(suchasdonotrepeatyourself,akaDRY)tothespecificsofanR
script?HowcanRcodebeincorporatedintoanefficientworkflow,includingproject
inception,collaboration,andwrite-up?Andhowcanonequicklylearnhowtousenew
packagesandfunctions?
Thebookanswersthesequestionsandmorein10self-containedchapters.Eachchapterstarts
withthebasicsandgetsprogressivelymoreadvanced,sothereissomethingforeveryonein
eachone.WhilemoreadvancedtopicssuchasparallelprogrammingandC++maynotbe
immediatelyrelevanttoRbeginners,thebookhelpstonavigateR’sinfamouslysteeplearning
curvewithacommitmenttostartingslowandbuildingonstrongfoundations.Thuseven
experiencedRusersarelikelytofindpreviouslyhiddengemsofadvice.Whileteachingthis
material,wecommonlyhear“Whydidn’tanyonetellmethatbefore?”
Efficientprogrammingshouldnotbeseenasanoptionalextra,andtheimportanceof
efficiencygrowswiththesizeofprojectsanddatasets.Infact,thisbookwasdevisedwhile
teachingacoursecalledRforBigData,whenitquicklybecameapparentthatifyouwantto
workwithlargedatasets,yourcodemustworkefficiently.Evenwithsmalldatasets,efficient
codethatisbothfasttowriteandfasttorunisavitalcomponentofsuccessfulRprojects.We
foundthattheconceptofefficientprogrammingisimportantinallbranchesoftheR
community.WhetheryouareasporadicuserofR(e.g.,foritsunbeatablerangeofstatistical
packages),lookingtodevelopapackage,orworkingonalargecollaborativeprojectin
whichefficiencyismission-critical,codeefficiencywillhaveamajorimpactonyour
productivity.
Ultimately,efficiencyisaboutgettingmoreoutputforlessworkinput.Totaketheanalogyof
acar,wouldyouratherdrive1,000kmonasingletank(orasinglechargeofbatteries)or
refuelaheavy,clunky,uglycarevery50km?Orwouldyouprefertochooseanaltogether
moreefficientvehicleandcycle?Inthesameway,efficientRcodeisbetterthaninefficientR
codeinalmosteveryway:itiseasiertoread,write,run,share,andmaintain.Thisbook
cannotprovidealltheanswersabouthowtoproducesuchcode,butitcertainlycanprovide
ideas,examplecode,andtipstomakeastartintherightdirectionoftravel.
ConventionsUsedinThisBook
Thefollowingtypographicalconventionsareusedinthisbook:
Italic
Indicatesnewterms,URLs,emailaddresses,filenames,andfileextensions.
Bold
IndicatesthenamesofRpackages.
Constantwidth
Usedforprogramlistings,aswellaswithinparagraphstorefertoprogramelements
suchasvariableorfunctionnames,databases,datatypes,environmentvariables,
statements,andkeywords.
Constantwidthbold
Showscommandsorothertextthatshouldbetypedliterallybytheuser.
Constantwidthitalic
Showstextthatshouldbereplacedwithuser-suppliedvaluesorbyvaluesdeterminedby
context.
T IP
Thiselementsignifiesatiporsuggestion.
NOT E
Thiselementsignifiesageneralnote.
WARNING
Thiselementindicatesawarningorcaution.
UsingCodeExamples
Supplementalmaterial(codeexamples,exercises,etc.)isavailablefordownloadat
https://github.com/csgillespie/efficient.
Thisbookisheretohelpyougetyourjobdone.Ingeneral,ifexamplecodeisofferedwith
thisbook,youmayuseitinyourprogramsanddocumentation.Youdonotneedtocontactus
forpermissionunlessyou’rereproducingasignificantportionofthecode.Forexample,
writingaprogramthatusesseveralchunksofcodefromthisbookdoesnotrequire
permission.SellingordistributingaCD-ROMofexamplesfromO’Reillybooksdoes
requirepermission.Answeringaquestionbycitingthisbookandquotingexamplecodedoes
notrequirepermission.Incorporatingasignificantamountofexamplecodefromthisbook
intoyourproduct’sdocumentationdoesrequirepermission.
Weappreciate,butdonotrequire,attribution.Anattributionusuallyincludesthetitle,author,
publisher,andISBN.Forexample:“EfficientRProgrammingbyColinGillespieandRobin
Lovelace(O’Reilly).Copyright2017ColinGillespie,RobinLovelace,978-1-491-95078-4.”
Ifyoufeelyouruseofcodeexamplesfallsoutsidefairuseorthepermissiongivenabove,
feelfreetocontactusatpermissions@oreilly.com.
O’ReillySafari
Safari(formerlySafariBooksOnline)isamembership-basedtrainingandreference
platformforenterprise,government,educators,andindividuals.
Membershaveaccesstothousandsofbooks,trainingvideos,LearningPaths,interactive
tutorials,andcuratedplaylistsfromover250publishers,includingO’ReillyMedia,Harvard
BusinessReview,PrenticeHallProfessional,Addison-WesleyProfessional,MicrosoftPress,
Sams,Que,PeachpitPress,Adobe,FocalPress,CiscoPress,JohnWiley&Sons,Syngress,
MorganKaufmann,IBMRedbooks,Packt,AdobePress,FTPress,Apress,Manning,New
Riders,McGraw-Hill,Jones&Bartlett,andCourseTechnology,amongothers.
Formoreinformation,pleasevisithttp://oreilly.com/safari.
HowtoContactUs
Pleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:
O’ReillyMedia,Inc.
1005GravensteinHighwayNorth
Sebastopol,CA95472
800-998-9938(intheUnitedStatesorCanada)
707-829-0515(internationalorlocal)
707-829-0104(fax)
Wehaveawebpageforthisbook,wherewelisterrata,examples,andanyadditional
information.Youcanaccessthispageathttp://bit.ly/efficient-r-programming.
Tocommentorasktechnicalquestionsaboutthisbook,sendemailto
bookquestions@oreilly.com.
Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteat
http://www.oreilly.com.
FindusonFacebook:http://facebook.com/oreilly
FollowusonTwitter:http://twitter.com/oreillymedia
WatchusonYouTube:http://www.youtube.com/oreillymedia
Acknowledgments
Thisbookwaswrittenintheopen,andmanypeoplecontributedpullrequeststofixminor
problems.SpecialthanksgoestoO’Reillywhoallowedthisprocessandeveryonewho
contributedviaGitHub:@Delvis,@richelbilderbeek,@adamryczkowski,@CSJCampbell,
@tktan,@nachti,ConorLawless,@timcdlucas,DirkEddelbuettel,@wolfganglederer,
@HenrikBengtsson,@giocomai,and@daattali.
Manythanksalsotothedetailedfeedbackfromthetechnicalreviewers,RichardCottonand
GarrettGrolemund.
Colin
ToEsther,Nathan,andNiamh.Thanksforyourpatience.
Robin
ThankstomyhousematesinCornerstoneHousingCooperativeforputtingupwithmebeing
antisocialwhileinbookmode.ToeveryoneattheUniversityofLeedsforencouragingmeto
pursueprojectsoutsidetheusualacademicpursuitsofjournalarticlesandconferences.And
thankstoeveryoneinvolvedinthecommunityofopensourcedevelopers,users,and
communicatorswhomadeallthispossible.
Chapter1.Introduction
Thischapterdescribesthewiderangeofpeoplethisbookwaswrittenfor,intermsofRand
programmingexperience,andhowyoucangetthemostoutofit.Anyonesettingoutto
improveefficiencyshouldhaveanunderstandingofpreciselywhattheymeanbytheterm,
andthisisdiscussedwithreferencetoalgorithmicandprogrammerefficiencyin“WhatIs
Efficiency?”,andwithreferencetoRinparticularin“WhatIsEfficientRProgramming?”on
thesamepage.Itmayseemobvious,butit’salsoworththinkingaboutwhyanyonewould
botherwithefficientcodenowthatpowerfulcomputersarecheapandaccessible.Thisis
coveredin“WhyEfficiency?”.
ThisbookhappilyisnotcompletelyR-specific.NonR–programmingskillsthatareneeded
forefficientRprogramming,whichyouwilldevelopduringthecourseoffollowingthis
book,arecoveredin“Cross-TransferableSkillsforEfficiency”.Atypicallyforabookabout
programming,thissectionintroducestouchtypingandconsistency,cross-transferableskills
thatshouldimproveyourefficiencybeyondprogramming.However,thisisfirstand
foremostabookaboutprogramminganditwouldn’tbesowithoutcodeexamplesinevery
chapter.Despitebeingmoreconceptualanddiscursive,thisopeningchapterisnoexception:
itspenultimatesection(“BenchmarkingandProfiling”)describestwoessentialtoolsinthe
efficientRprogrammer’stoolboxandhowtousethemwithacoupleofillustrativeexamples.
Thefinalthingtosayattheoutsetishowtousethisbookinconjunctionwiththebook’s
associatedpackageanditssourcecode.Thisiscoveredin“BookResources”.
Prerequisites
Asemphasizedinthenextsection,it’susefultoruncodeandexperimentasyouread.This
section,foundatthebeginningofeachchapter,ensuresthatyouhavethenecessarypackages
foreachchapter.Theprerequisitesforthischapterare:
AworkinginstallationofRonyourcomputer(see“InstallingandUpdatingRStudio”).
Installandloadthemicrobenchmark,profvis,andggplot2packages(see“InstallingR
Packages”fortipsoninstallingpackagesandkeepingthemup-to-date).Youcanensure
thatthesepackagesareinstalledbyloadingthemasfollows:
library("microbenchmark")
library("profvis")
library("ggplot2")
Theprerequisitesneededtorunthecodecontainedintheentirebookarecoveredin“Book
Resources”attheendofthischapter.
WhoThisBookIsforandHowtoUseIt
ThisbookisforanyonewhowantstomaketheirRcodefastertotype,fastertorun,and
morescalable.TheseconsiderationsgenerallycomeafterlearningtheverybasicsofRfor
dataanalysis;weassumeyouareeitheraccustomedtoRorproficientatprogrammingin
otherlanguages,althoughthisbookcouldstillbeofuseforbeginners.Thusthebookshould
beusefultopeoplewitharangeofskilllevels,whocanbroadlybedividedintothreegroups:
ForprogrammerswithlittleexperiencewithR
ThisbookwillhelpyounavigatethequirksofRtomakeitworkefficiently:itiseasyto
writeslowRcodeifyoutreatitasifitwereanotherlanguage.
ForRuserswithlittleexperienceinprogramming
Thisbookwillshowyoumanyconceptsandtricksofthetrade,someofwhichare
borrowedfromcomputerscience,thatwillmakeyourworkmoretimeeffective.
ForRbeginnerswithlittleexperienceinprogramming
Thisbookcansteeryoutogetthingsright(oratleastlesswrong)attheoutset.Bad
habitsareeasytogainbuthardtolose.Readingthisbookattheoutsetofyour
programmingcareercouldsaveyoumanyhoursinthefuturesearchingthewebfor
issuescoveredinthisbook.
Identifyingwhichgroupyoubestfitintowillhelpyougetthemostoutofit.Foreveryone,we
recommendreadingEfficientRProgrammingwhileyouhaveanactiveRprojectonthego,
whetherit’sacollaborativetaskatworkorsimplyapersonalprojectathome.Why?The
scopeofthisbookiswiderthanmostprogrammingtextbooks(Chapter4coversproject
management,forexample)andworkingonaprojectoutsidetheconfinesofitwillhelpput
theconcepts,recommendations,andcodeintopractice.Goingdirectlyfromwordsintoaction
inthiswaywillhelpensurethattheinformationisconsolidated:learnbydoing.
Ifyou’reanRnoviceandfitintothefinalcategory,werecommendthatthisactiveRproject
notbeanimportantdeliverable,butanotherRresource.Thoughthisbookisgeneric,itis
likelythatyourusageofRwillbelargelydomain-specific.Forthisreason,werecommend
readingitalongsideteachingmaterialinyourchosenarea.Furthermore,weadvocatethatall
readersusethisbookalongsideotherRresourcessuchasthenumerousvignettes,tutorials,
andonlinearticlesthattheRcommunityhasproduced(describedinthefollowingtip).Ata
bareminimum,youshouldbefamiliarwithdataframes,looping,andsimpleplots,whichyou
willlearnfromtheseresources.
R ESOUR CES F OR LEAR NINGR
Therearemanyplacestofindgenericanddomain-specificRteachingmaterials.Forcompletebeginners,therearea
numberofintroductoryresources,suchastheexcellentStudent’sGuidetoRandthemoretechnicalIcebreakeRtutorial.
Ralsocomespreinstalledwithguidance,revealedbyenteringhelp.start()intotheRconsole,includingtheclassic
officialguideAnIntroductiontoR,whichisexcellent,butdauntingtomany.Enteringvignette()willdisplayalistof
guidespackagedwithinyourRinstallation(andhencedonotrequireaninternetconnection).Toseethevignettefora
specifictopic,justenterthevignette’snameintothesamecommand(e.g.,vignette(package="dplyr",
"introduction"))toseetheintroductoryvignetteforthedplyrpackage.
AnotherearlyportofcallshouldbetheComprehensiveRArchiveNetwork(CRAN)website.TheContributed
Documentationpagecontainsalistofcontributedresources,mainlytutorials,onsubjectsrangingfrommapmakingto
econometrics.Thenewbookdownwebsitecontainsalistofcomplete(ornearcomplete)booksthatcoverdomainssuch
asRforDataScienceandAuthoringBookswithRMarkdown.WerecommendkeepingyoureyeontheR-o-spherevia
theR-Bloggerswebsite,popularTwitterfeeds,andCRAN-affiliatedemaillistsforup-to-datematerialsthatcanbeused
inconjunctionwiththisbook.
WhatIsEfficiency?
Ineverydaylife,efficiencyroughlymeansworkingwell.Anefficientvehiclegoesfarwithout
guzzlinggas.Anefficientworkergetsthejobdonefastwithoutstress.Andanefficientlight
shinesbrightlywithaminimumofenergyconsumption.Inthisfinalsense,efficiency(η)has
aformaldefinitionastheratioofworkdone(W,lightoutput)peruniteffort(Q,energy
consumptioninthiscase):
Howdoesthistranslateintoprogramming?Efficientcodecanbedefinednarrowlyor
broadly.Thefirst,morenarrowdefinitionisalgorithmicefficiency:howfastthecomputer
canundertakeapieceofworkgivenaparticularpieceofcode.Thisconceptdatesbacktothe
veryoriginsofcomputing,asillustratedbythefollowingquotebyAdaLovelace(1842)in
hernotesontheworkofCharlesBabbage:
Inalmosteverycomputationagreatvarietyofarrangementsforthesuccessionofthe
processesispossible,andvariousconsiderationsmustinfluencetheselectionsamongst
themforthepurposesofacalculatingengine.Oneessentialobjectistochoosethat
arrangementwhichshalltendtoreducetoaminimumthetimenecessaryforcompleting
thecalculation.
Thesecond,broaderdefinitionofefficientcomputingisprogrammerproductivity.Thisisthe
amountofusefulworkaperson(notacomputer)candoperunittime.Itmaybepossibleto
rewriteyourcodebaseinCtomakeit100timesfaster.Butifthistakes100humanhours,it
maynotbeworthit.Computerscanchugawaydayandnight.Peoplecannot.Human
productivityisthesubjectofChapter4.
Bytheendofthisbook,youshouldknowhowtowritecodethatisefficientfromboth
algorithmicandproductivityperspectives.Efficientcodeisalsoconcise,elegant,andeasyto
maintain,whichisvitalwhenworkingonlargeprojects.Butthisraisesthewiderquestion:
whatisdifferentaboutefficientRcodecomparedwithefficientcodeinanyotherlanguage?
WhatIsEfficientRProgramming?
TheissueflaggedbyAdaofhavingagreatvarietyofwaystosolveaproblemiskeyto
understandinghowefficientRprogrammingdiffersfromefficientprogramminginother
languages.Risnotoriousforallowinguserstosolveproblemsinmanyways.Thisisdueto
R’sinherentflexibility,inwhichalmost“anythingcanbemodifiedafteritiscreated”
(Wickham2014).R’sinventors,RossIhakaandRobertGentleman,designedittobethisway:
acellinadataframecanbeselectedinmultiplewaysinbaseRalone(threeofwhichare
illustratedlaterinthischapter,in“BenchmarkingExample”).Thisisusefulbecauseitallows
programmerstousethelanguageasbestsuitstheirneeds,butitcanbeconfusingforpeople
lookingfortherightwayofdoingthingsandcancauseinefficienciesifyoudon’tfully
understandthelanguage.
R’snotorietyforbeingabletosolveaprobleminmultiplewayshasgrownwiththe
proliferationofcommunity-contributedpackages.Inthisbook,wefocusonthebestwayof
solvingproblemsfromanefficiencyperspective.Oftenitisinstructivetodiscoverwhya
certainwayofdoingthingsisfasterthanotherways.However,ifyouraimissimplytoget
stuffdone,youonlyneedtoknowwhatislikelytobethemostefficientway.Inthisway,R’s
flexibilitycanbeinefficient:althoughitmaybeeasiertofindawayofsolvinganygiven
probleminRthanotherlanguages,solvingtheproblemwithRmaymakeithardertofindthe
bestwaytosolvethatproblem,astherearesomany.Thisbooktacklesthisissueheadonby
recommendingwhatwebelievearethemostefficientapproaches.Wehopeyoutrustour
views,basedonyearsofusingandteachingR,butwealsohopethatyouchallengethemat
timesandtestthemwithbenchmarksifyoususpectthere’sabetterwayofdoingthings
(thankstoR’sflexibilityandabilitytointerfacewithotherlanguages,theremaywellbe).
ItiswellknownthatRcodecanlackalgorithmicefficiencycomparedwithlow-level
languagesforcertaintasks,especiallyifitwaswrittenbysomeonewhodoesn’tfully
understandthelanguage.ButitisworthhighlightingthenumerouswaysthatRencourages
andguidesefficiency,especiallyprogrammerefficiency:
Risnotcompiled,butitcallscompiledcode.Thismeansthatyougetthebestofboth
worlds:thankfully,Rremovesthelaboriousstageofcompilingyourcodebeforebeing
abletorunit,butprovidesimpressivespeedgainsbycallingcompiledC,FORTRAN,
andotherlanguagebehindthescenes.
Risafunctionalandobject-orientatedlanguage(Wickham2014).Thismeansthatitis
possibletowritecomplexandflexiblefunctionsinRthatgetahugeamountofwork
donewithasinglelineofcode.
RusesRAMformemory.Thismayseemobvious,butit’sworthsaying:RAMismuch
fasterthananyharddisksystem.Comparedwithdatabases,Risthereforeveryfastat
commondatamanipulation,processing,andmodelingoperations.RAMisnowcheaper
thanever,meaningthepotentialdownsidesofthisfeaturearefurtherawaythanever.
Rissupportedbyexcellentintegrateddevelopmentenvironments(IDEs).The
environmentinwhichyouprogramcanhaveahugeimpactonprogrammerefficiency
asitcanprovidehelpquickly,allowforinteractiveplotting,andallowyourRprojectsto
betightlyintegratedwithotheraspectsofyourprojectsuchasfilemanagement,version
management,andinteractivevisualizationsystems,asdiscussedin“RStudio”.
Rhasastrongusercommunity.Thisboostsefficiencybecauseifyouencountera
problemthathasnotyetbeensolved,youcansimplyaskthecommunity.Ifitisanew,
clearlystated,andreproduciblequestionaskedonapopularforumsuchasStack
OverfloworanappropriateRlist,youarelikelytogetaresponsefromanaccomplished
Rprogrammerwithinminutes.Theobviousbenefitofthiscrowd-sourcedsupport
systemisthattheefficiencybenefitsoftheanswerwill,fromthatmomenton,be
availabletoeveryone.
EfficientRprogrammingistheimplementationofefficientprogrammingpracticesinR.All
languagesaredifferent,soefficientRcodedoesnotlooklikeefficientcodeinanother
language.Manypackageshavebeenoptimizedforperformanceso,forsomeoperations,
achievingmaximumcomputationalefficiencymaysimplybeacaseofselectingthe
appropriatepackageandusingitcorrectly.TherearemanywaystogetthesameresultinR,
andsomeareveryslow.Therefore,notwritingslowcodeshouldbeprioritizedoverwriting
fastcode.
Returningtotheanalogyofthetwocarssketchedinthepreface,efficientRprogrammingfor
someusecasescansimplymeantradinginyourold,heavy,gas-guzzlingSUVfunctionfora
lightweightvelomobile.Thesearchforoptimalperformanceoftenhasdiminishingreturns,
soitisimportanttofindbottlenecksinyourcodetoprioritizeworkformaximumincreases
incomputationalefficiency.LinkingbacktoR’snotorietyasaflexiblelanguage,efficientR
programmingcanbeinterpretedasfindingasolutionthatisfastenoughintermsof
computationalefficiencybutasfastaspossibleintermsofprogrammerefficiency.Afterall,
youandyourcoworkersprobablyhavebetterandmorevaluablethingstodooutsidework,
soitisimportantthatyougetthejobdonequicklyandtaketimeoffforotherinteresting
pursuits.
WhyEfficiency?
Computersarealwaysgettingmorepowerful.Doesthisnotreducetheneedforefficient
computing?Theanswerissimple:no.InanageofBigDataandstagnatingcomputerclock
speeds(seeChapter8),computationalbottlenecksaremorelikelythaneverbeforetohamper
yourwork.Anefficientprogrammercan“solvemorecomplextasks,askmoreambitious
questions,andincludemoresophisticatedanalysesintheirresearch”(Visseretal.2015).
Aconcreteexampleillustratestheimportanceofefficiencyinmission-criticalsituations.
RobinwasworkingonatightcontractfortheUK’sDepartmentforTransporttobuildthe
PropensitytoCycleTool,anonlineapplicationthathadtobereadyfornationaldeployment
inlessthanfourmonths.Forthiswork,hedevelopedthefunctionline2route()inthe
stplanrpackagetogenerateroutesviathe(CycleStreets)API.Hundredsofthousandsof
routeswereneeded,but,tohisdismay,codeslowedtoastandstillafteronlyafewthousand
routes.Thisendangeredthecontract.Aftereliminatingotherissuesandviacodeprofiling
(coveredin“CodeProfiling”),itwasfoundthattheslowdownwasduetoabugin
line2route():itsufferedfromthevectorgrowingproblem,discussedin“Memory
Allocation”.
Thesolutionwassimple.Asinglecommitmadeline2route()morethantentimesfasterand
substantiallyshorter.Thispotentiallysavedtheprojectfromfailure.Themoralofthisstoryis
thatefficientprogrammingisnotmerelyadesirableskill—itcanbeessential.
Therearemanyconceptsandskillsthatarelanguage-agnostic.Muchoftheknowledge
impartedinthisbookshouldberelevanttoprogramminginotherlanguages(andother
technicalactivitiesbeyondprogramming).Therearestrongreasonsforfocusingon
efficiencyinonelanguage,however.InR,simplyusingreplacementfunctionsfroma
differentpackagecangreatlyimproveefficiency,asdiscussedinrelationtoreadingtextfiles
inChapter5.Thislevelofdetail,withreproducibleexamples,wouldnotbepossibleina
general-purposeefficientprogrammingbook.Skillsforefficientworking,whichapply
beyondRprogramming,arecoveredinthenextsection.
Cross-TransferableSkillsforEfficiency
ThemeaningofefficientRcode,asopposedtogenericefficientcode,shouldbeclearfrom
readingtheprecedingtwosections.However,thatdoesnotmeanthattheskillsandconcepts
coveredinthisbookarenottransferabletootherlanguagesandnon-programmingtasks.
Likewise,workingonthesecross-transferableskillswillimproveyourRprogramming(as
wellasotheraspectsofyourworkinglife).Twooftheseskillsareespeciallyimportant:touch
typinganduseofaconsistentstyle.
TouchTyping
Theothersideoftheefficiencycoinisprogrammerefficiency.Therearemanythingsthat
willhelpincreasetheproductivityofyouandyourcollaborators,notleastfollowingthe
adviceofPhilippJanertto“thinkmore,workless”(Janert2010).Theevidencesuggeststhat
gooddiet,physicalactivity,plentyofsleep,andahealthywork-lifebalancecanallboostyour
speedandeffectivenessatwork(Jensen2011;Pereiraetal.2015;Grant,Spurgeon,and
Wallace2013).
Whilewerecommendthatthereaderreflectonthisevidenceandtheirownwell-being,thisis
notaself-helpbook.Itisabookaboutprogramming.However,thereisone
nonprogrammingskillthatcanhaveahugeimpactonproductivity:touchtyping.Thisskill
canberelativelypainlesstolearn,andcanhaveahugeimpactonyourabilitytowrite,
modify,andtestRcodequickly.Learningtotouchtypeproperlywillpayoffinsmall
incrementsthroughouttherestofyourprogramminglife(ofcourse,thebenefitsarenot
constrainedtoRprogramming).
Thekeydifferencebetweenatouchtypistandsomeonewhoconstantlylooksdownatthe
keyboard,orwhousesonlytwoorthreefingersfortyping,ishandplacement.Touchtyping
involvespositioningyourhandsonthekeyboardwitheachfingerofbothhandstouchingor
hoveringoveraspecificletter(Figure1-1).Thistakestimeandsomedisciplinetolearn.
Fortunatelytherearemanyresourcesthatwillhelpyougetinthehabitearly,includingthe
opensourcesoftwareprojectsKlavaroandTypeFaster.
ConsistentStyleandCodeConventions
Gettingintothehabitofclearandconsistentstylewhenwritinganything,beitcodeorpoetry,
willhavebenefitsinmanyotherprojects,programmingornon-programming.Asoutlinedin
“CodingStyle”,styleistosomeextentapersonalpreference.However,itisworthnotingthe
conventionsweuseattheoutsetofthisbook,tomaximizeitsreadability.Throughoutthis
bookweuseaconsistentsetofconventionstorefertocode.
Packagenamesareinbold,e.g.,dplyr.
Functionsareinacodefont,followedbyparentheses,likeplot()ormedian().
OtherRobjects,suchasdataorfunctionarguments,areinacodefontwithout
parentheses,likexandname.
Occasionally,we’llhighlightthepackageofthefunctionusingtwocolons,like
microbenchmark::microbenchmark().Notethatthisnotationcanbeefficientifyouonly
needtouseapackage’sfunctiononce,asitavoidsattachingthepackage.
TheconceptsofbenchmarkingandprofilingarenotR-specific.However,theyaredoneina
particularwayinR,asoutlinedinthenextsection.
BenchmarkingandProfiling
Benchmarkingandprofilingarekeytoefficientprogramming,especiallyinR.
Benchmarkingistheprocessoftestingtheperformanceofspecificoperationsrepeatedly.
Profilinginvolvesrunningmanylinesofcodetofindbottlenecks.Botharevitalfor
understandingefficiency,andweusethemthroughoutthebook.Theircentralitytoefficient
programmingpracticemeanstheymustbecoveredinthisintroductorychapter,despitebeing
seenbymanyasanintermediateoradvancedRprogrammingtopic.
Insomeways,benchmarkscanbeseenasthebuildingblocksofprofiles.Profilingcanbe
understoodasautomaticallyrunningmanybenchmarksforeverylineinascriptand
comparingtheresultslinebyline.Becausebenchmarksaresmaller,easier,andmore
modular,wecoverthemfirst.
Benchmarking
Modifyingelementsfromonebenchmarktothenextandrecordingtheresultsafterthe
modificationenablesustodeterminethefastestpieceofcode.Benchmarkingisimportantin
theefficientprogrammer’stoolkit:youmaythinkthatyourcodeisfasterthanmine,but
benchmarkingallowsyoutoproveit.Theeasiestwaytobenchmarkafunctionistouse
system.time().However,itisimportanttorememberthatwearetakingasample.We
wouldn’texpectasinglepersoninLondontoberepresentativeoftheentireUKpopulation;
similarly,asinglebenchmarkprovidesuswithasingleobservationonourfunction’s
behavior.Therefore,we’llneedtorepeatthetimingmanytimeswithaloop.
Analternativewayofbenchmarkingisviatheflexiblemicrobenchmarkpackage.Thisallows
ustoeasilyruneachfunctionmultipletimes(bydefault,100)inordertodetectmicrosecond
differencesincodeperformance.Wethengetaconvenientsummaryoftheresults:the
minimum/maximumandlower/upperquartiles,andthemean/mediantimes.Wesuggest
focusingonthemediantimetogetafeelforthestandardtimeandthequartilestounderstand
thevariability.
BenchmarkingExample
Agoodexampleistestingdifferentmethodstolookupasinglevalueinadataframe.Note
thateachargumentinthefollowingbenchmarkisatermtobeevaluated(formulti-line
benchmarks,thetermtobeevaluatedcanbesurroundedbycurlybrackets,{}).
library("microbenchmark")
df=data.frame(v=1:4,name=letters[1:4])
microbenchmark(df[3,2],df[3,"name"],df$name[3])
#Unit:microseconds
#exprminlqmeanmedianuqmaxnevalcld
#df[3,2]17.9918.9620.1619.3819.7735.14100b
#df[3,"name"]17.9719.1321.4519.6420.1574.00100b
#df$name[3]12.4813.8115.8114.4815.1467.24100a
Theresultssummarizehowlongeachquerytook:theminimum(min);lowerandupper
quartiles(lqanduq,respectively);andthemean,median,andmaximum(max)foreachofthe
numberofevaluations(neval,withthedefaultvalueof100usedinthiscase).cldreportsthe
relativerankofeachrowintheformofcompactletterdisplay:inthiscase,df$name[3]
performsbest,witharankofaandameantimeofaround25%lowerthantheothertwo
functions.
Whenusingmicrobenchmark(),youshouldpaycarefulattentiontotheunits.Intheprevious
example,eachfunctioncalltakesapproximately20microseconds,implyingaround50,000
functioncallscouldbedoneinasecond.Whencomparingquickfunctions,thestandardunits
are:
milliseconds(ms)
Onethousandfunctionstakesasecond;
microseconds(µs)
onemillionfunctioncallstakesasecond;
nanoseconds(ns)
onebillioncallstakesasecond.
Wecansettheunitswewanttousewiththeunitargument(e.g.,theresultsarereportedin
secondsifwesetunit="s").
Whenthinkingaboutcomputationalefficiency,thereare(atleast)twoinmeasures:
Relativetime
df$name[3]is25%fasterthandf[3,"name"];
Absolutetime
df$name[3]isfivemicrosecondsfasterthandf[3,"name"].
Bothmeasuresareuseful,butitisimportantnottoforgettheunderlyingtimescale.Itmakes
littlesensetooptimizeafunctionthattakesmicrosecondsifthereareoperationsthattake
secondstocompleteinyourcode.
Profiling
Benchmarkinggenerallyteststheexecutiontimeofonefunctionagainstanother.Profiling,
ontheotherhand,isabouttestinglargechunksofcode.
ItisdifficulttooveremphasizetheimportanceofprofilingforefficientRprogramming.
Withoutaprofileofwhattooklongest,youwillhaveonlyavagueideaofwhyyourcodeis
takingsolongtorun.Thefollowingexample(whichgeneratesFigure1-2,animageofice-
sheetretreatfrom1985to2015)showshowprofilingcanbeusedtoidentifybottlenecksin
yourRscripts:
library("profvis")
profvis(expr={
#Stage1:loadpackages
#library("rnoaa")#notnecessaryasdatapre-saved
library("ggplot2")
#Stage2:loadandprocessdata
out=readRDS("extdata/out-ice.Rds")
df=dplyr::rbind_all(out,id="Year")
#Stage3:visualizeoutput
ggplot(df,aes(long,lat,group=paste(group,Year)))+
geom_path(aes(colour=Year))
ggsave("figures/icesheet-test.png")
},interval=0.01,prof_output="ice-prof")
TheresultsofthisprofilingexercisearedisplayedinFigure1-3.
Formoreinformationaboutprofilingandbenchmarking,pleaserefertotheOptimisingcode
chapterinAdvancedRbyHadleyWickham(CRCPress),and“CodeProfiling”inthisbook.
Werecommendreadingtheseadditionalresourceswhileperformingbenchmarksand
profilesonyourowncode,perhapsbasedonthefollowingexercises.
Figure1-2.VisualizationofNorthPoleice-sheetdecline,generatedusingthecodeprofiledusingtheprofvispackage
Figure1-3.ProfilingresultsofloadingandplottingNASAdataonice-sheetretreat
Exercises
Considerthefollowingbenchmarktoevaluatedifferentfunctionsforcalculatingthe
cumulativesumofallthewholenumbersfrom1to100:
x=1:100#initiatevectortocumulativelysum
#Method1:withaforloop(10lines)
cs_for=function(x){
for(iinx){
if(i==1){
xc=x[i]
}else{
xc=c(xc,sum(x[1:i]))
}
}
xc
}
#Method2:withapply(3lines)
cs_apply=function(x){
sapply(x,function(x)sum(1:x))
}
#Method3:cumsum(1line,notshown)
microbenchmark(cs_for(x),cs_apply(x),cumsum(x))
#>Unit:nanoseconds
#>exprminlqmeanmedianuqmaxneval
#>cs_for(x)248145316292386893370505436382697258100
#>cs_apply(x)157610198157255241233324306013478394100
#>cumsum(x)561113117961422207518284100
1. Whichmethodisfastestandhowmanytimesfasterisit?
2. Runthesamebenchmark,butwiththeresultsreportedinseconds,onavectorofall
thewholenumbersfrom1to50,000.Hint:alsousetheargumentneval=1sothat
eachcommandisonlyrunoncetoensurethattheresultscomplete(evenwithasingle
evaluation,thebenchmarkmaytakeuptoormorethanaminutetocomplete,
dependingonyoursystem).Doestherelativetimedifferenceincreaseordecrease?
Byhowmuch?
3. Testhowlongthedifferentmethodsforsubsettingthedataframedf,presentedin
“BenchmarkingExample”,takeonyourcomputer.Isitfasterorsloweratsubsetting
thanthecomputeronwhichthisbookwascompiled?
4. Usesystem.time()andafor()looptotesthowlongittakestoperformthe
subsettingoperation50,000times.Beforetestingthis,doyouthinkitwillbemoreor
lessthanonesecondforeachsubsettingmethod?Hint:thetestforthefirstmethodis
showninthefollowingcode:
#Testhowlongittakestosubsetthedataframe50,000times:
system.time(
for(iin1:50000){
df[3,2]
}
)
5. Bonusexercise:tryprofilingasectionofcodeyouhavewrittenusingprofvis.
Wherearethebottlenecks?Weretheywhereyouexpected?
BookResources
RPackage
ThisbookhasanassociatedRpackagethatcontainsdatasetsandfunctionsreferencedinthe
book.ThepackageishostedonGitHubandcanbeinstalledusingthedevtoolspackage:
devtools::install_github("csgillespie/efficient")
Thepackagealsocontainssolutions(asvignettes)totheexercisesfoundinthisbook.They
canbebrowsedwiththefollowingcommand:
browseVignettes(package="efficient")
Thefollowingcommandwillinstallallpackagesusedtogeneratethisbook:
devtools::install_github("csgillespie/efficientR")
OnlineVersion
WearegratefultoO’Reillyforallowingustodevelopthisbookonline.Theonlineversion
constitutesasubstantialadditionalresourcetosupplementthisbook,andwillcontinueto
evolveinbetweenreprintsofthephysicalbook.Thebook’scodealsorepresentsasubstantial
learningopportunityinitselfasitwaswrittenusingRMarkdownandthebookdownpackage,
allowingustoruntheRcodeeachtimewecompilethebooktoensurethatitworks,and
allowingotherstocontributetoitslongevity.Toeditthischapter,forexample,simply
navigatetohttps://github.com/csgillespie/efficientR/edit/master/01-introduction.Rmdwhile
loggedintoaGitHubaccount.Thefullsourceofthebookisavailableat
https://github.com/csgillespie/efficientRwherewewelcomecomments/questionsontheIssue
TrackerandPullRequests.
References
Wickham,Hadley.2014a.AdvancedR.CRCPress.
Visser,MarcoD.,SeanM.McMahon,CoryMerow,PhilipM.Dixon,SydneRecord,andEelke
Jongejans.2015.“SpeedingUpEcologicalandEvolutionaryComputationsinR;Essentialsof
HighPerformanceComputingforBiologists.”EditedbyFrancisOuellette.PLOS
ComputationalBiology11(3):e1004140.doi:10.1371/journal.pcbi.1004140.
Janert,PhilippK.2010.DataAnalysiswithOpenSourceTools.O’ReillyMedia.
Jensen,JørgenDejgård.2011.“CanWorksiteNutritionalInterventionsImproveProductivity
andFirmProfitability?ALiteratureReview.”PerspectivesinPublicHealth131(4).SAGE
Publications:184–92.
Pereira,MichelleJessica,BrookeKayeCoombes,TracyAnneComans,andVenerina
Johnston.2015.“TheImpactofOnsiteWorkplaceHealth-EnhancingPhysicalActivity
InterventionsonWorkerProductivity:ASystematicReview.”Occupationaland
EnvironmentalMedicine72(6).BMJPublishingGroupLtd:401–12.
Grant,ChristineA,LouiseMWallace,andPeterCSpurgeon.2013.“AnExplorationofthe
PsychologicalFactorsAffectingRemoteE-Worker’sJobEffectiveness,Well-Beingand
Work-LifeBalance.”EmployeeRelations35(5).EmeraldGroupPublishingLimited:527–46.
Chapter2.EfficientSetup
Anefficientcomputersetupisanalogoustoawell-tunedvehicle.Itscomponentsworkin
harmony.Itiswellserviced.It’sfast!
Thischapterdescribesthesetupthatwillenableaproductiveworkflow.Itexploreshowthe
operatingsystem,Rversion,startupfiles,andIDEcanmakeyourRworkfaster.
Understandingandattimeschangingthesesetupoptionscanhavemanyadditionalbenefits.
That’swhywecoverthematthisearlystage(hardwareiscoveredinChapter3).Bytheendof
thischapter,youshouldunderstandhowtosetupyourcomputerandRinstallationfor
optimalefficiency.Itcoversthefollowingtopics:
Randtheoperatingsystems
SystemmonitoringonLinux,Mac,andWindows
Rversion
HowtokeepyourbaseRinstallationandpackagesup-to-date
Rstart-up
Howandwhytoadjustyour.Rprofileand.Renvironfiles
RStudio
AnIDEtoboostyourprogrammingproductivity
BLASandalternativeRinterpreters
LooksatwaystomakeRfaster
Efficientprogrammingismorethanaseriesoftips:thereisnosubstituteforin-depth
understanding.However,tohelprememberthekeymessagesburiedamongthedetails,each
chapterfromnowoncontainsaTopFiveTipssectionafterthepre-requisites.
Prerequisites
Onlyonepackageneedstobeinstalledtorunthecodeinthischapter:
library("benchmarkme")
TopFiveTipsforanEfficientRSetup
1. Usesystemmonitoringtoidentifybottlenecksinyourhardware/code.
2. KeepyourRinstallationandpackagesup-to-date.
3. MakeuseofRStudio’spowerfulautocompletioncapabilitiesandshortcuts.
4. StoreAPIkeysinthe.Renvironfile.
5. ConsiderchangingyourBLASlibrary.
OperatingSystem
Rsupportsallthreemajoroperatingsystem(OS)types:Linux,Mac,andWindows.1Ris
platform-independent,althoughtherearesomeOS-specificquirks,suchasinrelationtofile-
pathnotation(see“TheLocationofStartupFiles”).
BasicOS-specificinformationcanbequeriedfromwithinRusingSys.info():
Sys.info()
#>sysnamereleasemachineuser
#>"Linux""4.2.0-35-generic""x86_64""robin"
TranslatedintoEnglish,theprecedingoutputmeansthatRisrunningona64-bit(x86_64)
Linuxdistribution(4.2.0-35-genericistheLinuxversion)andthatthecurrentuserisrobin.
Fourotherpiecesofinformation(notshown)arealsoproducedbythecommand,the
meaningofwhichiswelldocumentedinahelpfilerevealedbyentering?Sys.infointheR
console.
T IP
Theassertive.reflectionpackagecanbeusedtoreportadditionalinformationaboutyourcomputer’soperating
systemandRsetupwithfunctionsforassertingoperatingsystemandothersystemcharacteristics.Theassert_*()
functionsworkbytestingthetruthofthestatementanderroringifthestatementisuntrue.OnaLinuxsystem
assert_is_linux()willrunsilently,whereasassert_is_windows()willcauseanerror.Thepackagecanalso
testfortheIDEyouareusing(e.g.,assert_is_rstudio()),thecapabilitiesofR
(assert_r_has_libcurl_capability(),etc.),andwhatOStoolsareavailable(e.g.,
assert_r_can_compile_code()).Thesefunctionscanbeusefulforrunningcodethatisdesignedonlytorunon
onetypeofsetup.
OperatingSystemandResourceMonitoring
Minordifferencesaside,R’scomputationalefficiencyisbroadlythesameacrossdifferent
operatingsystems.2Beyondthe32-bitversus64-bitissue(coveredinChapter3)andprocess
forking(coveredinChapter7)anotherOS-relatedissuetoconsiderisexternaldependencies:
programsthatRpackagesdependon.Sometimesexternalpackagedependenciesmustbe
installedmanually(i.e.,notusinginstall.packages()).ThisisespeciallycommononUnix-
basedsystems(LinuxandMac).OnDebian-basedoperatingsystemssuchasUbuntu,manyR
packagescanbeinstalledattheOSleveltoensurethatexternaldependenciesarealso
installed(see“InstallingRPackageswithDependencies”).
ResourcemonitoringistheprocessofcheckingthestatusofkeyOSvariables.For
computationallyintensivework,itissensibletomonitorsystemresourcesinthisway.
Resourcemonitoringcanhelpidentifycomputationalbottlenecks.AlongsideRprofiling
functionssuchasprofvis(see“CodeProfiling”),systemmonitoringprovidesausefultool
forunderstandinghowRisperforminginrelationtovariablesreportingtheOSstate,suchas
howmuchRAMisinuse,whichrelatestothewiderquestionofwhethermoreisneeded
(coveredinChapter3).
CPUresourcesallocatedovertimeisanothercommonOSvariablethatisworthmonitoring.
Abasicusecaseistocheckwhetheryourcodeisrunninginparallel(seeFigure2-1),and
whetherthereisspareCPUcapacityontheOSthatcouldbeharnessedbyparallelcode.
Figure2-1.Outputfromasystemmonitor(gnome-system-monitorrunningonUbuntu)showingtheresourcesconsumedby
runningthecodepresentedinthesecondoftheExercisesattheendofthissection.ThefirstincreasesRAMuse,thesecond
issingle-threaded,andthethirdismultithreaded.
Systemmonitoringisacomplextopicthatspillsoverintosystemadministrationandserver
management.Fortunately,therearemanytoolsdesignedtoeasemonitoringonallmajor
operatingsystems.
OnLinux,theshellcommandtopdisplayskeyresourceusefiguresformost
distributions.htopandGnome’sSystemMonitor(gnome-system-monitor;seeFigure2-
1)aremorerefinedalternatives,whichusecommand-lineandgraphicaluserinterfaces,
respectively.Anumberofoptions,suchasnethogs,monitorinternetusage.
OnMac,theActivityMonitorprovidessimilarfunctionality.Thiscanbeinitiatedfrom
theUtilitiesfolderinLaunchpad.
OnWindows,theTaskManagerprovideskeyinformationonRAMandCPUuseby
process.ThiscanbestartedinmodernWindowsversionsbypressingCtrl-Alt-Delorby
clickingthetaskbarandStartTaskManager.
Exercises
1. Whatistheexactversionofyourcomputer’soperatingsystem?
2. Startanactivitymonitor,thenexecutethefollowingcodechunk.Init,lapply()(or
itsparallelversion,mclapply())isusedtoapplythefunctionmedian()overevery
columninthedataframeobjectX(see“TheApplyFamily”formoreontheapply
familyoffunctions).Thereasonthisworksisthatadataframeisreallyalistof
vectors,witheachvectorformingacolumn.Howdothesystemoutputlogresultson
yoursystemcomparetothosepresentedinFigure2-1?
#Note:uses2+GBRAMandtakesseveralsecondsdependingonhardware
#1:Createlargedataset
X=as.data.frame(matrix(rnorm(1e8),nrow=1e7))
#2:Findthemedianofeachcolumnusingasinglecore
r1=lapply(X,median)
#3:Findthemedianofeachcolumnusingmanycores
r2=parallel::mclapply(X,median)
NOT E
mclapplyonlyworksinparallelonMacandLinux.InChapter7you’lllearnabouttheequivalent
functionparLapply()thatworksinparallelonWindows.
3. WhatdoyounoticeregardingCPUusage,RAM,andsystemtimeduringandafter
eachofthethreeoperations?
4. Bonusquestion:howwouldtheresultschangedependingonoperatingsystem?
RVersion
ItisimportanttobeawarethatRisanevolvingsoftwareproject,whosebehaviorchanges
overtime.Ingeneral,baseRisveryconservativeaboutmakingchangesthatbreakbackwards
compatibility.However,packagesoccasionallychangesubstantiallyfromonereleasetothe
next;typicallyitdependsontheageofthepackage.Formostusecases,werecommend
alwaysusingthemostup-to-dateversionofRandpackagessoyouhavethelatestcode.In
somecircumstances(e.g.,onaproductionserverorworkinginateam),youmay
alternativelywanttousespecificversionsthathavebeentestedtoensurestability.Keeping
packagesup-to-dateisdesirablebecausenewcodetendstobemoreefficient,intuitive,robust,
andfeature-rich.Thissectionexplainshow.
T IP
PreviousRversionscanbeinstalledfromCRAN’sarchiveorpreviousRreleases.Thebinaryversionsforall
OSescanbefoundatcran.r-project.org/bin/.TodownloadbinaryversionsforUbuntuWily,forexample,see
https://cran.r-project.org/bin/linux/ubuntu/wily/.TopinspecificversionsofRpackagesyoucanusethepackrat
package.FormoreonpinningRversionsandRpackages,seethefollowingarticlesonRStudio’swebsite:Using-
Different-Versions-of-Randrstudio.github.io/packrat/.
InstallingR
ThemethodofinstallingRvariesforWindows,Linux,andMac.
OnWindows,asingle.exefile(hostedatcran.r-project.org/bin/windows/base/)willinstall
thebaseRpackage.
OnaMac,thelatestversionshouldbeinstalledbydownloadingthe.pkgfileshostedat
https://cran.r-project.org/bin/macosx/.
OnLinux,theinstallationmethoddependsonthedistributionofLinuxinstalled,thoughthe
principlesarethesame.We’llcoverhowtoinstallRonDebian-basedsystems,withlinksat
theendfordetailsonotherLinuxdistributions.ThefirststageistoaddtheCRANrepository
toensurethatthelatestversionisinstalled.IfyouarerunningUbuntu16.04,forexample,
appendthefollowinglinetothefile/etc/apt/sources.list:
debhttp://cran.rstudio.com/bin/linux/ubuntuxenial/
http://cran.rstudio.comisthemirror(whichcanbereplacedbyanyofthoselistedat
https://cran.r-project.org/mirrors.html)andxenialistherelease.SeetheDebianandUbuntu
installationpagesonCRANforfurtherdetails.
Oncetheappropriaterepositoryhasbeenaddedandthesystemupdated(e.g.,withsudoapt-
getupdate),r-baseandotherr-packagescanbeinstalledusingtheaptsystem.The
followingtwocommands,forexample,wouldinstallthebaseRpackage(abarebonesinstall)
andthepackageRCurl,whichhasanexternaldependency:
sudoapt-getinstallr-cran-base#installbaseR
sudoapt-getinstallr-cran-rcurl#installthercurlpackage
apt-cachesearch"^r-.*"|sortwilldisplayallRpackagesthatcanbeinstalledfromapt
inDebian-basedsystems.InFedora-basedsystems,theequivalentcommandisyumlistR-
\*.
Typicaloutputfromthesecondcommandisillustratedinthefollowingexample:
Thefollowingextrapackageswillbeinstalled:
libcurl3-nss
ThefollowingNEWpackageswillbeinstalled
libcurl3-nssr-cran-rcurl
0toupgrade,2tonewlyinstall,0toremoveand16nottoupgrade.
Needtoget699kBofarchives.
Afterthisoperation,2,132kBofadditionaldiskspacewillbeused.
Doyouwanttocontinue?[Y/n]
Furtherdetailsareprovidedathttps://cran.r-project.org/bin/linux/forDebian,Redhat,and
SuseOSs.RalsoworksonFreeBSDandotherUnix-basedsystems.3
OnceRisinstalled,itshouldbekeptup-to-date.
UpdatingR
Risamatureandstablelanguage,sowell-writtencodeinbaseRshouldworkonmost
versions.However,itisimportanttokeepyourRversionrelativelyup-to-dateforthe
followingreasons:
Bugfixesareintroducedineachversion,makingerrorslesslikely.
Performanceenhancementsaremadefromoneversiontothenext,meaningyourcode
mayrunfasterinlaterversions.
ManyRpackagesonlyworkonrecentversionsonR.
Releasenoteswithdetailsoneachoftheseissuesarehostedathttps://cran.r-
project.org/src/base/NEWS.Rreleaseversionshavethreecomponentscorrespondingto
major.minor.patchchanges.Generally,twoorthreepatchesarereleasedbeforethenext
minorincrement,eachpatchisreleasedroughlyeverythreemonths.R3.2,forexample,has
consistedofthreeversions:3.2.0,3.2.1,and3.2.2.
OnUbuntu-basedsystems,newversionsofRshouldbeautomaticallydetectedthrough
thesoftwaremanagementsystem,andcanbeinstalledwithapt-getupgrade.
OnMac,thelatestversionshouldbeinstalledbytheuserfromthe.pkgfilesmentioned
previously.
OnWindows,theinstallrpackagemakesupdatingeasy:
#checkandinstallthelatestRversion
installr::updateR()
Forinformationaboutchangestoexpectinthenextversion,youcansubscribetoR’sNEWS
RSSfeed.It’sagoodwayofkeepingup-to-date.
InstallingRPackages
Largeprojectsmayneedseveralpackagestobeinstalled.Inthiscase,therequiredpackages
canbeinstalledatonce.Usingtheexampleofpackagesforhandlingspatialdata,thiscanbe
donequicklyandconciselywiththefollowingcode:
pkgs=c("raster","leaflet","rgeos")#packagenames
install.packages(pkgs)
Inthepreviouscode,alltherequiredpackagesareinstalledwithtwo—notthree—lines,
whichreducestyping.Notethatwecannowreusethepkgsobjecttoloadthemall:
inst=lapply(pkgs,library,character.only=TRUE)#loadthem
Inthepreviouscode,library(pkg[i])isexecutedforeverypackagestoredinthetextstring
vector.Weuselibrary()hereinsteadofrequire()becausetheformerproducesanerrorif
thepackageisnotavailable.
Loadingallpackagesatthebeginningofascriptisgoodpracticeasitensuresthatall
dependencieshavebeeninstalledbeforetimeisspentexecutingcode.Storingpackagenames
inacharactervectorobjectsuchaspkgsisalsousefulbecauseitallowsustoreferbackto
themagainandagain.
InstallingRPackageswithDependencies
Somepackageshaveexternaldependencies(i.e.,theycalllibrariesoutsideR).OnUnix-like
systems,thesearebestinstalledontotheoperatingsystem,bypassinginstall.packages.This
willensurethatthenecessarydependenciesareinstalledandsetupcorrectlyalongsidetheR
package.OnDebian-baseddistributionssuchasUbuntu,forexample,packageswithnames
startingwithr-cran-canbesearchedforandinstalledasfollows(seehttps://cran.r-
project.org/bin/linux/ubuntu/foralistofthese):
apt-cachesearchr-cran-#searchforavailablecranDebianpackages
sudoapt-get-installr-cran-rgdal#installthergdalpackage(withdependencies)
OnWindows,theinstallrpackagehelpsmanageandupdateRpackageswithsystem-level
dependencies.Forexample,theRtoolspackageforcompilingC/C++codeonWindowscan
beinstalledwiththefollowingcommand:
installr::install.rtools()
UpdatingRPackages
AnefficientRsetupwillcontainup-to-datepackages.Thiscanbedoneforallpackagesby
using:
update.packages()
ThedefaultforthisfunctionisfortheaskargumenttobesettoTRUE,givingcontrolover
whatisdownloadedontoyoursystem.Thisisgenerallydesirablebecauseupdatingdozensof
largepackagescanconsumealargeproportionofavailablesystemresources.
T IP
Toupdatepackagesautomatically,youcanaddthelineutils::update.packages(ask=FALSE)tothe.Last
functioninthe.Rprofilestartupfile(seethenextsectionformoreon.Rprofile).ThankstoRichardCottonforthis
tip.
AnevenmoreinteractivemethodforupdatingpackagesinRisprovidedbyRStudiovia
Tools→CheckforPackageUpdates.Manysuchtime-savingtricksareenabledbyRStudio,
asdescribedin“InstallingandUpdatingRStudio”.Next(aftertheexercises),wetakealookat
howtoconfigureRusingstartupfiles.
Exercises
1. WhatversionofRareyouusing?Isitthemostup-to-date?
2. Doanyofyourpackagesneedupdating?
RStartup
EverytimeRstarts,acoupleoffilescriptsarerunbydefault,asdocumentedin?Startup.
Thissectionexplainshowtocustomizethesefiles,allowingyoutosaveAPIkeysorload
frequentlyusedfunctions.Beforelearninghowtomodifythesefiles,we’lltakealookathow
toignorethem,withR’sstartuparguments.Ifyouwanttoturncustomsetupon,it’susefulto
beabletoturnitoff(e.g.,fordebugging).
T IP
SomeofR’sstartupargumentscanbecontrolledinteractivelyinRStudio.SeetheonlinehelpfileCustomizing
RStudioformoreonthis.
RStartupArguments
AnumberofargumentsthatrelatetostartupcanbeappendedtotheRstartupcommand(Rina
shellenvironment).Thefollowingareparticularlyimportant:
--no-environand--no-init
TellRtoonlylookforstartupfiles(describedinthenextsection)inthecurrentworking
directory.
--no-restore
TellsRnottoloadafilecalled.RData(thedefaultnameforRsessionfiles)thatmaybe
presentinthecurrentworkingdirectory.
--no-save
TellsRnottoasktheuseriftheywanttosaveobjectssavedinRAMwhenthesessionis
endedwithq().
AddingeachofthesewillmakeRloadslightlyfaster,meaningthatslightlylessuserinputis
neededwhenyouquit.R’sdefaultsettingofloadingdatafromthelastsessionautomaticallyis
potentiallyproblematicinthiscontext.SeeAppendixBofAnIntroductiontoRformore
startuparguments.
T IP
AconcisewaytoloadavanillaversionofRwithalloftheprecedingoptionsenablediswithanoptionofthe
samename:
R--vanilla
AnOverviewofR’sStartupFiles
TwofilesarereadeachtimeRstarts(unlessoneofthecommand-lineoptionsoutlined
previouslyisused):
.Renviron
Theprimarypurposeofwhichistosetenvironmentvariables.ThesetellRwheretofind
externalprograms,andcanholduser-specificinformationthatneedstobekeptsecret,
typicallyAPIkeys.
.Rprofile
Aplaintextfile(whichisalwayscalled.Rprofile,henceitsname)thatsimplyrunslines
ofRcodeeverytimeRstarts.IfyouwantRtocheckforpackageupdateseachtimeit
starts(asexplainedintheprevioussection),yousimplyaddtherelevantlinesomewhere
inthisfile.
WhenRstarts(unlessitwaslaunchedwith--no-environ),itfirstsearchesfor.Renvironand
then.Rprofile,inthatorder.Although.Renvironissearchedforfirst,wewilllookat.Rprofile
firstasitissimplerand,formanysetuptasks,morefrequentlyuseful.Bothfilescanexistin
threedirectoriesonyourcomputer.
WARNING
ModificationofR’sstartupfilesshouldnotbetakenlightly.Thisisanadvancedtopic.Ifyoumodifyyourstartup
filesinthewrongway,itcancauseproblems:aseeminglyinnocentcalltosetwd()in.Rprofile,forexample,will
breakdevtoolsbuildandcheckfunctions.
Proceedwithcautionand,ifyoumessthingsup,justdeletetheoffendingfiles!
TheLocationofStartupFiles
Confusingly,multipleversionsofstartupfilescanexistonthesamecomputer,onlyoneof
whichwillbeusedpersession.Notealsothatthesefilesshouldonlybechangedwithcaution
andifyouknowwhatyouaredoing.ThisisbecausetheycanmakeyourRversionbehave
differentlythanotherRinstallations,potentiallyreducingthereproducibilityofyourcode.
Filesinthreefoldersareimportantinthisprocess:
R_HOME
ThedirectoryinwhichRisinstalled.Theetcsubdirectorycancontainstartupfilesread
earlyoninthestartupprocess.FindoutwhereyourR_HOMEiswiththeR.home()
command.
HOME
Theuser’shomedirectory.Typically,thisis/home/usernameonUnixmachinesor
C:\Users\usernameonWindows(sinceWindows7).AskRwhereyourhomedirectoryis
withSys.getenv("HOME").
R’scurrentworkingdirectory
Thisisreportedbygetwd().
Itisimportanttoknowthelocationofthe.Rprofileand.Renvironsetupfilesthatarebeing
usedoutofthesethreeoptions.Ronlyusesone.Rprofileandone.Renvironinanysession;if
youhavean.Rprofilefileinyourcurrentproject,Rwillignore.RprofileinR_HOMEandHOME.
Likewise,.RprofileinHOMEoverrides.RprofileinR_HOME.Thesameappliesto.Renviron:you
shouldrememberthataddingproject-specificenvironmentvariableswith.Renvironwill
deactivateother.Renvironfiles.
Tocreateaproject-specificstartupscript,simplycreatean.Rprofilefileintheproject’sroot
directoryandstartaddingRcode(e.g.,viafile.edit(".Rprofile")).Rememberthatthis
willmake.Rprofileinthehomedirectorybeignored.Thefollowingcommandswillopen
your.RprofilefromwithinanReditor:
file.edit("~/.Rprofile")#edit.RprofileinHOME
file.edit(".Rprofile")#editproject-specific.Rprofile
WARNING
FilepathsprovidedbyWindowsoperatingsystemswillnotalwaysworkinR.Specifically,ifyouuseapaththat
containssinglebackslashes,suchasC:\\DATA\\data.csv,asprovidedbyWindows,thiswillgeneratetheerror:
Error:unexpectedinputin"C:\\".Toovercomethisissue,Rprovidestwofunctions,file.path()and
normalizePath().Theformercanbeusedtospecifyfilelocationswithouthavingtousesymbolstorepresent
relativefilepaths,asfollows:file.path("C:","DATA","data.csv").Thelattertakesanyinputstringfora
filenameandoutputsatextstringthatisstandard(canonical)fortheoperatingsystem.
normalizePath("C:/DATA/data.csv"),forexample,outputsC:\\DATA\\data.csvonaWindowsmachinebut
C:/DATA/data.csvonUnix-basedplatforms.Notethatonlythelatterwouldworkonbothplatforms,sostandard
Unixfilepathnotationissafeforalloperatingsystems.
Editingthe.Renvironfileinthesamelocationswillhavethesameeffect.Thefollowingcode
willcreateauser-specific.Renvironfile(whereAPIkeysandothercross-projectenvironment
variablescanbestored)withoutoverwritinganyexistingfile.
user_renviron=path.expand(file.path("~",".Renviron"))
file.edit(user_renviron)#openwithanothertexteditorifthisfails
T IP
Thepathologicalpackagecanhelpfindwhere.Rprofileand.Renvironfilesarelocatedonyoursystem,thanks
totheos_path()function.Theoutputofexample(Startup)isalsoinstructive.
Thelocation,contents,andusesofeachisoutlinedinmoredetailinthenextsection.
The.RprofileFile
Bydefault,Rlooksforandruns.Rprofilefilesinthethreelocationsdescribedpreviously,in
aspecificorder..RprofilefilesaresimplyRscriptsthatruneachtimeRruns.Theycanbe
foundwithinR_HOME,HOME,andtheproject’shomedirectorybyusinggetwd().Tocheckif
youhaveasitewide.Rprofile,whichwillrunforallusersonstartup,run:
site_path=R.home(component="home")
fname=file.path(site_path,"etc","Rprofile.site")
file.exists(fname)
TheprecedingcodecodechecksforthepresenceofRprofile.siteinthatdirectory.Asoutlined
previously,the.Rprofilelocatedinyourhomedirectoryisuser-specific.Again,wecantest
whetherthisfileexistsusing:
file.exists("~/.Rprofile")
WecanuseRtocreateandedit.Rprofile(warning:donotoverwriteyourprevious.Rprofile
—wesuggestyoutryproject-specific.Rprofilefirst):
file.edit("~/.Rprofile")
Example.RprofileFile
Example2-1providesatasteofwhatgoesinto.Rprofile.NotethatthisissimplyausualR
script,butwithanunusualname.Thebestwaytounderstandwhatisgoingonistocreatethis
samescript,saveitas.Rprofileinyourcurrentworkingdirectory,andthenrestartyourR
sessiontoobservewhatchanges.TorestartyourRsessionfromwithinRStudio,youcan
clickSession→RestartRorusethekeyboardshortcutCtrl-Shift-F10.
Example2-1.Examplecontentsof.Rprofile
#Afunwelcomemessage
message("HiRobin,welcometoR")
#CustomizetheRpromptthatprefixeseverycommand
#(use""forablankprompt)
options(prompt="R4geo>")
Let’squicklyexplaineachlineofcode.Thefirstsimplyprintsamessageintheconsoleeach
timeanewRsessionisstarted.Thelattermodifiestheconsolepromptintheconsole(setto>
bydefault).Notethatsimplyaddingmorelinestothe.Rprofilewillsetmorefeatures.An
importantaspectof.Rprofile(and.Renviron)isthateachlineisrunonceandonlyoncefor
eachRsession.Thatmeansthattheoptionssetwithin.Rprofilecaneasilybechangedduring
thesession.Thefollowingcommandrunmidsession,forexample,willreturnthedefault
prompt:
options(prompt=">")
Moredetailsontheseandotherpotentiallyuseful.Rprofileoptionsaredescribed
subsequently.Formoresuggestionsofusefulstartupsettings,seeexamplesin
help("Startup")andonlineresourcessuchasthoseatstatmethods.net.ThehelppagesforR
options(accessiblewith?options)arealsoworthareadbeforewritingyourown.Rprofile.
Everbeenfrustratedbyunwanted+symbolsthatpreventcopiedandpastedmultilinefunctions
fromworking?Thesepotentiallyannoying+scanbeeradicatedbyaddingoptions(continue
="")toyour.Rprofile.
Settingoptions
Thefunctionoptionsusedpreviouslycontainsanumberofdefaultsettings.Executing
options()providesagoodindicationofwhatcanbeconfigured.Thesettingsthatcanbe
configuredwithoptions()areoftenrelatedtopersonalpreference(withfewimplicationsfor
reproducibility)sothe.Rprofileinyourhomedirectoryisasensibleplacestosetthemifyou
wantthemtobesetforallyourprojectsthathavenoproject-specific.Rprofilefile.Other
illustrativeoptionsareshownhere:
#Withacustomizedprompt
options(prompt="R>",digits=4,show.signif.stars=FALSE,continue="")
#Withalongerpromptandempty'continue'indent(defaultis"+")
options(prompt="R4Geo>",digits=3,continue="")
Thefirstoptionchangesfourdefaultoptionsinasingleline:
TheRprompt,fromtheboring>totheexcitingR>
Thenumberofdigitsdisplayed
Removingthestarsaftersignificantp-values
Removingthe+inmultilinefunctions
Trytoavoidaddingoptionsthatmakeyourcodenonportabletothestartupfile.Forexample,
addingoptions(stringsAsFactors=FALSE)toyourstartupscripthasadditionaleffectsfor
read.table()andrelatedfunctions,includingread.csv(),makingthemconverttextstrings
intocharactersratherthanintofactors,asisthedefault.Thismaybeusefulforyou,butitcan
alsomakeyourcodelessportable,sobewarned.
SettingtheCRANmirror
ToavoidsettingtheCRANmirroreachtimeyouruninstall.packages(),youcan
permanentlysetthemirrorinyour.Rprofile.
#`local`createsanew,emptyenvironment
#Thisavoidspolluting.GlobalEnvwiththeobjectr
local({
r=getOption("repos")
r["CRAN"]="https://cran.rstudio.com/"
options(repos=r)
})
TheRStudiomirrorisavirtualmachinerunbyAmazon’sEC2service,anditsyncswiththe
mainCRANmirrorinAustriaonceperday.SinceRStudioisusingAmazon’sCloudFront,
therepositoryisautomaticallydistributedaroundtheworld,sonomatterwhereyouareinthe
world,thedatadoesn’tneedtotravelveryfar,andisthereforefasttodownload.
Thefortunespackage
Thissectionillustratesthepowerof.Rprofilecustomizationwithreferencetoapackagethat
wasdevelopedforfun.Thefollowingcodecouldeasilybealteredtoautomaticallyconnectto
adatabase,ortoensurethatthelatestpackageshavebeendownloaded.
Thefortunespackagecontainsanumberofmemorablequotes,calledRfortunes,thatthe
communityhascollectedovermanyyears.Eachfortunehasanumber.Togetfortunenumber
50,forexample,enter:
fortunes::fortune(50)
#>
#>Toparaphraseprovocatively,'machinelearningisstatisticsminusany
#>checkingofmodelsandassumptions'.
#>--BrianD.Ripley(aboutthedifferencebetweenmachinelearningand
#>statistics)
#>useR!2004,Vienna(May2004)
ItiseasytomakeRprintoutoneofthesenuggetsoftrutheachtimeyoustartasessionby
addingthefollowingto.Rprofile:
if(interactive())
try(fortunes::fortune(),silent=TRUE)
Theinteractive()functiontestswhetherRisbeingusedinteractivelyinaterminal.The
fortune()functioniscalledwithintry().Ifthefortunespackageisnotavailable,weavoid
raisinganerrorandmoveon.Byusing::,weavoidaddingthefortunespackagetoourlist
ofattachedpackages.
T IP
Typingsearch()givesthelistofattachedpackages.Byusingfortunes::fortune(),weavoidaddingthe
fortunespackagetothatlist.Thefunction.Last(),ifitexistsinthe.Rprofile,isalwaysrunattheendofthe
session.Wecanuseittoinstallthefortunespackageifneeded.Toloadthepackage,weuserequire(),because
ifthepackageisn’tinstalled,therequire()functionreturnsFALSEandraisesawarning.
.Last=function(){
cond=suppressWarnings(!require(fortunes,quietly=TRUE))
if(cond)
try(install.packages("fortunes"),silent=TRUE)
message("Goodbyeat",date(),"\n")
}
Usefulfunctions
Youcanuse.Rprofiletodefinenewhelperfunctionsorredefineexistingonessothatthey’re
fastertotype.Forexample,wecouldloadthefollowingtwofunctionsforexaminingdata
frames:
#ht==headtail
ht=function(d,n=6)rbind(head(d,n),tail(d,n))
#Showthefirst5rows&first5columnsofadataframe
hh=function(d)d[1:5,1:5]
andafunctionforsettinganiceplottingwindow:
nice_par=function(mar=c(3,3,2,1),mgp=c(2,0.4,0),tck=-0.01,
cex.axis=0.9,las=1,mfrow=c(1,1),...){
par(mar=mar,mgp=mgp,tck=tck,cex.axis=cex.axis,las=las,
mfrow=mfrow,...)
}
Notethatthesefunctionsareforpersonaluseandareunlikelytointerferewithcodefrom
otherpeople.Forthisreason,evenifyouuseacertainpackageeveryday,wedon’t
recommendloadingitinyour.Rprofile.Shorteninglongfunctionnamesforinteractive(but
notreproduciblecodewriting)isanotheroptionforusing.Rprofiletoincreaseefficiency.If
youfrequentlyuseView(),forexample,youmaybeabletosavetimebyreferringtoitin
abbreviatedform.Thisisillustratedinthefollowinglineofcode,whichmakesitfasterto
viewdatasets(althoughwithIDE-drivenautocompletion,outlinedinthenextsection,thetime
savingsisless).
v=utils::View
Alsobewareofthedangersofloadingmanyfunctionsbydefaultasitmaymakeyourcode
lessportable.Anotherpotentiallyusefulsettingtochangein.RprofileisR’scurrentworking
directory.IfyouwantRtoautomaticallysettheworkingdirectorytotheRfolderofyour
project,forexample,youwouldaddthefollowinglineofcodetotheproject-specific
.Rprofile:
setwd("R")
Creatinghiddenenvironmentswith.Rprofile
Beyondmakingyourcodelessportable,anotherdownsideofputtingfunctionsinyour
.Rprofileisthatitcanclutterupyourworkspace:whenyourunthels()command,your
.Rprofilefunctionswillappear.Also,ifyourunrm(list=ls()),yourfunctionswillbe
deleted.Oneneattricktoovercomethisissueistousehiddenobjectsandenvironments.
Whenanobjectnamestartswith.,bydefaultitdoesn’tappearintheoutputofthels()
function:
.obj=1
".obj"%in%ls()
#>[1]FALSE
Thisconceptalsoworkswithenvironments.Inthe.Rprofilefile,wecancreateahidden
environment:
.env=new.env()
Andthenaddfunctionstothisenvironment:
.env$ht=function(d,n=6)rbind(head(d,n),tail(d,n))
Attheendofthe.Rprofilefile,weuseattach,whichmakesitpossibletorefertoobjectsin
theenvironmentbytheirnamesalone:
attach(.env)
The.RenvironFile
The.Renvironfileisusedtostoresystemvariables.Itfollowsasimilarstartuproutinetothe
.Rprofilefile:Rfirstlooksforaglobal.Renvironfile,thenforlocalversions.Atypicaluseof
the.RenvironfileistospecifytheR_LIBSpath,whichdetermineswherenewpackagesare
installed:
#Linux
R_LIBS=~/R/library
#Windows
R_LIBS=C:/R/library
Aftersettingthis,install.packages()savespackagesinthedirectoryspecifiedbyR_LIBS.
Thelocationofthisdirectorycanbereferredbacktosubsequentlyasfollows:
Sys.getenv("R_LIBS")
AllcurrentlystoredenvironmentvariablescanbeseenbycallingSys.getenv()withno
arguments.Notethatmanyenvironmentvariablesarealreadypresetanddonotneedtobe
specifiedin.Renviron.HOME,forexample,whichcanbeseenwithSys.getenv("HOME"),is
takenfromtheoperatingsystem’slistofenvironmentvariables.Alistofthemostimportant
environmentvariablesthatcanaffectR’sbehaviorisdocumentedinthelittle-knownhelp
pagehelp("environmentvariables").
Tosetorunsetanenvironmentvariableforthedurationofasession,usethefollowing
commands:
Sys.setenv("TEST"="test-string")#setanenvironmentvariableforthesession
Sys.unsetenv("TEST")#unsetit
Anothercommonuseof.RenvironistostoreAPIkeysandauthenticationtokensthatwillbe
availablefromonesessiontoanother.4Acommonusecaseissettingtheenvironment
variableGITHUB_PAT,whichwillbedetectedbythedevtoolspackageviathefunction
github_pat().Totakeanotherexample,thefollowinglinein.RenvironsetstheZEIT_KEY
environmentvariable,whichisusedinthediezeitpackage:
ZEIT_KEY=PUT_YOUR_KEY_HERE
YouwillneedtosigninandstartanewRsessionfortheenvironmentvariable(accessedby
Sys.getenv())tobevisible.TotestiftheexampleAPIkeyhasbeensuccessfullyaddedasan
environmentvariable,runthefollowing:
Sys.getenv("ZEIT_KEY")
Usingthe.RenvironfileforstoringsettingssuchaslibrarypathsandAPIkeysisefficient
becauseitreducestheneedtoupdateyoursettingsforeveryRsession.Furthermore,thesame
.Renvironfilewillworkacrossdifferentplatforms,sokeepitstoredsafely.
Example.Renvironfile
My.Renvironfilehasgrownovertheyears.Ioftenswitchbetweenmydesktopandlaptop
computers,sotomaintainaconsistentworkingenvironment,Ihavethesame.Renvironfile
onallofmymachines.AswellascontaininganR_LIBSentryandsomeAPIkeys,my
.Renvironhasafewotherlines:
TMPDIR=/data/R_tmp/
WhenRisrunning,itcreatestemporarycopies.Onmyworkmachine,thedefault
directoryisanetworkdrive.
R_COMPILE_PKGS=3
Bytecompileallpackages(coveredinChapter3).
R_LIBS_SITE=/usr/lib/R/site-library:/usr/lib/R/library
Iexplicitlystatewheretolookforpackages.Myuniversityhasasitewidedirectorythat
containsoutdatedpackages.Iwanttoavoidingusingthisdirectory.
R_DEFAULT_PACKAGES=utils,grDevices,graphics,stats,methods
Explicitlystatethepackagestoload.NotethatIdon’tloadthedatasetspackage,butI
ensurethatmethodsisalwaysloaded.Duetohistoricalreasons,themethodspackage
isn’tloadedbydefaultincertainapplications(e.g.,Rscript).
Exercises
1. Whatarethethreelocationswherethestartupfilesarestored?Wherearethese
locationsonyourcomputer?
2. Foreachlocation,doesa.Rprofileor.Renvironfileexist?
3. Createa.Rprofilefileinyourcurrentworkingdirectorythatprintsthemessage
HappyefficientRprogrammingeachtimeyoustartRatthislocation.
4. WhathappenstothestartupfilesinR_HOMEifyoucreatetheminHOMEorlocalproject
directories?
RStudio
RStudioisanIDEforR.ItmakeslifeeasyforRusersanddeveloperswithitsintuitiveand
flexibleinterface.RStudioencouragesgoodprogrammingpractice.Throughitswiderange
offeatures,RStudiocanhelpmakeyouamoreefficientandproductiveRprogrammer.
RStudiocan,forexample,greatlyreducetheamountoftimespentrememberingandtyping
functionnamesthankstointelligentautocompletion.Someofthemostimportantfeaturesof
RStudioinclude:
Flexiblewindowpanelayoutstooptimizeuseofscreenspaceandenablefastinteractive
visualfeedback
Intelligentautocompletionoffunctionnames,packages,andRobjects
Awiderangeofkeyboardshortcuts
Visualdisplayofobjects,includingasearchabledatadisplaytable
Real-timecodechecking,debugging,anderrordetection
Menustoinstallandupdatepackages
Projectmanagementandintegrationwithversioncontrol
Quickdisplayoffunctionsourcecodeandhelpdocuments
Theprecedinglistoffeaturesshouldmakeitclearthatawellset-upIDEcanbeasimportant
asawellset-upRinstallationforbecominganefficientRprogrammer.5AswithRitself,the
bestwaytolearnaboutRStudioisbyusingit.Itisthereforeworthreadingthroughthis
sectioninparallelwithusingRStudiotoboostyourproductivity.
InstallingandUpdatingRStudio
RStudioisamature,feature-rich,andpowerfulIDEoptimizedforRprogramming,whichhas
becomepopularamongRdevelopers.TheOpenSourceEditioniscompletelyopensource
(ascanbeseenfromtheproject’sGitHubrep).ItcanbeinstalledonallmajorOSsfromthe
RStudiowebsite.
IfyoualreadyhaveRStudioandwouldliketoupdateit,simplyclickHelp→Checkfor
Updatesinthemenu.Forfastandefficientwork,keyboardshortcutsshouldbeusedwherever
possible,reducingtherelianceonthemouse.RStudiohasmanykeyboardshortcutsthatwill
helpwiththis.Togetintogoodhabitsearly,tryaccessingtheRStudioUpdateinterface
withouttouchingthemouse.OnLinuxandWindows,drop-downmenusareactivatedwiththe
Altkey,sothemenuitemcanbefoundwith:Alt-H-U.
OnMac,itworksdifferently.Cmd-?shouldactivateasearchacrossmenuitems,allowingthe
sameoperationtobeachievedwithCmd-?update.
NOT E
InRStudio,thekeyboardshortcutsdifferbetweenLinuxandWindowsversionsononehandandMaconthe
other.Inthissection,wegenerallyonlyusetheWindows/Linuxshortcutkeysforbrevity.TheMacequivalentis
usuallyfoundbysimplyreplacingCtrlandAltwiththeMac-specificCmdbutton.
WindowPaneLayout
RStudiohasfourmainwindowpanes(seeFigure2-2),eachofwhichservesarangeof
purposes:
TheSourcepane
Forediting,saving,anddispatchingRcodetotheconsole(topleft).Notethatthispane
doesnotexistbydefaultwhenyoustartRStudio:itappearswhenyouopenanRscript
(e.g.,viaFile→NewFile→RScript).Acommontaskinthispaneistosendcodeonthe
currentlinetotheconsole,viaCtrl/Cmd-Enter.
TheConsolepane
AnycodeenteredhereisprocessedbyR,linebyline.Thispaneisidealforinteractively
testingideasbeforesavingthefinalresultsintheSourcepaneabove.
TheEnvironmentpane(topright)
Containsinformationaboutthecurrentobjectsloadedintheworkspace,includingtheir
class,dimension(iftheyareadataframe),andname.Thispanealsocontainstabbed
subpaneswithasearchablehistorythatwasdispatchedtotheconsoleand(ifapplicableto
theproject)BuildandGitoptions.
TheFilespane(bottomright)
Containsasimplefilebrowser,aPlotstab,HelpandPackagetabs,andaViewerfor
visualizinginteractiveRoutputsuchasthoseproducedbytheleafletpackageandHTML
widgets.
Figure2-2.RStudiopanels
Usingeachofthepanelseffectivelyandnavigatingbetweenthemquicklyisaskillthatwill
developovertime,andwillonlyimprovewithpractice.
Exercises
Youaredevelopingaprojecttovisualizedata.TestoutthemultipanelRStudioworkflowby
followingthesesteps:
1. CreateanewfolderfortheinputdatausingtheFilespane.
2. TypedownlintheSourcepaneandhitEntertomakethefunctiondownload.file()
autocomplete.Thentype",whichwillautocompleteto"",pastetheURLofafileto
download(e.g.,https://www.census.gov/2010census/csv/pop_change.csv)anda
filename(e.g.,pop_change.csv).
3. ExecutethefullcommandwithCtrl-Enter:
download.file("https://www.census.gov/2010census/csv/pop_change.csv",
"extdata/pop_change.csv")
4. Writeandexecuteacommandtoreadthedata,suchas
pop_change=read.csv("extdata/pop_change.csv",skip=2)
5. UsetheEnvironmentpanetoclickonthedataobjectpop_change.Notethatthisruns
thecommandView(pop_change),whichlaunchesadataviewingtabinthetopleft
panel,forinteractivelyexploringdataframes(seeFigure2-3).
Figure2-3.ThedataviewingtabinRStudio
6. Usetheconsoletotestdifferentplotcommandstovisualizethedata,savingthecode
youwanttokeepbackintotheSourcepaneaspop_change.R.
7. UsethePlotstabintheFilespanetoscrollthroughpastplots.Savethebestusingthe
Exportdrop-downbutton.
Thepreviousexampleshowshowunderstandingofthesepanesandhowtousethem
interactivelycanhelpwiththespeedandproductivityofyourRprogramming.Further,there
areanumberofRStudiosettingsthatcanhelpensurethatitworksforyourneeds.
RStudioOptions
ArangeofprojectoptionsandglobaloptionsareavailableinRStudiofromtheToolsmenu
(accessibleinLinuxandWindowsfromthekeyboardviaAlt-T).Mostoftheseareself-
explanatory,butitisworthmentioningafewthatcanboostyourprogrammingefficiency:
GIT/SVNprojectsettingsallowRStudiotoprovideagraphicalinterfacetoyour
version-controlsystem,describedinChapter9.
RversionsettingsallowRStudiotopointtodifferentRversions/interpreters,whichmay
befasterforsomeprojects.
Restore.RData:untickthisdefaulttopreventloadingpreviouslycreatedRobjects.This
willmakeRstartmorequicklyandalsoreducethechanceofbugsduetopreviously
createdobjects.Forthisreason,werecommendyouuntickthisbox.
Code-editingoptionscanmakeRStudioadapttoyourcodingstyle,forexample,by
preventingtheautocompletionofbraces,whichsomeexperiencedprogrammersmay
findannoying.EnablingVimmodemakesRStudioactasa(partial)Vimemulator.
DiagnosticsettingscanmakeRStudiomoreefficientbyaddingadditionaldiagnosticsor
byremovingdiagnosticsiftheyareslowingdownyourwork.Thismaybeanissuefor
peopleusingRStudiotoanalyzelargedatasetsonolderlow-speccomputers.
Appearance:ifyouarestrugglingtoseethesourcecode,changingthedefaultfontsize
maymakeyouamoreefficientprogrammerbyreducingthetimeoverheadassociated
withsquintingatthescreen.Otheroptionsinthisarearelatemoretoaesthetics.Settings
suchasfonttypeandbackgroundcolorarealsoimportantbecausefeelingcomfortable
inyourprogrammingenvironmentcanboostproductivity.GotoTools→Global
Optionstomodifythese.
Autocompletion
Rprovidessomebasicautocompletionfunctionality.Typingthebeginningofafunction
name,suchasrn(shortforrnorm()),andpressingtheTabkeytwicewillresultinthefull
functionnamesassociatedwiththistextstringbeingprinted.Inthiscase,twooptionswould
bedisplayed:rnbinomandrnorm,providingausefulremindertotheuseraboutwhatis
available.Thesameappliestofilenamesenclosedinquotationmarks:typingteintheconsole
inaprojectthatcontainsafilecalledtest.Rshouldresultinthefullname"test.R"being
autocompleted.RStudiobuildsonthisfunctionalityandtakesittoanewlevel.
NOT E
ThedefaultsettingsforautocompletioninRStudioworkwell.Theyareintuitiveandarelikelytoworkformany
users,especiallybeginners.However,RStudio’sautocompletionoptionscanbemodifiedbynavigatingtoTools
→GlobalOptions→Code→CompletioninRStudio’stop-levelmenu.
InsteadofonlyautocompletingoptionswhenTabispressed,RStudioautocompletesthemat
anypoint.Buildingonthepreviousexample,RStudio’sautocompletiontriggerswhenthe
firstthreecharactersaretyped:rno.Thesamefunctionalityworkswhenonlythefirst
charactersaretyped,followedbyTab:automaticautocompletiondoesnotreplaceTab
autocompletionbutsupplementsit.NotethatinRStudio,twomoreoptionsareprovidedtothe
userafterenteringrnandpressingtheTabkeycomparedwithenteringthesametextintobase
R’sconsoledescribedinthepreviousparagraph:RNGkindandRNGversion.Thisillustrates
thatRStudio’sautocompletionfunctionalityisnotcase-sensitiveinthesamewaythatRis.
ThisisagoodthingbecauseRhasnoconsistentfunctionnamestyle!
RStudioalsohasmoreintelligentautocompletionofobjectsandfilenamesthanR’sbuilt-in
commandline.Totestthisfunctionality,trytypingUS,followedbytheTabkey.Afterpressing
downuntilUSArrestsisselected,pressEntersoitautocompletes.Finally,typing$should
leavethefollowingtextonthescreenandthefourcolumnsshouldbeshowninadropdown
box,readyforyoutoselectthevariableofinterestwiththedownarrow.
USArrests$#adrop-downmenuofcolumnsshouldappearinRStudio
Totakeamorecomplexexample,variablenamesstoredinthedataslotoftheclass
SpatialPolygonsDataFrame(aclassdefinedbythefoundationalspatialpackagesp)are
referredtointhelongformspdf@data$varname.6Inthiscase,spdfistheobjectname,datais
theslot,andvarnameisthevariablename.RStudiomakessuchS4objectseasiertouseby
enablingautocompletionoftheshortformspdf$varname.AnotherexampleisRStudio’s
abilitytofindfileshiddenawayinsubfolders.Typing"tewillfindtest.Revenifitislocated
inasubfoldersuchasR/test.R.Thereareanumberofothercleverautocompletiontricksthat
canboostR’sproductivitywhenusingRStudio,whicharebestfoundbyexperimentingand
pressingtheTabkeyfrequentlyduringyourRprogrammingwork.
KeyboardShortcuts
RStudiohasmanyusefulshortcutsthatcanhelpmakeyourprogrammingmoreefficientby
reducingtheneedtoreachforthemouseandpointandclickyourwayaroundcodeand
RStudio.Thesecanbeviewedbyusingalittleknownbutextremelyusefulkeyboardshortcut
(thiscanalsobeaccessedviatheToolsmenu):Alt-Shift-K.
ThiswilldisplaythedefaultshortcutsinRStudio.Itisworthspendingtimeidentifyingwhich
ofthesecouldbeusefulinyourworkandpracticinginteractingwithRStudiorapidlywith
minimalrelianceonthemouse.Thepoweroftheseautocompletioncapabilitiescanbefurther
enhancedbysettingyourownkeyboardshortcuts.However,aswithsetting.Rprofileand
.Renvironsettings,thisrisksreducingtheportabilityofyourworkflow.
Somemoreusefulshortcutsarelistedhere.Therearemanymoregemstofindthatcould
boostyourRwritingproductivity:
Ctrl-Z/Shift-Z
Undo/Redo
Ctrl-Enter
ExecutethecurrentlineorcodeselectionintheSourcepane
Ctrl-Alt-R
ExecutealltheRcodeinthecurrentlyopenfileintheSourcepane
Ctrl-Left/Right
Navigatecodequickly,wordbyword
Home/End
Navigatetothebeginning/endofthecurrentline
Alt-Shift-Up/Down
Duplicatethecurrentlineupordown
Ctrl-D
Deletethecurrentline
TosetyourownRStudiokeyboardshortcuts,navigatetoTools→ModifyKeyboard
Shortcuts.
ObjectDisplayandOutputTable
ItisusefultoknowwhatisinyourcurrentRenvironment.Thisinformationcanberevealed
withls(),butthisfunctiononlyprovidesobjectnames.RStudioprovidesanefficient
mechanismtoshowcurrentlyloadedobjectsandtheirdetailsinreal-time:theEnvironment
tabinthetop-rightcorner.Itmakessensetokeepaneyeonwhichobjectsareloadedandto
deleteobjectsthatarenolongeruseful.Doingsowillminimizetheprobabilityofconfusion
inyourworkflow(e.g.,byusingthewrongversionofanobject)andreducetheamountof
RAMRneeds.ThedetailsprovidedintheEnvironmenttabincludetheobject’sdimensionand
someadditionaldetailsdependingontheobject’sclass(e.g.,sizeinMBforlargedatasets).
AveryusefulfeatureofRStudioisitsadvancedviewingfunctionality.Thisistriggeredeither
byexecutingView(object)orbydouble-clickingontheobjectnameintheEnvironmenttab.
AlthoughyoucannoteditdataintheViewer(thisshouldbeconsideredagoodthingfroma
dataintegrityperspective),recentversionsofRStudioprovideanefficientsearchmechanism
torapidlyfilterandviewtherecordsthatareofmostinterest(seeFigure2-3).
ProjectManagement
Inthefartop-rightofRStudiothereisadiminutivedrop-downmenuillustratedwithRinside
atransparentbox.Thismenumaybesmallandsimple,butitishugelyefficientintermsof
organizinglarge,complex,andlong-termprojects.
TheideaofRStudioprojectsisthatthebulkofRprogrammingworkispartofawidertask,
whichwilllikelyconsistofinputdata,Rcode,graphicalandnumericaloutputs,and
documentsdescribingthework.Itispossibletoscattereachoftheseelementsatrandom
acrossyourharddisks,butthisisnotrecommended.Instead,theconceptofprojects
encouragesreproducibleworking,suchthatanyonewhoopenstheparticularprojectfolder
thatyouareworkingfromshouldbeabletorepeatyouranalysesandreplicateyourresults.
Itisthereforehighlyrecommendedthatyouuseprojectstoorganizeyourwork.Itcouldsave
hoursinthelongrun.Organizingdata,code,andoutputsalsomakessensefromaportability
perspective:ifyoucopythefolder(e.g.,viaGitHub),youcanworkonitfromanycomputer
withoutworryingabouthavingtherightfilesonyourcurrentmachine.Thesetasksare
implementedusingRStudio’ssimpleprojectsystem,inwhichthefollowingthingshappen
everytimeyouopenanexistingproject:
Theworkingdirectoryautomaticallyswitchestotheproject’sfolder.Thisenablesdata
andscriptfilestobereferredtousingrelativefilepaths,whicharemuchshorterthan
absolutefilepaths.Thismeansthatswitchingdirectoriesusingsetwd(),acommon
sourceoferrorforRusers,israrely,ifever,needed.
ThelastpreviouslyopenfileisloadedintotheSourcepane.ThehistoryofRcommands
executedinprevioussessionsisalsoloadedintotheHistorytab.Thisassistswith
continuitybetweenonesessionandthenext.
TheFiletabdisplaystheassociatedfilesandfoldersintheproject,allowingyouto
quicklyfindyourpreviouswork.
Anysettingsassociatedwiththeproject,suchasGitsettings,areloaded.Thisassistswith
collaborationandproject-specificsetup.
Eachprojectisdifferent,butmostcontaininputdata,Rcode,andoutputs.Tokeepthingstidy,
werecommendasubdirectorystructureresemblingthefollowing:
project/
-README.Rmd#Projectdescription
-set-up.R#Requiredpackages
-R/#ForRcode
-input#Datafiles
-graphics/
-output/#Results
ProperuseofprojectsensuresthatallRsourcefilesareneatlystashedinonefolderwitha
meaningfulstructure.Thisway,dataanddocumentationcanbefoundwhereonewouldexpect
them.Underthissystem,figuresandprojectoutputsarefirst-classcitizenswithintheproject’s
design,eachwiththeirownfolder.
AnotherapproachtoprojectmanagementistotreatprojectsasRpackages.Thisisnot
recommendedformostusecases,asitplacesrestrictionsonwhereyoucanputfiles.
However,iftheaimiscodedevelopmentandsharing,creatingasmallRpackagemaybethe
wayforward,evenifyouneverintendtosubmititonCRAN.CreatingRpackagesiseasier
thaneverbefore,asdocumentedinLearningRbyRichardCotton(O’Reilly)and,more
recently,inRPackagesbyHadleyWickham(O’Reilly).Thedevtoolspackagehelpsmanage
R’squirks,makingtheprocessmuchlesspainful.IfyouuseGitHub,theadvantageofthis
approachisthatanyoneshouldbeabletoreproduceyourworkusing
devtools::install_github("username/projectname"),althoughtheadministrativeoverhead
ofcreatinganentirepackageforeachsmallprojectwilloutweighthebenefitsformany.
Notethataset-up.Rorevena.Rprofilefileintheproject’srootdirectoryenablesproject-
specificsettingstobeloadedeachtimepeopleworkontheproject.Asdescribedinthe
previoussection,.RprofilecanbeusedtotweakhowRworksatstartup.Itisalsoaportable
waytomanageR’sconfigurationonaproject-by-projectbasis.
AnothercapabilitythatRStudiohasisexcellentdebuggingsupport.Ratherthanre-inventthe
wheel,IwouldliketodirectinterestedreaderstotheRStudiowebsite.
Exercises
1. TrymodifyingthelookandappearanceofyourRStudiosetup.
2. Whatisthekeyboardshortcuttoshowtheothershortcut?(Hint:itbeginswithAlt-
ShiftonLinuxandWindows.)
3. Tryasmanyoftheshortcutsrevealedbythepreviousstepasyoulike.Writedown
theonesthatyouthinkwillsaveyoutime,perhapsonaPost-itnotetogoonyour
computer.
BLASandAlternativeRInterpreters
Inthissection,wecoverafewsystem-leveloptionsavailabletospeedupR’sperformance.
Notethatformanyapplications,stabilityratherthanspeedisapriority,sotheseshouldonly
beconsideredifa)youhaveexhaustedoptionsforwritingyourRcodemoreefficientlyand
b)youareconfidenttweakingsystem-levelsettings.Thisshouldthereforebeseenasan
advancedsection:ifyouarenotinterestedinspeedingupbaseR,feelfreetoskiptothenext
section.
Manystatisticalalgorithmsmanipulatematrices.RusestheBasicLinearAlgebraSystem
(BLAS)frameworkforlinearalgebraoperations.Wheneverwecarryoutamatrixoperation,
suchastransposeorfindingtheinverse,weusetheunderlyingBLASlibrary.Byswitchingto
adifferentBLASlibrary,itmaybepossibletospeedupyourRcode.ChangingyourBLAS
libraryisstraightforwardifyouareusingLinux,butcanbetrickyforWindowsusers.
ThetwoopensourcealternativeBLASlibrariesareATLASandOpenBLAS.TheIntelMKL
isanotherimplementation,designedforIntelprocessorsbyIntelandusedinRevolutionR
(describedinthenextsection),butitrequireslicensingfees.TheMKLlibraryisprovided
withtheRevolutionanalyticssystem.Dependingonyourapplication,byswitchingyour
BLASlibrary,linearalgebraoperationscanrunseveraltimesfasterthanwiththebaseBLAS
routines.
IfyouuseLinux,youcanfindwhetheryouhaveaBLASlibrarysettingwiththefollowing
function,frombenchmarkme:
library("benchmarkme")
get_linear_algebra()
TestingPerformanceGainsfromBLAS
AsanillustrativetestoftheperformancegainsofferedbyBLAS,thefollowingtestwasrun
onanewlaptoprunningUbuntu15.10onasixth-generationCorei7processor,beforeand
afterOpenBLASwasinstalled.7
res=benchmark_std()#runasuiteofteststotestR'sperformance
ItwasfoundthattheinstallationofOpenBLASledtoatwo-foldspeed-up(fromaround150
to70seconds).Themajorityofthespeedgainwasfromthematrixalgebratests,ascanbe
seeninFigure2-4.Notethattheresultsofsuchtestsarehighlydependentonthe
particularitiesofeachcomputer.However,itclearlyshowsthatprogrammingbenchmarks
(e.g.,thecalculationof3,500,000Fibonaccinumbers)arenowmuchfaster,whereasmatrix
calculationsandfunctionsreceiveasubstantialspeedboost.Thisdemonstratesthatthespeed-
upyoucanexpectfromBLASdependsheavilyonthetypeofcomputationsyouare
undertaking.
Figure2-4.PerformancegainsobtainedbychangingtheunderlyingBLASlibrary(testsfrombenchmark_std())
OtherInterpreters
TheRlanguagecanbeseparatedfromtheRinterpreter.Theformerreferstothemeaningof
Rcommands,andthelatterreferstohowthecomputerexecutesthecommands.Alternative
interpretershavebeendevelopedtotrytomakeRfasterand,whilepromising,noneofthe
followingoptionshasfullytakenoff.
MicrosoftROpen,formerlyknownasRevolutionROpen(RRO),istheenhanced
distributionofRfromMicrosoft.Thekeyenhancementisthatitusesmultithreaded
mathematicslibraries,whichcanimproveperformance.
Rho(previouslycalledCXXR,shortforC++),areimplementationoftheRinterpreter
forspeedandefficiency.Ofthenewinterpreters,thisistheonethathasthemostrecent
developmentactivity(asofApril2016).
pqrR(prettyquickR)isanewversionoftheRinterpreter.Onemajordownsideisthatit
isbasedonR-2.15.0.Thedeveloper(RadfordNeal)hasmademanyimprovements,some
ofwhichhavenowbeenincorporatedintobaseR.pqRisanopensourceproject
licensedundertheGPL.OnenotableimprovementinpqRisthatitisabletodosome
numericcomputationsinparallelwitheachother,andwithotheroperationsofthe
interpreter,onsystemswithmultipleprocessorsorprocessorcores.
RenjinreimplementstheRinterpreterinJava,soitcanrunontheJavaVirtualMachine
(JVM).SinceRwillbepureJava,itcanrunanywhere.
TibcocreatedaC++basedinterpretercalledTERR.
OraclealsooffersanRinterpreterthatusesIntel’smathematicslibraryandtherefore
achieveshigherperformancewithoutchangingR’score.
Atthetimeofwriting,switchinginterpretersissomethingtoconsidercarefully.Butinthe
future,itmaybecomemoreroutine.
UsefulBLAS/BenchmarkingResources
Thegcbdpackagebenchmarksperformanceofafewstandardlinearalgebraoperations
acrossanumberofdifferentBLASlibrariesaswellasaGPUimplementation.Ithasan
excellentvignettesummarizingtheresults.
BrettKlamerprovidesanicecomparisonofATLAS,OpenBLAS,andIntelMKLBLAS
libraries.Healsogivesadescriptionofhowtoinstallthedifferentlibraries.
TheofficialRmanualsectiononBLAS.
Exercise
1. WhatBLASsystemisyourversionofRusing?
References
Cotton,Richard.2013.LearningR.O’ReillyMedia.
Wickham,Hadley.2015c.RPackages.O’ReillyMedia.
AllCRANpackagesareautomaticallytestedonthesesystems,inadditiontoSolaris.Rhasalsobeenreportedtorunon
moreexoticoperatingsystems,includingthoseusedinsmartphonesandgameconsoles(Peng2014).
Benchmarkingconductedforthepresentation“RonDifferentPlatforms”atuseR!2006foundthatRwasmarginallyfaster
onWindowsthanonLinuxsetups.Similarresultswerereportedinanacademicpaper,withRcompletingstatisticalanalyses
fasteronaLinuxthanonaMac(Sekhon2006).In2015RevolutionRsupportedtheseresultswithslightlyfasterruntimes
forcertainbenchmarksonUbuntuthanMacsystems.Thedatafromthebenchmarkmepackagealsosuggeststhatrunning
codeundertheLinuxOSismarginallyfaster.
SeeJasonFrench’s“InstallingRinLinux”formoreinformationoninstallingRonavarietyofLinuxdistributions.
Seevignette("api-packages")fromthehttrpackageformoreonthis.
OtheropensourceRIDEsexist,includingRKWard,Tinn-R,andJGR.emacsisanotherpopularsoftwareenvironment.
However,ithasaverysteeplearningcurve.
Slotsareelementsofanobject(specifically,S4objects)analogoustoacolumninadata.framebutreferredtowith@not
$.
OpenBLASwasinstalledonthecomputerviasudoapt-getinstalllibopenblas-base,whichisautomaticallydetected
andusedbyR.
1
2
3
4
5
6
7
Chapter3.EfficientProgramming
ManypeoplewhouseRwouldnotdescribethemselvesasprogrammers.Instead,theytendto
haveadvanceddomain-levelknowledgeandunderstandstandardRdatastructuressuchas
vectorsanddataframes,buthavelittleformaltrainingincomputing.Soundfamiliar?Inthat
case,thischapterisforyou.
Inthischapter,wewilldiscuss“bigpicture”programmingtechniques.Wecovergeneral
conceptsandRprogrammingtechniquesaboutcodeoptimization,beforedescribing
idiomaticprogrammingstructures.Weconcludethechapterbyexaminingrelativelyeasy
waysofspeedingupcodeusingthecompilerpackageandparallelprocessingusingmultiple
CPUs.
Prerequisites
Inthischapter,weintroducetwonewpackages,compilerandmemoise.Thecompilerpackage
comeswithR,soitwillalreadybeinstalled.
library("compiler")
library("memoise")
Wealsousethepryrandmicrobenchmarkpackagesintheexercises.
TopFiveTipsforEfficientProgramming
1. Becarefulnevertogrowvectors.
2. Vectorizecodewheneverpossible.
3. Usefactorswhenappropriate.
4. Avoidunnecessarycomputationbycachingvariables.
5. Bytecompilepackagesforaneasyperformanceboost.
GeneralAdvice
Low-levellanguageslikeCandFortrandemandmorefromtheprogrammer.Theyforceyou
todeclarethetypeofeveryvariableused,giveyoutheburdensomeresponsibilityofmemory
management,andhavetobecompiled.Theadvantageofsuchlanguages,comparedwithR,is
thattheyarefastertorun.Thedisadvantageisthattheytakelongertolearnandcannotberun
interactively.
NOT E
TheWikipediapageoncompileroptimizationsgivesaniceoverviewofstandardoptimizationtechniques.
Rusersdon’ttendtoworryaboutdatatypes.Thisisadvantageousintermsofcreating
concisecode,butcanresultinRprogramsthatareslow.Whileoptimizationssuchasgoing
parallelcandoublespeed,poorcodecaneasilyrunhundredsoftimesslower,soit’s
importanttounderstandthecausesofslowcode.ThesearecoveredinTheRInfernoby
PatrickBurns(Lulu.com),whichshouldbeconsideredessentialreadingforanyaspiringR
programmer.
Ultimately,callinganRfunctionalwaysendsupcallingsomeunderlyingC/Fortrancode.For
example,thebaseRfunctionrunif()onlycontainsasinglelinethatconsistsofacallto
C_runif().
function(n,min=0,max=1)
.Call(C_runif,n,min,max)
AgoldenruleinRprogrammingistoaccesstheunderlyingC/Fortranroutinesasquicklyas
possible;thefewerfunctioncallsrequiredtoachievethis,thebetter.Forexample,supposex
isastandardvectoroflengthn.Then
x=x+1
involvesasinglefunctioncalltothe+function.Whereastheforloop
for(iinseq_len(n))
x[i]=x[i]+1
has
nfunctioncallsto+
nfunctioncallstothe[function
nfunctioncallstothe[<-function(usedintheassignmentoperation)
Afunctioncalltoforandtothe:operator
Itisn’tthattheforloopisslow;ratheritisbecausewehavemanymorefunctioncalls.Each
individualfunctioncallisquick,butthetotalcombinationisslow.
NOT E
EverythinginRisafunctioncall.Whenweexecute1+1,weareactuallyexecuting+(1,1).
Exercise
1. Usethemicrobenchmarkpackagetocomparethevectorizedconstructx=x+1to
theforloopversion.Tryvaryingthesizeoftheinputvector.
MemoryAllocation
Anothergeneraltechniqueistobecarefulwithmemoryallocation.Ifpossible,pre-allocate
yourvectorandthenfillinthevalues.
T IP
Youshouldalsoconsiderpreallocatingmemoryfordataframesandlists.Nevergrowanobject.Agoodruleof
thumbistocompareyourobjectsbeforeandafteraforloop;havetheyincreasedinlength?
Let’sconsiderthreemethodsofcreatingasequenceofnumbers.Method1createsanempty
vectorandgraduallyincreases(orgrows)thelengthofthevector:
method1=function(n){
vec=NULL#Orvec=c()
for(iinseq_len(n))
vec=c(vec,i)
vec
}
Method2createsanobjectofthefinallengthandthenchangesthevaluesintheobjectby
subscripting:
method2=function(n){
vec=numeric(n)
for(iinseq_len(n))
vec[i]=i
vec
}
Method3directlycreatesthefinalobject:
method3=function(n)seq_len(n)
Tocomparethethreemethods,weusethemicrobenchmark()functionfromtheprevious
chapter:
microbenchmark(times=100,unit="s",
method1(n),method2(n),method3(n))
Table3-1showsthetiminginsecondsonmymachineforthesethreemethodsforaselection
ofvaluesofn.Therelationshipsforvaryingnareallroughlylinearonalog-logscale,but
thetimingsbetweenmethodsaredrasticallydifferent.Noticethatthetimingsarenolonger
trivial.Whenn=107,method1takesaroundanhourwhereasmethod2takestwosecondsand
method3isalmostinstantaneous.Rememberthegoldenrule:accesstheunderlyingC/Fortran
codeasquicklyaspossible.
Table3-1.Timeinsecondsto
Table3-1.Timeinsecondsto
createsequences.Whenn=
107,method1takesaroundan
hourwhiletheothermethods
takelessthanthreeseconds.
nMethod1 Method2 Method3
1050.21 0.02 0.00
10625.50 0.22 0.00
1073827.00 2.21 0.00
VectorizedCode
NOT E
Technicallyx=1createsavectoroflength1.Inthissection,weusevectorizedtoindicatethatfunctionswork
withvectorsofalllengths.
RecallthegoldenruleinRprogramming:accesstheunderlyingC/Fortranroutinesas
quicklyaspossible—thefewerfunctionscallsrequiredtoachievethis,thebetter.Withthis
mind,manyRfunctionsarevectorized;thatis,thefunction’sinputsand/oroutputsnaturally
workwithvectors,reducingthenumberoffunctioncallsrequired.Forexample,thecode
x=runif(n)+1
performstwovectorizedoperations.First,runif()returnsnrandomnumbers.Second,we
add1toeachelementofthevector.Ingeneral,itisagoodideatoexploitvectorized
functions.ConsiderthispieceofRcodethatcalculatesthesumoflog(x):
log_sum=0
for(iin1:length(x))
log_sum=log_sum+log(x[i])
WARNING
Using1:length(x)canleadtohard-to-findbugswhenxhaslengthzero.Instead,useseq_along(x)or
seq_len(length(x)).
Thiscodecouldeasilybevectorizedvia
log_sum=sum(log(x))
Writingcodethiswayhasanumberofbenefits:
It’sfaster.Whenn=107theRwayisabout40timesfaster.
It’sneater.
Itdoesn’tcontainabugwhenxisoflength0.
Aswiththegeneralexamplein“GeneralAdvice”,theslowdownisn’tduetotheforloop.
Instead,it’sbecausetherearemanymorefunctionscalls.
Exercises
1. Timethetwomethodsforcalculatingthelogsum.
2. Whathappenswhenthelength(x)=0(i.e.,wehaveanemptyvector)?
Example:MonteCarlointegration
It’salsoimportanttomakefulluseofRfunctionsthatusevectors.Forexample,supposewe
wishtoestimatetheintegral∫ x2dxusingaMonteCarlomethod.Essentially,wethrowdarts
atthecurveandcountthenumberofdartsthatfallbelowthecurve(asinFigure3-1).
MonteCarlointegration
1. Initialize:hits=0
2. foriin1:N
a. Generatetworandomnumbers,U1,U2,between0and1
b. IfU2<U12,thenhits=hits+1
3. endfor
4. Areaestimate=hits/N
ImplementingthisMonteCarloalgorithminRwouldtypicallyleadtosomethinglike:
monte_carlo=function(N){
hits=0
for(iinseq_len(N)){
u1=runif(1)
u2=runif(1)
if(u1^2>u2)
hits=hits+1
}
return(hits/N)
}
InR,thistakesafewseconds:
N=500000
system.time(monte_carlo(N))
#>usersystemelapsed
#>2.8280.0082.842
Incontrast,amoreR-centricapproachwouldbe:
monte_carlo_vec=function(N)mean(runif(N)^2>runif(N))
Themonte_carlo_vec()functioncontains(atleast)fouraspectsofvectorization:
Therunif()functioncallisnowfullyvectorized.
Weraiseentirevectorstoapowervia^.
Comparisonsusing>arevectorized.
Usingmean()isquickerthananequivalentforloop.
Thefunctionmonte_carlo_vec()isaround30timesfasterthanmonte_carlo().
Figure3-1.ExampleofMonteCarlointegration.Toestimatetheareaunderthecurve,throwrandompointsatthegraph
andcountthenumberofpointsthatlieunderthecurve.
Exercise
1. Verifythatmonte_carlo_vec()isfasterthanmonte_carlo().Howdoesthisrelateto
thenumberofdarts(i.e.,thesizeofN)thatisused?
CommunicatingwiththeUser
Whenwecreateafunction,weoftenwantthefunctiontogiveefficientfeedbackonthe
currentstate.Forexample,aretheremissingargumentsorhasanumericalcalculationfailed?
Therearethreemaintechniquesforcommunicatingwiththeuser.
FatalErrors:stop()
Fatalerrorsareraisedbycallingstop()(i.e.,executionisterminated).Whenstop()iscalled,
thereisnowayforafunctiontocontinue.Forinstance,whenwegeneraterandomnumbers
usingrnorm(),thefirstargumentisthesamplesize,n.Ifthenumberofobservationstoreturn
islessthan1,anerrorisraised.Whenweneedtoraiseanerror,weshoulddosoasquickly
aspossible;otherwise,it’sawasteofresources.Hence,thefirstfewlinesofafunction
typicallyperformargumentchecking.
Supposewecallafunctionthatraisesanerror.Whatthen?Efficient,robustcodecatchesthe
errorandhandlesitappropriately.Errorscanbecaughtusingtry()andtryCatch().For
example,
#Suppresstheerrormessage
good=try(1+1,silent=TRUE)
bad=try(1+"1",silent=TRUE)
Whenweinspecttheobjects,thevariablegoodjustcontainsthenumber2:
good
#>[1]2
However,thebadobjectisacharacterstringwithclasstry-errorandaconditionattribute
thatcontainstheerrormessage:
bad
#>[1]"Errorin1+\"1\":non-numericargumenttobinaryoperator\n"
#>attr(,"class")
#>[1]"try-error"
#>attr(,"condition")
#><simpleErrorin1+"1":non-numericargumenttobinaryoperator>
Wecanusethisinformationinastandardconditionalstatement:
if(class(bad)=="try-error")
#Dosomething
Furtherdetailsonerrorhandling,aswellassomeexcellentadviceongeneraldebugging
techniques,aregiveninAdvancedRbyHadleyWickham(CRCPress).
Warnings:warning()
Warningsaregeneratedusingthewarning()function.Whenawarningisraised,itindicates
potentialproblems.Forexample,mean(NULL)returnsNAandalsoraisesawarning.
Whenwecomeacrossawarninginourcode,itisimportanttosolvetheproblemandnotjust
ignoretheissue.Whileignoringwarningssavestimeintheshortterm,warningscanoften
maskdeeperissuesthathavecreptintoourcode.
WARNING
WarningscanbehiddenusingsuppressWarnings().
InformativeOutput:message()andcat()
Togiveinformativeoutput,usethemessage()function.Forexample,inthepoweRlaw
package,themessage()functionisusedtogivetheuseranestimateofexpectedruntime.
Providingaroughestimateofhowlongthefunctiontakesallowstheusertooptimizetheir
time.Similartowarnings,messagescanbesuppressedwithsuppressMessages().
Anotherfunctionusedforprintingmessagesiscat().Ingeneral,cat()shouldonlybeused
inprint()/show()methods.Forexample,lookatthefunctiondefinitionoftheS3print
methodfordifftimeobjects:getS3method("print","difftime").
Exercise
1. Thestop()functionhasanargumentcall.thatindicatesifthefunctioncallshould
bepartoftheerrormessage.Createafunctionandexperimentwiththisoption.
InvisibleReturns
Theinvisible()functionallowsyoutoreturnatemporarilyinvisiblecopyofanobject.
Thisisparticularlyusefulforfunctionsthatreturnvaluesthatcanbeassigned,butarenot
printedwhentheyarenotassigned.Forexample,supposewehaveafunctionthatplotsthe
dataandfitsastraightline:
regression_plot=function(x,y,...){
#Plotandpassadditionalargumentstodefaultplotmethod
plot(x,y,...)
#Fitregressionmodel
model=lm(y~x)
#Addlineofbestfittotheplot
abline(model)
invisible(model)
}
Whenthefunctioniscalled,ascattergraphisplottedwiththelineofbestfit,buttheoutputis
invisible.However,whenweassignthefunctiontoanobject(i.e.,out=regression_plot(x,
y)),thevariableoutcontainstheoutputofthelm()call.
Anotherexampleishist().Typically,wedon’twantanythingdisplayedintheconsolewhen
wecallthefunction:
hist(x)
However,ifweassigntheoutputtoanobject,out=hist(x),theobjectoutisactuallyalist
containing,interalia,informationonthemidpoints,breaks,andcounts.
Factors
Factorsaremuchmalignedobjects.Whileattimestheyareawkward,theydohavetheiruses.
Afactorisusedtostorecategoricalvariables.ThisdatatypeisuniquetoR(oratleastnot
commonamongprogramminglanguages).Thedifferencebetweenfactorsandstringsis
importantbecauseRtreatsfactorsandstringsdifferently.Althoughfactorslooksimilarto
charactervectors,theyareactuallyintegers.Thisleadstoinitiallysurprisingbehavior:
x=4:6
c(x)
#>[1]456
c(factor(x))
#>[1]123
Inthiscase,thec()functionisusingtheunderlyingintegerrepresentationofthefactor.
DealingwiththewrongcaseofbehaviorisacommonsourceofinefficiencyforRusers.
Often,categoricalvariablesgetstoredas1,2,3,4,and5,withassociateddocumentation
elsewherethatexplainswhateachnumbermeans.Thisisclearlyapain.Alternatively,we
storethedataasacharactervector.Whilethisisfine,thesemanticsarewrongbecauseit
doesn’tconveythatthisisacategoricalvariable.It’snotsensibletosaythatyoushould
alwaysorneverusefactors,sincefactorshavebothpositiveandnegativefeatures.Instead,we
needtoexamineeachcaseindividually.
Asageneralrule,ifyourvariablehasaninherentorder(e.g.,smallversuslarge)oryouhave
afixedsetofcategories,thenyoushouldconsiderusingafactor.
InherentOrder
Factorscanbeusedfororderingingraphics.Forinstance,supposewehaveadatasetwhere
thevariabletypetakesoneofthreevalues,small,medium,orlarge.Clearly,thereisan
ordering.Usingastandardboxplot()call,
boxplot(y~type)
wouldcreateaboxplotwherethex-axiswasalphabeticallyordered.Byconvertingtypeintoa
factor,wecaneasilyspecifythecorrectordering.
boxplot(y~factor(type,levels=c("Small","Medium","Large")))
WARNING
Mostusersinteractwithfactorsviatheread.csv()function,wherecharactercolumnsareautomatically
convertedtofactors.Thisfeaturecanbeirritatingifourdataismessyandwewanttocleanandrecodevariables.
Typicallywhenreadingindataviaread.csv(),weusethestringsAsFactors=FALSEargument.Althoughthis
argumentcanbeaddedtotheglobaloptions()listandplacedinthe.Rprofile,thisleadstononportablecode,so
shouldbeavoided.
FixedSetofCategories
Supposeourdatasetrelatestomonthsoftheyear:
m=c("January","December","March")
Ifwesortmintheusualway,sort(m),weperformstandardalphanumericordering;placing
Decemberfirst.Thisistechnicallycorrect,butnotthathelpful.Wecanusefactorstoremedy
thisproblembyspecifyingtheadmissiblelevels:
#month.namecontainsthe12months
fac_m=factor(m,levels=month.name)
sort(fac_m)
#>[1]JanuaryMarchDecember
#>12Levels:JanuaryFebruaryMarchAprilMayJuneJulyAugust...December
Exercise
1. Factorsareslightlymorespace-efficientthancharacters.Createacharactervector
andcorrespondingfactor,andusepryr::object_size()tocalculatethespace
neededforeachobject.
TheApplyFamily
Theapplyfunctionscanbeanalternativetowritingforloops.Thegeneralideaistoapply(or
map)afunctiontoeachelementofanobject.Forexample,youcanapplyafunctiontoeach
roworcolumnofamatrix.Alistofavailablefunctionsandtheirdescriptionsisgivenin
Table3-2.Ingeneral,allapplyfunctionshavesimilarproperties:
Eachfunctiontakesatleasttwoarguments:anobjectandanotherfunction.Thefunction
ispassedasanargument.
Everyapplyfunctionhasthedots(...)argument,whichisusedtopassonargumentsto
thefunctionprovidedtotheFUNargument.sapply(list((1:3)^2,2:7),mean,trim
=0.4),forexample,passesthetripargumenttothemeanfunctioncallforeachvector
inthelist.
Usingapplyfunctionswhenpossiblecanleadtoshorter,moresuccinct,idiomaticRcode.In
thissection,wewillcoverthethreemainfunctions,apply(),lapply(),andsapply().Since
theapplyfunctionsarecoveredinmostRtextbooks,wejustgiveabriefintroductiontothe
topicandprovidepointerstootherresourcesattheendofthissection.
NOT E
Mostpeoplerarelyusetheotherapplyfunctions.Forexample,Ihaveonlyusedeapply()once.Studentsinmy
classuploadedRscripts.Usingsource(),Iwasabletoreadinthescriptstoaseparateenvironment.Ithen
appliedamarkingschemetoeachenvironmentusingeapply().Usingseparateenvironments,Iavoidedobject
nameclashes.
Table3-2.Theapplyfamilyoffunctionsfrom
baseR
Function Description
apply Applyfunctionsoverarraymargins
by Applyafunctiontoadataframesplitbyfactors
eapply Applyafunctionovervaluesinanenvironment
lapply Applyafunctionoveralistorvector
mapply Applyafunctiontomultiplelistorvectorarguments
rapply Recursivelyapplyafunctiontoalist
tapply Applyafunctionoveraraggedarray
Theapply()functionisusedtoapplyafunctiontoeachroworcolumnofamatrix.Inmany
datascienceproblems,thisisacommontask.Forexample,tocalculatethestandarddeviation
oftherow:
data("ex_mat",package="efficient")
#MARGIN=1:correspondstorows
row_sd=apply(ex_mat,1,sd)
Thefirstargumentofapply()istheobjectofinterest.ThesecondargumentistheMARGIN.
Thisisavectorgivingthesubscriptsthatthefunction(thethirdargument)willbeapplied
over.Whentheobjectisamatrix,amarginof1indicatesrows,and2indicatescolumns.So
tocalculatethecolumnstandarddeviations,thesecondargumentischangedto2:
col_med=apply(ex_mat,2,sd)
Additionalargumentscanbepassedtothefunctionthatistobeappliedtothedata.For
example,topassthena.rmargumenttothesd()function,wehave:
row_sd=apply(ex_mat,1,sd,na.rm=TRUE)
Theapply()functionalsoworksonhigherdimensionalarrays;aone-dimensionalarrayisa
vector,atwo-dimensionalarrayisamatrix.
Thelapply()functionissimilartoapply().Themaindifferencesarethattheinputtypesare
vectorsorlistsandthereturntypeisalist.Essentially,weapplyafunctiontoeachelementof
alistorvector.Thefunctionssapply()andvapply()aresimilartolapply(),butthereturn
typeisnotnecessarilyalist.
Example:MoviesDataset
Theinternetmoviedatabaseisawebsitethatcollectsmoviedatasuppliedbystudiosandfans.
ItisoneofthelargestmoviedatabasesonthewebandismaintainedbyAmazon.The
ggplot2moviespackagecontainsabout60,000moviesstoredasadataframe:
data(movies,package="ggplot2movies")
Moviesareratedbetween1and10byfans.Columns7to16ofthemoviesdatasetgivesthe
percentageofvotersforaparticularrating.
ratings=movies[,7:16]
Forexample,4.5%ofvotersratedthefirstmoviea1:
ratings[1,]
#>r1r2r3r4r5r6r7r8r9r10
#>14.54.54.54.514.524.524.514.54.54.5
Wecanusetheapply()functiontoinvestigatevotingpatterns.Thefunction
nnet::which.is.max()findsthemaximumpositioninavector,butbreakstiesatrandom;
which.max()justreturnsthefirstvalue.Usingapply(),wecaneasilydeterminethemost
popularratingforeachmovieandplottheresults:
popular=apply(ratings,1,nnet::which.is.max)
plot(table(popular))
Figure3-2highlightsthefactthatvotingpatternsareclearlynotuniformbetween1and10.
Themostpopularvoteisthehighestrating,10.Clearlyifyouwenttothetroubleofvoting
foramovie,itwaseitherverygoodorverybad(thereisalsoapeakat1).Ratingamovie7
isalsoapopularchoice(searchthewebfor“mostpopularnumber”andyouwillseethat7
dominatestherankings.)
Figure3-2.Movievotingpreferences
TypeConsistency
Whenprogramming,itishelpfulifthereturnvaluefromafunctionalwaystakesthesame
form.Unfortunately,notallbaseRfunctionsfollowthisidiom.Forexample,thefunctions
sapply()and[.data.frame()aren’ttype-consistent:
two_cols=data.frame(x=1:5,y=letters[1:5])
zero_cols=data.frame()
sapply(two_cols,class)#acharactervector
sapply(zero_cols,class)#alist
two_cols[,1:2]#adata.frame
two_cols[,1]#anintegervector
Thiscancauseunexpectedproblems.Thefunctionslapply()andvapply()aretype-
consistent,asaredplyr::select()anddplyr:filter().Thepurrrpackagehassometype-
consistentalternativestobaseRfunctions.Forexample,youcanusemap_dbl()toreplace
Map(),andflatten_df()toreplaceunlist().
Otherresources
AlmosteveryRbookhasasectionontheapplyfunction.Hereareresourceswefeelaremost
helpful:
Eachfunctionhasanumberofexamplesintheassociatedhelppage.Youcandirectly
accesstheexamplesusingtheexample()function.Forexample,toruntheapply()
examples,useexample("apply").
ThereisaverydetailedStackOverflowanswerdescriptionofwhen,where,andhowto
useeachofthefunctions.
Inasimilarvein,NeilSaundershasaniceblogpostgivinganoverviewofthefunctions.
Theapplyfunctionsareanexampleoffunctionalprogramming.Chapter16ofRfor
DataSciencebyGrolemundandWickham(O’Reilly)describestheinterplaybetween
loopsandfunctionalprogramminginmoredetail,whereasAdvancedRbyHadley
Wickham(CRCPress)givesamorein-depthdescriptionofthetopic.
Exercises
1. Rewritethesapply()precedingfunctioncallsusingvapply()toensuretype
consistency.
2. Howwouldyoumakesubsettingdataframeswith[typeconsistent?Hint:lookatthe
dropargument.
CachingVariables
Astraightforwardmethodforspeedingupcodeistocalculateobjectsonceandreusethe
valuewhennecessary.Thiscouldbeassimpleasreplacingsd(x)inmultiplefunctioncalls
withtheobjectsd_x,whichisdefinedonceandreused.Forexample,supposewewishto
normalizeeachcolumnofamatrix.However,insteadofusingthestandarddeviationofeach
column,wewillusethestandarddeviationoftheentiredataset:
apply(x,2,function(i)mean(i)/sd(x))
Thisisinefficientbecausethevalueofsd(x)isconstant,sorecalculatingthestandard
deviationforeverycolumnisunnecessary.Instead,weshouldevaluateonceandstorethe
result:
sd_x=sd(x)
apply(x,2,function(i)mean(i)/sd_x)
Ifwecomparethetwomethodsona100rowby1,000columnmatrix,thecachedversionis
around100timesfaster(Figure3-3).
Figure3-3.Performancegainsobtainedfromcachingthestandarddeviationina100by1000matrix
Amoreadvancedformofcachingistousethememoisepackage.Ifafunctioniscalled
multipletimeswiththesameinput,itmaybepossibletospeedthingsupbykeepingacacheof
knownanswersthatitcanretrieve.Thememoisepackageallowsustoeasilystorethevalueof
afunctioncallandreturnsthecachedresultwhenthefunctioniscalledagainwiththesame
arguments.Thispackagetradesoffmemoryversusspeed,sincethememoisedfunctionstores
allpreviousinputsandoutputs.Tocacheafunction,wesimplypassthefunctiontothe
memoisefunction.
Theclassicmemoiseexampleisthefactorialfunction.Anotherexampleistolimitusetoa
webresource.Forexample,supposewearedevelopingaShiny(aninteractivegraphic)
applicationinwhichtheusercanfittheregressionlinetodata.Theusercanremovepoints
andrefittheline.Anexamplefunctionwouldbe:
#Argumentindicatesrowtoremove
plot_mpg=function(row_to_remove){
data(mpg,package="ggplot2")
mpg=mpg[-row_to_remove,]
plot(mpg$cty,mpg$hwy)
lines(lowess(mpg$cty,mpg$hwy),col=2)
}
Wecanusememoisetospeedupbycachingresults.Aquickbenchmark
m_plot_mpg=memoise(plot_mpg)
microbenchmark(times=10,unit="ms",m_plot_mpg(10),plot_mpg(10))
#>Unit:milliseconds
#>exprminlqmeanmedianuqmaxnevalcld
#>m_plot_mpg(10)0.044e-020.078e-028e-020.110a
#>plot_mpg(10)40.201e+0295.521e+021e+02107.110b
suggeststhatwecanobtaina100-foldspeed-up.
Exercise
1. Constructaboxplotoftimingsforthestandardplottingfunctionandthememoised
version.
FunctionClosures
WARNING
Thefollowingsectionismeanttoprovideanintroductiontofunctionclosureswithexampleusecases.See
AdvancedRbyHadleyWickham(CRCPress)foradetailedintroduction.
Moreadvancedcachingisavailableusingfunctionclosures.AclosureinRisanobjectthat
containsfunctionsboundtotheenvironmenttheclosurewascreatedin.Technically,all
functionsinRhavethisproperty,butweusethetermfunctionclosuretodenotefunctions
wheretheenvironmentisnotin.GlobalEnv.Oneoftheenvironmentsassociatedwitha
functionisknownastheenclosingenvironment;thatis,wherethefunctionwascreated.This
allowsustostorevaluesbetweenfunctioncalls.Supposewewanttocreateastopwatchtype
function.Thisiseasilyachievedwithafunctionclosure:
#<<-assignsvaluestotheparentenvironment
stop_watch=function(){
start_time=NULL
start=function()start_time<<-Sys.time()
stop=function(){
stop_time=Sys.time()
difftime(stop_time,start_time)
}
list(start=start,stop=stop)
}
watch=stop_watch()
Theobjectwatchisalistthatcontainstwofunctions.Onefunctionforstartingthetimer:
watch$start()
andtheotherforstoppingthetimer:
watch$stop()
Withoutusingfunctionclosures,thestopwatchfunctionwouldbelonger,morecomplex,and
thereforemoreinefficient.Whenusedproperly,functionclosuresareveryuseful
programmingtoolsforwritingconcisecode.
Exercises
1. Writeastopwatchfunctionwithoutusingfunctionclosures.
2. Manystopwatcheshavetheabilitytomeasurenotonlyyouroveralltimebutalso
yourindividuallaps.Addalap()functiontothestop_watch()functionthatwill
recordindividualtimes,whilestillkeepingtrackoftheoveralltime.
NOT E
Arelatedideatofunctionclosuresisnonstandardevaluation(NSE),orprogrammingonthelanguage.NSEcrops
upallthetimeinR.Forexample,whenweexecuteplot(height,weight),Rautomaticallylabelsthex-andy-
axisoftheplotwithheightandweight.Thisisapowerfulconceptthatenablesustosimplifycode.Moredetailis
giveninthe“Nonstandardevaluation”sectionofAdvancedRbyHadleyWickham.
TheByteCompiler
Thecompilerpackage,writtenbyRCorememberLukeTierney,hasbeenpartofRsince
version2.13.0.ThecompilerpackageallowsRfunctionstobecompiled,resultinginabyte
codeversionthatmayrunfaster.1Thecompilationprocesseliminatesanumberofcostly
operationstheinterpreterhastoperform,suchasvariablelookup.
SinceR2.14.0,allofthestandardfunctionsandpackagesinbaseRareprecompiledintobyte
code.Thisisillustratedbythebasefunctionmean():
getFunction("mean")
#>function(x,...)
#>UseMethod("mean")
#><bytecode:0x48eec88>
#><environment:namespace:base>
Thethirdlinecontainsthebytecodeofthefunction.Thismeansthatthecompilerpackagehas
translatedtheRfunctionintoanotherlanguagethatcanbeinterpretedbyaveryfast
interpreter.Amazingly,thecompilerpackageisalmostentirelypureR,withjustafewC
supportroutines.
Example:TheMeanFunction
ThecompilerpackagecomeswithR,sowejustneedtoloadthepackageintheusualway:
library("compiler")
Next,wecreateaninefficientfunctionforcalculatingthemean.Thisfunctiontakesina
vector,calculatesthelength,andthenupdatesthemvariable.
mean_r=function(x){
m=0
n=length(x)
for(iinseq_len(n))
m=m+x[i]/n
m
}
Thisisclearlyabadfunctionandweshouldjustusethemean()function,butit’sauseful
comparison.Compilingthefunctionisstraightforward:
cmp_mean_r=cmpfun(mean_r)
Thenweusethemicrobenchmark()functiontocomparethethreevariants:
#Generatesomedata
x=rnorm(1000)
microbenchmark(times=10,unit="ms",#milliseconds
mean_r(x),cmp_mean_r(x),mean(x))
#>Unit:milliseconds
#>exprminlqmeanmedianuqmaxnevalcld
#>mean_r(x)0.3580.3610.3700.3630.3670.4310c
#>cmp_mean_r(x)0.0500.0510.0520.0510.0510.0710b
#>mean(x)0.0050.0050.0080.0070.0080.0310a
Thecompiledfunctionisaroundseventimesfasterthantheuncompiledfunction.Ofcourse,
thenativemean()functionisfaster,butcompilingdoesmakeasignificantdifference
(Figure3-4).
Figure3-4.Comparsionofmeanfunctions
CompilingCode
Thereareanumberofwaystocompilecode.Theeasiestistocompileindividualfunctions
usingcmpfun(),butthisobviouslydoesn’tscale.Ifyoucreateapackage,youcan
automaticallycompilethepackageoninstallationbyadding
ByteCompile:true
totheDESCRIPTIONfile.MostRpackagesinstalledusinginstall.packages()arenot
compiled.Wecanenable(orforce)packagestobecompiledbystartingRwiththe
environmentvariableR_COMPILE_PKGSsettoapositiveintegervalueandspecifythatwe
installthepackagefromsource:
##WindowsuserswillneedRtools
install.packages("ggplot2",type="source")
Or,ifwewanttoavoidalteringthe.Renvironfile,wecanspecifyanadditionalargument:
install.packages("ggplot2",type="source",INSTALL_opts="--byte-compile")
Afinaloptionistousejust-in-time(JIT)compilation.2TheenableJIT()functiondisables
JITcompilationiftheargumentis0.Arguments1,2,or3implementdifferentlevelsof
optimization.JITcanalsobeenabledbysettingtheenvironmentvariableR_ENABLE_JITto
oneofthesevalues.
T IP
Werecommendsettingthecompileleveltothemaximumvalueof3.
Theimpactofcompilingoninstallwillvaryfrompackagetopackage.Forpackagesthat
alreadyhavelotsofprecompiledcode,speedgainswillbesmall(RCoreTeam2016).
WARNING
Notallpackagesworkwhencompiledoninstallation.
References
Burns,Patrick.2011.TheRInferno.Lulu.com.
Wickham,Hadley.2014a.AdvancedR.CRCPress.
Grolemund,G.,andH.Wickham.2016.RforDataScience.O’ReillyMedia.
RCoreTeam.2016.“RInstallationandAdministration.”RFoundationforStatistical
Computing.https://cran.r-project.org/doc/manuals/r-release/R-admin.html.
Theauthorshaveyettofindasituationwherebyte-compiledcoderunssignificantlyslower.
ItappearsthatinR3.4,thisoptimizationwillbeenabledbydefault.
1
2
Chapter4.EfficientWorkflow
Efficientprogrammingisanimportantskillforgeneratingthecorrectresult,ontime.Yet
codingisonlyonepartofawiderskillsetneededforsuccessfuloutcomesforprojects
involvingRprogramming.UnlessyourprojectistowritegenericRcode(i.e.,unlessyouare
ontheRCoreTeam),theprojectwillprobablytranscendtheconfinesoftheRworld;itmust
engagewithawholerangeofotherfactors.Inthiscontext,wedefineworkflowasthesumof
practices,habits,andsystemsthatenableproductivity.1Tosomeextent,workflowisabout
personalpreferences.Everyone’smindworksdifferentlysothemostappropriateworkflow
variesfrompersontopersonandfromoneprojecttothenext.Projectmanagementpractices
willalsovarydependingonthescaleandtypeoftheproject.It’sabigtopic,butitcan
usefullybecondensedintofivetoptips.
Prerequisites
Thischapterfocusesonworkflow.Forprojectplanningandmanagement,we’llusethe
DiagrammeRpackage.Forprojectreporting,we’llfocusonRMarkdownandknitr,which
arebundledwithRStudio(butcanbeinstalledindependentlyifneeded).We’llsuggestother
packagesthatareworthinvestigating,butarenotrequiredforthisparticularchapter.
library("DiagrammeR")
TopFiveTipsforEfficientWorkflow
1. Startwithoutwritingcodebutwithaclearmindandperhapsapenandpaper.This
willensurethatyoukeepyourobjectivesattheforefrontofyourmindwithout
gettinglostinthetechnology.
2. Makeaplan.Thesizeandnaturewilldependontheprojectbuttimelines,resources,
andchunkingtheworkwillmakeyoumoreeffectivewhenyoustart.
3. Selectthepackagesyouwilluseforimplementingtheplanearly.Minutesspent
researchingandselectingfromtheavailableoptionscouldsavehoursinthefuture.
4. Documentyourworkateverystage:workcanonlybeeffectiveifit’scommunicated
clearlyandcodecanonlybeefficientlyunderstoodifit’scommented.
5. Makeyourentireworkflowasreproducibleaspossible.knitrcanhelpwiththisin
thephaseofdocumentation.
AProjectPlanningTypology
Appropriateprojectmanagementstructuresandworkflowdependonthetypeofprojectyou
areundertaking.Thefollowingtypologydemonstratesthelinksbetweenprojecttypeand
projectmanagementrequirements.2
Dataanalysis
Here,youaretryingtoexploredatasetstodiscoversomethinginteresting/answersome
questions.Theemphasisisonspeedofmanipulatingyourdatatogenerateinteresting
results.Formalityislessimportantinthistypeofproject.Sometimesthisanalysis
projectmayonlybepartofalargerproject(thedatamayhavetobecreatedinalab,for
example).Howthedataanalystsinteractwiththerestoftheteammaybeasimportantfor
theproject’ssuccessashowtheyinteractwitheachother.
Packagecreation
Hereyouwanttocreatecodethatcanbereusedacrossprojects,possiblybypeople
whoseusecasesyoudon’tknow(ifyoumakeitpubliclyavailable).Theemphasisinthis
casewillbeonclarityofuserinterfaceanddocumentation,meaningstyleandcode
reviewareimportant.Robustnessandtestingareimportantinthistypeofproject,too.
Reportingandpublishing
Hereyouarewritingareport,journalpaper,orbook.Thelevelofformalityvaries
dependingupontheaudience,butyouhaveadditionalworrieslikehowmuchcodeit
takestoarriveattheconclusions,andhowmuchoutputthecodecreates.
Softwareapplications
ThiscouldrangefromasimpleShinyapptoRbeingembeddedintheserverofamuch
largerpieceofsoftware.Eitherway,sincethereislimitedopportunityforhuman
interaction,theemphasisisonrobustcodeandgracefullydealingwithfailure.
Basedontheseobservations,werecommendthinkingaboutwhichtypeofworkflow,file
structure,andprojectmanagementsystemsuitsyourprojectbest.Sometimesit’sbestnotto
beprescriptive,sowerecommendtryingdifferentworkingpracticestodiscoverwhich
worksbest,timepermitting.3
Thereare,however,concretestepsthatcanbetakentoimproveworkflowinmostprojects
thatinvolveRprogramming.Learningthemwill,inthelongrun,improveproductivityand
reproducibility.Withthesemotivationsinmind,thepurposeofthischapterissimple:to
highlightsomekeyingredientsofanefficientRworkflow.Itbuildsontheconceptofan
R/RStudioproject,introducedinChapter2,andisorderedchronologicallythroughoutthe
stagesinvolvedinatypicalproject’slifespan,frominceptiontopublication:
Projectplanning
Thisshouldhappenbeforeanycodehasbeenwritten,toavoidtimewastedusinga
mistakenanalysisstrategy.Projectmanagementistheartofmakingprojectplans
happen.
Packageselection
Afterplanningyourproject,youshouldidentifywhichpackagesaremostsuitableto
gettingtheworkdonequicklyandeffectively.Withrapidincreasesinthenumberand
performanceofpackages,itismoreimportantthanevertoconsidertherangeofoptions
attheoutset.Forexample,*_join()fromdplyrisoftenmoreappropriatethanmerge(),
aswe’llseeinChapter6.
Publication
ThisfinalstageisrelevantifyouwantyourRcodetobeusefulforothersinthelong
term.Tothisend,“Publication”touchesondocumentationusingknitrandthemuch
stricterapproachtocodepublicationofpackagedevelopment.
ProjectPlanningandManagement
Goodprogrammersworkingonacomplexprojectwillrarelyjuststarttypingcode.Instead,
theywillplanthestepsneededtocompletethetaskasefficientlyaspossible:“smart
preparationminimizeswork”(Berkun2005).Althoughsearchenginesareusefulfor
identifyingtheappropriatestrategy,trial-and-errorapproaches(e.g.,typingcodeatrandom
andGooglingtheinevitableerrormessages)areusuallyhighlyinefficient.
Strategicthinkingisespeciallyimportantduringaproject’sinception:ifyoumakeabad
decisionearlyon,itwillhavecascadingnegativeimpactsthroughouttheproject’sentire
lifespan.Sodetrimentalandubiquitousisthisphenomenoninsoftwaredevelopmentthata
termhasbeencoinedtodescribeit:technicaldebt.Thishasbeendefinedas“notquiteright
codewhichwepostponemakingright”(Kruchten,Nord,andOzkaya2012).Dozensof
academicpapershavebeenwrittenonthesubject,butfromtheperspectiveofbeginninga
project(i.e.,intheplanningstage,wherewearenow),allyouneedtoknowisthatitis
absolutelyvitaltomakesensibledecisionsattheoutset.Ifyoudonot,yourprojectmaybe
doomedtofailureofincessantroundsofrefactoring.
Tominimizetechnicaldebtattheoutset,thebestplacetostartmaybewithapenandpaper
andanopenmind.Sketchingoutyourideasanddecidingpreciselywhatyouwanttodo,free
fromtheconstraintsofaparticularpieceoftechnology,canbearewardingexercisebefore
youbegin.Projectplanningandvisioningcanbeacreativeprocessnotalwayswell-suitedto
thelinearlogicofcomputing,despiterecentadvancesinprojectmanagementsoftware,some
ofwhichareoutlinedinthebulletpointsthatfollow.
Scalecanlooselybedefinedasthenumberofpeopleworkingonaproject.Itshouldbe
consideredattheoutsetbecausetheimportanceofprojectmanagementincreases
exponentiallywiththenumberofpeopleinvolved.Projectmanagementmaybetrivialfora
smallproject,butifyouexpectittogrow,implementingastructuredworkflowearlyon
couldavoidproblemslater.Onsmallprojectsconsistingofaone-offscript,project
managementmaybeadistractingwasteoftime.Largeprojectsinvolvingdozensofpeople,
ontheotherhand,requiremucheffortdedicatedtoprojectmanagement:regularmeetings,
divisionoflabor,andascalableprojectmanagementsystemtotrackprogress,issues,and
prioritieswillinevitablyconsumealargeproportionoftheproject’stime.Fortunately,a
multitudeofdedicatedprojectmanagementsystemshavebeendevelopedtocatertoprojects
acrossarangeofscales.Theseinclude,inroughascendingorderofscaleandcomplexity,the
following:
Theinteractivecode-sharingsiteGitHub,whichisdescribedinmoredetailinChapter9
ZenHub,abrowserpluginthatis“thefirstandonlyprojectmanagementsuitethatworks
nativelywithinGitHub”
Web-basedandeasy-to-usetoolssuchasTrello
DedicateddesktopprojectmanagementsoftwaresuchasProjectLibreandGanttProject
Fullyfeatured,enterprisescale,opensourceprojectmanagementsystemssuchas
OpenProjectandredmine
Regardlessofthesoftware(orlackthereof)usedforprojectmanagement,itinvolves
consideringtheproject’saimsinthecontextofavailableresources(e.g.,computationaland
programmerresources),projectscope,timescales,andsuitablesoftware.Andthesethings
shouldbeconsideredtogether.Totakeoneexample,isitworththeinvestmentoftimeneeded
tolearnaparticularRpackagethatisnotessentialtocompletingtheprojectbutwhichwill
makethecoderunfaster?Doesitmakemoresensetohireanotherprogrammerorinvestin
morecomputationalresourcestocompleteanurgentdeadline?
Minutesspentthinkingthroughsuchissuesbeforewritingasinglelinecansavehoursinthe
future.ThisisemphasizedinbookssuchasTheArtofProjectManagementbyScottBerkun
(O’Reilly)andthe“GuidetotheProjectManagementBodyofKnowledge”byPMBoKand
usefulonlineresourcessuchthosebyteamgantt.comandlasa.org.uk,whichfocusexclusively
onprojectplanning.Thissectioncondensessomeofthemostimportantlessonsfromthis
literatureinthecontextoftypicalRprojects(i.e.,thosethatinvolvedataanalysis,modeling,
andvisualization).
ChunkingYourWork
Onceaprojectoverviewhasbeendevisedandstored,inmind(forsmallprojects,ifyoutrust
thatasstoragemedium!)orwritten,aplanwithatimelinecanbedrawnup.Theup-to-date
visualizationofthisplancanbeapowerfulremindertoyouandcollaboratorsoftheprogress
ontheprojectsofar.Moreimportantly,thetimelineprovidesanoverviewofwhatneedstobe
donenext.Settingstartdatesanddeadlinesforeachtaskwillhelpprioritizetheworkand
ensurethatyouareontrack.Breakingalargeprojectintosmallerchunksishighly
recommended,makinghuge,complextasksmoreachievableandmodular(PMBoK2000).
Chunkingtheworkwillalsomakecollaborationeasier,asweshallseeinChapter5.
Thetasksthataprojectshouldbesplitintowilldependonthenatureofthework.Thephases
illustratedinFigure4-1representaroughstartingpoint,notatemplate.Theprogramming
phasewillusuallyneedtobesplitintoatleastdatatidying,processing,andvisualization.
Figure4-1.Schematicillustrationsofkeyprojectphasesandlevelsofactivityovertime,basedonthe“GuidetotheProject
ManagementBodyofKnowledge”(PMBoK2000)
MakingYourWorkflowSMART
Amorerigorous(butpotentiallyonerous)waytoprojectplanistodividetheworkintoa
seriesofobjectivesandtracktheirprogressthroughouttheproject’sduration.Onewayto
checkifanobjectiveisappropriateforactionandreviewisbyusingtheSMARTcriteria:
Specific:istheobjectiveclearlydefinedandself-contained?
Measurable:isthereaclearindicationofitscompletion?
Attainable:canthetargetbeachieved?
Realistic:havesufficientresourcesbeenallocatedtothetask?
Time-bound:isthereanassociatedcompletiondateormilestone?
Iftheanswertoeachofthesequestionsisyes,thetaskislikelytobesuitabletoincludeinthe
project’splan.Notethatthisdoesnotmeanallprojectplansneedtobeuniform.Aproject
plancantakemanyforms,includingashortdocument,aGanttchart(seeFigure4-2),or
simplyaclearvisionoftheproject’sstepsinmind.
Figure4-2.AGanttchartcreatedusingDiagrammeRillustratingthestepsneededtocompletethisbookatanearlystage
ofitsdevelopment
VisualizingPlanswithR
VariousRpackagescanhelpvisualizetheprojectplan.Thoughtheseareuseful,theycannot
competewiththededicatedprojectmanagementsoftwareoutlinedattheoutsetofthissection.
However,ifyouareworkingonarelativelysimpleproject,itisusefultoknowthatRcan
helprepresentandkeeptrackofyourwork.Packagesforplottingprojectprogressinclude:4
plan
Providesbasictoolstocreateburndowncharts(whichconciselyshowwhetheraproject
isontimeornot)andGanttcharts.
plotrix
Ageneral-purposeplottingpackage,providesbasicGanttchart-plottingfunctionality.
Enterexample(gantt.chart)fordetails.
DiagrammeR
AnewpackageforcreatingnetworkgraphsandotherschematicdiagramsinR.This
packageprovidesanRinterfacetosimpleflowchartfileformatssuchasmermaidand
GraphViz.
Thesmallexamplethatfollows(whichprovidesthebasisforcreatingchartslikeFigure4-2)
illustrateshowDiagrammeRcantakesimpletextinputstocreateinformativeup-to-dateGantt
charts.Suchchartscangreatlyhelpwiththeplanningandtaskmanagementoflongand
complexRprojects,aslongastheydonottakeawayvaluableprogrammingtimefromcore
projectobjectives.
library("DiagrammeR")
#DefinetheGanttchartandplottheresult(notshown)
mermaid("gantt
SectionInitiation
Planning:a1,2016-01-01,10d
Dataprocessing:aftera1,30d")
Inthisexample,ganttdefinesthesubsequentdatalayout.Sectionreferstotheproject’s
section(usefulforlargeprojects,withmilestones),andeachnewlinereferstoadiscretetask.
Planning,forexample,hasthecodea,beginsonthefirstdayof2016,andlastsfor10days.
Seeknsv.github.io/mermaid/gantt.htmlformoredetaileddocumentation.
Exercises
1. WhatarethethreemostimportantworkchunksofyourcurrentRproject?
2. WhatisthemeaningofSMARTobjectives(seeMakingYourWorkflowSMART)?
3. Runthecodechunkattheendofthissectiontoseetheoutput.
4. Bonusexercise:modifythiscodetocreateabasicGanttchartofanRprojectyouare
workingon.
PackageSelection
Agoodexampleoftheimportanceofpriorplanningtominimizeeffortandreducetechnical
debtispackageselection.Aninefficient,poorlysupported,orsimplyoutdatedpackagecan
wastehours.Whenamoreappropriatealternativeisavailable,thiswastecanbepreventedby
priorplanning.TherearemanypoorpackagesonCRANandmuchduplicationsoit’seasyto
gowrong.Justbecauseacertainpackagecansolveaparticularproblemdoesn’tmeanthatit
should.
Usedwell,however,packagescangreatlyimproveproductivity:notreinventingthewheelis
partoftheethosofopensourcesoftware.Ifsomeonehasalreadysolvedaparticulartechnical
problem,youdon’thavetorewritetheircode,whichallowsyoutofocusonsolvingthe
appliedproblem.Furthermore,becauseRpackagesaregenerally(butnotalways)writtenby
competentprogrammersandsubjecttouserfeedback,theymayworkfasterandmore
effectivelythanthehastilypreparedcodeyoumayhavewritten.AllRcodeisopensourceand
potentiallysubjecttopeerreview.AprerequisiteofpublishinganRpackageisthatdeveloper
contactdetailsmustbeprovided,andmanypackagesprovideasiteforissuetracking.
Furthermore,Rpackagescanincreaseprogrammerproductivitybydramaticallyreducingthe
amountofcodetheyneedtowritebecauseallthecodeispackagedinfunctionsbehindthe
scenes.
Let’slookatanexample.Imagineaprojectforwhichyouwouldliketofindthedistance
betweensetsofpoints(origins,o,anddestinations,d)ontheEarth’ssurface.Background
readingshowsthatagoodapproximationofgreatcircledistance,whichaccountsforthe
curvatureoftheEarth,canbemadebyusingtheHaversineformula,whichyouduly
implement,involvingmuchtrialanderror:
#Functiontoconvertdegreestoradians
deg2rad=function(deg)deg*pi/180
#Createoriginsanddestinations
o=c(lon=-1.55,lat=53.80)
d=c(lon=-1.61,lat=54.98)
#Converttoradians
o_rad=deg2rad(o)
d_rad=deg2rad(d)
#Finddifferenceindegrees
delta_lon=(o_rad[1]-d_rad[1])
delta_lat=(o_rad[2]-d_rad[2])
#CalculatedistancewithHaversineformula
a=sin(delta_lat/2)^2+cos(o_rad[2])*cos(d_rad[2])*sin(delta_lon/2)^2
c=2*asin(min(1,sqrt(a)))
(d_hav1=6371*c)#multiplybyEarth'sdiameter
#>[1]131
Thismethodworksbutittakestimetowrite,test,anddebug.Itwouldbemuchbetterto
packageitupintoafunction.Orevenbetter,useafunctionthatsomeoneelsehaswrittenand
putinapackage:
#Findgreatcircledistancewithgeosphere
(d_hav2=geosphere::distHaversine(o,d))
#>[1]131415
Thedifferencebetweenthehardcodedmethodandthepackagemethodisstriking.Oneis
sevenlinesoftrickyRcodeinvolvingmanysubsettingstagesandsmall,similarfunctions
(e.g.,sinandasin),whichareeasytoconfuse.Theotherisonelineofsimplecode.The
packagemethodusinggeospheretookperhaps100thofthetimeandgaveamoreaccurate
result(becauseitusesamoreaccurateestimateofthediameteroftheEarth).Thismeansthat
acoupleofminutessearchingforapackagetoestimategreatcircledistanceswouldhave
beentimewellspentattheoutsetofthisproject.Buthowdoyousearchforpackages?
SearchingforRPackages
Buildingontheprecedingexample,howcanyoufindoutifthereisapackagetosolveyour
particularproblem?Thefirststageistoguess:ifitisacommonproblem,someonehas
probablytriedtosolveit.Thesecondstageistosearch.AsimpleGooglequery,haversine
formulaR,returnedalinktothegeospherepackageinthesecondresult(ahardcoded
implementationwasfirst).
BeyondGoogle,therearealsoseveralsitesforsearchingforpackagesandfunctions.
rdocumentation.orgprovidesamultifieldsearchenvironmenttopinpointthefunctionor
packageyouneed.Amazingly,thesearchforhaversineintheDescriptionfieldyielded10
resultsfromeightpackages:RhasatleasteightimplementationsoftheHaversineformula!
Thisshowstheimportanceofcarefulpackageselectionasthereareoftenmanypackagesthat
dothesamejob,aswewillseeinthenextsection.Thereisalsoawaytofindthefunction
fromwithinR,withRSiteSearch(),whichopensaURLinyourbrowserlinkingtoanumber
offunctions(40)andvignettes(2)thatmentionthetextstring:
#SearchCRANformentionsofhaversine
RSiteSearch("haversine")
HowtoSelectaPackage
DuetotheconservativenatureofbaseRdevelopment,whichrightlyprioritizesstabilityover
innovation,muchoftheinnovationandperformancegainsintheRecosystemhaveoccurred
inrecentyearsinthepackages.Theincreasedeaseofpackagedevelopment(Wickham2015c)
andinterfacingwithotherlanguages(Eddelbuetteletal.2011)hasacceleratedtheirnumber,
quality,andefficiency.Anadditionalfactorhasbeenthegrowthincollaborationandpeer
reviewinpackagedevelopment,drivenbycode-sharingwebsitessuchasGitHubandonline
communitiessuchasROpenSciforpeerreviewingcode.
Performance,stability,andeaseofuseshouldbehighontheprioritylistwhenchoosing
whichpackagetouse.Anothermoresubtlefactoristhatsomepackagesworkbettertogether
thanothers.TheRpackageecosystemiscomposedofinterrelatedpackages.Knowing
somethingoftheseinterdependenciescanhelpyouselectapackagesuitewhentheproject
demandsanumberofdiverseyetinterrelatedprogrammingtasks.Thetidyverse,forexample,
containsmanyinterrelatedpackagesthatworkwelltogether,suchasreadr,tidyr,anddplyr.5
Thesemaybeusedtogethertoread,tidy,andthenprocessthedata,asoutlinedinthe
subsequentsections.
Thereisnohardandfastruleaboutwhichpackageyoushoulduseandnewpackagesare
emergingallthetime.Theultimatetestwillbeempiricalevidence:doesitgetthejobdoneon
yourdata?However,thefollowingcriteriashouldprovideagoodindicationofwhethera
packageisworthaninvestmentofyourprecioustime,oreveninstallingonyourcomputer:
Isitmature?
Themoretimeapackageisavailable,themoretimeitwillhaveforobviousbugstobe
ironedout.TheageofapackageonCRANcanbeseenfromitsArchivepageonCRAN.
Wecanseefromtheggplot2archive,forexample,thatggplot2wasfirstreleasedonthe
June10,2007andthatithashad29releases.Themostrecentoftheseatthetimeof
writingwasggplot22.1.0;reaching1or2inthefirstdigitofpackageversionsis
usuallyanindicationfromthepackageauthorthatthepackagehasreachedahighlevel
ofstability.
Isitactivelydeveloped?
Itisagoodsignifpackagesarefrequentlyupdated.Afrequentlyupdatedpackagewill
haveitslatestversionpublishedrecentlyonCRAN.TheCRANpackagepagefor
ggplot2,forexample,saidPublished:2016-03-01,whichwaslessthansixmonthsold
atthetimeofwriting.
Isitwelldocumented?
Thisisnotonlyanindicationofhowmuchthought,care,andattentionhasgoneintothe
package,italsohasadirectimpactonitseaseofuse.Usingapoorlydocumented
packagecanbeinefficientduetothehoursspenttryingtoworkouthowtouseit!To
checkifthepackageiswelldocumented,lookatthehelppagesassociatedwithitskey
functions(e.g.,?ggplot),trytheexamples(e.g.,example(ggplot)),andsearchfor
packagevignettes(e.g.,vignette(package="ggplot2")).
Isitwellused?
Thiscanbeseenbysearchingforthepackagenameonline.Mostpackagesthathavea
stronguserbasewillproducethousandsofresultswhentypedintoagenericsearch
enginesuchasGoogle.Morespecific(andpotentiallyuseful)indicationsofusewill
narrowdownthesearchtoparticularusers.Apackagewidelyusedbytheprogramming
communitywilllikelybevisibleonGitHub.Atthetimeofwriting,asearchforggplot2
onGitHubyieldedover400repositoriesandalmost200,000matchesincommittedcode!
Likewise,apackagethathasbeenadoptedforuseinacademiawilltendtobementioned
inGoogleScholar(again,ggplot2scoresextremelywellinthismeasure,withover
5,000hits).
AnarticleinsimplystatsdiscussesthisissuewithreferencetotheproliferationofGitHub
packages(thosethatarenotavailableonCRAN).Inthiscontext,well-regardedand
experiencedpackagecreatorsandindirectdatasuchastheamountofGitHubactivityarealso
highlightedasreasonstotrustapackage.
ThewebsitesofMRANandMETACRANcanhelpthepackage-selectionprocessby
providingfurtherinformationoneachpackageuploadedtoCRAN.METACRAN,for
example,providesmetadataaboutRpackagesviaasimpleAPIandtheprovisionofbadgesto
showhowmanydownloadsaparticularpackagehaspermonth.ReturningtotheHaversine
examplegivenpreviously,wecouldfindouthowmanytimestwopackagesthatimplement
theformulaaredownloadedeachmonthwiththefollowingURLs:
http://cranlogs.r-pkg.org/badges/last-month/geosphere,downloadsofgeosphere:
http://cranlogs.r-pkg.org/badges/last-month/geoPlot,downloadsofgeoPlot:
Itisclearfromtheresultsreportedthatgeosphereisbyfarthemorepopularpackage,soisa
sensibleandmaturechoicefordealingwithdistancesontheEarth’ssurface.
Publication
Thefinalstageinatypicalprojectworkflowispublication.Althoughit’sthefinalstagetobe
workedon,thatdoesnotmeanyoushouldonlydocumentaftertheotherstagesarecomplete:
makingdocumentationintegraltoyouroverallworkflowwillmakethisstagemucheasier
andmoreefficient.
WhetherthefinaloutputisareportcontaininggraphicsproducedbyR,anonlineplatform
forexploringresults,orwell-documentedcodethatcolleaguescanusetoimprovetheir
workflow,startingitearlyisagoodplan.Ineverycase,theprogrammingprinciplesof
reproducibility,modularity,andDRY(don’trepeatyourself)willmakeyourpublications
fastertowrite,easiertomaintain,andmoreusefultoothers.
Insteadofattemptingacomprehensivetreatmentofthetopic,wewilltouchbrieflyona
coupleofwaysofdocumentingyourworkinR:dynamicreportsandRpackages.Thereisa
wealthofmaterialoneachoftheseonline.Awealthofonlineresourcesexistsoneachof
these;toavoidduplicationofeffort,thefocusisondocumentationfromaworkflow-
efficiencyperspective.
DynamicDocumentswithRMarkdown
WhenwritingareportusingRoutputs,atypicalworkflowhashistoricallybeento1)dothe
analysis,2)savetheresultinggraphicsandrecordthemainresultsoutsidetheRproject,and
3)openaprogramunrelatedtoRsuchasLibreOfficetoimportandcommunicatetheresults
inprose.Thisisinefficient:itmakesupdatingandmaintainingtheoutputsdifficult(whenthe
datachanges,steps1to3willhavetobedoneagain)andthereisoverheadinvolvedin
jumpingbetweenincompatiblecomputingenvironments.
ToovercomethisinefficiencyinthedocumentationofRoutputs,theRMarkdownframework
wasdeveloped.Usedinconjunctionwiththeknitrpackage,wehave:
Theabilitytoprocesscodechunks(viaknitr)
AnotebookinterfaceforR(viaRStudio)
Theabilitytorenderoutputtomultipleformats(viapandoc)
RMarkdowndocumentsareplaintextandhavethefileextension.Rmd.Thisframework
allowsfordocumentstobegeneratedautomatically.Furthermore,nothingisefficientunless
youcanquicklyredoit.Documentingyourcodeinsidedynamicdocumentsinthisway
ensuresthatanalysiscanbequicklyrerun.
NOT E
ThisnotebrieflyexplainsRMarkdownfortheuninitiated.RmarkdownisaformofMarkdown.Markdownisa
puretextdocumentformatthathasbecomeastandardfordocumentationforsoftware.Itisthedefaultformatfor
displayingtextonGitHub.RMarkdownallowstheusertoembedRcodeinaMarkdowndocument.Thisisa
powerfuladditiontoMarkdown,asitallowscustomimages,tables,andeveninteractivevisualizationstobe
includedinyourRdocuments.RMarkdownisanefficientfileformattowriteinbecauseitislightweight,human,
andcomputer-readable,andismuchlessverbosethanHTMLandLaTeX.Thefirstdraftofthisbookwaswritten
inRMarkdown.
InanRMarkdowndocument,resultsaregeneratedontheflybyincludingcodechunks.Code
chunksareRcodethatareprecededby```{r,options}onthelinebeforetheRcode,and
```attheendofthechunk.Forexample,supposewehavethecodechunk
```{reval=TRUE,echo=TRUE}
(1:5)^2
```
inanRMarkdowndocument.Theeval=TRUEinthecodeindicatesthatthecodeshouldbe
evaluated,whileecho=TRUEcontrolswhethertheRcodeisdisplayed.Whenwecompilethe
document,weget
(1:5)^2
#>[1]1491625
RMarkdownviaknitrprovidesawiderangeofoptionstocustomizewhatisdisplayedand
evaluated.Whenyouadapttothisworkflow,itishighlyefficient,especiallyasRStudio
providesanumberofshortcutsthatmakeiteasytocreateandmodifycodechunks.Tocreate
achunkwhileeditingan.Rmdfile,forexample,simplyenterCtrl/Cmd-Alt-IonWindowsor
LinuxorselecttheoptionfromtheCodedrop-downmenu.
Onceyourdocumenthascompiled,itshouldappearonyourscreenintothefileformat
requested.IfanHTMLfilehasbeengenerated(asisthedefault),RStudioprovidesafeature
thatallowsyoutoputituponlinerapidly.Thisisdoneusingtherpubswebsite,astoreofa
hugenumberofdynamicdocuments(whichcouldbeagoodsourceofinspirationforyour
publications).AssumingyouhaveanRStudioaccount,clickingthePublishbuttonatthetopof
theHTMLoutputwindowwillinstantlypublishyourworkonline,withaminimumofeffort,
enablingfastandefficientcommunicationwithmanycollaboratorsandthepublic.
Animportantadvantageofdynamicallydocumentingworkthiswayisthatwhenthedataor
analysiscodechanges,theresultswillbeupdatedinthedocumentautomatically.Thiscan
savehoursoffiddlycopyingandpastingofRoutputbetweendifferentprograms.Also,if
yourclientwantspagesandpagesofdocumentedoutput,knitrcanprovidethemwitha
minimumoftyping(e.g.,bycreatingslightlydifferentversionsofthesameplotoverand
overagain).Fromadeliveryofcontentperspective,thatiscertainlyanefficiencygain
comparedwithhoursofcopyingandpastingfigures!
IfyourRMarkdowndocumentsincludetime-consumingprocessingstages,aspeedboostcan
beattainedafterthefirstbuildbysettingopts_chunk$set(cache=TRUE)inthefirstchunkof
thedocument.Thissettingwasusedtoreducethebuildtimesofthisbook,ascanbeseenon
GitHub.
Furthermore,dynamicdocumentswritteninRMarkdowncancompileintoarangeofoutput
formatsincludingHTML,PDF,andMicrosoft’sdocx.Thereisawealthofinformationonthe
detailsofdynamicreportwritingthatisnotworthreplicatinghere.Keyreferencesare
RStudio’sexcellentwebsiteonRMarkdownhostedatrmarkdown.rstudio.comand,foramore
detailedaccountofdynamicdocumentswithR,DynamicDocumentswithRandKnitrby
YihuiXie(CRCPress).
RPackages
AstrictapproachtoprojectmanagementandworkflowistreatingyourprojectsasR
packages.Thisapproachhasadvantagesandlimitations.Themajorriskwithtreatinga
projectasapackageisthatthepackageisquiteastrictwayoforganizingwork.Packagesare
suitedforcode-intensiveprojectswherecodedocumentationisimportant.Anintermediate
approachistouseadummypackagethatincludesaDESCRIPTIONfileintherootdirectory
tellingprojectuserswhichpackagesmustbeinstalledforthecodetowork.Thisbookis
basedonadummypackagesothatwecaneasilykeepthedependenciesup-to-date(seethe
book’sDESCRIPTIONfileonlineforinsightintohowthisworks).
Creatingpackagesisgoodpracticeintermsoflearningtocorrectlydocumentyourcode,
storeexampledata,andeven(viavignettes)ensurereproducibility.Butitcantakealotof
extratimesoshouldnotbetakenlightly.ThisapproachtoRworkflowisappropriatefor
managingcomplexprojectsthatrepeatedlyusethesameroutinesthatcanbeconvertedinto
functions.Creatingprojectpackagescanprovideafoundationforgeneralizingyourcodefor
usebyothers,e.g.,viapublicationonGitHuborCRAN.AndRpackagedevelopmenthasbeen
mademucheasierinrecentyearsbythedevelopmentofthedevtoolspackage,whichis
highlyrecommendedforanyoneattemptingtowriteanRpackage.
ThenumberofessentialelementsofRpackagesdifferentiatesthemfromotherRprojects.
Threeoftheseareoutlinedherefromanefficiencyperspective:
TheDESCRIPTIONfilecontainskeyinformationaboutthepackage,includingwhich
packagesarerequiredforthecodecontainedinyourpackagetowork(e.g.,using
Imports:).Thisisefficientbecauseitmeansthatanyonewhoinstallsyourpackagewill
automaticallyinstalltheotherpackagesitdependson.
TheR/foldercontainsalltheRcodethatdefinesyourpackage’sfunctions.Placingyour
codeinasingleplaceandencouragingyoutomakeyourcodemodularinthiswaycan
greatlyreduceduplicationofcodeonlargeprojects.Furthermore,thedocumentationof
RpackagesthroughRoxygentagssuchas#'Thisfunctiondoesthis...makesit
easyforotherstouseyourwork.Thisformofefficientdocumentationisfacilitatedby
theroxygen2package.
Thedata/foldercontainsexamplecodefordemonstratingtoothershowthefunctions
workandtransportingdatasetsthatwillbefrequentlyusedinyourworkflow.Datacanbe
addedautomaticallytoyourpackageprojectusingthedevtoolspackage,with
devtools::use_data().Thiscanincreaseefficiencybyprovidingawayofdistributing
small-to-medium-sizeddatasetsandmakingthemavailablewhenthepackageisloaded
withthefunctiondata("data_set_name").
ThepackagetestthatmakesiteasierthanevertotestyourRcodeasyougo,ensuringthat
nothingbreaks.This,combinedwithcontinuousintegrationservicessuchasthatprovidedby
Travis,makesupdatingyourcodebaseasefficientandrobustaspossible.This,andmore,is
describedinTestingRCodebyRichardCotton(CRCPress).
Aswithdynamicdocuments,packagedevelopmentisalargetopic.Forsmallone-offprojects,
thetimetakeninlearninghowtosetupapackagemaynotbeworththesavings.However,
packagesprovidearigorouswayofstoringcode,data,anddocumentationthatcangreatly
boostproductivityinthelongrun.FormoreonRpackages,seeRPackagesbyHadley
Wickham(O’Reilly);theonlineversionprovidesallyouneedtoknowaboutwritingR
packagesforfree.
Reference
Berkun,Scott.2005.TheArtofProjectManagement.O’ReillyMedia.
Kruchten,Philippe,RobertLNord,andIpekOzkaya.2012.“TechnicalDebt:FromMetaphor
toTheoryandPractice.”IEEESoftware,no.6.IEEE:18–21.
PMBoK,A.2000.“GuidetotheProjectManagementBodyofKnowledge.”Project
ManagementInstitute,PennsylvaniaUSA.
Wickham,Hadley.2015c.RPackages.O’ReillyMedia.
Eddelbuettel,Dirk,RomainFrançois,J.Allaire,JohnChambers,DouglasBates,andKevin
Ushey.2011.“Rcpp:SeamlessRandC++Integration.”JournalofStatisticalSoftware40(8):
1–18.
Xie,Yihui.2015.DynamicDocumentswithRandKnitr.Vol.29.CRCPress.
Cotton,Richard.2016b.TestingRCode.
TheOxfordDictionary’sdefinitionofworkflowissimilar,withamoreindustrialfeel:“Thesequenceofindustrial,
administrative,orotherprocessesthroughwhichapieceofworkpassesfrominitiationtocompletion.”
ThankstoRichardCottonforsuggestingthistypology.
TheimportanceofworkflowhasnotgoneunnoticedbytheRcommunity,andthereareanumberofdifferentsuggestions
toboostRproductivity.RobHyndman,forexample,advocatesthestrategyofusingfourself-containedscriptstobreakup
Rworkintomanageablechunks:load.R,clean.R,func.R,anddo.R.
ForamorecomprehensivediscussionofGanttchartsinR,pleaserefertostackoverflow.com/questions/3550341.
Anexcellentoverviewofthetidyverse,formerlyknownasthehadleyverse,anditsbenefitsisavailablefrom
barryrowlingson.github.io/hadleyverse.
1
2
3
4
5
Chapter5.EfficientInput/Output
ThischapterexplainshowtoefficientlyreadandwritedatainR.Input/output(I/O)isthe
technicaltermforreadingandwritingdata:theprocessofgettinginformationintoa
particularcomputersystem(inthiscase,R)andthenexportingittotheoutsideworldagain
(inthiscase,asafileformatthatothersoftwarecanread).DataI/Owillbeneededonprojects
wheredatacomesfrom,orgoesto,externalsources.However,themajorityofRresources
anddocumentationstartwiththeoptimisticassumptionthatyourdatahasalreadybeenloaded,
ignoringthefactthatimportingdatasetsintoRandexportingthemtotheworldoutsidetheR
ecosystemcanbeatime-consumingandfrustratingprocess.Tricky,slow,orultimately
unsuccessfuldataI/Ocancrippleefficiencyrightattheoutsetofaproject.Conversely,
readingandwritingyourdataefficientlywillmakeyourRprojectsmorelikelytosucceedin
theoutsideworld.
Thefirstsectionintroducesrio,ametapackageforefficientlyreadingandwritingdataina
rangeoffileformats.riorequiresonlytwointuitivefunctionsfordataI/O,makingitefficient
tolearnanduse.Next,weexploreinmoredetailefficientfunctionsforreadingfilesstoredin
commonplaintextfileformatsfromthereadranddata.tablepackages.Binaryformats,
whichcandramaticallyreducefilesizesandread/writetimes,arecoverednext.
Withtheacceleratingdigitalrevolutionandgrowthinopendata,anincreasingproportionof
theworld’sdatacanbedownloadedfromtheinternet.Thistrendissettocontinue,making
“GettingDatafromtheInternet”ondownloadingandimportingdatafromthewebimportant
forfuture-proofingyourI/Oskills.Thebenchmarksinthischapterdemonstratethatchoiceof
fileformatandpackagesfordataI/Ocanhaveahugeimpactoncomputationalefficiency.
Beforereadinginasinglelineofdata,itisworthconsideringageneralprinciplefor
reproducibledatamanagement:nevermodifyrawdatafiles.Rawdatashouldbeseenasread-
only,andcontaininformationaboutitsprovenance.Keepingtheoriginalfilenameand
commentingonitssourceareacoupleofwaystoimprovereproducibility,evenwhenthe
dataarenotpubliclyavailable.
Prerequisites
Rcanreaddatafromavarietyofsources.Webeginbydiscussingthegenericpackagerio
thathandlesawidevarietyofdatatypes.SpecialattentionispaidtoCSVfiles,whichleadsto
thereadranddata.tablepackages.Therelativelynewpackagefeatherisintroducedasa
binaryfileformatthathascross-languagesupport.
library("rio")
library("readr")
library("data.table")
library("feather")
WealsousetheWDIpackagetoillustrateaccessingonlinedatasets:
library("WDI")
TopFiveTipsforEfficientDataI/O
1. Ifpossible,keepthenamesoflocalfilesdownloadedfromtheinternetorcopied
ontoyourcomputerunchanged.Thiswillhelpyoutracetheprovenanceofthedatain
thefuture.
2. R’snativefileformatis.Rds.Thesefilescanbeimportedandexportedusing
readRDS()andsaveRDS()forfastandspace-efficientdatastorage.
3. Useimport()fromtheriopackagetoefficientlyimportdatafromawiderangeof
formats,avoidingthehassleofloadingformat-specificlibraries.
4. Usereadrordata.tableequivalentsofread.table()toefficientlyimportlargetext
files.
5. Usefile.size()andobject.size()tokeeptrackofthesizeoffilesandRobjects
andtakeactioniftheygettoobig.
VersatileDataImportwithrio
rioisaveritablemultitoolfordataI/O.rioprovideseasy-to-useandcomputationallyefficient
functionsforimportingandexportingtabulardatainarangeoffileformats.Asstatedinthe
package’svignette,rioaimsto“simplifytheprocessofimportingdataintoRandexporting
datafromR.”ThevignettegoesontotoexplainhowmanyofthefunctionsfordataI/O
describedinR’sDataImport/Exportmanualareoutdated(e.g.,referringtoWriteXLSbutnot
themorerecentreadxlpackage)anddifficulttolearn.
Thisiswhyrioiscoveredattheoutsetofthischapter:ifyoujustwanttogetdataintoRwitha
minimumoftimelearningnewfunctions,thereisafairchancethatriocanhelpformany
commonfileformats.Atthetimeofwriting,theseinclude.csv,.feather,.json,.dta,.xls,.xlsx,
andGoogleSheets(seethepackage’sGitHubpageforup-to-dateinformation).Inthe
followingexample,weillustratethekeyriofunctionsofimport()andexport():
library("rio")
#Specifyafile
fname=system.file("extdata/voc_voyages.tsv",package="efficient")
#Importthefile(usesthefreadfunctionfromdata.table)
voyages=import(fname)
#ExportthefileasanExcelspreadsheet
export(voyages,"voc_voyages.xlsx")
Therewasnoneedtospecifytheoptionalformatargumentfordataimportandexport
functionsbecausethisisinferredbythesuffix,which,inthepreviousexample,is.tsvand
.xlsx,respectively.Youcanoverridetheinferredfileformatforbothfunctionswiththe
formatargument.Youcould,forexample,createacomma-delimitedfilecalled
voc_voyages.xlsxwithexport(voyages,"voc_voyages.xlsx",format="csv").However,
thiswouldnotbeagoodideabecauseitisimportanttoensurethatafile’ssuffixmatchesits
format.
Toprovideanotherexample,thefollowingcodechunkdownloadsandimportsasadata
frameinformationaboutthecountriesoftheworldstoredin.json(downloadingdatafrom
theinternetiscoveredinmoredetailin“GettingDatafromtheInternet”):
caps=import("https://github.com/mledoze/countries/raw/master/countries.json")
T IP
Theabilitytoimportanduse.jsondataisbecomingincreasinglycommonasitisastandardoutputformatfor
manyAPIs.Thejsonliteandgeojsoniopackageshavebeendevelopedtomakethisaseasyaspossible.
Exercises
1. Thefinallineintheprecedingcodechunkshowsaneatfeatureofrioandsomeother
packages:theoutputformatisdeterminedbythesuffixofthefilename,whichmakes
forconcisecode.Tryopeningthevoc_voyages.xlsxfilewithaneditorsuchas
LibreOfficeCalcorMicrosoftExceltoensurethattheexportworked,before
removingthisratherinefficientfileformatfromyoursystem:
file.remove("voc_voyages.xlsx")
2. Trysavingthethevoyagesdataframesintothreeotherfileformatsofyourchoosing
(seevignette("rio")forsupportedformats).Tryopeningtheseinexternal
programs.Whichfileformatsaremoreportable?
3. Asabonusexercise,createasimplebenchmarktocomparethewritetimesforthe
differentfileformatsusedtocompletethepreviousexercise.Whichisfastest?Which
isthemostspace-efficient?
Plain-TextFormats
Plain-textdatafilesareencodedinaformat(typicallyUTF-8)thatcanbereadbyhumansand
computersalike.Thegreatthingaboutplaintextisitssimplicityandeaseofuse:any
programminglanguagecanreadaplain-textfile.Themostcommonplain-textformatis.csv,
comma-separatedvalues,inwhichcolumnsareseparatedbycommasandrowsareseparated
bylinebreaks.Thisisillustratedinthesimpleexamplehere:
Person,Nationality,CountryofBirth
Robin,British,England
Colin,British,Scotland
ThereisoftenmorethanonewaytoreaddataintoR,and.csvfilesarenoexception.The
methodyouchoosehasimplicationsforcomputationalefficiency.Thissectioninvestigates
methodsforgettingplain-textfilesintoR,withafocusonthreeapproaches:baseR’splain-
textreadingfunctionssuchasread.csv();thedata.tableapproach,whichusesthefunction
fread();andthenewerreadrpackage,whichprovidesread_csv()andotherread_*()
functionssuchasread_tsv().Althoughthesefunctionsperformdifferently,theyarelargely
cross-compatible,asillustratedinthefollowingcodechunk,whichloadsdataonthe
concentrationofCO2intheatmosphereovertime:
WARNING
Ingeneral,youshouldnever“hand-write”aCSVfile.Instead,youshouldusewrite.csv()oranequivalent
function.TheInternetEngineeringTaskForcehastheCSVdefinitionthatfacilitatessharingCSVfilesbetween
toolsandoperatingsystems.
df_co2=read.csv("extdata/co2.csv")
df_co2_dt=readr::read_csv("extdata/co2.csv")
#>Warning:Missingcolumnnamesfilledin:'X1'[1]
#>Parsedwithcolumnspecification:
#>cols(
#>X1=col_integer(),
#>time=col_double(),
#>co2=col_double()
#>)
df_co2_readr=data.table::fread("extdata/co2.csv")
NOT E
Notethatafunctionderivedfromanotherinthiscontextmeansthatitcallsanotherfunction.Thefunctionssuchas
read.csv()andread.delim(),infact,arewrappersaroundread.table().Thiscanbeseeninthesourcecodeof
read.csv(),forexample,whichshowsthatthefunctionisroughlytheequivalentofread.table(file,header=
TRUE,sep=",").
Althoughthissectionisfocusedonreadingtextfiles,itdemonstratesthewiderprinciplethat
thespeedandflexibilityadvantagesofadditionalreadfunctionscanbeoffsetbythe
disadvantagesofadditionalpackagedependency(intermsofcomplexityandmaintainingthe
code)forsmalldatasets.Therealbenefitskickinonlargedatasets.Ofcourse,therearesome
datatypesthatrequireacertainpackagetoloadinR:thereadstata13package,forexample,
wasdevelopedsolelytoreadin.dtafilesgeneratedbyversionsofStata13andabove.
Figure5-1demonstratesthattherelativeperformancegainsofthedata.tableandreadr
approachesincreasewithdatasize,especiallyfordatawithmanyrows.Belowaround1MB,
read.csv()isactuallyfasterthanread_csv(),whilefread()ismuchfasterthanboth,
althoughthesesavingsarelikelytobeinconsequentialforsuchsmallerdatasets.
Forfilesabove100MBinsize,fread()andread_csv()canbeexpectedtobearoundfive
timesfasterthanread.csv().Thisefficiencygainmaybeinconsequentialforaone-offfile
of100MBrunningonafastcomputer(whichstilltakeslessthanaminutewithread.csv()),
butcouldrepresentanimportantspeed-upifyoufrequentlyloadlargetextfiles.
Whentestedonalarge(4GB).csvfile,itwasfoundthatfread()andread_csv()were
almostidenticalinloadtimesandthatread.csv()tookaboutfivetimeslonger.This
consumedmorethan10GBofRAM,makingitunsuitabletorunonmanycomputers(see
“RandomAccessMemory”formoreonmemory).Notethatbothreadrandbasemethodscan
bemadesignificantlyfasterbyprespecifyingthecolumntypesattheoutset(seethefollowing
codechunk).Furtherdetailsareprovidedbythehelpin?read.table.
read.csv(file_name,colClasses=c("numeric","numeric"))
InsomecaseswithRprogramming,thereisatrade-offbetweenspeedandrobustness.Thisis
illustratedherewithreferencetodifferencesinhowreadr,data.table,andbaseRhandle
unexpectedvalues.Figure5-1highlightsthebenefitofswitchingtofread()and(eventually)
toread_csv()asthedatasetsizeincreases.Forasmall(1MB)dataset,fread()isaboutfive
timesfasterthanbaseR.
Figure5-1.Benchmarksofbase,data.table,andreadrapproachesforreadingCSVfiles,usingthefunctionsread.csv(),
fread(),andread_csv(),respectively.Thefacetsrangingfrom2to200representthenumberofcolumnsintheCSVfile.
DifferencesBetweenfread()andread_csv()
Thefilevoc_voyageswastakenfromadatasetonDutchnavalexpeditionsandusedwith
permissionfromtheCWIDatabaseArchitecturesGroup.Thedataisdescribedmorefullyat
monetdb.org.Fromthisdataset,weprimarilyusethevoyagestable,whichlistsDutch
shippingexpeditionsbytheirdateofdeparture.
fname=system.file("extdata/voc_voyages.tsv",package="efficient")
voyages_base=read.delim(fname)
Whenweruntheequivalentoperationusingreadr,
voyages_readr=readr::read_tsv(fname)
#>Parsedwithcolumnspecification:
#>cols(
#>.default=col_character(),
#>number=col_integer(),
#>trip=col_integer(),
#>tonnage=col_integer(),
#>departure_date=col_date(format=""),
#>cape_arrival=col_date(format=""),
#>cape_departure=col_date(format=""),
#>arrival_date=col_date(format=""),
#>next_voyage=col_integer()
#>)
#>Seespec(...)forfullcolumnspecifications.
#>Warning:2parsingfailures.
#>rowcolexpectedactual
#>4403cape_arrivaldatelike2-01-01
#>4592cape_departuredatelike8-05-17
awarningisraisedregardingrow2841inthebuiltvariable.Thisisbecauseread_*()
decideswhatclasseachvariableisbasedonthefirst1,000rows,ratherthanallrows,asbase
read.*()functionsdo.Printingtheoffendingelement:
voyages_base$built[2841]#afactor.
#>[1]1721-01-01
#>182Levels:17841,8611351594160016121613161416151619...taken1672
voyages_readr$built[2841]#anNA:textcannotbeconvertedtonumeric
#>[1]"1721-01-01"
Readingthefileusingdata.table:
#Verbosewarningsnotshown
voyages_dt=data.table::fread(fname)
generatesfivewarningmessagesstatingthatcolumns2,4,9,10,and11wereBumpedtotype
characterondatarow...,withtheoffendingrowsprintedinplaceof....Insteadof
changingtheoffendingvaluestoNA,asreadrdoesforthebuiltcolumn(9),fread()
automaticallyconvertsanycolumnsitconsidersasnumericintocharacters.Anadditional
featureoffread()isthatitcanread-inaselectionofthecolumns,eitherbytheirindexor
name,usingtheselectargument.Thisisillustratedinthefollowingcodebyreadinginonly
half(thefirst11)columnsfromthevoyagesdatasetandcomparingtheresulttousing
fread()onallcolumns.
microbenchmark(times=5,
with_select=data.table::fread(fname,select=1:11),
without_select=data.table::fread(fname)
)
#>Unit:milliseconds
#>exprminlqmeanmedianuqmaxneval
#>with_select9.529.589.689.719.749.865
#>without_select16.0216.4516.5716.6416.7616.985
Tosummarize,thedifferencesbetweenbase,readr,anddata.tablefunctionsforreadingin
datagobeyondcodeexecutiontimes.Thefunctionsread_csv()andfread()boostspeed
partiallyattheexpenseofrobustnessbecausetheydecidecolumnclassesbasedonasmall
sampleofavailabledata.Thesimilaritiesanddifferencesbetweentheapproachesare
summarizedfortheDutchshippingdatainTable5-1.
Table5-1.Comparisonoftheclassescreatedby
base,readr,anddata.tableforaselectionof
variablesinthevoyagesdataset
Packages number boatname built departure_date
base integer factor factor factor
readr integer character character date
data.table integer character character character
Table5-1showsfourmainsimilaritiesanddifferencesbetweenthethreetypesofread
functions:
ForuniformdatasuchasthenumbervariableinTable5-1,allreadingmethodsyieldthe
sameresult(integer,inthiscase).
Forcolumnsthatareobviouslycharacterssuchasboatname,thebasemethodresultsin
factors(unlessstringsAsFactorsissettoTRUE),whereasfread()andread_csv()
functionsreturncharacters.
Forcolumnsinwhichthefirst1,000rowsareofonetypebutwhichcontainanomalies,
suchasbuiltanddeparture_dataintheshippingexample,fread()coercestheresultto
characters.read_csv()andsiblings,bycontrast,keeptheclassthatiscorrectforthefirst
1,000rowsandsetstheanomalousrecordstoNA.ThisisillustratedinTable5-1,where
read_tsv()producesanumericclassforthebuiltvariable,ignoringthenonnumerictext
inrow2841.
read_*()functionsgenerateobjectsofclasstbl_df,anextensionofthedata.frame
class,asdiscussedin“EfficientDataProcessingwithdplyr”.fread()generatesobjects
ofclassdata.table().Thesecanbeusedasstandarddataframesbutdiffersubtlyin
theirbehavior.
Anadditionaldifferenceisthatread_csv()createsdataframesofclasstbl_dfandthe
data.frame.Thismakesnopracticaldifference,unlessthetibblepackageisloaded,as
describedin“EfficientDataFrameswithtibble”inthenextchapter.
Thewiderpointassociatedwiththesetestsisthatfunctionsthatsavetimecanalsoleadto
additionalconsiderationsorcomplexitiesinyourworkflow.Takingalookatwhatisgoing
onunderthehoodoffastfunctionstoincreasespeed,aswehavedoneinthissection,canhelp
youunderstandtheadditionalconsequencesofchoosingfastfunctionsoverslowerfunctions
frombaseR.
PreprocessingTextOutsideR
TherearecircumstanceswhendatasetsbecometoolargetoreaddirectlyintoR.Readingina
4GBtextfileusingthefunctionstestedpreviously,forexample,consumesallavailableRAM
ona16GBmachine.Toovercomethislimitation,externalstreamprocessingtoolscanbe
usedtopreprocesslargetextfiles.Thefollowingcommand,usingtheLinuxcommandline
shell(orWindows-basedLinuxshellemulatorCygwin)commandsplit,forexample,will
breakalargemulti-GBfileintomany1GBchunks,eachofwhichismoremanageableforR:
split-b100mbigfile.csv
Theresultisaseriesoffiles,setto100MBeach,withthe-b100margumentintheprevious
code.Bydefault,thesewillbecalledxaa,xabandcouldbereadinonechunkatatime(e.g.,
usingread.csv(),fread(),orread_csv(),describedintheprevioussection)without
crashingmostmoderncomputers.
SplittingalargefileintoindividualchunksmayallowittobereadintoR.Thisisnotan
efficientwaytoimportlargedatasets,however,becauseitresultsinanonrandomsampleof
thedatathisway.Amoreefficient,robust,andscalablewaytoworkwithlargedatasetsisvia
databases,coveredin“WorkingwithDatabases”inthenextchapter.
BinaryFileFormats
Therearelimitationstoplain-textfiles.EventhetrustyCSVformatis“restrictedtotabular
data,lackstype-safety,andhaslimitedprecisionfornumericvalues”(Eddelbuettel,Stokely,
andOoms2016).Onceyouhavereadintherawdata(e.g.,fromaplain-textfile)andtidiedit
(coveredinthenextchapter),itiscommontowanttosaveitforfutureuse.Savingitafter
tidyingisrecommendedtoreducethechanceofhavingtorunallthedata-cleaningcode
again.Werecommendsavingtidiedversionsoflargedatasetsinoneofthebinaryformats
coveredinthissectionasthiswilldecreaseread/writetimesandfilesizes,makingyourdata
moreportable.1
Unlikeplain-textfiles,datastoredinbinaryformatscannotbereadbyhumans.Thisallows
space-efficientdatacompression,butmeansthatthefileswillbelesslanguage-agnostic.R’s
nativefileformat,.Rds,forexample,maybedifficulttoreadandwriteusingexternal
programssuchasPythonorLibreOfficeCalc.Thissectionprovidesanoverviewofbinary
fileformatsinR,withbenchmarkstoshowhowtheycomparewiththeplain-textformat.csv
coveredintheprevioussection.
NativeBinaryFormats:RdataorRds?
.Rdsand.RDataareR’snativebinaryfileformats.Theseformatsareoptimizedforspeedand
compressionratios.Butwhatisthedifferencebetweenthem?Thefollowingcodechunk
demonstratesthekeydifferencebetweenthem:
save(df_co2,file="extdata/co2.RData")
saveRDS(df_co2,"extdata/co2.Rds")
load("extdata/co2.RData")
df_co2_rds=readRDS("extdata/co2.Rds")
identical(df_co2,df_co2_rds)
#>[1]TRUE
Thefirstmethodisthemostwidelyused.Itusesthesave()function,whichtakesanynumber
ofRobjectsandwritesthemtoafile,whichmustbespecifiedbythefile=argument.
save()islikesave.image(),whichsavesalltheobjectscurrentlyloadedinR.
Thesecondmethodisslightlylessused,butwerecommendit.Apartfrombeingslightly
moreconciseforsavingsingleRobjects,thereadRDS()functionismoreflexible;asshown
inthesubsequentline,theresultingobjectcanbeassignedtoanyname.Inthiscase,wecalled
itdf_co2_rds(whichweshowtobeidenticaltodf_co2,loadedwiththeload()command),
butwecouldhavecalleditanythingorsimplyprintedittotheconsole.
UsingsaveRDS()isgoodpracticebecauseitforcesyoutospecifyobjectnames.Ifyouuse
save()withoutcare,youcouldforgetthenamesoftheobjectsyousavedandaccidentally
overwriteobjectsthatalreadyexist.
TheFeatherFileFormat
FeatherwasdevelopedasacollaborationbetweenRandPythondeveloperstocreateafast,
light,andlanguage-agnosticformatforstoringdataframes.Thefollowingcodechunkshows
howitcanbeusedtosaveandthenreloadthedf_co2datasetloadedpreviouslyinbothRand
Python:
library("feather")
write_feather(df_co2,"extdata/co2.feather")
df_co2_feather=read_feather("extdata/co2.feather")
importfeather
importfeather
path='data/co2.feather'
df_co2_feather=feather.read_dataframe(path)
BenchmarkingBinaryFileFormats
Weknowthatbinaryformatsareadvantageousfromspaceandread/writetimeperspectives,
buthowmuchso?Thebenchmarksinthissection,basedonlargematricescontaining
randomnumbers,aredesignedtohelpanswerthisquestion.Figure5-2showsthattherelative
efficiencygainsoffeatherandRdsformats,comparedwithbaseCSV.Fromlefttoright,
Figure5-2showsbenefitsintermsoffilesize,readtimes,andwritetimes.
Intermsofwritetimes,Rdsfilesperformthebest,occupyingjustoveraquarterofthehard
diskspacecomparedwiththeequivalentCSVfiles.Theequivalentfeatherformatalso
outperformedtheCSVformat,occupyingaroundhalfthediskspace.
Theresultsofthissimplediskusagebenchmarkshowthatsavingdatainacompressedbinary
formatcansavespaceand,ifyourdatawillbesharedonline,datadownloadtimeand
bandwidthusageperspectives.Buthowdoeseachmethodcomparefromacomputational
efficiencyperceptive?Thereadandwritetimesforeachfileformatareillustratedinthe
middleandright-handpanelsofFigure5-2.
Figure5-2.Comparisonoftheperformanceofbinaryformatsforreadingandwritingdatasetswith20columnswiththe
plain-textformatCSV;thefunctionsusedtoreadthefileswereread.csv(),readRDS(),andfeather::read_feather(),
respectively.Thefunctionsusedtowritethefileswerewrite.csv(),saveRDS(),andfeather::write_feather().
Theresultsshowthatfilesizeisnotareliablepredictorofdatareadandwritetimes.Thisis
duetothecomputationaloverheadsofcompression.Althoughfeatherfilesoccupiedmore
diskspace,theywereroughlyequivalentintermsofreadtimes:thefunctionsread_feather()
andreadRDS()wereconsistentlyaround10timesfasterthanread.csv().Intermsofwrite
times,featherexcels:write_feather()wasaround10timesfasterthanwrite.csv(),whereas
saveRDS()wasonlyaround1.2timesfaster.
NOT E
Notethattheperformanceofdifferentfileformatsdependsonthecontentofthedatabeingsaved.The
benchmarkshereshowedsavingsformatricesofrandomnumbers.Forreal-lifedata,theresultswouldbequite
different.Thevoyagesdataset,savedasanRdsfile,occupiedlessthanaquarterthediskspaceastheoriginal
TSVfile,whereasthefilesizewaslargerthantheoriginalwhensavedasafeatherfile!
ProtocolBuffers
Google’sProtocolBuffersofferaportable,efficient,andscalablesolutiontobinarydata
storage.Arecentpackage,RProtoBuf,providesanRinterface.Thisapproachisnotcovered
inthisbook,asitisnew,advanced,andnot(atthetimeofwriting)widelyusedintheR
community.Theapproachisdiscussedindetailinapaperonthesubject,whichalsoprovides
anexcellentoverviewoftheissuesassociatedwithdifferentfileformats(Eddelbuettel,
Stokely,andOoms2016).
GettingDatafromtheInternet
Thefollowingcodechunkshowshowthefunctionsdownload.file2andunzipcanbeusedto
downloadandunzipadatasetfromtheinternet.Rcanautomateprocessesthatareoften
performedmanually(e.g.,throughthegraphicaluserinterfaceofawebbrowser)with
potentialadvantagesforreproducibilityandprogrammerefficiency.Theresultisdatastored
neatlyinthedatadirectoryreadytobeimported.Notethatwedeliberatelykeptthefilename
intacttohelpwithdocumentation,enhancingunderstandingofthedata’sprovenance,sofuture
userscanquicklyfindoutwherethedatacamefrom.Notealsothatpartofthedatasetis
storedintheefficientpackage.UsingRforbasicfilemanagementcanhelpcreatea
reproducibleworkflow,asillustratedhere:
url="https://www.monetdb.org/sites/default/files/voc_tsvs.zip"
download.file(url,"voc_tsvs.zip")#downloadfile
unzip("voc_tsvs.zip",exdir="data")#unzipfiles
file.remove("voc_tsvs.zip")#tidyupbyremovingthezipfile
Thisworkflowequallyappliestodownloadingandloadingsinglefiles.Notethatonecould
makethecodemoreconcisebyreplacingthesecondlinewithdf=read.csv(url).
However,werecommenddownloadingthefiletodisksothatifforsomereasonitfails(e.g.,
ifyouwouldliketoskipthefirstfewlines),youdon’thavetokeepdownloadingthefileover
andoveragain.Thefollowingcodedownloadsandloadsdataonatmosphericconcentrations
ofCO2.Notethatthisdatasetisalsoavailablefromthedatasetspackage.
url="https://vincentarelbundock.github.io/Rdatasets/csv/datasets/co2.csv"
download.file(url,"extdata/co2.csv")
df_co2=read_csv("extdata/co2.csv")
TherearenowmanyRpackagestoassistwiththedownloadandimportofdata.The
organizationrOpenScisupportsanumberofthese.Thefollowingexampleillustratesthis
usingtheWDIpackage(notsupportedbyrOpenSci)toaccessesWorldBankdataonCO2
emissionsinthetransportsector:
library("WDI")
WDIsearch("CO2")#searchfordataonatopic
co2_transport=WDI(indicator="EN.CO2.TRAN.ZS")#importdata
Therewillbesituationswhereyoucannotdownloadthedatadirectlyorwhenthedatacannot
bemadeavailable.Inthiscase,simplyprovidingacommentrelatingtothedata’sorigin(e.g.,
#Downloadedfromhttp://example.com)beforereferringtothedatasetcangreatlyimprove
theutilityofthecodetoyourselfandothers.
ThereareanumberofRpackagesthatprovidemoreadvancedfunctionalitythansimply
downloadingfiles.TheCRANtaskviewonwebtechnologiesprovidesacomprehensivelist.
ThetwopackagesforinteractingwithwebpagesarehttrandRCurl.Theformerpackage
provides(arelatively)user-friendlyinterfaceforexecutingstandardHTTPmethodssuchas
GETandPOST.ItalsoprovidessupportforwebauthenticationprotocolsandreturnsHTTP
statuscodesthatareessentialfordebugging.TheRCurlpackagefocusesonlower-level
supportandisparticularlyusefulforweb-basedXMLsupportorFTPoperations.
AccessingDataStoredinPackages
Mostwell-documentedpackagesprovidesomeexampledataforyoutoplaywith.Thiscan
helpdemonstrateusecasesinspecificdomainsthatuseaparticulardataformat.The
commanddata(package="package_name")willshowthedatasetsinapackage.Datasets
providedbydplyr,forexample,canbeviewedwithdata(package="dplyr").
Rawdata(i.e.,datathathasnotbeenconvertedintoR’snative.Rdsformat)isusuallylocated
withinthesubfolderextdatainR,whichcorrespondstoinst/extdatawhendeveloping
packages.Thefunctionsystem.file()outputsfilepathsassociatedwithspecificpackages.To
seealltheexternalfileswithinthereadrpackage,forexample,youcouldusethefollowing
command:
list.files(system.file("extdata",package="readr"))
#>[1]"challenge.csv""compound.log""epa78.txt"
#>[4]"example.log""fwf-sample.txt""massey-rating.txt"
#>[7]"mtcars.csv""mtcars.csv.bz2""mtcars.csv.zip"
Further,tolookaroundtoseewhatfilesarestoredinaparticularpackage,youcouldtypethe
following,takingadvantageofRStudio’sintellisensefilecompletioncapabilities(usingcopy
andpastetoenterthefilepath):
system.file(package="readr")
#>[1]"/home/robin/R/x86_64-pc-linux-gnu-library/3.3/readr"
HittingTabafterthesecondcommandshouldtriggerRStudiotocreateaminiaturepop-up
boxlistingthefileswithinthefolder,asillustratedinFigure5-3.
Figure5-3.DiscoveringfilesinRpackagesusingRStudio’sintellisense
References
Eddelbuettel,Dirk,MurrayStokely,andJeroenOoms.2016.“RProtoBuf:EfficientCross-
LanguageDataSerializationinR.”JournalofStatisticalSoftware71(1):1–24.
doi:10.18637/jss.v071.i02.
Geographicaldata,forexample,canbeslowtoreadinexternalformats.Alarge.shpor.geojsonfilecantakemorethan
100timeslongertoloadthananequivalent.Rdsor.Rdatafile.
SinceR3.2.3thebasefunctiondownload.file()canbeusedtodownloadfromsecure(https://)connectionsonany
operatingsystem.
1
2
Chapter6.EfficientDataCarpentry
Therearemanywordsfordataprocessing.Youcanclean,hack,manipulate,munge,refine,
andtidyyourdataset,readyforthenextstage.Eachwordsayssomethingaboutperceptions
thatpeoplehaveabouttheprocess:dataprocessingisoftenseenasdirtywork,anunpleasant
necessitythatmustbeenduredbeforetherealfunandimportantworkbegins.Thisperception
iswrong.Gettingyourdataship-shapeisarespectableandinsomecasesvitalskill.Forthis
reason,weusethemoreadmirabletermdatacarpentry.
Thismetaphorisnotaccidental.Carpentryistheprocessoftakingroughpiecesofwoodand
workingwithcare,diligence,andprecisiontocreateafinishedproduct.Acarpenterdoesnot
hackatthewoodatrandom.Heorshewillinspecttherawmaterialandselecttherighttool
forthejob.Inthesameway,datacarpentryistheprocessoftakingrough,raw,andtosome
extentrandomlyarrangedinputdataandcreatingneatlyorganizedandtidydata.Learningthe
skillofdatacarpentryearlywillyieldbenefitsforyearstocome.“Givemesixhourstochop
downatreeandIwillspendthefirstfoursharpeningtheaxe,”asthesayinggoes.
Dataprocessingisacriticalstageinanyprojectinvolvingdatasetsfromexternalsources(i.e.,
mostreal-worldapplications).Inthesamewaythattechnicaldebt,discussedinChapter5,can
crippleyourworkflow,workingwithmessydatacanleadtoprojectmanagementhell.
Fortunately,doneefficiently,attheoutsetofyourproject(ratherthanhalfwaythroughwhenit
maybetoolate)andusingappropriatetools,thisdataprocessingstagecanbehighly
rewarding.Moreimportantly,fromanefficiencyperspective,workingwithcleandatawillbe
beneficialforeverysubsequentstageofyourRproject.So,fordata-intensiveapplications,
thiscouldbethemostimportantchapterinthisbook.Init,wecoverthefollowingtopics:
Tidyingdatawithtidyr
Processingdatawithdplyr
Workingwithdatabases
Dataprocessingwithdata.table
Prerequisites
Thischapterreliesonanumberofpackagesfordatacleaningandprocessing.Checkthatthey
areinstalledonyourcomputerandloadthemwith:
library("tibble")
library("tidyr")
library("stringr")
library("readr")
library("dplyr")
library("data.table")
RSQLiteandggmaparealsousedinacoupleofexamples,thoughtheyarenotcentraltothe
chapter’scontent.
TopFiveTipsforEfficientDataCarpentry
1. Timespentpreparingyourdataatthebeginningcansavehoursoffrustrationinthe
longrun.
2. Tidydataprovidesaconceptfororganizingdata,andthepackagetidyrprovides
somefunctionsforthiswork.
3. Thedata_frameclassdefinedbythetibblepackagemakesdatasetsefficienttoprint
andeasytoworkwith.
4. dplyrprovidesfastandintuitivedataprocessingfunctions;data.tablehasunmatched
speedforsomedataprocessingapplications.
5. The%>%pipeoperatorcanhelpclarifycomplexdataprocessingworkflows.
EfficientDataFrameswithtibble
tibbleisapackagethatdefinesanewdataframeclassforR,thetbl_df.Thesetibblediffs(as
theirinventorsuggeststheyshouldbepronounced)arelikethebaseclassdata.framebutwith
moreuser-friendlyprinting,subsetting,andfactorhandling.
NOT E
AtibbledataframeisanS3objectwiththreeclasses,tbl_df,tbl,anddata.frame.Sincetheobjecthasthe
data.frametag,thismeansthatifatbl_dfortblmethodisn’tavailable,theobjectwillbepassedontothe
appropriatedata.framefunction.
Tocreateatibbledataframe,weusethetibblefunction:
library("tibble")
tibble(x=1:3,y=c("A","B","C"))
#>#Atibble:3×2
#>xy
#><int><chr>
#>11A
#>22B
#>33C
ThepreviousexampleillustratesthemaindifferencesbetweenthetibbleandbaseR
approachestodataframes:
Whenprinted,thetibblediffreportstheclassofeachvariable.data.frameobjectsdo
not.
Charactervectorsarenotcoercedintofactorswhentheyareincorporatedintoatbl_df,
ascanbeseenbythe<chr>headingbetweenthevariablenameandthesecondcolumn.
Bycontrast,data.frame()coercescharactersintofactors,whichcancauseproblems
furtherdowntheline.
Whenprintingatibbledifftoscreen,onlythefirst10rowsaredisplayed.Thenumber
ofcolumnsprinteddependsonthewindowsize.
Otherdifferencescanbefoundintheassociatedhelppage:help("tibble").
NOT E
Youcancreateatibbledataframerowbyrowusingthetribble()function.
Exercise
1. Createthefollowingdataframe:
df_base=data.frame(colA="A")
Tryandguesstheoutputofthefollowingcommands:
print(df_base)
df_base$colA
df_base$col
df_base$colB
Nowcreateatibbledataframeandrepeattheprecedingcommands.
TidyingDatawithtidyrandRegularExpressions
Akeyskillindataanalysisisunderstandingthestructureofdatasetsandbeingabletoreshape
them.Thisisimportantfromaworkflowefficiencyperspective:morethanhalfofadata
analyst’stimecanbespentreformattingdatasets(Wickham2014b),sogettingitintoasuitable
formearlycouldsavehoursinthefuture.Convertingdataintoatidyformisalso
advantageousfromacomputationalefficiencyperspectivebecauseitisusuallyfastertorun
analysisandplottingcommandsontidydata.
Datatidyingincludesdatacleaninganddatareshaping.Datacleaningistheprocessof
reformattingandlabelingmessydata.Packagesincludingstringiandstringrcanhelpupdate
messycharacterstringsusingregularexpressions;assertiveandassertrpackagescan
performdiagnosticchecksfordataintegrityattheoutsetofadataanalysisproject.A
commondata-cleaningtaskistheconversionofnonstandardtextstringsintodateformatsas
describedinthelubridatevignette(seevignette("lubridate")).Tidyingisabroader
concept,however,andalsoincludesreshapingdatasothatitisinaformmoreconduciveto
dataanalysisandmodeling.TheprocessofreshapingisillustratedbyTables6-1and6-2,
providedbyHadleyWickhamandloadedusingthefollowingcode:
library("efficient")
data(pew)#see?pew-datasetfromtheefficientpackage
pew[1:3,1:4]#takealookatthedata
#>#Atibble:3×4
#>religion`<$10k``$10--20k``$20--30k`
#><chr><int><int><int>
#>1Agnostic273460
#>2Atheist122737
#>3Buddhist272130
Tables6-1and6-2showasubsetofthewidepewandlong(tidy)pewtdatasets,respectively.
Theyhavedifferentdimensions,buttheycontainpreciselythesameinformation.Column
namesinthewideforminTable6-1becameanewvariableinthelongforminTable6-2.
Accordingtotheconceptoftidydata,thelongformiscorrect.Notethatcorrecthereisused
inthecontextofdataanalysisandgraphicalvisualization.BecauseRisavector-based
language,tidydataalsohasanefficiencyadvantage:it’softenfastertooperateonafewlong
columnsthanseveralshortones.Furthermore,thepowerfulandefficientpackagesdplyrand
ggplot2weredesignedaroundtidydata.Widedata,however,canbespaceefficient,andis
commonforpresentationinsummarytables,soit’susefultobeabletotransferbetweenwide
(orotherwiseuntidy)andtidyformats.
Tidydatahasthefollowingcharacteristics(Wickham2014b):
Eachvariableformsacolumn.
Eachobservationformsarow.
Eachtypeofobservationalunitformsatable.
Becausethereisonlyoneobservationalunitintheexample(religions),itcanbedescribedin
asingletable.Largeandcomplexdatasetsareusuallyrepresentedbymultipletables,with
uniqueidentifiersorkeystojointhemtogether(Codd1979).
Twocommonoperationsfacilitatedbytidyraregatheringandsplittingcolumns.
MakeWideTablesLongwithgather()
Gatheringmeansmakingwidetableslongbyconvertingcolumnnamestoanewvariable.
Thisisdonewiththefunctiongather()(theinverseofwhichisspread()).Theprocessis
illustratedinTables6-1and6-2.Thecodethatperformsthisoperationisprovidedinthe
followingcodeblock.Thisconvertsatablewith18rowsand10columnsintoatidydataset
with162rowsand3columns(comparetheoutputwiththeoutputofpew,shownpreviously):
dim(pew)
#>[1]1810
pewt=gather(data=pew,key=Income,value=Count,-religion)
dim(pewt)
#>[1]1623
pewt[c(1:3,50),]
#>#Atibble:4×3
#>religionIncomeCount
#><chr><chr><int>
#>1Agnostic<$10k27
#>2Atheist<$10k12
#>3Buddhist<$10k27
#>4Orthodox$20--30k23
Thepreviouscodedemonstratesthethreeargumentsthatgather()requires:
1. data,adataframeinwhichcolumnnameswillbecomerowvalues.
2. key,thenameofthecategoricalvariableintowhichthecolumnnamesintheoriginal
datasetsareconverted.
3. value,thenameofcellvaluecolumns.
Aswithotherfunctionsinthetidyverse,allargumentsaregivenusingbarenames,ratherthan
characterstrings.Arguments2and3canbespecifiedbytheuser,andhavenorelationtothe
existingdata.Furthermore,anadditionalargument,setas-religion,wasusedtoremovethe
religionvariablefromthegathering,ensuringthatthevaluesinthesecolumnsarethefirst
columnintheoutput.Ifno-religionargumentisspecified,allcolumnnamesareusedinthe
key,meaningtheresultssimplyreportall180column/valuepairsresultingfromtheinput
datasetwith10columnsby18rows:
gather(pew)
#>#Atibble:180×2
#>keyvalue
#><chr><chr>
#>1religionAgnostic
#>2religionAtheist
#>3religionBuddhist
#>4religionCatholic
#>#...with176morerows
Table6-1.Firstsixrowsofthe
aggregatedPewdatasetfrom
Wickham(2014a)inanuntidy
form
Religion <$10k $10–20k $ 20–30k
Agnostic 27 34 60
Atheist 12 27 37
Buddhist 27 21 30
Table6-2.Longform
ofthePewdataset
representedinthe
previoustable
showingtheminimum
valuesforannual
incomes(includes
part-timework)
Religion Income Count
Agnostic <$10k 27
Atheist <$10k 12
Buddhist <$10k 27
Agnostic $10–20k 34
Atheist $10–20k 27
Buddhist $10–20k 21
Agnostic $20–30k 60
Atheist $20–30k 37
Buddhist $20–30k 30
SplitJointVariableswithseparate()
Splittingmeanstakingavariablethatisreallytwovariablescombinedandcreatingtwo
separatecolumnsfromit.Aclassicexampleisage-sexvariables(e.g.,m0-10andf0-10to
representmalesandfemalesinthe0to10ageband).Splittingsuchvariablescanbedonewith
theseparate()function,asillustratedinTables6-3and6-4andinthefollowingcodechunk.
See?separateformoreinformationonthisfunction.
agesex=c("m0-10","f0-10")#createcompoundvariable
n=c(3,5)#createavalueforeachobservation
agesex_df=tibble(agesex,n)#createadataframe
separate(agesex_df,col=agesex,into=c("age","sex"))
#>#Atibble:2×3
#>agesexn
#>*<chr><chr><dbl>
#>1m0103
#>2f0105
Table6-
3.Joined
ageand
sex
variables
inone
column
agesex n
m0-10 3
f0-10 5
Table6-4.
Ageand
sex
variables
separated
bythe
function
separate
sex age n
m 0-10 3
f 0-10 5
OthertidyrFunctions
Thereareothertidyingoperationsthattidyrcanperform,asdescribedinthepackage’s
vignette(vignette("tidy-data")).Thewiderissueofmanipulationisalargetopicwith
majorpotentialimplicationsforefficiency(Spector2008)andthissectiononlycoverssome
ofthekeyoperations.Moreimportantisunderstandingtheprinciplesbehindconverting
messydataintostandardoutputforms.
Thesesameprinciplescanalsobeappliedtotherepresentationofmodelresults.Thebroom
packageprovidesastandardoutputformatformodelresults,easinginterpretation(seethe
broomvignette).Thefunctionbroom::tidy()canbeappliedtoawiderangeofmodelobjects
andreturnthemodel’soutputinastandardizeddataframeoutput.
Usually,itismoreefficienttousethenonstandardevaluationversionofvariablenames,as
thesecanbeautocompletedbyRStudio.Insomecases,youmaywanttousestandard
evaluationandrefertovariablenamesusingquotationmarks.Todothis,_canbeaddedto
dplyrandtidyrfunctionnamestoallowtheuseofstandardevaluation.Thusthestandard
evaluationversionofseparate(agesex_df,agesex,c("sex","age"),1)is
separate_(agesex_df,"agesex",c("sex","age"),1).
RegularExpressions
Regularexpressions(commonlyknownasregex)isalanguagefordescribingand
manipulatingtextstrings.Therearebooksonthesubject,andseveralgoodtutorialsonregex
inR,suchasHandlingandProcessingStringsinRbyGastonSanchez(TrowchezEditions),
sowe’lljustscratchthesurfaceofthetopic,andprovideatasteofwhatispossible.Regexisa
deeptopic.However,knowingthebasicscansaveahugeamountoftimefromadata-tidying
perspective,byautomatingthecleaningofmessystrings.
Inthissection,weteachbothstringrandbaseRwaysofdoingpatternmatching.Theformer
provideseasy-to-rememberfunctionnamesandconsistency.Thelatterisusefultoknowas
you’llfindlotsofbaseRregexcodeinotherpeople’scodebecausestringrisrelativelynew
andnotinstalledbydefault.Thefoundationalregexoperationistodetectwhetheraparticular
textstringexistsinanelement,whichisdonewithgrepl()andstr_detect()inbaseRand
stringr,respectively:
library("stringr")
x=c("HiI'mRobin.","DoB1985")
grepl(pattern="9",x=x)
#>[1]FALSETRUE
str_detect(string=x,pattern="9")
#>[1]FALSETRUE
NOT E
stringrdoesnotincludeadirectreplacementforgrep().Youcanusewhich(str_detect())instead.
Noticethatstr_detect()beginswithstr_.Thisisacommonfeatureofallstringrfunctions.
Thiscanbeefficientbecauseifyouwanttodosomeregexwork,youjustneedtotypestr_
andthenpresstheTabkeytoseealistofalltheoptions.ThevariousbaseRregexfunction
names,bycontrast,arehardtoremember,includingregmatches(),strsplit(),andgsub().
Thestringrequivalentshavemoreintuitivenamesthatrelatetotheintentionofthefunctions:
str_match_all(),str_split(),andstr_replace_all(),respectively.
Thereismoretosayonthetopic,butratherthanrepeatwhathasbeensaidelsewhere,wefeel
itismoreefficienttodirecttheinterestedreadertowardexistingexcellentresourcesfor
learningregexinR.Werecommendreading,inorder:
TheStringschapterofRforDataSciencebyGrolemundandWickham(O’Reilly)
Thestringrvignette(vignette("stringr"))
ThedetailedtutorialonregexinbaseR(Sanchez2013)
Formoreadvancedtopics,readingthedocumentationandonlinearticlesaboutthe
stringipackage,onwhichstringrdepends
Exercises
1. Whatarethethreecriteriaoftidydata?
2. Loadandlookatsubsetsofthesedatasets.ThefirstisthePewdatasetswe’vebeen
usingalready.Thesecondreportsthepointsthatdefine,roughly,thegeographical
boundariesofdifferentLondonboroughs.Whatisuntidyabouteach?
head(pew,10)
#>#Atibble:10×10
#>religion<$10k$10-20k$20-30k$30-40k$40-50k$50-75
#><chr><int><int><int><int><int><int>
#>1Agnostic2734608176137
#>2Atheist122737523570
#>3Buddhist272130343358
#>4Catholic4186177326706381116
#>#...with6morerows,and3morevariables:`$75--100k`<int>,
#>#`$100--150k`<int>,`>150k`<int>
data(lnd_geo_df)
head(lnd_geo_df,10)
#>name_datepopulationxy
#>1Bromley-2001295535544362172379
#>2Bromley-2001295535549546169911
#>3Bromley-2001295535539596160796
#>4Bromley-2001295535533693170730
#>5Bromley-2001295535533718170814
#>6Bromley-2001295535534004171442
#>7Bromley-2001295535541105173356
#>8Bromley-2001295535544362172379
#>9RichmonduponThames-2001172330523605176321
#>10RichmonduponThames-2001172330521455172362
3. Converteachoftheprecedingdatasetsintotidyform.
4. Considerthefollowingstringofphonenumbersandfruitsfrom“Stringr:Modern,
ConsistentStringProcessing”byHadleyWickham(TheRJournal):
strings=c("2197338965","329-293-8753","banana","5957947569",
"3872876718","apple","233.398.9187",
"4829523315","2399238115","8425664692",
"Work:579-499-7527","$1000","Home:543.355.3679")
WritefunctionsinstringrandbaseRthatreturn:
Alogicalvectorreportingwhetherornoteachstringcontainsanumber
Completewordsonly,withoutextraneousnonlettercharacters
str_detect(string=strings,pattern="[0-9]")
#>[1]TRUETRUEFALSETRUETRUEFALSETRUETRUETRUETRUE
#>[12]TRUETRUE
str_extract(strings,pattern="[A-z]+")
#>[1]NANA"banana"NANA"apple"NA
#>[8]NANANA"Work"NA"Home"
EfficientDataProcessingwithdplyr
Aftertidyingyourdata,thenextstageistypicallydataprocessing.Thisincludesthecreation
ofnewdata,suchasanewcolumnthatissomefunctionofexistingcolumns,ordataanalysis,
theprocessofaskingdirectedquestionsofthedataandexportingtheresultsinauser-
readableform.
Followingtheadvicein“PackageSelection”,wehavecarefullyselectedanappropriate
packageforthesetasks:dplyr,whichroughlymeansdataframepliers.dplyrhasanumberof
advantagesoverbaseRanddata.tableapproachestodataprocessing:
dplyrisfasttorun(duetoitsC++backend)andintuitivetotype.
dplyrworkswellwithtidydata,asdescribedpreviously.
dplyrworkswellwithdatabases,providingefficiencygainsonlargedatasets.
Furthermore,dplyrisefficienttolearn(seeChapter10).Ithasasmallnumberofintuitively
namedfunctions,orverbs.ThesewerepartlyinspiredbySQL,oneofthelongestestablished
languagesfordataanalysis,whichcombinesmultiplesimplefunctions(suchasSELECTand
WHERE,roughlyanalogoustodplyr::select()anddplyr::filter())tocreatepowerful
analysisworkflows.Likewise,dplyrfunctionsweredesignedtobeusedtogethertosolvea
widerangeofdataprocessingchallenges(seeTable6-5).
Table6-5.dplyrverbfunctions
dplyrfunction(s) Description BaseRfunctions
filter(),slice() Subsetrowsbyattribute(filter)orposition(slice) subset(),[
arrange() Returndataorderedbyvariable(s) order()
select() Subsetcolumns subset(),[,[[
rename() Renamecolumns colnames()
distinct() Returnuniquerows !duplicated()
mutate() Createnewvariables(transmutedropsexistingvariables) transform(),[[
summarize() Collapsedataintoasinglerow aggregate(),tapply()
sample_n() Returnasampleofthedata sample()
UnlikethebaseRanalogues,dplyr’sdataprocessingfunctionsworkinaconsistentway.Each
functiontakesadataframeobjectasitsfirstargumentandcreatesanotherdataframe.
Variablescanbecalleddirectlywithoutusingthe$operator.dplyrwasdesignedtobeused
withthepipeoperator%>%providedbythemagrittrpackage,allowingeachdataprocessing
stagetoberepresentedasanewline.Thisisillustratedinthefollowingcodechunk,which
loadsatidycountry-leveldatasetofgreenhousegasemissionsfromtheefficientpackage,
andthenidentifiesthecountrieswiththegreatestabsolutegrowthinemissionsfrom1971to
2012:
library("dplyr")
data("ghg_ems",package="efficient")
top_table=
ghg_ems%>%
filter(!grepl("World|Europe",Country))%>%
group_by(Country)%>%
summarize(Mean=mean(Transportation),
Growth=diff(range(Transportation)))%>%
top_n(3,Growth)%>%
arrange(desc(Growth))
Theresults,illustratedinTable6-6,showthattheUShasthehighestgrowthandaverage
emissionsfromthetransportsector,followedcloselybyChina.Theaimofthiscodechunkis
notforyoutosomehowreaditandunderstandit;itistoprovideatasterofdplyr’sunique
syntax,whichisdescribedinmoredetailthroughoutthedurationofthissection.
Table6-6.Thetopthree
countriesintermsof
averageCO2emissions
fromtransportsince
1971,andgrowthin
transportemissions
overthatperiod
(MTCO2e/yr)
Country Mean Growth
UnitedStates 1462 709
China 214 656
India 85 170
Buildingonthelearningbydoingethic,theremainderofthissectionworksthroughthese
functionstoprocessandbegintoanalyzeadatasetoneconomicequalityprovidedbythe
WorldBank.Theinputdatasetcanbeloadedasfollows:
#Loadglobalinequalitydata
data(wb_ineq)
dplyrisalargepackageandcanbeseenasalanguageinitsownright.Followingthewalk
beforeyourunprinciple,we’llstartsimply,byfilteringandaggregatingrows.
RenamingColumns
Renamingdatacolumnsisacommontaskthatcanmakewritingcodefasterbyusingshort,
intuitivenames.Thedplyrfunctionrename()makesthiseasy.
Notethatinthiscodeblockthevariablenameissurroundedbyback-quotes(`).ThisallowsR
torefertocolumnnamesthatarenonstandard.Notealsothesyntax:renametakesthedata
frameasthefirstobjectandthencreatesnewvariablesbyspecifyingnew_variable_name=
original_name.
library("dplyr")
wb_ineq=rename(wb_ineq,code=`CountryCode`)
Torenamemultiplecolumns,thevariablenamesaresimplyseparatedbycommas.Thebase
Randdplyrwayofdoingthisisillustratedinanolderversionofthedataset(notrun)toshow
howlong,clunky,andinefficientnamescanbeconvertedintoshortandleanones.
#Thedplyrway(renametwovariables)
wb_ineq=rename(wb_ineq,
top10=`Incomeshareheldbyhighest10%[SI.DST.10TH.10]`,
bot10=`Incomeshareheldbylowest10%[SI.DST.FRST.10]`)
#ThebaseRway(renamefivevariables)
names(wb_ineq)[5:9]=c("top10","bot10","gini","b40_cons","gdp_percap")
ChangingColumnClasses
TheclassofRobjectsiscriticaltoperformance.Ifaclassisincorrectlyspecified(e.g.,if
numbersaretreatedasfactorsorcharacters),thiswillleadtoincorrectresults.Theclassof
allcolumnsinadataframecanbequeriedusingthefunctionstr()(shortfordisplaythe
structureofanobject),asillustratedinthefollowingcode,withtheinequalitydataloaded
previously.1
vapply(wb_ineq,class,character(1))
#>CountrycodeYearYearCodetop10bot10
#>"character""character""integer""character""numeric""numeric"
#>ginib40_consgdp_percap
#>"numeric""numeric""numeric"
Thisshowsthatalthoughweloadedthedatacorrectly,allcolumnsareseenbyRas
characters.Thismeanswecannotperformnumericalcalculationsonthedataset:
mean(wb_ineq$gini)fails.
Visualinspectionofthedata(e.g.,viaView(wb_ineq))clearlyshowsthatallcolumnsexcept
for1to4(Country,CountryCode,Year,andYearCode)shouldbenumeric.Wecanreassign
theclassesofthenumericvariablesonebyone:
wb_ineq$gini=as.numeric(wb_ineq$gini)
mean(wb_ineq$gini,na.rm=TRUE)#nowthemeaniscalculated
#>[1]40.5
However,thepurposeofprogramminglanguagesistoautomatetasksandreducetyping.The
followingcodechunkreclassifiesallofthenumericvariablesusingdata.matrix(),which
convertsthedataframetoanumericmatrix:
cols_to_change=5:9#columnidstochange
wb_ineq[cols_to_change]=data.matrix(wb_ineq[cols_to_change])
vapply(wb_ineq,class,character(1))
#>CountrycodeYearYearCodetop10bot10
#>"character""character""integer""character""numeric""numeric"
#>ginib40_consgdp_percap
#>"numeric""numeric""numeric"
AsissooftenthecasewithR,therearemanywaystosolvetheproblem.Thefollowingcode
isaone-linerusingunlist(),whichconvertslistobjectsintovectors:
wb_ineq[cols_to_change]=as.numeric(unlist(wb_ineq[cols_to_change]))
Anotherone-linertoachievethesameresultusesdplyr’smutate_eachfunction:
wb_ineq=mutate_each(wb_ineq,"as.numeric",cols_to_change)
Aswithotheroperations,thereareotherwaysofachievingthesameresultinR,includingthe
FilteringRows
dplyroffersanalternativewayoffilteringdata,usingfilter().
#BaseR:wb_ineq[wb_ineq$Country=="Australia",]
aus2=filter(wb_ineq,Country=="Australia")
filter()isslightlymoreflexiblethan[:filter(wb_ineq,code=="AUS",Year==
1974),worksaswellasfilter(wb_ineq,code=="AUS"&Year==1974),andtakesany
numberofconditions(see?filter).filter()isslightlyfasterthanbaseR.2Byavoidingthe
$symbol,dplyrmakessubsettingcodeconciseandconsistentwithotherdplyrfunctions.The
firstargumentisadataframeandsubsequentrawvariablenamescanbetreatedasvector
objects,whichareadefiningfeatureofdplyr.Inthenextsection,we’lllearnhowthissyntax
canbeusedalongsidethe%>%pipecommandtowritecleardatamanipulationcommands.
TherearedplyrequivalentsofmanybaseRfunctions,buttheseusuallyworkslightly
different.Thedplyrequivalentofaggregate,forexample,istousethegroupingfunction
group_byincombinationwiththegeneral-purposefunctionsummarize(nottobeconfused
withsummaryinbaseR),asweshallseein“DataAggregation”.
ChainingOperations
Anotherinterestingfeatureofdplyrisitsabilitytochainoperationstogether.Thisovercomes
oneoftheaestheticissueswithRcode:youcanendupwithverylongcommandswithmany
functionsnestedinsideoneanothertoanswerrelativelysimplequestions.Combinedwiththe
group_by()function,pipescanhelpcondensethousandsoflinesofdataintosomething
human-readable.Here’showyoucouldusethechainstosummarizeaverageGiniindexesper
decade,forexample:
wb_ineq%>%
select(Year,gini)%>%
mutate(decade=floor(Year/10)*10)%>%
group_by(decade)%>%
summarize(mean(gini,na.rm=TRUE))
#>#Atibble:6×2
#>decade`mean(gini,na.rm=TRUE)`
#><dbl><dbl>
#>1197040.1
#>2198037.8
#>3199042.0
#>4200040.5
#>#...with2morerows
Oftenthebestwaytolearnistotryandbreaksomething,sotryrunningthepreceding
commandswithdifferentdplyrverbs.Bywayofexplanation,thisiswhathappened:
1. OnlythecolumnsYearandginiwereselected,usingselect().
2. Anewvariable,decade,wascreated(e.g.,1989becomes1980).
3. Thisnewvariablewasusedtogrouprowsinthedataframewiththesamedecade.
4. Themeanvalueperdecadewascalculated,illustratinghowaverageincome
inequalitywasgreatestin1992buthassincedecreasedslightly.
Let’saskanotherquestiontoseehowdplyrchainingworkflowcanbeusedtoanswer
questionsinteractively:whatarethefivemostunequalyearsforcountriescontainingtheletter
g?Here’showchainscanhelporganizetheanalysisneededtoanswerthisquestionstepby
step:
wb_ineq%>%
filter(grepl("g",Country))%>%
group_by(Year)%>%
summarize(gini=mean(gini,na.rm=TRUE))%>%
arrange(desc(gini))%>%
top_n(n=5)
#>Selectingbygini
#>#Atibble:5×2
#>Yeargini
#><int><dbl>
#>1198046.9
#>2199346.0
#>3201344.5
#>4198143.6
#>#...with1morerows
Theprecedingfunctionconsistsofsixstages,eachofwhichcorrespondstoanewlineand
dplyrfunction:
1. Filteroutthecountrieswe’reinterestedin(anyselectioncriteriacouldbeusedin
placeofgrepl("g",Country)).
2. Grouptheoutputbyyear.
3. Summarize,foreachyear,themeanGiniindex.
4. ArrangetheresultsbyaverageGiniindex.
5. Selectonlythetopfivemostunequalyears.
Toseewhythismethodispreferabletothenestedfunctionapproach,takealookatthelatter.
Evenafterindentingproperly,itlooksterribleandisalmostimpossibletounderstand!
top_n(
arrange(
summarize(
group_by(
filter(wb_ineq,grepl("g",Country)),
Year),
gini=mean(gini,na.rm=TRUE)),
desc(gini)),
n=5)
Thissectionhasprovidedonlyatasteofwhatispossiblewithdplyrandwhyitmakessense
fromcode-writingandcomputational-efficiencyperspectives.Foramoredetailedaccountof
dataprocessingwithRusingthisapproach,werecommendRforDataSciencebyGrolemund
andWickham(O’Reilly).
Exercises
1. Tryrunningeachoftheprecedingchainingexampleslinebyline,sothefirsttwo
entriesforthefirstexamplelooklikethis:
wb_ineq
#>#Atibble:6,925×9
#>CountrycodeYear`YearCode`top10bot10ginib40_cons
#><chr><chr><int><chr><dbl><dbl><dbl><dbl>
#>1AfghanistanAFG1974YR1974NANANANA
#>2AfghanistanAFG1975YR1975NANANANA
#>3AfghanistanAFG1976YR1976NANANANA
#>4AfghanistanAFG1977YR1977NANANANA
#>#...with6,921morerows,and1morevariables:gdp_percap<dbl>
followedby:
wb_ineq%>%
select(Year,gini)
#>#Atibble:6,925×2
#>Yeargini
#><int><dbl>
#>11974NA
#>21975NA
#>31976NA
#>41977NA
#>#...with6,921morerows
Explaininyourownwordswhatchangeseachtime.
2. Usechaineddplyrfunctionstoanswerthefollowingquestion:inwhichyeardid
countrieswithoutanaintheirnamehavethelowestlevelofinequality?
DataAggregation
Dataaggregationinvolvescreatingsummariesofdatabasedonagroupingvariable,ina
processthathasbeenreferredtoassplit-apply-combine.Theendresultusuallyhasthesame
numberofrowsastherearegroups.Becauseaggregationisawayofcondensingdatasets,it
canbeaveryusefultechniqueformakingsenseoflargedatasets.Thefollowingcodefinds
thenumberofuniquecountries(countrybeingthegroupingvariable)fromtheghg_ems
datasetstoredintheefficientpackage:
#Packageavailablefromgithub.com/csgillespie/efficient
data(ghg_ems,package="efficient")
names(ghg_ems)
#>[1]"Country""Year""Electricity""Manufacturing"
#>[5]"Transportation""Other""Fugitive"
nrow(ghg_ems)
#>[1]7896
length(unique(ghg_ems$Country))
#>[1]188
Notethatwhiletherearealmost8,000rows,therearefewerthan200countries.Thusfactors
wouldhavebeenamorespace-efficientwayofstoringthecountrydata.
Toaggregatethedatasetusingdplyr,youdividethetaskintotwoparts:groupthedatasetfirst
andthensummarize,asillustratednext.3
library("dplyr")
group_by(ghg_ems,Country)%>%
summarize(mean_eco2=mean(Electricity,na.rm=TRUE))
#>#Atibble:188×2
#>Countrymean_eco2
#><chr><dbl>
#>1AfghanistanNaN
#>2Albania0.641
#>3Algeria23.015
#>4Angola0.791
#>#...with184morerows
NOT E
Thepreviousexamplerelatestowiderprogramming:howmuchworkshouldonefunctiondo?Theworkcould
havebeendonewithasingleaggregate()call.However,theUnixphilosophystatesthatprogramsshould“do
onethingwell,”whichishowdplyr’sfunctionsweredesigned.Shorterfunctionsareeasiertounderstandand
debug.Buthavingtoomanyfunctionscanalsomakeyourcallstackconfusing.
Toreinforcethepoint,thisoperationisalsoperformedinthefollowingcodeonthewb_ineq
dataset:
data(wb_ineq,package="efficient")
countries=group_by(wb_ineq,Country)
summarize(countries,gini=mean(gini,na.rm=TRUE))
#>#Atibble:176×2
#>Countrygini
#><chr><dbl>
#>1AfghanistanNaN
#>2Albania30.4
#>3Algeria37.8
#>4Angola50.6
#>#...with172morerows
Notethatsummarizeishighlyversatile,andcanbeusedtoreturnacustomizedrangeof
summarystatistics:
summarize(countries,
#numberofrowspercountry
obs=n(),
med_t10=median(top10,na.rm=TRUE),
#standarddeviation
sdev=sd(gini,na.rm=TRUE),
#numberwithgini>30
n30=sum(gini>30,na.rm=TRUE),
sdn30=sd(gini[gini>30],na.rm=TRUE),
#range
dif=max(gini,na.rm=TRUE)-min(gini,na.rm=TRUE)
)
#>#Atibble:176×7
#>Countryobsmed_t10sdevn30sdn30dif
#><chr><int><dbl><dbl><int><dbl><dbl>
#>1Afghanistan40NANaN0NANA
#>2Albania4024.41.2530.3642.78
#>3Algeria4029.83.4423.4374.86
#>4Angola4038.611.30211.30015.98
#>#...with172morerows
Toshowcasethepowerofsummarize()usedonagrouped_df,thepreviouscodereportsa
widerangeofcustomizedsummarystatisticspercountry:
Thenumberofrowsineachcountrygroup
StandarddeviationofGiniindices
Medianproportionofincomeearnedbythetop10%
ThenumberofyearsinwhichtheGiniindexwasgreaterthan30
ThestandarddeviationofGiniindexvaluesover30
TherangeofGiniindexvaluesreportedforeachcountry
Exercises
1. Referbacktothegreenhousegasemissionsexampleattheoutsetofsection
“EfficientDataProcessingwithdplyr”,inwhichwefoundthetopthreecountriesin
termsofemissionsgrowthinthetransportsector.
a. Explaininwordswhatisgoingonineachline.
b. Trytofindthetopthreecountriesintermsofemissionsin2012—howisthe
listdifferent?
NonstandardEvaluation
Thefinalthingtosayaboutdplyrdoesnotrelatetothedatabuttothesyntaxofthefunctions.
Notethatmanyoftheargumentsinthecodeexamplesinthissectionareprovidedasraw
names;theyarerawvariablenamesnotsurroundedbyquotationmarks(e.g.,Countryrather
than"Country").Thisiscallednonstandardevaluation(NSE)(seevignette("nse")).NSE
wasuseddeliberately,withtheaimofmakingthefunctionsmoreefficientforinteractiveuse.
NSEreducestypingandallowsautocompletioninRStudio.
ThisisfinewhenusingRinteractively.Butwhenyou’dliketouseRnoninteractively,codeis
generallymorerobustusingstandardevaluationbecauseitminimizesthechanceofcreating
obscurescope-relatedbugs.Usingstandingevaluationalsoavoidshavingtodeclareglobal
variablesifyouincludethecodeinapackage.Forthisreason,mostfunctionsintidyrand
dplyrhavetwoversions:onethatusesNSE(thedefault)andanotherthatusesstandard
evaluationandrequiresthevariablenamestobeprovidedinquotationmarks.Thestandard
evaluationversionsoffunctionsaredenotedwiththeaffix_.Thisisillustratedinthe
followingcodewiththegather()function,usedpreviously:
#1:DefaultNSEfunction
group_by(cars,cut(speed,c(0,10,100)))%>%summarize(mean(dist))
#>#Atibble:2×2
#>`cut(speed,c(0,10,100))``mean(dist)`
#><fctr><dbl>
#>1(0,10]15.8
#>2(10,100]49.0
#2:Standardevaluationusingquotemarks
group_by_(cars,"cut(speed,c(0,10,100))")%>%summarize_("mean(dist)")
#>#Atibble:2×2
#>`cut(speed,c(0,10,100))``mean(dist)`
#><fctr><dbl>
#>1(0,10]15.8
#>2(10,100]49.0
#3:Standardevaluationusingformula,tildenotation
#(recommendedstandardevaluationmethod)
group_by_(cars,~cut(speed,c(0,10,100)))%>%summarize_(~mean(dist))
#>#Atibble:2×2
#>`cut(speed,c(0,10,100))``mean(dist)`
#><fctr><dbl>
#>1(0,10]15.8
#>2(10,100]49.0
CombiningDatasets
Theusefulnessofadatasetcansometimesbegreatlyenhancedbycombiningitwithother
data.Ifwecouldmergetheglobalghg_emsdatasetwithgeographicdata,forexample,we
couldvisualizethespatialdistributionofclimatepollution.Forthepurposesofthissection,
wejoinghg_emstotheworlddataprovidedbyggmaptoillustratetheconceptsandmethods
ofdatajoining(alsoreferredtoasmerging).
library("ggmap")
world=map_data("world")
names(world)
#>[1]"long""lat""group""order""region""subregion"
Visuallycomparethisnewdatasetoftheworldwithghg_ems(e.g.,viaView(world);
View(ghg_ems)).Itisclearthatthecolumnregionintheformercontainsthesame
informationasCountryinthelatter.Thiswillbethejoiningvariable;renamingitinworld
willmakethejoinmoreefficient.
world=rename(world,Country=region)
ghg_ems$All=rowSums(ghg_ems[3:7])
T IP
Ensurethatbothjoiningvariableshavethesameclass(combiningcharacterandfactorcolumnscancause
havoc).
Howlargeistheoverlapbetweenghg_ems$Countryandworld$Country?Wecanfindout
usingthe%in%operator,whichfindsouthowmanyelementsinonevectormatchthosein
anothervector.Specifically,wewillfindouthowmanyuniquecountrynamesfromghg_ems
arepresentintheworlddataset:
unique_countries_ghg_ems=unique(ghg_ems$Country)
unique_countries_world=unique(world$Country)
matched=unique_countries_ghg_ems%in%unique_countries_world
table(matched)
#>matched
#>FALSETRUE
#>20168
Thiscomparisonexercisehasbeenfruitful:mostofthecountriesintheco2datasetexistinthe
worlddataset.Butwhataboutthe20countrynamesthatdonotmatch?Wecanidentifytheseas
follows:
(unmatched_countries_ghg_ems=unique_countries_ghg_ems[!matched])
#>[1]"Antigua&Barbuda""Bahamas,The"
#>[3]"Bosnia&Herzegovina""Congo,Dem.Rep."
#>[5]"Congo,Rep.""Coted'Ivoire"
#>[7]"EuropeanUnion(15)""EuropeanUnion(28)"
#>[9]"Gambia,The""Korea,Dem.Rep.(North)"
#>[11]"Korea,Rep.(South)""Macedonia,FYR"
#>[13]"RussianFederation""SaintKitts&Nevis"
#>[15]"SaintVincent&Grenadines""SaoTome&Principe"
#>[17]"Trinidad&Tobago""UnitedKingdom"
#>[19]"UnitedStates""World"
Itisclearfromtheoutputthatsomeofthenonmatches(e.g.,theEuropeanUnion)arenot
countriesatall.However,otherssuchasGambiaandtheUnitedStatesclearlyshouldhave
matches.Fuzzymatchingcanhelpfindwhichcountriesdomatch,asillustratedbythefirst
nonmatchingcountryhere:
(unmatched_country=unmatched_countries_ghg_ems[1])
#>[1]"Antigua&Barbuda"
unmatched_world_selection=agrep(pattern=unmatched_country,
unique_countries_world,max.distance=10)
unmatched_world_countries=unique_countries_world[unmatched_world_selection]
Whatjusthappened?Weverifiedthatthefirstunmatchingcountryintheghg_emsdatasetwas
notintheworldcountrynames.Soweusedthemorepowerfulagreptosearchforfuzzy
matches(withthemax.distanceargumentsetto10).Theresultsshowthatthecountry
Antigua&Barbudafromtheghg_emsdatamatchestwocountriesintheworlddataset.Wecan
updatethenamesinthedatasetwearejoiningtoaccordingly:
world$Country[world$Country%in%unmatched_world_countries]=
unmatched_countries_ghg_ems[1]
Thiscodereducesthenumberofcountrynamesintheworlddatasetbyreplacingboth
“Antigua”and“Barbuda”with“Antigua&Barbuda”.Thiswouldnotworktheotherway
around:howwouldoneknowwhethertochange“Antigua&Barbuda”to“Antigua”orto
“Barbuda”?
Thusfuzzymatchingisstillalaboriousprocessthatmustbecomplementedbyhuman
judgment.IttakesahumantoknowforsurethatUnitedStatesisrepresentedasUSAinthe
worlddataset,withoutriskingfalsematchesviaagrep.
WorkingwithDatabases
InsteadofloadingallthedataintoRAM,asRdoes,databasesquerydatafromtheharddisk.
ThiscanallowasubsetofaverylargedatasettobedefinedandreadintoRquickly,without
havingtoloaditfirst.Rcanconnecttodatabasesinanumberofways,whicharebriefly
touchedonbelow.Thesubjectofdatabasesisalargeareaundergoingrapidevolution.Rather
thanaimingatcomprehensivecoverage,wewillprovidepointerstodevelopmentsthatenable
efficientaccesstoawiderangeofdatabasetypes.Anup-to-datehistoryofR’sinterfacesto
databasescanbefoundintheREADMEoftheDBIpackage,whichprovidesacommon
interfaceandsetofclassesfordriverpackages(suchasRSQLite).
RODBCisaveteranpackageforqueryingexternaldatabasesfromwithinR,usingtheOpen
DatabaseConnectivity(ODBC)API.ThefunctionalityofRODBCisdescribedinthe
package’svignette(seevignette("RODBC")),andtodayitsmainuseistoprovideanR
interfacetoSQLServerdatabases,whichlackaDBIinterface.
TheDBIpackageisaunifiedframeworkforaccessingdatabasesthatallowsforotherdrivers
tobeaddedasmodularpackages.ThusnewpackagesthatbuildonDBIcanbeseenpartlyasa
replacementsofRODBC(RMySQL,RPostgreSQL,andRSQLite)(see
vignette("backend")formoreonhowDBIdriverswork).BecausetheDBIsyntaxappliesto
awiderangeofdatabasetypes,weuseitherewithaworkedexample.
Imagineyouhaveaccesstoadatabasethatcontainstheghg_emsdataset.
#Connecttoadatabasedriver
library("RSQLite")
con=dbConnect(SQLite(),dbname=ghg_db)#Alsousername&passwordarguments
dbListTables(con)
rs=dbSendQuery(con,"SELECT*FROM`ghg_ems`WHERE(`Country`!='World')")
df_head=dbFetch(rs,n=6)#extractfirst6row
TheprecedingcodechunkshowshowthefunctiondbConnectconnectstoanexternaldatabase
—inthiscase,aMySQLdatabase.Theusernameandpasswordargumentsareusedto
establishtheconnection.Next,wequerywhichtablesareavailablewithdbListTables,query
thedatabase(withoutyetextractingtheresultstoR)withdbSendQuery,and,finally,loadthe
resultsintoRwithdbFetch.
T IP
Besurenevertoreleaseyourpasswordbyenteringitdirectlyintothecommand.Instead,werecommendsaving
sensitiveinformationsuchasdatabasepasswordsandAPIkeysin.Renviron,describedinChapter2.Assuming
youhadsavedyourpasswordastheenvironmentvariablePSWRD,youcouldenterpwd=Sys.getenv("PSWRD")
tominimizetheriskofexposingyourpasswordthroughaccidentallyreleasingthecodeoryoursessionhistory.
RecentlytherehasbeenashifttothenoSQLapproachtostoringlargedatasets.Thisis
illustratedbytheemergenceanduptakeofsoftwaresuchasMongoDBandApacheCassandra
thathaveRinterfacesviapackagesmongoliteandRJDBC,whichcanconnecttoApache
CassandradatastoresandanysourcecompliantwiththeJavaDatabaseConnectivity(JDBC)
API.
MonetDBisarecentalternativetorelationalandnoSQLapproachesthatofferssubstantial
efficiencyadvantagesforhandlinglargedatasets(Kerstenetal.2011).Atutorialonthe
MonetDBwebsiteprovidesanexcellentintroductiontohandlingdatabasesfromwithinR.
Therearemanywiderconsiderationsinrelationtodatabasesthatwewillnotcoverhere:who
willmanageandmaintainthedatabase?Howwillitbebackeduplocally(localcopiesshould
bestoredtoreducerelianceonthenetwork)?Whatistheappropriatedatabaseforyour
project?Theseissuescanhavemajoreffectsonefficiency,especiallyonlarge,data-intensive
projects.However,wewillnotcoverthemherebecauseitisafast-movingfield.Instead,we
directtheinterestedreadertowardresourcesonthesubject,including:
Thewebsiteforsparklyr,arecentlycreatedpackageforefficientlyinterfacingwiththe
ApacheSparkstack.
db-engines.com/en/,awebsitecomparingtherelativemeritsofdifferentdatabases.
Thedatabasesvignettefromthedplyrpackage.
GettingstartedwithMongoDBinR,anintroductoryvignetteonnonrelationaldatabases
andmapreducefromthemongolitepackage.
Databasesanddplyr
ToaccessadatabaseinRviadplyr,youmustuseoneofthesrc_*()functionstocreatea
source.ContinuingwiththeSQLiteexamplepreviouslygiven,youwouldcreateatblobject
thatcanbequeriedbydplyrasfollows:
library("dplyr")
ghg_db=src_sqlite(ghg_db)
ghg_tbl=tbl(ghg_db,"ghg_ems")
Theghg_tblobjectcanthenbequeriedinasimilarwayasastandarddataframe.For
example,supposewewishedtofilterbyCountry.Thenweusethefilter()functionas
before:
rm_world=ghg_tbl%>%
filter(Country!="World")
Inthiscode,dplyrhasactuallygeneratedthenecessarySQLcommand,whichcanbe
examinedusingexplain(rm_world).Whenworkingwithdatabases,dplyruseslazy
evaluation:thedataisonlyfetchedatthelastmomentwhenit’sneeded.TheSQLcommand
associatedwithrm_worldhasn’tyetbeenexecuted;thisiswhytail(rm_world)doesn’twork.
Byusinglazyevaluation,dplyrismoreefficientathandlinglargedatastructuresbecauseit
avoidsunnecessarycopying.WhenyouwantyourSQLcommandtobeexecuted,use
collect(rm_world).
ThefinalstagewhenworkingwithdatabasesinRistodisconnect:
dbDisconnect(conn=con)
Exercises
FollowtheworkedexampleheretocreateandqueryadatabaseonlandpricesintheUK
usingdplyrasafrontendtoanSQLitedatabase.Thefirststageistoreadinthedata:
#Seehelp("land_df",package="efficient")fordetails
data(land_df,package="efficient")
ThenextstageistocreateanSQLitedatabasetoholdthedata:
#install.packages("RSQLite")#RequiresRSQLitepackage
my_db=src_sqlite("land.sqlite3",create=TRUE)
land_sqlite=copy_to(my_db,land_df,indexes=list("postcode","price"))
1. Whatclassisthenewobjectland_sqlite?
2. Whydidweusetheindexesargument?
Fromtheprecedingcode,wecanseethatwehavecreatedatbl.Thiscanbeaccessed
usingdplyrinthesamewayasanydataframecan.Nowwecanquerythedata.You
canuseSQLcodetoquerythedatabasedirectlyorusestandarddplyrverbsonthe
table.
#Method1:usingsql
tbl(my_db,sql('SELECT"price","postcode","old/new"FROMland_df'))
#>Source:query[??x3]
#>Database:sqlite3.8.6[land.sqlite3]
#>
#>pricepostcode`old/new`
#><int><chr><chr>
#>184000CW95EUN
#>2123500TR138JHN
#>3217950PL339DLN
#>4147000EX395XTN
#>#...withmorerows
3. Howwouldyouperformthesamequeryusingselect()?Tryittoseeifyougetthe
sameresult(hint:usebackticksfortheold/newvariablename).
#>Source:query[??x3]
#>Database:sqlite3.8.6[land.sqlite3]
#>
#>pricepostcode`old/new`
#><int><chr><chr>
#>184000CW95EUN
#>2123500TR138JHN
#>3217950PL339DLN
#>4147000EX395XTN
#>#...withmorerows
DataProcessingwithdata.table
data.tableisamaturepackageforfastdataprocessingthatpresentsanalternativetodplyr.
Thereissomecontroversyaboutwhichismoreappropriatefordifferenttasks.4Whichis
moreefficienttosomeextentdependsonpersonalpreferencesandwhatyouareusedto.Both
arepowerfulandefficientpackagesthattaketimetolearn,soitisbesttolearnoneandstick
withit,ratherthanhavethedualityofusingtwoforsimilarpurposes.Therearesituationsin
whichoneworksbetterthananother:dplyrprovidesamoreconsistentandflexibleinterface
(e.g.,withitsinterfacetodatabases,demonstratedintheprevioussection),soformost
purposeswerecommendlearningdplyrfirstifyouarenewtobothpackages.dplyrcanalso
beusedtoworkwiththedata.tableclassusedbythedata.tablepackagesoyoucangetthe
bestofbothworlds.
data.tableisfasterthandplyrforsomeoperationsandofferssomefunctionalityunavailable
inotherpackages,however,andhasamatureandadvancedusercommunity.data.table
supportsrollingjoins,whichallowrowsinonetabletobeselectedbasedonproximity
betweensharedvariables(typicallytime)andnon-equijoinswherejoincriteriacanbe
inequalitiesratherthanequalto.
Thissectionprovidesafewexamplestoillustratehowdata.tableisuniqueand(attheriskof
inflamingthedebatefurther)somebenchmarksyoucanusetoexplorewhichismore
efficient.Asemphasizedthroughoutthebook,efficientcodewritingisoftenmoreimportant
thanefficientexecutiononmanyeverydaytasks,sotosomeextentit’samatterofpreference.
Thefoundationalobjectclassofdata.tableisthedata.table.Likedplyr’stbl_df,
data.table’sdata.tableobjectsbehaveinthesamewayasthebasedata.frameclass.
However,thedata.tableparadigmhassomeuniquefeaturesthatmakeithighly
computationallyefficientformanycommontasksindataanalysis.Buildingonsubsetting
methodsusing[andfilter(),mentionedpreviously,we’llseedata.tables’sunique
approachtosubsetting.LikebaseR,data.tableusessquarebracketsbut(unlikebaseRbut
likedplyr)usesnonstandardevaluation,soyouneednotrefertotheobjectnameinsidethe
brackets:
library("data.table")
data(wb_ineq_renamed)#fromtheefficientpackage
wb_ineq_dt=data.table(wb_ineq_renamed)#converttodata.tableclass
aus3a=wb_ineq_dt[Country=="Australia"]
NOT E
Notethatthesquarebracketsdonotneedacommatorefertorowswithdata.tableobjects;inbaseR,you
wouldwritewb_ineq_renamed[wb_ineq_renamed$Country=="Australia",].
Toboostperformance,youcansetkeys,analogoustoprimarykeysindatabases.Theseare
superchargedrownamesthatorderthetablebasedononeormorevariables.Thisallowsa
binarysearchalgorithmtosubsettherowsofinterest,whichismuch,muchfasterthanthe
vectorscanapproachusedinbaseR(seevignette("datatable-keys-fast-subset")).
data.tableusesthekeyvaluesforsubsettingbydefaultsothevariabledoesnotneedtobe
mentionedagain.Instead,usingkeys,thesearchcriteriaisprovidedasalist(invokedinthe
followingcodechunkwiththeconcise.()syntax,whichissynonymouswithlist()).
setkey(wb_ineq_dt,Country)
aus3b=wb_ineq_dt[.("Australia")]
Theresultisthesame,sowhyaddtheextrastageofsettingthekey?Thereasonisthatthis
one-offsortingoperationcanleadtosubstantialperformancegainsinsituationswhere
repeatedlysubsettingrowsonlargedatasetsconsumesalargeproportionofcomputational
timeinyourworkflow.ThisisillustratedinFigure6-1,whichcomparesfourmethodsof
subsettingincrementallylargerversionsofthewb_ineqdataset.
Figure6-1demonstratesthatdata.tableismuchfasterthanbaseRanddplyratsubsetting.As
withusingexternalpackagesusedtoreadindata(see“Plain-TextFormats”),therelative
benefitsofdata.tableimprovewithdatasetsize,approachinga~70-foldimprovementon
baseRanda~50-foldimprovementondplyrasthedatasetsizereacheshalfagigabyte.
Interestingly,eventhenonkeyimplementationofthedata.tablesubsetmethodisfasterthan
thealternatives.Thisisbecausedata.tablecreatesakeyinternallybydefaultbefore
subsetting.Theprocessofcreatingthekeyaccountsforthe~10foldspeed-upincaseswhere
thekeyhasbeenpregenerated.
Thissectionhasintroduceddata.tableasacomplimentaryapproachtobaseanddplyr
methodsfordataprocessing.ItoffersperformancegainsduetoitsimplementationinCand
theuseofkeysforsubsettingtables.data.tableoffersmuchmore,however,including:highly
efficientdatareshaping,datasetmerging(alsoknownasjoining,aswithleft_join()in
dplyr),andgrouping.Forfurtherinformationondata.table,werecommendreadingthe
package’sdatatable-intro,datatable-reshape,anddatatable-reference-semantics
vignettes.
Figure6-1.Benchmarkillustratingtheperformancegainstobeexpectedfordifferentdatasetsizes
References
Wickham,Hadley.2014b.“TidyData.”TheJournalofStatisticalSoftware14(5).
Codd,E.F.1979.“Extendingthedatabaserelationalmodeltocapturemoremeaning.”ACM
TransactionsonDatabaseSystems4(4):397–434.doi:10.1145/320107.320109.
Spector,Phil.2008.DataManipulationwithR.SpringerScience&BusinessMedia.
Sanchez,Gaston.2013.“HandlingandProcessingStringsinR.”TrowchezEditions.
http://bit.ly/handlingstringsR.
Grolemund,G.,andH.Wickham.2016.RforDataScience.O’ReillyMedia.
Wickham,Hadley.2010.“Stringr:Modern,ConsistentStringProcessing.”TheRJournal2
(2):38–40.
Kersten,MartinL,StratosIdreos,StefanManegold,EriettaLiarou,andothers.2011.“The
Researcher’sGuidetotheDataDeluge:QueryingaScientificDatabaseinJustaFew
Seconds.”PVLDBChallengesandVisions3.
str(wb_ineq)isanotherwaytoseethecontentsofanobject,butproducesmoreverboseoutput.
Notethatfilterisalsothenameofafunctionusedinthebasestatslibrary.Typically,packagesavoidusingnames
alreadytakeninbaseR,butthisisanexception.
TheequivalentcodeinbaseRise_ems=aggregate(ghg_ems$Electricity,list(ghg_ems$Country),mean,na.rm=
TRUE,data=ghg_ems);nrow(ghg_ems).
OnequestionontheStackOverflowwebsitetitled“data.tablevsdplyr”illustratesthiscontroversyanddelvesintothe
philosophyunderlyingeachapproach.
1
2
3
4
Chapter7.EfficientOptimization
DonaldKnuthisalegendaryAmericancomputerscientistwhodevelopedanumberofthekey
algorithmsthatweusetoday(see,forexample,?Random).Onthesubjectofoptimization,he
gavethisadvice:
Therealproblemisthatprogrammershavespentfartoomuchtimeworryingabout
efficiencyinthewrongplacesandatthewrongtimes;prematureoptimizationistherootof
allevil(oratleastmostofit)inprogramming.
Knuth’spointisthatitiseasytoundertakecodeoptimizationinefficiently.Whendeveloping
code,thecausesofinefficienciesmayshiftsothatwhatoriginallycausedslownessatthe
beginningofyourworkmaynotberelevantatalaterstage.Thismeansthattimespent
optimizingcodeearlyinthedevelopmentalstagecouldbewasted.Evenworse,thereisa
trade-offbetweencodespeedandcodereadability;we’vealreadymadethistrade-offonceby
usingreadable(butslow)RcomparedwithverboseCcode!
Forthisreason,thischapterispartofthelatterhalfofthebook.Thepreviouschapters
deliberatelyfocusedonconcepts,packages,andfunctionstoincreaseefficiency.Theseare
(relatively)easywaysofsavingtimethat,onceimplemented,willworkforfutureprojects.
Codeoptimization,bycontrast,isanadvancedtopicthatshouldonlybetackledoncelow
hangingfruitforefficiencygainshavebeentaken.
Inthischapterweassumethatyoualreadyhavewell-developedcodethatismature
conceptuallyandhasbeentriedandtested.Nowyouwanttooptimizethiscode,butnot
prematurely.Thechapterisorganizedasfollows.First,webeginwithgeneralhintsandtips
aboutoptimizingbaseRcode.Codeprofilingcanidentifykeybottlenecksinthecodeinneed
ofoptimization,andthisiscoveredinthenextsection.“ParallelComputing”discusseshow
parallelcodecanovercomeefficiencybottlenecksforsomeproblems.Thefinalsection
explainshowRcppcanbeusedtoefficientlyincorporateC++codeintoanRanalysis.
Prerequisites
Inthischapter,someoftheexamplesrequireaworkingC++compiler.Theinstallation
methoddependsonyouroperatingsystem:
Linux
Acompilershouldalreadybeinstalled.Otherwise,installr-baseandacompilerwillbe
installedasadependency.
Mac
InstallXcode.
Windows
InstallRtools.MakesureyouselecttheversionthatcorrespondstoyourversionofR.
Thepackagesusedinthischapterare:
library("microbenchmark")
library("ggplot2movies")
library("profvis")
library("Rcpp")
TopFiveTipsforEfficientOptimization
1. Beforeyoustarttooptimizeyoucode,ensurethatyouknowwherethebottleneck
lies;useacodeprofiler.
2. Ifthedatainyourdataframeisallofthesametype,considerconvertingittoa
matrixforaspeedboost.
3. Usespecializedrowandcolumnfunctionswheneverpossible.
4. TheparallelpackageisidealforMonteCarlosimulations.
5. Foroptimalperformance,considerrewritingkeypartsofyourcodeinC++.
CodeProfiling
Oftenyouwillhaveworkingcode,butsimplywantittorunfaster.Insomecases,it’sobvious
wherethebottlenecklies.Sometimesyouwillguess,relyingonintuition.Adrawbackofthis
isthatyoucouldbewrongandwastetimeoptimizingthewrongpieceofcode.Tomakeslow
coderunfaster,itisimportanttofirstdeterminewheretheslowcodelives.Thisisthe
purposeofcodeprofiling.
TheRprof()functionisabuilt-intoolforprofilingtheexecutionofRexpressions.At
regulartimeintervals,theprofilerstopstheRinterpreter,recordsthecurrentfunctioncall
stack,andsavestheinformationtoafile.TheresultsfromRprof()arestochastic.Eachtime
werunafunctionR,theconditionshavechanged.Hence,eachtimeyouprofileyourcode,the
resultwillbeslightlydifferent.
Unfortunately,Rprof()isnotuser-friendly.Forthisreason,werecommendusingtheprofvis
packageforprofilingyourRcode.profvisprovidesaninteractivegraphicalinterfacefor
visualizingcode-profilingdatafromRprof().
GettingStartedwithprofvis
Afterinstallingprofvis(e.g.,withinstall.packages("profvis")),itcanbeusedtoprofileR
code.Asasimpleexample,wewillusethemoviesdataset,whichcontainsinformationon
about60,000movies.First,we’llselectmoviesthatareclassedascomedies,thenplottheyear
themoviewasmadeverusthemovieratinganddrawalocalpolynomialregressionlineto
pickoutthetrend.Themainfunctionfromtheprofvispackageisprofvis(),whichprofiles
thecodeandcreatesaninteractiveHTMLpageoftheresults.Thefirstargumentof
profvis()istheRexpressionofinterest.Thiscanbemanylineslong:
library("profvis")
profvis({
data(movies,package="ggplot2movies")#Loaddata
movies=movies[movies$Comedy==1,]
plot(movies$year,movies$rating)
model=loess(rating~year,data=movies)#loessregressionline
j=order(movies$year)
lines(movies$year[j],model$fitted[j])#Addlinetotheplot
})
ThepreviouscodeprovidesaninteractiveHTMLpage(theFigure7-1).Ontheleftsideisthe
codeandontherightisaflamegraph(thehorizontaldirectionistimeinmillisecondsandthe
verticaldirectionisthecallstack).
Figure7-1.Outputfromprofvis
Theleft-handpanelgivestheamountoftimespentoneachlineofcode.Itshowsthatthe
majorityoftimeisspentcalculatingtheloess()smoothingline.Thebottomlineoftheright
panelalsohighlightsthatmostoftheexecutiontimeisspentontheloess()function.
Travelingupthefunction,weseethatloess()callssimpleLoess(),whichinturncallsthe
.C()function.
Theconclusionfromthisgraphisthatifoptimizationwererequired,itwouldmakesenseto
focusontheloess()andpossiblytheorder()functioncalls.
Example:MonopolySimulation
MonopolyisaboardgamethatoriginatedintheUnitedStatesover100yearsago.The
objectiveofthegameistogoaroundtheboardandpurchasesquares(properties).Ifother
playerslandonyourproperties,theyhavetopayatax.Theplayerwiththemostmoneyatthe
endofthegamewins.Tomakethingsmoreinteresting,thereareChanceandCommunity
Chestsquares.Ifyoulandononeofthesesquares,youdrawacard,whichmaysendyouto
otherpartsoftheboard.TheotherspecialsquareisJail.OnewayofenteringJailistoroll
threesuccessivedoubles.
TheefficientpackagecontainsaMonteCarlofunctionforsimulatingasimplifiedgameof
monopoly.Bykeepingtrackofwhereapersonlandswhengoingaroundtheboard,weobtain
anestimateoftheprobabilityoflandingonacertainsquare.Theentirecodeisaround100
lineslong.Inorderforprofvistofullyprofilethecode,theefficientpackageneedstobe
installedfromsource:
devtools::install_github("csgillespie/efficient",args="--with-keep.source")
Thefunctioncanthenbeprofiledviathefollowingcode,whichresultsinFigure7-2.
library("efficient")
profvis(simulate_monopoly(10000))
Figure7-2.CodeprofilingforsimulatingthegameofMonopoly
Theoutputfromprofvisshowsthatthevastmajorityoftime(around65%)isspentinthe
move_square()function.
InMonopoly,movingaroundtheboardiscomplicatedbythefactthatrollingadouble(apair
of1s,2s,…,6s)isspecial:
Rolltwodice(total1):total_score=total1.
Ifyougetadouble,rollagain(total2)andtotal_score=total1+total2.
Ifyougetadouble,rollagain(total3)andtotal_score=total1+total2+total3.
Ifrollthreeisadouble,gotoJail;otherwise,movetotal_score.
Thefunctionmove_square()capturesthislogic.Nowthatweknowwherethecodeisslow,
howcanwespeedupthecomputation?Inthenextsection,wewilldiscussstandardtechniques
thatcanbeused.Wewillthenrevisitthisexample.
EfficientBaseR
InR,thereisoftenmorethanonewaytosolveaproblem.Inthissection,wehighlight
standardtricksoralternativemethodsthatmayimproveperformance.
Theif()Versusifelse()Functions
ifelse()isavectorizedversionofthestandardcontrol-flowfunctionif(test)if_yes
elseif_nothatworksasfollows:
ifelse(test,if_yes,if_no)
Intheprecedingimaginaryexample,thereturnvalueisfilledwithelementsfromtheif_yes
andif_noargumentsthataredeterminedbywhethertheelementoftestisTRUEorFALSE.For
example,supposewehaveavectorofexammarks.ifelse()couldbeusedtoclassifythem
aspassorfail:
marks=c(25,55,75)
ifelse(marks>=40,"pass","fail")
#>[1]"fail""pass""pass"
Ifthelengthofthetestconditionisequalto1(i.e.,length(test)==1),thenthestandard
conditionalstatement
mark=25
if(mark>=40){
"pass"
}else{
"fail"
}
isaroundfivetotentimesfasterthanifelse(mark>=40,"pass","fail").
Anadditionalquirkofifelse()isthatalthoughitismoreprogrammerefficient,asitismore
conciseandunderstandablethanmultilinealternatives,itisoftenlesscomputationally
efficientthanamoreverbosealternative.Thisisillustratedwiththefollowingbenchmark,in
whichthesecondoptionrunsabout20timesfaster,despitetheresultsbeingidentical:
marks=runif(n=10e6,min=30,max=99)
system.time({
result1=ifelse(marks>=40,"pass","fail")
})
#>usersystemelapsed
#>4.2930.3514.667
system.time({
result2=rep("fail",length(marks))
result2[marks>=40]="pass"
})
#>usersystemelapsed
#>0.1920.0520.244
identical(result1,result2)
#>[1]TRUE
ThereistalkontheR-develemaillistofspeedingupifelse()inbaseR.Asimplesolutionis
tousetheif_else()functionfromdplyr,although,asdiscussedinthesamethread,itcannot
replaceifelse()inallsituations.Forourexamresulttestexample,if_else()worksfine
andismuchfasterthanbaseR’simplementation(althoughitisstillaroundthreetimesslower
thanthehardcodedsolution):
system.time({
result3=dplyr::if_else(marks>=40,"pass","fail")
})
#>usersystemelapsed
#>1.0650.1881.253
identical(result1,result3)
#>[1]TRUE
SortingandOrdering
Sortingavectorisrelativelyquick;sortingavectoroflength107takesaround0.01seconds.
Ifyouonlysortavectoronceatthetopofascript,thendon’tworrytoomuchaboutthis.
However,ifyouaresortinginsidealooporinaShinyapplication,thenitcanbeworthwhile
thinkingabouthowtooptimizethisoperation.
Therearecurrentlythreesortingalgorithms,c("shell","quick","radix"),thatcanbe
specifiedinthesort()function,withradixbeinganewadditiontoR3.3.Typically,the
radix(thenondefaultoption)isthemostcomputationallyefficientoptionformostsituations
(itisaround20%fasterwhensortingalargevectorofdoubles).
Anotherusefultrickistopartiallyordertheresults.Forexample,ifyouonlywanttodisplay
thetop10results,thenusethepartialargument(i.e.,sort(x,partial=1:10)).Forvery
largevectors,thiscangiveathree-foldspeedincrease.
ReversingElements
Therev()functionprovidesareversedversionofitsargument.Ifyouwishtosortin
increasingorder,sort(x,decreasing=TRUE)ismarginally(around10%)fasterthan
rev(sort(x)).
WhichIndicesareTRUE?
TodeterminewhichindexofavectororarrayisTRUE,wewouldtypicallyusethewhich()
function.Ifwewanttofindtheindexofjusttheminimumormaximumvalue(i.e.,which(x
==min(x))),thenusingtheefficientwhich.min()/which.max()variantscanbeordersof
magnitudefaster(seeFigure7-3).
Figure7-3.Comparisonofwhich.min()withwhich()
ConvertingFactorstoNumerics
Afactorisjustavectorofintegerswithassociatedlevels.Occasionally,wewanttoconverta
factorintoitsnumericalequivalent.Themostefficientwayofdoingthis(especiallyforlong
factors)is:
as.numeric(levels(f))[f]
LogicalANDandOR
ThelogicalAND(&)andOR(|)operatorsarevectorizedfunctionsandaretypicallyused
duringmulticriteriasubsettingoperations.Thefollowingcode,forexample,returnsTRUEfor
allelementsofxgreaterthan0.4orlessthan0.6:
x<0.4|x>0.6
#>[1]TRUEFALSETRUE
WhenRexecutesthiscomparison,itwillalwayscalculatex>0.6regardlessofthevalueof
x<0.4.Incontrast,thenonvectorizedversion,&&,onlyexecutesthesecondcomponentif
needed.Thisisefficientandleadstoneatercode:
#Weonlycalculatethemeanifdatadoesn'tcontainNAs
if(!anyNA(x)&&mean(x)>0){
#Dosomething
}
comparedto
if(!anyNA(x)){
if(mean(x)>0){
#dosomething
}
}
However,caremustbetakennottouse&&or||onvectorsbecauseitonlyevaluatesthefirst
elementofthevector,givingtheincorrectanswer.Thisisillustratedhere:
x<0.4||x>0.6
#>[1]TRUE
RowandColumnOperations
Indataanalysis,weoftenwanttoapplyafunctiontoeachcolumnorrowofadataset.For
example,wemightwanttocalculatethecolumnorrowsums.Theapply()functionmakes
thistypeofoperationstraightforward.
#Secondargument:1->rows.2->columns
apply(data_set,1,function_name)
Thereareoptimizedfunctionsforcalculatingrowandcolumnsums/means(rowSums(),
colSums(),rowMeans(),andcolMeans())thatshouldbeusedwheneverpossible.Thepackage
matrixStatscontainsmanyoptimizedrow/columnfunctions.
is.na()andanyNA()
Totestwhetheravector(orotherobject)containsmissingvalues,weusetheis.na()
function.Oftenweareinterestedinwhetheravectorcontainsanymissingvalues.Inthiscase,
anyNA(x)ismoreefficientthanany(is.na(x)).
Matrices
Amatrixissimilartoadataframe:itisatwo-dimensionalobjectandsubsetting,andother
functionsworkinthesameway.However,allmatrixelementsmusthavethesametype.
Matricestendtobeusedduringstatisticalcalculations.Thelm()function,forexample,
internallyconvertsthedatatoamatrixbeforecalculatingtheresults;anycharactersarethus
recodedasnumericdummyvariables.
Matricesaregenerallyfasterthandataframes.Forexample,thedatasetsex_matandex_df
fromtheefficientpackageeachhave1,000rowsand100columnsandcontainthesame
randomnumbers.However,selectingrowsfromthedataframeisabout150timesslower
thanamatrix,asillustratedhere:
data(ex_mat,ex_df,package="efficient")
microbenchmark(times=100,unit="ms",ex_mat[1,],ex_df[1,])
#>Unit:milliseconds
#>exprminlqmeanmedianuqmaxneval
#>ex_mat[1,]0.002520.003680.05650.005310.005935.08100
#>ex_df[1,]0.770580.874061.08940.967711.100456.36100
T IP
Usethedata.matrix()functiontoefficientlyconvertadataframeintoamatrix.
Theintegerdatatype
NumbersinRareusuallystoredindouble-precisionfloating-pointformat,whichis
describedindetailinAFirstCourseinStatisticalProgrammingwithR(BraunandMurdoch
2007)and“WhatEveryComputerScientistShouldKnowAboutFloating-PointArithmetic”
(Goldberg).Thetermdoublereferstothefactthaton32-bitsystems(forwhichtheformat
wasdeveloped)twomemorylocationsareusedtostoreasinglenumber.Eachdouble-
precisionnumberisaccuratetoabout17decimalplaces.
NOT E
Whencomparingfloating-pointnumbers,weshouldbeparticularlycarefulbecausey=sqrt(2)*sqrt(2)isnot
exactly2—it’salmost2.Usingsprintf("%.17f",y)willgiveyouthetruevalueofy(to17decimalplaces).
Integersareanothernumericdatatype.IntegersprimarilyexisttobepassedtoCorFortran
code.Youwillnotneedtocreateintegersformostapplications.However,theyare
occasionallyusedtooptimizesubsettingoperations.Whenwesubsetadataframeormatrix,
weareinteractingwithCcodeandmightbetemptedtouseintegerswiththepurposeof
speedingupourcode.Forexample,ifwelookattheargumentsfortheheadfunction
args(head.matrix)
#>function(x,n=6L,...)
#>NULL
T IP
Usingthe:operatorautomaticallycreatesavectorofintegers.
weseethatthedefaultargumentfornis6Lratherthansimply6(theLisshortforliteraland
isusedtocreateaninteger).Thisgivesatinyspeedboost(around0.1microseconds!).
x=runif(10)
microbenchmark(head(x,6.0),head(x,6L),times=1000000)
#Unit:microseconds
#exprminlqmeanmedianuqmaxnevalcld
#head(x,6)7.0678.3099.0588.6869.0981052661e+06a
#head(x,6L)6.9478.2198.9338.5949.0071063071e+06a
Becausethisfunctionisubiquitous,thislow-leveloptimizationisuseful.Ingeneral,ifyouare
worriedaboutshavingmicrosecondsoffyourRcoderuntime,youshouldprobablyconsider
switchingtoanotherlanguage.
Integersaremorespace-efficient.Thefollowingcodecomparesthesizeofanintegervector
tothatofastandardnumericvector:
pryr::object_size(1:10000)
#>40kB
pryr::object_size(seq(1,10000,by=1.0))
#>80kB
Theresultsshowthattheintegerversionisroughlyhalfthesize.However,mostmathematical
operationswillconverttheintegervectorintoastandardnumericalvector,asillustratedin
thefollowingcodechunk:
is.integer(1L+1)
#>[1]FALSE
Furtherstoragesavingscanbeobtainedusingthebitpackage.
Sparsematrices
Anotherdatastructurethatcanbestoredefficientlyisasparsematrix.Thisissimplyamatrix
wheremostoftheelementsarezero.Conversely,ifmostelementsarenonzero,thematrixis
considereddense.Theproportionofnonzeroelementsiscalledthesparsity.Large,sparse
matricesoftencropupwhenperformingnumericalcalculations.Typically,ourdataisn’t
sparse,buttheresultingdatastructureswecreatemaybesparse.Thereareanumberof
techniques/methodsusedtostoresparsematrices.Methodsforcreatingsparsematricescan
befoundintheMatrixpackage.1
Asanexample,supposewehavealargematrixinwhichthediagonalelementsarenonzero:
library("Matrix")
N=10000
sp=sparseMatrix(1:N,1:N,x=1)
m=diag(1,N,N)
Bothobjectscontainthesameinformation,butthedataisstoreddifferently.Becausewehave
thesamevaluemultipletimesinthematrix,weonlyneedtostorethevalueonceandlinkitto
multiplematrixlocations.Thematrixobjectstoreseachindividualelement,whereasthe
sparsematrixobjectonlystoresthelocationofthenonzeroelements.Thisismuchmore
memory-efficient,asillustratedinthefollowingcode:
pryr::object_size(sp)
#>161kB
pryr::object_size(m)
#>800MB
Exercises
1. Createavector,x.Benchmarkany(is.na(x))againstanyNA().Dotheresultsvary
withthesizeofthevector?
2. Examinethefollowingfunctiondefinitionstogiveyouanideaofhowintegersare
used:
tail.matrix()
lm()
3. Constructamatrixofintegersandamatrixofnumerics.Using
pryr::object_size(),comparetheobjects.
4. Howdoesthefunctionseq.int(),whichwasusedinthetail.matrix()function,
differfromthestandardseq()function?
NOT E
Arelatedmemory-savingideaistoreplacelogicalvectorswithvectorsfromthebitpackage,whichtakeupjust
over1/30thofthespace(butyoucan’tuseNAs).
Example:Optimizingthemove_square()Function
Figure7-2showsthatourmainbottleneckinsimulatingthegameofMonopolyisthe
move_square()function.Withinthisfunction,wespendaround50%ofthetimecreatinga
dataframe,20%calculatingrowsums,andtheremainderoncomparisonoperations.This
pieceofcodecanbeoptimizedfairlyeasily(whilestillretainingthesameoverallstructure)
byincorporatingthefollowingimprovements:2
Insteadofusingseq(1,6)togeneratethesixpossiblevaluesofrollingadie,use1:6.
Also,insteadofadataframe,useamatrixandperformasinglecalltothesample()
function:
matrix(sample(1:6,6,replace=TRUE),ncol=2)
Overall,thisrevisedlineisaround25timesfaster;mostofthespeedboostcamefrom
switchingtoamatrix.
UserowSums()insteadofapply().Theapply()functioncallisalreadyfasterbecause
weswitchedfromadataframetoamatrix(aroundthreetimes).UsingrowSums()witha
matrixgivesa10-foldspeedboost.
Use&&intheifcondition;thisisabouttwiceasfastas&.
Impressively,therefactoredcoderuns20timesfasterthantheoriginalcode.Compare
Figures7-2and7-4withthemainspeedboostcomingfromusingamatrixinsteadofadata
frame.
Figure7-4.Codeprofilingoftheoptimizedcode
Exercise
1. Themove_square()functionshowninFigure7-4usesavectorizedsolution.
Wheneverwemove,wealwaysrollsixdice,thenexaminetheoutcomeand
determinethenumberofdoubles.However,thisispotentiallywasteful,sincethe
probabilityofgettingonedoubleis1/6andtwodoublesis1/36.Anothermethodisto
onlyrolladditionaldiceifandwhentheyareneeded.Implementandtimethis
solution.
ParallelComputing
Thissectionprovidesabriefforayintothewordofparallelcomputing.Itonlylooksat
methodsforparallelcomputingonsharedmemorysystems.Thissimplymeanscomputersin
whichmultipleCPUcorescanaccessthesameblock(i.e.,mostlaptopsanddesktopssold
worldwide).Thissectionprovidesaflavorofwhatispossible;forafulleraccountofparallel
processinginR,seeParallelRbyMcCallumandWeston(O’Reilly).
ThefoundationalpackageforparallelcomputinginRisparallel.InrecentRversions(since
R2.14.0),thiscomespreinstalledwithbaseR.Theparallelpackagemuststillbeloaded
beforeuse,however,andyoumustmanuallydeterminethenumberofavailablecores
manuallyasillustratedinthefollowingcode:
library("parallel")
no_of_cores=detectCores()
NOT E
ThevaluereturnedbydetectCores()turnsouttobeoperating-systemandchip-makerdependent;see
help("detectCores")forfulldetails.Formoststandardmachines,detectCores()returnsthenumberof
simultaneousthreads.
ParallelVersionsofApplyFunctions
Themostcommonlyusedparallelapplicationsareparallelizedreplacementsoflapply(),
sapply(),andapply().Theparallelimplementationsandtheirargumentsareshowninthe
followingcodeexample:
parLapply(cl,x,FUN,...)
parApply(cl=NULL,X,MARGIN,FUN,...)
parSapply(cl=NULL,X,FUN,...,simplify=TRUE,USE.NAMES=TRUE)
ThekeypointisthatthereisverylittledifferenceinargumentsbetweenparLapply(),and
apply(),sothebarriertousing(thisform)ofparallelcomputingislow,assumingyouare
proficientwiththeapplyfamilyoffunctions.Eachofthesefunctionshasanargumentcl,
whichiscreatedbyamakeCluster()call.Thisfunction,amongotherthings,specifiesthe
numberofprocessorstouse.
Example:SnakesandLadders
ParallelcomputingisidealforMonteCarlosimulations.Eachcoreindependentlysimulatesa
realizationfromthemodel.Attheend,wegatheruptheresults.Intheefficientpackage,there
isafunctionthatsimulatesasinglegameofSnakesandLadders:snakes_ladders().3
ThefollowingcodeillustrateshowtosimulateNgamesusingsapply():
N=10^4
sapply(1:N,snakes_ladders)
Rewritingthiscodetomakeuseoftheparallelpackageisstraightforward.Beginbymakinga
clusterobject:
library("parallel")
cl=makeCluster(4)
Thensimplyswapsapply()forparSapply():
parSapply(cl,1:N,snakes_ladders)
Itisimportanttostopthecreatedclusters,asthiscanleadtomemoryleaks,4asillustratedin
thefollowingcode:
stopCluster(cl)
Ifweachievedperfectparallelizationandusedafour(ormore)core,thenwewouldobtaina
four-foldspeedup(wesetmakeCluster(4)).However,werarelygetthis.
Onamultiprocessorcomputer,thiscanleadtoafour-foldspeed-up.However,itisrareto
achievethisoptimalspeed-upsincethereisalwayscommunicationbetweenthreads.
ExitFunctionswithCare
AlwayscallstopCluster()tofreeresourceswhenyoufinishwiththeclusterobject.
However,iftheparallelcodeiswithinafunctioncallthatresultsinanerror,the
StopCluster()commandwouldbeomitted.
Theon.exit()functionhandlesthisproblemwithaminimumoffuss;regardlessofhowthe
functionends,on.exit()isalwayscalled.Inthecontextofparallelprogramming,wewill
havesomethingsimilarto:
simulate=function(cores){
cl=makeCluster(cores)
on.exit(stopCluster(cl))
#Dosomething
}
T IP
Anothercommonuseofon.exit()iswiththepar()function.Ifyouusepar()tochangegraphicalparameters
withinafunction,on.exit()ensuresthattheseparametersareresettotheirpreviousvaluewhenthefunctionends.
ParallelCodeunderLinuxandOSX
IfyouareusingLinuxorOSX,thenanotherwayofrunningcodeinparallelistousethe
mclapply()andmcmapply()functions:
#ThiswillrunonWindows,butwillonlyuse1core
mclapply(1:N,snakes_ladders)
Thesefunctionsuseforking;thatis,creatinganewcopyofaprocessrunningontheCPU.
However,Windowsdoesnotsupportthislow-levelfunctionalityinthewaythatLinuxdoes.
Themainadvantageofmclapply()isthatyoudon’thavetostartandstopclusterobjects.The
bigdisadvantageisthatonWindowsmachines,youarelimitedtoasinglecore.
Rcpp
SometimesRisjustslow.You’vetriedeverytrickyouknow,andyourcodeisstillcrawling
along.Atthispoint,youcouldconsiderrewritingkeypartsofyourcodeinanother,faster
language.Rhasinterfacestootherlanguagesviapackages,suchasRcpp,rJava,rPython,
andrecentlyV8.TheseprovideRinterfacestoC++,Java,Python,andJavaScript,
respectively.Rcppisthemostpopularofthese(Figure7-5).
Figure7-5.DownloadsperdayfromtheRStudioCRANmirrorofpackagesthatprovideRinterfacestootherlanguages
C++isamodern,fast,andverywell-supportedlanguagewithlibrariesforperformingmany
kindsofcomputationaltasks.RcppmakesincorporatingC++codeintoyourRworkflow
easy.
AlthoughC/Fortranroutinescanbeusedusingthe.Call()function,thisisnotrecommended
becauseusing.Call()canbeapainfulexperience.RcppprovidesafriendlyAPIthatletsyou
writehigh-performancecode,bypassingR’strickyCAPI.TypicalbottlenecksthatC++
addressesareloopsandrecursivefunctions.
C++isapowerfulprogramminglanguageaboutwhichentirebookshavebeenwritten.This
sectionthereforeisfocusedongettingstartedandprovidingaflavorofwhatispossible.Itis
structuredasfollows.AfterensuringthatyourcomputerissetupforRcpp,weproceedby=
creatingasimpleC++function,toshowhowC++compareswithR(“ASimpleC++
Function”).ThisisconvertedintoanRfunctionusingcppFunction()in“ThecppFunction()
Command”.
TheremainderofthechapterexplainsC++datatypes(“C++DataTypes”),illustrateshowto
sourceC++codedirectly(“ThesourceCpp()Function”),explainsvectors(“Vectorsand
Loops”)andRcppsugar(“C++withSugaronTop”),andfinallyprovidesguidanceon
furtherresourcesonthesubject(“RcppResources”).
ASimpleC++Function
TowriteandcompileC++functions,youneedaworkingC++compiler(see“Prerequisites”).
Thecodeinthischapterwasgeneratedusingversion0.12.7ofRcpp.
Rcppiswelldocumented,asillustratedbythenumberofvignettesonthepackage’sCRAN
page.Inadditiontoitspopularity,manyotherpackagesdependonRcpp,whichcanbeseen
bylookingattheReverseImportssection.
Tocheckthatyouhaveeverythingneededforthischapter,runthefollowingpieceofcode
fromthecourseRpackage:
efficient::test_rcpp()
AC++functionissimilartoanRfunction:youpassasetofinputstoafunction,somecodeis
run,andasingleobjectisreturned.However,therearesomekeydifferences:
IntheC++function,eachlinemustbeterminatedwith;.InR,weuse;onlywhenwe
havemultiplestatementsonthesameline.
WemustdeclareobjecttypesintheC++version.Inparticular,weneedtodeclarethe
typesofthefunctionarguments,thereturnvalues,andanyintermediateobjectswe
create.
Thefunctionmusthaveanexplicitreturnstatement.SimilartoR,therecanbemultiple
returns,butthefunctionwillterminatewhenithitsitsfirstreturnstatement.
Youdonotuseassignmentwhencreatingafunction.
Objectassignmentmustusethe=sign.The<-operatorisn’tvalid.
One-linecommentscanbecreatedusing//.Multilinecommentsarecreatedusing
/*...*/.
Supposewewanttocreateafunctionthataddstwonumberstogether.InR,thiswouldbea
simpleone-lineaffair:
add_r=function(x,y)x+y
InC++,itisabitmorelong-winded:
/*Returntypedouble
*Twoarguments,alsodoubles
*/
doubleadd_cpp(doublex,doubley){
doublevalue=x+y;
returnvalue;
}
IfwewerewritingaC++program,wewouldalsoneedanotherfunctioncalledmain().We
wouldthencompilethecodetoobtainanexecutable.Theexecutableisplatform-dependent.
ThebeautyofusingRcppisthatitmakesitveryeasytocallC++functionsfromRandthe
userdoesn’thavetoworryabouttheplatform,compilers,ortheR/C++interface.
ThecppFunction()Command
IfwepasstheC++functioncreatedintheprevioussectionasatextstringargumentto
cppFunction()
library("Rcpp")
cppFunction('
doubleadd_cpp(doublex,doubley){
doublevalue=x+y;
returnvalue;
}
')
RcppwillmagicallycompiletheC++codeandconstructafunctionthatbridgesthegap
betweenRandC++.Afterrunningthecodeshownpreviously,wenowhaveaccesstothe
add_cpp()function
add_cpp
#>function(x,y)
#>.Primitive(".Call")(<pointer:0x2b9e590670e0>,x,y)
andcancalltheadd_cpp()functionintheusualway:
add_cpp(1,2)
#>[1]3
Wedon’thavetoworryaboutcompilers.Also,ifyouincludethisfunctioninapackage,users
don’thavetoworryaboutanyoftheRcppmagic.Itjustworks.
C++DataTypes
Themostbasictypeofvariableisaninteger,int.Anintvariablecanstoreavalueinthe
range–32768to+32767.Tostorefloating-pointnumbers,therearesingle-precisionnumbers
(float)anddouble-precisionnumbers(double).Adoubletakestwiceasmuchmemoryasa
float(ingeneral,weshouldalwaysworkwithdouble-precisionnumbersunlesswehavea
compilingreasontoswitchtofloats).Forsinglecharacters,weusethechardatatype.
NOT E
Thereisalsosomethingcalledanunsignedint,whichgoesfrom0to65,535andalongintthatrangesfrom0to
231−1.
Apointerobjectisavariablethatpointstoanareaofmemorythathasbeengivenaname.
Pointersareaverypowerful—butprimitive—facilitycontainedintheC++language.They
canbeveryefficientbecausesinceratherthanpassinglargeobjectsaround,wepassapointer
tothememorylocation;inotherwords,ratherthanpassthehouse,wejustgivetheaddress.
Wewon’tusepointersinthischapter,butmentionthemforcompleteness.Table7-1givesan
overview.
Table7-1.OverviewofkeyC++
objecttypes
Type Description
char Asinglecharacter
int Aninteger
float Asingle-precisionfloating-pointnumber
double Adouble-precisionfloating-pointnumber
void Avaluelessquantity
ThesourceCpp()Function
ThecppFunction()isgreatforgettingsmallexamplesupandrunning.Butitisbetter
practicetoputyourC++codeinaseparatefile(withfileextension.cpp)andusethefunction
callsourceCpp("path/to/file.cpp")tocompilethem.However,wedoneedtoincludeafew
headersatthetopofthefile.ThefirstlineweaddgivesusaccesstotheRcppfunctions.The
fileRcpp.hcontainsalistoffunctionandclassdefinitionssuppliedbyRcpp.Thisfilewillbe
locatedwhereRcppisinstalled.Theincludeline
#include<Rcpp.h>
causesthecompilertoreplacethatlinewiththecontentsofthenamedsourcefile.Thismeans
thatwecanaccessthefunctionsdefinedbyRcpp.ToaccesstheRcppfunctions,wewould
havetotypeRcpp::function_1.ToavoidtypingRcpp::,weusethenamespacefacility:
usingnamespaceRcpp;
Nowwecanjusttypefunction_1();thisisthesameconceptthatRusesformanaging
functionnamecollisionswhenloadingpackages.Aboveeachfunctionwewanttoexport/use
inR,weaddthetag:
//[[Rcpp::export]]
NOT E
Similartopackagesandthelibrary()functioninR,weaccessadditionalfunctionsvia#include.Astandard
headertoincludeis#include<math.h>,whichcontainsstandardmathematicsfunctions.
Thiswouldgivethecompletefile:
#include<Rcpp.h>
usingnamespaceRcpp;
//[[Rcpp::export]]
doubleadd_cpp(doublex,doubley){
doublevalue=x+y;
returnvalue;
}
TherearetwomainbenefitswithputtingyourC++functionsinseparatefiles.First,wehave
thebenefitofsyntaxhighlighting(RStudiohasgreatsupportforC++editing).Second,it’s
easiertomakesyntaxerrorswhentheswitchingbetweenRandC++inthesamefile.Tosave
space,we’llomittheheadersfortheremainderofthechapter.
VectorsandLoops
Let’snowconsideraslightlymorecomplicatedexample.Herewewanttowriteourown
functionthatcalculatesthemean.Thisisjustanillustrativeexample:R’sversionismuch
betterandmorerobusttoscaledifferencesinourdata.Forcomparison,let’screatea
correspondingRfunction—thisisthesamefunctionweusedinChapter3.Thefunction
takesasinglevectorxasinputandreturnsthemeanvalue,m:
mean_r=function(x){
m=0
n=length(x)
for(iin1:n)
m=m+x[i]/n
m
}
ThisisaverybadRfunction;weshouldjustusethebasefunctionmean()forreal-world
applications.However,thepurposeofmean_r()istoprovideacomparisonfortheC++
version,whichwewillwriteinasimilarway.
Inthisexample,wewillletRcppsmooththeinterfacebetweenC++andRbyusingthe
NumericVectordatatype.ThisRcppdatatypemirrorstheRvectorobjecttype.Other
commonclassesareIntegerVector,CharacterVector,andLogicalVector.
IntheC++versionofthemeanfunction,wespecifytheargumenttypes:x(NumericVector)
andthereturnvalue(double).TheC++versionofthemean()functionisafewlineslonger.
Almostalways,thecorrespondingC++versionwillbe,possiblymuch,longer.Ingeneral,R
optimizesforreduceddevelopmenttime;C++optimizesforfastexecutiontime.The
correspondingC++functionforcalculatingthemeanis:
doublemean_cpp(NumericVectorx){
inti;
intn=x.size();
doublemean=0;
for(i=0;i<n;i++){
mean=mean+x[i]/n;
}
returnmean;
}
TousetheC++function,weneedtosourcethefile(remembertoputthenecessaryheaders
in):
sourceCpp("src/mean_cpp.cpp")
AlthoughtheC++versionissimilar,thereareafewcrucialdifferences.
1. Weusethe.size()methodtofindthelengthofx.
2. Theforloophasamorecomplicatedsyntax.
for(variableinitialisation;condition;variableupdate){
//Codetoexecute
}
Inthisexample,theloopinitializesi=0andwillcontinuerunninguntili<nis
false.Thestatementi++increasesthevalueofiby1;essentiallyit’sjustashortcut
fori=i+1.
3. Similartoi++,C++providesotheroperatorstomodifyvariablesinplace.For
example,wecouldrewritepartoftheloopas
mean+=x[i]/n;
Thepreviouscodeaddsx[i]/ntothevalueofmean.Othersimilaroperatorsare-
=,*=,/=,andi--.
4. AC++vectorstartsat0,not1.
TocomparetheC++andRfunctions,we’llgeneratesomenormalrandomnumbers:
x=rnorm(1e4)
Thencallthemicrobenchmark()function(theresultsareplottedinFigure7-6).
#com_mean_risthecompiledversionofmean_r
z=microbenchmark(
mean(x),mean_r(x),com_mean_r(x),mean_cpp(x),
times=1000
)
Inthissimpleexample,theRcppvariantisaround100timesfasterthanthecorresponding
pureRversion.Thissortofspeed-upisnotuncommonwhenswitchingtoanRcppsolution.
NoticethattheRcppversionandstandardbasefunctionmean()runatroughlythesame
speed;afterall,thebaseRfunctioniswritteninC.However,mean()usesamoresophisticated
algorithmwhencalculatingthemeantoensureaccuracy.
Figure7-6.Comparisonofmeanfunctions
Exercises
Considerthefollowingpieceofcode:
doubletest1(){
doublea=1.0/81;
doubleb=0;
for(inti=0;i<729;++i)
b=b+a;
returnb;
}
1. Savethefunctiontest1()inaseparatefile.Makesureitworks.
2. WriteasimilarfunctioninRandcomparethespeedoftheC++andRversions.
3. Createafunctioncalledtest2(),inwhichthedoublevariableshavebeenreplaced
byfloat.Doyoustillgetthecorrectanswer?
4. Changeb=b+atob+=atomakeyourcodemorelikeC++.
5. (Difficult!)What’sthedifferencebetweeni++and++i?
Matrices
Eachvectortypehasacorrespondingmatrixequivalent:NumericMatrix,IntegerMatrix,
CharacterMatrix,andLogicalMatrix.Weusethesetypesinasimilarwaytohowweused
NumericVectors.Themaindifferencesare:
Whenweinitialize,weneedtospecifythenumberofrowsandcolumns:
//10rows,5columns
NumericMatrixmat(10,5);
//Length10
NumericVectorv(10);
Wesubsetusing()—i.e.,mat(5,4).
Thefirstelementinamatrixismat(0,0)—rememberthatindexesstartwith0,not1.
Todeterminethenumberofrowsandcolumns,weusethe.nrow()and.ncol()
methods.
C++withSugaronTop
RcppsugarbringsahigherlevelofabstractiontoC++codewrittenusingtheRcppAPI.
WhatthismeansinpracticeisthatwecanwriteC++codeinthestyleofR.Forexample,
supposewewantedtofindthesquareddifferenceoftwovectors;asquaredresidualin
regression.InR,wewoulduse
sq_diff_r=function(x,y)(x-y)^2
RewritingthefunctioninstandardC++wouldgive
NumericVectorres_c(NumericVectorx,NumericVectory){
inti;
intn=x.size();
NumericVectorresiduals(n);
for(i=0;i<n;i++){
residuals[i]=pow(x[i]-y[i],2);
}
returnresiduals;
}
WithRcppsugar,wecanrewritethiscodetobemoresuccinctandhavemoreofanRfeel:
NumericVectorres_sugar(NumericVectorx,NumericVectory){
returnpow(x-y,2);
}
InthepreviousC++code,thepow()functionandx-yarevalidduetoRcppsugar.Other
functionsthatareavailableincludethed/q/p/rstatisticalfunctions,suchasrnorm()and
pnorm().Thesweetenedversionsaren’tusuallyfasterthantheC++versions,buttypically
there’sverylittledifferencebetweenthetwo.However,withthesugaredvariety,thecodeis
shorterandisconstantlybeingimproved.
Exercises
1. ConstructanRversion(usingaforloopratherthanthevectorizedsolution),
res_r(),andcomparethethreefunctionvariants.
2. Inthepreviousexample,res_sugar()isfasterthanres_c().Doyouknowwhy?
RcppResources
TheaimofthissectionwastoprovideanintroductiontoRcpp.Oneofthesellingpointsof
Rcppisthatthereisagreatdealofdocumentationavailable.
TheRcppwebsite.
TheoriginalJournalofStatisticalSoftwarepaperdescribingRcppandthefollow-up
bookSeamlessRandC++IntegrationwithRcppbyEddelbuettelandFrancois.
HadleyWickhamprovidesaveryreadablechapteronRcppinAdvancedRthatgoesinto
abitmoredetailthanthissection.
TheRcppsectionontheStackOverflowwebsite.Questionsareoftenansweredbythe
Rcppauthors.
References
Braun,John,andDuncanJMurdoch.2007.AFirstCourseinStatisticalProgrammingwithR.
Vol.25.CambridgeUniversityPressCambridge.
Goldberg,David.1991.“WhatEveryComputerScientistShouldKnowAboutFloating-Point
Arithmetic.”ACMComputingSurveys(CSUR)23(1).ACM:5–48.
McCallum,Ethan,andStephenWeston.2011.ParallelR.O’ReillyMedia.
Eddelbuettel,Dirk,andRomainFrançois.2011.“Rcpp:SeamlessRandC++Integration.”
JournalofStatisticalSoftware40(8):1–18.
Eddelbuettel,Dirk.2013.SeamlessRandC++IntegrationwithRcpp.Springer.
Wickham,Hadley.2014a.AdvancedR.CRCPress.
Technicallythisisn’tinbaseR;it’sarecommendedpackage.
Solutionsareavailableintheefficientpackagevignette.
TheideaforthisexamplecametooneoftheauthorsafteraparticularlylonganddullgameofSnakesandLadderswith
hisson.
Seegithub.com/npct/pct-shiny/issues/292forareal-worldexampleofthedangersofnotstoppingcreatedcores.
1
2
3
4
Chapter8.EfficientHardware
ThischapterisoddforabookonRprogramming.Itcontainsverylittlecode,andyetthe
chapterhasthepotentialtospeedupyouralgorithmsbyordersofmagnitude.Thischapter
considerstheimpactthatyourcomputerhasonyourtime.
Yourhardwareiscrucial.Itwillnotonlydeterminehowfastyoucansolveyourproblem,but
alsowhetheryoucaneventackletheproblemofinterest.Thisisbecauseeverythingisloaded
inRAM.Ofcourse,havingamorepowerfulcomputercostsmoney.Thegoalistohelpyou
decidewhetherthebenefitsofupgradingyourhardwareareworththatextracost.
Webeginthischapterwithabackgroundsectiononcomputerstorageandmemoryandhowit
ismeasured.Thenweconsiderindividualcomputercomponents,andconcludewithrenting
machinesinthecloud.
Prerequisites
Thischapterwillfocusonassessingyourhardwareandthebenefitofupgrading.Wewilluse
thebenchmarkmepackagetoquantifytheeffectofchangingyourCPU.
library("benchmarkme")
TopFiveTipsforEfficientHardware
1. UsethepackagebenchmarkmetoassessyourCPU’snumber-crunchingability;isit
worthupgradingyourhardware?
2. Ifpossible,addmoreRAM.
3. Double-checkthatyouhaveinstalleda64-bitversionofR.
4. Cloudcomputingisacost-effectivewayofobtainingmorecomputerpower.
5. Solid-statedrivestypicallywon’thavemuchimpactonthespeedofyourRcodebut
willincreaseyouroverallproductivitybecauseI/0ismuchfaster.
Background:WhatIsaByte?
Acomputercannotstore“numbers”or“letters.”Theonlythingacomputercanstoreand
workwithisbits.Abitisbinary;itiseithera0ora1.Infact,fromaphysicsperspective,abit
isjustablipofelectricitythateitherisorisn’tthere.
Inthepast,theASCIIcharactersetdominatedcomputing.Thissetdefines128characters
including0to9,upperandlowercasealphanumeric,andafewcontrolcharacterssuchasa
newline.Storingthesecharactersrequired7bitsbecause27=128,but8bitsweretypically
usedforperformancereasons.Table8-1givesthebinaryrepresentationofthefirstfew
characters.
Table8-1.Thebit
representationofafew
ASCIIcharacters
Bitrepresentation Character
01000001 A
01000010 B
01000011 C
01000100 D
01000101 E
01010010 R
Thelimitationofonlyhaving256charactersledtothedevelopmentofUnicode,astandard
frameworkaimedatcreatingasinglecharactersetforeveryreasonablewritingsystem.
Typically,Unicodecharactersrequire16bitsofstorage.
Eightbitsisonebyte,orASCIIcharacter.SotwoASCIIcharacterswouldusetwobytesor16
bits.Apuretextdocumentcontaining100characterswoulduse100bytes(800bits).Notethat
markup,suchasfontinformationormetadata,canimposeasubstantialmemoryoverhead:an
empty.docxfilerequiresabout3,700bytesofstorage.
Whencomputerscientistsfirststartedtothinkaboutcomputermemory,theynoticedthat210=
1024≃103and220=1,048,576≃106,sotheyadoptedtheshorthandofkilo-andmegabytes.
Ofcourse,everyoneknewthatitwasjustashorthand,anditwasreallyabinarypower.When
computersbecamemorewidespread,foolishpeoplelikeyouandmejustassumedthatkilo
actuallymeant103bytes.
Fortunately,theIEEEStandardsBoardintervenedandcreatedconventional,internationally
adopteddefinitionsoftheInternationalSystemofUnits(SI)prefixes.Soakilobyte(kB)is103
=1000bytesandamegabyte(MB)is106bytesor103kilobytes(seeTable8-2).Apetabyteis
approximately100milliondrawersfilledwithtext.Astonishingly,Googleprocessesaround
20petabytesofdataeveryday.
Table8-2.Data-conversiontable.Source:
http://physics.nist.gov/cuu/Units/binary.html
Factor Name Symbol Origin Derivation
210 kibi Ki Kilobinary: (210)1
220 mebi Mi Megabinary: (210)2
230 gibi Gi Gigabinary: (210)3
240 tebi Ti Terabinary: (210)4
250 pebi Pi Petabinary: (210)5
Eventhoughthereisnowanagreeduponstandardfordiscussingmemory,noteveryone
followsit.MicrosoftWindows,forexample,uses1MBtomean220B.Evenmoreconfusing,
thecapacityofa1.44MBfloppydiskisamixture,1MB=103×210B.TypicallyRAMis
specifiedinkibibytes,buthard-drivemanufacturersfollowtheSIstandard!
RandomAccessMemory
Randomaccessmemory(RAM)isatypeofcomputermemorythatcanbeaccessedrandomly:
anybyteofmemorycanbeaccessedwithouttouchingtheprecedingbytes.RAMisfoundin
computers,phones,tablets,andevenprinters.TheamountofRAMRhasaccesstois
incrediblyimportant.SinceRloadsobjectsintoRAM,theamountofRAMyouhaveavailable
canlimitthesizeofdatasetyoucananalyze.
Eveniftheoriginaldatasetisrelativelysmall,youranalysiscangeneratelargeobjects.For
example,supposewewanttoperformstandardclusteranalysis.Thebuilt-indataset
USAarrestsisadataframewith50rowsandfourcolumns.Eachrowcorrespondstoastatein
theUS:
head(USArrests,3)
#>MurderAssaultUrbanPopRape
#>Alabama13.22365821.2
#>Alaska10.02634844.5
#>Arizona8.12948031.0
Ifwewanttogroupstatesthathavesimilarcrimestatistics,astandardfirststepistocalculate
thedistanceorsimilaritymatrix:
d=dist(USArrests)
Whenweinspecttheobjectsizeoftheoriginaldatasetandthedistanceobjectusingthepryr
package:
pryr::object_size(USArrests)
#>5.23kB
pryr::object_size(d)
#>14.3kB
Wehavemanagedtocreateanobjectthatisthreetimeslargerthantheoriginaldataset.
NOT E
Thedistanceobjectdisactuallyavectorthatcontainsthedistancesintheuppertriangularregion.
Infact,theobjectdisasymmetricn×nmatrix,wherenisthenumberofrowsinUSAarrests.
Clearly,asnincreases,thesizeofdincreasesatarateofO(n2).Soifouroriginaldataset
contained10,000records,theassociateddistancematrixwouldcontainalmost108values.Of
course,sincethematrixissymmetrical,thiscorrespondstoaround50millionuniquevalues.
T IP
AroughruleofthumbisthatyourRAMshouldbethreetimesthesizeofyourdataset.
AnotherbenefitofhavingmoreonboardRAMisthatthegarbagecollector,aprocessthat
runsperiodicallytofreeupsystemmemoryoccupiedbyR,iscalledlessoften.Itis
straightforwardtodeterminehowmuchRAMyouhaveusingthebenchmarkmepackage:
benchmarkme::get_ram()
#>16.3GB
Itissometimespossibletoincreaseyourcomputer’sRAM.Onacomputermotherboard,
therearetypicallytwotofourRAMormemoryslots.Ifyouhavefreeslots,thenyoucanadd
morememory.RAMcomesintheformofdualin-linememorymodules(DIMMs)thatcanbe
slottedintothemotherboardspaces(seeFigure8-1foranexample).
Figure8-1.ThreeDIMMslotsonacomputermotherboardusedforincreasingtheamountofavailableRAM.Source:
Wikimedia
However,itiscommonthatallslotsarealreadytaken.Thismeansthattoupgradeyour
computer ’smemory,someoralloftheDIMMswillhavetoberemoved.Togofrom8GBto
16GB,forexample,youmayhavetodiscardthetwo4GBRAMcardsandreplacethemwith
two8GBcards.Increasingyourlaptop/desktopfrom4GBto16GBor32GBischeapand
shoulddefinitelybeconsidered.AsRCorememberUweLiggesstates:
fortunes::fortune(192)
#>
#>RAMischeapandthinkinghurts.
#>--UweLigges(aboutmemoryrequirementsinR)
#>R-help(June2007)
ItisatestamenttothedesignofRthatitisstillrelevantanditspopularityisgrowing.Ross
Ihaka,oneoftheoriginatorsoftheRprogramminglanguage,madeathrow-awaycomment
in2003:
fortunes::fortune(21)
#>
#>Iseemtorecallthatweweretargeting512kMacintoshes.Inourdreams
#>wemighthaveseen16MbSun.
#>--RossIhaka(inreplytothequestionwhetherR&Rthoughtwhenthey
#>startedoutthattheywouldseeRusing16GmemoryonadualOpteron
#>computer)
#>R-help(November2003)
Consideringthatastandardsmartphonenowcontains1GBofRAM,thefactthatRwas
designedfor“basic”computersbutcanscaleacrossclustersisimpressive.R’soriginson
computerswithlimitedresourceshelpsexplainitsefficiencyatdealingwithlargedatasets.
Exercises
Thefollowingtwoexercisesaimtohelpyoudetermineifitisworthwhiletoupgradeyour
RAM.
1. Rloadseverythingintomemory(i.e.,yourcomputer’sRAM).HowmuchRAMdoes
yourcomputerhave?
2. Usingyourpreferredsearchengine,howmuchdoesitcosttodoubletheamountof
availableRAMonyoursystem?
HardDrives:HDDVersusSSD
YouareusingRbecauseyouwanttoanalyzedata.Thedataistypicallystoredonyourhard
drive,butnotallharddrivesareequal.Unlessyouhaveafairlyexpensivelaptop,your
computerprobablyhasastandardharddiskdrive(HDD).HDDswerefirstintroducedbyIBM
in1956.Dataisstoredusingmagnetismonarotatingplatter,asshowninFigure8-2.The
fastertheplatterspins,thefastertheHDDcanperform.Manylaptopdrivesspinateither
5,400or7,200RPM(revolutionsperminute).ThemajoradvantageofHDDsisthattheyare
cheap,makinga1TBlaptopstandard.
NOT E
Intheauthors’experience,havinganSSDdrivedoesn’tmaketoomuchofadifferencetoR.However,the
reductioninboottimeandgeneraltasksmakesanSSDdriveawonderfulpurchase.
Figure8-2.Astandard2.5”harddrive,foundinmostlaptops.Source:Wikimedia
Solid-statedrives(SSDs)canbethoughtofaslargebutmoresophisticatedversionsofUSB
sticks.Theyhavenomovingparts,andinformationisstoredinmicrochips.Sincethereare
nomovingparts,reading/writingismuchquicker.SSDshaveotherbenefits:theyarequieter,
allowfasterboottime(nospinuptime),andrequirelesspower(morebatterylife).
Theread/writespeedforastandardHDDisusuallyintheregionof50to100MB/s(usually
closerto50MB).ForSSDs,speedsaretypicallyover200MB/s.Fortop-of-the-rangemodels
thiscanapproach500MB/s.Ifyou’rewondering,read/writespeedsforRAMarearound2to
20GB/s.Soatbest,SSDsareatleastoneorderofmagnitudeslowerthanRAM,butstillfaster
thanstandardHDDs.
T IP
Ifyouareunsureaboutwhattypeofharddriveyouhave,thentimehowlongyourcomputertakestoreachthe
loginscreen.Ifitislessthanfiveseconds,youprobablyhaveanSSD.
OperatingSystems:32-Bitor64-Bit
Rcomesintwoversions:32-bitand64-bit.Youroperatingsystemalsocomesintwo
versions,32-bitand64-bit.Ideally,youwant64-bitversionsofbothRandtheoperating
system.Usinga32-bitversionofeitherhasseverelimitationsontheamountofRAMRcan
access.SowhenwesuggestthatyoushouldjustbuymoreRAM,thisassumesthatyouare
usinga64-bitoperatingsystem,witha64-bitversionofR.
NOT E
IfyouareusinganOSversionfromthelastfiveyears,itisunlikelytobea32-bitOS.
A32-bitmachinecanaccessatmostonly4GBofRAM.AlthoughsomeCPUsoffersolutions
tothislimitation,ifyouarerunninga32-bitoperatingsystem,thenRislimitedtoaround3
GBofRAM.Ifyouarerunninga64-bitoperatingsystembutonlya32-bitversionofR,then
youhaveaccesstoslightlymorememory(butnotmuch).Modernsystemsshouldruna64-bit
operatingsystem,witha64-bitversionofR.Yourmemorylimitisnowmeasuredas8TBfor
Windowsmachinesand128TBforUnix-basedOSes.Aneasymethodfordeterminingifyou
arerunninga64-bitversionofRistorun
.Machine$sizeof.pointer
whichwillreturn8ifyouarunninga64-bitversionofR.
Tofindprecisedetails,consulttheRhelppageshelp("Memory-limits")andhelp("Memory").
Exercises
Theseexercisesaimtocondensetheprevioussectionintothekeypoints.
1. Areyouusinga32-bitor64-bitversionofR?
2. IfyouareusingWindows,whataretheresultsofrunningthecommand
memory.limit()?
CentralProcessingUnit
Thecentralprocessingunit(CPU),ortheprocessor,isthebrainofacomputer.TheCPUis
responsibleforperformingnumericalcalculations.Thefastertheprocessor,thefasterRwill
run.Theclockspeed(orclockrate,measuredinhertz)isthefrequencywithwhichtheCPU
executesinstructions.Thefastertheclockspeed,themoreinstructionsaCPUcanexecuteina
section.CPUclockspeedforasingleCPUhasbeenfairlystaticinthelastcoupleofyears,
hoveringaround3.4GHz(seeFigure8-3).
Figure8-3.CPUclockspeed.Thedataforthisfigurewascollectedfromweb-forumandWikipedia.Itisintendedto
indicategeneraltrendsinCPUspeed.
Unfortunately,wecan’tsimplyuseclockspeedstocompareCPUs,sincetheinternal
architectureofaCPUplaysacrucialroleindeterminingitsperformance.TheRpackage
benchmarkmeprovidesfunctionsforbenchmarkingyoursystemandcontainsdatafrom
previousbenchmarks.Figure8-4showstherelativeperformanceforover150CPUs.
Figure8-4.CPUbenchmarksfromtheRpackage,benchmarkme.EachpointrepresentsanindividualCPUresult.
RunningthebenchmarksandcomparingyourCPUtoothersisstraightforwardusingthe
benchmarkmepackage.Afterloadingthepackage,wecanbenchmarkyourCPU
res=benchmark_std()
andcomparetheresultstootherusers:
plot(res)
#Uploadyourbenchmarksforfutureusers
upload_results(res)
YougetthemodelspecificationsofthetopCPUsusingget_datatable(res).
CloudComputing
Cloudcomputingusesnetworksofremoteservers,insteadofalocalcomputer,tostoreand
analyzedata.Itisnowbecomingincreasinglypopulartorentcloudcomputingresources.
AmazonEC2
AmazonElasticComputeCloud(EC2)isoneofanumberofprovidersofthisservice.EC2
makesit(relatively)easytorunRinstancesinthecloud.Userscanconfiguretheoperating
system,CPU,harddrivetype,theamountofRAM,andwheretheprojectisphysicallylocated.
IfyouwanttorunaserverintheAmazonEC2cloud,youhavetoselectthesystemyouare
goingtobootup.Thereareavastarrayofprepackagedsystemimages.Someoftheseimages
arejustbasicoperatingsystems,suchasDebianorUbuntu,whichrequirefurther
configuration.ThereisalsoanAmazonmachineimagethatspecificallytargetsRand
RStudio.
Exercise
1. Toassesswhetheryoushouldconsidercloudcomputing,findouthowmuchitwould
costtorentamachinecomparabletoyourlaptopinthecloudforoneyear.
Chapter9.EfficientCollaboration
Largeprojectsinevitablyinvolvemanypeople.Thisposesrisksbutalsocreatesopportunities
forimprovingcomputationalefficiencyandproductivity,especiallyifprojectcollaborators
arereadingandcommittingcode.Thischapterprovidesguidanceonhowtominimizethe
risksandmaximizethebenefitsofcollaborativeRprogramming.
Collaborativeworkinghasanumberofbenefits.Ateamwithadiverseskillsetisusually
strongerthanateamwithaverynarrowfocus.Itmakessensetospecialize:clearlydefining
rolessuchasstatistician,frontenddeveloper,systemadministrator,andprojectmanagerwill
makeyourteamstronger.Evenifyouareworkingalone,dividingtheworkintodiscrete
branchesinthiswaycanbeuseful,asdiscussedinChapter4.
Collaborativeprogrammingprovidesanopportunityforpeopletorevieweachother’scode.
Thiscanbeencouragedbyusingauniformstylewithmanycomments,asdescribedin
“CodingStyle”.Likeusingaclearstyleinhumanlanguage,followingastyleguidehasthe
additionaladvantageofmakingyourcodemoreunderstandabletoothers.
Whenworkingoncomplexprogrammingprojectswithmultipleinterdependencies,version
controlisessential.Evenonsmallprojects,trackingtheprogressofyourproject’scodebase
hasmanyadvantagesandmakescollaborationmucheasier.Fortunately,itisnoweasierthan
everbeforetointegrateversioncontrolintoyourproject,usingRStudio’sinterfacetothe
versioncontrolsoftwaregitandonlinecode-sharingwebsitessuchasGitHub.Thisisthe
subjectof“VersionControl”.
Thefinalsection,“CodeReview”,addressesthequestionofworkinginateamand
performingcodereviews.
Prerequisites
Thischapterdealswithcodingstandardsandtechniques.Theonlypackagesrequiredforthis
chapterarelubridateanddplyr.Thesepackagesareusedtoillustrategoodpractices.
TopFiveTipsforEfficientCollaboration
1. Maintainaconsistentcodingstyle.
2. Thinkcarefullyaboutyourcommentsandkeepthemuptodate.
3. Useversioncontrolwheneverpossible.
4. Useinformativecommitmessages.
5. Don’tbeafraidtoelicitfeedbackfromcolleagues.
CodingStyle
Tobeasuccessfulprogrammer,youneedtouseaconsistentprogrammingstyle.Thereisno
singlecorrectstyle,butusingmultiplestylesinthesameprojectiswrong(Baath2012).To
someextent,goodstyleissubjectiveanduptopersonaltaste.Thereare,however,general
principlesthatmostprogrammersagreeon,suchas:
Usemodularcode
Commentyourcode
Don’tRepeatYourself(DRY)
Beconcise,clear,andconsistent
Goodcodingstylewillmakeyoumoreefficientevenifyouaretheonlypersonwhoreadsit.
Whenyourcodeisreadbymultiplereadersoryouaredevelopingcodewithcoworkers,
havingaconsistentstyleisevenmoreimportant.ThereareanumberofRstyleguidesonline
thatarebroadlysimilar,includingonebyGoogle,HadleyWhickham,andRichieCotton.The
stylefollowedinthisbookisbasedonacombinationofHadleyWickham’sguideandour
ownpreferences(wefollowYihuiXieinpreferring=to<-forassignment,forexample).
Inlinewiththeprincipleofautomation(automateanytaskthatcansavetimebyautomating),
theeasiestwaytoimproveyourcodeistoaskyourcomputertodoitusingRStudio.
ReformattingCodewithRStudio
RStudiocanautomaticallycleanuppoorlyindentedandformattedcode.Todothis,selectthe
linesthatneedtobeformatted(e.g.,viaCtrl-Atoselecttheentirescript),thenautomatically
indentitwithCtrl-I.TheshortcutCtrl-Shift-Awillreformatthecode,addingspacesfor
maximumreadability.Anexampleisprovidedhere:
#Poorlyindented/formattedcode
if(!exists("x")){
x=c(3,5)
y=x[2]}
Thiscodechunkworksbutisnotpleasanttoread.RStudioautomaticallyindentsthecode
aftertheifstatementasfollows:
#Automaticallyindentedcode(Ctrl-IinRStudio)
if(!exists("x")){
x=c(3,5)
y=x[2]}
Thisisastart,butit’sstillnoteasytoread.ThiscanbefixedinRStudioasillustratedinthe
followingcodechunk(theseoptionscanbeseenintheCodemenu,accessedwithAlt-Con
Windows/Linuxcomputers):
#Automaticallyreformatthecode(Ctrl-Shift-AinRStudio)
if(!exists("x")){
x=c(3,5)
y=x[2]
}
Notethatsomeaspectsofstylearesubjective;forexample,wewouldnotleaveaspaceafter
theifand).
Filenames
Filenamesshouldusethe.Rextensionandshouldbelowercase(e.g.,load.R).Avoidspaces.
Useadashorunderscoretoseparatewords.
#Goodnames
normalize.R
load.R
#Badnames
Normalize.r
loaddata.R
Section1.1ofWritingRExtensionsprovidesmoredetailedguidanceonfilenames,suchas
avoidingnon-Englishalphabeticcharactersastheycannotbeguaranteedtoworkacross
locales.Whiletheguidelinesarestrict,theguidanceaidsinmakingyourscriptsmore
portable.
LoadingPackages
Libraryfunctioncallsshouldbeatthetopofyourscript.Whenloadinganessentialpackage,
uselibraryinsteadofrequiresinceamissingpackagewillthenraiseanerror.Ifapackage
isn’tessential,userequireandappropriatelycapturethewarningraised.Packagenames
shouldbesurroundedwithquotationmarks.
#Good
library("dplyr")
#Non-standardevaluation
library(dplyr)
Avoidlistingeverypackageyoumayneed;insteadjustincludethepackagesyouactuallyuse.
Ifyoufindthatyouareloadingmanypackages,considerputtingallpackagesinafilecalled
packages.Randusingsourceappropriately.
Commenting
Commentscangreatlyimprovetheefficiencyofcollaborativeprojectsbyhelpingeveryone
tounderstandwhateachlineofcodeisdoing.However,commentsshouldbeusedcarefully;
plasteringyourscriptwithcommentsdoesnotnecessarilymakeitmoreefficient,andtoo
manycommentscanbeinefficient.Updatingheavilycommentedcodecanbeapain—not
onlywillyouhavetochangealltheRcode,you’llalsohavetorewriteordeleteallthe
comments!
Ensurethatyourcommentsaremeaningful.AvoidusingverboseEnglishtoexplainstandard
Rcode.Thefollowingcomment,forexample,addsnousefulinformationbecauseitis
obviousbyreadingthecodethatiisbeingsetto1:
#Settingxequalto1
x=1
Instead,commentsshouldprovidecontext.Imaginethatxwasbeingusedasacounter(in
whichcaseitshouldprobablyhaveamoremeaningfulname,likecounter,butwe’llcontinue
tousexforillustrativepurposes).Inthatcase,thecommentcouldexplainyourintentionfor
itsfutureuse:
#Initializecounter
x=1
Thepreviousexampleillustratesthatcommentsaremoreusefuliftheyprovidecontextand
explaintheprogrammer’sintention(McConnell2004).Eachcommentlineshouldbeginwith
asinglehash(#),followedbyaspace.Commentscanbetoggled(turnedonandoff)inthis
waywithCtrl-Shift-CinRStudio.Thedoublehash(##)canbereservedforRoutput.Ifyou
followyourcommentwithfourdashes(#----)RStudiowillenablecodefoldinguntilthe
nextinstanceofthis.
ObjectNames
“WhenIuseaword,”HumptyDumptysaid,inaratherscornfultone,“itmeansjustwhatI
chooseittomean—neithermorenorless.”
LewisCarroll,ThroughtheLookingGlass,Chapter6
Itisimportantforobjectsandfunctionstobenamedconsistentlyandsensibly.Totakeasilly
example,imagineifallobjectsinyourprojectswerecalledx,xx,xxx,etc.Thecodewould
runfine.However,itwouldbehardforotherpeople,andafutureyou,tofigureoutwhatwas
goingon,especiallywhenyougottotheobjectxxxxxxxxxx!
Forthisreason,givingaclearandconsistentnametoyourobjects,especiallyiftheyare
goingtobeusedmanytimesinyourscript,canboostprojectefficiency(ifanobjectisonly
usedonce,itsnameislessimportant,acasewherexcouldbeacceptable).Following
discussionin“TheStateofNamingConventionsinR”byRasmusBaathandelsewhere,we
suggestanunderscore_separatedstyleforfunctionandobjectnames.1Unlessyouare
creatinganS3object,avoidusinga.inthename(thiswillhelpavoidconfusingPython
programmers!).Namesshouldbeconciseyetmeaningful.
Infunctions,therequiredargumentsshouldalwaysbefirst,followedbyoptionalarguments.
Thespecial...argumentshouldcomelast.Ifyourargumenthasabooleanvalue,use
TRUE/FALSEinsteadofT/Fforclarity.
WARNING
It’stemptingtouseT/Fasshortcuts.Butitiseasytoaccidentallyredefinethesevariables(e.g.,F=10).Rraises
anerrorifyoutrytoredefineTRUE/FALSE.
Whileit’spossibletowriteargumentsthatdependonotherarguments,trytoavoidusingthis
idiomasitmakesunderstandingthedefaultbehaviorhardertounderstand.Typically,it’s
easiertosetanargumenttohaveadefaultvalueofNULLandcheckitsvalueusingis.null()
thanbyusingmissing().Wherepossible,avoidusingnamesofexistingfunctions.
ExamplePackage
Thelubridatepackageisagoodexampleofapackagethathasaconsistentnamingsystem,
whichmakesiteasyforuserstoguessitsfeaturesandbehavior.Datesareencodedina
varietyofways,butthelubridatepackagehasaneatsetoffunctionsconsistingofthethree
letters,year,month,andday.Forexample:
library("lubridate")
ymd("2012-01-02")
dmy("02-01-2012")
mdy("01-02-2012")
Assignment
ThetwomostcommonwaysofassigningobjectstovaluesinRiswith<-and=.Inmost(but
notall)contexts,theycanbeusedinterchangeably.Regardlessofwhichoperatoryouprefer,
consistencyiskey,particularlywhenworkinginagroup.Inthisbookweusethe=operator
forassignment,asit’sfastertotypeandmoreconsistentwithotherlanguages.
Theoneplacewhereadifferenceoccursisduringfunctioncalls.Considerthefollowing
pieceofcodeusedfortimingrandomnumbergeneration:
system.time(expr1<-rnorm(10e5))
system.time(expr2=rnorm(10e5))#error
Thefirstlineswillruncorrectlyandcreateavariablecalledexpr1.Thesecondlinewillraise
anerror.Whenweuse=inafunctioncall,itchangesfromanassignmentoperatortoan
argumentpassingoperator.Forfurtherinformationaboutassignment,see?assignOps.
Spacing
Consistentspacingisaneasywayofmakingyourcodemorereadable.Evenasimple
commandsuchasx=x+1takesabitmoretimetounderstandwhenthespacingis
removed(i.e.,x=x+1).Youshouldaddaspacearoundtheoperators+,-,\,and*.Includea
spacearoundtheassignmentoperators,<-and=.Additionally,addaspacearoundany
comparisonoperatorssuchas==and<.Thelatterrulehelpsavoidbugs:
#Bug.xnowequals1
x[x<-1]
#Correct.Selectingvalueslessthan-1
x[x<-1]
Theexceptionstothespaceruleare:,::,and:::,aswellas$and@symbolsforselecting
subpartsofobjects.AswithEnglish,addaspaceafteracomma:
z[z$colA>1990,]
Indentation
Usetwospacestoindentcode.Nevermixtabsandspaces.RStudiocanautomaticallyconvert
thetabcharactertospaces(seeTools->Globaloptions->Code).
CurlyBraces
Considerthefollowingcode:
#Badstyle,fails
if(x<5)
{
y}
else{
x}
TypingthisstraightintoRwillresultinanerror.Anopeningcurlybrace,{,shouldnotgoon
itsownlineandshouldalwaysbefollowedbyalinebreak.Aclosingcurlybraceshould
alwaysgoonitsownline(unlessit’sfollowedbyanelse,inwhichcasetheelseshouldgo
onitsownline).Thecodeinsidecurlybracesshouldbeindented(andRStudiowillenforce
thisrule),asshowninthefollowingcodechunk:
#Goodstyle
if(x<5){
x
}else{
y
}
Exercise
1. LookatthedifferencebetweenyourstyleandRStudio’sbasedonarepresentativeR
scriptthatyouhavewritten(see“CodingStyle”).Whatarethesimilarities?Whatare
thedifferences?Areyouconsistent?Writethesedownandthinkabouthowyoucan
usetheresultstoimproveyourcodingstyle.
VersionControl
Whenaprojectgetslarge,complicated,ormissioncritical,itisimportanttokeeptrackof
howitevolves.InthesamewaythatDropboxsavesabackupofyourfiles,versioncontrol
systemskeepabackupofyourcode.Theonlydifferenceisthatversioncontrolsystemsback
upyourcodeforever.
TheversioncontrolsystemwerecommendisGit,acommand-lineapplicationcreatedby
LinusTorvalds,whoalsoinventedLinux.2TheeasiestwaytointegrateyourRprojectswith
Git,ifyou’renotaccustomedtousingashell(e.g.,theUnixcommandline),iswithRStudio’s
Gittabinthetopright-handwindow(seeFigure9-1).Thisshowsthatanumberoffileshave
beenmodified(asillustratedwiththeblueMsymbol)andthatsomearenew(asillustrated
withtheyellow?symbol).Checkingthetick-boxwillenablethesefilestobecommitted.
Commits
Commitsarethebasicunitsofversioncontrol.Keepyourcommitsatomic:eachoneshould
onlydoonething.Documentyourworkwithclearandconcisecommitmessages,andusethe
presenttense(e.g.,addanalysisfunctions).
Committingcodeonlyupdatesthefilesonyourlocalbranch.Toupdatethefilesstoredona
remoteserver(e.g.,onGitHub),youmushpushthecommit.Thiscanbedoneusinggitpush
fromashellorusingthegreenuparrowinRStudio,asillustratedinFigure9-1.Theblue
downarrowwillpullthelatestversionoftherepositoryfromtheremote.3
Figure9-1.TheGittabinRStudio
GitIntegrationinRStudio
HowdoyouenablethisfunctionalityonyourinstallationofRStudio?RStudiocanbeaGUI
GitonlyifGithasbeeninstalledandRStudiocanfindit.Youneedaworkinginstallationof
Git(e.g.,installedthroughapt-getinstallgitUbuntu/DebianorviaGitHubDesktopfor
MacandWindows).RStudiocanbelinkedtoyourGitinstallationviaTools→Global
OptionsintheGit/SVNtab.ThistabalsoprovidesalinktoahelppageonRStudio/Git.
OnceGithasbeenlinkedtoyourRStudioinstallation,itcanbeusedtotrackchangesinanew
projectbyselectingCreateagitrepositorywhencreatinganewproject.Thetab
illustratedinFigure9-1willappear,allowingfunctionalityforinteractingwithGitvia
RStudio.
RStudioprovidesausefulGUIfornavigatingpastcommits.Thisallowsyoutoseetheentire
historyofyourproject.Tonavigateandviewthedetailsofpastcommits,clickontheDiff
buttonintheGitpane,asillustratedinFigure9-2.
Figure9-2.TheGithistorynavigationinterface
GitHub
GitHubisanonlineplatformthatmakessharingyourworkandcollaboratingoncodeeasy.
TherearealternativessuchasGitLab.ThefocushereisonGitHubasit’sbyfarthemost
popularamongRdevelopers.Also,throughthecommanddevtools::install_github(),
previewversionsofapackagecanbeinstalledandupdatedinaninstant.ThismakesGitHub
packagesagreatwaytoaccessthelatestfunctionality.AndGitHubmakesiteasytogetyour
workouttheretotheworldforefficientlycollaboratingwithothers,withouttherestraints
placedonCRANpackages.
ToinstalltheGitHubversionofthebenchmarkmepackage,forexample,youwouldenter
devtools::install_github("csgillespie/benchmarkme")
NotethatcsgillespieistheGitHubuserandbenchmarkmeisthepackagename.Replacing
csgillespiewithrobinlovelaceinthepreviouscodewouldinstallRobin’sversionofthe
package.Thisisusefulforfastcollaborationwithmanypeople,butyoumustrememberthat
GitHubpackageswillnotupdateautomaticallywiththecommandupdate.packages(see
“UpdatingRPackages”).
WARNING
AlthoughGitHubisfantasticforcollaboration,itcanendupcreatingmoreproblemsthanitsolvesifyour
collaboratorsarenotGit-literate.Inoneproject,RobineventuallyabandonedusingGitHubafterhiscollaborator
founditimpossibletoworkwith.MoretimewasbeingspentdebuggingGit/GitHubthanactuallyworking.Our
advicethereforeistoneverimposeGitandalwaysensurethatotherlinesofcommunication(e.g.,phonecalls,
emails)areopenbecausedifferentpeoplepreferdifferentwaysofcommunicating.
Branches,Forks,Pulls,andClones
Gitisalargeprogramthattakesalongtimetolearnin-depth.However,gettingtogripswith
thebasicsofsomeofitsmoreadvancedfunctionscanmakeyouamoreefficient
collaborator.Usingandmergingbranches,forexample,allowsyoutotestnewfeaturesina
self-containedenvironmentbeforetheyareusedinproduction(e.g.,whenshiftingtoan
updatedversionofapackagethatisnotbackwardscompatible).Insteadofboggingyoudown
withacomprehensivediscussionofwhatispossible,thissectioncutstothemostimportant
featuresforcollaboration:branches,forks,pulls,andclones.Foramoredetaileddescription
ofGit’spowerfulfunctionality,werecommendtheJennyBryan’sbook,HappyGitand
GitHubfortheuseR.
Branchesaredistinctversionsofyourrepository.Gitallowsyoujumpseamlesslybetween
differentversionsofyourentireproject.Tocreateanewbranchcalledtest,youneedtoenter
theshellandusetheGitcommandline:
gitcheckout-btest
Thisisequivalenttoenteringtwocommands:gitbranchtesttocreatethebranchandthen
gitcheckouttesttocheckoutthatbranch.Checkoutmeansswitchintothatbranch.Any
changeswillnotaffectyourpreviousbranch.InRStudio,youcanjumpquicklybetween
branchesusingthedrop-downmenuinthetoprightoftheGitpane.Thisisillustratedin
Figure9-1:seethemastertextfollowedbyadownarrow.Clickingonthiswillallowyouto
selectotherbranches.
Forksarelikebranches,buttheyexistonotherpeople’scomputers.Youcanforkarepository
onGitHubeasily,asdescribedonthesite’shelppages.Ifyouwantanexactcopyofthis
repository(includingthecommithistory),youcanclonethisforktoyourcomputerusingthe
commandgitcloneorbyusingaGitGUIsuchasGitHubDesktop.Thisispreferablefrom
acollaborationperspectivethancloningtherepositorydirectly,becauseanychangescanbe
pushedbackonlineeasilyifyouareworkingfromyourownfork.Youcannotpushtoforks
thatyouhavenotcreated,unlesssomeonehasgrantedyouaccess.Ifyouwantyourworkto
beincorporatedintotheoriginalfork,youcanuseapullrequest.Note:ifyoudon’tneedthe
project’sentirecommithistory,youcansimplydownloadazipfilecontainingthelatest
versionoftherepositoryfromGitHub(atthetoprightofanyGitHubrepository).
Apullrequest(PR)isamechanismonGitHubbywhichyourcodecanbeaddedtoanexisting
project.OneofthemostusefulfeaturesofaPRfromacollaborationperspectiveisthatit
providesanopportunityforotherstocommentonyourcode,linebyline,beforeitgets
merged.ThisisalldoneonlineonGitHub,asdiscussedinGitHub’sonlinehelp.Following
feedback,youmaywanttorefactorcodewrittenbyyouorothers.
CodeReview
Whatisacodereview?4Simplyput,whenwehavefinishedworkingonapieceofcode,a
colleaguereviewsourworkandconsidersquestionssuchas:
Isthecodecorrectandproperlydocumented?
Couldthecodebeimproved?
Doesthecodeconformtoexistingstyleguidelines?
Arethereanyautomatedtests?Ifso,aretheysufficient?
Agoodcodereviewsharesknowledgeandbestpractices.
Alightweightcodereviewcantakeavarietyofforms.Forexample,itcouldbeassimpleas
emailingaroundsomecodeforcomments,or“overtheshoulder,”wheresomeoneliterally
looksoveryourshoulderwhileyoucode.Moreformaltechniquesincludepaired
programmingwheretwodevelopersworksidebysideonthesameproject.
Regardlessofthereviewmethodbeingemployed,thereanumberofpointstoremember.
First,aswithallformsoffeedback,beconstructive.Ratherthanpointingoutflaws,give
suggestedimprovements.Closelyrelatedisgivingpraisewhenappropriate.Second,ifyou
arereviewingapieceofcode,setatimeframeorthenumberoflinesofcodeyouwillreview.
Forexample,youwillspendonehourtoreviewapieceofcode,orreviewamaximumof
400lines.Third,acodereviewshouldbeperformedbeforethecodeismergedintoalarger
codebase;fixmistakesassoonaspossible.
ManyRusersdon’tworkonateamorinagroup;instead,theyworkbythemselves.
Practically,thereisn’tusuallyanyonenearbytoreviewtheircode.However,thereisstillthe
optionofanunofficalcodereview.Forexample,ifyouhavehostedcodeonanonline
repositorysuchasGitHub,userswillnaturallygivefeedbackonourcode(especiallyifyou
makeitclearthatyouwelcomefeedback).AnothergoodplaceisStackOverflow(coveredin
detailinChapter10).Thissiteallowsyoutopostanswerstootherusersquestions.Whenyou
postananswer,ifyourcodeisunclear,thiswillbeflaggedincommentsbelowyouranswer.
References
Bååth,Rasmus.2012.“TheStateofNamingConventionsinR.”TheRJournal4(2):74–75.
https://journal.r-project.org/archive/2012-2/RJournal_2012-2_Baaaath.pdf.
McConnell,Steve.2004.CodeComplete.PearsonEducation.
OnenotableexceptionarepackagesinBioconductor,wherevariablenamesarecamelCase.Inthiscase,youshouldmatch
theexistingstyle.
Werecommend10YearsofGit:AnInterviewwithGitCreatorLinusTorvaldsfromLinux.comformoreinformationon
thistopic.
Foramoredetailedaccountofthisprocess,seeGitHub’shelppages.
Thissectionisbeingwrittenwithsmallteamsinmind.Largerteamsshouldconsultamoredetailedtextoncodereview.
1
2
3
4
Chapter10.EfficientLearning
Aswithanyvibrantopensourcesoftwarecommunity,Risfastmoving.Thiscanbe
disorientingbecauseitmeansthatyoucanneverfinishlearningR.Ontheotherhand,itmakes
Rafascinatingsubjectbecausethereisalwaysmoretolearn.EvenexperiencedRuserskeep
findingnewfunctionalitythathelpssolveproblemsmorequicklyandelegantly.Therefore,
learninghowtolearnisoneofthemostimportantskillstohaveifyouwanttolearnRin-
depth.Weemphasizedepthoflearningbecauseitismoreefficienttolearnsomething
properlythantoGoogleitrepeatedlyeverytimeyouforgethowitworks.
Thischapteraimstoequipyouwithconcepts,guidance,andtipsthatwillaccelerateyour
transitionfromanRhackertoanRprogrammer.ThisinevitablyinvolveseffectiveuseofR’s
help,readingRsourcecode,anduseofonlinematerial.
Prerequisties
Theonlypackageusedinthissectionisswirl:
library("swirl")
TopFiveTipsforEfficientLearning
1. UseR’sinternalhelp(e.g.,with?,??,vignette(),andapropos()).Tryswirl.
2. ReadaboutthelatestdevelopmentsinestablishedoutletssuchastheJournalfor
StatisticalSoftware,theRJournal,Rlists,andtheblogosphere.
3. Ifstuck,askforhelp!Aclearquestionpostedinanappropriateplace,using
reproduciblecode,shouldgetaquickandenlighteninganswer.
4. Formorein-depthlearning,nothingcanbeatimmersiveRbooksandtutorials.Do
someresearchanddecidewhichresourcesyoushoulduse.
5. Oneofthebestwaystoconsolidatelearningistowriteitupandpassonthe
knowledge;tellingthestoryofwhatyou’velearnedwithalsohelpothers.
UsingR’sInternalHelp
SometimesthebestplacetolookforhelpiswithinRitself.UsingR’shelphasthreemain
advantagesfromanefficiencyperspective:
It’sfastertoqueryRfrominsideyourIDEthantoswitchcontextandsearchforhelpon
adifferentplatform(e.g.,theinternet,whichhascountlessdistractions).
Itworksoffline.
LearningtoreadR’sdocumentation(andsourcecode)isapowerfulskillinitselfthat
willimproveyourRprogramming.
ThemaindisadvantageofR’sinternalhelpisthatitisterseandinsomecasessparse.Donot
expecttoalwaysbeabletofindtheanswerinR,sobepreparedtolookelsewhereformore
detailedhelpandcontext.Fromalearningperspective,becomingacquaintedwithR’s
documentationisoftenbetterthanfindingthesolutionfromadifferentsourcebecauseitwas
writtenbydevelopers,largelyfordevelopers.Therefore,withRdocumentationyoulearn
aboutfunctionsfromthehorse’smouth.Rhelpalsosometimesshedslightonafunction’s
historythroughreferencestoacademicpapers.
AsyoulooktolearnaboutatopicorfunctioninR,itislikelythatyouwillhaveasearch
strategyofyourown,rangingfrombroadtonarrow:
1. SearchingRandinstalledpackagesforhelponaspecifictopic.
2. Readinguponpackagesvignettes.
3. Gettinghelponaspecificfunction.
4. Lookingintothesourcecode.
Inmanycases,youmayalreadyhavegonethroughstages1and2.Oftenyoucanstopatstage
3andsimplyusethefunctionwithoutworryingaboutexactlyhowitworks.Ineverycase,itis
usefultobeawareofthishierarchicalapproachtolearningfromR’sinternalhelp,soyoucan
startwiththebigpicture(andavoidgoingdownamisguidedrouteearlyon)andthenquickly
focusinonthefunctionsthataremostrelatedtoyourtask.
Toillustratethisapproachinaction,imaginethatyouareinterestedinaspecifictopic:
optimization.Theremainderofthissectionwillworkthroughstages1to4outlined
previouslyasifwewantedtofindoutmoreaboutthistopic,withoccasionaldiversionsfrom
ittoseehowspecifichelpfunctionsworkinmoredetail.Thefinalmethodoflearningfrom
R’sinternalresourcescoveredinthissectionisswirl,apackageforinteractivelearning.
SearchingRforTopics
AwideboundarysearchforatopicinRwilloftenbeginwithasearchforinstancesofa
keywordinthedocumentationandfunctionnames.Usingtheexampleofoptimization,you
couldstartwithasearchforatextstringrelatedtothetopicofinterest:
#help.search("optim")#or,moreconcisely
??optim
Notethatthe??symbolissimplyausefulshorthandversionofthefunctionhelp.search().It
issometimesusefultousethefullfunctionratherthantheshorthandversion,becauseit
allowsyoutospecifyanumberofoptions.Tosearchforallhelppagesthatmentionthemore
specificterm“optimization”inthetitleoraliasofthehelppages,forexample,thefollowing
commandwouldbeused:
help.search(pattern="optimisation|optimization",
fields=c("title","concept"))
Thiswillreturnashort(andpotentiallymoreefficientlyfocused)listofhelppagesthanthe
wide-ranging??optimcall.Tomakethesearchevenmorespecific,wecanusethepackage
argumenttoconstrainthesearchtoasinglepackage.Thiscanbeveryusefulwhenyouknow
thatafunctionexistsinaspecificpackagebutyoucannotrememberwhatitiscalled:
help.search(pattern="optimisation|optimization",
fields=c("title","concept"),package="stats")
AnotherfunctionforsearchingRisapropos().ItprintstotheconsoleanyRobjects
(includinghiddenfunctions,thosebeginningwith.,anddatasets)whosenamematchesa
giventextstring.BecauseitdoesnotsearchR’sdocumentation,ittendstoreturnfewerresults
thanhelp.search().Itsuseandtypicaloutputscanbeseeninthefollowingexamples:
apropos("optim")
#>[1]"constrOptim""optim""optimHess""optimise""optimize"
apropos("lm")[1:6]#showonlyfirstsixresults
#>[1]".__C__anova.glm"".__C__anova.glm.null"".__C__diagonalMatrix"
#>[4]".__C__generalMatrix"".__C__glm"".__C__glm.null"
TosearchallRpackages,includingthoseyouhavenotinstalledlocally,foraspecifictopic,
thereareanumberofoptions.Forobviousreasons,thisrequiresinternetaccess.Themost
rudimentarywaytoseewhatpackagesareavailablefromCRAN,ifyouareusingRStudio,is
touseitsautocompletionfunctionalityforpackagenames.Totakeanexample,ifyouare
lookingforapackageforgeospatialdataanalysis,youcoulddoworsethanenterthetext
stringgeoasanargumentintopackageinstallationfunction(e.g.,install.packages(geo))
andpressingtheTabkeywhenthecursorisbetweentheoandthe)intheexample.The
resultingoptionsareshowninFigure10-1.Selectingonefromthedrop-downmenuwill
resultinitbeingcompletedwithsurroundingquotationmarks,asnecessary.
Figure10-1.PackagenameautocompletioninactioninRStudioforpackagesbeginningwithgeo
FindingandUsingVignettes
Somepackagescontainvignettes.Thesearepiecesoflong-formdocumentationthatallow
packageauthorstogointodetailexplaininghowthepackageworks(Wickham2015c).In
general,theyarehighquality.Becausetheycanbeusedtoillustratereal-worldusecases,
vignettescanbethebestwaytounderstandfunctionsinthecontextofbroaderexplanations
andlongerexamplesthanareprovidedinfunctionhelppages.Althoughmanypackageslack
vignettes,theydeserveasubsectionoftheirownbecausetheycanboosttheefficiencywith
whichpackagefunctionsareusedinanintegratedworkflow.
NOT E
Ifyouarefrustratedbecauseacertainpackagelacksavignette,youcancreateone.Thiscanbeagreatwayof
learningaboutandconsolidatingyourknowledgeofapackage.Tocreateavignette,firstdownloadthesource
codeofapackageandthenusedevtools::use_vignette().Toaddavignettetotheefficientpackageusedin
thisbook,forexample,youcouldclonetherepo(e.g.,usingthecommandgitclone
git@github.com:csgillespie/efficient).Onceyouhaveopenedtherepoasaproject(e.g.,inRStudio),you
couldcreateavignettecalled“efficient-learning”withthecommanduse_vignette("efficient-learning").
Tobrowseanyvignettesassociatedwithaparticularpackage,wecanusethehandyfunction
browseVignettes():
browseVignettes(package="benchmarkme")
Thisisroughlyequivalenttovignette(package="benchmarkme")butopensanewpageina
browserandletsyounavigateallthevignettesinthatparticularpackage.Foranoverviewof
allvignettesavailablefromRpackagesinstalledonyourcomputer,trybrowsingallavailable
vignetteswithbrowseVignettes().Youmaybesurprisedathowmanyhiddengemsthereare
inthere!
Howbesttousevignettesdependsonthevignetteinquestionandyouraims.Ingeneral,you
shouldexpecttospendlongerreadingvignettesthanothertypesofRdocumentation.The
Introductiontodplyrvignette(openedwithvignette("introduction",package=
"dplyr")),forexample,containsalmost4,000wordsofprose,examplecode,andoutputs
thatillustratehowitsfunctionswork.Werecommendworkingthroughtheexamplesand
typingtheexamplecodeinordertolearnbydoing.
Anotherwaytolearnfrompackagevignettesistoviewtheirsourcecode.Youcanfindwhere
vignettesourcecodelivesbylookinginthevignette/folderofthepackage’ssourcecode.
dplyr’svignettes,forexample,canbeviewed(andedited)online.Aquickwaytoviewa
vignette’sRcodeiswiththeedit()function:
v=vignette("introduction",package="dplyr")
edit(v)
GettingHelponFunctions
Allfunctionshavehelppages.Thesecontain,ataminimum,alistoftheinputargumentsand
thenatureoftheoutputthatcanbeexpected.Onceafunctionhasbeenidentified(e.g.,using
oneofthemethodsoutlinedin“SearchingRforTopics”),itshelppagecanbedisplayedby
prefixingthefunctionnamewith?.Continuingwiththepreviousexample,thehelppage
associatedwiththecommandoptim()(forgeneral-purposeoptimization)canbeinvokedas
follows:
#help("optim")#or,moreconcisely:
?optim
Ingeneral,helppagesdescribewhatfunctionsdo,nothowtheywork.Thisisoneofthe
reasonsthatfunctionhelppagesarethought(bysome)tobedifficulttounderstand.In
practice,thismeansthatthehelppagedoesnotdescribetheunderlyingmathematicsor
algorithmindetail—itsaimistodescribetheinterface.
Ahelppageisdividedintoanumberofsections.Thehelpforoptim()istypicalinthatithas
atitle(general-purposeoptimization)followedbyshortDescription,Usage,andArguments
sections.TheDescriptionisusuallyjustasentenceortwoexplainingwhatitdoes.Usage
showstheargumentsthatthefunctionneedstowork.AndArgumentsdescribeswhatkindof
objectsthefunctionexpects.LongersectionstypicallyincludeDetailsandExamples,which
providesomecontextandprovide(usuallyreproducible)examplesofhowthefunctioncan
beused,respectively.ThetypicallyshortValue,References,andSeeAlsosectionsfacilitate
efficientlearningbyexplainingwhattheoutputmeans,whereyoucanfindacademic
literatureonthesubject,andrelatedfunctions.
optim()isamatureandheavilyusedfunctionsoithasalonghelppage;you’llprobablybe
gladtoknowthatnotallhelppagesarethislong!Withsomuchpotentiallyoverwhelming
informationinasinglehelppage,theplacementoftheshort,densesectionsatthebeginning
isefficientbecauseithelpsyoutounderstandthefundamentalsofafunctioninfewwords.
Learninghowtoreadandquicklyinterpretsuchhelppageswillgreatlyhelpyourabilityto
learnR.Takesometimetostudythehelpforoptim()indetail.
ItisworthdiscussingthecontentsoftheUsagesectioninparticular,becausethiscontains
informationthatmaynotbeimmediatelyobvious:
optim(par,fn,gr=NULL,...,
method=c("Nelder-Mead","BFGS","CG","L-BFGS-B","SANN","Brent"),
lower=-Inf,upper=Inf,control=list(),hessian=FALSE)
Thiscontainstwopiecesofcriticalinformation:
1. Theessentialargumentsthatmustbeprovidedforthefunctiontowork(parandfn
inthiscase,asgrhasadefaultvalue)beforethe...symbol;and
2. optionalargumentsthatcontrolhowthefunctionworks(method,lower,andhessian
inthiscase)....areoptionalargumentswhosevaluesdependontheotherarguments
(whichwillbepassedtothefunctionrepresentedbyfninthiscase).Let’sseehow
thisworksinpracticebytryingtorunoptim()tofindtheminimumvalueofthe
functiony=x4-x2:
fn=function(x){
x^4-x^2
}
optim(par=0,fn=fn)
#>Warninginoptim(par=0,fn=fn):one-dimensionaloptimization
#>byNelder-Meadisunreliable:use"Brent"oroptimize()directly
#>$par
#>[1]0.707
#>
#>$value
#>[1]-0.25
#>
#>$counts
#>functiongradient
#>58NA
#>
#>$convergence
#>[1]0
#>
#>$message
#>NULL
Theresultsshowthattheminimumvalueoffn(x)isfoundwhenx=0.707..(1/√2),witha
minimumvalueof-0.25.Ittook58iterationsofthefunctioncallforoptim()toconvergeon
thisvalue.EachoftheseoutputvaluesisdescribedintheValuessectionofthehelppages.
Fromthehelppages,wecouldguessthatprovidingthefunctioncallwithoutspecifyingpar
(i.e.,optim(fn=fn))wouldfail,whichindeeditdoes.
ThemosthelpfulsectionisoftentheExamples.Theselieatthebottomofthehelppageand
showpreciselyhowthefunctionworks.Youcaneithercopyandpastethecode,oractually
runtheexamplecodeusingtheexamplecommand(itiswellworthrunningtheseexamples
duetothegraphicsproduced):
example(optim)
NOT E
WhenapackageisaddedtoCRAN,theexamplepartofthedocumentationisrunonallmajorplatforms.This
helpsensurethatapackageworksonmultiplesystems.
AnotherusefulsectioninthehelpfileisSeeAlso:.Intheoptim()helppage,itlinksto
optimize(),whichmaybemoreappropriateforthisusecase.
ReadingRSourceCode
Risopensource.Thismeansthatweviewtheunderlyingsourcecodeandexamineany
function.Ofcoursethecodeiscomplex,anddivingstraightintothesourcecodewon’thelp
thatmuch.However,watchingtheGitHubRsourcecodemirrorwillallowyoutomonitor
smallchangesthatoccur.Thisgivesaniceentrypointintoacomplexcodebase.Likewise,
examiningthesourceofsmallfunctionssuchasNCOLisinformative(e.g.,
getFunction("NCOL")).
T IP
SubscribingtotheRNEWSblogisaneasywayofkeepingtrackoffuturechanges.
ManyRpackagesaredevelopedintheopenonGitHuborr-forge.Selectafewwell-known
packagesandexaminetheirsources.Agoodpackagetostartwithisdrat.Thisisarelatively
simplepackagedevelopedbyDirkEddelbuettel(authorofRcpp)thatonlycontainsafew
functions.ItgivesyouanexcellentpointerintosoftwaredevelopmentbyoneofthekeyR
packagewriters.
AshortcutforbrowsingR’ssourcecodeisprovidedbytheRStudioIDE:clickingona
functionandthenpressingtheF2keywillopenitssourcecodeinthefileeditor.Thisworks
bothforfunctionsthatexistinRanditspackagesandfunctionsthatyoucreatedinanotherR
script(solongasitiswithinyourprojectdirectory).Althoughreadingsourcecodecanbe
interestinginitself,itisprobablybestdoneinthecontextofaspecificquestion,suchas“how
canIuseafunctionnameasanargumentinmyownfunction?”(lookingatthesourcecodeof
apply()mayhelphere).
swirl
swirlisaninteractiveteachingplatformforR.Itoffersanumberofextensionsand,forthe
pioneering,theabilityforotherstocreatecustomextensions.Thelearningcurveandmethod
willnotworkforeveryone,butthispackageisworthflaggingasapotentself-teaching
resource.Insomeways,swirlcanbeseenastheultimateinternalRhelpasitallowsdedicated
learningsessions,basedonmultiplechoicequestions,allwithinausualRsession.Toenter
theswirlworld,justenterthefollowing.Theresultantinstructionswillexplaintherest:
library("swirl")
swirl()
OnlineResources
TheRcommunityhasastrongonlinepresence,providingmanyresourcesforlearning.Over
time,therehasfortunatelybeenatendencyforRresourcestobecomemoreuserfriendlyand
up-to-date.ManyresourcesthathavebeenonCRANformanyyearsaredatedbynowsoit’s
moreefficienttonavigatedirectlytothemostup-to-dateandefficient-to-useresources.
Cheatsheetsareshortdocumentssummarizinghowtodocertainthings.RStudio,for
example,providesexcellentcheatsheetsondplyr,rmarkdown,andtheRStudioIDEitself.
TheR-projectwebsitecontainssixdetailedofficialmanuals,plusagiantPDFfilecontaining
documentationforallrecommendedpackages.TheseincludeAnIntroductiontoR,TheR
LanguageDefinition,andRInstallationandAdministration,allofwhicharerecommended
forpeoplewantingtolearngeneralRskills.Ifyouaredevelopingapackageandwantto
submitittoCRAN,theWritingRExtensionsmanualisrecommendedreading,althoughithas
tosomeextentbeensupersededbyRPackagesbyHadleyWickham(O’Reilly),thesource
codeofwhichisavailableonline.Whilethesemanualsarelong,theycontainimportant
informationwrittenbyexperiencedRprogrammers.
Formoredomain-specificandup-to-dateinformationondevelopmentsinR,werecommend
checkingoutacademicjournals.TheRJournalregularlypublishesarticlesdescribingnewR
packages,aswellasgeneralprogramminghints.Similarly,thearticlesintheJournalof
StatisticalSoftwarehaveastrongRbias.Publicationsinthesejournalsaregenerallyofvery
highqualityandhavebeenrigorouslypeerreviewed.However,theymayberathertechnical
forRnovices.
Thewidercommunityprovidesamuchlargerbodyofinformation,ofmorevariablequality,
thantheofficialRresources.TheContributedDocumentationpageonR’shomepage
containsdozensoftutorialsandotherresourcesonawiderangeoftopics.Someoftheseare
excellent,althoughmanyarenotkeptup-to-date.AnexcellentresourceforbrowsingRhelp
pagesonlineisprovidedbyrdocumentation.org.
Lowergradebutmorefrequentlyreleasedinformationcanbefoundontheblogosphere.
CentraltothisisR-bloggers,ablogaggregatorofcontentcontributedbybloggerswhowrite
aboutR(inEnglish).Itisagreatwaytogetexposedtonewanddifferentpackages.Similarly,
monitoringthe#rstatsTwittertagkeepsyouup-to-datewiththelatestnews.
Therearealsomailinglists,Googlegroups,andtheStackExchangeQ&Asites.Before
requestinghelp,readafewotherquestionstolearntheformatofthesite.Makesureyou
searchpreviousquestionssoyouarenotduplicatingwork.Perhapsthemostimportantpoint
istorememberthatpeoplearen’tunderanyobligationtoansweryourquestion.Oneofthe
fantasticthingsabouttheopensourcecommunityisthatyoucanaskquestionsandoneof
coredevelopersmayansweryourquestionforfree—butremember,everyoneisbusy!
StackOverflow
ThenumberoneplaceontheinternetforgettinghelponprogrammingisStackOverflow.
Thiswebsiteprovidesaplatformforaskingandansweringquestions.Throughsite
membership,questionsandanswersarevotedupordown.UsersofStackOverflowearn
reputationpointswhentheirquestionoranswerisup-voted.Anyone(withenoughreputation)
caneditaquestionoranswer.Thishelpsthecontentremainrelevant.
Questionsaretagged.TheRquestionscanbefoundundertheRtag.TheRpagecontains
linkstoofficialdocumentation,freeresources,andvariousotherlinks.MembersoftheStack
OverflowRcommunityhavetagged,usingr-faq,afewquestionthatoftencropup.
MailingListsandGroups
TherearemanymailinglistsandGooglegroupsfocusedonRandparticularpackages.The
mainlistforgettinghelpisR-help.Thisisahigh-volumemailinglist,witharoundadozen
messagesperday.AmoretechnicalmailinglistisR-devel.Thislistisintendedforquestions
anddiscussionaboutcodedevelopmentinR.Thediscussiononthislistisverytechnical.It’sa
goodplacetobeintroducedtonewideas,butit’snottheplacetoaskabouttheseideas!There
aremanyotherspecial-interestmailinglistscoveringtopicssuchashigh-performance
computingtoecology.ManypopularpackagesalsohavetheirownmailinglistorGoogle
group(e.g.,ggplot2andshiny).Thekeypieceofadviceisbeforemailingalist,readthe
relevantmailingarchiveandcheckthatyourmessageisappropriate.
AskingaQuestion
Agreatwaytogetspecifichelponadifficulttopicistoaskforhelp.However,askingagood
questionisnoteasy.Threecommonmistakes,andwaystoavoidthem,areoutlinedhere:
1. Askingaquestionthathasalreadybeenasked;makesurethatyou’veproperly
searchedfortheanswerbeforeposting.
2. TheanswertothequestioncanbefoundinR’shelp:makesurethatyou’veproperly
readtherelevanthelppagesbeforeasking.
3. Thequestiondoesnotcontainareproducibleexample;createasimpleversionof
yourdata,showthecodeyou’vetried,anddisplaytheresultyouarehopingfor.
Yourquestionshouldcontainjustenoughinformationthatyourproblemisclearandcanbe
reproducible,whileatthesametimeavoidsunnecessarydetails.FortunatelythereisaStack
Overflowquestion—HowtomakeagreatRreproducibleexample?—thatprovides
excellentguidance.Additionalguidesthatexplainhowtocreategoodprogrammingquestions
areprovidedbyStackOverflowandtheRmailinglistpostingguide.
MinimalDataset
Whatisthesmallestdatasetyoucanconstructthatwillreproduceyourissue?Youractual
datasetmaycontain105rowsand104columns,buttogetyourideaacrossyoumightonly
needfourrowsandthreecolumns.Makingsmallexampledatasetsiseasy.Forexample,to
createadataframewithtwonumericcolumnsandacolumnofcharacters,usethefollowing:
set.seed(1)
example_df=data.frame(x=rnorm(4),y=rnorm(4),z=sample(LETTERS,4))
Notethatthecalltoset.seedensuresthatanyonewhorunsthecodewillgetthesamerandom
numberstream.Alternatively,youcanuseoneofthemanydatasetsthatcomewithR-
library(help="datasets").
Ifcreatinganexampledatasetisn’tpossible,thenusedputonyouractualdataset.Thiswill
createanASCIItextrepresentationoftheobjectthatwillenableanyonetorecreatetheobject:
dput(example_df)
#>structure(list(
#>x=c(-0.626453810742332,0.183643324222082,-0.835628612410047,
#>1.59528080213779),
#>y=c(0.329507771815361,-0.820468384118015,0.487429052428485,
#>0.738324705129217),
#>z=structure(c(3L,4L,1L,2L),.Label=c("J","R","S","Y"),
#>class="factor")),
#>.Names=c("x","y","z"),row.names=c(NA,-4L),class="data.frame")
MinimalExample
Whatyoushouldnotdoissimplycopyandpasteyourentirefunctionintoyourquestion.It’s
unlikelythatyourentirefunctiondoesn’twork,sojustsimplifyittothebareminimum.The
aimistotargetyouractualissue.Avoidcopyingandpastinglargeblocksofcode;remove
superfluouslinesthatarenotpartoftheproblem.Beforeaskingyourquestion,canyourun
yourcodeinacleanRenvironmentandreproduceyourerror?
LearningInDepth
Intheageoftheinternetandsocialmedia,manypeoplefeelluckyiftheyhavetimetogofor
awalk,letalonesitdowntoreadabook.ButitisundeniablethatlearningRindepthisa
time-consumingactivity.Readingabookoralargetutorial(andcompletingthepractical
examplescontainedwithin)maynotbethemostefficientwaytosolveaparticularproblemin
theshortterm,butitcanbeoneofthebestwaystolearnRprogrammingproperly,especially
inthelongrun.
In-depthlearningdiffersfromshallow,incrementallearningbecauseratherthandiscovering
howaspecificfunctionworks,youfindouthowsystemsoffunctionsworktogether.Totake
ametaphorfromcivilengineering,in-depthlearningisaboutbuildingstrongfoundationson
whichawiderangeofbuildingscanbeconstructed.In-depthlearningcanbehighlyefficient
inthelongrunbecauseitwillpaybackovermanyyears,regardlessofthedomain-specific
problemyouwanttouseRtotackle.Shallowlearning,tocontinuethemetaphor,ismorelike
erectingmanytemporarystructures:theycansolveaspecificproblemintheshortterm,but
theywillnotbedurable.Flimsydwellingscanbesweptaway.Shallowmemoriescanbe
forgotten.
Havingestablishedthattimespentdeeplearningcan,counterintuitively,beefficient,itis
worththinkingabouthowtodeeplearn.Thisvariesfrompersontoperson.Itdoesnot
involvepassivelyabsorbingsacredinformationtransmittedyearafteryearbytheRgods.Itis
anactive,participatoryprocess.Toensurethatmemoriesarerapidlyactionableyoumust
learnbydoing.Learningfromacohesive,systematic,andrelativelycomprehensiveresource
willhelpyoutoseethemanyinterconnectionsbetweenthedifferentelementsofR
programmingandhowtheycanbecombinedforefficientwork.
Thereareanumberofsuchresources,includingthisbook.Althoughtheunderstandable
tendencywillbetouseitincrementally,dippinginandoutofdifferentsectionswhendifferent
problemsarise,wealsorecommendreadingitsystematicallytoseehowthedifferent
elementsofefficiencyfittogether.Itislikelythatasyouworkprogressivelythroughthis
book,inparallelwithsolvingreal-worldproblems,youwillrealizethatthesolutionisnotto
havetherightresourceathandbuttobeabletousethetoolsprovidedbyRefficiently.Once
youhitthislevelofproficiency,youshouldhavetheconfidencetoaddressmostproblems
encounteredfromfirstprinciples.Overtime,yourfirstportofcallshouldmoveawayfrom
GoogleandevenR’sinternalhelptosimplygivingitatry.Informedtrialanderror,and
intelligentexperimentation,canbethebestapproachtobothlearningandsolvingproblems
quickly,onceyouareequippedwiththetoolstodoso.That’swhythisisthelastsectioninthe
book.
Ifyouhavealreadyworkedthroughalltheexamplesinthisbook,orifyouwanttolearn
areasnotcoveredinit,therearemanyexcellentresourcesforextendinganddeepeningyour
knowledgeofRprogrammingforfastandeffectivework,andtodonewthingswithit.
BecauseRisalargeandever-evolvinglanguage,thereisnodefinitivelistofresourcesfor
takingyourRskillstonewheights.However,thefollowinglist,inroughascendingorderof
difficultyanddepth,shouldprovideplentyofmaterialandmotivationforin-depthlearningof
R.
1. FreewebinarsandonlinecoursesprovidedbyRStudioandDataCamp.Both
organizationsarewellregardedandkeeptheircontentup-to-date,buttherearelikely
othersourcesofotheronlinecourses.Werecommendthatyoutestpushingyour
abilities,ratherthangoingoverthesamematerialcoveredinthisbook.
2. RforDataScience(GrolemundandWickham2016),afreebookintroducingmany
conceptsandtidypackagesforworkingwithdata(afreeonlineversionisavailable
fromr4ds.had.co.nz).
3. RProgrammingforDataScience(Peng2014),whichprovidesin-depthcoverageof
analysisandvisualizationofdatasets.
4. AdvancedRProgramming(Wickham2014a),anadvancedbookthatlooksatthe
internalsofhowRworks(freefromadv-r.had.co.nz).
SpreadtheKnowledge
Thefinalthingtosayonthetopicofefficientlearningrelatestotheold(~2000yearsold!)
sayingdocendodiscimus:
byteachingwelearn
Thismeansthatpassingoninformationisoneofthebestwaystoconsolidateyourlearning.It
waslargelybyhelpingotherslearnRthatwebecameproficientRusers.
DemandforRskillsisgrowing,sotherearemanyopportunitiestoteachR.Whetherit’s
helpingyourcolleagueuseapply()orwritingablogpostonsolvingcertainproblemsinR,
teachingothersRcanbearewardingexperience.Furthermore,spreadingtheknowledgecan
beefficient:itwillimproveyourownunderstandingofthelanguageandbenefittheentire
community,providingpositivefeedbacktothemovementtowardopensourcesoftwarein
data-drivencomputing.
Assumingyouhavecompletedthisbook,theonlyremainingthingtosayis“Welldone!You
arenowanefficientRprogrammer.”Wehopeyoudirectyournewlyfoundskillstowardthe
greatergoodandpassonthewisdomtoothersalongtheway.
AppendixA.PackageDependencies
ThebookusesdatasetsstoredintheefficientGitHubpackage,whichcanbeinstalled(after
devtoolshasbeeninstalled)asfollows:
devtools::install_github("csgillespie/efficient",
args="--with-keep.source")
ThebookdependsonthefollowingCRANpackages:
Name Title Version
assertive.reflection AssertionsforCheckingtheStateofR(Cotton2016a) 0.0.3
benchmarkme CrowdSourcedSystemBenchmarks(Gillespie2016) 0.3.0
bookdown AuthoringBookswithRMarkdown(Xie2016a) 0.1
cranlogs DownloadLogsfromtheRStudioCRANMirror(Csardi2015) 2.1.0
data.table ExtensionofData.frame(Dowleetal.2015) 1.9.6
devtools ToolstoMakeDevelopingRPackagesEasier(H.WickhamandChang2016a) 1.12.0
DiagrammeR CreateGraphDiagramsandFlowchartsUsingR(Sveidqvistetal.2016) 0.8.4
dplyr AGrammarofDataManipulation(WickhamandFrancois2016) 0.5.0
drat DratRArchiveTemplate(CarlBoettigeretal.2016) 0.1.1
efficient BecominganEfficientRProgrammer(GillespieandLovelace2016) 0.1.1
feather RBindingstotheFeatherAPI(H.Wickham2016a) 0.3.0
formatR FormatRCodeAutomatically(Xie2016b) 1.4
fortunes RFortunes(ZeileisandRcommunity2016) 1.5.3
geosphere SphericalTrigonometry(Hijmans2016) 1.5.5
ggmap SpatialVisualizationwithggplot2(KahleandWickham2016) 2.6.1
ggplot2 AnImplementationoftheGrammarofGraphics(H.WickhamandChang2016b) 2.1.0
ggplot2movies MoviesData(H.Wickham2015a) 0.0.1
knitr AGeneral-PurposePackageforDynamicReportGenerationinR(Xie2016c) 1.14
lubridate MakeDealingwithDatesaLittleEasier(Grolemund,Spinu,andWickham2016) 1.5.6
microbenchmark AccurateTimingFunctions(Mersmann2015) 1.4.2.1
profvis InteractiveVisualizationsforProfilingRCode(ChangandLuraschi2016) 0.3.2
pryr ToolsforComputingontheLanguage(H.Wickham2015b) 0.1.2
Rcpp SeamlessRandC++Integration(Eddelbuetteletal.2016) 0.12.7
readr ReadTabularData(Wickham,Hester,andFrancois2016) 1.0.0
rio ASwiss-ArmyKnifeforDataI/O(ChanandLeeper2016) 0.4.12
RSQLite SQLiteInterfaceforR(Wickham,James,andFalcon2014) 1.0.0
tibble SimpleDataFrames(Wickham,Francois,andMüller2016) 1.2
tidyr EasilyTidyDatawithspread()andgather()Functions(H.Wickham2016b) 0.6.0
AppendixB.References
Bååth,Rasmus.2012.“TheStateofNamingConventionsinR.”TheRJournal4(2):74–75.
https://journal.r-project.org/archive/2012-2/RJournal_2012-2_Baaaath.pdf.
Berkun,Scott.2005.TheArtofProjectManagement.O’ReillyMedia.
Braun,John,andDuncanJMurdoch.2007.AFirstCourseinStatisticalProgrammingwithR.
Vol.25.CambridgeUniversityPressCambridge.
Burns,Patrick.2011.TheRInferno.Lulu.com.
CarlBoettiger,DirkEddelbuettelwithcontributionsby,SebastianGibb,ColinGillespie,Jan
Górecki,MattJones,ThomasLeeper,StevenPav,andJanSchulz.2016.Drat:DratRArchive
Template.https://CRAN.R-project.org/package=drat.
Chan,Chung-hong,andThomasJ.Leeper.2016.Rio:ASwiss-ArmyKnifeforDataI/O.
https://CRAN.R-project.org/package=rio.
Chang,Winston.2012.RGraphicsCookbook.O’ReillyMedia.
Chang,Winston,andJavierLuraschi.2016.Profvis:InteractiveVisualizationsforProfilingR
Code.https://CRAN.R-project.org/package=profvis.
Codd,E.F.1979.“Extendingthedatabaserelationalmodeltocapturemoremeaning.”ACM
TransactionsonDatabaseSystems4(4):397–434.doi:10.1145/320107.320109.
Cotton,Richard.2013.LearningR.O’ReillyMedia.
———.2016a.Assertive.reflection:AssertionsforCheckingtheStateofR.https://CRAN.R-
project.org/package=assertive.reflection.
———.2016b.TestingRCode.
Csardi,Gabor.2015.Cranlogs:DownloadLogsfromthe’RStudio’’CRAN’Mirror.
https://CRAN.R-project.org/package=cranlogs.
Dowle,M,ASrinivasan,TShort,SLianoglouwithcontributionsfromRSaporta,andE
Antonyan.2015.Data.table:ExtensionofData.frame.https://CRAN.R-
project.org/package=data.table.
Eddelbuettel,Dirk.2013.SeamlessRandC++IntegrationwithRcpp.Springer.
Eddelbuettel,Dirk,andRomainFrançois.2011.“Rcpp:SeamlessRandC++Integration.”
JournalofStatisticalSoftware40(8):1–18.
Eddelbuettel,Dirk,RomainFrancois,JJAllaire,KevinUshey,QiangKou,DouglasBates,and
JohnChambers.2016.Rcpp:SeamlessRandC++Integration.https://CRAN.R-
project.org/package=Rcpp.
Eddelbuettel,Dirk,RomainFrançois,J.Allaire,JohnChambers,DouglasBates,andKevin
Ushey.2011.“Rcpp:SeamlessRandC++Integration.”JournalofStatisticalSoftware40(8):
1–18.
Eddelbuettel,Dirk,MurrayStokely,andJeroenOoms.2016.“RProtoBuf:EfficientCross-
LanguageDataSerializationinR.”JournalofStatisticalSoftware71(1):1–24.
doi:10.18637/jss.v071.i02.
Gillespie,Colin.2016.Benchmarkme:CrowdSourcedSystemBenchmarks.https://CRAN.R-
project.org/package=benchmarkme.
Gillespie,Colin,andRobinLovelace.2016.Efficient:BecominganEfficientRProgrammer.
Goldberg,David.1991.“WhatEveryComputerScientistShouldKnowAboutFloating-Point
Arithmetic.”ACMComputingSurveys(CSUR)23(1).ACM:5–48.
Grant,ChristineA,LouiseMWallace,andPeterCSpurgeon.2013.“AnExplorationofthe
PsychologicalFactorsAffectingRemoteE-Worker’sJobEffectiveness,Well-Beingand
Work-LifeBalance.”EmployeeRelations35(5).EmeraldGroupPublishingLimited:527–46.
Grolemund,G.,andH.Wickham.2016.RforDataScience.O’ReillyMedia.
Grolemund,Garrett,VitalieSpinu,andHadleyWickham.2016.Lubridate:MakeDealingwith
DatesaLittleEasier.https://CRAN.R-project.org/package=lubridate.
Hijmans,RobertJ.2016.Geosphere:SphericalTrigonometry.https://CRAN.R-
project.org/package=geosphere.
Janert,PhilippK.2010.DataAnalysiswithOpenSourceTools.“O’ReillyMedia”.
Jensen,JørgenDejgård.2011.“CanWorksiteNutritionalInterventionsImproveProductivity
andFirmProfitability?ALiteratureReview.”PerspectivesinPublicHealth131(4).SAGE
Publications:184–92.
Kahle,David,andHadleyWickham.2016.Ggmap:SpatialVisualizationwithGgplot2.
https://CRAN.R-project.org/package=ggmap.
Kersten,MartinL,StratosIdreos,StefanManegold,EriettaLiarou,andothers.2011.“The
Researcher’sGuidetotheDataDeluge:QueryingaScientificDatabaseinJustaFew
Seconds.”PVLDBChallengesandVisions3.
Kruchten,Philippe,RobertLNord,andIpekOzkaya.2012.“TechnicalDebt:FromMetaphor
toTheoryandPractice.”IEEESoftware,no.6.IEEE:18–21.
Lovelace,AdaCountess.1842.“TranslatorsnotestoanarticleonBabbagesAnalytical
Engine.”ScientificMemoirs3.691-731.
Lovelace,Robin,andMorganeDumont.2016.SpatialMicrosimulationwithR.CRCPress.
http://bit.ly/spatialmicrosimR.
McCallum,Ethan,andStephenWeston.2011.ParallelR.O’ReillyMedia.
McConnell,Steve.2004.CodeComplete.PearsonEducation.
Mersmann,Olaf.2015.Microbenchmark:AccurateTimingFunctions.https://CRAN.R-
project.org/package=microbenchmark.
Peng,Roger.2014.RProgrammingforDataScience.Leanpub.
https://leanpub.com/rprogramming.
Pereira,MichelleJessica,BrookeKayeCoombes,TracyAnneComans,andVenerina
Johnston.2015.“TheImpactofOnsiteWorkplaceHealth-EnhancingPhysicalActivity
InterventionsonWorkerProductivity:ASystematicReview.”Occupationaland
EnvironmentalMedicine72(6).BMJPublishingGroupLtd:401–12.
PMBoK,A.2000.“GuidetotheProjectManagementBodyofKnowledge.”Project
ManagementInstitute,PennsylvaniaUSA.
RCoreTeam.2016.“RInstallationandAdministration.”RFoundationforStatistical
Computing.https://cran.r-project.org/doc/manuals/r-release/R-admin.html.
Sanchez,Gaston.2013.“HandlingandProcessingStringsinR.”TrowchezEditions.
http://bit.ly/handlingstringsR.
Sekhon,JasjeetS.2006.“TheArtofBenchmarking:EvaluatingthePerformanceofRon
LinuxandOSX.”ThePoliticalMethodologist14(1):15–19.
Spector,Phil.2008.DataManipulationwithR.SpringerScience&BusinessMedia.
Sveidqvist,Knut,MikeBostock,ChrisPettitt,MikeDaines,AndreiKashcha,andRichard
Iannone.2016.DiagrammeR:CreateGraphDiagramsandFlowchartsUsingR.
https://CRAN.R-project.org/package=DiagrammeR.
Visser,MarcoD.,SeanM.McMahon,CoryMerow,PhilipM.Dixon,SydneRecord,andEelke
Jongejans.2015.“SpeedingUpEcologicalandEvolutionaryComputationsinR;Essentialsof
HighPerformanceComputingforBiologists.”EditedbyFrancisOuellette.PLOS
ComputationalBiology11(3):e1004140.doi:10.1371/journal.pcbi.1004140.
Wickham,Hadley.2010.“Stringr:Modern,ConsistentStringProcessing.”TheRJournal2
(2):38–40.
———.2014a.AdvancedR.CRCPress.
———.2014b.“TidyData.”TheJournalofStatisticalSoftware14(5).
———.2015a.Ggplot2movies:MoviesData.https://CRAN.R-
project.org/package=ggplot2movies.
———.2015b.Pryr:ToolsforComputingontheLanguage.https://CRAN.R-
project.org/package=pryr.
———.2015c.RPackages.O’ReillyMedia.
———.2016a.Feather:RBindingstotheFeather’API’.https://CRAN.R-
project.org/package=feather.
———.2016b.Tidyr:EasilyTidyDatawithspread()andgather()Functions.
https://CRAN.R-project.org/package=tidyr.
Wickham,Hadley,andWinstonChang.2016a.Devtools:ToolstoMakeDevelopingR
PackagesEasier.https://CRAN.R-project.org/package=devtools.
———.2016b.Ggplot2:AnImplementationoftheGrammarofGraphics.https://CRAN.R-
project.org/package=ggplot2.
Wickham,Hadley,andRomainFrancois.2016.Dplyr:AGrammarofDataManipulation.
https://CRAN.R-project.org/package=dplyr.
Wickham,Hadley,RomainFrancois,andKirillMüller.2016.Tibble:SimpleDataFrames.
https://CRAN.R-project.org/package=tibble.
Wickham,Hadley,JimHester,andRomainFrancois.2016.Readr:ReadTabularData.
https://CRAN.R-project.org/package=readr.
Wickham,Hadley,DavidA.James,andSethFalcon.2014.RSQLite:SQLiteInterfaceforR.
https://CRAN.R-project.org/package=RSQLite.
Xie,Yihui.2015.DynamicDocumentswithRandKnitr.Vol.29.CRCPress.
———.2016a.Bookdown:AuthoringBookswithRMarkdown.https://CRAN.R-
project.org/package=bookdown.
———.2016b.FormatR:FormatRCodeAutomatically.https://CRAN.R-
project.org/package=formatR.
———.2016c.Knitr:AGeneral-PurposePackageforDynamicReportGenerationinR.
https://CRAN.R-project.org/package=knitr.
Zeileis,Achim,andtheRcommunity.2016.Fortunes:RFortunes.https://CRAN.R-
project.org/package=fortunes.
Index
Symbols
&,&&(ANDoperator),LogicalANDandOR
.csvfiles,Plain-TextFormats
.Rdata,NativeBinaryFormats:RdataorRds?
.Rds,NativeBinaryFormats:RdataorRds?
.Renvironfile,AnOverviewofR’sStartupFiles,The.RenvironFile-Example.Renviron
file
locationof,TheLocationofStartupFiles
storingpasswordsin,WorkingwithDatabases
.Rprofile,AnOverviewofR’sStartupFiles,The.RprofileFile-Creatinghidden
environmentswith.Rprofile,Thefortunespackage
hiddenenvironmentswith,Creatinghiddenenvironmentswith.Rprofile
locationof,TheLocationofStartupFiles
settingCRANmirror,SettingtheCRANmirror
settingoptions,Settingoptions
usefulfunctions,Usefulfunctions
?prefix,GettingHelponFunctions
??symbol,SearchingRforTopics
|,||(ORoperator),LogicalANDandOR
A
aggregation(seedataaggregation)
algorithmicefficiency,WhatIsEfficiency?
AND(&,&&)operator,LogicalANDandOR
anyNA()function,is.na()andanyNA()
applyfunctionfamily,TheApplyFamily-Otherresources
moviesdatasetexample,Example:MoviesDataset
parallelversionsof,ParallelVersionsofApplyFunctions
resourcesfor,Otherresources
typeconsistencyand,TypeConsistency
apply()function,TheApplyFamily-Otherresources,RowandColumnOperations
apropos(),SearchingRforTopics
argumentpassing,assignmentvs.,Assignment
ASCIIcharacterset,Background:WhatIsaByte?
assertive.reflectionpackage,OperatingSystem
autocompletion,Autocompletion-Autocompletion
B
baseR
convertingfactorstonumerics,ConvertingFactorstoNumerics
determiningwhichindicesareTRUE,WhichIndicesareTRUE?
if()vs.ifelse()functions,Theif()Versusifelse()Functions
integerdatatype,Theintegerdatatype
is.na()andanyNA(),is.na()andanyNA()
logicalANDandOR,LogicalANDandOR
matrices,Matrices-Sparsematrices
patternmatchingwith,RegularExpressions
reversingelements,ReversingElements
rowandcolumnoperations,RowandColumnOperations
sortingandordering,SortingandOrdering
BasicLinearAlgebraSystem(seeBLAS)
benchmarking
binaryfileformats,BenchmarkingBinaryFileFormats-ProtocolBuffers
BLASresources,UsefulBLAS/BenchmarkingResources
forefficientprogramming,BenchmarkingandProfiling-BenchmarkingExample
benchmarkmepackage,RandomAccessMemory,CentralProcessingUnit
binaryfileformats
benchmarking,BenchmarkingBinaryFileFormats-ProtocolBuffers
feather,TheFeatherFileFormat
forIO,BinaryFileFormats-ProtocolBuffers
ProtocolBuffersfor,ProtocolBuffers
Rdsvs.Rdata,NativeBinaryFormats:RdataorRds?
BLASframework,BLASandAlternativeRInterpreters-UsefulBLAS/Benchmarking
Resources
benchmarkingresources,UsefulBLAS/BenchmarkingResources
testingperformancegainsfrom,TestingPerformanceGainsfromBLAS
braces,curly({}),CurlyBraces
branches,Branches,Forks,Pulls,andClones
broompackage,OthertidyrFunctions
byte,Background:WhatIsaByte?
C
C++
cppFunction()command,ThecppFunction()Command
datatypes,C++DataTypes
Rfunctionsvs.,ASimpleC++Function
Rcppand,ASimpleC++Function
Rcppsugarand,C++withSugaronTop
sourceCpp()function,ThesourceCpp()Function
caching
functionclosures,FunctionClosures
variables,CachingVariables-Exercises
cat()function,InformativeOutput:message()andcat()
categoricalvariables,Factors
centralprocessingunit(CPU),CentralProcessingUnit
chaining,ChainingOperations
cheatsheets,OnlineResources
chunking,ChunkingYourWork
class,ofcolumns,ChangingColumnClasses
clones,Branches,Forks,Pulls,andClones
closures,FunctionClosures
cloudcomputing,CloudComputing
codeprofiling,CodeProfiling-Example:MonopolySimulation
efficiencyand,BenchmarkingandProfiling,Profiling-Exercises
profvis,GettingStartedwithprofvis-Example:MonopolySimulation
codereview,CodeReview
codingstyle
assigningobjectstovalues,Assignment
commenting,Commenting
curlybraces,CurlyBraces
filenames,Filenames
forefficientcollaboration,CodingStyle-Exercise
importanceofconsistency,ConsistentStyleandCodeConventions
indentation,Indentation
loadingpackages,LoadingPackages
lubridatepackageexample,ExamplePackage
objectnames,ObjectNames
reformattingcodewithRStudio,ReformattingCodewithRStudio
spacing,Spacing
collaboration,EfficientCollaboration-CodeReview
codereview,CodeReview
tipsfor,TopFiveTipsforEfficientCollaboration
versioncontrol,VersionControl-Branches,Forks,Pulls,andClones
columns
apply()functionand,RowandColumnOperations
changingclass,ChangingColumnClasses
renaming,RenamingColumns
comments/commenting,Commenting
commits,Commits
compilerpackage,TheByteCompiler-CompilingCode,Example:TheMeanFunction
compilingcode,CompilingCode
ComprehensiveRArchiveNetwork(seeCRANentries)
CPU(centralprocessingunit),CentralProcessingUnit
CRAN(ComprehensiveRArchiveNetwork),WhoThisBookIsforandHowtoUseIt
CRANmirror,SettingtheCRANmirror
csvfiles,Plain-TextFormats
curlybraces,CurlyBraces
D
dataaggregation,DataAggregation-DataAggregation
datacarpentry,EfficientDataCarpentry-DataProcessingwithdata.table
combiningdatasets,CombiningDatasets-CombiningDatasets
dataframeswithtibble,EfficientDataFrameswithtibble
dataprocessingwithdata.table,DataProcessingwithdata.table-DataProcessingwith
data.table
databasesand,WorkingwithDatabases-Databasesanddplyr
dplyrfordataprocessing,EfficientDataProcessingwithdplyr-ChainingOperations
tidyrfor,TidyingDatawithtidyrandRegularExpressions-RegularExpressions
tipsfor,TopFiveTipsforEfficientDataCarpentry
dataframes,EfficientDataFrameswithtibble
datainput/output(seeinput/output(IO))
dataprocessing,EfficientDataProcessingwithdplyr
(seealsodatacarpentry)
dataaggregation,DataAggregation-DataAggregation
data.tablefor,DataProcessingwithdata.table-DataProcessingwithdata.table
dplyr,EfficientDataProcessingwithdplyr-ChainingOperations
nonstandardevaluation,NonstandardEvaluation
datatidying,TidyingDatawithtidyrandRegularExpressions-OthertidyrFunctions
gather(),MakeWideTablesLongwithgather()
regularexpressionsand,RegularExpressions
splittingjointvariableswithseparate(),SplitJointVariableswithseparate()
tidyrfor,TidyingDatawithtidyrandRegularExpressions-RegularExpressions
data.tablepackage,DataProcessingwithdata.table-DataProcessingwithdata.table
databases
datacarpentryand,WorkingwithDatabases-Databasesanddplyr
dplyrand,Databasesanddplyr
datasets
combining,CombiningDatasets-CombiningDatasets
forillustratingquestions,MinimalDataset
DBI,WorkingwithDatabases
deeplearning,LearningInDepth
dependencies,Rpackageswith,InstallingRPackageswithDependencies
documentation,RMarkdownfor,DynamicDocumentswithRMarkdown
double-precisionfloating-pointformat,Theintegerdatatype
dplyr
chainingoperationswith,ChainingOperations
changingcolumnclasses,ChangingColumnClasses
dataaggregation,DataAggregation-DataAggregation
dataprocessingwith,EfficientDataProcessingwithdplyr-ChainingOperations
databaseaccessvia,Databasesanddplyr
filteringrows,FilteringRows
nonstandardevaluation,NonstandardEvaluation
renamingcolumns,RenamingColumns
verbfunctions,EfficientDataProcessingwithdplyr
dratpackage,ReadingRSourceCode
dualin-linememorymodules(DIMMs),RandomAccessMemory
dynamicdocumentation,DynamicDocumentswithRMarkdown
E
EC2(ElasticComputeCloud),AmazonEC2
efficiency
about,WhatIsEfficientRProgramming?-WhatIsEfficientRProgramming?
benchmarking,BenchmarkingandProfiling-BenchmarkingExample
consistentcodestyle/conventions,ConsistentStyleandCodeConventions
cross-transferableskillsfor,Cross-TransferableSkillsforEfficiency-ConsistentStyle
andCodeConventions
defined,WhatIsEfficiency?
importanceof,WhyEfficiency?
inRprogramming,WhatIsEfficientRProgramming?-WhatIsEfficientR
Programming?
ofprogrammer,TouchTyping
profiling,BenchmarkingandProfiling,Profiling-Exercises
touchtyping,TouchTyping
waysinwhichRencourages/guides,WhatIsEfficientRProgramming?
efficientpackage,Example:MonopolySimulation,FindingandUsingVignettes
ElasticComputeCloud(EC2),AmazonEC2
F
factors,Factors
convertingtonumerics,ConvertingFactorstoNumerics
forfixedsetofcategories,FixedSetofCategories
inherentorder,InherentOrder
fatalerrors,FatalErrors:stop()
feather(fileformat),TheFeatherFileFormat
filepaths,TheLocationofStartupFiles
file.path()function,TheLocationofStartupFiles
filenames,consistentstylefor,Filenames
filter()function,FilteringRows
forks,Branches,Forks,Pulls,andClones
fread()function
read_csv()vs.,DifferencesBetweenfread()andread_csv()-DifferencesBetweenfread()
andread_csv()
speedof,Plain-TextFormats
functioncalls
assignmentoperatorvs.argumentpassingoperator,Assignment
library,LoadingPackages
minimizing,GeneralAdvice
functionclosures,FunctionClosures
functions,helppagesfor,GettingHelponFunctions-GettingHelponFunctions
fuzzymatching,CombiningDatasets
G
gather()function,MakeWideTablesLongwithgather()
Gentleman,Robert,WhatIsEfficientRProgramming?
Git
about,VersionControl
branches,Branches,Forks,Pulls,andClones
clones,Branches,Forks,Pulls,andClones
forks,Branches,Forks,Pulls,andClones
pullrequests,Branches,Forks,Pulls,andClones
RStudioand,GitIntegrationinRStudio
GitHub,GitHub
graphics,factorsfororderingin,InherentOrder
H
harddiscdrive(HDD),HardDrives:HDDVersusSSD
hardware,EfficientHardware-AmazonEC2
bitsandbytes,Background:WhatIsaByte?
cloudcomputing,CloudComputing
CPU,CentralProcessingUnit
harddrives,HardDrives:HDDVersusSSD
operatingsystems,OperatingSystems:32-Bitor64-Bit
RAM,RandomAccessMemory-RandomAccessMemory
tipsfor,TopFiveTipsforEfficientHardware
HDD(harddiscdrive),HardDrives:HDDVersusSSD
help,R
functions,GettingHelponFunctions-GettingHelponFunctions
Rsinternalhelp,UsingR’sInternalHelp-swirl
searchingfortopicsin,SearchingRforTopics
sourcecode,ReadingRSourceCode
swirl,swirl
vignettes,FindingandUsingVignettes
help.start()function,WhoThisBookIsforandHowtoUseIt
helperfunctions,Usefulfunctions
hiddenenvironments,Creatinghiddenenvironmentswith.Rprofile
I
IDE(integrateddevelopmentenvironment)(seeRStudio)
if()function,ifelse()functionvs.,Theif()Versusifelse()Functions
Ihaka,Ross,WhatIsEfficientRProgramming?,RandomAccessMemory
indentation,Indentation
indices,determiningwhichareTRUE,WhichIndicesareTRUE?
input/output(IO),EfficientInput/Output-AccessingDataStoredinPackages
accessingdatastoredinpackages,AccessingDataStoredinPackages
binaryfileformats,BinaryFileFormats-ProtocolBuffers
datafrominternet,GettingDatafromtheInternet
plain-textformats,Plain-TextFormats-PreprocessingTextOutsideR
rio,VersatileDataImportwithrio-VersatileDataImportwithrio
tipsfor,TopFiveTipsforEfficientDataI/O
installation
R,InstallingR
Rpackages,RPackage,InstallingRPackages
Rpackageswithdependencies,InstallingRPackageswithDependencies
integerdatatype,Theintegerdatatype
integrateddevelopmentenvironment(IDE)(seeRStudio)
internalhelp,R,UsingR’sInternalHelp-swirl
InternationalSystemofUnits(SI)prefixes,Background:WhatIsaByte?
internet,datafrom,GettingDatafromtheInternet
interpreters,OtherInterpreters
invisible()function,InvisibleReturns
IO(seeinput/output)
is.na()function,is.na()andanyNA()
J
joining,CombiningDatasets-CombiningDatasets
joiningvariable,CombiningDatasets
K
keyboardshortcuts,InstallingandUpdatingRStudio,KeyboardShortcuts
Knuth,Donald,EfficientOptimization
L
lapply()function,TheApplyFamily
learning,EfficientLearning-SpreadtheKnowledge
askingquestionsefficiently,AskingaQuestion
indepth,LearningInDepth
onlineresources,WhoThisBookIsforandHowtoUseIt,OnlineResources-Mailing
ListsandGroups
Rsinternalhelpfor,WhoThisBookIsforandHowtoUseIt,UsingR’sInternalHelp-
swirl
StackOverflowsite,StackOverflow
teaching,SpreadtheKnowledge
tipsfor,TopFiveTipsforEfficientLearning
libraryfunctioncalls,LoadingPackages
Linux
C++compiler,Prerequisites
parallelcodeunder,ParallelCodeunderLinuxandOSX
Rinstallation,InstallingR
systemmonitoringon,OperatingSystemandResourceMonitoring
loops,Rcppand,VectorsandLoops-Matrices
lubridatepackage,ExamplePackage
M
MacOS
C++compilerinstallation,Prerequisites
Rinstallation,InstallingR
Rupdates,UpdatingR
systemmonitoringon,OperatingSystemandResourceMonitoring
mailinglists,MailingListsandGroups
matrices,Matrices-Sparsematrices
integerdatatype,Theintegerdatatype
sparse,Sparsematrices
memoisepackage,CachingVariables
memoryallocation,MemoryAllocation
merging,CombiningDatasets-CombiningDatasets
message()function,InformativeOutput:message()andcat()
METACRAN,HowtoSelectaPackage
microbenchmarkpackage,Benchmarking
MicrosoftROpen,OtherInterpreters
missingvalues,is.na()andanyNA()
MonetDB,WorkingwithDatabases
Monopoly(game),Example:Optimizingthemove_square()Function
MonteCarlosimulation
codeprofiling,Example:MonopolySimulation
parallelcomputingfor,Example:SnakesandLadders
vectorizedcode,Example:MonteCarlointegration
MRAN,HowtoSelectaPackage
N
non-standardevaluation(NSE),Exercises,NonstandardEvaluation
normalizePath()function,TheLocationofStartupFiles
noSQL,WorkingwithDatabases
O
objectdisplay,ObjectDisplayandOutputTable
objects
assigningtovalues,Assignment
namingof,ObjectNames
onlinelearningresources,OnlineResources-MailingListsandGroups,MailingListsand
Groups
mailinglists,MailingListsandGroups
R-bloggers,OnlineResources
StackOverflow,StackOverflow
operatingsystem(OS)
32-bitvs.64-bit,OperatingSystems:32-Bitor64-Bit
Rsetup,OperatingSystem-Exercises
resourcemonitoringand,OperatingSystemandResourceMonitoring-Exercises
optim()function,GettingHelponFunctions-GettingHelponFunctions
optimization,EfficientOptimization-RcppResources
codeprofiling,CodeProfiling-Example:MonopolySimulation
efficientbaseR,EfficientBaseR-Exercises
movie_square()function,Example:Optimizingthemove_square()Function
parallelcomputing,ParallelComputing-ParallelCodeunderLinuxandOSX
Rcpp,Rcpp-RcppResources
tipsfor,TopFiveTipsforEfficientOptimization
options()function,Settingoptions
OR(|,||)operator,LogicalANDandOR
Oracle,R-interpreter,OtherInterpreters
ordering,InherentOrder,SortingandOrdering
OS(seeoperatingsystem)
OSX,parallelcodeunder,ParallelCodeunderLinuxandOSX
P
packages
loading,LoadingPackages
(seealso.Renvironfile)
R(seeRpackages)
panes,RStudiolayout,WindowPaneLayout-Exercises
parallelcomputing,ParallelComputing-ParallelCodeunderLinuxandOSX
applyfunctions,ParallelVersionsofApplyFunctions
exitfunctions,ExitFunctionswithCare
SnakesandLadderssimulation,Example:SnakesandLadders
underLinuxorOSX,ParallelCodeunderLinuxandOSX
parallelpackage,ParallelComputing,Example:SnakesandLadders
passwords,storingin.Renviron,WorkingwithDatabases
pathologicalpackage,TheLocationofStartupFiles
plain-textdatafiles
fread()vs.read_csv(),DifferencesBetweenfread()andread_csv()-DifferencesBetween
fread()andread_csv()
I/Owith,Plain-TextFormats-PreprocessingTextOutsideR
limitationsto,BinaryFileFormats
preprocessingtextoutsideR,PreprocessingTextOutsideR
pointerobject,C++DataTypes
pqrR,OtherInterpreters
profiling(seecodeprofiling)
profvis,GettingStartedwithprofvis-Example:MonopolySimulation
basics,GettingStartedwithprofvis
Monopolysimulationexample,Example:MonopolySimulation
programmerproductivity/efficiency,WhatIsEfficiency?
(seealsoworkflow)
programming,EfficientProgramming-CompilingCode
applyfunctionfamily,TheApplyFamily-Otherresources
bytecompiler,TheByteCompiler-CompilingCode
cachingvariables,CachingVariables-Exercises
communicatingwithuser,CommunicatingwiththeUser-InvisibleReturns
factors,Factors
generaladvice,GeneralAdvice-Exercise
memoryallocation,MemoryAllocation
tipsfor,Prerequisites
vectorizedcode,VectorizedCode-Exercise
projectmanagement
chunking,ChunkingYourWork
RStudio,ProjectManagement
SMARTcriteriaforobjectives,MakingYourWorkflowSMART
visualizingplanswithR,VisualizingPlanswithR
projectplanning
packageselection,PackageSelection-HowtoSelectaPackage
projectmanagementand,ProjectPlanningandManagement-VisualizingPlanswithR
typology,AProjectPlanningTypology-AProjectPlanningTypology
visualizingplanswithR,VisualizingPlanswithR
ProtocolBuffers
binarydatastoragewith,ProtocolBuffers
pryrpackage,RandomAccessMemory
publication,Publication-RPackages
RMarkdownframeworkfordocumentation,DynamicDocumentswithRMarkdown
treatingprojectsasRpackages,RPackages
pullrequest(PR),Branches,Forks,Pulls,andClones
Q
questions
askingefficiently,AskingaQuestion
avoidingredundant,OnlineResources
minimaldatasetforillustrating,MinimalDataset
minimalexampleforillustrating,MinimalExample
R
R
C++functionsvs.,ASimpleC++Function
installing,InstallingR
updating,UpdatingR
RMarkdown,DynamicDocumentswithRMarkdown
Rpackageecosystem,HowtoSelectaPackage
Rpackages
accessingdatastoredin,AccessingDataStoredinPackages
installation,RPackage,InstallingRPackages
installationwithdependencies,InstallingRPackageswithDependencies
searchingfor,SearchingforRPackages
selectionaspartofplanningprocess,PackageSelection-HowtoSelectaPackage
selectioncriteria,HowtoSelectaPackage
treatingprojectsas,ProjectManagement,RPackages
updating,UpdatingRPackages
Rstartup,RStartup-Exercises
arguments,RStartupArguments
locationofstartupfiles,TheLocationofStartupFiles-TheLocationofStartupFiles
R-bloggers,OnlineResources
R-projectwebsite,OnlineResources
randomaccessmemory(RAM),RandomAccessMemory-RandomAccessMemory
Rcpp,Rcpp-RcppResources
C++datatypes,C++DataTypes
C++functions,ASimpleC++Function
cppFunction()command,ThecppFunction()Command
matrices,Matrices
resources/documentation,RcppResources
sourceCpp()function,ThesourceCpp()Function
sugar,C++withSugaronTop
vectorsandloops,VectorsandLoops-Matrices
Rdata,NativeBinaryFormats:RdataorRds?
Rds,NativeBinaryFormats:RdataorRds?
read.csv()function,Plain-TextFormats
readrpackage,Prerequisites,Plain-TextFormats-DifferencesBetweenfread()and
read_csv()
read_csv()function
factorsand,InherentOrder
fread()vs.,DifferencesBetweenfread()andread_csv()-DifferencesBetweenfread()
andread_csv()
speedof,Plain-TextFormats
reformatting,ReformattingCodewithRStudio
regularexpressions,RegularExpressions
rename()function,RenamingColumns
Renjin,OtherInterpreters
Renvironfile,AnOverviewofR’sStartupFiles,The.RenvironFile-Example.Renviron
file
locationof,TheLocationofStartupFiles
storingpasswordsin,WorkingwithDatabases
resourcemonitoring,OperatingSystemandResourceMonitoring-Exercises
rev()function,ReversingElements
Rho,OtherInterpreters
riopackage,VersatileDataImportwithrio-VersatileDataImportwithrio
RODBC,WorkingwithDatabases
rows,filteringwithdplyr,FilteringRows
Rprof()function,CodeProfiling
Rprofile,AnOverviewofR’sStartupFiles,The.RprofileFile-Creatinghidden
environmentswith.Rprofile,Thefortunespackage
hiddenenvironmentswith,Creatinghiddenenvironmentswith.Rprofile
locationof,TheLocationofStartupFiles
settingCRANmirror,SettingtheCRANmirror
settingoptions,Settingoptions
usefulfunctions,Usefulfunctions
RStudio
autocompletion,Autocompletion-Autocompletion
Gitintegrationin,GitIntegrationinRStudio
installingandupdating,InstallingandUpdatingRStudio
keyboardshortcuts,InstallingandUpdatingRStudio,KeyboardShortcuts
objectdisplayandoutputtable,ObjectDisplayandOutputTable
options,RStudioOptions
projectmanagement,ProjectManagement
Rpackageupdates,UpdatingRPackages
reformattingcodewith,ReformattingCodewithRStudio
setup,RStudio-ProjectManagement
windowpanelayout,WindowPaneLayout-Exercises
RStudiomirror,SettingtheCRANmirror
S
separate()function,SplitJointVariableswithseparate()
setup,EfficientSetup-Exercise
alternativeRinterpreters,OtherInterpreters
BLASframework,BLASandAlternativeRInterpreters-UsefulBLAS/Benchmarking
Resources
installingR,InstallingR
operatingsystem,OperatingSystem-Exercises
Rpackageinstallation,InstallingRPackages
Rstartup,RStartup-Exercises
Rversion,RVersion-Exercises
RStudio,RStudio-ProjectManagement
tipsfor,TopFiveTipsforanEfficientRSetup
updatingR,UpdatingR
updatingRpackages,UpdatingRPackages
sharedmemorysystems(seeparallelcomputing)
shortcuts,keyboard,InstallingandUpdatingRStudio,KeyboardShortcuts
SI(InternationalSystemofUnits)prefixes,Background:WhatIsaByte?
SMARTcriteria,MakingYourWorkflowSMART
solidstatedrive(SSD),HardDrives:HDDVersusSSD
sorting,SortingandOrdering
sourcecode,reading,ReadingRSourceCode
sourceCpp()function,ThesourceCpp()Function
spacing,Spacing
sparsematrices,Sparsematrices
SSD(solidstatedrive),HardDrives:HDDVersusSSD
StackOverflow(programminghelpsite),StackOverflow
startupfiles,R,AnOverviewofR’sStartupFiles-Example.Renvironfile
.Renviron,The.RenvironFile-Example.Renvironfile
.Rprofile,The.RprofileFile-Creatinghiddenenvironmentswith.Rprofile
locationof,TheLocationofStartupFiles-TheLocationofStartupFiles
startup,R,RStartup-Exercises
stop()function,FatalErrors:stop()
streamprocessing,PreprocessingTextOutsideR
stringr,patternmatchingwith,RegularExpressions
style(seecodingstyle)
subsetting,DataProcessingwithdata.table,Theintegerdatatype
sugar,C++withSugaronTop
swirl,swirl
Sys.getenv()function,The.RenvironFile
systemvariables(see.Renvironfile)
T
tables,gather()functionand,MakeWideTablesLongwithgather()
tbl_dfdataframeclass,EfficientDataFrameswithtibble
teaching,asformoflearning,SpreadtheKnowledge
technicaldebt,ProjectPlanningandManagement
TERR,OtherInterpreters
tibble,EfficientDataFrameswithtibble
Tibco,OtherInterpreters
tidydata,TidyingDatawithtidyrandRegularExpressions
tidyrpackage,TidyingDatawithtidyrandRegularExpressions-OthertidyrFunctions
datatidyingwith,TidyingDatawithtidyrandRegularExpressions-Othertidyr
Functions
splittingjointvariableswithseparate(),SplitJointVariableswithseparate()
variousfunctions,OthertidyrFunctions
touchtyping,TouchTyping
U
Ubuntu
Rpackageswithdependencies,InstallingRPackageswithDependencies
Rupdates,UpdatingR
updating
R,UpdatingR
Rpackages,UpdatingRPackages
user,communicatingwith,CommunicatingwiththeUser-InvisibleReturns
fatalerrors,FatalErrors:stop()
informativeoutput,InformativeOutput:message()andcat()
invisiblereturns,InvisibleReturns
warnings,Warnings:warning()
V
values,assigningobjectsto,Assignment
variables,caching,CachingVariables-Exercises
vector
determiningwhichindicesareTRUE,WhichIndicesareTRUE?
matricesand,Matrices
pre-allocating,MemoryAllocation
Rcppand,VectorsandLoops-Matrices
sorting,SortingandOrdering
vectorizedcode
efficientprogrammingand,VectorizedCode-Exercise
MonteCarlointegration,Example:MonteCarlointegration
versioncontrol,VersionControl-Branches,Forks,Pulls,andClones
branches,Branches,Forks,Pulls,andClones
clones,Branches,Forks,Pulls,andClones
commits,Commits
forks,Branches,Forks,Pulls,andClones
GitintegrationinRStudio,GitIntegrationinRStudio
GitHub,GitHub
pullrequests,Branches,Forks,Pulls,andClones
vignette()function,WhoThisBookIsforandHowtoUseIt
vignettes,finding/using,WhoThisBookIsforandHowtoUseIt,FindingandUsing
Vignettes
W
warning()function,Warnings:warning()
wideboundarysearch,SearchingRforTopics
widedata,TidyingDatawithtidyrandRegularExpressions
windowpanes,inRStudiolayout,WindowPaneLayout-Exercises
Windows
C++compilerinstallation,Prerequisites
filepathsinR,TheLocationofStartupFiles
Rinstallation,InstallingR
Rpackageswithdependencies,InstallingRPackageswithDependencies
Rupdates,UpdatingR
systemmonitoringon,OperatingSystemandResourceMonitoring
workflow,EfficientWorkflow-RPackages
chunking,ChunkingYourWork
defined,EfficientWorkflow
packageselection,PackageSelection-HowtoSelectaPackage
projectmanagementand,ProjectPlanningandManagement-VisualizingPlanswithR
projectplanningtypology,AProjectPlanningTypology-AProjectPlanningTypology
publication,Publication-RPackages
AbouttheAuthors
ColinGillespieisseniorlecturer(associateprofessor)atNewcastleUniversity,UK.His
researchinterestsarehigh-performancestatisticalcomputingandBayesianstatistics.Heis
regularlyemployedasaconsultantbyJumpingRiversandhasbeenteachingRsince2005at
avarietyoflevels,rangingfrombeginnerstoadvancedprogramming.
RobinLovelaceisaresearcherattheLeedsInstituteforTransportStudies(ITS)andthe
LeedsInstituteforDataAnalytics(LIDA).RobinhasmanyyearsusingRforacademic
researchandhastaughtnumerousRcoursesatalllevels.Hehasdevelopedanumberof
popularRresources,includingIntroductiontoVisualisingSpatialDatainRandSpatial
MicrosimulationwithR(LovelaceandDumont2016).Theseskillshavebeenappliedona
numberofprojectswithreal-worldapplications,includingthePropensitytoCycleTool,a
nationallyscalableinteractiveonlinemappingapplication,andthestplanrpackage.
Colophon
TheanimalonthecoverofEfficientRProgrammingisthegreyheron(Ardeacinerea).Grey
heronsarelargewadingbirds,measuringupto100cminheightwithanearly200cm
wingspan.Theyarelong-legged,whichletsthemeasilywadeintheshallowsoftheirnative
wetlandhabitat.Theyhuntfish,amphibians,smallmammals,andinsectsbystanding
motionlessinshallowwaterthroughouttheday,thenstrikingunsuspectingpreywiththeir
longbill.Atnight,theyroostintreesoroncliffs,wheretheyalsolayeggsandraisetheir
young.
GreyheronscanbefoundthroughoutEurope,Asia,andAfrica.Mostgrayheronsliveinthe
sameregionyearround,butthoselivingincoldernorthernregionsmigratesouthforthe
winter.Theyaremostlygreyincolor,withawhiteneckandblackstreaksontheheadand
wings.
Greyheronshavebeenapartofseveralancientmythologicalsystems.DuringtheNew
KingdomperiodinEgypt,thedeityBennu,godofthesun,creation,andrebirth,was
representedasagreyheron.Inpre-ChristianRome,thegrayheronwasasymbolof
divininationusedbyprieststopredictthefuture.
ManyoftheanimalsonO’Reillycoversareendangered;allofthemareimportanttothe
world.Tolearnmoreabouthowyoucanhelp,gotoanimals.oreilly.com.
ThecoverimageisfromMeyersKleinesLexicon.ThecoverfontsareURWTypewriterand
GuardianSans.ThetextfontisAdobeMinionPro;theheadingfontisAdobeMyriad
Condensed;andthecodefontisDaltonMaag’sUbuntuMono.
Preface
ConventionsUsedinThisBook
UsingCodeExamples
O’ReillySafari
HowtoContactUs
Acknowledgments
Colin
Robin
1.Introduction
Prerequisites
WhoThisBookIsforandHowtoUseIt
WhatIsEfficiency?
WhatIsEfficientRProgramming?
WhyEfficiency?
Cross-TransferableSkillsforEfficiency
TouchTyping
ConsistentStyleandCodeConventions
BenchmarkingandProfiling
Benchmarking
BenchmarkingExample
Profiling
BookResources
RPackage
OnlineVersion
References
2.EfficientSetup
Prerequisites
TopFiveTipsforanEfficientRSetup
OperatingSystem
OperatingSystemandResourceMonitoring
RVersion
InstallingR
UpdatingR
InstallingRPackages
InstallingRPackageswithDependencies
UpdatingRPackages
RStartup
RStartupArguments
AnOverviewofR’sStartupFiles
TheLocationofStartupFiles
The.RprofileFile
Example.RprofileFile
The.RenvironFile
RStudio
InstallingandUpdatingRStudio
WindowPaneLayout
RStudioOptions
Autocompletion
KeyboardShortcuts
ObjectDisplayandOutputTable
ProjectManagement
BLASandAlternativeRInterpreters
TestingPerformanceGainsfromBLAS
OtherInterpreters
UsefulBLAS/BenchmarkingResources
References
3.EfficientProgramming
Prerequisites
TopFiveTipsforEfficientProgramming
GeneralAdvice
MemoryAllocation
VectorizedCode
CommunicatingwiththeUser
FatalErrors:stop()
Warnings:warning()
InformativeOutput:message()andcat()
InvisibleReturns
Factors
InherentOrder
FixedSetofCategories
TheApplyFamily
Example:MoviesDataset
TypeConsistency
CachingVariables
FunctionClosures
TheByteCompiler
Example:TheMeanFunction
CompilingCode
References
4.EfficientWorkflow
Prerequisites
TopFiveTipsforEfficientWorkflow
AProjectPlanningTypology
ProjectPlanningandManagement
ChunkingYourWork
MakingYourWorkflowSMART
VisualizingPlanswithR
PackageSelection
SearchingforRPackages
HowtoSelectaPackage
Publication
DynamicDocumentswithRMarkdown
RPackages
Reference
5.EfficientInput/Output
Prerequisites
TopFiveTipsforEfficientDataI/O
VersatileDataImportwithrio
Plain-TextFormats
DifferencesBetweenfread()andread_csv()
PreprocessingTextOutsideR
BinaryFileFormats
NativeBinaryFormats:RdataorRds?
TheFeatherFileFormat
BenchmarkingBinaryFileFormats
ProtocolBuffers
GettingDatafromtheInternet
AccessingDataStoredinPackages
References
6.EfficientDataCarpentry
Prerequisites
TopFiveTipsforEfficientDataCarpentry
EfficientDataFrameswithtibble
TidyingDatawithtidyrandRegularExpressions
MakeWideTablesLongwithgather()
SplitJointVariableswithseparate()
OthertidyrFunctions
RegularExpressions
EfficientDataProcessingwithdplyr
RenamingColumns
ChangingColumnClasses
FilteringRows
ChainingOperations
DataAggregation
NonstandardEvaluation
CombiningDatasets
WorkingwithDatabases
Databasesanddplyr
DataProcessingwithdata.table
References
7.EfficientOptimization
Prerequisites
TopFiveTipsforEfficientOptimization
CodeProfiling
GettingStartedwithprofvis
Example:MonopolySimulation
EfficientBaseR
Theif()Versusifelse()Functions
SortingandOrdering
ReversingElements
WhichIndicesareTRUE?
ConvertingFactorstoNumerics
LogicalANDandOR
RowandColumnOperations
is.na()andanyNA()
Matrices
Example:Optimizingthemove_square()Function
ParallelComputing
ParallelVersionsofApplyFunctions
Example:SnakesandLadders
ExitFunctionswithCare
ParallelCodeunderLinuxandOSX
Rcpp
ASimpleC++Function
ThecppFunction()Command
C++DataTypes
ThesourceCpp()Function
VectorsandLoops
Matrices
C++withSugaronTop
RcppResources
References
8.EfficientHardware
Prerequisites
TopFiveTipsforEfficientHardware
Background:WhatIsaByte?
RandomAccessMemory
HardDrives:HDDVersusSSD
OperatingSystems:32-Bitor64-Bit
CentralProcessingUnit
CloudComputing
AmazonEC2
9.EfficientCollaboration
Prerequisites
TopFiveTipsforEfficientCollaboration
CodingStyle
ReformattingCodewithRStudio
Filenames
LoadingPackages
Commenting
ObjectNames
ExamplePackage
Assignment
Spacing
Indentation
CurlyBraces
VersionControl
Commits
GitIntegrationinRStudio
GitHub
Branches,Forks,Pulls,andClones
CodeReview
References
10.EfficientLearning
Prerequisties
TopFiveTipsforEfficientLearning
UsingR’sInternalHelp
SearchingRforTopics
FindingandUsingVignettes
GettingHelponFunctions
ReadingRSourceCode