Instructions

User Manual:

Open the PDF directly: View PDF .
Page Count: 10

README.md 2/5/2019

1 / 10

BuildyourownPandasCub

ThisrepositorycontainsadetailedprojectthatteachesyouhowtobuildyourownPythondata

analysislibrary.

TargetStudent

ThisprojectistargetedtowardsthosewhounderstandthefundamentalsofPythonandwishto

buildtheirowndataanalysislibrarysimilartoPandasfromscratch.

Pre‒Requisites

IntermediateknowledgeofPython

Helpfultohaveheardaboutspecialmethods

HelpfultohaveusedNumPyandPandasbefore

Objectives

MostdatascientistswhousePythonrelyonPandas.Inthisassignment,wewillbuildPandasCub,

alibrarythatimplementsmanyofthemostcommonandusefulmethodsfoundinPandas.Pandas

Cubwill:

HaveaDataFrameclasswithdatastoredinNumPyarrays

UsespecialmethodsdeﬁnedinthePythondatamodel

HaveanicelyformatteddisplayoftheDataFrameinthenotebook

Selectsubsetsofdatawiththebracketsoperator

Implementaggregationmethods‒sum,min,max,mean,median,etc...

Implementnon‒aggregationmethodssuchasisna,unique,rename,drop

Groupbyoneortwocolumns

Havemethodsspeciﬁctostringcolumns

SettinguptheDevelopmentEnvironment

Irecommendcreatinganewenvironmentusingthecondapackagemanager.Ifyoudonothave

conda,youcandownloaditherealongwiththeentireAnacondadistribution.ChoosePython3.

Whenbeginningdevelopmentonanewlibrary,it'sagoodideatouseacompletelyseparate

environmenttowriteyourcode.

Createtheenvironmentwiththeenvironment.ymlﬁle

Condaallowsyoutoautomatetheenvironmentcreationbycreatinganenvironment.ymlﬁle.The

contentsoftheﬁleareminimalandaredisplayedbelow.

name: pandas_cub

dependencies:

- python=3.6

- pandas

README.md 2/5/2019

2 / 10

- jupyter

- pytest

Thisﬁlewillbeusedtocreateanewenvironmentnamedpandas_cub.ItwillinstallPython3.6in

acompletelyseparatefolderinyourﬁlesystemalongwithpandas,jupyter,andpytest.Therewill

actuallybemanymorepackagesinstalledasthoselibrarieshavedependenciesoftheirown.

Visitthispageformoreinformationoncondaenvironments.

Commandtocreatenewenvironment

Inthetopleveldirectoryofthisrepository,wheretheenvironment.ymlﬁleislocated,runthe

followingcommandfromyourcommandline.

conda env create -f environment.yml

Theabovecommandwilltakesometimetocomplete.Onceitcompletes,theenvironmentwillbe

created.

Listtheenvironments

Runthecommandconda env listtoshowalltheenvironmentsyouhave.Therewillbea*next

totheactiveenvironment,whichwilllikelybebase,thedefaultenvironmentthateveryonestarts

in.

Activatethepandas̲cubenvironment

Creatingtheenvironmentdoesnotmeanitisactive.Youmustactivateinordertouseit.Usethe

followingcommandtoactivateit.

conda activate pandas_cub

Youshouldseepandas_cubinparenthesesprecedingyourcommandprompt.Youcanrunthe

commandconda env listtoconﬁrmthatthe*hasmovedtopandas_cub.

Deactivateenvironment

Youshouldonlyusethepandas_cubenvironmenttodevelopthislibrary.Whenyouaredonewith

thissession,runthecommandconda deactivatetoreturntoyourdefaultcondaenvironment.

Test‒DrivenDevelopmentwithpytest

Thecompletionofeachpartofthisprojectispredicateduponpassingthetestswritteninthe

test_dataframe.pymoduleinsidethetestsfolder.

Wewillrelyuponthepytestlibrarytotestourcode.Weinstalleditalongwithacommandlinetool

withthesamenameduringourenvironmentcreation.

Test‒Drivendevelopmentisapopularapproachfordevelopment.Itinvolveswritingtestsﬁrstand

thenwritingcodethatpassesthetests.

README.md 2/5/2019

3 / 10

Testing

Allthetestsarelocatedinthetest_dataframe.pymodulefoundinthetestsdirectory.Torunall

thetestsinthisﬁlerunthefollowingonthecommandline.

$ pytest tests/test_dataframe.py

Ifyourunthiscommandrightnow,allthetestswillfail.Asyoucompletethestepsintheproject,

youwillstartpassingthetests.Onceallthetestsarepassed,theprojectwillbecomplete.

Automatedtestdiscovery

Thepytestlibraryhasrulesforautomatedtestdiscovery.Itisn'tnecessarytosupplythepathto

thetestmoduleifyourdirectoriesandmodulenamesfollowthoserules.Youcansimplyrun

pytesttorunallthetestsinthislibrary.

Runningspeciﬁctests

Ifyouopenuponeofthetestmoduletest_dataframe.py,youwillseethetestsgroupedunder

diﬀerentclasses.Eachmethodoftheclassesrepresentsexactlyonetest.Torunallthetestswithin

asingleclass,appendtwocolonsfollowedbytheclassname.Thefollowingisaconcreteexample:

$ pytest tests/test_dataframe.py::TestDataFrameCreation

Itispossibletorunjustasingletestbyappendingtwomorecolonsfollowedbythemethodname.

Anotherconcreteexamplefollows:

$ pytest tests/test_dataframe.py::TestDataFrameCreation::test_input_types

Theanswerisinpandas̲cub̲ﬁnal

Thepandas_cub_finaldirectorycontainsthecompleted__init__.pyﬁlethatcontainsthecode

thatpassesallthetests.Onlylookatthisﬁleafteryouhaveattemptedtocompletethesectionon

yourown.

ManuallytestinaJupyterNotebook

Duringdevelopment,it'sgoodtohaveaplacetomanuallyexperimentwithyournewcodesoyou

canseeitinaction.WewillbeusingtheJupyterNotebooktoquicklyseehowourDataFrameis

changing.Withinthepandas_cubenvironment,launchaJupyterNotebookandopenuptheTest

Notebook.ipynbnotebook.

Autoreloading

Theﬁrstcellloadsanotebookmagicextensionwhichautomaticallyreloadscodefromﬁlesthat

havechanged.Normally,wewouldhavetorestartthekernelifwemadechangestoourcodeto

seeitreﬂectitscurrentstate.Thismagiccommandsavesusfromdoingthis.

Importingpandas̲cub

README.md 2/5/2019

4 / 10

Thisnotebookisatthesamelevelastheinnerpandas_cubdirectory.Thismeansthatwecan

importpandas_cubdirectlyintoournamespacewithoutchangingdirectories.Technically,

pandas_cubisaPythonpackage,whichisadirectorycontaininga__init__.pyﬁle.Itisthis

initializationﬁlethatgetsrunwhenwewriteimport pandas_cub as pdc.

pandas_cub_finalisalsoimportedsoyoucanseehowthecompletedobjectissupposedto

behave.

AtestDataFrame

AsimpletestDataFrameiscreatedforpandas_cub,pandas_cub_final,andpandas.Theoutput

forallthreeDataFramesareproducedinthenotebook.Therecurrentlyisnonicerepresentation

forpandas_cubDataFrames.

StartingPandasCub

Youwillbeeditingasingleﬁleforthisproject‒the__init__.pyﬁlefoundinthepandas_cub

directory.Itcontainsskeletoncodefortheentireproject.Youwon'tbedeﬁningyourownclasses

ormethods,butyouwillbeﬁllingoutthemethodbodies.

Openupthisﬁlenow.Youwillseemanyincompletemethodsthathavethekeywordpassastheir

lastline.Somemethodshavecodewithacommentthatsays'yourcodehere'.Thesearethe

methodsthatyouwillbeediting.Othermethodsarecompleteandwon'tneedediting.

Howtocompletetheproject

Keepthe__init__.pyﬁleopenatalltimes.Thisistheonlyﬁlethatyouwillbeediting.Readand

completeeachnumberedsectionbelow.Editthemethodindicatedineachsectionandthenrun

thetest.Onceyoupassthattest,moveontothenextsection.

1.CheckDataFrameconstructorinputtypes

OurDataFrameclassisconstructedwithasingleparameter,data.Wearegoingtoforceourusers

tosetthisvalueasadictionarythathasstringsasthekeysandone‒dimensionalNumPyarraysas

thevalues.Thekeyswilleventuallybecomethecolumnnamesandthearrayswillbethevaluesof

thosecolumns.

Inthisstep,wewillﬁlloutthe_check_input_typesmethod.Thismethodwillensurethatour

usershavepassedusavaliddataparameter.Noticethatthisdataisalreadyassignedtothe

_datainstancevariable,meaningyouwillaccesstowithinthemethodwithself._data.

Speciﬁcally,_check_input_typesmustdothefollowing:

raiseaTypeErrorifdataisnotadictionary

raiseaTypeErrorifthekeysofdataarenotstrings

raiseaTypeErrorifthevaluesofdataarenotNumPyarrays

raiseaValueErrorifthevaluesofdataarenot1‒dimensional

Runthefollowingcommandtotestthissection:

$ pytest tests/test_dataframe.py::TestDataFrameCreation::test_input_types

README.md 2/5/2019

5 / 10

2.Checkarraylengths

Wearenowguaranteedthatdataisadictionaryofstringsmappedtoone‒dimensionalarrays.

EachcolumnofdatainourDataFramemusthavethesamenumberofelements.Inthisstep,you

mustensurethatthisisthecase.Editthe_check_array_lengthsmethodandraisea

ValueErrorifanyofthearraysarediﬀerinlength.

Runthefollowingtest:

$ pytest tests/test_dataframe.py::TestDataFrameCreation::test_array_length

3.Changeunicodearraystoobject

Bydefault,wheneveryoucreateaNumPyarrayofPythonstrings,itwilldefaultthedatatypeof

thatarraytounicode.Unicodearraysaremorediﬃculttomanipulateanddon'thavetheﬂexibility

thatwedesire.So,ifouruserpassesusaUnicodearray,wewillcoverittoadatatypecalled

'object'.Thisisaﬂexibletypeandwillhelpuslaterwhencreatingmethodsjustforstringcolumns.

ThistypeallowsanyPythonobjectswithinthearray.

Inthisstep,youwillchangethedatatypeofUnicodearraystoobject.Youwilldothisbychecking

eacharraysdatatypekind.Thedatatypekindisasingle‒charactervalueavailablebydoing

array.dtype.kind.Usetheastypearraymethodtochangeitstype.

Editthe_convert_unicode_to_objectmethodandverifywiththetest_unicode_to_object

test.

4.FindthenumberofrowsintheDataFramewiththelenfunction

ThenumberofrowsarereturnedwhenpassingaPandasDataFrametothebuiltinlenfunction.

Wewillmakepandas̲cubbehavethesameexactway.

Todosoweneedtoimplementthespecialmethod__len__.ThisiswhatPythoncallwheneveran

objectispassedtothelenfunction.

Editthe__len__methodandhaveitreturnthenumberofrows.Testwithtest_len.

5.Returncolumnsasalist

Inpandas,callingdf.columnsreturnsasequenceofthecolumnnames.Ourcolumnnamesare

currentlythekeysinour_datadictionary.Pythonprovidesthepropertydecoratorwhichallows

ustoexecutecodeonsomethingthatappearstobejustaninstancevariable.

Editthecolumns'method'(reallyaproperty)toreturnalistofthecolumnsinorder.Sinceweare

workingwithPython3.6,thedictionarykeysareinternallyordered.Takeadvantageofthis.

Validatewiththetest_columnstest.

6.Setnewcolumnnames

Inthisstep,wewillbeassigningallnewcolumnstoourDataFramebysettingthecolumns

propertyequaltoalist.ThisistheexactsamesyntaxasitiswithPandas.Aconcreteexample

belowshowshowyouwouldsetnewcoulmnsfora3‒columnDataFrame.

README.md 2/5/2019

6 / 10

Completethefollowingtasks:

RaiseaTypeErroriftheobjectusedtosetnewcolumnsisnotalist

RaiseaValueErrorifthenumberofcolumnnamesinthelistdoesnotmatchthecurrent

DataFrame

RaiseaTypeErrorifanyofthecolumnsarenotstrings

RaiseaValueErrorifanyofthecolumnnamesareduplicatedinthelist

Reassignthe_datavariablesothatallthekeyshavebeenupdated

df.columns = ['state', 'age', 'fruit']

Pythonallowsyoutosetcolumnsbyusingthedecoratorcolumns.setter.Thevalueontheright

handsideoftheassignmentstatementispassedtothemethod.Editthe'column'method

decoratedbycolumns.setterandtestwithtest_set_columns.

7.Theshapeproperty

TheshapepropertyinPandasreturnsatupleofthenumberofrowsandcolumns.Theproperty

decoratorisusedagainhere.EditittohaveourDataFramedothesameasPandas.Testwith

test_shape

8.Uncomment_repr_html_method

ThisisamethodspeciﬁcallyusedbyIPythontorepresentyourobjectintheJupyterNotebook.

Thismethodmustreturnastringofhtml.Thismethodisfairlycomplexandyoumustknowsome

basichtmltocomplete.Idecidedtoimplementthismethodforyou.Uncommentitandtestthe

outputinthenotebook.YoushouldnowseeanicelyformattedrepresentationofyourDataFrame.

9.Thevaluesproperty

InPandas,valuesisapropertythatreturnsasinglearrayofallthecolumnsofdata.Our

DataFramewilldothesame.Editthevaluespropertyandconcatenateallthecolumnarraysinto

asingletwo‒dimensionalNumPyarray.Returnthisarray.TheNumPycolumn_stackfunctioncan

behelpfulhere.Testwithtest_values.

10.Thedtypesproperty

InPandas,thedtypespropertyreturnsaSeriescontainingthedatatypeofeachcolumnwiththe

columnnamesintheindex.OurDataFramedoesn'thaveanindex.Instead,returnatwo‒column

DataFrame.Putthecolumnnamesunderthe'ColumnName'columnandthedatatype(bool,int,

string,orﬂoat)underthecolumnname'DataType'.

Atthetopofthe__init__.pymodulethereexistsaDTYPE_NAMEdictionary.Useittoconvert

fromarraykindtothestringnameofthedatatype.Testwithtest_dtypes.

11.Selectasinglecolumnwiththebrackets

README.md 2/5/2019

7 / 10

InPandas,youcanselectasinglewithdf['colname'].OurDataFramewilldothesame.Tomake

anobjectworkwiththebrackets,youmustimplementthe__getitem__specialmethod.This

methodispassedasingleparameter,thevaluewithinthebrackets.

Inthisstep,useisinstancetocheckwhetheritemisastring.Ifitisreturnaonecolumn

DataFrameofthatcolumn.

ThesetestsareunderatheTestSelectionclass.Runthetest_one_columntest.

12.Selectmultiplecolumnswithalist

OurDataFramewillalsobeabletoselectmultiplecolumnsifgivenalistwithinthebrackets.For

example,df[['colname1', 'colname2']]willreturnatwocolumnDataFrame.

Continueeditingthe__getitem__method.Ifitemisalist,returnaDataFrameofjustthose

columns.Runthetest_multiple_columns

13.BooleanSelectionwithaDataFrame

InPandas,youcanﬁlterforspeciﬁcrowsofaDataFramebypassinginabooleanSeries/arrayto

thebrackets.Forinstance,thefollowingwillselectallrowssuchthataisgreaterthan10.

>>> s = df['a'] > 10

>>> df[s]

Thisiscalledbooleanselection.WewillmakeourDataFrameworksimilarly.Editthe__getitem__

methodandcheckwhetheritemisaDataFrame.Ifitisthendothefollowing:

Ifitismorethanonecolumn,raiseaValueError

Extracttheunderlyingarrayfromthesinglecolumn

Iftheunderlyingarraykindisnotboolean('b')raiseaValueError

UsethebooleanarraytoreturnanewDataFramewithjusttherowswheretheboolean

arrayisTruealongwithallthecolumns.

Runtest_simple_booleantotest

14.

Asinglestringselectsonecolumn‒>df['colname']

Alistofstringsselectsmultiplecolumns‒>df[['colname1', 'colname2']]

AonecolumnDataFrameofbooleansthatﬁltersrows‒>df[df_bool]

Rowandcolumnselectionsimultaneously‒>df[rs, cs]

csandrscanbeintegers,slices,oralistofintegers

rscanalsobeaone‒columnbooleanDataFrame

Implementtheﬁrsttwoitemsinthelistandthencopyandpasteallthecodefrom

pandas̲cub̲ﬁnal.

10.TabCompletion

README.md 2/5/2019

8 / 10

IPythonhelpsusagainbyprovidinguswiththe_ipython_key_completions_method.Returna

listofthetabcompletionsyouwouldliketohaveavailablewheninsidethebracketsoperator.

11.Createoroverwriteacolumn

Tomakeanassignmentwiththebracketsoperator,Pythonmakesthe__setitem__methodwhich

acceptstwovalues,thekeyandthevalue.Wewillonlyimplementthesimplecaseofaddinga

newcolumnoroverwritinganoldone.

12.headandtailmethods

Havethesemethodsacceptasingleparameternandreturntheﬁrst/lastnrows.

13.Genericaggregationmethods

Aggregationmethodsreturnasinglevalueforeachcolumn.Wewillonlyimplementcolumn‒wise

aggregationsandnotrow‒wise.

Writeagenericmethod_aggthatacceptsanaggregationfunctionasastring.Usethegetattr

functiontogettheactualNumPyfunction.

Stringcolumnswithmissingvalueswillnotwork.Exceptthiserroranddon'treturncolumnswhere

theaggregationcannotbefound.

Deﬁningthe_aggmethodwillmakealltheotheraggregationmethodswork.

14.isnamethod

ReturnaDataFrameofthesameshapethathasabooleanforeverysinglevalueinthe

DataFrame.Usenp.isnanexceptinthecaseforstringswhichyoucanuseavectorizedequality

expressiontoNone

15.countmethod

Returnthenumberofnon‒missingvaluesforeachcolumn

16.uniquemethod

Returnalistofone‒columndataframesofuniquevaluesineachcolumn.Ifthereisasingle

column.ReturnjusttheDataFrame

17.nuniquemethod

Returnthenumberofuniquevaluesforeachcolumn

18.value_countsmethod

Returnalistoftwo‒columnDataFrameswiththeﬁrstcolumnnameasthenameoftheoriginal

columnandthesecondcolumnname'count'containingthenumberofoccurrencesforeachvalue.

UsetheCountermethodofthecollectionsmodule.ReturntheDataFrameswithsortedvalues

fromgreatesttoleast.Youholdoﬀonsortinguntilyouhavedeﬁnedthesort_valuesmethod.

README.md 2/5/2019

9 / 10

AcceptabooleanparameternormalizethatreturnsrelativefrequencieswhenTrue.

IfthecallingDataFramehasasinglecolumn,returnasingleDataFrame.

19.renamemethod

Acceptadictionaryofoldcolumnnamesmappedtonewcolumnnames.ReturnanewDataFrame

20.dropmethod

AcceptalistofcolumnnamesandreturnaDataFramewithoutthosecolumns

21.Non‒aggregationmethods

Thereareseveralnon‒aggregationmethodsthatfunctionsimilarly.Createagenericmethod

_non_aggthatcanimplement:

abs

cummin

cummax

cumsum

clip

round

copy

ThecumminandcummaxfunctionsinNumPynecessitatedotnotationtoreach.Wecannotuse

getattrforthisandinsteadhavetousethemorespecializedattrgetterfromtheoperator

library.

Noticethatsomeofthesehaveparameters.Collectthemwith*args.

22.diffandpct_changemethods

Returntherawdiﬀerenceorpercentagechangebetweenrowsgivenadistancen.Youcandrop

theﬁrstnrows.

23.ArithmeticandComparisonOperators

Allthearithmeticandcomparisonoperatorshavespecialmethodsavailable.Forinstance__add__

isusedfortheplussign,and__le__isusedforlessthanorequalto.Eachofthesemethods

acceptsasingleotherparameter.

Writeagenericmethod,_operthatworkswitheachofthesemethods.

24.sort_valuesmethod

Thismethodtakestwoparameters.Thesortingcolumnorcolumns(asastringorlist)anda

booleanforthedirectiontosort.YouwillneedtouseNumPy'sargsorttogettheorderofthe

sortforasinglecolumnandlexsorttosortmultiplecolumns.

25.samplemethod

README.md 2/5/2019

10 / 10

ThismethodrandomlysamplestherowsoftheDataFrame.Youcaneitherchooseanexact

numbertosamplewithnorafractionwithfrac.Samplewithreplacementbyusingtheboolean

replace.Youcanalsosettherandomnumberseed.

26.straccessor

Inthe__init__method,therewasalinethatcreatedstrasaninstancevariablewiththe

StringMethodstype.

Allthestringmethodsusethegeneric_str_methodmethodwhichacceptsthenameofthe

method,thecolumnnameandanymethod‒speciﬁcparameters.

Modifythegeneric_str_methodtomakealltheotherstringmethodswork.

27.pivot_tablemethod

Thisisbyfarthemostcomplexmethodtoimplement.Allowrowsandcolumnstobecolumn

nameswho'suniquevaluesformthegroups.Aggregatethecolumnpassedtothevalues

parameterwiththeaggfuncstring.

AlloweitherrowsorcolumnstobeNone.IfvaluesoraggfuncisNonethenﬁndthefrequency

(likeinvalue_counts).

28.Automaticallyadddocumentation

Thismethodisalreadycompletedandautomaticallyaddsdocumentationtotheaggregation

methodsbysettingthe__doc__attribute.

29.ReadingsimpleCSVs

Implementtheread_csvfunctionbyreadingthrougheachline.Assumetheﬁrstlinehasthe

columnnames.Usethesecondlinetoassignthedatatypesofeachcolumn.

Instructions

Navigation menu

Versions of this User Manual:

Views

Navigation