Spark Exercise Instructions

Spark_Exercise_Instructions%20(3)

Spark_Exercise_Instructions%20(4)

Spark_Exercise_Instructions%20

Spark_Exercise_Instructions

Spark_Exercise_Instructions%20

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 61

Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
1
Cloudera Developer Training for
Apache Spark:
Hands-On Exercises
!"#"$%&'()*"+',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'-!
.%#/+01#'23"$45+"6'7"**5#8'9:',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,';!
.%#/+01#'23"$45+"6'<5"=5#8'*>"'7:%$?'@)4AB"#*%*5)#',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'C!
.%#/+01#'23"$45+"6'9+5#8'*>"'7:%$?'7>"&&',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'D!
.%#/+01#'23"$45+"6'!"**5#8'7*%$*"/'=5*>'E@@+',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'F!
.%#/+01#'23"$45+"6'G)$?5#8'=5*>'H%5$'E@@+',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'IJ!
.%#/+01#'23"$45+"6'9+5#8'.@K7',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'IL!
.%#/+01#'23"$45+"6'EA##5#8'7:%$?'7>"&&')#'%'M&A+*"$',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'N-!
.%#/+01#'23"$45+"6'G)$?5#8'G5*>'H%$*5*5)#+',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'NC!
.%#/+01#'23"$45+"6'<5"=5#8'7*%8"+'5#'*>"'7:%$?'O::&54%*5)#'9P',,,,,,,,,,,,,,,,,,,,,,,,,'NF!
.%#/+01#'23"$45+"6'M%4>5#8'E@@+',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'-I!
"#$%#&!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
2
.%#/+01#'23"$45+"6'M>"4?:)5#*5#8'E@@+',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'--!
.%#/+01#'23"$45+"6'G$5*5#8'%#/'EA##5#8'%'7:%$?'O::&54%*5)#',,,,,,,,,,,,,,,,,,,,,,,,,,,,'-C!
.%#/+01#'23"$45+"6'M)#Q58A$5#8'7:%$?'O::&54%*5)#+',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'JR!
.%#/+01#'23"$45+"6'23:&)$5#8'7:%$?'7*$"%B5#8',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'JJ!
.%#/+01#'23"$45+"6'G$5*5#8'%'7:%$?'7*$"%B5#8'O::&54%*5)#',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'JL!
.%#/+01#'23"$45+"6'P*"$%*5S"'H$)4"++5#8'=5*>'7:%$?',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,';I!
.%#/+01#'23"$45+"6'9+5#8'T$)%/4%+*'<%$5%U&"+',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,';;!
.%#/+01#'23"$45+"6'9+5#8'O44ABA&%*)$+',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,';D!
.%#/+01#'23"$45+"6'PB:)$*5#8'@%*%'G5*>'7V)):',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,';L!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
3
General Notes
'()*+,-./0!1-.23234!5)*-0,0!*0,!.!62-1*.(!7.5823,!-*33234!18,!',319:!;23*<!
+201-2=*12)3>!?820!67!8.0!:@.-A!.3+!'BC!D!E'()*+,-./0!B201-2=*12)3F!235(*+234!
G@.58,!C.+))@H!2301.((,+>!G(18)*48!C.+))@!.3+!:@.-A!1I@25.((I!-*3!)3!.!5(*01,-!)J!
K*(12@(,!K.5823,0F!23!1820!5)*-0,!I)*!L2((!=,!*0234!.!5(*01,-!-*33234!()5.((I!)3!.!
0234(,!3)+,F!L8258!20!-,J,--,+!1)!.0!M0,*+)NB201-2=*1,+!K)+,>!!
Getting Started
?8,!67!20!0,1!1)!.*1)K.125.((I!()4!23!.0!18,!*0,-!training>!:8)*(+!I)*!()4!)*1!.1!
.3I!12K,F!I)*!5.3!()4!=.5A!23!.0!18,!*0,-!training!L218!18,!@.00L)-+!training>!
:8)*(+!I)*!3,,+!21F!18,!-))1!@.00L)-+!20!training>!O)*!K.I!=,!@-)K@1,+!J)-!1820!
2JF!J)-!,<.K@(,F!I)*!L.31!1)!58.34,!18,!A,I=).-+!(.I)*1>!P3!4,3,-.(F!I)*!08)*(+!3)1!
3,,+!1820!@.00L)-+!0235,!18,!training!*0,-!8.0!*3(2K21,+!0*+)!@-2Q2(,4,0>!
Working with the Virtual Machine
P3!0)K,!5)KK.3+N(23,!01,@0!23!18,!,<,-520,0F!I)*!L2((!0,,!(23,0!(2A,!1820R!
$ hdfs dfs -put shakespeare \
/user/training/shakespeare
?8,!+)((.-!0243!E$H!.1!18,!=,4233234!)J!,.58!(23,!23+25.1,0!18,!;23*<!08,((!@-)K@1>!
?8,!.51*.(!@-)K@1!L2((!235(*+,!.++212)3.(!23J)-K.12)3!E,>4>F!
[training@localhost workspace]$!H!=*1!1820!20!)K211,+!J-)K!18,0,!
2301-*512)30!J)-!=-,Q21I>!
:)K,!5)KK.3+0!.-,!1)!=,!,<,5*1,+!23!18,!MI18)3!)-!:5.(.!:@.-A!:8,((0S!18)0,!.-,!
08)L3!L218!pyspark>!)-!scala>!@-)K@10!-,0@,512Q,(I>!
?8,!=.5A0(.08!E\H!.1!18,!,3+!)J!18,!J2-01!(23,!)J!.!5)KK.3+!02432J2,0!18.1!18,!
5)KK.3+!20!3)1!5)K@(,1,+F!.3+!5)3123*,0!)3!18,!3,<1!(23,>!O)*!5.3!,31,-!18,!5)+,!
,<.51(I!.0!08)L3!E)3!1L)!(23,0HF!)-!I)*!5.3!,31,-!21!)3!.!0234(,!(23,>!PJ!I)*!+)!18,!
(.11,-F!I)*!08)*(+!!"#!1I@,!23!18,!=.5A0(.08>!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
4
Completing the Exercises
G0!18,!,<,-520,0!@-)4-,00F!.3+!I)*!4.23!K)-,!J.K2(2.-21I!L218!:@.-AF!L,!@-)Q2+,!
J,L,-!01,@N=IN01,@!2301-*512)30S!.0!23!18,!-,.(!L)-(+F!L,!K,-,(I!42Q,!I)*!.!
-,T*2-,K,31!.3+!21/0!*@!1)!I)*!1)!0)(Q,!18,!@-)=(,KU!O)*!08)*(+!J,,(!J-,,!1)!-,J,-!1)!
18,!0)(*12)30!@-)Q2+,+F!.0A!I)*-!2301-*51)-!J)-!.00201.35,F!)-!5)30*(1!L218!I)*-!
J,(()L!01*+,310>!!
Solutions
7:%$?'7>"&&'+)&A*5)#+'WHX*>)#')$'74%&%Y'
7.3I!)J!18,!,<,-520,0!23!1820!5)*-0,!.-,!+)3,!23!18,!231,-.512Q,!:@.-A!:8,((>!PJ!I)*!
3,,+!8,(@!5)K@(,1234!.3!,<,-520,F!I)*!5.3!-,J,-!1)!18,!0)(*12)30!J2(,0!23!
~/training_materials/sparkdev/solutions>!V2(,0!L218!.!.pyspark!
,<1,302)3!5)31.23!5)KK.3+0!18.1!.-,!K,.31!1)!=,!@.01,+!231)!18,!MI18)3!:@.-A!
:8,((S!J2(,0!,3+234!23!.scalaspark!.-,!K,.31!1)!=,!@.01,+!231)!18,!:5.(.!:@.-A!
:8,((>!
?8,!MI18)3!08,((!20!@.-125*(.-!.=)*1!5)+,!@.01,0!=,5.*0,!)J!18,!3,5,0021I!J)-!@-)@,-!
L821,0@.5,W1.=!.(243K,31>!!?)!.Q)2+!1820!200*,F!L,!0*44,01!I)*!*0,!2MI18)3F!.3+!*0,!
18,!%paste!XK.425Y!J*3512)3!2301,.+!)J!18,!1,-K23.(!L23+)L/0!@.01,!J*3512)3>!
7:%$?'O::&54%*5)#'+)&A*5)#+''
:)K,!)J!18,!,<,-520,0!23Q)(Q,!-*33234!5)K@(,1,!@-)4-.K0!-.18,-!18.3!*0234!18,!
231,-.512Q,!:@.-A!:8,((>!V)-!18,0,F!MI18)3!0)(*12)30!.-,!23!
~/training_materials/sparkdev/solutions!L218!18,!,<1,302)3!.py>!Z*3!
18,0,!0)(*12)30!*0234!0@.-AN0*=K21!.0!+,05-2=,+!23!18,!X[-21234!.!:@.-A!G@@(25.12)3Y!
,<,-520,>!
:5.(.!.@@(25.12)30!.-,!@-)Q2+,+!23!18,!5)31,<1!)J!.!7.Q,3!@-)\,51!+2-,51)-IF!()5.1,+!
23!~/exercises/projects>![21823!18,!@-)\,51!+2-,51)-IF!18,!0)(*12)3!5)+,!20!23!
src/main/scala/solution>!V)(()L!18,!2301-*512)30!23!18,!X[-21234!.!:@.-A!
G@@(25.12)3Y!,<,-520,!1)!5)K@2(,F!@.5A.4,F!.3+!-*3!:5.(.!0)(*12)30>!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
5
Hands-On Exercise: Setting Up
Files Used in This Exercise:
~/scripts/sparkdev/training_setup_sparkdev.sh
'
P#'*>5+'23"$45+"'X)A'=5&&'+"*'A:'X)A$'4)A$+"'"#S5$)#B"#*,'
')K@(,1,!1820!,<,-520,!=,J)-,!K)Q234!)3!1)!(.1,-!,<,-520,0>!
Set Up Your Environment
],J)-,!01.-1234!18,!,<,-520,0F!-*3!18,!5)*-0,!0,1*@!05-2@1!23!.!1,-K23.(!L23+)LR!$>
$ ~/scripts/sparkdev/training_setup_sparkdev.sh
?820!05-2@1!+),0!18,!J)(()L234R!
:,10!*@!L)-A0@.5,!@-)\,510!J)-!18,!5)*-0,!,<,-520,0!23!~/exercises!
7.A,0!0*-,!18,!5)--,51!0,-Q25,0!.-,!-*33234!J)-!18,!5(*01,-!
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
6
Hands-On Exercise: Viewing the
Spark Documentation
'
P#'*>5+'23"$45+"'X)A'=5&&'Q%B5&5%$5Z"'X)A$+"&Q'=5*>'*>"'7:%$?'/)4AB"#*%*5)#,'
:1.-1!V2-,J)<!23!I)*-!62-1*.(!7.5823,!.3+!Q2021!18,!:@.-A!+)5*K,31.12)3!)3!I)*-!$>
()5.(!K.5823,F!*0234!18,!@-)Q2+,+!=))AK.-A!)-!)@,3234!18,!^Z;!
file:/usr/lib/spark/docs/_site/index.html!
V-)K!18,!H$)8$%BB5#8'!A5/"+!K,3*F!0,(,51!18,!7:%$?'H$)8$%BB5#8'!A5/">!">
]-2,J(I!-,Q2,L!18,!4*2+,>!O)*!K.I!L208!1)!=))AK.-A!18,!@.4,!J)-!(.1,-!-,Q2,L>!
V-)K!18,!OHP'@)4+!K,3*F!0,(,51!,218,-!74%&%/)4!)-!HX*>)#'OHPF!+,@,3+234!)3!_>
I)*-!(.34*.4,!@-,J,-,35,>!]))AK.-A!18,!GMP!@.4,!J)-!*0,!+*-234!5(.00>!;.1,-!
,<,-520,0!L2((!-,J,-!I)*!1)!1820!+)5*K,31.12)3>!
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
7
Hands-On Exercise: Using the Spark
Shell
'
P#'*>5+'23"$45+"'X)A'=5&&'+*%$*'*>"'7:%$?'7>"&&'%#/'$"%/'%'Q5&"'5#*)'%'E"+5&5"#*'
@5+*$5UA*"/'@%*%'7"*'WE@@Y,'
O)*!K.I!58))0,!1)!+)!1820!,<,-520,!*0234!,218,-!:5.(.!)-!MI18)3>!V)(()L!18,!
2301-*512)30!=,()L!J)-!MI18)3F!)-!0A2@!1)!18,!3,<1!0,512)3!J)-!:5.(.>!
7)01!)J!18,!(.1,-!,<,-520,0!.00*K,!I)*!.-,!*0234!MI18)3F!=*1!:5.(.!0)(*12)30!.-,!
@-)Q2+,+!)3!I)*-!Q2-1*.(!K.5823,F!0)!I)*!08)*(+!J,,(!J-,,!1)!*0,!:5.(.!2J!I)*!@-,J,->!
Using the Python Spark Shell
P3!.!1,-K23.(!L23+)LF!01.-1!18,!pyspark!08,((R!$>
$ pyspark
O)*!K.I!4,1!0,Q,-.(!P`V9!.3+![GZ`P`a!K,00.4,0F!L8258!I)*!5.3!+20-,4.-+>!PJ!
I)*!+)3/1!0,,!18,!In[n]>!@-)K@1!.J1,-!.!J,L!0,5)3+0F!821!Z,1*-3!.!J,L!12K,0!1)!
5(,.-!18,!05-,,3!)*1@*1>!
Note: Your environment is set up to use IPython shell by default. If you would
prefer to use the regular Python shell, set IPYTHON=0 before starting pyspark.
:@.-A!5-,.1,0!.!:@.-A')31,<1!)=\,51!J)-!I)*!5.((,+!sc>!7.A,!0*-,!18,!)=\,51!">
,<2010R!
pyspark> sc
MI0@.-A!L2((!+20@(.I!23J)-K.12)3!.=)*1!18,!sc!)=\,51!0*58!.0!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
8
<pyspark.context.SparkContext at 0x2724490>
^0234!5)KK.3+!5)K@(,12)3F!I)*!5.3!0,,!.((!18,!.Q.2(.=(,!:@.-A')31,<1!K,18)+0R!_>
1I@,!sc.!Esc!J)(()L,+!=I!.!+)1H!.3+!18,3!18,!b?G]c!A,I>!!
O)*!5.3!,<21!18,!08,((!=I!8211234!'1-(NB!)-!=I!1I@234!exit>!%>
Using the Scala Spark Shell
P3!.!1,-K23.(!L23+)LF!01.-1!18,!:5.(.!:@.-A!:8,((R!$>
$ spark-shell
O)*!K.I!4,1!0,Q,-.(!P`V9!.3+![GZ`P`a!K,00.4,0F!L8258!I)*!5.3!+20-,4.-+>!PJ!
I)*!+)3/1!0,,!18,!scala>!@-)K@1!.J1,-!.!J,L!0,5)3+0F!821!d31,-!.!J,L!12K,0!1)!
5(,.-!18,!05-,,3!)*1@*1>!
:@.-A!5-,.1,0!.!:@.-A')31,<1!)=\,51!J)-!I)*!5.((,+!sc>!7.A,!0*-,!18,!)=\,51!">
,<2010R!
scala> sc
:5.(.!L2((!+20@(.I!23J)-K.12)3!.=)*1!18,!sc!)=\,51!0*58!.0R!
res0: org.apache.spark.SparkContext =
org.apache.spark.SparkContext@2f0301fa !
^0234!5)KK.3+!5)K@(,12)3F!I)*!5.3!0,,!.((!18,!.Q.2(.=(,!:@.-A')31,<1!K,18)+0R!D>
1I@,!sc.!Esc!J)(()L,+!=I!.!+)1H!.3+!18,3!18,!b?G]c!A,I>!!
O)*!5.3!,<21!18,!08,((!=I!8211234!'1-(NB!)-!=I!1I@234!exit>!e>
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
9
Hands-On Exercise: Getting Started
with RDDs
Files Used in This Exercise:
Data files (local):
~/training_materials/sparkdev/data/frostroad.txt
~/training_materials/sparkdev/data/weblogs/2013-09-15.log
Solution:
solutions/LogIPs.pyspark
solutions/LogIPs.scalaspark
'
P#'*>5+'23"$45+"'X)A'=5&&':$%4*54"'A+5#8'E@@+'5#'*>"'7:%$?'7>"&&,'
O)*!L2((!01.-1!=I!-,.+234!.!02K@(,!1,<1!J2(,>!?8,3!I)*!L2((!*0,!:@.-A!1)!,<@()-,!18,!
G@.58,!L,=!0,-Q,-!)*1@*1!()40!)J!18,!5*01)K,-!0,-Q25,!021,!)J!.!J2512)3.(!K)=2(,!
@8)3,!0,-Q25,!@-)Q2+,-!5.((,+!;)*+.5-,>!
Load and view text file
:1.-1!18,!:@.-A!:8,((!2J!I)*!,<21,+!21!J-)K!18,!@-,Q2)*0!,<,-520,>!O)*!K.I!*0,!$>
,218,-!:5.(.!Espark-shellH!)-!MI18)3!EpysparkH>!?8,0,!2301-*512)30!.00*K,!
I)*!.-,!*0234!MI18)3>!
Z,Q2,L!18,!02K@(,!1,<1!J2(,!L,!L2((!=,!*0234!=I!Q2,L234!EL218)*1!,+21234H!18,!J2(,!">
23!.!1,<1!,+21)->!?8,!J2(,!20!()5.1,+!.1R!
~/training_materials/sparkdev/data/frostroad.txt>!
!B,J23,!.3!ZBB!1)!=,!5-,.1,+!=I!-,.+234!23!.!02K@(,!1,01!J2(,R!_>
pyspark> mydata = sc.textFile(\
"file:/home/training/training_materials/sparkdev/\
data/frostroad.txt")
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
10
`)1,!18.1!:@.-A!8.0!3)1!I,1!-,.+!18,!J2(,>!P1!L2((!3)1!+)!0)!*312(!I)*!@,-J)-K!.3!%>
)@,-.12)3!)3!18,!ZBB>!?-I!5)*31234!18,!3*K=,-!)J!(23,0!23!18,!+.1.0,1R!
pyspark> mydata.count()
?8,!count!)@,-.12)3!5.*0,0!18,!ZBB!1)!=,!K.1,-2.(2f,+!E5-,.1,+!.3+!
@)@*(.1,+H>!?8,!3*K=,-!)J!(23,0!08)*(+!=,!+20@(.I,+F!,>4>!
Out[4]: 23
?-I!,<,5*1234!18,!collect!)@,-.12)3!1)!+20@(.I!18,!+.1.!23!18,!ZBB>!`)1,!18.1!D>
1820!-,1*-30!.3+!+20@(.I0!18,!,312-,!+.1.0,1>!?820!20!5)3Q,32,31!J)-!Q,-I!0K.((!
ZBB0!(2A,!1820!)3,F!=*1!=,!5.-,J*(!*0234!collect!J)-!K)-,!1I@25.(!(.-4,!+.1.0,10>!
pyspark> mydata.collect()
^0234!5)KK.3+!5)K@(,12)3F!I)*!5.3!0,,!.((!18,!.Q.2(.=(,!1-.30J)-K.12)30!.3+!e>
)@,-.12)30!I)*!5.3!@,-J)-K!)3!.3!ZBB>!?I@,!mydata.!.3+!18,3!18,!b?G]c!A,I>!
Explore the Loudacre web log files
P3!1820!,<,-520,!I)*!L2((!=,!*0234!+.1.!23!
~/training_materials/sparkdev/data/weblogs>!P3212.((I!I)*!L2((!L)-A!
L218!18,!()4!J2(,!J-)K!.!0234(,!+.I>!;.1,-!I)*!L2((!L)-A!L218!18,!J*((!+.1.!0,1!
5)30201234!)J!K.3I!+.I0!L)-18!)J!()40>!
Z,Q2,L!)3,!)J!18,!.log!J2(,0!23!18,!+2-,51)-I>!`)1,!18,!J)-K.1!)J!18,!(23,0F!,>4>!&>
!
116.180.70.237 - 128 [15/Sep/2013:23:59:53 +0100]
"GET /KBDOC-00031.html HTTP/1.0" 200 1388
"http://www.loudacre.com" "Loudacre CSR Browser"
IP#address#
User#ID#
Request##
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
11
:,1!.!Q.-2.=(,!J)-!18,!+.1.!J2(,!0)!I)*!+)!3)1!8.Q,!1)!-,1I@,!21!,.58!12K,>!g>
pyspark> logfile="file:/home/training/\
training_materials/sparkdev/data/weblogs/\
2013-09-15.log"
'-,.1,!.3!ZBB!J-)K!18,!+.1.!J2(,>!h>
pyspark> logs = sc.textFile(logfile)
'-,.1,!.3!ZBB!5)31.23234!)3(I!18)0,!(23,0!18.1!.-,!-,T*,010!J)-!iMa!J2(,0>!$#>
pyspark> jpglogs=\
logs.filter(lambda x: ".jpg" in x)
62,L!18,!J2-01!$#!(23,0!)J!18,!+.1.!*0234!takeR!$$>
pyspark> jpglogs.take(10)
:)K,12K,0!I)*!+)!3)1!3,,+!1)!01)-,!231,-K,+2.1,!+.1.!23!.!Q.-2.=(,F!23!L8258!$">
5.0,!I)*!5.3!5)K=23,!18,!01,@0!231)!.!0234(,!(23,!)J!5)+,>!V)-!2301.35,F!2J!.((!I)*!
3,,+!20!1)!5)*31!18,!3*K=,-!)J!iMa!-,T*,010F!I)*!5.3!,<,5*1,!1820!23!.!0234(,!
5)KK.3+R!
pyspark> sc.textFile(logfile).filter(lambda x: \
".jpg" in x).count()
`)L!1-I!*0234!18,!K.@!J*3512)3!1)!+,J23,!.!3,L!ZBB>!:1.-1!L218!.!Q,-I!02K@(,!$_>
K.@!18.1!-,1*-30!18,!(,3418!)J!,.58!(23,!23!18,!()4!J2(,>!!
pyspark> logs.map(lambda s: len(s)).take(5)
?820!@-2310!)*1!.3!.--.I!)J!J2Q,!231,4,-0!5)--,0@)3+234!1)!18,!J2-01!J2Q,!(23,0!23!18,!
J2(,>!!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
12
?8.1/0!3)1!Q,-I!*0,J*(>!P301,.+F!1-I!K.@@234!1)!.3!.--.I!)J!L)-+0!J)-!,.58!(23,R!$%>
pyspark> logs.map(lambda s: s.split()).take(5)
?820!12K,!21!@-2310!)*1!J2Q,!.--.I0F!,.58!5)31.23234!18,!L)-+0!23!18,!
5)--,0@)3+234!()4!J2(,!(23,>!
`)L!18.1!I)*!A3)L!8)L!map!L)-A0F!+,J23,!.!3,L!ZBB!5)31.23234!\*01!18,!PM!$D>
.++-,00,0!J-)K!,.58!(23,!23!18,!()4!J2(,>!E?8,!PM!.++-,00!20!18,!J2-01!XL)-+Y!23!,.58!
(23,H>!
pyspark> ips = logs.map(lambda s: s.split()[0])
pyspark> ips.take(5)
G(18)*48!take!.3+!collect!.-,!*0,J*(!L.I0!1)!())A!.1!+.1.!23!.3!ZBBF!18,2-!$e>
)*1@*1!20!3)1!Q,-I!-,.+.=(,>!V)-1*3.1,(IF!18)*48F!18,I!-,1*-3!.--.I0F!L8258!I)*!
5.3!21,-.1,!18-)*48R!
pyspark> for x in ips.take(10): print x
V23.((IF!0.Q,!18,!(201!)J!PM!.++-,00,0!.0!.!1,<1!J2(,R!$&>
pyspark> ips.saveAsTextFile(\
"file:/home/training/iplist")
P3!.!1,-K23.(!L23+)LF!(201!18,!5)31,310!)J!18,!/home/training/iplist!$g>
J)(+,->!O)*!08)*(+!0,,!K*(12@(,!J2(,0>!?8,!)3,!I)*!5.-,!.=)*1!20!part-00000F!
L8258!08)*(+!5)31.23!18,!(201!)J!PM!.++-,00,0>!XM.-1Y!E@.-1212)3H!J2(,0!.-,!
3*K=,-,+!=,5.*0,!18,-,!K.I!=,!-,0*(10!J-)K!K*(12@(,!1.0A0!-*33234!)3!18,!
5(*01,-S!I)*!L2((!(,.-3!K)-,!.=)*1!1820!(.1,->!
If You Have More Time
PJ!I)*!8.Q,!K)-,!12K,F!.11,K@1!18,!J)(()L234!58.((,34,0R!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
13
'8.((,34,!$R!G0!I)*!+2+!23!18,!@-,Q2)*0!01,@F!0.Q,!.!(201!)J!PM!.++-,00,0F!=*1!1820!$>
12K,F!*0,!18,!L8)(,!L,=!()4!+.1.!0,1!Eweblogs/*H!2301,.+!)J!.!0234(,!+.I/0!()4>!!!
?2@R!O)*!5.3!*0,!18,!*@N.--)L!1)!,+21!.3+!,<,5*1,!@-,Q2)*0!5)KK.3+0>!
O)*!08)*(+!)3(I!3,,+!1)!K)+2JI!18,!(23,0!18.1!-,.+!.3+!0.Q,!18,!J2(,0>!
'8.((,34,!"R!^0,!ZBB!1-.30J)-K.12)30!1)!5-,.1,!.!+.1.0,1!5)30201234!)J!18,!PM!">
.++-,00!.3+!5)--,0@)3+234!*0,-!PB!J)-!,.58!-,T*,01!J)-!.3!C?7;!J2(,>!EB20-,4.-+!
-,T*,010!J)-!)18,-!J2(,!1I@,0H>!?8,!*0,-!PB!20!18,!182-+!J2,(+!23!,.58!()4!J2(,!(23,>!
B20@(.I!18,!+.1.!23!18,!J)-K!ipaddress/useridF!,>4>R!
165.32.101.206/8
100.219.90.44/102
182.4.148.56/173
246.241.6.175/45395
175.223.172.207/4115
Review the API Documentation for RDD Operations
62021!18,!:@.-A!GMP!@.4,!I)*!=))AK.-A,+!@-,Q2)*0(I>!V)(()L!18,!(23A!.1!18,!1)@!$>
J)-!18,!ZBB!5(.00!.3+!-,Q2,L!18,!(201!)J!.Q.2(.=(,!K,18)+0>!
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
14
Hands-On Exercise: Working with
Pair RDDs
Files Used in This Exercise:
Data files (local)
~/training_materials/sparkdev/data/weblogs/*
~/training_materials/sparkdev/data/accounts.csv
Solution:
solutions/UserRequests.pyspark
solutions/UserRequests.scalaspark
'
P#'*>5+'23"$45+"'X)A'=5&&'4)#*5#A"'"3:&)$5#8'*>"'[)A/%4$"'="U'+"$S"$'&)8'Q5&"+\'
%+'="&&'%+'*>"'[)A/%4$"'A+"$'%44)A#*'/%*%\'A+5#8'?"X0S%&A"'H%5$'E@@+,''
?820!12K,F!L)-A!L218!18,!,312-,!0,1!)J!+.1.!J2(,0!23!18,!L,=()4!J)(+,-!-.18,-!18.3!\*01!
.!0234(,!+.I/0!()40>!
^0234!7.@Z,+*5,F!5)*31!18,!3*K=,-!)J!-,T*,010!J-)K!,.58!*0,->!$>
.> ^0,!map!1)!5-,.1,!.!M.2-!ZBB!L218!18,!*0,-!PB!.0!18,!A,IF!.3+!18,!231,4,-!
$!.0!18,!Q.(*,>!E?8,!*0,-!PB!20!18,!182-+!J2,(+!23!,.58!(23,>H!!O)*-!+.1.!L2((!
())A!0)K,18234!(2A,!1820R!
!
!!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
15
=> ^0,!reduce!1)!0*K!18,!Q.(*,0!J)-!,.58!*0,-!PB>!O)*-!ZBB!+.1.!L2((!=,!
02K2(.-!1)R!
!
B20@(.I!18,!*0,-!PB0!.3+!821!5)*31!J)-!18,!*0,-0!L218!18,!$#!8248,01!821!5)*310>!!">
.> ^0,!map!1)!-,Q,-0,!18,!A,I!.3+!Q.(*,F!(2A,!1820R!
!
=> ^0,!sortByKey(False)!1)!0)-1!18,!0L.@@,+!+.1.!=I!5)*31>!!!
'-,.1,!.3!ZBB!L8,-,!18,!*0,-!2+!20!18,!A,IF!.3+!18,!Q.(*,!20!18,!(201!)J!.((!18,!PM!_>
.++-,00,0!18.1!*0,-!8.0!5)33,51,+!J-)K>!EPM!.++-,00!20!18,!J2-01!J2,(+!23!,.58!
-,T*,01!(23,>H!
C231R!7.@!1)!(userid, ipaddress)!.3+!18,3!*0,!groupByKey>!
!
!
!
?8,!+.1.!0,1!23!18,!%>
~/training_materials/sparkdev/data/accounts.csv!5)302010!)J!
23J)-K.12)3!.=)*1!;)*+.5-,/0!*0,-!.55)*310>!?8,!J2-01!J2,(+!23!,.58!(23,!20!18,!
(5,userid)
(7,userid)
(2,userid)
(userid,20.1.34.55)
(userid,245.33.1.1)
(userid,65.50.196.141)
(userid,[20.1.34.55, 74.125.239.98])
(userid,[75.175.32.10, 245.33.1.1, 66.79.233.99])
(userid,[65.50.196.141])
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
16
*0,-!PBF!L8258!5)--,0@)3+0!1)!18,!*0,-!PB!23!18,!L,=!0,-Q,-!()40>!?8,!)18,-!J2,(+0!
235(*+,!.55)*31!+,1.2(0!0*58!.0!5-,.12)3!+.1,F!J2-01!.3+!(.01!3.K,!.3+!0)!)3>!!
!
i)23!18,!.55)*310!+.1.!L218!18,!L,=()4!+.1.!1)!@-)+*5,!.!+.1.0,1!A,I,+!=I!*0,-!
PB!L8258!5)31.230!18,!*0,-!.55)*31!23J)-K.12)3!.3+!18,!3*K=,-!)J!L,=021,!8210!
J)-!18.1!*0,->!
.> 7.@!18,!.55)*310!+.1.!1)!A,IWQ.(*,N(201!@.2-0R!E*0,-2+F!bQ.(*,0jcH!!
!
=> i)23!18,!M.2-!ZBB!L218!18,!0,1!)J!*0,-2+W821!5)*310!5.(5*(.1,+!23!18,!J2-01!
01,@>!
!
5> B20@(.I!18,!*0,-!PBF!821!5)*31!F!.3+!J2-01!3.K,!E_-+!Q.(*,H!.3+!(.01!3.K,!
E%18!Q.(*,H!J)-!18,!J2-01!D!,(,K,310F!,>4>R!
userid1 4 Cheryl West
userid2 8 Elizabeth Kerns
userid3 1 Melissa Roman
(userid1,[userid1,2008-11-24 10:04:08,\N,Cheryl,West,4905
Olive Street,San Francisco,CA,…])
(userid2,[!userid2,2008-11-23
14:05:07,\N,Elizabeth,Kerns,4703 Eva Pearl
Street,Richmond,CA,…])
(userid3,[!userid3,2008-11-02 17:12:12,2013-07-18
16:42:36,Melissa,Roman,3539 James Martin
Circle,Oakland,CA,…])
(userid1,([userid1,2008-11-24
10:04:08,\N,Cheryl,West,4905 Olive Street,San
Francisco,CA,…],4))
(userid2,([!userid2,2008-11-23
14:05:07,\N,Elizabeth,Kerns,4703 Eva Pearl
Street,Richmond,CA,…],8))
(userid3,([!userid3,
2008-11-02 17:12:12,2013-07-18
16:42:36,Melissa,Roman,3539 James Martin
Circle,Oakland,CA,…],1))
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
17
If You Have More Time
PJ!I)*!8.Q,!K)-,!12K,F!.11,K@1!18,!J)(()L234!58.((,34,0R!
'8.((,34,!$R!^0,!keyBy!1)!5-,.1,!.3!ZBB!)J!.55)*31!+.1.!L218!18,!@)01.(!5)+,!$>
Eh18!J2,(+!23!18,!':6!J2(,H!.0!18,!A,I>!!!
C231R!-,J,-!1)!18,!:@.-A!GMP!J)-!K)-,!23J)-K.12)3!)3!18,!keyBy!)@,-.12)3!
?2@R!G00243!1820!3,L!ZBB!1)!.!Q.-2.=(,!J)-!*0,!23!18,!3,<1!58.((,34,!
'8.((,34,!"R!'-,.1,!.!@.2-!ZBB!L218!@)01.(!5)+,!.0!18,!A,I!.3+!.!(201!)J!3.K,0!">
E;.01!`.K,FV2-01!`.K,H!23!18.1!@)01.(!5)+,!.0!18,!Q.(*,>!
C231R!V2-01!3.K,!.3+!(.01!3.K,!.-,!18,!%18!.3+!D18!J2,(+0!-,0@,512Q,(I!
9@12)3.(R!?-I!*0234!18,!mapValues!)@,-.12)3!
'8.((,34,!_R!:)-1!18,!+.1.!=I!@)01.(!5)+,F!18,3!J)-!18,!J2-01!J2Q,!@)01.(!5)+,0F!_>
+20@(.I!18,!5)+,!.3+!(201!18,!3.K,0!23!18.1!@)01.(!f)3,F!,>4>!
--- 85003
Jenkins,Thad
Rick,Edward
Lindsay,Ivy
--- 85004
Morris,Eric
Reiser,Hazel
Gregg,Alicia
Preston,Elizabeth
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
18
Hands-On Exercise: Using HDFS
Files Used in This Exercise:
Data files (local)
~/training_materials/sparkdev/data/weblogs/*
Solution:
solutions/SparkHDFS.pyspark
solutions/SparkHDFS.scalaspark
'
P#'*>5+'23"$45+"'X)A'=5&&':$%4*54"'=)$?5#8'=5*>'Q5&"+'5#'.@K7\'*>"'.%/)):'
@5+*$5UA*"/'K5&"'7X+*"B,'
Exploring HDFS
CBV:!20!.(-,.+I!2301.((,+F!5)3J24*-,+F!.3+!-*33234!)3!I)*-!Q2-1*.(!K.5823,>!!
9@,3!.!1,-K23.(!L23+)L!E2J!)3,!20!3)1!.(-,.+I!)@,3H!=I!+)*=(,N5(25A234!18,!$>
?,-K23.(!25)3!)3!18,!+,0A1)@>!
7)01!)J!I)*-!231,-.512)3!L218!18,!0I01,K!L2((!=,!18-)*48!.!5)KK.3+N(23,!">
L-.@@,-!5.((,+!hadoop>!PJ!I)*!-*3!1820!@-)4-.K!L218!3)!.-4*K,310F!21!@-2310!.!
8,(@!K,00.4,>!?)!1-I!1820F!-*3!18,!J)(()L234!5)KK.3+!23!.!1,-K23.(!L23+)LR!
$ hdfs
?8,!hdfs!5)KK.3+!20!0*=+2Q2+,+!231)!0,Q,-.(!0*=0I01,K0>!?8,!0*=0I01,K!J)-!_>
L)-A234!L218!18,!J2(,0!)3!18,!J2(,!20!5.((,+!FsShell>!?820!0*=0I01,K!5.3!=,!
23Q)A,+!L218!18,!5)KK.3+!hdfs dfs>!!P3!18,!1,-K23.(!L23+)LF!,31,-R!
$ hdfs dfs
O)*!0,,!.!8,(@!K,00.4,!+,05-2=234!.((!18,!5)KK.3+0!.00)52.1,+!L218!18,!
FsShell!0*=0I01,K>!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
19
d31,-R!%>
$ hdfs dfs -ls /
?820!08)L0!I)*!18,!5)31,310!)J!18,!-))1!+2-,51)-I!23!CBV:>!?8,-,!L2((!=,!K*(12@(,!
,31-2,0F!)3,!)J!L8258!20!/user>!P3+2Q2+*.(!*0,-0!8.Q,!.!X8)K,Y!+2-,51)-I!*3+,-!
1820!+2-,51)-IF!3.K,+!.J1,-!18,2-!*0,-3.K,S!I)*-!*0,-3.K,!23!1820!5)*-0,!20!
trainingF!18,-,J)-,!I)*-!8)K,!+2-,51)-I!20!/user/training>!!
?-I!Q2,L234!18,!5)31,310!)J!18,!/user!+2-,51)-I!=I!-*33234R!D>
$ hdfs dfs -ls /user
O)*!L2((!0,,!I)*-!8)K,!+2-,51)-I!23!18,!+2-,51)-I!(201234>!!
;201!18,!5)31,310!)J!I)*-!8)K,!+2-,51)-I!=I!-*33234R!e>
$ hdfs dfs -ls /user/training
?8,-,!.-,!3)!J2(,0!I,1F!0)!18,!5)KK.3+!02(,31(I!,<210>!?820!20!+2JJ,-,31!18.3!2J!I)*!
-.3!hdfs dfs -ls /fooF!L8258!-,J,-0!1)!.!+2-,51)-I!18.1!+),03/1!,<201!.3+!
L8258!L)*(+!+20@(.I!.3!,--)-!K,00.4,>!
`)1,!18.1!18,!+2-,51)-I!01-*51*-,!23!CBV:!8.0!3)18234!1)!+)!L218!18,!+2-,51)-I!
01-*51*-,!)J!18,!()5.(!J2(,0I01,KS!18,I!.-,!5)K@(,1,(I!0,@.-.1,!3.K,0@.5,0>!
Uploading Files
],02+,0!=-)L0234!18,!,<201234!J2(,0I01,KF!.3)18,-!2K@)-1.31!18234!I)*!5.3!+)!L218!
FsShell!20!1)!*@().+!3,L!+.1.!231)!CBV:>!
'8.34,!+2-,51)-2,0!1)!18,!()5.(!J2(,0I01,K!+2-,51)-I!5)31.23234!18,!0.K@(,!+.1.!&>
J)-!18,!5)*-0,>!
$ cd ~/training_materials/sparkdev/data
PJ!I)*!@,-J)-K!.!-,4*(.-!;23*<!ls!5)KK.3+!23!1820!+2-,51)-IF!I)*!L2((!0,,!.!J,L!
J2(,0F!235(*+234!18,!L,=()40!+2-,51)-I!I)*!*0,+!23!@-,Q2)*0!,<,-520,0>!!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
20
P30,-1!1820!+2-,51)-I!231)!CBV:R!g>
$ hdfs dfs -put weblogs /user/training/weblogs
?820!5)@2,0!18,!()5.(!weblogs!+2-,51)-I!.3+!210!5)31,310!231)!.!-,K)1,!CBV:!
+2-,51)-I!3.K,+!/user/training/weblogs>!!
;201!18,!5)31,310!)J!I)*-!CBV:!8)K,!+2-,51)-I!3)LR!h>
$ hdfs dfs -ls /user/training
O)*!08)*(+!0,,!.3!,31-I!J)-!18,!weblogs!+2-,51)-I>!!
`)L!1-I!18,!0.K,!dfs -ls!5)KK.3+!=*1!L218)*1!.!@.18!.-4*K,31R!$#>
$ hdfs dfs -ls
O)*!08)*(+!0,,!18,!0.K,!-,0*(10>!PJ!I)*!+)!3)1!@.00!.!+2-,51)-I!3.K,!1)!18,!-ls!
5)KK.3+F!21!.00*K,0!I)*!K,.3!I)*-!8)K,!+2-,51)-IF!2>,>!/user/training>!
Relative paths
If you pass any relative (non-absolute) paths to FsShell commands, they are
considered relative to your home directory.
Viewing and Manipulating Files
`)L!Q2,L!0)K,!)J!18,!+.1.!I)*!\*01!5)@2,+!231)!CBV:>!!
d31,-R!$$>
$ hdfs dfs -cat weblogs/2014-03-08.log | tail -n 50
?820!@-2310!18,!(.01!D#!(23,0!)J!18,!J2(,!1)!I)*-!1,-K23.(>!?820!5)KK.3+!20!*0,J*(!
J)-!Q2,L234!18,!)*1@*1!)J!:@.-A!@-)4-.K0>!9J1,3F!.3!23+2Q2+*.(!)*1@*1!J2(,!20!Q,-I!
(.-4,F!K.A234!21!235)3Q,32,31!1)!Q2,L!18,!,312-,!J2(,!23!18,!1,-K23.(>!V)-!1820!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
21
-,.0)3F!21!20!)J1,3!.!4))+!2+,.!1)!@2@,!18,!)*1@*1!)J!18,!fs -cat!5)KK.3+!231)!
headF!tailF!moreF!)-!less>!
?)!+)L3().+!.!J2(,!1)!L)-A!L218!)3!18,!()5.(!J2(,0I01,K!*0,!18,!dfs -get!$">
5)KK.3+>!?820!5)KK.3+!1.A,0!1L)!.-4*K,310R!.3!CBV:!@.18!.3+!.!()5.(!@.18>!P1!
5)@2,0!18,!CBV:!5)31,310!231)!18,!()5.(!J2(,0I01,KR!
$ hdfs dfs -get weblogs/2013-09-22.log ~/logfile.txt!
$ less ~/logfile.txt
?8,-,!.-,!0,Q,-.(!)18,-!)@,-.12)30!.Q.2(.=(,!L218!18,!hdfs dfs!5)KK.3+!1)!$_>
@,-J)-K!K)01!5)KK)3!J2(,0I01,K!K.32@*(.12)30R!mvF!rmF!cpF!mkdirF!.3+!0)!)3>!!
d31,-R !
$ hdfs dfs
?820!+20@(.I0!.!=-2,J!*0.4,!-,@)-1!)J!18,!5)KK.3+0!.Q.2(.=(,!L21823!FsShell>!
?-I!@(.I234!.-)*3+!L218!.!J,L!)J!18,0,!5)KK.3+0>!
Accessing HDFS files in Spark
P3!18,!:@.-A!:8,((F!5-,.1,!.3!ZBB!=.0,+!)3!)3,!)J!18,!J2(,0!I)*!*@().+,+!1)!CBV:>!$%>
V)-!,<.K@(,R!
pyspark> logs=sc.textFile("hdfs://localhost/\
user/training/weblogs/2014-03-08.log")
:.Q,!18,!iMa!-,T*,010!23!18,!+.1.0,1!1)!CBV:R!$D>
pyspark> logs.filter(lambda s: ".jpg" in s).\
saveAsTextFile("hdfs://localhost/user/training/jpgs")
62,L!18,!5-,.1,+!+2-,51)-I!.3+!J2(,0!21!5)31.230>!!$e>
$ hdfs dfs -ls jpgs
$ hdfs dfs -cat jpgs/* | more
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
22
$%#&"!'(R!d<@()-,!18,!`.K,`)+,!^PR!http://localhost:50070!>!P3!$&>
@.-125*(.-F!1-I!K,3*!0,(,512)3!9*5&5*5"+!!!T$)=+"'*>"'K5&"+X+*"B>!
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
23
Hands-On Exercise: Running Spark
Shell on a Cluster
P#'*>5+'23"$45+"'X)A'=5&&'+*%$*'*>"'7:%$?'7*%#/%&)#"'B%+*"$'%#/'=)$?"$'
/%"B)#+\'"3:&)$"'*>"'7:%$?']%+*"$'%#/'7:%$?'G)$?"$'9+"$'P#*"$Q%4"+'W9P+Y\'
%#/'+*%$*'*>"'7:%$?'7>"&&')#'*>"'4&A+*"$,'
`)1,!18.1!23!1820!5)*-0,!I)*!.-,!-*33234!.!X5(*01,-Y!)3!.!0234(,!8)01>!?820!L)*(+!
3,Q,-!8.@@,3!23!.!@-)+*512)3!,3Q2-)3K,31F!=*1!20!*0,J*(!J)-!,<@()-.12)3F!1,01234F!.3+!
@-.5125234>!
Start the Spark Standalone Cluster
P3!.!1,-K23.(!L23+)LF!01.-1!18,!:@.-A!7.01,-!.3+!:@.-A![)-A,-!+.,K)30R!$>
$ sudo service spark-master start
$ sudo service spark-worker start
`)1,R!O)*!5.3!01)@!18,!0,-Q25,0!=I!-,@(.5234!start!L218!stopF!)-!J)-5,!18,!
0,-Q25,!1)!-,01.-1!=I!*0234!restart>!!O)*!K.I!3,,+!1)!+)!1820!2J!I)*!0*0@,3+!
.3+!-,01.-1!18,!67>!
View the Spark Standalone Cluster UI
:1.-1!V2-,J)<!)3!I)*-!67!.3+!Q2021!18,!:@.-A!7.01,-!^P!=I!*0234!18,!@-)Q2+,+!">
=))AK.-A!)-!Q2021234!http://localhost:18080/>!
O)*!08)*(+!3)1!0,,!.3I!.@@(25.12)30!23!18,!Z*33234!G@@(25.12)30!)-!')K@(,1,+!_>
G@@(25.12)30!.-,.0!=,5.*0,!I)*!8.Q,!3)1!-*3!.3I!.@@(25.12)30!)3!18,!5(*01,-!I,1>!
G!-,.(NL)-(+!:@.-A!5(*01,-!L)*(+!8.Q,!0,Q,-.(!L)-A,-0!5)3J24*-,+>!P3!1820!5(.00!%>
L,!8.Q,!\*01!)3,F!-*33234!()5.((IF!L8258!20!3.K,+!=I!18,!+.1,!21!01.-1,+F!18,!8)01!
21!20!-*33234!)3F!.3+!18,!@)-1!21!20!(201,3234!)3>!V)-!,<.K@(,R!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
24
!
!
'(25A!)3!18,!L)-A,-!PB!(23A!1)!Q2,L!18,!:@.-A![)-A,-!^P!.3+!3)1,!18.1!18,-,!.-,!D>
3)!,<,5*1)-0!5*--,31(I!-*33234!)3!18,!3)+,>!
Z,1*-3!1)!18,!:@.-A!7.01,-!^P!.3+!1.A,!3)1,!)J!18,!^Z;!08)L3!.1!18,!1)@>!O)*!e>
K.I!L208!1)!0,(,51!.3+!5)@I!21!231)!I)*-!5(2@=).-+>!
Start Spark Shell on the cluster
Z,1*-3!1)!I)*-!1,-K23.(!L23+)L!.3+!,<21!:@.-A!:8,((!2J!21!20!012((!-*33234>!!&>
:1.-1!:@.-A!:8,((!.4.23F!1820!12K,!0,11234!18,!7G:?dZ!,3Q2-)3K,31!Q.-2.=(,!1)!g>
18,!7.01,-!^Z;!I)*!3)1,+!23!18,!:@.-A!:1.3+.()3,![,=!^P>!V)-!,<.K@(,F!1)!01.-1!
@I0@.-AR!
$ MASTER=spark://localhost:7077 pyspark
9-!18,!:5.(.!08,((R!
$ spark-shell --master spark://localhost:7077
O)*!L2((!0,,!.++212)3.(!23J)!K,00.4,0!5)3J2-K234!-,4201-.12)3!L218!18,!:@.-A!
7.01,->!EO)*!K.I!3,,+!1)!821!d31,-!.!J,L!12K,0!1)!5(,.-!18,!05-,,3!()4!.3+!0,,!
18,!08,((!@-)K@1>H!V)-!,<.K@(,R!
…INFO cluster.SparkDeploySchedulerBackend: Connected to
Spark cluster with app ID app-20140604052124-0017
…INFO client.AppClient$ClientActor: Executor added:
app-20140604052124-0017/0 on worker-20140603111601-
localhost-7078 (localhost:7078) with 1 cores
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
25
O)*!5.3!5)3J2-K!18.1!I)*!.-,!5)33,51,+!1)!18,!5)--,51!K.01,-!=I!Q2,L234!18,!h>
sc.master!@-)@,-1IR!
pyspark> sc.master
d<,5*1,!.!02K@(,!)@,-.12)3!1)!1,01!,<,5*12)3!)3!18,!5(*01,->!V)-!,<.K@(,F!$#>
pyspark> sc.textFile("weblogs/*").count()
Z,().+!18,!:@.-A!:1.3+.()3,!7.01,-!^P!23!V2-,J)<!.3+!3)1,!18.1!3)L!18,!:@.-A!$$>
:8,((!.@@,.-0!23!18,!(201!)J!-*33234!.@@(25.12)30>!
!
'(25A!)3!18,!.@@(25.12)3!PB!Eapp-xxxxxxxH!1)!0,,!.3!)Q,-Q2,L!)J!18,!$">
.@@(25.12)3F!235(*+234!18,!(201!)J!,<,5*1)-0!-*33234!E)-!L.21234!1)!-*3H!1.0A0!J-)K!
1820!.@@(25.12)3>!P3!)*-!0K.((!5(.00-))K!5(*01,-F!18,-,!20!\*01!)3,F!-*33234!)3!18,!
0234(,!3)+,!23!18,!5(*01,-F!=*1!23!.!-,.(!5(*01,-!18,-,!5)*(+!=,!K*(12@(,!,<,5*1)-0!
-*33234!)3!,.58!5(*01,->!
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
26
Hands-On Exercise: Working With
Partitions
Files Used in This Exercise:
Data files (local):
~/training_materials/sparkdev/data/activations
Stubs:
stubs/TopModels.pyspark
stubs/TopModels.scalaspark
Solutions:
solutions/TopModels.pyspark
solutions/TopModels.scalaspark
'
P#'*>5+'23"$45+"'X)A'=5&&'Q5#/'*>"'B)+*'4)BB)#'*X:"')Q'4"&&A&%$'/"S54"'
%4*5S%*"/'5#'%'85S"#'/%*%'+"*,''
G0!I)*!L)-A!18-)*48!18,!,<,-520,F!I)*!L2((!,<@()-,!ZBB!@.-1212)3234>!
The Data
Z,Q2,L!18,!+.1.!23!~/training_materials/sparkdev/data/activations>!
d.58!k7;!J2(,!5)31.230!+.1.!J)-!.((!18,!+,Q25,0!.512Q.1,+!=I!5*01)K,-0!+*-234!.!
0@,52J25!K)318>!!
')@I!1820!+.1.!1)!CBV:>!$>
:.K@(,!23@*1!+.1.R!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
27
<activations>
<activation timestamp="1225499258" type="phone">
<account-number>316</account-number>
<device-id>
d61b6971-33e1-42f0-bb15-aa2ae3cd8680
</device-id>
<phone-number>5108307062</phone-number>
<model>iFruit 1</model>
</activation>
</activations>
The Task
O)*-!5)+,!08)*(+!4)!18-)*48!.!0,1!)J!.512Q.12)3!k7;!J2(,0!.3+!)*1@*1!18,!1)@!!!
+,Q25,!K)+,(0!.512Q.1,+>!
?8,!)*1@*1!L2((!())A!0)K,18234!(2A,R!
iFruit 1 (392)
Sorrento F00L (224)
MeeToo 1.0 (12)
:1.-1!L218!18,!TopModels!01*=!05-2@1>!`)1,!18.1!J)-!5)3Q,32,35,!I)*!8.Q,!=,,3!$>
@-)Q2+,+!L218!J*3512)30!1)!@.-0,!18,!k7;F!.0!18.1!20!3)1!18,!J)5*0!)J!1820!d<,-520,>!
')@I!18,!01*=!5)+,!231)!18,!:@.-A!:8,((>!
Z,.+!18,!k7;!J2(,0!231)!.3!ZBBF!18,3!5.((!toDebugString!)3!18.1!ZBB>!?820!">
L2((!+20@(.I!18,!3*K=,-!)J!@.-1212)30F!L8258!L2((!=,!18,!0.K,!.0!18,!3*K=,-!)J!
J2(,0!18.1!L,-,!-,.+R!
pyspark> print activations.toDebugString()
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
28
?820!L2((!+20@(.I!18,!(23,.4,!E18,!(201!)J!+,@,3+,31!ZBB0S!1820!L2((!=,!+205*00,+!
K)-,!23!18,!3,<1!58.@1,-H>!?8,!)3,!I)*!5.-,!.=)*1!8,-,!20!18,!5*--,31!ZBBF!
L8258!20!.1!18,!1)@!)J!18,!(201>!
^0,!mapPartitions!1)!K.@!,.58!@.-1212)3!1)!.3!k7;!?-,,!01-*51*-,!=.0,+!)3!_>
@.-0234!18,!5)31,310!)J!18.1!@.-1212)3!.0!.!01-234>!O)*!5.3!5.((!18,!@-)Q2+,+!
J*3512)3!getactivations!=I!@.00234!21!18,!@.-1212)3!21,-.1)-!J-)K!
mapPartitionsS!21!L2((!-,1*-3!.3!.--.I!)J!k7;!,(,K,310!J)-!,.58!.512Q.12)3!1.4!
23!18,!@.-1212)3>!V)-!,<.K@(,R!
pyspark> activations.mapPartitions(lambda xml: \
getactivations(xml))
7.@!,.58!.512Q.12)3!1.4!1)!18,!K)+,(!3.K,!)J!18,!+,Q25,!.512Q.1,+!*0234!18,!%>
@-)Q2+,+!getmodel!J*3512)3>!
'.((!toDebugString!)3!18,!3,L!ZBB!.3+!3)1,!18.1!18,!@.-1212)3234!8.0!=,,3!D>
K.231.23,+R!)3,!@.-1212)3!J)-!,.58!J2(,>!
')*31!18,!3*K=,-!)J!)55*--,35,0!)J!,.58!K)+,(!.3+!+20@(.I!18,!1)@!J,L>!EZ,J,-!e>
1)!,.-(2,-!58.@1,-0!J)-!.!-,K23+,-!)3!8)L!1)!*0,!7.@Z,+*5,!1)!5)*31!
)55*--,35,0!2J!I)*!3,,+!1)>H!
^0,!18,!top(n)!K,18)+!1)!+20@(.I!18,!$#!K)01!@)@*(.-!K)+,(0>!`)1,!18.1!I)*!&>
L2((!3,,+!1)!A,I!18,!ZBB!=I!5)*31>!
)"#*+,-*'.*,#/*,0%'12,0/*((,13!!&!4,5"1,#/*,!*6#,*6*17&8*9,
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
29
Hands-On Exercise: Viewing Stages
in the Spark Application UI
'
Files Used in This Exercise:
Data files (HDFS):
activations/*
Solutions:
solutions/TopModels.pyspark
solutions/TopModels.scalaspark
'
P#'*>5+'23"$45+"'X)A'=5&&'A+"'*>"'7:%$?'O::&54%*5)#'9P'*)'S5"='*>"'"3"4A*5)#'
+*%8"+'Q)$'%'^)U,'
P3!18,!(.01!d<,-520,F!I)*!L-)1,!.!05-2@1!23!18,!:@.-A!:8,((!1)!@.-0,!k7;!J2(,0!
5)31.23234!+,Q25,!.512Q.12)3!+.1.F!.3+!5)*31!18,!3*K=,-!)J!,.58!1I@,!)J!+,Q25,!
K)+,(!.512Q.1,+>!`)L!I)*!L2((!-,Q2,L!18,!01.4,0!.3+!1.0A0!18.1!L,-,!,<,5*1,+>!
7.A,!0*-,!18,!:@.-A!:8,((!20!012((!-*33234!J-)K!18,!(.01!d<,-520,>!PJ!21!20!3)1F!)-!2J!$>
I)*!+2+!3)1!5)K@(,1,!18,!(.01!d<,-520,F!-,01.-1!18,!08,((!.3+!@.01,!23!18,!5)+,!
J-)K!18,!0)(*12)3!J2(,!J)-!18,!@-,Q2)*0!d<,-520,>!
P3!.!=-)L0,-F!Q2,L!18,!:@.-A!G@@(25.12)3!^PR!http://localhost:4040/!">
7.A,!0*-,!18,!:1.4,0!1.=!20!0,(,51,+>!_>
!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
30
;))A!23!18,!')K@(,1,+!:1.4,0!0,512)3!.3+!I)*!08)*(+!0,,!18,!01.4,0!)J!18,!%>
,<,-520,!I)*!5)K@(,1,+>!!
?82340!1)!3)1,R!!
.> ?8,!01.4,0!.-,!3*K=,-,+F!=*1!3*K=,-0!+)!3)1!-,(.1,!1)!18,!)-+,-!)J!
,<,5*12)3>!!`)1,!18,!12K,0!18,!01.4,0!L,-,!0*=K211,+>!
=> ?8,!3*K=,-!)J!1.0A0!23!18,!J2-01!01.4,!5)--,0@)3+0!1)!18,!3*K=,-!)J!
@.-1212)30F!L8258!J)-!1820!,<.K@(,!5)--,0@)3+0!1)!18,!3*K=,-!)J!J2(,0!
@-)5,00,+>!
5> ?8,!:8*JJ(,![-21,!5)(*K3!23+25.1,0!8)L!K*58!+.1.!18.1!01.4,!5)@2,+!
=,1L,,3!1.0A0>!?820!20!*0,J*(!1)!A3)L!=,5.*0,!5)@I234!1))!K*58!+.1.!
.5-)00!18,!3,1L)-A!5.3!5.*0,!@,-J)-K.35,!200*,0>!
'(25A!)3!18,!01.4,0!1)!Q2,L!+,1.2(0!.=)*1!18.1!01.4,>!?82340!1)!3)1,R!D>
.> ?8,!:*KK.-I!7,1-250!.-,.!08)L0!I)*!8)L!K*58!12K,!L.0!0@,3+!)3!
Q.-2)*0!01,@0>!?820!5.3!8,(@!I)*!3.--)L!+)L3!@,-J)-K.35,!@-)=(,K0>!
=> ?8,!?.0A0!.-,.!(2010!,.58!1.0A>!?8,!;)5.(21I!;,Q,(!5)(*K3!23+25.1,0!
L8,18,-!18,!@-)5,00!-.3!)3!18,!0.K,!3)+,!L8,-,!18,!@.-1212)3!L.0!
@8I025.((I!01)-,+!)-!3)1>!Z,K,K=,-!18.1!:@.-A!L2((!.11,K@1!1)!.(L.I0!-*3!
1.0A0!L8,-,!18,!+.1.!20F!=*1!K.I!3)1!.(L.I0!=,!.=(,!1)F!2J!18,!3)+,!20!=*0I>!!!
5> P3!.!-,.(NL)-(+!5(*01,-F!18,!,<,5*1)-!5)(*K3!23!18,!?.0A!.-,.!L)*(+!
+20@(.I!18,!+2JJ,-,31!L)-A,-!3)+,0!L8258!-.3!18,!1.0A0>!EP3!1820!0234(,N
3)+,!5(*01,-F!.((!1.0A0!-*3!)3!18,!0.K,!8)01>H!
)"#*+,-*'.*,#/*,0%'12,0/*((,13!!&!4,5"1,#/*,!*6#,*6*17&8*9,
,
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
31
Hands-On Exercise: Caching RDDs
Files Used in This Exercise:
Data files (HDFS):
activations/*
Solutions:
solutions/TopModels.pyspark
solutions/TopModels.scalaspark
'
P#'*>5+'23"$45+"'X)A'=5&&'"3:&)$"'*>"':"$Q)$B%#4"'"QQ"4*')Q'4%4>5#8'%#'E@@,'
?8,!,.02,01!L.I!1)!0,,!5.58234!23!.512)3!20!1)!5)K@.-,!18,!12K,!21!1.A,0!1)!5)K@(,1,!
.3!)@,-.12)3!)3!.!5.58,+!.3+!*35.58,+!ZBB>!
7.A,!0*-,!18,!:@.-A!:8,((!20!012((!-*33234!J-)K!18,!(.01!,<,-520,>!PJ!21!203/1F!-,01.-1!$>
21!.3+!@.01,!23!18,!5)+,!J-)K!18,!0)(*12)3!J2(,>!
d<,5*1,!.!5)*31!.512)3!)3!18,!ZBB!5)31.23234!18,!(201!)J!.512Q.1,+!K)+,(0R!">
pyspark> models.count()
?.A,!3)1,!)J!18,!12K,!21!1))A!1)!5)K@(,1,!18,!count!)@,-.12)3>!?8,!)*1@*1!L2((!_>
())A!0)K,18234!(2A,!1820R!
14/04/07 05:47:17 INFO SparkContext: Job finished:
count at <ipython-input-3-986fd9b5da19>:1, took
17.718823392 s
`)L!5.58,!18,!ZBBR!%>
pyspark> models.cache()
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
32
d<,5*1,!18,!5)*31!.4.23>!?820!12K,!08)*(+!1.A,!.=)*1!18,!0.K,!.K)*31!)J!12K,!D>
18,!(.01!)3,!+2+>!P1!K.I!,Q,3!1.A,!.!(211(,!()34,-F!=,5.*0,!23!.++212)3!1)!-*33234!
18,!)@,-.12)3F!21!20!.(0)!5.58234!18,!-,0*(10>!
Z,N,<,5*1,!18,!5)*31>!],5.*0,!18,!+.1.!20!3)L!5.58,+F!I)*!08)*(+!0,,!.!e>
0*=01.312.(!-,+*512)3!23!18,!.K)*31!)J!12K,!18,!)@,-.12)3!1.A,0>!
P3!I)*-!=-)L0,-F!Q2,L!18,!:@.-A!G@@(25.12)3!^P!.3+!0,(,51!18,!7*)$%8"!1.=>!O)*!&>
L2((!0,,!.!(201!)J!5.58,+!ZBB0!E23!1820!5.0,F!\*01!18,!K)+,(0!ZBB!I)*!5.58,+!
.=)Q,H>!'(25A!18,!ZBB!+,05-2@12)3!1)!0,,!+,1.2(0!.=)*1!@.-1212)30!.3+!5.58234>!
'(25A!)3!18,!23"4A*)$+!1.=!.3+!1.A,!3)1,!)J!18,!.K)*31!)J!K,K)-I!*0,+!.3+!g>
.Q.2(.=(,!J)-!)*-!)3,!L)-A,-!3)+,>!
`)1,!18.1!18,!5(.00-))K!,3Q2-)3K,31!8.0!.!0234(,!L)-A,-!3)+,!L218!.!0K.((!
.K)*31!)J!K,K)-I!.(()5.1,+F!0)!I)*!K.I!0,,!18.1!3)1!.((!)J!18,!+.1.0,1!20!.51*.((I!
5.58,+!23!K,K)-I>!P3!18,!-,.(!L)-(+F!J)-!4))+!@,-J)-K.35,!.!5(*01,-!L2((!8.Q,!
K)-,!3)+,0F!,.58!L218!K)-,!K,K)-IF!0)!18.1!K)-,!)J!I)*-!.512Q,!+.1.!5.3!=,!
5.58,+>!
9@12)3.(R!:,1!18,!ZBB/0!@,-0201,35,!(,Q,(!1)!StorageLevel.DISK_ONLY!.3+!h>
5)K@.-,!=)18!18,!5)K@*1,!12K,0!.3+!18,!01)-.4,!-,@)-1!23!18,!:@.-A!G@@(25.12)3!
[,=!^P>!EC231R!],5.*0,!I)*!8.Q,!.(-,.+I!@,-0201,+!18,!ZBB!.1!.!+2JJ,-,31!(,Q,(F!
I)*!L2((!3,,+!1)!unpersist!J2-01!=,J)-,!I)*!5.3!0,1!.!3,L!(,Q,(>H!!
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
33
Hands-On Exercise: Checkpointing
RDDs
Files Used in This Exercise:
Stubs:
stubs/IterationTest.pyspark
stubs/IterationTest.scalaspark
Solutions:
solutions/IterationTest.pyspark
solutions/IterationTest.scalaspark
'
P#'*>5+'23"$45+"'X)A'=5&&'+""'>)='4>"4?:)5#*5#8'%QQ"4*+'%#'E@@_+'&5#"%8"'
Create an iterative RDD that results in a stack overflow
'-,.1,!.3!ZBB!=I!@.-.((,(2f234!.3!.--.I!)J!3*K=,-0>!$>
pyspark> mydata = sc.parallelize([1,2,3,4,5])
;))@!"##!12K,0>!!d.58!12K,!18-)*48!18,!())@F!5-,.1,!.!3,L!ZBB!=.0,+!)3!18,!">
@-,Q2)*0!21,-.12)3/0!-,0*(1!=I!.++234!$!1)!,.58!,(,K,31>!
pyspark> for i in range(200):
mydata = mydata.map(lambda myInt: myInt + 1)
')((,51!.3+!+20@(.I!18,!+.1.!23!18,!ZBB>!_>
pyspark> for x in mydata.collect(): print x
:8)L!18,!J23.(!ZBB!*0234!toDebugString()>!!`)1,!18.1!18,!=.0,!ZBB!)J!18,!%>
(23,.4,!20!18,!@.-.((,(2f,+!.--.IF!,>4>!ParallelCollectionRDD[1]!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
34
Z,@,.1!18,!())@F!L8258!.++0!.3)18,-!"##!21,-.12)30!1)!18,!(23,.4,>!?8,3!5)((,51!D>
.3+!+20@(.I!18,!ZBB!,(,K,310!.4.23>!!B2+!21!L)-Al!
Tip: When an exception occurs, the application output may exceed your
terminal window’s scroll buffer. You can adjust the size of the scroll buffer by
selecting Edit > Profile Preferences > Scrolling and changing the Scrollback
lines setting.
m,,@!.++234!1)!18,!(23,.4,!=I!-,@,.1234!18,!())@F!.3+!1,01234!=I!5)((,51234!18,!e>
,(,K,310>!dQ,31*.((IF!18,!collect()!)@,-.12)3!08)*(+!42Q,!I)*!.3!,--)-!
23+25.1234!.!01.5A!)Q,-J()L>!
P3!MI18)3F!18,!,--)-!K,00.4,!L2((!(2A,(I!-,@)-1!X,<5,002Q,(I!+,,@!
-,5*-02)3!-,T*2-,+Y>!
P3!:5.(.!18,!=.0,!,<5,@12)3!L2((!=,!:1.5A9Q,-J()Ld--)-F!L8258!I)*!L2((!
8.Q,!1)!05-)((!*@!23!I)*-!08,((!L23+)L!1)!0,,S!18,!2KK,+2.1,!,<5,@12)3!
L2((!@-)=.=(I!=,!-,(.1,+!1)!18,!]()5A!7.3.4,-!18-,.+!3)1!-,0@)3+234F!
0*58!.0!Xd--)-!0,3+234!K,00.4,!1)!]()5A7.3.4,-7.01,-Y>!
?.A,!3)1,!)J!18,!1)1.(!3*K=,-!)J!21,-.12)30!18.1!-,0*(1,+!23!18,!01.5A!)Q,-J()L>!&>
Fix the stack overflow problem by checkpointing the
RDD
d<21!.3+!-,01.-1!18,!:@.-A!:8,((!J)(()L234!18,!,--)-!23!18,!@-,Q2)*0!0,512)3>!g>
d3.=(,!58,5A@)231234!=I!5.((234!sc.setCheckpointDir("checkpoints")!h>
M.01,!23!18,!@-,Q2)*0!5)+,!1)!5-,.1,!18,!ZBB>!$#>
G0!=,J)-,F!5-,.1,!.3!21,-.12Q,!ZBB!+,@,3+,35IF!*0234!.1!(,.01!18,!3*K=,-!)J!$$>
21,-.12)30!18.1!@-,Q2)*0(I!-,0*(1,+!23!01.5A!)Q,-J()L>!
P302+,!18,!())@F!.++!1L)!01,@0!18.1!.-,!,<,5*1,+!,Q,-I!$#!21,-.12)30R!$">
.> '8,5A@)231!18,!ZBB!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
35
=> 7.1,-2.(2f,!18,!ZBB!=I!@,-J)-K234!.3!.512)3!0*58!.0!count>!
GJ1,-!())@234F!5)((,51!.3+!Q2,L!18,!,(,K,310!)J!18,!ZBB!1)!5)3J2-K!18,!\)=!$_>
,<,5*1,0!L218)*1!.!01.5A!)Q,-J()L>!
d<.K23,!18,!(23,.4,!)J!18,!ZBBS!3)1,!18.1!-.18,-!18.3!4)234!=.5A!1)!18,!=.0,F!21!$%>
4),0!=.5A!1)!18,!K)01!-,5,31!58,5A@)231>!
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
36
Hands-On Exercise: Writing and
Running a Spark Application
Files and Directories Used in This Exercise:
Data files (HDFS)
/user/training/weblogs
Scala Project Directory:
~/exercises/projects/countjpgs
Scala Classes:
stubs.CountJPGs
solution.CountJPGs
Python Stub:
stubs/CountJPGs.py
Python Solution:
solutions/CountJPGs.py
'
P#'*>5+'23"$45+"'X)A'=5&&'=$5*"'X)A$')=#'7:%$?'%::&54%*5)#'5#+*"%/')Q'A+5#8'*>"'
5#*"$%4*5S"'7:%$?'7>"&&'%::&54%*5)#,'
[-21,!.!02K@(,!@-)4-.K!18.1!5)*310!18,!3*K=,-!)J!iMa!-,T*,010!23!.!L,=!()4!J2(,>!?8,!
3.K,!)J!18,!J2(,!08)*(+!=,!@.00,+!23!1)!18,!@-)4-.K!.0!.3!.-4*K,31>!
?820!20!18,!0.K,!1.0A!I)*!+2+!,.-(2,-!23!18,!Xa,11234!:1.-1,+![218!ZBB0Y!,<,-520,>!
?8,!()425!20!18,!0.K,F!=*1!1820!12K,!I)*!L2((!3,,+!1)!0,1!*@!18,!:@.-A')31,<1!)=\,51!
I)*-0,(J>!
B,@,3+234!)3!L8258!@-)4-.KK234!(.34*.4,!I)*!.-,!*0234F!J)(()L!18,!.@@-)@-2.1,!
0,1!)J!2301-*512)30!=,()L!1)!L-21,!.!:@.-A!@-)4-.K>!
:*5"1*,13!!&!4,;"31,%1"41'<=,>*,831*,#",*6&#,51"<,#/*,0%'12,0/*((9,
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
37
Write a Spark application in Python
You may use any text editor you wish. If you don’t have an editor preference,
you may wish to use gedit, which includes language-specific support for Python.
G!02K@(,!01*=!J2(,!1)!4,1!01.-1,+!8.0!=,,3!@-)Q2+,+R!$>
~/training_materials/sparkdev/stubs/CountJPGs.py>!?820!01*=!
2K@)-10!18,!-,T*2-,+!:@.-A!5(.00!.3+!0,10!*@!I)*-!K.23!5)+,!=()5A>!')@I!1820!
01*=!1)!I)*-!L)-A!.-,.!.3+!,+21!21!1)!5)K@(,1,!1820!,<,-520,>!
:,1!*@!.!:@.-A')31,<1!*0234!18,!J)(()L234!5)+,R!">
sc = SparkContext()
P3!18,!=)+I!)J!18,!@-)4-.KF!().+!18,!J2(,!@.00,+!23!1)!18,!@-)4-.KF!5)*31!18,!_>
3*K=,-!)J!iMa!-,T*,010F!.3+!+20@(.I!18,!5)*31>!O)*!K.I!L208!1)!-,J,-!=.5A!1)!18,!
Xa,11234!:1.-1,+!L218!ZBB0Y!,<,-520,!J)-!18,!5)+,!1)!+)!1820>!
Z*3!18,!@-)4-.KF!@.00234!18,!3.K,!)J!18,!()4!J2(,!1)!@-)5,00F!,>4>R!%>
$ spark-submit CountJPGs.py weblogs/*
]I!+,J.*(1F!18,!@-)4-.K!L2((!-*3!()5.((I>!Z,N-*3!18,!@-)4-.KF!0@,52JI234!18,!D>
5(*01,-!K.01,-!23!)-+,-!1)!-*3!21!)3!18,!5(*01,-R!
$ spark-submit --master spark://localhost:7077 \
CountJPGs.py weblogs/*
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
38
62021!18,!:1.3+.()3,!:@.-A!7.01,-!^P!.3+!5)3J2-K!18.1!18,!@-)4-.K!20!-*33234!)3!e>
18,!5(*01,->!!'
Write a Spark application in Scala
You may use any text editor you wish. If you don’t have an editor preference,
you may wish to use gedit, which includes language-specific support for Scala.
If you are familiar with the Idea IntelliJ IDE, you may choose to use that; the
provided project directories include IntelliJ configuration.
G!7.Q,3!@-)\,51!1)!4,1!01.-1,+!8.0!=,,3!@-)Q2+,+R!$>
~/exercises/projects/countjpgs. !
d+21!18,!:5.(.!5)+,!23!src/main/scala/stubs/CountJPGs.scala>!">
:,1!*@!.!:@.-A')31,<1!*0234!18,!J)(()L234!5)+,R!_>
val sc = new SparkContext()
P3!18,!=)+I!)J!18,!@-)4-.KF!().+!18,!J2(,!@.00,+!23!1)!18,!@-)4-.KF!5)*31!18,!%>
3*K=,-!)J!iMa!-,T*,010F!.3+!+20@(.I!18,!5)*31>!O)*!K.I!L208!1)!-,J,-!=.5A!1)!18,!
Xa,11234!:1.-1,+!L218!ZBB0Y!,<,-520,!J)-!18,!5)+,!1)!+)!1820>!
V-)K!18,!countjpgs!L)-A234!+2-,51)-IF!=*2(+!I)*-!@-)\,51!*0234!18,!J)(()L234!D>
5)KK.3+R!
$ mvn package
PJ!18,!=*2(+!20!0*55,00J*(F!21!L2((!4,3,-.1,!.!iGZ!J2(,!5.((,+!countjpgs-1.0.jar!e>
23!countjpgs/target>!Z*3!18,!@-)4-.K!*0234!18,!J)(()L234!5)KK.3+R!
$ spark-submit \
--class stubs.CountJPGs \
target/countjpgs-1.0.jar weblogs/*
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
39
]I!+,J.*(1F!18,!@-)4-.K!L2((!-*3!()5.((I>!Z,N-*3!18,!@-)4-.KF!0@,52JI234!18,!&>
5(*01,-!K.01,-!23!)-+,-!1)!-*3!21!)3!18,!5(*01,-R!
$ spark-submit \
--class stubs.CountJPGs \
--master spark://localhost:7077 \
target/countjpgs-1.0.jar weblogs/*
62021!18,!:1.3+.()3,!:@.-A!7.01,-!^P!.3+!5)3J2-K!18.1!18,!@-)4-.K!20!-*33234!)3!g>
18,!5(*01,->'
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
40
Hands-On Exercise: Configuring
Spark Applications
Files Used in This Exercise:
Data files (HDFS)
/user/training/weblogs
Properties files (local)
spark.conf
log4j.properties
'
P#'*>5+'23"$45+"'X)A'=5&&':$%4*54"'+"**5#8'S%$5)A+'7:%$?'4)#Q58A$%*5)#'):*5)#+,'
O)*!L2((!L)-A!L218!18,!')*31iMa0!@-)4-.K!I)*!L-)1,!23!18,!@-2)-!d<,-520,>!
Set configuration options at the command line
Z,-*3!18,!')*31iMa0!MI18)3!)-!:5.(.!@-)4-.K!I)*!L-)1,!23!18,!@-,Q2)*0!$>
,<,-520,F!1820!12K,!0@,52JI234!.3!.@@(25.12)3!3.K,>!V)-!,<.K@(,R!
$ spark-submit --master spark://localhost:7077 \
--name 'Count JPGs' \
CountJPGs.py weblogs/*
62021!18,!:1.3+.()3,!:@.-A!7.01,-!^P!Ehttp://localhost:18080/H!.3+!3)1,!">
18,!.@@(25.12)3!3.K,!(201,+!20!18,!)3,!0@,52J2,+!23!18,!5)KK.3+!(23,>!
$%#&"!'(R![82(,!18,!.@@(25.12)3!20!-*33234F!Q2021!18,!:@.-A!G@@(25.12)3!^P!.3+!_>
Q2,L!18,!2#S5$)#B"#*!1.=>!?.A,!3)1,!)J!18,!spark.*!@-)@,-12,0!0*58!.0!
masterF!appNameF!.3+!driver!@-)@,-12,0>'
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
41
Set configuration options in a configuration file
'8.34,!+2-,51)-2,0!1)!I)*-!,<,-520,!L)-A234!+2-,51)-I>!EPJ!I)*!.-,!L)-A234!23!%>
:5.(.F!18.1!20!18,!countjpgs!@-)\,51!+2-,51)-I>H!
^0234!.!1,<1!,+21)-F!5-,.1,!.!J2(,!23!18,!L)-A234!+2-,51)-I!5.((,+!myspark.confF!D>
5)31.23234!0,112340!J)-!18,!@-)@,-12,0!08)L3!=,()LR!!
spark.app.name My Spark App
spark.ui.port 4141
spark.master spark://localhost:7077
Z,N-*3!I)*-!.@@(25.12)3F!1820!12K,!*0234!18,!@-)@,-12,0!J2(,!2301,.+!)J!*0234!18,!e>
05-2@1!)@12)30!1)!5)3J24*-,!:@.-A!@-)@,-12,0R!
spark-submit \
--properties-file myspark.conf \
CountJPGs.py \
weblogs/*
[82(,!18,!.@@(25.12)3!20!-*33234F!Q2,L!18,!:@.-A!G@@(25.12)3!^P!.1!18,!.(1,-3.1,!&>
@)-1!I)*!0@,52J2,+!1)!5)3J2-K!18.1!21!20!*0234!18,!5)--,51!@)-1R!!
http://localhost:4141!
G(0)!Q2021!18,!:1.3+.()3,!:@.-A!7.01,-!^P!1)!5)3J2-K!18.1!18,!.@@(25.12)3!g>
5)--,51(I!-.3!)3!18,!5(*01,-!L218!18,!5)--,51!.@@!3.K,F!,>4>R!
!
Optional: Set configuration properties programmatically
V)(()L234!18,!,<.K@(,!J-)K!18,!0(2+,0F!K)+2JI!18,!')*31iMa0!@-)4-.K!1)!0,1!18,!h>
.@@(25.12)3!3.K,!.3+!^P!@)-1!@-)4-.KK.125.((I>!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
42
.> V2-01!5-,.1,!.!:@.-A')3J!)=\,51!.3+!0,1!210!spark.app.name!.3+!
spark.ui.port!@-)@,-12,0!!
=> ?8,3!*0,!18,!:@.-A')3J!)=\,51!L8,3!5-,.1234!18,!:@.-A')31,<1>!
Set logging levels
')@I!18,!1,K@(.1,!J2(,!$#>
$SPARK_HOME/conf/log4j.properties.template!1)!
log4j.properties!23!I)*-!,<,-520,!L)-A234!+2-,51)-I>!
d+21!log4j.properties!>!?8,!J2-01!(23,!5*--,31(I!-,.+0R!$$>
log4j.rootCategory=INFO, console
Z,@(.5,!INFO!L218!DEBUGR!
log4j.rootCategory=DEBUG, console
Z,-*3!I)*-!:@.-A!.@@(25.12)3>!],5.*0,!18,!5*--,31!+2-,51)-I!20!)3!18,!i.Q.!$">
5(.00@.18F!I)*-!log4.properties!J2(,!L2((!0,1!18,!()44234!(,Q,(!1)!DEBUG>!!
`)125,!18.1!18,!)*1@*1!3)L!5)31.230!=)18!18,!P`V9!K,00.4,0!21!+2+!=,J)-,!.3+!$_>
Bd]^a!K,00.4,0F!,>4>R!
14/03/19 11:40:45 INFO MemoryStore: ensureFreeSpace(154293) called
with curMem=0, maxMem=311387750
14/03/19 11:40:45 INFO MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 150.7 KB, free 296.8 MB)
14/03/19 11:40:45 DEBUG BlockManager: Put block broadcast_0 locally
took 79 ms
14/03/19 11:40:45 DEBUG BlockManager: Put for block broadcast_0
without replication took 79 ms
B,=*4!()44234!5.3!=,!*0,J*(!L8,3!+,=*44234F!1,01234F!)-!)@12K2f234!I)*-!5)+,F!
=*1!23!K)01!5.0,0!4,3,-.1,0!*33,5,00.-2(I!+201-.51234!)*1@*1>!!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
43
d+21!18,!log4j.properties!J2(,!1)!-,@(.5,!DEBUG!L218!WARN!.3+!1-I!.4.23>!$%>
?820!12K,!3)125,!18.1!3)!P`V9!)-!Bd]^a!K,00.4,0!.-,!+20@(.I,+F!)3(I![GZ`!
K,00.4,0>!
O)*!5.3!.(0)!0,1!18,!()4!(,Q,(!J)-!18,!:@.-A!:8,((!=I!@(.5234!18,!$D>
log4j.properties!J2(,!23!I)*-!L)-A234!+2-,51)-I!=,J)-,!01.-1234!18,!08,((>!
?-I!01.-1234!18,!08,((!J-)K!18,!+2-,51)-I!23!L8258!I)*!@(.5,+!18,!J2(,!.3+!3)1,!
18.1!)3(I![GZ`!K,00.4,0!3)L!.@@,.->!!
`)1,R!B*-234!18,!-,01!)J!18,!,<,-520,0F!I)*!K.I!58.34,!18,0,!0,112340!+,@,3+234!)3!
L8,18,-!I)*!J23+!18,!,<1-.!()44234!K,00.4,0!8,(@J*(!)-!+201-.51234>!!
This is the end of the Exercise
!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
44
Hands-On Exercise: Exploring Spark
Streaming
Files Used in This Exercise:
Solution Scala script:
solutions/SparkStreaming.scalaspark
'
P#'*>5+'23"$45+"\'X)A'=5&&'"3:&)$"'7:%$?'7*$"%B5#8'A+5#8'*>"'74%&%'7:%$?'7>"&&,'
?820!,<,-520,0!8.0!1L)!@.-10R!!
Z,Q2,L!18,!:@.-A!:1-,.K234!+)5*K,31.12)3!
G!0,-2,0!)J!01,@N=IN01,@!2301-*512)30!23!L8258!I)*!L2((!*0,!:@.-A!
:1-,.K234!1)!5)*31!L)-+0!23!.!01-,.K!
Review the Spark Streaming documentation
62,L!18,!:@.-A!:1-,.K234!GMP!=I!Q2021234!18,!:@.-A!:5.(.+)5!GMP!EL8258!I)*!$>
=))AK.-A,+!@-,Q2)*0(I!23!18,!5(.00H!.3+!0,(,51234!18,!
org.apache.spark.streaming!@.5A.4,!23!18,!@.5A.4,!@.3,!)3!18,!(,J1>!!
V)(()L!18,!(23A0!.1!18,!1)@!)J!18,!@.5A.4,!@.4,!1)!Q2,L!18,!DStream!.3+!">
PairDStreamFunctions!5(.00,0!n!18,0,!L2((!08)L!I)*!18,!J*3512)30!.Q.2(.=(,!
)3!.!B:1-,.K!)J!-,4*(.-!ZBB0!.3+!M.2-!ZBB0!-,0@,512Q,(I>!
O)*!K.I!.(0)!L208!1)!Q2,L!18,!:@.-A!:1-,.K234!M-)4-.KK234!a*2+,!E0,(,51!_>
H$)8$%BB5#8'!A5/"+!o!7:%$?'7*$"%B5#8!)3!18,!:@.-A!+)5*K,31.12)3!K.23!
@.4,H>!
Count words in a stream
V)-!1820!0,512)3F!I)*!L2((!02K*(.1,!01-,.K234!1,<1!+.1.!18-)*48!.!3,1L)-A!0)5A,1!
*0234!18,!nc!5)KK.3+>!?820!5)KK.3+!1.A,0!23@*1!J-)K!18,!5)30)(,!E01+23H!.3+!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
45
0,3+0!21!1)!18,!@)-1!I)*!0@,52JIF!0)!18.1!18,!1,<1!I)*!1I@,!20!0,31!1)!18,!5(2,31!@-)4-.K!
EL8258!L2((!=,!I)*-!:@.-A!:1-,.K234!.@@(25.12)3>H!!!
P3!.!1,-K23.(!L23+)LF!,31,-!18,!5)KK.3+!%>
$ nc -lkv 1234
G3I18234!I)*!1I@,!L2((!=,!0,31!1)!@)-1!$"_%>!O)*!L2((!-,1*-3!1)!1820!L23+)L!.J1,-!
I)*!8.Q,!01.-1,+!I)*-!:@.-A!:1-,.K234!')31,<1>!
:1.-1!.!0,@.-.1,!1,-K23.(!J)-!-*33234!18,!:@.-A!:8,((>!')@I!D>
/usr/lib/spark/conf/log4j.properties!1)!18,!()5.(!+2-,51)-I!.3+!,+21!
21!1)!0,1!18,!()44234!(,Q,(!1)!ERROR>!E?820!20!1)!-,+*5,!18,!(,Q,(!)J!()44234!)*1@*1F!
L8258!)18,-L20,!L)*(+!K.A,!21!+2JJ25*(1!1)!0,,!18,!231,-.512Q,!)*1@*1!J-)K!18,!
01-,.K234!\)=>H!
:1.-1!18,!:@.-A!:5.(.!:8,((>!P3!)-+,-!1)!*0,!:@.-A!:1-,.K234!231,-.512Q,(IF!I)*!e>
3,,+!1)!,218,-!-*3!18,!08,((!)3!.!:@.-A!5(*01,-F!)-!()5.((I!L218!.1!(,.01!1L)!
18-,.+0>!V)-!1820!,<,-520,F!-*3!()5.((I!L218!1L)!18-,.+0F!=I!1I@234R!
$ spark-shell --master local[2]
P3!18,!:@.-A!:8,((F!2K@)-1!18,!5(.00,0!I)*!3,,+!J)-!1820!,<.K@(,>!O)*!K.I!5)@I!&>
18,!5)KK.3+0!J-)K!18,0,!2301-*512)30F!)-!2J!I)*!@-,J,-F!5)@I!J-)K!18,!0)(*12)3!
05-2@1!J2(,!@-)Q2+,+!ESparkStreaming.scalasparkH>!
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.Seconds
'-,.1,!.!:@.-A!:1-,.K234!')31,<1F!01.-1234!L218!18,!:@.-A!')31,<1!@-)Q2+,+!=I!g>
18,!08,((F!L218!.!=.158!+*-.12)3!)J!D!0,5)3+0R!
val ssc = new StreamingContext(sc,Seconds(5))
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
46
'-,.1,!.!B:1-,.K!1)!-,.+!1,<1!+.1.!J-)K!@)-1!$"_%!E18,!0.K,!@)-1!I)*!.-,!h>
0,3+234!1,<1!1)!*0234!18,!35!5)KK.3+F!01.-1,+!23!18,!J2-01!01,@>H!
val mystream = ssc.socketTextStream("localhost",1234)
^0,!7.@Z,+*5,!1)!5)*31!18,!)55*--,35,!)J!L)-+0!)3!18,!01-,.K>!$#>
val words = mystream.flatMap(line => line.split("\\W"))
val wordCounts = words.map(x =>
(x, 1)).reduceByKey((x,y) => x+y)
M-231!)*1!18,!J2-01!$#!L)-+!5)*31!@.2-0!23!,.58!=.158R!$$>
wordCounts.print()
:1.-1!18,!:1-,.K234!')31,<1>!?820!L2((!1-244,-!18,!B:1-,.K!1)!5)33,51!1)!18,!$">
0)5A,1F!.3+!01.-1!=.158234!.3+!@-)5,00234!18,!23@*1!,Q,-I!D!0,5)3+0>!'.((!
awaitTermination!1)!L.21!J)-!18,!1.0A!1)!5)K@(,1,>!
ssc.start()
ssc.awaitTermination()
a)!=.5A!1)!18,!1,-K23.(!L23+)L!23!L8258!I)*!01.-1,+!18,!nc!5)KK.3+>!O)*!$_>
08)*(+!0,,!.!K,00.4,!23+25.1234!18.1!nc!.55,@1,+!.!5)33,512)3!J-)K!I)*-!
B:1-,.K>!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
47
d31,-!0)K,!1,<1>!dQ,-I!J2Q,!0,5)3+0!I)*!08)*(+!0,,!)*1@*1!23!18,!:@.-A!:8,((!$%>
L23+)L!0*58!.0R!
-------------------------------------------
Time: 1396631265000 ms
-------------------------------------------
(never,1)
(purple,1)
(I,1)
(a,1)
(ve,1)
(seen,1)
(cow,1)
?)!-,01.-1!18,!.@@(25.12)3F!1I@,!'1-(N'!1)!,<21!:@.-A!:8,((F!18,3!-,01.-1!18,!08,((!$D>
.3+!*0,!5)KK.3+!8201)-I!)-!@.01,!23!18,!.@@(25.12)3!5)KK.3+0!.4.23>!
[8,3!I)*!.-,!+)3,F!5()0,!18,!nc!@-)5,00!23!18,!)18,-!1,-K23.(!L23+)L>!$e>
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
48
Hands-On Exercise: Writing a Spark
Streaming Application
Files and Directories Used in This Exercise:
Project:
~/exercises/projects/streaminglogs
Stub class:
stubs.StreamingLogs
Solution class:
solution.StreamingLogs
Test script:
~/training_materials/sparkdev/examples/streamtest.py
'
P#'*>5+'23"$45+"\'X)A'=5&&'=$5*"'%'7:%$?'7*$"%B5#8'%::&54%*5)#'*)'4)A#*'
`#)=&"/8"'T%+"'%$*54&"'$"VA"+*+'UX'9+"$'P@,'
Count Knowledge Base article requests
`)L!18.1!I)*!.-,!J.K2(2.-!L218!*0234!:@.-A!:1-,.K234F!1-I!.!K)-,!-,.(20125!1.0AR!-,.+!
23!01-,.K234!L,=!0,-Q,-!()4!+.1.F!.3+!5)*31!18,!3*K=,-!)J!-,T*,010!J)-!m3)L(,+4,!
].0,!.-125(,0>!
?)!02K*(.1,!.!01-,.K234!+.1.!0)*-5,F!I)*!L2((!*0,!18,!@-)Q2+,+!streamtest.py!
MI18)3!05-2@1F!L8258!L.210!J)-!.!5)33,512)3!)3!18,!8)01!.3+!@)-1!0@,52J2,+!.3+F!)35,!
21!-,5,2Q,0!.!5)33,512)3F!0,3+0!18,!5)31,310!)J!18,!J2(,E0H!0@,52J2,+!1)!18,!5(2,31!
EL8258!L2((!=,!I)*-!:@.-A!:1-,.K234!.@@(25.12)3H>!O)*!5.3!0@,52JI!18,!0@,,+!.1!
L8258!18,!+.1.!08)*(+!=,!0,31!E(23,0!@,-!0,5)3+H>!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
49
:1-,.K!18,!;)*+.5-,!L,=()4!J2(,0!.1!.!-.1,!)J!"#!@,-!0,5)3+>!P3!.!0,@.-.1,!$>
1,-K23.(!L23+)LF!-*3R!
$ python \
~/training_materials/sparkdev/examples/streamtest.py \
localhost 1234 20 \
/home/training/training_materials/sparkdev/data/weblogs/*
`)1,!18.1!1820!05-2@1!,<210!.J1,-!18,!5(2,31!+205)33,510S!I)*!L2((!3,,+!1)!-,01.-1!18,!
05-2@1!L8,3!I)*!-,01.-1!I)*-!:@.-A!G@@(25.12)3>!
G!7.Q,3!@-)\,51!J)(+,-!8.0!=,,3!@-)Q2+,+!J)-!I)*-!:@.-A!:1-,.K234!.@@(25.12)3R!">
exercises/projects/streaminglogs>!!?)!5)K@(,1,!18,!,<,-520,F!01.-1!
L218!18,!01*=!5)+,!23!src/main/scala/stubs/StreamingLogs.scalaF!
L8258!2K@)-10!18,!3,5,00.-I!5(.00,0!.3+!0,10!*@!18,!:1-,.K234!')31,<1>!
'-,.1,!.!B:1-,.K!=I!-,.+234!18,!+.1.!J-)K!18,!8)01!.3+!@)-1!@-)Q2+,+!.0!23@*1!_>
@.-.K,1,-0>!
V2(1,-!18,!B:1-,.K!1)!)3(I!235(*+,!(23,0!5)31.23234!18,!01-234!XKBDOCY>!%>
V)-!,.58!ZBB!23!18,!J2(1,-,+!B:1-,.KF!+20@(.I!18,!3*K=,-!)J!21,K0!n!18.1!20F!18,!D>
3*K=,-!)J!-,T*,010!J)-!m]!.-125(,0>!
:.Q,!18,!J2(1,-,+!()40!1)!1,<1!J2(,0>!e>
?)!1,01!I)*-!.@@(25.12)3F!=*2(+!I)*-!.@@(25.12)3!iGZ!J2(,!*0234!18,!mvn package!&>
5)KK.3+>!!Z*3!I)*-!.@@(25.12)3!()5.((I!.3+!=,!0*-,!1)!0@,52JI!1L)!18-,.+0S!.1!
(,.01!1L)!18-,.+0!)-!3)+,0!.-,!-,T*2-,+!1)!-*33234!.!01-,.K234!.@@(25.12)3F!
L82(,!)*-!67!5(*01,-!8.0!)3(I!)3,>!!?8,!StreamingLogs!.@@(25.12)3!1.A,0!1L)!
@.-.K,1,-0R!18,!8)01!3.K,!.3+!@)-1!3*K=,-!1)!5)33,51!18,!B:1-,.K!1)S!0@,52JI!
18,!0.K,!8)01!.3+!@)-1!18.1!18,!1,01!05-2@1!20!(201,3234!)3>!
$ spark-submit \
--class stubs.StreamingLogs \
--master local[2] \
target/streamlog-1.0.jar localhost 1234
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
50
E^0,!--class solution.StreamingLogs!1)!-*3!18,!0)(*12)3!5(.00!
2301,.+>H!
6,-2JI!18,!5)*31!)*1@*1F!.3+!-,Q2,L!18,!5)31,310!)J!18,!J2(,0>!g>
'8.((,34,R!P3!.++212)3!1)!+20@(.I234!18,!5)*31!,Q,-I!0,5)3+!E18,!=.158!+*-.12)3HF!h>
5)*31!18,!3*K=,-!)J!m]!-,T*,010!)Q,-!.!L23+)L!)J!$#!0,5)3+0>!M-231!)*1!18,!
*@+.1,+!$#!0,5)3+!1)1.(!,Q,-I!"!0,5)3+0>!
.> C231!$R!^0,!18,!countByWindow!J*3512)3>!!
=> C231!"R!^0,!)J!L23+)L!)@,-.12)30!-,T*2-,0!58,5A@)231234>!^0,!18,!
ssc.checkpoint(directory)!J*3512)3!=,J)-,!01.-1234!18,!::'!1)!
,3.=(,!58,5A@)231234>!
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
51
Hands-On Exercise: Iterative
Processing with Spark
Files Used in This Exercise:
Data files (local):
~/training_materials/sparkdev/data/devicestatus.txt
Stubs:
stubs/KMeansCoords.pyspark
stubs/KMeansCoords.scalaspark
Solutions:
solutions/KMeansCoords.pyspark
solutions/KMeansCoords.scalaspark
'
P#'*>5+'23"$45+"\'X)A'=5&&':$%4*54"'5B:&"B"#*5#8'5*"$%*5S"'%&8)$5*>B+'5#'7:%$?'
UX'4%&4A&%*5#8'?0B"%#+'Q)$'%'+"*')Q':)5#*+,'
Review the Data
Z,Q2,L!18,!+.1.!J2(,F!18,3!5)@I!21!1)!CBV:R!!
~/training_materials/sparkdev/data/devicestatus.txt>!!!
?820!J2(,!5)31.230!.!0.K@(,!)J!+,Q25,!01.1*0!+.1.>!V)-!1820!,<,-520,F!18,!J2,(+0!I)*!5.-,!
.=)*1!.-,!18,!(.01!1L)F!L8258!-,@-,0,31!18,!()5.12)3!E(.121*+,!.3+!()3421*+,H!)J!18,!
+,Q25,!.0!18,!(.01!1L)!J2,(+0!EJ2,(+0!$_!.3+!$%HR!
2014-03-15:13:10:20|Titanic 2500|15e758be-8624-46aa-80a3-
b6e08e979600|77|70|40|22|13|0|enabled|connected|enabled|38.92539179
59|-122.78959506
2014-03-15:13:10:20|Sorrento F41L|2d6862a6-2659-4e07-9c68-
6ea31e94cda0|4|16|23|enabled|enabled|connected|33|79|44|35.48129955
43|-120.306768128
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
52
Calculate k-means for device location
PJ!I)*!.-,!.(-,.+I!J.K2(2.-!L218!5.(5*(.1234!ANK,.30F!1-I!+)234!18,!,<,-520,!)3!I)*-!
)L3>!918,-L20,F!J)(()L!18,!01,@N=IN01,@!@-)5,00!=,()L>!
:1.-1!=I!5)@I234!18,!@-)Q2+,+!KMeansCoords!01*=!J2(,F!L8258!5)31.230!18,!$>
J)(()L234!5)3Q,32,35,!J*3512)30!*0,+!23!5.(5*(.1234!ANK,.30R!
closestPointR!42Q,3!.!E(.121*+,W()3421*+,H!@)231!.3+!.3!.--.I!)J!
5*--,31!5,31,-!@)2310F!-,1*-30!18,!23+,<!23!18,!.--.I!)J!18,!5,31,-!5()0,01!
1)!18,!42Q,3!@)231!
addPointsR!42Q,3!1L)!@)2310F!-,1*-3!.!@)231!L8258!20!18,!0*K!)J!18,!1L)!
@)2310!n!18.1!20F!E<$p<"F!I$pI"H!
distanceSquaredR!42Q,3!1L)!@)2310F!-,1*-30!18,!0T*.-,+!+201.35,!)J!
18,!1L)>!!?820!20!.!5)KK)3!5.(5*(.12)3!-,T*2-,+!23!4-.@8!.3.(I020>!
:,1!18,!Q.-2.=(,!K!E18,!3*K=,-!)J!K,.30!1)!5.(5*(.1,H>!V)-!1820!,<,-520,!L,!">
-,5)KK,3+!I)*!01.-1!L218!5>!!
:,1!18,!Q.-2.=(,!convergeDist>!?820!L2((!=,!*0,+!1)!+,52+,!L8,3!18,!ANK,.30!_>
5.(5*(.12)3!20!+)3,!n!L8,3!18,!.K)*31!18,!()5.12)30!)J!18,!K,.30!58.34,0!
=,1L,,3!21,-.12)30!20!(,00!18.3!convergeDist>!G!X@,-J,51Y!0)(*12)3!L)*(+!=,!#S!
1820!3*K=,-!-,@-,0,310!.!X4))+!,3)*48Y!0)(*12)3>![,!-,5)KK,3+!01.-1234!L218!
.!Q.(*,!)J!0.1>!
M.-0,!18,!23@*1!J2(,F!L8258!20!+,(2K21,+!=I!18,!58.-.51,-!q|/F!231)!%>
E(.121*+,F()3421*+,H!@.2-0!E18,!$_18!.3+!$%18!J2,(+0!23!,.58!(23,H>!93(I!235(*+,!
A3)L3!()5.12)30!E18.1!20F!J2(1,-!)*1!(0,0)!()5.12)30H>!!],!0*-,!1)!5.58,!18,!
-,0*(1234!ZBB!=,5.*0,!I)*!L2((!.55,00!21!,.58!12K,!18-)*48!18,!21,-.12)3>!
'-,.1,!.!mN(,3418!.--.I!5.((,+!kPoints!=I!1.A234!.!-.3+)K!0.K@(,!)J!K!()5.12)3!D>
@)2310!J-)K!18,!ZBB!.0!01.-1234!K,.30!E5,31,-!@)2310H>!d>4>!
data.takeSample(False, K, 42)
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
53
P1,-.12Q,(I!5.(5*(.1,!.!3,L!0,1!)J!K!K,.30!*312(!18,!1)1.(!+201.35,!=,1L,,3!18,!e>
K,.30!5.(5*(.1,+!J)-!1820!21,-.12)3!.3+!18,!(.01!20!0K.((,-!18.3!convergeDist>!
V)-!,.58!21,-.12)3R!
.> V)-!,.58!5))-+23.1,!@)231F!*0,!18,!@-)Q2+,+!closestPoint!J*3512)3!1)!
K.@!,.58!@)231!1)!18,!23+,<!23!18,!kPoints!.--.I!)J!18,!()5.12)3!5()0,01!
1)!18.1!@)231>!?8,!-,0*(1234!ZBB!08)*(+!=,!A,I,+!=I!18,!23+,<F!.3+!18,!
Q.(*,!08)*(+!=,!18,!@.2-R!E%"&!#F!$H>!E?8,!Q.(*,!q$/!L2((!(.1,-!=,!*0,+!1)!
5)*31!18,!3*K=,-!)J!@)2310!5()0,01!1)!.!42Q,3!K,.3>H!d>4>!
!
(1, ((37.43210, -121.48502), 1))
(4, ((33.11310, -111.33201), 1))!
(0, ((39.36351, -119.40003), 1))!
(1, ((40.00019, -116.44829), 1))
=> Z,+*5,!18,!-,0*(1R!J)-!,.58!5,31,-!23!18,!AM)2310!.--.IF!0*K!18,!(.121*+,0!
.3+!()3421*+,0F!-,0@,512Q,(IF!)J!.((!18,!@)2310!5()0,01!1)!18.1!5,31,-F!.3+!
18,!3*K=,-!)J!5()0,01!@)2310>!d>4>!
(0, ((2638919.87653,-8895032.182481), 74693)))
(1, ((3654635.24961,-12197518.55688), 101268))!
(2, ((1863384.99784,-5839621.052003), 48620))!
(3, ((4887181.82600,-14674125.94873), 126114))!
(4, ((2866039.85637,-9608816.13682), 81162))
5> ?8,!-,+*5,+!ZBB!08)*(+!8.Q,!E.1!K)01H!m!K,K=,-0>!7.@!,.58!1)!.!3,L!
5,31,-!@)231!=I!5.(5*(.1234!18,!.Q,-.4,!(.121*+,!.3+!()3421*+,!J)-!,.58!0,1!
)J!5()0,01!@)2310R!18.1!20F!K.@!E&!?*6FE#"#'(@F#"#'(AHF!H!!!E&!?*6FE#"#'(@W!F!
#"#'(AW!HH!
+> ')((,51!18,0,!3,L!@)2310!231)!.!()5.(!K.@!)-!.--.I!A,I,+!=I!&!?*6>!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
54
,> ^0,!18,!@-)Q2+,+!distanceSquared!K,18)+!1)!5.(5*(.1,!8)L!K*58!
,.58!5,31,-!XK)Q,+Y!=,1L,,3!18,!5*--,31!21,-.12)3!.3+!18,!(.01>!?8.1!20F!
J)-!,.58!5,31,-!23!kPointsF!5.(5*(.1,!18,!+201.35,!=,1L,,3!18.1!@)231!
.3+!18,!5)--,0@)3+234!3,L!@)231F!.3+!0*K!18)0,!+201.35,0>!!?8.1!20!18,!
+,(1.!=,1L,,3!21,-.12)30S!L8,3!18,!+,(1.!20!(,00!18.3!convergeDistF!
01)@!21,-.1234>!
J> ')@I!18,!3,L!5,31,-!@)2310!1)!18,!kPoints!.--.I!23!@-,@.-.12)3!J)-!18,!
3,<1!21,-.12)3>!!
[8,3!18,!21,-.12)3!20!5)K@(,1,F!+20@(.I!18,!J23.(!m!5,31,-!@)2310>!&>
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
55
Hands-On Exercise: Using Broadcast
Variables
Files Used in This Exercise:
Data files (HDFS):
weblogs/*
Data files (local):
~training_materials/sparkdev/data/targetmodels.txt
Stubs:
stubs/TargetModels.pyspark
stubs/TargetModels.scalaspark
Solutions:
solutions/TargetModels.pyspark
solutions/TargetModels.scalaspark
'
P#'*>5+'23"$45+"'X)A'=5&&'Q5&*"$'="U'$"VA"+*+'*)'5#4&A/"')#&X'*>)+"'Q$)B'/"S54"+'
5#4&A/"/'5#'%'&5+*')Q'*%$8"*'B)/"&+,'
;)*+.5-,!L.310!1)!+)!0)K,!.3.(I020!)3!L,=!1-.JJ25!@-)+*5,+!J-)K!0@,52J25!+,Q25,0>!
?8,!(201!)J!1.-4,1!K)+,(0!20!23!
~training_materials/sparkdev/data/targetmodels.txt!
V2(1,-!18,!L,=!0,-Q,-!()40!1)!235(*+,!)3(I!18)0,!-,T*,010!J-)K!+,Q25,0!23!18,!(201>!
E?8,!K)+,(!3.K,!)J!18,!+,Q25,!L2((!=,!23!18,!(23,!23!18,!()4!J2(,>H!^0,!.!=-).+5.01!
Q.-2.=(,!1)!@.00!18,!(201!)J!1.-4,1!+,Q25,0!1)!18,!L)-A,-0!18.1!L2((!-*3!18,!J2(1,-!1.0A0>!
B&!#+!^0,!18,!01*=!J2(,!J)-!1820!,<,-520,!23!
~/training_materials/sparkdev/stubs!J)-!18,!5)+,!1)!().+!23!18,!(201!)J!
1.-4,1!K)+,(0>!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
56
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
57
Hands-On Exercise: Using
Accumulators
Files Used in This Exercise:
Data files (HDFS):
weblogs/*
Solutions:
RequestAccumulator.pyspark
RequestAccumulator.scalaspark
'
P#'*>5+'23"$45+"'X)A'=5&&'4)A#*'*>"'#ABU"$')Q'/5QQ"$"#*'*X:"+')Q'Q5&"+'$"VA"+*"/'
5#'%'+"*')Q'="U'+"$S"$'&)8+,''
^0234!.55*K*(.1)-0F!5)*31!18,!3*K=,-!)J!,.58!1I@,!)J!J2(,!EC?7;F!'::!.3+!iMaH!
-,T*,01,+!23!18,!L,=!0,-Q,-!()4!J2(,0>!!
!
C231R!*0,!18,!J2(,!,<1,302)3!01-234!1)!+,1,-K23,!18,!1I@,!)J!-,T*,01F!
2>,>!.htmlF!>cssF!.jpg>!
'
This is the end of the Exercise
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
58
Hands-On Exercise: Importing Data
With Sqoop
Files Used in This Exercise:
Solutions:
solutions/sqoop-movie-import.sh
solutions/AverageMovieRatings.pyspark
solutions/AverageMovieRatings.scalaspark
P#'*>5+'23"$45+"'X)A'=5&&'5B:)$*'/%*%'Q$)B'%'$"&%*5)#%&'/%*%U%+"'A+5#8'7V)):,'
a>"'/%*%'X)A'&)%/'>"$"'=5&&'U"'A+"/'5#'+AU+"VA"#*'"3"$45+"+,'
')302+,-!18,!7I:r;!+.1.=.0,!movielensF!+,-2Q,+!J-)K!18,!7)Q2,;,30!@-)\,51!
J-)K!^32Q,-021I!)J!7233,0)1.>!E:,,!3)1,!.1!18,!,3+!)J!1820!d<,-520,>H!?8,!+.1.=.0,!
5)302010!)J!0,Q,-.(!-,(.1,+!1.=(,0F!=*1!L,!L2((!2K@)-1!)3(I!1L)!)J!18,0,R!movieF!
L8258!5)31.230!23J)-K.12)3!)3!.=)*1!_Fh##!K)Q2,0S!.3+!movieratingF!L8258!8.0!
.=)*1!$F###F###!-.12340!)J!18)0,!K)Q2,0>!!
Review the Database Tables
V2-01F!-,Q2,L!18,!+.1.=.0,!1.=(,0!1)!=,!().+,+!231)!C.+))@>!
I, ;)4!23!1)!7I:r;R!
$ mysql --user=training --password=training movielens
N, Z,Q2,L!18,!01-*51*-,!.3+!5)31,310!)J!18,!movie!1.=(,R!
mysql> DESCRIBE movie;!
. . .!
mysql> SELECT * FROM movie LIMIT 5;
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
59
-, `)1,!18,!5)(*K3!3.K,0!J)-!18,!1.=(,R!
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss!
J, Z,Q2,L!18,!01-*51*-,!.3+!5)31,310!)J!18,!movierating!1.=(,R!
mysql> DESCRIBE movierating;
mysql> SELECT * FROM movierating LIMIT 5;
;, `)1,!18,0,!5)(*K3!3.K,0R!
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss!
C, d<21!KI0T(R!
mysql> quit
Import with Sqoop
O)*!23Q)A,!:T))@!)3!18,!5)KK.3+!(23,!1)!@,-J)-K!0,Q,-.(!5)KK.3+0>![218!21!I)*!
5.3!5)33,51!1)!I)*-!+.1.=.0,!0,-Q,-!1)!(201!18,!+.1.=.0,0!E058,K.0H!1)!L8258!I)*!
8.Q,!.55,00F!.3+!(201!18,!1.=(,0!.Q.2(.=(,!J)-!().+234>!V)-!+.1.=.0,!.55,00F!I)*!
@-)Q2+,!.!5)33,512)3!01-234!1)!2+,312JI!18,!0,-Q,-F!.3+!n!2J!-,T*2-,+!n!I)*-!*0,-3.K,!
.3+!@.00L)-+>!
I,' :8)L!18,!5)KK.3+0!.Q.2(.=(,!23!:T))@R!
$ sqoop help
N,' ;201!18,!+.1.=.0,0!E058,K.0H!23!I)*-!+.1.=.0,!0,-Q,-R!
$ sqoop list-databases \
--connect jdbc:mysql://localhost \
--username training --password training
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
60
E`)1,R!P301,.+!)J!,31,-234!--password training!)3!18,!5)KK.3+!(23,F!I)*!
K.I!@-,J,-!1)!,31,-!-PF!.3+!(,1!:T))@!@-)K@1!I)*!J)-!18,!@.00L)-+F!L8258!20!
18,3!3)1!+20@(.I,+!L8,3!I)*!1I@,!21>H!
-,' ;201!18,!1.=(,0!23!18,!movielens!+.1.=.0,R!
$ sqoop list-tables \
--connect jdbc:mysql://localhost/movielens \
--username training --password training
J,' PK@)-1!18,!movie!1.=(,!231)!CBV:R!
$ sqoop import \!
--connect jdbc:mysql://localhost/movielens \
--username training --password training \
--fields-terminated-by '\t' --table movie !
;,' 6,-2JI!18.1!18,!5)KK.3+!8.0!L)-A,+>!`)1,!18.1!(2A,!:@.-A!)*1@*1F!:T))@!)*1@*1!
20!01)-,+!23!K*(12@(,!@.-1212)3!J2(,0!-.18,-!18.3!.!0234(,!J2(,>!?.A,!3)1,!)J!18,!
J)-K.1!)J!18,!J2(,R!movieID[tab]name[tab]year!
$ hdfs dfs -ls movie
$ hdfs dfs -tail movie/part-m-00000
C,' PK@)-1!18,!movierating!1.=(,!231)!CBV:!=I!-,@,.1234!18,!(.01!1L)!01,@0F!=*1!
J)-!18,!movierating!1.=(,>!
Read and process the data in Spark
D,' :1.-1!18,!:@.-A!:8,((>!
L,' Z,.+!23!.((!18,!K)Q2,!-.12340F!A,I,+!=I!K)Q2,!PB>!!E:@(21!18,!23@*1!(23,!)3!18,!1.=!
58.-.51,-R!\tH!
F,' '8.((,34,R!5.(5*(.1,!18,!.Q,-.4,!-.1234!J)-!,.58!K)Q2,!
Copyright © 2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
61
IR,' '8.((,34,R!:.Q,!18,!.Q,-.4,!-.12340!1)!J2(,0!23!18,!J)-K.1R!
movieID[tab]name[tab]rating
EC231R!\)23!L218!18,!+.1.!J-)K!18,!K)Q2,!1.=(,H!
This is the end of the Exercise
Note:
This exercise uses the MovieLens data set, or subsets thereof. This data is freely
available for academic purposes, and is used and distributed by Cloudera with
the express permission of the UMN GroupLens Research Group. If you would
like to use this data for your own research purposes, you are free to do so, as
long as you cite the GroupLens Research Group in any resulting publications. If
you would like to use this data for commercial purposes, you must obtain
explicit permission. You may find the full dataset, as well as detailed license
terms, at http://www.grouplens.org/node/73

Navigation menu