Data Analyst Training Exercise Manual

Data_Analyst_Training_Exercise_Manual%20

Data_Analyst_Training_Exercise_Manual%20

Data_Analyst_Training_Exercise_Manual

Data_Analyst_Training_Exercise_Manual

Data_Analyst_Training_Exercise_Manual%20

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 76

Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
1
Cloudera Data Analyst Training:
Using Pig, Hive, and Impala
with Hadoop
Hands-On Exercises
!"#"$%&'()*"+',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'-!
.%#/+01#'23"$45+"6'7%*%'8#9"+*':5*;'.%/))<'=))&+',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'>!
.%#/+01#'23"$45+"6'?+5#9'@59'A)$'2=B'@$)4"++5#9',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'CD!
.%#/+01#'23"$45+"6'E#%&FG5#9'E/'H%I<%59#'7%*%'J5*;'@59',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'-C!
.%#/+01#'23"$45+"6'E#%&FG5#9'75+<%$%*"'7%*%'K"*+'J5*;'@59',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'-L!
.%#/+01#'23"$45+"6'23*"#/5#9'@59'J5*;'K*$"%I5#9'%#/'?7M+',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'DN!
.%#/+01#'23"$45+"6'OP##5#9'.5Q"'RP"$5"+'A$)I'*;"'K;"&&S'K4$5<*+S'%#/'.P"',,,,,,,,,,,,,,,,,,'DT!
.%#/+01#'23"$45+"6'7%*%'U%#%9"I"#*'J5*;'.5Q"',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'NN!
1<*5)#%&'.%#/+01#'23"$45+"6'!%5#5#9'8#+59;*'J5*;'K"#*5I"#*'E#%&F+5+',,,,,,,,,,,,,,,,,,,,,,,,,,,'>C!
.%#/+01#'23"$45+"6'7%*%'=$%#+A)$I%*5)#'J5*;'.5Q"',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'>>!
.%#/+01#'23"$45+"6'8#*"$%4*5Q"'E#%&F+5+'J5*;'8I<%&%',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'VD!
7%*%'U)/"&'O"A"$"#4"',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'WX!
O"9P&%$'23<$"++5)#'O"A"$"#4"',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'WN!
"#$%#&!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
2
General Notes
'()*+,-./0!1-.23234!5)*-0,0!*0,!.!62-1*.(!7.5823,!9:;<!=218!.!-,5,31!6,-02)3!)>!'?@!.(-,.+A!
2301.((,+!.3+!5)3>24*-,+!>)-!A)*B!C8,!:;!-*30!23!D0,*+)E+201-2F*1,+!7)+,G!.!5)3>24*-.12)3!
18.1!,3.F(,0!.!@.+))D!5(*01,-!1)!-*3!)3!.!0234(,!7.5823,B!
Points to Note While Working in the VM
C,' C8,!:;!20!0,1!1)!.*1)7.125.((A!()4!23!.0!18,!*0,-!trainingB!H8)*(+!A)*!()4!)*1G!A)*!5.3!
()4!F.5I!23!.0!18,!*0,-!training!=218!18,!D.00=)-+!trainingB!C8,!-))1!D.00=)-+!20!
.(0)!trainingG!18)*48!A)*!5.3!D-,>2J!.3A!5)77.3+!=218!sudo!1)!-*3!21!.0!-))1B!
-,' KJ,-520,0!)>1,3!5)31.23!01,D0!=218!5)77.3+0!18.1!())I!(2I,!1820L!
$ hadoop fs -put accounting_reports_taxyear_2013 \
/user/training/tax_analysis/
C8,!$!0A7F)(!-,D-,0,310!18,!5)77.3+!D-)7D1B!?)!not!235(*+,!1820!58.-.51,-!=8,3!
5)DA234!.3+!D.01234!5)77.3+0!231)!A)*-!1,-723.(!=23+)=B!M(0)G!18,!F.5I0(.08!02432>2,0!
18.1!18,!5)77.3+!5)3123*,0!)3!18,!3,J1!(23,B!N)*!7.A!,218,-!,31,-!18,!5)+,!.0!08)=3!
9)3!1=)!(23,0<G!)-!)721!18,!F.5I0(.08!.3+!1AD,!18,!5)77.3+!)3!.!0234(,!(23,B!!
D,' M(18)*48!7.3A!01*+,310!.-,!5)7>)-1.F(,!*0234!OPQR!1,J1!,+21)-0!(2I,!62!)-!,7.50G!0)7,!
72481!D-,>,-!.!4-.D825.(!1,J1!,+21)-B!C)!236)I,!18,!4-.D825.(!,+21)-!>-)7!18,!5)77.3+!
(23,G!1AD,!gedit!>)(()=,+!FA!18,!D.18!)>!18,!>2(,!A)*!=208!1)!,+21B!MDD,3+234!&!1)!18,!
5)77.3+!.(()=0!A)*!1)!1AD,!.++212)3.(!5)77.3+0!=82(,!18,!,+21)-!20!012((!)D,3B!@,-,!20!
.3!,J.7D(,!)>!8)=!1)!,+21!.!>2(,!3.7,+!myfile.txtL!
$ gedit myfile.txt &
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
3
Class-Specific VM Customization
N)*-!:;!20!*0,+!23!0,6,-.(!)>!'()*+,-./0!1-.23234!5(.00,0B!C820!D.-125*(.-!5(.00!+),0!3)1!
-,S*2-,!0)7,!)>!18,!0,-625,0!18.1!01.-1!FA!+,>.*(1G!=82(,!)18,-!0,-625,0!18.1!+)!3)1!01.-1!FA!
+,>.*(1!.-,!8,(D>*(!>)-!1820!5(.00B!T,!8.6,!05-2D1,+!18,!5(.00E0D,52>25!5*01)72U.12)30!.3+!
01-)34(A!-,5)77,3+!18.1!A)*!-*3!18,!>)(()=234!5)77.3+!=8,3,6,-!A)*!01.-1!18,!:;L!
$ ~/scripts/analyst/toggle_services.sh
C820!=2((!5)30,-6,!7,7)-A!.3+!235-,.0,!D,->)-7.35,!)>!18,!62-1*.(!7.5823,B!M>1,-!-*33234!
1820!5)77.3+G!A)*!7.A!0.>,(A!243)-,!.3A!7,00.4,0!.F)*1!0,-625,0!18.1!8.6,!.(-,.+A!F,,3!
01.-1,+!)-!08*1!+)=3B!
Points to Note During the Exercises
K%I<&"'K)&P*5)#+'
Q>!A)*!3,,+!.!8231!)-!=.31!1)!58,5I!A)*-!=)-IG!18,!sample_solution!0*F+2-,51)-A!=21823!
,.58!,J,-520,!+2-,51)-A!5)31.230!5)7D(,1,!5)+,!0.7D(,0B!
H%*4;0P<'K4$5<*'
Q>!A)*!.-,!*3.F(,!1)!5)7D(,1,!.3!,J,-520,G!=,!8.6,!D-)62+,+!.!05-2D1!1)!5.158!A)*!*D!
.*1)7.125.((AB!K.58!,J,-520,!8.0!2301-*512)30!>)-!-*33234!18,!5.158E*D!05-2D1B!
$ADIR'2#Q5$)#I"#*'Y%$5%Z&"'
$ADIR!20!.!08)-15*1!18.1!D)2310!1)!18,!/home/training/training_materials/
analyst!+2-,51)-AG!=8258!5)31.230!18,!5)+,!.3+!+.1.!A)*!=2((!*0,!23!18,!,J,-520,0B!!
M"J"$'K*"<0ZF0K*"<'8#+*$P4*5)#+'%+'[)P':)$\'=;$)P9;'=;"+"'23"$45+"+'
M0!18,!,J,-520,0!D-)4-,00G!.3+!A)*!4.23!7)-,!>.72(2.-21A!=218!18,!1))(0!A)*/-,!*0234G!=,!
D-)62+,!>,=,-!01,DEFAE01,D!2301-*512)30B!N)*!08)*(+!>,,(!>-,,!1)!.0I!A)*-!2301-*51)-!>)-!
.00201.35,!.1!.3A!127,G!)-!1)!5)30*(1!=218!A)*-!>,(()=!01*+,310B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
4
])#P+'23"$45+"+'
;.3A!)>!18,!,J,-520,0!5)31.23!)3,!)-!7)-,!)D12)3.(!VF)3*0W!0,512)30B!T,!,35)*-.4,!A)*!1)!
=)-I!18-)*48!18,0,!2>!127,!-,7.230!.>1,-!A)*!>23208!18,!7.23!,J,-520,!.3+!=)*(+!(2I,!.3!
.++212)3.(!58.((,34,!1)!D-.5125,!=8.1!A)*!8.6,!(,.-3,+B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
5
Hands-On Exercise: Data Ingest With
Hadoop Tools!
8#'*;5+'"3"$45+"'F)P'J5&&'<$%4*54"'P+5#9'*;"'.%/))<'4)II%#/'&5#"'P*5&5*F'*)'5#*"$%4*'
J5*;'.%/))<^+'75+*$5ZP*"/'M5&"+F+*"I'_.7MK`'%#/'P+"'Ka))<'*)'5I<)$*'*%Z&"+'A$)I'%'
$"&%*5)#%&'/%*%Z%+"'*)'.7MK,'
Prepare your Virtual Machine
X.*358!18,!:;!2>!A)*!8.6,3/1!.(-,.+A!+)3,!0)G!.3+!18,3!-*3!18,!>)(()=234!5)77.3+!1)!F))01!
D,->)-7.35,!FA!+20.F(234!0,-625,0!18.1!.-,!3)1!3,,+,+!>)-!1820!5(.00L!
$ ~/scripts/analyst/toggle_services.sh
Step 1: Exploring HDFS
C,' YD,3!.!1,-723.(!=23+)=!92>!)3,!20!3)1!.(-,.+A!)D,3<!FA!+)*F(,E5(25I234!18,!C,-723.(!
25)3!)3!18,!+,0I1)DB!P,J1G!58.34,!1)!18,!+2-,51)-A!>)-!1820!,J,-520,!FA!-*33234!18,!
>)(()=234!5)77.3+L!
$ cd $ADIR/exercises/data_ingest
-,' N)*!5.3!*0,!18,!hadoop!fs!5)77.3+!1)!231,-.51!=218!18,!@.+))D!?201-2F*1,+!
Z2(,0A01,7!9@?ZH<!>-)7!18,!5)77.3+!(23,B!Q3!18,!1,-723.(!=23+)=G!,31,-L!
$ hadoop fs
C820!+20D(.A0!.!8,(D!7,00.4,!+,05-2F234!.((!0*F5)77.3+0!.00)52.1,+!=218!hadoop fsB!
D,' [*3!18,!>)(()=234!5)77.3+L!
$ hadoop fs -ls /
C820!(2010!18,!5)31,310!)>!18,!@?ZH!-))1!+2-,51)-AB!Y3,!)>!18,!+2-,51)-2,0!(201,+!20!/userB!
K.58!*0,-!)3!18,!5(*01,-!8.0!.!\8)7,/!+2-,51)-A!F,()=!/user!5)--,0D)3+234!1)!820!)-!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
6
8,-!*0,-!Q?B!H235,!A)*-!*0,-!Q?!)3!18,!5(*01,-!20!trainingG!A)*-!8)7,!+2-,51)-A!23!
@?ZH!20!/user/trainingB!
N,' C)!0,,!18,!5)31,310!)>!A)*-!8)7,!+2-,51)-A!9=8258!20!5*--,31(A!,7D1A<G!-*3!18,!
>)(()=234!5)77.3+L!
$ hadoop fs -ls /user/training
>,' Q>!A)*!+)!3)1!0D,52>A!.!D.18G!hadoop fs!.00*7,0!A)*!.-,!-,>,--234!1)!A)*-!8)7,!
+2-,51)-AB!C8,-,>)-,G!18,!>)(()=234!5)77.3+!20!,S*26.(,31!1)!18,!)3,!.F)6,L!
$ hadoop fs -ls
V,' ;)01!)>!A)*-!=)-I!=2((!F,!23!18,!/dualcore!+2-,51)-AG!0)!5-,.1,!18.1!3)=L!
$ hadoop fs -mkdir /dualcore
W,' P,J1G!.++!.!T,F!0,-6,-!()4!>2(,!1)!1820!3,=!+2-,51)-A!23!@?ZHL!
$ hadoop fs -put $ADIR/data/access.log /dualcore
Overwriting Files in Hadoop
Unlike the UNIX shell, Hadoop won’t overwrite files and directories. This feature helps
protect users from accidentally replacing data that may have taken hours to produce. If
you need to replace a file or directory in HDFS, you must first remove the existing one.
Please keep this in mind in case you make a mistake and need to repeat a step during
the Hands-On Exercises.
To remove a file:
$ hadoop fs -rm /dualcore/example.txt
To remove a directory (recursively):
$ hadoop fs -rm -r /dualcore/example/
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
7
L,' :,-2>A!18,!(.01!01,D!FA!(201234!18,!5)31,310!)>!18,!/dualcore!+2-,51)-A!.4.23B!N)*!
08)*(+!)F0,-6,!18.1!18,!access.log!>2(,!20!D-,0,31!.3+!)55*D2,0!$#]G&&^G%]_!FA1,0!)>!
0D.5,!23!@?ZHL!
$ hadoop fs -ls /dualcore
T,' C)!D-.5125,!-,7)6234!.!>2(,G!A)*!7.A!3)=!+,(,1,!18,!>2(,!A)*!`*01!.++,+B!
$ hadoop fs -rm /dualcore/access.log
Step 2: Importing Database Tables into HDFS with Sqoop
?*.(5)-,!01)-,0!23>)-7.12)3!.F)*1!210!,7D()A,,0G!5*01)7,-0G!D-)+*510G!.3+!)-+,-0!23!.!
;AHaX!+.1.F.0,B!Q3!18,!3,J1!>,=!01,D0G!A)*!=2((!,J.723,!1820!+.1.F.0,!F,>)-,!*0234!HS))D!1)!
27D)-1!210!1.F(,0!231)!@?ZHB!
C,' X)4!23!1)!;AHaX!.3+!0,(,51!18,!dualcore!+.1.F.0,L!
$ mysql --user=training --password=training dualcore
-,' P,J1G!(201!18,!.6.2(.F(,!1.F(,0!23!18,!dualcore!+.1.F.0,!9mysql>!-,D-,0,310!18,!
;AHaX!5(2,31!D-)7D1!.3+!20!3)1!D.-1!)>!18,!5)77.3+<L!
mysql> SHOW TABLES;!
D,' [,62,=!18,!01-*51*-,!)>!18,!employees!1.F(,!.3+!,J.723,!.!>,=!)>!210!-,5)-+0L!
mysql> DESCRIBE employees;!
mysql> SELECT emp_id, fname, lname, state, salary FROM
employees LIMIT 10;
N,' KJ21!;AHaX!FA!1AD234!quitG!.3+!18,3!821!18,!,31,-!I,AL!
mysql> quit
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
8
Data Model Reference
For your convenience, you will find a reference section depicting the structure for the
tables you will use in the exercises at the end of this Exercise Manual.
>,' P,J1G!-*3!18,!>)(()=234!5)77.3+G!=8258!27D)-10!18,!employees!1.F(,!231)!18,!
/dualcore!+2-,51)-A!5-,.1,+!,.-(2,-!*0234!1.F!58.-.51,-0!1)!0,D.-.1,!,.58!>2,(+L!
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training --password training \
--fields-terminated-by '\t' \
--warehouse-dir /dualcore \
--table employees
Hiding Passwords
Typing the database password on the command line is a potential security risk since
others may see it. An alternative to using the --password argument is to use -P and let
Sqoop prompt you for the password, which is then not visible when you type it.
Sqoop Code Generation
After running the sqoop import command above, you may notice a new file named
employee.java in your local directory. This is an artifact of Sqoop’s code generation
and is really only of interest to Java developers, so you can ignore it.
!
V,' [,620,!18,!D-,62)*0!5)77.3+!.3+!27D)-1!18,!customers!1.F(,!231)!@?ZHB!
W,' [,620,!18,!D-,62)*0!5)77.3+!.3+!27D)-1!18,!products!1.F(,!231)!@?ZHB!
L,' [,620,!18,!D-,62)*0!5)77.3+!.3+!27D)-1!18,!orders!1.F(,!231)!@?ZHB!
! !
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
9
T,' P,J1G!A)*!=2((!27D)-1!18,!order_details!1.F(,!231)!@?ZHB!C8,!5)77.3+!20!0(2481(A!
+2>>,-,31!F,5.*0,!1820!1.F(,!)3(A!8)(+0!-,>,-,35,0!1)!-,5)-+0!23!18,!orders!.3+!
products!1.F(,G!.3+!(.5I0!.!D-27.-A!I,A!)>!210!)=3B!')30,S*,31(AG!A)*!=2((!3,,+!1)!
0D,52>A!18,!--split-by!)D12)3!.3+!2301-*51!HS))D!1)!+262+,!18,!27D)-1!=)-I!.7)34!
7.D!1.0I0!F.0,+!)3!6.(*,0!23!18,!order_id!>2,(+B!M3!.(1,-3.126,!20!1)!*0,!18,!-m 1!
)D12)3!1)!>)-5,!HS))D!1)!27D)-1!.((!18,!+.1.!=218!.!0234(,!1.0IG!F*1!1820!=)*(+!
02432>25.31(A!-,+*5,!D,->)-7.35,B!
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training --password training \
--fields-terminated-by '\t' \
--warehouse-dir /dualcore \
--table order_details \
--split-by=order_id
Bonus Exercise #1: Running a Python MapReduce Job
Q>!A)*!8.6,!0*55,00>*((A!>23208,+!18,!,.-(2,-!01,D0!.3+!012((!8.6,!127,G!>,,(!>-,,!1)!5)3123*,!
=218!1820!)D12)3.(!F)3*0!,J,-520,B!
?*.(5)-,/0!@*7.3![,0)*-5,0!7.3.4,-!-,S*,01,+!18.1!=,!D-)62+,!827!=218!18,!3*7F,-!)>!
,7D()A,,0!=218!0.(.-2,0!)>!.1!(,.01!bcdG###G!4-)*D,+!FA!01.1,G!1)!8,(D!827!D(.3!7,,12340!1)!
23>)-7!18)0,!,7D()A,,0!)>!.!-,5,31!-,12-,7,31!D(.3!58.34,!18.1!.>>,510!18,7B!!
M!0)>1=.-,!,3423,,-!)3!)*-!1,.7!-,5,31(A!.11,3+,+!'()*+,-./0!?,6,()D,-!1-.23234!.3+!=.0!
,.4,-!1)!1-A!=-21234!0)7,!;.D[,+*5,!5)+,!23!eA18)3G!0)!A)*!=2((!F-2,>(A!,J.723,!18,!5)+,!
08,!=-)1,!.3+!18,3!-*3!21!1)!D-)+*5,!18,!+.1.!)*-!@[!+,D.-17,31!-,S*,01,+B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
10
C,' '8.34,!1)!18,!bonus_01!0*F+2-,51)-A!)>!18,!5*--,31!,J,-520,L!
$ cd bonus_01
-,' KJ.723,!18,!;.D[,+*5,!5)+,!>)-!F)18!18,!7.DD,-!.3+!-,+*5,-!0)!A)*!5.3!0,,!8)=!21!
=2((!D-)+*5,!18,!+.1.!=,!8.6,!F,,3!.0I,+!1)!D-)62+,L!
$ cat mapper.py
$ cat reducer.py
D,' P,J1G!,J.723,!18,!08,((!05-2D1!18.1!18,!0)>1=.-,!,3423,,-!=-)1,L!
$ cat runjob.sh
M0!A)*!5.3!0,,G!1820!`)F!+,>23,0!18,!D.18!)>!18,!f.6.!(2F-.-A!9fM[<!>2(,!18.1!5)31.230!
0*DD)-1!>)-!@.+))D!H1-,.7234B!Q1!.(0)!+,>23,0!18,!)*1D*1!+2-,51)-AG!7.I234!0*-,!18.1!21!
+),0!3)1!.(-,.+A!,J201!9>)-!,J.7D(,G!2>!.!D-,62)*0!`)F!8.+!.(-,.+A!5-,.1,+!21<!F,>)-,!
0*F7211234!18,!`)F!>)-!,J,5*12)3!)3!18,!5(*01,-B!
N,' KJ,5*1,!1820!08,((!05-2D1!1)!-*3!18,!`)FL!
$ ./runjob.sh
@.+))D!08)*(+!01.-1!+20D(.A234!01.1*0!7,00.4,0!1)!A)*-!05-,,3!=21823!.!>,=!0,5)3+0G!
.3+!18,!`)F!08)*(+!-*3!>)-!.!>,=!723*1,0!F,>)-,!5)35(*+234!=218!.!7,00.4,!,JD(.23234!
18.1!18,!)*1D*1!20!23!18,!/user/training/empcounts!+2-,51)-A!923!@?ZH<!.3+!
-,1*-3234!5)31-)(!1)!A)*-!1,-723.(B!
>,' X201!18,!5)31,310!)>!1820!+2-,51)-A!23!@?ZHL!
$ hadoop fs -ls /user/training/empcounts
N)*!08)*(+!0,,!18-,,!1AD,0!)>!21,70!23!1820!+2-,51)-AL!.!_SUCCESS!>2(,!23+25.1234!18.1!
18,!`)F!5)7D(,1,+!0*55,00>*((AG!.!_logs!+2-,51)-A!5)31.23234!()4!>2(,0!>-)7!18,!`)F!92>!
.3A!=,-,!D-)+*5,+<G!.3+!)3,!)-!7)-,!>2(,0!=8)0,!3.7,0!01.-1!=218!partB!C8,0,!>2(,0!
5)31.23!18,!.51*.(!)*1D*1!-,5)-+0!5-,.1,+!FA!18,!`)FB!!!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
11
V,' H235,!)*-!`)F!027D(A!D-)+*5,0!.!07.((!0*77.-A!)>!18,!23D*1!+.1.G!21!=)*(+!F,!7)-,!
5)36,32,31!1)!-,1-2,6,!18,!5)31,310!)>!18,!@?ZH!)*1D*1!+2-,51)-A!.3+!7,-4,!18,7!231)!.!
0234(,!>2(,!)3!18,!()5.(!+20IB!C8,!hadoop fs -getmerge!5)77.3+!+),0!,J.51(A!1820L!
$ hadoop fs -getmerge /user/training/empcounts results.txt
C8,!results.txt!>2(,!=2((!3)1!235(*+,!.3A!+.1.!>-)7!18,!_SUCCESS!>2(,!)-!_logs!
0*F+2-,51)-A!23!@?ZH!F,5.*0,!1820!5)77.3+!243)-,0!>2(,0!=8)0,!3.7,0!F,423!=218!.3!
*3+,-05)-,!)-!.!D,-2)+!9+)1<B!!
W,' P)=!18.1!1820!20!.!()5.(!>2(,G!A)*!5.3!,J.723,!210!)*1D*1!=218!18,!OPQR!cat!5)77.3+L!
$ cat results.txt
N)*!08)*(+!3)=!F,!.F(,!1)!.30=,-!18,!>)(()=234!S*,012)30L!
.B @)=!7.3A!01.1,0!7.158!18,!5-21,-2.!0D,52>2,+!FA!18,!@*7.3![,0)*-5,0!
+,D.-17,31g!
FB T8258!01.1,!5)31.230!18,!7)01!,7D()A,,0!=8)!7.158!18,0,!5-21,-2.g!9@231L!21!
20!18,!0.7,!01.1,!=8,-,!?*.(5)-,!=.0!>)*3+,+G!=8,-,!18,!5)-D)-.1,!
8,.+S*.-1,-0!.-,!()5.1,+G!.3+!=8,-,!=,!8.6,!18,!7)01!,7D()A,,0!)6,-.((B<!
Bonus Exercise #2: Running a Java MapReduce Job
Q>!A)*!8.6,!0*55,00>*((A!>23208,+!18,!>2-01!F)3*0!,J,-520,!.3+!012((!8.6,!127,G!>,,(!>-,,!1)!
5)3123*,!=218!1820!0,5)3+!F)3*0!,J,-520,B!
C8,!0)>1=.-,!,3423,,-!=8)!=-)1,!18,!)-2423.(!eA18)3!;.D[,+*5,!`)F!=.0!0)!,J521,+!.F)*1!
=8.1!08,!(,.-3,+!23!'()*+,-./0!?,6,()D,-!5(.00!18.1!08,!=.31,+!1)!5-,.1,!.!;.D[,+*5,!`)F!
23!f.6.!F.0,+!)3!18,!eA18)3!)3,!A)*!`*01!-.3B!C820!`)F!20!0(2481(A!+2>>,-,31!23!18.1!21!2+,312>2,0!
01.1,0!=218!.1!(,.01!"d!,7D()A,,0!8.6234!.!0.(.-A!)>!bd#G###!)-!7)-,B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
12
C,' '8.34,!1)!18,!bonus_02!0*F+2-,51)-A!>)-!1820!8.3+0E)3!,J,-520,L!
$ cd ../bonus_02
Q>!A)*!8.DD,3!1)!I3)=!18,!f.6.!D-)4-.77234!(.34*.4,G!8.6,!.!())I!.1!18,!5)+,!23!18,!
java/src/example/!0*F+2-,51)-AB!Q1!5)31.230!5)77,310!18.1!,JD(.23!18,!5)+,!>)-!
18,!;.DD,-!.3+![,+*5,-G!.0!=,((!.0!18,!?-26,-!5(.00!18.1!5)3>24*-,0!.3+!0*F7210!18,!`)F!
1)!18,!5(*01,-B!
-,' C8,!0)>1=.-,!,3423,,-!=-)1,!.!05-2D1!18.1!*0,0!18,!MD.58,!M31!F*2(+!0A01,7!1)!5)7D2(,G!
D.5I.4,G!.3+!-*3!18,!5)+,B!C820!7,.30!A)*!5.3!D,->)-7!.((!)>!18,0,!01,D0!FA!1AD234!`*01!
)3,!5)77.3+L!
$ ant run
D,' Y35,!18,!`)F!20!5)7D(,1,G!*0,!hadoop fs -cat!1)!+20D(.A!18,!)*1D*1!23!@?ZH!+2-,51(A!
1)!18,!1,-723.(!93)1,L!18,!f.6.!`)F!20!5)3>24*-,+!1)!=-21,!210!)*1D*1!1)!.!0(2481(A!+2>>,-,31!
)*1D*1!D.18!18.3!18,!eA18)3!`)F<L!
$ hadoop fs -cat /user/training/empcounts_java/part*
!
This is the end of the Exercise
!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
13
Hands-On Exercise: Using Pig for ETL
Processing!
8#'*;5+'"3"$45+"'F)P'J5&&'<$%4*54"'P+5#9'@59'*)'"3<&)$"S'4)$$"4*S'%#/'$")$/"$'/%*%'5#'
A5&"+'A$)I'*J)'/5AA"$"#*'%/'#"*J)$\+,'[)P'J5&&'A5$+*'"3<"$5I"#*'J5*;'+I%&&'+%I<&"+')A'
*;5+'/%*%'P+5#9'@59'5#'&)4%&'I)/"S'%#/')#4"'F)P'%$"'4)#A5/"#*'*;%*'F)P$'2=B'+4$5<*+'
J)$\'%+'F)P'"3<"4*S'F)P'J5&&'P+"'*;"I'*)'<$)4"++'*;"'4)I<&"*"'/%*%'+"*+'5#'.7MK'ZF'
P+5#9'@59'5#'U%<O"/P4"'I)/",'
8U@1O=E(=L!C820!,J,-520,!F*2(+0!)3!18,!D-,62)*0!)3,B!Q>!A)*!=,-,!*3.F(,!1)!5)7D(,1,!18,!
D-,62)*0!,J,-520,!)-!1823I!A)*!7.A!8.6,!7.+,!.!7201.I,G!-*3!18,!>)(()=234!5)77.3+!1)!
D-,D.-,!>)-!1820!,J,-520,!F,>)-,!5)3123*234L!
$ ~/scripts/analyst/catchup.sh
Background Information
?*.(5)-,!8.0!-,5,31(A!01.-1,+!*0234!)3(23,!.+6,-120,7,310!1)!.11-.51!3,=!5*01)7,-0!1)!)*-!
,E5)77,-5,!021,B!K.58!)>!18,!1=)!.+!3,1=)-I0!=,!*0,!D-)62+,0!+.1.!.F)*1!18,!.+0!18,A/6,!
D(.5,+B!C820!235(*+,0!18,!021,!=8,-,!18,!.+!=.0!D(.5,+G!18,!+.1,!=8,3!21!=.0!D(.5,+G!=8.1!
I,A=)-+0!1-244,-,+!210!+20D(.AG!=8,18,-!18,!*0,-!5(25I,+!18,!.+G!.3+!18,!D,-E5(25I!5)01B!
O3>)-1*3.1,(AG!18,!+.1.!>-)7!,.58!3,1=)-I!20!23!.!+2>>,-,31!>)-7.1B!K.58!>2(,!.(0)!5)31.230!
0)7,!236.(2+!-,5)-+0B!h,>)-,!=,!5.3!.3.(AU,!18,!+.1.G!=,!7*01!>2-01!5)--,51!18,0,!D-)F(,70!
FA!*0234!e24!1)L!
Z2(1,-!236.(2+!-,5)-+0!
[,)-+,-!>2,(+0!
')--,51!235)30201,352,0!
T-21,!18,!5)--,51,+!+.1.!1)!@?ZH!!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
14
Step #1: Working in the Grunt Shell
Q3!1820!01,DG!A)*!=2((!D-.5125,!-*33234!e24!5)77.3+0!23!18,!i-*31!08,((B!!
C,' '8.34,!1)!18,!+2-,51)-A!>)-!1820!8.3+0E)3!,J,-520,L!
$ cd $ADIR/exercises/pig_etl
-,' ')DA!.!07.((!3*7F,-!)>!-,5)-+0!>-)7!18,!23D*1!>2(,!1)!.3)18,-!>2(,!)3!18,!()5.(!>2(,!
0A01,7B!T8,3!A)*!01.-1!e24G!A)*!=2((!-*3!23!()5.(!7)+,B!Z)-!1,01234G!A)*!5.3!=)-I!>.01,-!
=218!07.((!()5.(!>2(,0!18.3!(.-4,!>2(,0!23!@?ZHB!!
Q1!20!3)1!,00,312.(!1)!58))0,!.!-.3+)7!0.7D(,!8,-,!j!`*01!.!8.3+>*(!)>!-,5)-+0!23!18,!
5)--,51!>)-7.1!=2((!0*>>25,B!O0,!18,!5)77.3+!F,()=!1)!5.D1*-,!18,!>2-01!"d!-,5)-+0!0)!
A)*!8.6,!,3)*48!1)!1,01!A)*-!05-2D1L!
$ head -n 25 $ADIR/data/ad_data1.txt > sample1.txt
D,' H1.-1!18,!i-*31!08,((!23!()5.(!7)+,!0)!18.1!A)*!5.3!=)-I!=218!18,!()5.(!sample1.txt!
>2(,B!!
$ pig -x local
M!D-)7D1!23+25.1,0!18.1!A)*!.-,!3)=!23!18,!i-*31!08,((L!
grunt>
N,' X).+!18,!+.1.!23!18,!sample1.txt!>2(,!231)!e24!.3+!+*7D!21L!
grunt> data = LOAD 'sample1.txt';
grunt> DUMP data;
N)*!08)*(+!0,,!18,!"d!-,5)-+0!18.1!5)7D-20,!18,!0.7D(,!+.1.!>2(,B!!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
15
>,' X).+!18,!>2-01!1=)!5)(*730/!+.1.!>-)7!18,!0.7D(,!>2(,!.0!58.-.51,-!+.1.G!.3+!18,3!+*7D!
18.1!+.1.L!
grunt> first_2_columns = LOAD 'sample1.txt' AS
(keyword:chararray, campaign_id:chararray);
grunt> DUMP first_2_columns;
V,' O0,!18,!DESCRIBE!5)77.3+!23!e24!1)!-,62,=!18,!058,7.!)>!first_2_colsL!
grunt> DESCRIBE first_2_columns;
C8,!058,7.!.DD,.-0!23!18,!i-*31!08,((B!
O0,!18,!DESCRIBE!5)77.3+!=82(,!D,->)-7234!18,0,!,J,-520,0!.3A!127,!A)*!=)*(+!(2I,!
1)!-,62,=!058,7.!+,>23212)30B!!
W,' H,,!=8.1!8.DD,30!2>!A)*!-*3!18,!DESCRIBE!5)77.3+!)3!dataB![,5.((!18.1!=8,3!A)*!
().+,+!dataG!A)*!+2+!not!+,>23,!.!058,7.B!!
grunt> DESCRIBE data;
L,' K3+!A)*-!i-*31!08,((!0,002)3L!
grunt> QUIT;
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
16
Step #2: Processing Input Data from the First Ad Network
Q3!1820!01,DG!A)*!=2((!D-)5,00!18,!23D*1!+.1.!>-)7!18,!>2-01!.+!3,1=)-IB!Z2-01G!A)*!=2((!5-,.1,!.!
e24!05-2D1!23!.!>2(,G!.3+!18,3!A)*!=2((!-*3!18,!05-2D1B!;.3A!D,)D(,!>23+!=)-I234!1820!=.A!
,.02,-!18.3!=)-I234!+2-,51(A!23!18,!i-*31!08,((B!
C,' K+21!18,!first_etl.pig!>2(,!1)!5)7D(,1,!18,!LOAD!01.1,7,31!.3+!-,.+!18,!+.1.!>-)7!
18,!0.7D(,!A)*!`*01!5-,.1,+B!C8,!>)(()=234!1.F(,!08)=0!18,!>)-7.1!)>!18,!+.1.!23!18,!>2(,B!
Z)-!027D(2521AG!A)*!08)*(+!(,.6,!18,!date!.3+!time!>2,(+0!0,D.-.1,G!0)!,.58!=2((!F,!)>!
1AD,!chararrayG!-.18,-!18.3!5)36,-1234!18,7!1)!.!0234(,!>2,(+!)>!1AD,!datetimeB!
!
Index&
Field&
Data&Type&
Description&
Example&
#!
I,A=)-+!
58.-.--.A!
k,A=)-+!18.1!1-244,-,+!.+!
tablet
$!
5.7D.243l2+!
58.-.--.A!
O32S*,(A!2+,312>2,0!)*-!.+!
A3
"!
+.1,!
58.-.--.A!
?.1,!)>!.+!+20D(.A!
05/29/2013
&!
127,!
58.-.--.A!
C27,!)>!.+!+20D(.A!
15:49:21
%!
+20D(.Al021,!
58.-.--.A!
?)7.23!=8,-,!.+!08)=3!
www.example.com
d!
=.0l5(25I,+!
231!
T8,18,-!.+!=.0!5(25I,+!
1
]!
5D5!
231!
')01!!D,-!5(25IG!23!5,310!
106
c!
5)*31-A!
58.-.--.A!
P.7,!)>!5)*31-A!23!=8258!.+!-.3!
USA
_!
D(.5,7,31!
58.-.--.A!
T8,-,!)3!D.4,!=.0!.+!+20D(.A,+!
TOP
-,' Y35,!A)*!8.6,!,+21,+!18,!LOAD!01.1,7,31G!1-A!21!)*1!FA!-*33234!A)*-!05-2D1!23!()5.(!
7)+,L!
$ pig -x local first_etl.pig
;.I,!0*-,!18,!)*1D*1!())I0!5)--,51!92B,BG!18.1!A)*!8.6,!18,!>2,(+0!23!18,!,JD,51,+!)-+,-!
.3+!18,!6.(*,0!.DD,.-!0272(.-!23!>)-7.1!1)!18.1!08)=3!23!18,!1.F(,!.F)6,<!F,>)-,!A)*!
5)3123*,!=218!18,!3,J1!01,DB!!
D,' ;.I,!,.58!)>!18,!>)(()=234!58.34,0G!-*33234!A)*-!05-2D1!23!()5.(!7)+,!.>1,-!,.58!)3,!1)!
6,-2>A!18.1!A)*-!58.34,!20!5)--,51L!
.B OD+.1,!A)*-!05-2D1!1)!>2(1,-!)*1!.((!-,5)-+0!=8,-,!18,!5)*31-A!>2,(+!+),0!3)1!
5)31.23!USAB!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
17
FB T,!3,,+!1)!01)-,!18,!>2,(+0!23!.!+2>>,-,31!)-+,-!18.3!=,!-,5,26,+!18,7B!!O0,!.!
FOREACH!!GENERATE!01.1,7,31!1)!5-,.1,!.!3,=!-,(.12)3!5)31.23234!18,!
>2,(+0!23!18,!0.7,!)-+,-!.0!08)=3!23!18,!>)(()=234!1.F(,!918,!country!>2,(+!20!
3)1!235(*+,+!0235,!.((!-,5)-+0!3)=!8.6,!18,!0.7,!6.(*,<L!
!
Field&
Description&
5.7D.243l2+!
O32S*,(A!2+,312>2,0!)*-!.+!
+.1,!
?.1,!)>!.+!+20D(.A!
127,!
C27,!)>!.+!+20D(.A!
I,A=)-+!
k,A=)-+!18.1!1-244,-,+!.+!
+20D(.Al021,!
?)7.23!=8,-,!.+!08)=3!
D(.5,7,31!
T8,-,!)3!D.4,!=.0!.+!+20D(.A,+!
=.0l5(25I,+!
T8,18,-!.+!=.0!5(25I,+!
5D5!
')01!!D,-!5(25IG!23!5,310!
5B OD+.1,!A)*-!05-2D1!1)!5)36,-1!18,!keyword!>2,(+!1)!*DD,-5.0,!.3+!1)!-,7)6,!
.3A!(,.+234!)-!1-.2(234!=821,0D.5,!98231L!A)*!5.3!3,01!5.((0!1)!18,!1=)!F*2(1E23!
>*3512)30!2302+,!18,!FOREACH!!GENERATE!01.1,7,31!>-)7!18,!(.01!
01.1,7,31<B!
N,' M++!18,!5)7D(,1,!+.1.!>2(,!1)!@?ZHL!
$ hadoop fs -put $ADIR/data/ad_data1.txt /dualcore
>,' K+21!first_etl.pig!.3+!58.34,!18,!D.18!23!18,!LOAD!01.1,7,31!1)!7.158!18,!D.18!)>!
18,!>2(,!A)*!`*01!.++,+!1)!@?ZH!9/dualcore/ad_data1.txt<B!!!
V,' P,J1G!-,D(.5,!DUMP!=218!.!STORE!01.1,7,31!18.1!=2((!=-21,!18,!)*1D*1!)>!A)*-!
D-)5,00234!.0!1.FE+,(2721,+!-,5)-+0!1)!18,!/dualcore/ad_data1!+2-,51)-AB!!
W,' [*3!1820!05-2D1!23!e24/0!;.D[,+*5,!7)+,!1)!.3.(AU,!18,!,312-,!>2(,!23!@?ZHL!
$ pig first_etl.pig
Q>!A)*-!05-2D1!>.2(0G!58,5I!A)*-!5)+,!5.-,>*((AG!>2J!18,!,--)-G!.3+!18,3!1-A!-*33234!21!.4.23B!
?)3/1!>)-4,1!18.1!A)*!7*01!-,7)6,!)*1D*1!23!@?ZH!>-)7!.!D-,62)*0!-*3!F,>)-,!A)*!
,J,5*1,!18,!05-2D1!.4.23B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
18
L,' '8,5I!18,!>2-01!"#!)*1D*1!-,5)-+0!18.1!A)*-!05-2D1!=-)1,!1)!@?ZH!.3+!,30*-,!18,A!())I!
5)--,51!9A)*!5.3!243)-,!18,!7,00.4,!V5.1L!O3.F(,!1)!=-21,!1)!)*1D*1!01-,.7Wm!1820!027D(A!
8.DD,30!F,5.*0,!A)*!.-,!=-21234!7)-,!+.1.!=218!18,!fs -cat!5)77.3+!18.3!A)*!.-,!
-,.+234!=218!18,!head!5)77.3+<L!
$ hadoop fs -cat /dualcore/ad_data1/part* | head -20
.B M-,!18,!>2,(+0!23!18,!5)--,51!)-+,-g!
FB M-,!.((!18,!I,A=)-+0!3)=!23!*DD,-5.0,g!
Step #3: Processing Input Data from the Second Ad Network
P)=!18.1!A)*!8.6,!0*55,00>*((A!D-)5,00,+!18,!+.1.!>-)7!18,!>2-01!.+!3,1=)-IG!5)3123*,!FA!
D-)5,00234!+.1.!>-)7!18,!0,5)3+!)3,B!!
C,' '-,.1,!.!07.((!0.7D(,!)>!18,!+.1.!>-)7!18,!0,5)3+!.+!3,1=)-I!18.1!A)*!5.3!1,01!()5.((A!
=82(,!A)*!+,6,()D!A)*-!05-2D1L!
$ head -n 25 $ADIR/data/ad_data2.txt > sample2.txt
!
-,' K+21!18,!second_etl.pig!>2(,!1)!5)7D(,1,!18,!LOAD!01.1,7,31!.3+!-,.+!18,!+.1.!>-)7!
18,!0.7D(,!A)*!`*01!5-,.1,+!98231L!18,!>2,(+0!.-,!5)77.E+,(2721,+<B!C8,!>)(()=234!1.F(,!
08)=0!18,!)-+,-!)>!>2,(+0!23!1820!>2(,L!
!
Index&
Field&
Data&Type&
Description&
Example&
#!
5.7D.243l2+!
58.-.--.A!
O32S*,(A!2+,312>2,0!)*-!.+!
A3
$!
+.1,!
58.-.--.A!
?.1,!)>!.+!+20D(.A!
05/29/2013
"!
127,!
58.-.--.A!
C27,!)>!.+!+20D(.A!
15:49:21
&!
+20D(.Al021,!
58.-.--.A!
?)7.23!=8,-,!.+!08)=3!
www.example.com
%!
D(.5,7,31!
58.-.--.A!
T8,-,!)3!D.4,!=.0!.+!+20D(.A,+!
TOP
d!
=.0l5(25I,+!
231!
T8,18,-!.+!=.0!5(25I,+!
Y
]!
5D5!
231!
')01!D,-!5(25IG!23!5,310!
106
c!
I,A=)-+!
58.-.--.A!
k,A=)-+!18.1!1-244,-,+!.+!
tablet
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
19
D,' Y35,!A)*!8.6,!,+21,+!18,!LOAD!01.1,7,31G!*0,!18,!DESCRIBE!I,A=)-+!.3+!18,3!-*3!
A)*-!05-2D1!23!()5.(!7)+,!1)!58,5I!18.1!18,!058,7.!7.158,0!18,!1.F(,!.F)6,L!
$ pig -x local second_etl.pig
N,' [,D(.5,!DESCRIBE!=218!.!DUMP!01.1,7,31!.3+!18,3!7.I,!,.58!)>!18,!>)(()=234!
58.34,0!1)!second_etl.pigG!-*33234!1820!05-2D1!23!()5.(!7)+,!.>1,-!,.58!58.34,!1)!
6,-2>A!=8.1!A)*/6,!+)3,!F,>)-,!A)*!5)3123*,!=218!18,!3,J1!01,DL!
.B C820!.+!3,1=)-I!0)7,127,0!()40!.!426,3!-,5)-+!1=25,B!M++!.!01.1,7,31!1)!18,!
second_etl.pig!>2(,!0)!18.1!A)*!-,7)6,!.3A!+*D(25.1,!-,5)-+0B!Q>!A)*!8.6,!
+)3,!1820!5)--,51(AG!A)*!08)*(+!)3(A!0,,!)3,!-,5)-+!=8,-,!18,!
display_site!>2,(+!8.0!.!6.(*,!)>!siliconwire.example.comB!
FB M0!F,>)-,G!A)*!3,,+!1)!01)-,!18,!>2,(+0!23!.!+2>>,-,31!)-+,-!18.3!A)*!-,5,26,+!
18,7B!!O0,!.!FOREACH!!GENERATE!01.1,7,31!1)!5-,.1,!.!3,=!-,(.12)3!
5)31.23234!18,!>2,(+0!23!18,!0.7,!)-+,-!A)*!*0,+!1)!=-21,!18,!)*1D*1!>-)7!>2-01!
.+!3,1=)-I!908)=3!.4.23!23!18,!1.F(,!F,()=<!.3+!.(0)!*0,!18,!UPPER!.3+!
TRIM!>*3512)30!1)!5)--,51!18,!keyword!>2,(+!.0!A)*!+2+!,.-(2,-L!
!
Field&
Description&
5.7D.243l2+!
O32S*,(A!2+,312>2,0!)*-!.+!
+.1,!
?.1,!)>!.+!+20D(.A!
127,!
C27,!)>!.+!+20D(.A!
I,A=)-+!
k,A=)-+!18.1!1-244,-,+!.+!
+20D(.Al021,!
?)7.23!=8,-,!.+!08)=3!
D(.5,7,31!
T8,-,!)3!D.4,!=.0!.+!+20D(.A,+!
=.0l5(25I,+!
T8,18,-!.+!=.0!5(25I,+!
5D5!
')01!!D,-!5(25IG!23!5,310!
5B C8,!+.1,!>2,(+!23!1820!+.1.!0,1!20!23!18,!>)-7.1!MM-DD-YYYYG!=82(,!18,!+.1.!
A)*!D-,62)*0(A!=-)1,!20!23!18,!>)-7.1!MM/DD/YYYYB!K+21!18,!FOREACH!!
GENERATE!01.1,7,31!1)!5.((!18,!REPLACE(date, '-', '/')!>*3512)3!
1)!5)--,51!1820B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
20
>,' Y35,!A)*!.-,!0*-,!18,!05-2D1!=)-I0!()5.((AG!.++!18,!>*((!+.1.!0,1!1)!@?ZHL!
$ hadoop fs -put $ADIR/data/ad_data2.txt /dualcore
V,' K+21!18,!05-2D1!1)!8.6,!21!LOAD!18,!>2(,!A)*!`*01!.++,+!1)!@?ZHG!.3+!18,3!-,D(.5,!18,!
DUMP!01.1,7,31!=218!.!STORE!01.1,7,31!1)!=-21,!A)*-!)*1D*1!.0!1.FE+,(2721,+!-,5)-+0!
1)!18,!/dualcore/ad_data2!+2-,51)-AB!!
W,' [*3!A)*-!05-2D1!.4.2301!18,!+.1.!A)*!.++,+!1)!@?ZHL!
$ pig second_etl.pig
L,' '8,5I!18,!>2-01!$d!)*1D*1!-,5)-+0!=-211,3!23!@?ZH!FA!A)*-!05-2D1L!
$ hadoop fs -cat /dualcore/ad_data2/part* | head -15
.B ?)!A)*!0,,!.3A!+*D(25.1,!-,5)-+0g!
FB M-,!18,!>2,(+0!23!18,!5)--,51!)-+,-g!
5B M-,!.((!18,!I,A=)-+0!23!*DD,-5.0,g!
+B Q0!18,!+.1,!>2,(+!23!18,!5)--,51!9MM/DD/YYYY<!>)-7.1g!
!
This is the end of the Exercise
!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
21
Hands-On Exercise: Analyzing Ad
Campaign Data with Pig!
7P$5#9'*;"'<$"Q5)P+'"3"$45+"S'F)P'<"$A)$I"/'2=B'<$)4"++5#9')#'/%*%'+"*+'A$)I'*J)'
)#&5#"'%/'#"*J)$\+,'8#'*;5+'"3"$45+"S'F)P'J5&&'J$5*"'@59'+4$5<*+'*;%*'%#%&FG"'*;5+'/%*%'*)'
)<*5I5G"')P$'%/Q"$*5+5#9S';"&<5#9'7P%&4)$"'*)'+%Q"'I)#"F'%#/'%**$%4*'#"J'4P+*)I"$+,'
8U@1O=E(=L!C820!,J,-520,!F*2(+0!)3!18,!D-,62)*0!)3,B!Q>!A)*!=,-,!*3.F(,!1)!5)7D(,1,!18,!
D-,62)*0!,J,-520,!)-!1823I!A)*!7.A!8.6,!7.+,!.!7201.I,G!-*3!18,!>)(()=234!5)77.3+!1)!
D-,D.-,!>)-!1820!,J,-520,!F,>)-,!5)3123*234L!
$ ~/scripts/analyst/catchup.sh
Step #1: Find Low Cost Sites
h)18!.+!3,1=)-I0!58.-4,!*0!)3(A!=8,3!.!*0,-!5(25I0!)3!)*-!.+B!C820!20!2+,.(!>)-!?*.(5)-,!
0235,!)*-!4).(!20!1)!F-234!3,=!5*01)7,-0!1)!)*-!021,B!@)=,6,-G!0)7,!021,0!.3+!I,A=)-+0!.-,!
7)-,!,>>,5126,!18.3!)18,-0!.1!.11-.51234!D,)D(,!231,-,01,+!23!18,!3,=!1.F(,1!=,!.+6,-120,B!
T218!1820!23!723+G!A)*!=2((!F,423!FA!2+,312>A234!=8258!021,0!8.6,!18,!()=,01!1)1.(!5)01B!!
C,' '8.34,!1)!18,!+2-,51)-A!>)-!1820!8.3+0E)3!,J,-520,L!
$ cd $ADIR/exercises/analyze_ads
!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
22
-,' YF1.23!.!()5.(!0*F0,1!)>!18,!23D*1!+.1.!FA!-*33234!18,!>)(()=234!5)77.3+L!
$ hadoop fs -cat /dualcore/ad_data1/part* \
| head -n 100 > test_ad_data.txt
N)*!5.3!243)-,!18,!7,00.4,!V5.1L!O3.F(,!1)!=-21,!1)!)*1D*1!01-,.7GW!=8258!.DD,.-0!
F,5.*0,!A)*!.-,!=-21234!7)-,!+.1.!=218!18,!fs -cat!5)77.3+!18.3!A)*!.-,!-,.+234!
=218!18,!head!5)77.3+B!
()*"6!M0!7,312)3,+!23!18,!D-,62)*0!,J,-520,G!21!20!>.01,-!1)!1,01!e24!05-2D10!FA!*0234!.!
()5.(!0*F0,1!)>!18,!23D*1!+.1.B!N)*!5.3!*0,!()5.(!0*F0,10!)>!+.1.!=8,3!1,01234!e24!05-2D10!
18-)*48)*1!1820!5)*-0,B!M(18)*48!,JD(2521!01,D0!.-,!3)1!D-)62+,+!>)-!5-,.1234!()5.(!+.1.!
0*F0,10!23!*D5)7234!,J,-520,0G!+)234!0)!=2((!8,(D!A)*!D,->)-7!18,!,J,-520,0!7)-,!
S*25I(AB!
D,' YD,3!18,!low_cost_sites.pig!>2(,!23!A)*-!,+21)-G!.3+!18,3!7.I,!18,!>)(()=234!
58.34,0L!
.B ;)+2>A!18,!LOAD!01.1,7,31!1)!-,.+!18,!0.7D(,!+.1.!23!18,!
test_ad_data.txt!>2(,B!
FB M++!.!(23,!18.1!5-,.1,0!.!3,=!-,(.12)3!1)!235(*+,!)3(A!-,5)-+0!=8,-,!
was_clicked!8.0!.!6.(*,!)>!1B!
5B i-)*D!1820!>2(1,-,+!-,(.12)3!FA!18,!display_site!>2,(+B!
+B '-,.1,!.!3,=!-,(.12)3!18.1!235(*+,0!1=)!>2,(+0L!18,!display_site!.3+!18,!
1)1.(!5)01!)>!.((!5(25I0!)3!18.1!021,B!
,B H)-1!18.1!3,=!-,(.12)3!FA!5)01!923!.05,3+234!)-+,-<!!
>B ?20D(.A!`*01!18,!>2-01!18-,,!-,5)-+0!1)!18,!05-,,3!
N,' Y35,!A)*!8.6,!7.+,!18,0,!58.34,0G!1-A!-*33234!A)*-!05-2D1!.4.2301!18,!0.7D(,!+.1.L!
$ pig –x local low_cost_sites.pig
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
23
>,' Q3!18,!LOAD!01.1,7,31G!-,D(.5,!18,!test_ad_data.txt!>2(,!=218!.!>2(,!4()F!9D.11,-3<!
18.1!=2((!().+!F)18!18,!/dualcore/ad_data1!.3+!/dualcore/ad_data2!
+2-,51)-2,0!9.3+!+),0!not!().+!.3A!)18,-!+.1.G!0*58!.0!18,!1,J1!>2(,0!>-)7!18,!D-,62)*0!
,J,-520,<B'
V,' Y35,!A)*!8.6,!7.+,!18,0,!58.34,0G!1-A!-*33234!A)*-!05-2D1!.4.2301!18,!+.1.!23!@?ZHL!
$ pig low_cost_sites.pig
RP"+*5)#L!T8258!18-,,!021,0!8.6,!18,!()=,01!)6,-.((!5)01g!
Step #2: Find High Cost Keywords
C8,!1,-70!*0,-0!1AD,!=8,3!+)234!0,.-58,0!7.A!D-)7D1!18,!021,!1)!+20D(.A!.!?*.(5)-,!
.+6,-120,7,31B!H235,!)3(23,!.+6,-120,-0!5)7D,1,!>)-!18,!0.7,!0,1!)>!I,A=)-+0G!0)7,!)>!
18,7!5)01!7)-,!18.3!)18,-0B!N)*!=2((!3)=!=-21,!0)7,!e24!X.123!1)!+,1,-723,!=8258!
I,A=)-+0!8.6,!F,,3!18,!7)01!,JD,3026,!>)-!*0!)6,-.((B!!
C,' H235,!1820!=2((!F,!.!0(2481!6.-2.12)3!)3!18,!5)+,!A)*!8.6,!`*01!=-211,3G!5)DA!18.1!>2(,!.0!
high_cost_keywords.pigL!
$ cp low_cost_sites.pig high_cost_keywords.pig
-,' K+21!18,!high_cost_keywords.pig!>2(,!.3+!7.I,!18,!>)(()=234!18-,,!58.34,0L!
.B i-)*D!FA!18,!keyword!>2,(+!2301,.+!)>!display_site!
FB H)-1!23!+,05,3+234!)-+,-!)>!5)01!
5B ?20D(.A!18,!1)D!>26,!-,0*(10!1)!18,!05-,,3!2301,.+!)>!18,!1)D!18-,,!.0!F,>)-,!
D,' Y35,!A)*!8.6,!7.+,!18,0,!58.34,0G!1-A!-*33234!A)*-!05-2D1!.4.2301!18,!+.1.!23!@?ZHL!
$ pig high_cost_keywords.pig
RP"+*5)#L!T8258!>26,!I,A=)-+0!8.6,!18,!8248,01!)6,-.((!5)01g!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
24
Bonus Exercise #1: Count Ad Clicks
Q>!A)*!8.6,!0*55,00>*((A!>23208,+!18,!,.-(2,-!01,D0!.3+!012((!8.6,!127,G!>,,(!>-,,!1)!5)3123*,!
=218!1820!)D12)3.(!F)3*0!,J,-520,B!
Y3,!27D)-1.31!01.120125!=,!8.6,3/1!A,1!5.(5*(.1,+!20!18,!1)1.(!3*7F,-!)>!5(25I0!)*-!.+0!8.6,!
-,5,26,+B!?)234!0)!=2((!8,(D!)*-!7.-I,1234!+2-,51)-!D(.3!8,-!3,J1!.+!5.7D.243!F*+4,1B!
C,' '8.34,!1)!18,!bonus_01!0*F+2-,51)-A!)>!18,!5*--,31!,J,-520,L!
$ cd bonus_01
-,' K+21!18,!total_click_count.pig!>2(,!.3+!27D(,7,31!18,!>)(()=234L!
.B i-)*D!18,!-,5)-+0!9>2(1,-,+!FA!was_clicked == 1<!0)!18.1!A)*!5.3!5.((!18,!
.44-,4.1,!>*3512)3!23!18,!3,J1!01,DB!
FB Q36)I,!18,!COUNT!>*3512)3!1)!5.(5*(.1,!18,!1)1.(!)>!5(25I,+!.+0!98231L!F,5.*0,!
=,!08)*(+3/1!8.6,!.3A!3*((!-,5)-+0G!A)*!5.3!*0,!18,!COUNT!>*3512)3!2301,.+!)>!
COUNT_STARG!.3+!18,!58)25,!)>!>2,(+!A)*!0*DD(A!1)!18,!>*3512)3!20!.-F21-.-A<B!
5B ?20D(.A!18,!-,0*(1!1)!18,!05-,,3!
D,' Y35,!A)*!8.6,!7.+,!18,0,!58.34,0G!1-A!-*33234!A)*-!05-2D1!.4.2301!18,!+.1.!23!@?ZHL!
$ pig total_click_count.pig
RP"+*5)#L!@)=!7.3A!5(25I0!+2+!=,!-,5,26,g!
Bonus Exercise #2: Estimate The Maximum Cost of The Next
Ad Campaign
Q>!A)*!8.6,!0*55,00>*((A!>23208,+!18,!,.-(2,-!01,D0!.3+!012((!8.6,!127,G!>,,(!>-,,!1)!5)3123*,!
=218!1820!)D12)3.(!F)3*0!,J,-520,B!
T8,3!A)*!-,D)-1,+!18,!1)1.(!3*7F,-!)>!5(25I0!1)!)*-!;.-I,1234!?2-,51)-G!08,!0.2+!18.1!8,-!
4).(!20!1)!4,1!.F)*1!18-,,!127,0!18.1!.7)*31!+*-234!18,!3,J1!5.7D.243B!O3>)-1*3.1,(AG!
F,5.*0,!18,!5)01!20!F.0,+!)3!18,!021,!.3+!I,A=)-+G!08,!+),03/1!I3)=!8)=!7*58!1)!F*+4,1!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
25
>)-!18.1!5.7D.243B!H8,!.0I,+!A)*!1)!8,(D!FA!,0127.1234!18,!=)-01!5.0,!97)01!,JD,3026,<!5)01!
F.0,+!)3!d#G###!5(25I0B!N)*!=2((!+)!1820!FA!>23+234!18,!7)01!,JD,3026,!.+!.3+!18,3!
7*(12D(A234!21!FA!18,!3*7F,-!)>!5(25I0!08,!=.310!1)!.582,6,!23!18,!3,J1!5.7D.243B!
C,' h,5.*0,!1820!5)+,!=2((!F,!0272(.-!1)!18,!5)+,!A)*!=-)1,!23!18,!D-,62)*0!01,DG!01.-1!FA!
5)DA234!18.1!>2(,!.0!project_next_campaign_cost.pigL!
$ cp total_click_count.pig project_next_campaign_cost.pig
-,' K+21!18,!project_next_campaign_cost.pig!>2(,!.3+!7.I,!18,!>)(()=234!
7)+2>25.12)30L!
.B H235,!A)*!.-,!1-A234!1)!+,1,-723,!18,!8248,01!D)002F(,!5)01G!A)*!08)*(+!3)1!
(2721!A)*-!5.(5*(.12)3!1)!18,!5)01!>)-!.+0!.51*.((A!5(25I,+B![,7)6,!18,!FILTER!
01.1,7,31!0)!18.1!A)*!5)302+,-!18,!D)002F2(21A!18.1!.3A!.+!72481!F,!5(25I,+B!
FB '8.34,!18,!.44-,4.1,!>*3512)3!1)!18,!)3,!18.1!-,1*-30!18,!7.J27*7!6.(*,!23!
18,!cpc!>2,(+!98231L!+)3/1!>)-4,1!1)!58.34,!18,!3.7,!)>!18,!-,(.12)3!1820!>2,(+!
F,()340!1)G!23!)-+,-!1)!.55)*31!>)-!18,!-,7)6.(!)>!18,!FILTER!01.1,7,31!23!
18,!D-,62)*0!01,D<B!
5B ;)+2>A!A)*-!FOREACH...GENERATE!01.1,7,31!1)!7*(12D(A!18,!6.(*,!
-,1*-3,+!FA!18,!.44-,4.1,!>*3512)3!FA!18,!1)1.(!3*7F,-!)>!5(25I0!=,!,JD,51!1)!
8.6,!23!18,!3,J1!5.7D.243!
+B ?20D(.A!18,!-,0*(1234!6.(*,!1)!18,!05-,,3B!
D,' Y35,!A)*!8.6,!7.+,!18,0,!58.34,0G!1-A!-*33234!A)*-!05-2D1!.4.2301!18,!+.1.!23!@?ZHL!
$ pig project_next_campaign_cost.pig
RP"+*5)#L!T8.1!20!18,!7.J27*7!A)*!,JD,51!1820!5.7D.243!72481!5)01g!N)*!5.3!
5)7D.-,!A)*-!0)(*12)3!1)!18,!)3,!23!18,!bonus_02/sample_solution/!
0*F+2-,51)-AB!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
26
Bonus Exercise #3: Calculating Click Through Rate (CTR)
Q>!A)*!8.6,!0*55,00>*((A!>23208,+!18,!,.-(2,-!01,D0!.3+!012((!8.6,!127,G!>,,(!>-,,!1)!5)3123*,!
=218!1820!)D12)3.(!F)3*0!,J,-520,B!
C8,!5.(5*(.12)30!A)*!+2+!.1!18,!01.-1!)>!1820!,J,-520,!4.6,!*0!.!-)*48!2+,.!.F)*1!18,!0*55,00!
)>!.+!5.7D.243G!F*1!+2+3/1!.55)*31!>)-!18,!>.51!18.1!0)7,!021,0!+20D(.A!)*-!.+0!7)-,!18.3!
)18,-0B!C820!7.I,0!21!+2>>25*(1!1)!+,1,-723,!8)=!,>>,5126,!)*-!.+0!=,-,!FA!027D(A!5)*31234!
18,!3*7F,-!)>!5(25I0!)3!)3,!021,!.3+!5)7D.-234!21!1)!18,!3*7F,-!)>!5(25I0!)3!.3)18,-!021,B!
Y3,!7,1-25!18.1!=)*(+!.(()=!*0!1)!F,11,-!7.I,!0*58!5)7D.-20)30!20!18,!'(25IEC8-)*48![.1,!
9http://tiny.cloudera.com/ade03a<G!5)77)3(A!.FF-,62.1,+!.0!'C[B!C820!6.(*,!20!
027D(A!18,!D,-5,31.4,!)>!.+0!08)=3!18.1!*0,-0!.51*.((A!5(25I,+G!.3+!5.3!F,!5.(5*(.1,+!FA!
+262+234!18,!3*7F,-!)>!5(25I0!FA!18,!1)1.(!3*7F,-!)>!.+0!08)=3B!
C,' '8.34,!1)!18,!bonus_03!0*F+2-,51)-A!)>!18,!5*--,31!,J,-520,L!
$ cd ../bonus_03
-,' K+21!18,!lowest_ctr_by_site.pig!>2(,!.3+!27D(,7,31!18,!>)(()=234L!
.B T21823!18,!3,01,+!FOREACHG!>2(1,-!18,!-,5)-+0!1)!235(*+,!)3(A!-,5)-+0!=8,-,!
18,!.+!=.0!5(25I,+B!
FB '-,.1,!.!3,=!-,(.12)3!)3!18,!(23,!18.1!>)(()=0!18,!FILTER!01.1,7,31!=8258!
5)*310!18,!3*7F,-!)>!-,5)-+0!=21823!18,!5*--,31!4-)*D!
5B M++!.3)18,-!(23,!F,()=!18.1!1)!5.(5*(.1,!18,!5(25IE18-)*48!-.1,!23!.!3,=!>2,(+!
3.7,+!ctr!
+B M>1,-!18,!3,01,+!FOREACHG!0)-1!18,!-,5)-+0!23!.05,3+234!)-+,-!)>!5(25I18-)*48!
-.1,!.3+!+20D(.A!18,!>2-01!18-,,!1)!18,!05-,,3B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
27
D,' Y35,!A)*!8.6,!7.+,!18,0,!58.34,0G!1-A!-*33234!A)*-!05-2D1!.4.2301!18,!+.1.!23!@?ZHL!
$ pig lowest_ctr_by_site.pig
RP"+*5)#L!T8258!18-,,!021,0!8.6,!18,!()=,01!5(25I!18-)*48!-.1,g!
Q>!A)*!012((!8.6,!127,!-,7.23234G!7)+2>A!A)*-!05-2D1!1)!+20D(.A!18,!18-,,!I,A=)-+0!=218!18,!
8248,01!5(25IE18-)*48!-.1,B!N)*!5.3!5)7D.-,!A)*-!0)(*12)3!1)!18,!
highest_ctr_by_keyword.pig!>2(,!23!18,!sample_solution!+2-,51)-AB!
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
28
Hands-On Exercise: Analyzing Disparate
Data Sets with Pig!
8#'*;5+'"3"$45+"S'F)P'J5&&'<$%4*54"'4)IZ5#5#9S'b)5#5#9S'%#/'%#%&FG5#9'*;"'<$)/P4*'+%&"+'
/%*%'<$"Q5)P+&F'"3<)$*"/'A$)I'7P%&4)$"^+'UFKRB'/%*%Z%+"'+)'F)P'4%#')Z+"$Q"'*;"'
"AA"4*+'*;%*')P$'$"4"#*'%/Q"$*5+5#9'4%I<%59#';%+';%/')#'+%&"+,''
8U@1O=E(=L!C820!,J,-520,!F*2(+0!)3!D-,62)*0!)3,0B!Q>!A)*!=,-,!*3.F(,!1)!5)7D(,1,!.3A!
D-,62)*0!,J,-520,!)-!1823I!A)*!7.A!8.6,!7.+,!.!7201.I,G!-*3!18,!>)(()=234!5)77.3+!1)!
D-,D.-,!>)-!1820!,J,-520,!F,>)-,!5)3123*234L!
$ ~/scripts/analyst/catchup.sh
Step #1: Show Per-Month Sales Before and After Campaign
h,>)-,!=,!D-)5,,+!=218!7)-,!0)D820125.1,+!.3.(A020G!A)*!08)*(+!>2-01!5.(5*(.1,!18,!3*7F,-!
)>!)-+,-0!?*.(5)-,!-,5,26,+!,.58!7)318!>)-!18,!18-,,!7)3180!F,>)-,!)*-!.+!5.7D.243!F,4.3!
9Z,F-*.-A!j!MD-2(G!"#$&<G!.0!=,((!.0!>)-!18,!7)318!+*-234!=8258!)*-!5.7D.243!-.3!9;.AG!
"#$&<B!
C,' '8.34,!1)!18,!+2-,51)-A!>)-!1820!8.3+0E)3!,J,-520,L!
$ cd $ADIR/exercises/disparate_datasets
-,' YD,3!18,!count_orders_by_period.pig!>2(,!23!A)*-!,+21)-B!T,!8.6,!D-)62+,+!18,!
LOAD!01.1,7,31!.0!=,((!.0!.!FILTER!01.1,7,31!18.1!*0,0!.!-,4*(.-!,JD-,002)3!1)!7.158!
18,!-,5)-+0!23!18,!+.1.!-.34,!A)*/((!.3.(AU,B!;.I,!18,!>)(()=234!.++212)3.(!58.34,0L!
.B Z)(()=234!18,!FILTER!01.1,7,31G!5-,.1,!.!3,=!-,(.12)3!=218!`*01!)3,!>2,(+L!18,!
)-+,-/0!A,.-!.3+!7)318!98231L!*0,!18,!SUBSTRING!F*2(1E23!>*3512)3!1)!,J1-.51!
18,!>2-01!D.-1!)>!18,!order_dtm!>2,(+G!=8258!5)31.230!18,!7)318!.3+!A,.-<B!
FB ')*31!18,!3*7F,-!)>!)-+,-0!23!,.58!)>!18,!7)3180!A)*!,J1-.51,+!23!18,!
D-,62)*0!01,DB!
5B ?20D(.A!18,!5)*31!FA!7)318!1)!18,!05-,,3!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
29
D,' Y35,!A)*!8.6,!7.+,!18,0,!58.34,0G!1-A!-*33234!A)*-!05-2D1!.4.2301!18,!+.1.!23!@?ZHL!
$ pig count_orders_by_period.pig
RP"+*5)#L!?),0!18,!+.1.!0*44,01!18.1!18,!.+6,-120234!5.7D.243!=,!01.-1,+!23!;.A!(,+!1)!
.!0*F01.312.(!235-,.0,!23!)-+,-0g!
Step #2: Count Advertised Product Sales by Month
Y*-!.3.(A020!>-)7!18,!D-,62)*0!01,D!0*44,010!18.1!0.(,0!235-,.0,+!+-.7.125.((A!18,!0.7,!
7)318!=,!F,4.3!.+6,-120234B!P,J1G!A)*/((!5)7D.-,!18,!0.(,0!)>!18,!0D,52>25!D-)+*51!=,!
.+6,-120,+!9D-)+*51!Q?!n$"c%&%_<!+*-234!18,!0.7,!D,-2)+!1)!0,,!=8,18,-!18,!235-,.0,!23!
0.(,0!=.0!.51*.((A!-,(.1,+!1)!)*-!5.7D.243B!
N)*!=2((!F,!`)23234!1=)!+.1.!0,10!+*-234!1820!D)-12)3!)>!18,!,J,-520,B!H235,!1820!20!18,!>2-01!`)23!
A)*!8.6,!+)3,!=218!e24!+*-234!5(.00G!3)=!20!.!4))+!127,!1)!7,312)3!.!12D!18.1!5.3!8.6,!.!
D-)>)*3+!,>>,51!)3!18,!D,->)-7.35,!)>!A)*-!05-2D1B!Z2(1,-234!)*1!*3=.31,+!+.1.!>-)7!,.58!
-,(.12)3!before!A)*!`)23!18,7G!.0!=,/6,!+)3,!23!)*-!,J.7D(,G!7,.30!18.1!A)*-!05-2D1!=2((!
3,,+!1)!D-)5,00!(,00!+.1.!.3+!=2((!>23208!7)-,!S*25I(AB!T,!=2((!+205*00!0,6,-.(!7)-,!e24!
D,->)-7.35,!12D0!(.1,-!23!5(.00G!F*1!1820!)3,!20!=)-18!(,.-3234!3)=B!
C,' K+21!18,!count_tablet_orders_by_period.pig!>2(,!.3+!27D(,7,31!18,!
>)(()=234L!
.B f)23!18,!1=)!-,(.12)30!)3!18,!order_id!>2,(+!18,A!8.6,!23!5)77)3!
FB '-,.1,!.!3,=!-,(.12)3!>-)7!18,!`)23,+!+.1.!18.1!5)31.230!.!0234(,!>2,(+L!18,!
)-+,-/0!A,.-!.3+!7)318G!0272(.-!1)!=8.1!A)*!+2+!D-,62)*0(A!23!18,!
count_orders_by_period.pig!>2(,B!
5B i-)*D!18,!-,5)-+0!FA!7)318!.3+!18,3!5)*31!18,!-,5)-+0!23!,.58!4-)*D!!
+B ?20D(.A!18,!-,0*(10!1)!A)*-!05-,,3!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
30
-,' Y35,!A)*!8.6,!7.+,!18,0,!58.34,0G!1-A!-*33234!A)*-!05-2D1!.4.2301!18,!+.1.!23!@?ZHL!
$ pig count_tablet_orders_by_period.pig
RP"+*5)#L!?),0!18,!+.1.!08)=!.3!235-,.0,!23!0.(,0!)>!18,!.+6,-120,+!D-)+*51!
5)--,0D)3+234!1)!18,!7)318!23!=8258!?*.(5)-,/0!5.7D.243!=.0!.5126,g!
Bonus Exercise #1: Calculate Average Order Size
Q>!A)*!8.6,!0*55,00>*((A!>23208,+!18,!,.-(2,-!01,D0!.3+!012((!8.6,!127,G!>,,(!>-,,!1)!5)3123*,!
=218!1820!)D12)3.(!F)3*0!,J,-520,B!
Q1!.DD,.-0!18.1!)*-!.+6,-120234!5.7D.243!=.0!0*55,00>*(!23!4,3,-.1234!3,=!)-+,-0!>)-!
?*.(5)-,B!H235,!=,!0,((!1820!1.F(,1!.1!.!0(2481!()00!1)!.11-.51!3,=!5*01)7,-0G!(,1/0!0,,!2>!
5*01)7,-0!=8)!F*A!1820!1.F(,1!.(0)!F*A!)18,-!182340B!N)*!=2((!=-21,!5)+,!1)!5.(5*(.1,!18,!
.6,-.4,!3*7F,-!)>!21,70!>)-!.((!)-+,-0!18.1!5)31.23!18,!.+6,-120,+!1.F(,1!+*-234!18,!
5.7D.243!D,-2)+B!
C,' '8.34,!1)!18,!bonus_01!0*F+2-,51)-A!)>!18,!5*--,31!,J,-520,L!
$ cd bonus_01
-,' K+21!18,!average_order_size.pig!>2(,!1)!5.(5*(.1,!18,!.6,-.4,!.0!+,05-2F,+!.F)6,B!
T82(,!18,-,!.-,!7*(12D(,!=.A0!1)!.582,6,!1820G!=,!-,5)77,3+!A)*!27D(,7,31!18,!
>)(()=234L!
.B Z2(1,-!18,!)-+,-0!FA!+.1,!9*0234!.!-,4*(.-!,JD-,002)3<!1)!235(*+,!)3(A!18)0,!
D(.5,+!+*-234!18,!5.7D.243!D,-2)+!9;.A!$G!"#$&!18-)*48!;.A!&$G!"#$&<!
FB KJ5(*+,!.3A!)-+,-0!=8258!+)!3)1!5)31.23!18,!.+6,-120,+!D-)+*51!9D-)+*51!Q?!
n$"c%&%_<!
5B '-,.1,!.!3,=!-,(.12)3!5)31.23234!18,!order_id!.3+!product_id!>2,(+0!>)-!
18,0,!)-+,-0B!
+B ')*31!18,!1)1.(!3*7F,-!)>!D-)+*510!D,-!)-+,-!
,B '.(5*(.1,!18,!.6,-.4,!3*7F,-!)>!D-)+*510!>)-!.((!)-+,-0!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
31
D,' Y35,!A)*!8.6,!7.+,!18,0,!58.34,0G!1-A!-*33234!A)*-!05-2D1!.4.2301!18,!+.1.!23!@?ZHL!
$ pig average_order_size.pig
RP"+*5)#L!?),0!18,!+.1.!08)=!18.1!18,!.6,-.4,!)-+,-!5)31.23,+!.1!(,.01!1=)!21,70!23!
.++212)3!1)!18,!1.F(,1!=,!.+6,-120,+g!
Bonus Exercise #2: Segment Customers for Loyalty Program
Q>!A)*!8.6,!0*55,00>*((A!>23208,+!18,!,.-(2,-!01,D0!.3+!012((!8.6,!127,G!>,,(!>-,,!1)!5)3123*,!
=218!1820!)D12)3.(!F)3*0!,J,-520,B!
?*.(5)-,!20!5)302+,-234!01.-1234!.!()A.(1A!-,=.-+0!D-)4-.7B!C820!=2((!D-)62+,!,J5(*026,!
F,3,>210!1)!)*-!F,01!5*01)7,-0G!=8258!=2((!8,(D!*0!1)!-,1.23!18,7B!M3)18,-!.+6.31.4,!20!18.1!
21!=2((!.(0)!.(()=!*0!1)!5.D1*-,!,6,3!7)-,!+.1.!.F)*1!8)=!18,A!08)D!=218!*0m!>)-!,J.7D(,G!=,!
5.3!,.02(A!1-.5I!18,2-!23E01)-,!D*-58.0,0!=8,3!18,0,!5*01)7,-0!426,!*0!18,2-!-,=.-+0!
D-)4-.7!3*7F,-!.1!58,5I)*1B!!!
C)!F,!5)302+,-,+!>)-!18,!D-)4-.7G!.!5*01)7,-!7*01!8.6,!7.+,!.1!(,.01!>26,!D*-58.0,0!>-)7!
?*.(5)-,!+*-234!"#$"B!C8,0,!5*01)7,-0!=2((!F,!0,47,31,+!231)!4-)*D0!F.0,+!)3!18,!1)1.(!
-,1.2(!D-25,!)>!.((!D*-58.0,0!,.58!7.+,!+*-234!18.1!A,.-L!
@&%*5#PIL!e*-58.0,0!1)1.(,+!.1!(,.01!b$#G###!
!)&/L!e*-58.0,0!1)1.(,+!.1!(,.01!bdG###!F*1!(,00!18.3!b$#G###!
K5&Q"$L!e*-58.0,0!1)1.(,+!.1!(,.01!b"Gd##!F*1!(,00!18.3!bdG###!
H235,!=,!.-,!5)302+,-234!18,!1)1.(!0.(,0!D-25,!)>!)-+,-0!23!.++212)3!1)!18,!3*7F,-!)>!)-+,-0!.!
5*01)7,-!8.0!D(.5,+G!3)1!,6,-A!5*01)7,-!=218!.1!(,.01!>26,!)-+,-0!+*-234!"#$"!=2((!S*.(2>AB!
Q3!>.51G!)3(A!.F)*1!)3,!D,-5,31!)>!)*-!5*01)7,-0!=2((!F,!,(242F(,!>)-!7,7F,-082D!23!)3,!)>!
18,0,!18-,,!4-)*D0B!
?*-234!1820!,J,-520,G!A)*!=2((!=-21,!18,!5)+,!3,,+,+!1)!>2(1,-!18,!(201!)>!)-+,-0!F.0,+!)3!+.1,G!
4-)*D!18,7!FA!5*01)7,-!Q?G!5)*31!18,!3*7F,-!)>!)-+,-0!D,-!5*01)7,-G!.3+!18,3!>2(1,-!1820!1)!
,J5(*+,!.3A!5*01)7,-!=8)!+2+!3)1!8.6,!.1!(,.01!>26,!)-+,-0B!N)*!=2((!18,3!`)23!1820!
23>)-7.12)3!=218!18,!)-+,-!+,1.2(0!.3+!D-)+*510!+.1.!0,10!23!)-+,-!1)!5.(5*(.1,!18,!1)1.(!0.(,0!
)>!18)0,!)-+,-0!>)-!,.58!5*01)7,-G!0D(21!18,7!231)!18,!4-)*D0!F.0,+!)3!18,!5-21,-2.!+,05-2F,+!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
32
.F)6,G!.3+!18,3!=-21,!18,!+.1.!>)-!,.58!4-)*D!95*01)7,-!Q?!.3+!1)1.(!0.(,0<!231)!.!0,D.-.1,!
+2-,51)-A!23!@?ZHB!
C,' '8.34,!1)!18,!bonus_02!0*F+2-,51)-A!)>!18,!5*--,31!,J,-520,L!
$ cd ../bonus_02
-,' K+21!18,!loyalty_program.pig!>2(,!.3+!27D(,7,31!18,!01,D0!+,05-2F,+!.F)6,B!C8,!
5)+,!1)!().+!18,!18-,,!+.1.!0,10!A)*!=2((!3,,+!20!.(-,.+A!D-)62+,+!>)-!A)*B!
D,' M>1,-!A)*!8.6,!=-211,3!18,!5)+,G!-*3!21!.4.2301!18,!+.1.!23!@?ZHL!
$ pig loyalty_program.pig
N,' Q>!A)*-!05-2D1!5)7D(,1,+!0*55,00>*((AG!*0,!18,!hadoop fs -getmerge!5)77.3+!1)!
5-,.1,!.!()5.(!1,J1!>2(,!>)-!,.58!4-)*D!0)!A)*!5.3!58,5I!A)*-!=)-I!93)1,!18.1!18,!3.7,!)>!
18,!+2-,51)-A!08)=3!8,-,!7.A!3)1!F,!18,!0.7,!.0!18,!)3,!A)*!58)0,<L!
$ hadoop fs -getmerge /dualcore/loyalty/platinum platinum.txt
$ hadoop fs -getmerge /dualcore/loyalty/gold gold.txt
$ hadoop fs -getmerge /dualcore/loyalty/silver silver.txt
>,' O0,!18,!OPQR!head!.3+o)-!tail!5)77.3+0!1)!58,5I!.!>,=!-,5)-+0!.3+!,30*-,!18.1!18,!
1)1.(!0.(,0!D-25,0!>.((!231)!18,!5)--,51!-.34,0L!
$ head platinum.txt
$ tail gold.txt
$ head silver.txt
V,' Z23.((AG!5)*31!18,!3*7F,-!)>!5*01)7,-0!23!,.58!4-)*DL!
$ wc -l platinum.txt
$ wc -l gold.txt
$ wc -l silver.txt
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
33
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
34
Hands-On Exercise: Extending Pig with
Streaming and UDFs!
8#'*;5+'"3"$45+"'F)P'J5&&'P+"'*;"'STREAM'\"FJ)$/'5#'@59'*)'%#%&FG"'I"*%/%*%'A$)I'
7P%&4)$"^+'4P+*)I"$'+"$Q54"'4%&&'$"4)$/5#9+'*)'5/"#*5AF'*;"'4%P+"')A'%'+P//"#'5#4$"%+"'
5#'4)I<&%5#*+,'[)P'J5&&'*;"#'P+"'*;5+'/%*%'5#'4)#bP#4*5)#'J5*;'%'P+"$0/"A5#"/'AP#4*5)#'
*)'<$)<)+"'%'+)&P*5)#'A)$'$"+)&Q5#9'*;"'<$)Z&"I,'
8U@1O=E(=L!C820!,J,-520,!F*2(+0!)3!D-,62)*0!)3,0B!Q>!A)*!=,-,!*3.F(,!1)!5)7D(,1,!.3A!
D-,62)*0!,J,-520,!)-!1823I!A)*!7.A!8.6,!7.+,!.!7201.I,G!-*3!18,!>)(()=234!5)77.3+!1)!
D-,D.-,!>)-!1820!,J,-520,!F,>)-,!5)3123*234L!
$ ~/scripts/analyst/catchup.sh
Background Information
?*.(5)-,!)*10)*-5,0!210!5.((!5,31,-!)D,-.12)30!.3+!)*-!5)010!8.6,!-,5,31(A!-20,3!+*,!1)!.3!
235-,.0,!23!18,!6)(*7,!)>!5.((0!8.3+(,+!FA!18,0,!.4,310B!O3>)-1*3.1,(AG!=,!+)!3)1!8.6,!
.55,00!1)!18,!5.((!5,31,-/0!+.1.F.0,G!F*1!18,A!D-)62+,!*0!=218!-,5)-+2340!)>!18,0,!5.((0!01)-,+!
23!;e&!>)-7.1B!hA!*0234!e24/0!STREAM!I,A=)-+!1)!236)I,!.!D-)62+,+!eA18)3!05-2D1G!A)*!5.3!
,J1-.51!18,!5.1,4)-A!.3+!127,01.7D!>-)7!18,!>2(,0G!.3+!18,3!.3.(AU,!18.1!+.1.!1)!(,.-3!=8.1!
20!5.*0234!18,!-,5,31!235-,.0,!23!5.((0B!!!
Step #1: Extract Call Metadata
P)1,L!H235,!18,!eA18)3!(2F-.-A!=,!.-,!*0234!>)-!,J1-.51234!18,!1.40!+),03p1!0*DD)-1!@?ZHG!=,!
-*3!1820!05-2D1!23!()5.(!7)+,!)3!.!07.((!0.7D(,!)>!18,!5.((!-,5)-+2340B!h,5.*0,!A)*!=2((!*0,!
e24/0!()5.(!7)+,G!18,-,!=2((!F,!3)!3,,+!1)!V082DW!18,!05-2D1!1)!18,!3)+,0!23!18,!5(*01,-B!
C,' '8.34,!1)!18,!+2-,51)-A!>)-!1820!8.3+0E)3!,J,-520,L!
$ cd $ADIR/exercises/extending_pig
-,' M!D-)4-.77,-!)3!)*-!1,.7!D-)62+,+!*0!=218!.!eA18)3!05-2D1!9readtags.py<!>)-!
,J1-.51234!18,!7,1.+.1.!>-)7!18,!;e&!>2(,0B!C820!05-2D1!1.I,0!18,!D.18!)>!.!>2(,!)3!18,!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
35
5)77.3+!(23,!.3+!-,1*-30!.!-,5)-+!5)31.23234!>26,!1.FE+,(2721,+!>2,(+0L!18,!>2(,!D.18G!
5.((!5.1,4)-AG!.4,31!Q?G!5*01)7,-!Q?G!.3+!18,!127,01.7D!)>!=8,3!18,!.4,31!.30=,-,+!18,!
5.((BB!
N)*-!>2-01!01,D!20!1)!5-,.1,!.!1,J1!>2(,!5)31.23234!18,!D.180!)>!18,!>2(,0!1)!.3.(AU,G!=218!)3,!
(23,!>)-!,.58!>2(,B!N)*!5.3!,.02(A!5-,.1,!18,!+.1.!23!18,!-,S*2-,+!>)-7.1!FA!5.D1*-234!18,!
)*1D*1!)>!18,!OPQR!find!5)77.3+L!
$ find $ADIR/data/cscalls/ -name '*.mp3' > call_list.txt
D,' K+21!18,!extract_metadata.pig!>2(,!.3+!7.I,!18,!>)(()=234!58.34,0L!
.B [,D(.5,!18,!8.-+5)+,+!D.-.7,1,-!23!18,!SUBSTRING!>*3512)3!*0,+!1)!>2(1,-!
FA!7)318!=218!.!D.-.7,1,-!3.7,+!MONTH!=8)0,!6.(*,!A)*!5.3!.00243!)3!18,!
5)77.3+!(23,B!C820!=2((!7.I,!21!,.0A!1)!58,5I!18,!(,.+234!5.((!5.1,4)-2,0!>)-!
+2>>,-,31!7)3180!=218)*1!8.6234!1)!,+21!18,!05-2D1B!
FB M++!18,!5)+,!3,5,00.-A!1)!5)*31!5.((0!FA!5.1,4)-A!
5B ?20D(.A!18,!1)D!18-,,!5.1,4)-2,0!9F.0,+!)3!3*7F,-!)>!5.((0<!1)!18,!05-,,3B!
N,' Y35,!A)*!8.6,!7.+,!18,0,!58.34,0G!-*3!A)*-!05-2D1!1)!58,5I!18,!1)D!18-,,!5.1,4)-2,0!23!
18,!7)318!F,>)-,!?*.(5)-,!01.-1,+!18,!)3(23,!.+6,-120234!5.7D.243L!
$ pig -x local -param MONTH=2013-04 extract_metadata.pig
>,' P)=!-*3!18,!05-2D1!.4.23G!1820!127,!0D,52>A234!18,!D.-.7,1,-!>)-!;.AL!
$ pig -x local -param MONTH=2013-05 extract_metadata.pig
C8,!)*1D*1!08)*(+!5)3>2-7!18.1!3)1!)3(A!20!5.((!6)(*7,!0*F01.312.((A!8248,-!23!;.AG!18,!
SHIPPING_DELAY!5.1,4)-A!8.0!7)-,!18.3!1=25,!18,!.7)*31!)>!5.((0!.0!18,!)18,-!1=)B!
Step #2: Choose Best Location for Distribution Center
C8,!.3.(A020!A)*!`*01!5)7D(,1,+!*35)6,-,+!.!D-)F(,7B!?*.(5)-,/0!:25,!e-,02+,31!)>!
YD,-.12)30!(.*358,+!.3!236,0124.12)3!F.0,+!)3!A)*-!>23+2340!.3+!8.0!3)=!5)3>2-7,+!18,!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
36
5.*0,L!)*-!)3(23,!.+6,-120234!5.7D.243!20!23+,,+!.11-.51234!7.3A!3,=!5*01)7,-0G!F*1!7.3A!
)>!18,7!(26,!>.-!>-)7!?*.(5)-,/0!)3(A!+201-2F*12)3!5,31,-!23!e.()!M(1)G!'.(2>)-32.B!M((!)*-!
082D7,310!.-,!1-.30D)-1,+!FA!1-*5IG!0)!.3!)-+,-!5.3!1.I,!*D!1)!>26,!+.A0!1)!+,(26,-!
+,D,3+234!)3!18,!5*01)7,-/0!()5.12)3B!!
!
C)!0)(6,!1820!D-)F(,7G!?*.(5)-,!=2((!)D,3!.!3,=!+201-2F*12)3!5,31,-!1)!27D-)6,!082DD234!
127,0B!!
C8,!qQe!5)+,0!>)-!18,!18-,,!D-)D)0,+!021,0!.-,!#"$$_G!]&$&^G!.3+!c_"&cB!N)*!=2((!())I!*D!
18,!(.121*+,!.3+!()3421*+,!)>!18,0,!qQe!5)+,0G!.0!=,((!.0!18,!qQe!5)+,0!)>!5*01)7,-0!=8)!8.6,!
-,5,31(A!)-+,-,+G!*0234!.!0*DD(2,+!+.1.!0,1B!Y35,!A)*!8.6,!18,!5))-+23.1,0G!A)*!=2((!236)I,!
18,!*0,!18,!HaversineDistInMiles!O?Z!+201-2F*1,+!=218!?.1.Z*!1)!+,1,-723,!8)=!>.-!
,.58!5*01)7,-!20!>-)7!18,!18-,,!+.1.!5,31,-0B!N)*!=2((!18,3!5.(5*(.1,!18,!.6,-.4,!+201.35,!
>)-!.((!5*01)7,-0!1)!,.58!)>!18,0,!+.1.!5,31,-0!23!)-+,-!1)!D-)D)0,!18,!)3,!18.1!=2((!F,3,>21!
18,!7)01!5*01)7,-0B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
37
C,' M++!18,!1.FE+,(2721,+!>2(,!7.DD234!qQe!5)+,0!1)!(.121*+,o()3421*+,!D)2310!1)!@?ZHL!
$ hadoop fs -mkdir /dualcore/distribution
$ hadoop fs -put $ADIR/data/latlon.tsv \
/dualcore/distribution
-,' M!5)E=)-I,-!D-)62+,+!.!05-2D1!9create_cust_location_data.pig<!18.1!>23+0!18,!
qQe!5)+,0!>)-!5*01)7,-0!=8)!D(.5,+!)-+,-0!+*-234!18,!D,-2)+!=,!-.3!18,!.+!5.7D.243B!Q1!
.(0)!,J5(*+,0!18,!)3,0!=8)!.-,!.(-,.+A!5()0,!1)!)*-!5*--,31!>.52(21AG!.0!=,((!.0!5*01)7,-0!
23!18,!-,7)1,!01.1,0!)>!M(.0I.!.3+!@.=.22!9=8,-,!)-+,-0!.-,!082DD,+!FA!.2-D(.3,<B!C8,!
e24!X.123!5)+,!`)230!18,0,!5*01)7,-0/!qQe!5)+,0!=218!18,!(.121*+,o()3421*+,!+.1.!0,1!
*D().+,+!23!18,!D-,62)*0!01,DG!18,3!=-21,0!18)0,!18-,,!5)(*730!9qQe!5)+,G!(.121*+,G!.3+!
()3421*+,<!.0!18,!-,0*(1B!!KJ.723,!18,!05-2D1!1)!0,,!8)=!21!=)-I0G!.3+!18,3!-*3!21!1)!
5-,.1,!18,!5*01)7,-!()5.12)3!+.1.!23!@?ZHL!
$ pig create_cust_location_data.pig
D,' N)*!=2((!*0,!18,!HaversineDistInMiles!>*3512)3!1)!5.(5*(.1,!18,!+201.35,!>-)7!
,.58!5*01)7,-!1)!,.58!)>!18,!18-,,!D-)D)0,+!=.-,8)*0,!()5.12)30B!C820!>*3512)3!
-,S*2-,0!*0!1)!0*DD(A!18,!(.121*+,!.3+!()3421*+,!)>!F)18!18,!5*01)7,-!.3+!18,!
=.-,8)*0,B!T82(,!18,!05-2D1!A)*!`*01!,J,5*1,+!5-,.1,+!18,!(.121*+,!.3+!()3421*+,!>)-!
,.58!5*01)7,-G!A)*!7*01!5-,.1,!.!+.1.!0,1!5)31.23234!18,!qQe!5)+,G!(.121*+,G!.3+!
()3421*+,!>)-!18,0,!=.-,8)*0,0B!?)!1820!FA!-*33234!18,!>)(()=234!OPQR!5)77.3+L!
$ egrep '^02118|^63139|^78237' \
$ADIR/data/latlon.tsv > warehouses.tsv
N,' P,J1G!.++!1820!>2(,!1)!@?ZHL!
$ hadoop fs -put warehouses.tsv /dualcore/distribution
! !
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
38
>,' K+21!18,!calc_average_distances.pig!>2(,B!C8,!O?Z!20!.(-,.+A!-,4201,-,+!.3+!.3!
.(2.0!>)-!1820!>*3512)3!3.7,+!DIST!20!+,>23,+!.1!18,!1)D!)>!18,!05-2D1G!`*01!F,>)-,!18,!1=)!
+.1.!0,10!A)*!=2((!*0,!.-,!().+,+B!N)*!3,,+!1)!5)7D(,1,!18,!-,01!)>!1820!05-2D1L!!
.B '-,.1,!.!-,5)-+!>)-!,6,-A!5)7F23.12)3!)>!5*01)7,-!.3+!D-)D)0,+!+201-2F*12)3!
5,31,-!()5.12)3!
FB O0,!18,!>*3512)3!1)!5.(5*(.1,!18,!+201.35,!>-)7!18,!5*01)7,-!1)!18,!=.-,8)*0,!
5B '.(5*(.1,!18,!.6,-.4,!+201.35,!>)-!.((!5*01)7,-0!1)!,.58!=.-,8)*0,!
+B ?20D(.A!18,!-,0*(1!1)!18,!05-,,3!
V,' M>1,-!A)*!8.6,!>23208,+!27D(,7,31234!18,!e24!X.123!5)+,!+,05-2F,+!.F)6,G!-*3!18,!05-2D1L!
$ pig calc_average_distances.pig
RP"+*5)#L!T8258!)>!18,0,!18-,,!D-)D)0,+!qQe!5)+,0!8.0!18,!()=,01!.6,-.4,!72(,.4,!1)!
)*-!5*01)7,-0g
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
39
Hands-On Exercise: Running Hive
Queries from the Shell, Scripts, and Hue!
8#'*;5+'"3"$45+"'F)P'J5&&'J$5*"'.5Q"RB'aP"$5"+'*)'%#%&FG"'/%*%'5#'.5Q"'*%Z&"+'*;%*';%Q"'
Z""#'<)<P&%*"/'J5*;'/%*%'F)P'<&%4"/'5#'.7MK'/P$5#9'"%$&5"$'"3"$45+"+,''
8U@1O=E(=L!C820!,J,-520,!F*2(+0!)3!D-,62)*0!)3,0B!Q>!A)*!=,-,!*3.F(,!1)!5)7D(,1,!.3A!
D-,62)*0!,J,-520,!)-!1823I!A)*!7.A!8.6,!7.+,!.!7201.I,G!-*3!18,!>)(()=234!5)77.3+!1)!
D-,D.-,!>)-!1820!,J,-520,!F,>)-,!5)3123*234L!
$ ~/scripts/analyst/catchup.sh
Step #1: Running a Query from the Hive Shell
?*.(5)-,!-.3!.!5)31,01!23!=8258!5*01)7,-0!D)01,+!62+,)0!)>!231,-,01234!=.A0!1)!*0,!18,2-!
3,=!1.F(,10B!M!bdG###!D-2U,!=2((!F,!.=.-+,+!1)!18,!5*01)7,-!=8)0,!62+,)!-,5,26,+!18,!
8248,01!-.1234B!
@)=,6,-G!18,!-,4201-.12)3!+.1.!=.0!()01!+*,!1)!.3![?h;H!5-.08G!.3+!18,!)3(A!23>)-7.12)3!
=,!8.6,!20!>-)7!18,!62+,)0B!C8,!=233234!5*01)7,-!231-)+*5,+!8,-0,(>!)3(A!.0!Vh-2+4,1!>-)7!
k.30.0!'21AW!23!8,-!62+,)B!
N)*!=2((!3,,+!1)!-*3!.!@26,!S*,-A!18.1!2+,312>2,0!18,!=233,-/0!-,5)-+!23!)*-!5*01)7,-!
+.1.F.0,!0)!18.1!=,!5.3!0,3+!8,-!18,!bdG###!D-2U,B!
C,' '8.34,!1)!18,!+2-,51)-A!>)-!1820!8.3+0E)3!,J,-520,L!
$ cd $ADIR/exercises/analyzing_sales
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
40
-,' H1.-1!@26,L!
$ hive
Hive Prompt
To make it easier to copy queries and paste them into your terminal window, we do not
show the hive> prompt in subsequent steps. Steps prefixed with $ should be executed
on the UNIX command line; the rest should be run in Hive unless otherwise noted.
D,' ;.I,!18,!S*,-A!-,0*(10!,.02,-!1)!-,.+!FA!0,11234!18,!D-)D,-1A!18.1!=2((!7.I,!@26,!08)=!
5)(*73!8,.+,-0L!
set hive.cli.print.header=true;
N,' M((!A)*!I3)=!.F)*1!18,!=233,-!20!18.1!8,-!3.7,!20!h-2+4,1!.3+!08,!(26,0!23!k.30.0!'21AB!
O0,!@26,p0!XQkK!)D,-.1)-!1)!+)!.!=2(+5.-+!0,.-58!>)-!3.7,0!0*58!.0!rh-2+4,1rG!
rh-2+4,11,r!)-!rh-2+4211,rB![,7,7F,-!1)!>2(1,-!)3!18,!5*01)7,-p0!521AB!
RP"+*5)#L!T8258!5*01)7,-!+2+!A)*-!S*,-A!2+,312>A!.0!18,!=233,-!)>!18,!bdG###!D-2U,g!
Step #2: Running a Query Directly from the Command Line
N)*!=2((!3)=!-*3!.!1)DEP!S*,-A!1)!2+,312>A!18,!18-,,!7)01!,JD,3026,!D-)+*510!18.1!?*.(5)-,!
5*--,31(A!)>>,-0B!
C,' KJ21!18,!@26,!08,((!.3+!-,1*-3!1)!18,!5)77.3+!(23,L!
quit;
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
41
-,' M(18)*48!@26,aX!01.1,7,310!.-,!1,-723.1,+!FA!0,725)()30!23!18,!@26,!08,((G!21!20!3)1!
3,5,00.-A!1)!+)!1820!=8,3!-*33234!.!0234(,!S*,-A!>-)7!18,!5)77.3+!(23,!*0234!18,!-e!
)D12)3B![*3!18,!>)(()=234!5)77.3+!1)!,J,5*1,!18,!S*)1,+!@26,aX!01.1,7,31L!
$ hive -e 'SELECT price, brand, name FROM PRODUCTS ORDER BY
price DESC LIMIT 3'
RP"+*5)#L!T8258!18-,,!D-)+*510!.-,!18,!7)01!,JD,3026,g!
Step #3: Running a HiveQL Script
C8,!-*(,0!>)-!18,!5)31,01!+,05-2F,+!,.-(2,-!-,S*2-,!18.1!18,!=233,-!F)*481!18,!.+6,-120,+!
1.F(,1!>-)7!?*.(5)-,!F,1=,,3!;.A!$G!"#$&!.3+!;.A!&$G!"#$&B!h,>)-,!=,!5.3!.*18)-2U,!)*-!
.55)*31234!+,D.-17,31!1)!D.A!18,!bdG###!D-2U,G!A)*!7*01!,30*-,!18.1!h-2+4,1!20!,(242F(,B!
H235,!1820!S*,-A!236)(6,0!`)23234!+.1.!>-)7!0,6,-.(!1.F(,0G!21/0!.!D,->,51!5.0,!>)-!-*33234!21!.0!
.!@26,!05-2D1B!
C,' H1*+A!18,!@26,aX!5)+,!>)-!18,!S*,-A!1)!(,.-3!8)=!21!=)-I0L!
$ cat verify_tablet_order.hql
-,' KJ,5*1,!18,!@26,aX!05-2D1!*0234!18,!826,!5)77.3+/0!-f!)D12)3L!
$ hive -f verify_tablet_order.hql
RP"+*5)#L!?2+!h-2+4,1!)-+,-!18,!.+6,-120,+!1.F(,1!23!;.Ag!
Step #4: Running a Query Through Hue and Beeswax
M3)18,-!=.A!1)!-*3!@26,!S*,-2,0!20!18-)*48!A)*-!T,F!F-)=0,-!*0234!@*,/0!h,,0=.J!
.DD(25.12)3B!C820!20!,0D,52.((A!5)36,32,31!2>!A)*!*0,!7)-,!18.3!)3,!5)7D*1,-!j!)-!2>!A)*!*0,!
.!+,625,!90*58!.0!.!1.F(,1<!18.1!203/1!5.D.F(,!)>!-*33234!@26,!210,(>!j!F,5.*0,!21!+),0!3)1!
-,S*2-,!.3A!0)>1=.-,!)18,-!18.3!.!F-)=0,-B!!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
42
C,' H1.-1!18,!Z2-,>)J!T,F!F-)=0,-!FA!5(25I234!18,!)-.34,!.3+!F(*,!25)3!3,.-!18,!1)D!)>!18,!
:;!=23+)=G!`*01!1)!18,!-2481!)>!18,!HA01,7!7,3*B!Y35,!Z2-,>)J!01.-10G!1AD,!
http://localhost:8888/!231)!18,!.++-,00!F.-G!.3+!18,3!821!18,!,31,-!I,AB!
-,' M>1,-!.!>,=!0,5)3+0G!A)*!08)*(+!0,,!@*,/0!()423!05-,,3B!!K31,-!training!23!F)18!18,!
*0,-3.7,!.3+!D.00=)-+!>2,(+0G!.3+!18,3!5(25I!18,!VH243!Q3W!F*11)3B!Q>!D-)7D1,+!1)!
-,7,7F,-!18,!D.00=)-+G!+,5(23,!FA!8211234!18,!KH'!I,A!0)!A)*!5.3!D-.5125,!1820!01,D!
.4.23!(.1,-!2>!A)*!58))0,B!!
M(18)*48!0,6,-.(!@*,!.DD(25.12)30!.-,!.6.2(.F(,!18-)*48!18,!25)30!.1!18,!1)D!)>!18,!D.4,G!
18,!h,,0=.J!S*,-A!,+21)-!20!08)=3!FA!+,>.*(1B!!
D,' H,(,51!default!>-)7!18,!+.1.F.0,!(201!)3!18,!(,>1!02+,!)>!18,!D.4,B!
N,' T-21,!.!S*,-A!23!18,!1,J1!.-,.!18.1!=2((!5)*31!18,!3*7F,-!)>!-,5)-+0!23!18,!customers!
1.F(,G!.3+!18,3!5(25I!18,!VKJ,5*1,W!F*11)3B!
RP"+*5)#L!@)=!7.3A!5*01)7,-0!+),0!?*.(5)-,!0,-6,g!
>,' '(25I!18,!Va*,-A!K+21)-W!(23I!23!18,!*DD,-!(,>1!5)-3,-G!.3+!18,3!=-21,!.3+!-*3!.!S*,-A!1)!
>23+!18,!1,3!01.1,0!=218!18,!7)01!5*01)7,-0B!
RP"+*5)#L!T8258!01.1,!8.0!18,!7)01!5*01)7,-0g!
Bonus Exercise #1: Calculating Revenue and Profit
Q>!A)*!8.6,!0*55,00>*((A!>23208,+!18,!,.-(2,-!01,D0!.3+!012((!8.6,!127,G!>,,(!>-,,!1)!5)3123*,!
=218!1820!)D12)3.(!F)3*0!,J,-520,B!!
H,6,-.(!7)-,!S*,012)30!.-,!+,05-2F,+!F,()=!.3+!A)*!=2((!3,,+!1)!=-21,!18,!@26,aX!5)+,!1)!
.30=,-!18,7B!N)*!5.3!*0,!=8258,6,-!7,18)+!A)*!(2I,!F,01G!235(*+234!@26,!08,((G!@26,!H5-2D1G!
)-!@*,G!1)!-*3!A)*-!S*,-2,0B!!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
43
T8258!1)D!18-,,!D-)+*510!8.0!?*.(5)-,!0)(+!7)-,!)>!18.3!.3A!)18,-g!
.5#*L![,7,7F,-!18.1!2>!A)*!*0,!.!GROUP BY!5(.*0,!23!@26,G!A)*!7*01!4-)*D!FA!.((!>2,(+0!
(201,+!23!18,!SELECT!5(.*0,!18.1!.-,!3)1!D.-1!)>!.3!.44-,4.1,!>*3512)3B!
T8.1!=.0!?*.(5)-,/0!1)1.(!-,6,3*,!23!;.AG!"#$&g!!
T8.1!=.0!?*.(5)-,/0!4-)00!D-)>21!90.(,0!D-25,!723*0!5)01<!23!;.AG!"#$&g!
C8,!-,0*(10!)>!18,!.F)6,!S*,-2,0!.-,!08)=3!23!5,310B![,=-21,!18,!4-)00!D-)>21!S*,-A!1)!
>)-7.1!18,!6.(*,!23!+)((.-0!.3+!5,310!9,B4BG!b"######B##<B!C)!+)!1820G!A)*!5.3!+262+,!18,!
D-)>21!FA!$##!.3+!>)-7.1!18,!-,0*(1!*0234!18,!PRINTF!>*3512)3!.3+!18,!>)-7.1!01-234!
"$%.2f"B!
C8,-,!.-,!0,6,-.(!=.A0!A)*!5)*(+!=-21,!,.58!S*,-AG!.3+!A)*!5.3!>23+!)3,!0)(*12)3!>)-!,.58!
D-)F(,7!23!18,!bonus_01/sample_solution/!+2-,51)-AB!
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
44
Hands-On Exercise: Data Management
with Hive!
8#'*;5+'"3"$45+"'F)P'J5&&'<$%4*54"'P+5#9'+"Q"$%&'4)II)#'*"4;#5aP"+'A)$'4$"%*5#9'%#/'
<)<P&%*5#9'.5Q"'*%Z&"+,'[)P'J5&&'%&+)'4$"%*"'%#/'aP"$F'%'*%Z&"'4)#*%5#5#9'"%4;')A'*;"'
4)I<&"3'A5"&/'*F<"+'J"'+*P/5"/6'%$$%FS'I%<S'%#/'+*$P4*,''
8U@1O=E(=L!C820!,J,-520,!F*2(+0!)3!D-,62)*0!)3,0B!Q>!A)*!=,-,!*3.F(,!1)!5)7D(,1,!.3A!
D-,62)*0!,J,-520,!)-!1823I!A)*!7.A!8.6,!7.+,!.!7201.I,G!-*3!18,!>)(()=234!5)77.3+!1)!
D-,D.-,!>)-!1820!,J,-520,!F,>)-,!5)3123*234L!
$ ~/scripts/analyst/catchup.sh
M++212)3.((AG!7.3A!)>!18,!5)77.3+0!A)*!=2((!-*3!*0,!,362-)37,31.(!6.-2.F(,0!.3+!-,(.126,!
>2(,!D.180B!Q1!20!27D)-1.31!18.1!A)*!*0,!18,!@26,!08,((G!-.18,-!18.3!@*,!)-!.3)18,-!231,->.5,G!.0!
A)*!=)-I!18-)*48!18,!01,D0!18.1!>)(()=B!!
Step #1: Use Sqoop’s Hive Import Option to Create a Table
N)*!*0,+!HS))D!23!.3!,.-(2,-!,J,-520,!1)!27D)-1!+.1.!>-)7!;AHaX!231)!@?ZHB!HS))D!5.3!.(0)!
5-,.1,!.!@26,!1.F(,!=218!18,!0.7,!>2,(+0!.0!18,!0)*-5,!1.F(,!23!.++212)3!1)!27D)-1234!18,!
-,5)-+0G!=8258!0.6,0!A)*!>-)7!8.6234!1)!=-21,!.!CREATE TABLE!01.1,7,31B!
C,' '8.34,!1)!18,!+2-,51)-A!>)-!1820!@.3+0EY3!KJ,-520,L!
$ cd $ADIR/exercises/data_mgmt
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
45
-,' KJ,5*1,!18,!>)(()=234!5)77.3+!1)!27D)-1!18,!suppliers!1.F(,!>-)7!;AHaX!.0!.!3,=!
@26,E7.3.4,+!1.F(,L!
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training --password training \
--fields-terminated-by '\t' \
--table suppliers \
--hive-import
D,' H1.-1!@26,L!
$ hive
N,' Q1!20!.(=.A0!.!4))+!2+,.!1)!6.(2+.1,!+.1.!.>1,-!.++234!21B!KJ,5*1,!18,!@26,!S*,-A!08)=3!
F,()=!1)!5)*31!18,!3*7F,-!)>!0*DD(2,-0!23!C,J.0L!
SELECT COUNT(*) FROM suppliers WHERE state='TX';
C8,!S*,-A!08)*(+!08)=!18.1!323,!-,5)-+0!7.158B!
Step #2: Create an External Table in Hive
N)*!27D)-1,+!+.1.!>-)7!18,!employees!1.F(,!23!;AHaX!23!.3!,.-(2,-!,J,-520,G!F*1!21!=)*(+!
F,!5)36,32,31!1)!F,!.F(,!1)!S*,-A!1820!>-)7!@26,B!H235,!18,!+.1.!.(-,.+A!,J2010!23!@?ZHG!1820!
20!.!4))+!)DD)-1*321A!1)!*0,!.3!,J1,-3.(!1.F(,B!!
! !
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
46
C,' T-21,!.3+!,J,5*1,!.!@26,aX!01.1,7,31!1)!5-,.1,!.3!,J1,-3.(!1.F(,!>)-!18,!1.FE+,(2721,+!
-,5)-+0!23!@?ZH!.1!/dualcore/employeesB!C8,!+.1.!>)-7.1!20!08)=3!F,()=L!
!
Field&Name&
Field&Type&
,7Dl2+!
STRING
>3.7,!
STRING
(3.7,!
STRING
.++-,00!
STRING
521A!
STRING
01.1,!
STRING
U2D5)+,!
STRING
`)Fl121(,!
STRING
,7.2(!
STRING
.5126,!
STRING
0.(.-A!
INT
-,' [*3!18,!>)(()=234!@26,!S*,-A!1)!6,-2>A!18.1!A)*!8.6,!5-,.1,+!18,!1.F(,!5)--,51(AB!!
SELECT job_title, COUNT(*) AS num
FROM employees
GROUP BY job_title
ORDER BY num DESC
LIMIT 3;
Q1!08)*(+!08)=!18.1!H.(,0!M00)52.1,G!'.082,-G!.3+!M00201.31!;.3.4,-!.-,!18,!18-,,!7)01!
5)77)3!`)F!121(,0!.1!?*.(5)-,B!
Step #3: Create and Load a Hive-Managed Table
P,J1G!A)*!=2((!5-,.1,!.3+!18,3!().+!.!@26,E7.3.4,+!1.F(,!=218!D-)+*51!-.12340!+.1.B!
C,' '-,.1,!.!1.F(,!3.7,+!ratings!>)-!01)-234!1.FE+,(2721,+!-,5)-+0!*0234!1820!01-*51*-,L!
!!
Field&Name&
Field&Type&
D)01,+!
TIMESTAMP
5*01l2+!
INT
D-)+l2+!
INT
-.1234!
TINYINT
7,00.4,!
STRING
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
47
-,' H8)=!18,!1.F(,!+,05-2D12)3!.3+!6,-2>A!18.1!210!>2,(+0!8.6,!18,!5)--,51!)-+,-G!3.7,0G!.3+!
1AD,0L!
DESCRIBE ratings;
D,' P,J1G!)D,3!.!0,D.-.1,!1,-723.(!=23+)=!9Z2(,!Es!!YD,3!C,-723.(<!0)!A)*!5.3!-*3!18,!
>)(()=234!08,((!5)77.3+B!C820!=2((!D)D*(.1,!18,!1.F(,!+2-,51(A!FA!*0234!18,!hadoop fs!
5)77.3+!1)!5)DA!D-)+*51!-.12340!+.1.!>-)7!"#$"!1)!18.1!+2-,51)-A!23!@?ZHL!
$ hadoop fs -put $ADIR/data/ratings_2012.txt \
/user/hive/warehouse/ratings
X,.6,!18,!=23+)=!)D,3!.>1,-=.-+0!0)!18.1!A)*!5.3!,.02(A!0=2158!F,1=,,3!@26,!.3+!18,!
5)77.3+!D-)7D1B!
N,' P,J1G!6,-2>A!18.1!@26,!5.3!-,.+!18,!+.1.!=,!`*01!.++,+B![*3!18,!>)(()=234!S*,-A!23!@26,!
1)!5)*31!18,!3*7F,-!)>!-,5)-+0!23!1820!1.F(,!918,!-,0*(1!08)*(+!F,!%]%<L!
SELECT COUNT(*) FROM ratings;
>,' M3)18,-!=.A!1)!().+!+.1.!231)!.!@26,!1.F(,!20!18-)*48!18,!LOAD DATA!5)77.3+B!C8,!
3,J1!>,=!5)77.3+0!=2((!(,.+!A)*!18-)*48!18,!D-)5,00!)>!5)DA234!.!()5.(!>2(,!1)!@?ZH!
.3+!().+234!21!231)!@26,B!Z2-01G!5)DA!18,!"#$&!-.12340!+.1.!1)!@?ZHL!
$ hadoop fs -put $ADIR/data/ratings_2013.txt /dualcore
V,' :,-2>A!18.1!18,!>2(,!20!18,-,L!
$ hadoop fs -ls /dualcore/ratings_2013.txt
W,' O0,!18,!LOAD DATA!01.1,7,31!23!@26,!1)!().+!18.1!>2(,!231)!18,!ratings!1.F(,L!
LOAD DATA INPATH '/dualcore/ratings_2013.txt' INTO TABLE
ratings;
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
48
L,' C8,!LOAD DATA INPATH!5)77.3+!moves!18,!>2(,!1)!18,!1.F(,/0!+2-,51)-AB!:,-2>A!18.1!
18,!>2(,!20!3)!()34,-!D-,0,31!23!18,!)-2423.(!+2-,51)-AL!
$ hadoop fs -ls /dualcore/ratings_2013.txt
T,' :,-2>A!18.1!18,!>2(,!20!08)=3!.()3402+,!18,!"#$"!-.12340!+.1.!23!18,!1.F(,/0!+2-,51)-AL!
$ hadoop fs -ls /user/hive/warehouse/ratings
CX,' Z23.((AG!5)*31!18,!-,5)-+0!23!18,!-.12340!1.F(,!1)!,30*-,!18.1!.((!"$G^^c!.-,!.6.2(.F(,L!
SELECT COUNT(*) FROM ratings;
Step #4: Create, Load, and Query a Table with Complex Fields
?*.(5)-,!-,5,31(A!01.-1,+!.!()A.(1A!D-)4-.7!1)!-,=.-+!)*-!F,01!5*01)7,-0B!M!5)((,.4*,!8.0!
.(-,.+A!D-)62+,+!*0!=218!.!0.7D(,!)>!18,!+.1.!18.1!5)31.230!23>)-7.12)3!.F)*1!5*01)7,-0!
=8)!8.6,!0243,+!*D!>)-!18,!D-)4-.7G!235(*+234!18,2-!D8)3,!3*7F,-0!9.0!.!7.D<G!.!(201!)>!D.01!
)-+,-!Q?0!9.0!.3!.--.A<G!.3+!.!01-*51!18.1!0*77.-2U,0!18,!72327*7G!7.J27*7G!.6,-.4,G!.3+!
1)1.(!6.(*,!)>!D.01!)-+,-0B!N)*!=2((!5-,.1,!18,!1.F(,G!D)D*(.1,!21!=218!18,!D-)62+,+!+.1.G!.3+!
18,3!-*3!.!>,=!S*,-2,0!1)!D-.5125,!-,>,-,35234!18,0,!1AD,0!)>!>2,(+0B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
49
C,' [*3!18,!>)(()=234!01.1,7,31!23!@26,!1)!5-,.1,!18,!1.F(,L!
CREATE TABLE loyalty_program
(cust_id INT,
fname STRING,
lname STRING,
email STRING,
level STRING,
phone MAP<STRING, STRING>,
order_ids ARRAY<INT>,
order_value STRUCT<min:INT,
max:INT,
avg:INT,
total:INT>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
-,' KJ.723,!18,!+.1.!23!loyalty_data.txt!1)!0,,!8)=!21!5)--,0D)3+0!1)!18,!>2,(+0!23!18,!
1.F(,!.3+!18,3!().+!21!231)!@26,L!
LOAD DATA LOCAL INPATH 'loyalty_data.txt' INTO TABLE
loyalty_program;
D,' [*3!.!S*,-A!1)!0,(,51!18,!HOME!D8)3,!3*7F,-!98231L!7.D!I,A0!.-,!5.0,E0,302126,<!>)-!
5*01)7,-!Q?!$"##_]]B!N)*!08)*(+!0,,!%#_EdddE%^$%!.0!18,!-,0*(1B!
N,' H,(,51!18,!182-+!,(,7,31!>-)7!18,!order_ids!.--.A!>)-!5*01)7,-!Q?!$"##_]]!98231L!
,(,7,310!.-,!23+,J,+!>-)7!U,-)<B!C8,!S*,-A!08)*(+!-,1*-3!d"c_d#dB!
>,' H,(,51!18,!total!.11-2F*1,!>-)7!18,!order_value!01-*51!>)-!5*01)7,-!Q?!$"##_]]B!
C8,!S*,-A!08)*(+!-,1*-3!%#$_c%B!
!
!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
50
Bonus Exercise #1: Alter and Drop a Table
Q>!A)*!8.6,!0*55,00>*((A!>23208,+!18,!7.23!,J,-520,!.3+!012((!8.6,!127,G!>,,(!>-,,!1)!5)3123*,!
=218!1820!F)3*0!,J,-520,B!N)*!5.3!5)7D.-,!A)*-!=)-I!.4.2301!18,!>2(,0!>)*3+!23!18,!
bonus_01/sample_solution/!0*F+2-,51)-AB!
C,' O0,!ALTER TABLE!1)!-,3.7,!18,!level!5)(*73!1)!statusB!
-,' O0,!18,!DESCRIBE!5)77.3+!)3!18,!loyalty_program!1.F(,!1)!6,-2>A!18,!58.34,B!
D,' O0,!ALTER TABLE!1)!-,3.7,!18,!,312-,!1.F(,!1)!reward_programB!
N,' M(18)*48!18,!ALTER TABLE!5)77.3+!)>1,3!-,S*2-,0!18.1!=,!7.I,!.!5)--,0D)3+234!
58.34,!1)!18,!+.1.!23!@?ZHG!-,3.7234!.!1.F(,!)-!5)(*73!+),0!3)1B!N)*!5.3!6,-2>A!1820!FA!
-*33234!.!S*,-A!)3!18,!1.F(,!*0234!18,!3,=!3.7,0!918,!-,0*(1!08)*(+!F,!VHQX:K[W<L!
SELECT status FROM reward_program WHERE cust_id = 1200866;
>,' M0!0)7,127,0!8.DD,30!23!18,!5)-D)-.1,!=)-(+G!D-2)-212,0!8.6,!082>1,+!.3+!18,!D-)4-.7!
20!3)=!5.35,(,+B!?-)D!18,!reward_program!1.F(,B!
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
51
Optional Hands-On Exercise: Gaining
Insight with Sentiment Analysis!
8#'*;5+')<*5)#%&'"3"$45+"S'F)P'J5&&'P+"'.5Q"c+'*"3*'<$)4"++5#9'A"%*P$"+'*)'%#%&FG"'
4P+*)I"$+^'4)II"#*+'%#/'<$)/P4*'$%*5#9+,'[)P'J5&&'P#4)Q"$'<$)Z&"I+'%#/'<$)<)+"'
<)*"#*5%&'+)&P*5)#+,'
8U@1O=E(=L!C820!,J,-520,!F*2(+0!)3!D-,62)*0!)3,0B!Q>!A)*!=,-,!*3.F(,!1)!5)7D(,1,!.3A!
D-,62)*0!,J,-520,!)-!1823I!A)*!7.A!8.6,!7.+,!.!7201.I,G!-*3!18,!>)(()=234!5)77.3+!1)!
D-,D.-,!>)-!1820!,J,-520,!F,>)-,!5)3123*234L!
$ ~/scripts/analyst/catchup.sh
Background Information
'*01)7,-!-.12340!.3+!>,,+F.5I!.-,!4-,.1!0)*-5,0!)>!23>)-7.12)3!>)-!F)18!5*01)7,-0!.3+!
-,1.2(,-0!(2I,!?*.(5)-,B!
@)=,6,-G!5*01)7,-!5)77,310!.-,!1AD25.((A!>-,,E>)-7!1,J1!.3+!7*01!F,!8.3+(,+!+2>>,-,31(AB!
Z)-1*3.1,(AG!@26,!D-)62+,0!,J1,3026,!0*DD)-1!>)-!1,J1!D-)5,00234B!
Step #1: Analyze Numeric Product Ratings
h,>)-,!+,(6234!231)!1,J1!D-)5,00234G!A)*/((!F,423!FA!.3.(AU234!18,!3*7,-25!-.12340!5*01)7,-0!
8.6,!.00243,+!1)!6.-2)*0!D-)+*510B!
C,' '8.34,!1)!18,!+2-,51)-A!>)-!1820!8.3+0E)3!,J,-520,L!
$ cd $ADIR/exercises/sentiment
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
52
-,' H1.-1!@26,!.3+!*0,!18,!DESCRIBE!5)77.3+!1)!-,723+!A)*-0,(>!)>!18,!1.F(,/0!01-*51*-,B!!
D,' T,!=.31!1)!>23+!18,!D-)+*51!18.1!5*01)7,-0!(2I,!7)01G!F*1!7*01!4*.-+!.4.2301!F,234!
720(,+!FA!D-)+*510!18.1!8.6,!>,=!-.12340!.00243,+B![*3!18,!>)(()=234!S*,-A!1)!>23+!18,!
D-)+*51!=218!18,!8248,01!.6,-.4,!.7)34!.((!18)0,!=218!.1!(,.01!d#!-.12340L!
SELECT prod_id, FORMAT_NUMBER(avg_rating, 2) AS avg_rating
FROM (SELECT prod_id, AVG(rating) AS avg_rating,
COUNT(*) AS num
FROM ratings
GROUP BY prod_id) rated
WHERE num >= 50
ORDER BY avg_rating DESC
LIMIT 1;
N,' [,=-21,G!.3+!18,3!,J,5*1,G!18,!S*,-A!.F)6,!1)!>23+!18,!D-)+*51!=218!18,!lowest!.6,-.4,!
.7)34!D-)+*510!=218!.1!(,.01!d#!-.12340B!N)*!08)*(+!0,,!18.1!18,!-,0*(1!20!D-)+*51!Q?!
$"c%]c&!=218!.3!.6,-.4,!-.1234!)>!$B$#B!
!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
53
Step #2: Analyze Rating Comments
T,!)F0,-6,+!,.-(2,-!18.1!5*01)7,-0!.-,!6,-A!+200.120>2,+!=218!)3,!)>!18,!D-)+*510!=,!0,((B!
M(18)*48!3*7,-25!-.12340!5.3!8,(D!2+,312>A!which!D-)+*51!18.1!20G!18,A!+)3/1!1,((!*0!why!
5*01)7,-0!+)3/1!(2I,!18,!D-)+*51B!M(18)*48!=,!5)*(+!027D(A!-,.+!18-)*48!.((!18,!5)77,310!
.00)52.1,+!=218!18.1!D-)+*51!1)!(,.-3!1820!23>)-7.12)3G!18.1!.DD-).58!+),03/1!05.(,B!P,J1G!A)*!
=2((!*0,!@26,/0!1,J1!D-)5,00234!0*DD)-1!1)!.3.(AU,!18,!5)77,310B!
C,' C8,!>)(()=234!S*,-A!3)-7.(2U,0!.((!5)77,310!)3!18.1!D-)+*51!1)!()=,-5.0,G!F-,.I0!
18,7!231)!23+262+*.(!=)-+0!*0234!18,!SENTENCES!>*3512)3G!.3+!D.00,0!18)0,!1)!18,!
NGRAMS!>*3512)3!1)!>23+!18,!>26,!7)01!5)77)3!F24-.70!91=)E=)-+!5)7F23.12)30<B![*3!
18,!S*,-A!23!@26,L!
SELECT EXPLODE(NGRAMS(SENTENCES(LOWER(message)), 2, 5))
AS bigrams
FROM ratings
WHERE prod_id = 1274673;
-,' ;)01!)>!18,0,!=)-+0!.-,!1))!5)77)3!1)!D-)62+,!7*58!2302481G!18)*48!18,!=)-+!
V,JD,3026,W!+),0!01.3+!)*1!23!18,!(201B!;)+2>A!18,!D-,62)*0!S*,-A!1)!>23+!18,!>26,!7)01!
5)77)3!trigrams!918-,,E=)-+!5)7F23.12)30<G!.3+!18,3!-*3!18.1!S*,-A!23!@26,B!
D,' M7)34!18,!D.11,-30!A)*!0,,!23!18,!-,0*(1!20!18,!D8-.0,!V1,3!127,0!7)-,BW!C820!72481!F,!
-,(.1,+!1)!18,!5)7D(.2310!18.1!18,!D-)+*51!20!1))!,JD,3026,B!P)=!18.1!A)*/6,!2+,312>2,+!.!
0D,52>25!D8-.0,G!())I!.1!.!>,=!5)77,310!18.1!5)31.23!21!FA!-*33234!1820!S*,-AL!
SELECT message
FROM ratings
WHERE prod_id = 1274673
AND message LIKE '%ten times more%'
LIMIT 3;
N)*!08)*(+!0,,!18-,,!5)77,310!18.1!0.AG!VT8A!+),0!18,!-,+!)3,!5)01!1,3!127,0!7)-,!
18.3!18,!)18,-0gW!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
54
N,' T,!5.3!23>,-!18.1!5*01)7,-0!.-,!5)7D(.23234!.F)*1!18,!D-25,!)>!1820!21,7G!F*1!18,!
5)77,31!.()3,!+),03/1!D-)62+,!,3)*48!+,1.2(B!Y3,!)>!18,!=)-+0!9V-,+W<!23!18.1!
5)77,31!=.0!.(0)!>)*3+!23!18,!(201!)>!1-24-.70!>-)7!18,!,.-(2,-!S*,-AB!!T-21,!.3+!
,J,5*1,!.!S*,-A!18.1!=2((!>23+!.((!+2012351!5)77,310!5)31.23234!18,!=)-+!V-,+W!18.1!.-,!
.00)52.1,+!=218!D-)+*51!Q?!$"c%]c&B!
>,' C8,!D-,62)*0!01,D!08)*(+!8.6,!+20D(.A,+!1=)!5)77,310L!
VT8.1!20!0)!0D,52.(!.F)*1!-,+gW!
VT8A!+),0!18,!-,+!)3,!5)01!1,3!127,0!7)-,!18.3!18,!)18,-0gW!
C8,!0,5)3+!5)77,31!27D(2,0!18.1!1820!D-)+*51!20!)6,-D-25,+!-,(.126,!1)!0272(.-!D-)+*510B!
T-21,!.3+!-*3!.!S*,-A!18.1!=2((!+20D(.A!18,!-,5)-+!>)-!D-)+*51!Q?!$"c%]c&!23!18,!
products!1.F(,B!
V,' N)*-!S*,-A!08)*(+!8.6,!08)=3!18.1!18,!D-)+*51!=.0!.!V$]ih!OHh!Z(.08!?-26,!9[,+<W!
>-)7!18,!VY-2)3W!F-.3+B!P,J1G!-*3!1820!S*,-A!1)!2+,312>A!0272(.-!D-)+*510L!
SELECT *
FROM products
WHERE name LIKE '%16 GB USB Flash Drive%'
AND brand='Orion';
C8,!S*,-A!-,0*(10!08)=!18.1!=,!8.6,!18-,,!.(7)01!2+,3125.(!D-)+*510G!F*1!18,!D-)+*51!
=218!18,!3,4.126,!-,62,=0!918,!-,+!)3,<!5)010!.F)*1!1,3!127,0!.0!7*58!.0!18,!)18,-0G!`*01!
.0!0)7,!)>!18,!5)77,310!0.2+B!!
h.0,+!)3!18,!5)01!.3+!D-25,!5)(*730G!21!.DD,.-0!18.1!+)234!1,J1!D-)5,00234!)3!18,!
D-)+*51!-.12340!8.0!8,(D,+!*0!*35)6,-!.!D-25234!,--)-B!
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
55
Hands-On Exercise: Data Transformation
with Hive!
8#'*;5+'"3"$45+"'F)P'J5&&'4$"%*"'%#/'<)<P&%*"'%'*%Z&"'J5*;'&)9'/%*%'A$)I'7P%&4)$"^+'
:"Z'+"$Q"$,'RP"$5"+')#'*;%*'/%*%'J5&&'$"Q"%&'*;%*'I%#F'4P+*)I"$+'%Z%#/)#'*;"5$'
+;)<<5#9'4%$*+'Z"A)$"'4)I<&"*5#9'*;"'4;"4\)P*'<$)4"++,'[)P'J5&&'4$"%*"'+"Q"$%&'
%//5*5)#%&'*%Z&"+S'P+5#9'/%*%'A$)I'%'TRANSFORM'+4$5<*'%#/'%'+P<<&5"/'?7MS'J;54;'F)P'
J5&&'P+"'&%*"$'*)'%#%&FG"';)J'7P%&4)$"'4)P&/'*P$#'*;5+'<$)Z&"I'5#*)'%#')<<)$*P#5*F,''
8U@1O=E(=L!C820!,J,-520,!F*2(+0!)3!D-,62)*0!)3,0B!Q>!A)*!=,-,!*3.F(,!1)!5)7D(,1,!.3A!
D-,62)*0!,J,-520,!)-!1823I!A)*!7.A!8.6,!7.+,!.!7201.I,G!-*3!18,!>)(()=234!5)77.3+!1)!
D-,D.-,!>)-!1820!,J,-520,!F,>)-,!5)3123*234L!
$ ~/scripts/analyst/catchup.sh
Step #1: Create and Populate the Web Logs Table
CAD25.(!()4!>2(,!>)-7.10!.-,!3)1!+,(2721,+G!0)!A)*!=2((!3,,+!1)!*0,!18,![,4,JH,-?,!.3+!
0D,52>A!.!D.11,-3!@26,!5.3!*0,!1)!D.-0,!(23,0!231)!23+262+*.(!>2,(+0!A)*!5.3!18,3!S*,-AB!
C,' '8.34,!1)!18,!+2-,51)-A!>)-!1820!8.3+0E)3!,J,-520,L!
$ cd $ADIR/exercises/transform
-,' KJ.723,!18,!create_web_logs.hql!05-2D1!1)!4,1!.3!2+,.!)>!8)=!21!*0,0!.!
[,4,JH,-?,!1)!D.-0,!(23,0!23!18,!()4!>2(,!9.3!,J.7D(,!()4!(23,!20!08)=3!23!18,!5)77,31!.1!
18,!1)D!)>!18,!>2(,<B!T8,3!A)*!8.6,!,J.723,+!18,!05-2D1G!-*3!21!1)!5-,.1,!18,!1.F(,!23!@26,L!
$ hive -f create_web_logs.hql
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
56
D,' e)D*(.1,!18,!1.F(,!FA!.++234!18,!()4!>2(,!1)!18,!1.F(,/0!+2-,51)-A!23!@?ZHL!
$ hadoop fs -put $ADIR/data/access.log /dualcore/web_logs
N,' H1.-1!18,!@26,!08,((!23!.3)18,-!1,-723.(!=23+)=!
>,' :,-2>A!18.1!18,!+.1.!20!().+,+!5)--,51(A!FA!-*33234!1820!S*,-A!1)!08)=!18,!1)D!18-,,!
21,70!*0,-0!0,.-58,+!>)-!)3!)*-!T,F!021,L!
SELECT term, COUNT(term) AS num FROM
(SELECT LOWER(REGEXP_EXTRACT(request,
'/search\\?phrase=(\\S+)', 1)) AS term
FROM web_logs
WHERE request REGEXP '/search\\?phrase=') terms
GROUP BY term
ORDER BY num DESC
LIMIT 3;
N)*!08)*(+!0,,!18.1!21!-,1*-30!1.F(,1!9&#&<G!-.7!9$d&<!.3+!=2>2!9$%_<B!
()*"L!C8,!REGEXP!)D,-.1)-G!=8258!20!.6.2(.F(,!23!0)7,!HaX!+2.(,510G!20!0272(.-!1)!LIKEG!
F*1!*0,0!-,4*(.-!,JD-,002)30!>)-!7)-,!D)=,->*(!D.11,-3!7.158234B!C8,!REGEXP!
)D,-.1)-!20!0A3)3A7)*0!=218!18,!RLIKE!)D,-.1)-B!
!
!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
57
Step #2: Analyze Customer Checkouts
N)*/6,!`*01!S*,-2,+!18,!()40!1)!0,,!=8.1!*0,-0!0,.-58!>)-!)3!?*.(5)-,/0!T,F!021,G!F*1!3)=!
A)*/((!-*3!0)7,!S*,-2,0!1)!(,.-3!=8,18,-!18,A!F*AB!M0!)3!7.3A!T,F!021,0G!5*01)7,-0!.++!
D-)+*510!1)!18,2-!08)DD234!5.-10!.3+!18,3!>)(()=!.!V58,5I)*1W!D-)5,00!1)!5)7D(,1,!18,2-!
D*-58.0,B!H235,!,.58!D.-1!)>!1820!>)*-E01,D!D-)5,00!5.3!F,!2+,312>2,+!FA!210!O[X!23!18,!()40G!
=,!5.3!*0,!.!-,4*(.-!,JD-,002)3!1)!,.02(A!2+,312>A!18,7L!
!
Step&
Request&URL&
Description&
$!
/cart/checkout/step1-viewcart
:2,=!(201!)>!21,70!.++,+!1)!5.-1!
"!
/cart/checkout/step2-shippingcost
P)12>A!5*01)7,-!)>!082DD234!5)01!
&!
/cart/checkout/step3-payment
i.18,-!D.A7,31!23>)-7.12)3!
%!
/cart/checkout/step4-receipt
H8)=!-,5,2D1!>)-!5)7D(,1,+!)-+,-!
C,' [*3!18,!>)(()=234!S*,-A!23!@26,!1)!08)=!18,!3*7F,-!)>!-,S*,010!>)-!,.58!01,D!)>!18,!
58,5I)*1!D-)5,00L!
SELECT COUNT(*), request
FROM web_logs
WHERE request REGEXP '/cart/checkout/step\\d.+'
GROUP BY request;
C8,!-,0*(10!)>!1820!S*,-A!8248(2481!.!7.`)-!D-)F(,7B!MF)*1!)3,!)*1!)>!,6,-A!18-,,!
5*01)7,-0!.F.3+)3!18,2-!5.-1!.>1,-!18,!0,5)3+!01,DB!C820!72481!7,.3!72((2)30!)>!+)((.-0!
23!()01!-,6,3*,G!0)!(,1/0!0,,!2>!=,!5.3!+,1,-723,!18,!5.*0,B!
-,' C8,!()4!>2(,/0!cookie!>2,(+!01)-,0!.!6.(*,!18.1!*32S*,(A!2+,312>2,0!,.58!*0,-!0,002)3B!
H235,!3)1!.((!0,002)30!236)(6,!58,5I)*10!.1!.((G!5-,.1,!.!3,=!1.F(,!5)31.23234!18,!0,002)3!
Q?!.3+!3*7F,-!)>!58,5I)*1!01,D0!5)7D(,1,+!>)-!`*01!18)0,!0,002)30!18.1!+)L!
CREATE TABLE checkout_sessions AS
SELECT cookie, ip_address, COUNT(request) AS steps_completed
FROM web_logs
WHERE request REGEXP '/cart/checkout/step\\d.+'
GROUP BY cookie, ip_address;
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
58
D,' [*3!1820!S*,-A!1)!08)=!18,!3*7F,-!)>!D,)D(,!=8)!.F.3+)3,+!18,2-!5.-1!.>1,-!,.58!01,DL!
SELECT steps_completed, COUNT(cookie) AS num
FROM checkout_sessions
GROUP BY steps_completed;
N)*!08)*(+!0,,!18.1!7)01!5*01)7,-0!=8)!.F.3+)3,+!18,2-!)-+,-!+2+!0)!.>1,-!18,!0,5)3+!
01,DG!=8258!20!=8,3!18,A!>2-01!(,.-3!8)=!7*58!21!=2((!5)01!1)!082D!18,2-!)-+,-B!!!
Step #3: Use TRANSFORM for IP Geolocation
h.0,+!)3!=8.1!A)*p6,!`*01!0,,3G!21!0,,70!(2I,(A!18.1!5*01)7,-0!.F.3+)3!18,2-!5.-10!+*,!1)!
8248!082DD234!5)010B!C8,!082DD234!5)01!20!F.0,+!)3!18,!5*01)7,-p0!()5.12)3!.3+!18,!=,2481!)>!
18,!21,70!18,Ap6,!)-+,-,+B!M(18)*48!1820!23>)-7.12)3!203p1!23!18,!+.1.F.0,!90235,!18,!)-+,-!
=.03p1!5)7D(,1,+<G!=,!5.3!4.18,-!,3)*48!+.1.!>-)7!18,!()40!1)!,0127.1,!18,7B!!
T,!+)3p1!8.6,!18,!5*01)7,-p0!.++-,00G!F*1!=,!5.3!*0,!.!D-)5,00!I3)=3!.0!rQe!4,)()5.12)3r!
1)!7.D!18,!5)7D*1,-p0!Qe!.++-,00!23!18,!()4!>2(,!1)!.3!.DD-)J27.1,!D8A025.(!()5.12)3B!H235,!
1820!203p1!.!F*2(1E23!5.D.F2(21A!)>!@26,G!A)*p((!*0,!.!D-)62+,+!eA18)3!05-2D1!1)!TRANSFORM!18,!
ip_address!>2,(+!>-)7!18,!checkout_sessions!1.F(,!1)!.!qQe!5)+,G!.0!D.-1!)>!@26,aX!
01.1,7,31!18.1!5-,.1,0!.!3,=!1.F(,!5.((,+!cart_zipcodesB!
Regarding TRANSFORM and UDF Examples in this Exercise
During this exercise, you will use a Python script for IP geolocation and a UDF to
calculate shipping costs. Both are implemented merely as a simulation compatible with
the fictitious data we use in class and intended to work even when Internet access is
unavailable. The focus of these exercises is on how to use external scripts and UDFs,
rather than how the code for the examples works internally.
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
59
C,' KJ.723,!18,!create_cart_zipcodes.hql!05-2D1!.3+!)F0,-6,!18,!>)(()=234L!
.B Q1!5-,.1,0!.!3,=!1.F(,!5.((,+!cart_zipcodes!F.0,+!)3!0,(,51!01.1,7,31B!
FB C8.1!0,(,51!01.1,7,31!1-.30>)-70!18,!ip_addressG!cookieG!.3+!
steps_completed!>2,(+0!>-)7!18,!checkout_sessions!1.F(,!*0234!.!
eA18)3!05-2D1B!
5B C8,!3,=!1.F(,!5)31.230!18,!qQe!5)+,!2301,.+!)>!.3!Qe!.++-,00G!D(*0!18,!)18,-!
1=)!>2,(+0!>-)7!18,!)-2423.(!1.F(,B!
-,' KJ.723,!18,!ipgeolocator.py!05-2D1!.3+!)F0,-6,!18,!>)(()=234L!
.B [,5)-+0!.-,!-,.+!>-)7!@26,!)3!01.3+.-+!23D*1B!
FB C8,!05-2D1!0D(210!18,7!231)!23+262+*.(!>2,(+0!*0234!.!1.F!+,(2721,-B!
5B C8,!ip_addr!>2,(+!20!5)36,-1,+!1)!zipcodeG!F*1!18,!cookie!.3+!
steps_completed!>2,(+0!.-,!D.00,+!18-)*48!*37)+2>2,+B!
+B C8,!18-,,!>2,(+0!23!,.58!)*1D*1!-,5)-+!.-,!+,(2721,+!=218!1.F0!.-,!D-231,+!1)!
01.3+.-+!)*1D*1B!
D,' [*3!18,!05-2D1!1)!5-,.1,!18,!cart_zipcodes!1.F(,L!
$ hive -f create_cart_zipcodes.hql
Step #4: Extract List of Products Added to Each Cart
M0!+,05-2F,+!,.-(2,-G!,0127.1234!18,!082DD234!5)01!.(0)!-,S*2-,0!.!(201!)>!21,70!23!18,!
5*01)7,-/0!5.-1B!N)*!5.3!2+,312>A!D-)+*510!.++,+!1)!18,!5.-1!0235,!18,!-,S*,01!O[X!())I0!(2I,!
1820!9)3(A!18,!D-)+*51!Q?!58.34,0!>-)7!)3,!-,5)-+!1)!18,!3,J1<L!
!
!!!!!!!!!/cart/additem?productid=1234567!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
60
C,' T-21,!.!@26,aX!01.1,7,31!1)!5-,.1,!.!1.F(,!5.((,+!cart_items!=218!1=)!>2,(+0L!
cookie!.3+!prod_id!F.0,+!)3!+.1.!0,(,51,+!18,!web_logs!1.F(,B!k,,D!18,!>)(()=234!
23!723+!=8,3!=-21234!A)*-!01.1,7,31L!
.B C8,!prod_id!>2,(+!08)*(+!5)31.23!)3(A!18,!0,6,3E+2421!D-)+*51!Q?!98231L!*0,!
18,!REGEXP_EXTRACT!>*3512)3<!
FB M++!.!WHERE!5(.*0,!=218!REGEXP!*0234!18,!0.7,!-,4*(.-!,JD-,002)3!.0!.F)6,!
0)!18.1!A)*!)3(A!235(*+,!-,5)-+0!=8,-,!5*01)7,-0!.-,!.++234!21,70!1)!18,!5.-1B!
5B Q>!A)*!3,,+!.!8231!)3!8)=!1)!=-21,!18,!01.1,7,31G!())I!.1!18,!
sample_solution/create_cart_items.hql!>2(,B!
-,' KJ,5*1,!18,!@26,aX!01.1,7,31!>-)7!A)*!`*01!=-)1,B!
D,' :,-2>A!18,!5)31,310!)>!18,!3,=!1.F(,!FA!-*33234!1820!S*,-AL!
SELECT COUNT(DISTINCT cookie) FROM cart_items WHERE
prod_id=1273905;
Q>!1820!+),03/1!-,1*-3!%cG!18,3!5)7D.-,!A)*-!01.1,7,31!1)!18,!
sample_solution/create_cart_items.hql!>2(,G!7.I,!18,!3,5,00.-A!
5)--,512)30G!.3+!18,3!-,E-*3!A)*-!01.1,7,31!9.>1,-!+-)DD234!18,!cart_items!1.F(,<B!
Step #5: Create Tables to Join Web Logs with Product Data
N)*!3)=!8.6,!1.F(,0!-,D-,0,31234!18,!qQe!5)+,0!.3+!D-)+*510!.00)52.1,+!=218!58,5I)*1!
0,002)30G!F*1!A)*p((!3,,+!1)!`)23!18,0,!=218!18,!D-)+*510!1.F(,!1)!4,1!18,!=,2481!)>!18,0,!
21,70!F,>)-,!A)*!5.3!,0127.1,!082DD234!5)010B!Q3!)-+,-!1)!+)!0)7,!7)-,!.3.(A020!(.1,-G!=,/((!
.(0)!235(*+,!1)1.(!0,((234!D-25,!.3+!1)1.(!=8)(,0.(,!5)01!23!.++212)3!1)!18,!1)1.(!082DD234!
=,2481!>)-!.((!21,70!23!18,!5.-1B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
61
C,' [*3!18,!>)(()=234!@26,aX!1)!5-,.1,!.!1.F(,!5.((,+!cart_orders!=218!18,!23>)-7.12)3L!
CREATE TABLE cart_orders AS
SELECT z.cookie, steps_completed, zipcode,
SUM(shipping_wt) as total_weight,
SUM(price) AS total_price,
SUM(cost) AS total_cost
FROM cart_zipcodes z
JOIN cart_items i
ON (z.cookie = i.cookie)
JOIN products p
ON (i.prod_id = p.prod_id)
GROUP BY z.cookie, zipcode, steps_completed;
Step #6: Create a Table Using a UDF to Estimate Shipping
Cost
T,!>23.((A!8.6,!.((!18,!23>)-7.12)3!=,!3,,+!1)!,0127.1,!18,!082DD234!5)01!>)-!,.58!
.F.3+)3,+!)-+,-B!!Y3,!)>!18,!+,6,()D,-0!)3!)*-!1,.7!8.0!.(-,.+A!=-211,3G!5)7D2(,+G!.3+!
D.5I.4,+!.!@26,!O?Z!18.1!=2((!5.(5*(.1,!18,!082DD234!5)01!426,3!.!qQe!5)+,!.3+!18,!1)1.(!
=,2481!)>!.((!21,70!23!18,!)-+,-B!!
C,' h,>)-,!A)*!5.3!*0,!.!O?ZG!A)*!7*01!.++!21!1)!@26,/0!5(.00D.18B![*3!18,!>)(()=234!
5)77.3+!23!@26,!1)!+)!18.1L!
ADD JAR geolocation_udf.jar;
-,' P,J1G!A)*!7*01!-,4201,-!18,!>*3512)3!=218!@26,!.3+!D-)62+,!18,!3.7,!)>!18,!O?Z!5(.00!.0!
=,((!.0!18,!.(2.0!A)*!=.31!1)!*0,!>)-!18,!>*3512)3B![*3!18,!@26,!5)77.3+!F,()=!1)!
.00)52.1,!)*-!O?Z!=218!18,!.(2.0!CALC_SHIPPING_COSTL!
CREATE TEMPORARY FUNCTION CALC_SHIPPING_COST AS
'com.cloudera.hive.udf.UDFCalcShippingCost';
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
62
D,' P)=!5-,.1,!.!3,=!1.F(,!5.((,+!cart_shipping!18.1!=2((!5)31.23!18,!0,002)3!Q?G!
3*7F,-!)>!01,D0!5)7D(,1,+G!1)1.(!-,1.2(!D-25,G!1)1.(!=8)(,0.(,!5)01G!.3+!18,!,0127.1,+!
082DD234!5)01!>)-!,.58!)-+,-!F.0,+!)3!+.1.!>-)7!18,!cart_orders!1.F(,L!
CREATE TABLE cart_shipping AS
SELECT cookie, steps_completed, total_price, total_cost,
CALC_SHIPPING_COST(zipcode, total_weight) AS shipping_cost
FROM cart_orders;
N,' Z23.((AG!6,-2>A!A)*-!1.F(,!FA!-*33234!18,!>)(()=234!S*,-A!1)!58,5I!.!-,5)-+L!
SELECT * FROM cart_shipping WHERE cookie='100002920697';
C820!08)*(+!08)=!18.1!0,002)3!.0!8.6234!1=)!5)7D(,1,+!01,D0G!.!1)1.(!-,1.2(!D-25,!)>!
b"]&BccG!.!1)1.(!=8)(,0.(,!5)01!)>!b"&]B^_G!.3+!.!082DD234!5)01!)>!b^B#^B!
()*"L!C8,!total_priceG!total_costG!.3+!shipping_cost!5)(*730!23!18,!
cart_shipping!1.F(,!5)31.23!18,!3*7F,-!)>!5,310!.0!231,4,-0B!h,!0*-,!1)!+262+,!
-,0*(10!5)31.23234!7)3,1.-A!.7)*310!FA!$##!1)!4,1!+)((.-0!.3+!5,310B
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
63
Hands-On Exercise: Interactive Analysis
with Impala!
8#'*;5+'"3"$45+"'F)P'J5&&'"3%I5#"'%Z%#/)#"/'4%$*'/%*%'P+5#9'*;"'*%Z&"+'4$"%*"/'5#'*;"'
<$"Q5)P+'"3"$45+",'[)P'J5&&'P+"'8I<%&%'*)'aP54\&F'/"*"$I5#"';)J'IP4;'&)+*'$"Q"#P"'
*;"+"'%Z%#/)#"/'4%$*+'$"<$"+"#*'%#/'P+"'+"Q"$%&'dJ;%*'5Ae'+4"#%$5)+'*)'/"*"$I5#"'
J;"*;"$'J"'+;)P&/')AA"$'A$""'+;5<<5#9'*)'"#4)P$%9"'4P+*)I"$+'*)'4)I<&"*"'*;"5$'
<P$4;%+"+,'
8U@1O=E(=L!C820!,J,-520,!F*2(+0!)3!D-,62)*0!)3,0B!Q>!A)*!=,-,!*3.F(,!1)!5)7D(,1,!.3A!
D-,62)*0!,J,-520,!)-!1823I!A)*!7.A!8.6,!7.+,!.!7201.I,G!-*3!18,!>)(()=234!5)77.3+!1)!
D-,D.-,!>)-!1820!,J,-520,!F,>)-,!5)3123*234L!
$ ~/scripts/analyst/catchup.sh
Step #1: Start the Impala Shell and Refresh the Cache
C,' !'8.34,!1)!18,!+2-,51)-A!>)-!1820!8.3+0E)3!,J,-520,L!
$ cd $ADIR/exercises/interactive
-,' Z2-01G!01.-1!18,!Q7D.(.!08,((L!
$ impala-shell
D,' H235,!A)*!5-,.1,+!1.F(,0!.3+!7)+2>2,+!+.1.!23!@26,G!Q7D.(./0!5.58,!)>!18,!7,1.01)-,!20!
)*1+.1,+B!N)*!7*01!-,>-,08!21!F,>)-,!5)3123*234!FA!,31,-234!18,!>)(()=234!5)77.3+!23!
18,!Q7D.(.!08,((L!
REFRESH;
!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
64
Step #2: Calculate Lost Revenue
!
C,' Z2-01G!A)*/((!5.(5*(.1,!8)=!7*58!-,6,3*,!18,!.F.3+)3,+!5.-10!-,D-,0,31B![,7,7F,-G!
18,-,!.-,!>)*-!01,D0!23!18,!58,5I)*1!D-)5,00G!0)!)3(A!-,5)-+0!23!18,!cart_shipping!
1.F(,!=218!.!steps_completed!6.(*,!)>!>)*-!-,D-,0,31!.!5)7D(,1,+!D*-58.0,L!
SELECT SUM(total_price) AS lost_revenue
FROM cart_shipping
WHERE steps_completed < 4;
!
N)*!08)*(+!0,,!18.1!.F.3+)3,+!5.-10!7,.3!18.1!?*.(5)-,!20!D)1,312.((A!()0234!)*1!)3!
7)-,!18.3!b"!72((2)3!23!-,6,3*,t!'(,.-(A!21/0!=)-18!18,!,>>)-1!1)!+)!>*-18,-!.3.(A020B!
()*"6!C8,!total_priceG!total_costG!.3+!shipping_cost!5)(*730!23!18,!
cart_shipping!1.F(,!5)31.23!18,!3*7F,-!)>!5,310!.0!231,4,-0B!h,!0*-,!1)!+262+,!
-,0*(10!5)31.23234!7)3,1.-A!.7)*310!FA!$##!1)!4,1!+)((.-0!.3+!5,310B!
cart_shipping
cookie steps_completed total_price shipping_costtotal_cost
100054318085 46899 4256292
100060397203 419218 55217520
100062224714 27609 5567155
100064732105 253137 83950685
100107017704 144928 72044200
Lost Revenue From Abandoned Shipping Carts
Sum of total_price where steps_completed < 4
... ... ... ......
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
65
-,' C8,!3*7F,-!-,1*-3,+!FA!18,!D-,62)*0!S*,-A!20!-,6,3*,G!F*1!=8.1!5)*310!20!D-)>21B!T,!
5.(5*(.1,!4-)00!D-)>21!FA!0*F1-.51234!18,!5)01!>-)7!18,!D-25,B!T-21,!.3+!,J,5*1,!.!S*,-A!
0272(.-!1)!18,!)3,!.F)6,G!F*1!=8258!-,D)-10!18,!1)1.(!()01!D-)>21!>-)7!.F.3+)3,+!5.-10B!Q>!
A)*!3,,+!.!8231!)3!8)=!1)!=-21,!1820!S*,-AG!A)*!5.3!58,5I!18,!sample_solution/!
abandoned_checkout_profit.sql!>2(,B!
M>1,-!-*33234!A)*-!S*,-AG!A)*!08)*(+!0,,!18.1!=,!.-,!D)1,312.((A!()0234!
b$$$G#d_B^#!23!D-)>21!+*,!1)!5*01)7,-0!3)1!5)7D(,1234!18,!58,5I)*1!D-)5,00B!
D,' @)=!+),0!1820!5)7D.-,!1)!18,!.7)*31!)>!D-)>21!=,!-,5,26,!>-)7!5*01)7,-0!=8)!+)!
5)7D(,1,!18,!58,5I)*1!D-)5,00g!;)+2>A!A)*-!D-,62)*0!S*,-A!1)!5)302+,-!)3(A!18)0,!
-,5)-+0!=8,-,!steps_completed = 4G!.3+!18,3!,J,5*1,!21!23!18,!Q7D.(.!08,((B!'8,5I!
sample_solution/completed_checkout_profit.sql!2>!A)*!3,,+!.!8231B!
C8,!-,0*(1!08)*(+!08)=!18.1!=,!,.-3!.!1)1.(!)>!b$ccG^&"B^&!)3!5)7D(,1,+!)-+,-0G!
0)!.F.3+)3,+!5.-10!-,D-,0,31!.!0*F01.312.(!D-)D)-12)3!)>!.++212)3.(!D-)>210B!
N,' C8,!D-,62)*0!1=)!S*,-2,0!1)(+!*0!18,!total!D-)>21!>)-!.F.3+)3,+!.3+!5)7D(,1,+!)-+,-0G!
F*1!18,0,!.-,3/1!+2-,51(A!5)7D.-.F(,!F,5.*0,!18,-,!=,-,!+2>>,-,31!3*7F,-0!)>!,.58B!Q1!
72481!F,!18,!5.0,!18.1!)3,!20!7*58!7)-,!D-)>21.F(,!18.3!18,!)18,-!)3!.!D,-E)-+,-!F.020B!
T-21,!.3+!,J,5*1,!.!S*,-A!18.1!=2((!5.(5*(.1,!18,!average!D-)>21!F.0,+!)3!18,!3*7F,-!)>!
01,D0!5)7D(,1,+!+*-234!18,!58,5I)*1!D-)5,00B!Q>!A)*!3,,+!8,(D!=-21234!1820!S*,-AG!58,5I!
18,!sample_solution/checkout_profit_by_step.sql!>2(,B!
N)*!08)*(+!)F0,-6,!18.1!5.-10!.F.3+)3,+!.>1,-!01,D!1=)!-,D-,0,31!.3!,6,3!
8248,-!.6,-.4,!D-)>21!D,-!)-+,-!18.3!5)7D(,1,+!)-+,-0B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
66
Step #3: Calculate Cost/Profit for a Free Shipping Offer
N)*!8.6,!)F0,-6,+!18.1!7)01!5.-10!j!.3+!18,!7)01!profitable!5.-10!j!.-,!.F.3+)3,+!.1!18,!
D)231!=8,-,!=,!+20D(.A!18,!082DD234!5)01!1)!18,!5*01)7,-B!N)*!=2((!3)=!-*3!0)7,!S*,-2,0!1)!
+,1,-723,!=8,18,-!)>>,-234!>-,,!082DD234G!)3!.1!(,.01!0)7,!)-+,-0G!=)*(+!.51*.((A!F-234!23!
7)-,!-,6,3*,!.00*7234!1820!)>>,-!D-)7D1,+!7)-,!5*01)7,-0!1)!>23208!18,!58,5I)*1!D-)5,00B!
C,' [*3!18,!>)(()=234!S*,-A!1)!5)7D.-,!18,!.6,-.4,!082DD234!5)01!>)-!)-+,-0!.F.3+)3,+!
.>1,-!18,!0,5)3+!01,D!6,-0*0!5)7D(,1,+!)-+,-0L!
SELECT steps_completed, AVG(shipping_cost) AS ship_cost
FROM cart_shipping
WHERE steps_completed = 2 OR steps_completed = 4
GROUP BY steps_completed;
N)*!=2((!0,,!18.1!18,!082DD234!5)01!)>!.F.3+)3,+!)-+,-0!=.0!.(7)01!$#u!8248,-!
18.3!>)-!5)7D(,1,+!D*-58.0,0B!Y>>,-234!>-,,!082DD234G!.1!(,.01!>)-!0)7,!)-+,-0G!
72481!.51*.((A!F-234!23!7)-,!7)3,A!18.3!D.00234!)3!18,!5)01!.3+!-20I234!
.F.3+)3,+!)-+,-0B!
cart_shipping
cookie steps_completed total_price shipping_costtotal_cost
100054318085 46899 4256292
100060397203 419218 55217520
100062224714 27609 5567155
100064732105 253137 83950685
100107017704 144928 72044200
Average Shipping Cost for Carts Abandoned After Steps 2 and 4
Average of shipping_cost where steps_completed = 2 or 4
... ... ... ......
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
67
-,' [*3!18,!>)(()=234!S*,-A!1)!+,1,-723,!18,!.6,-.4,!D-)>21!D,-!)-+,-!)6,-!18,!,312-,!
7)318!>)-!18,!+.1.!A)*!.-,!.3.(AU234!23!18,!()4!>2(,B!C820!=2((!8,(D!A)*!1)!+,1,-723,!
=8,18,-!=,!5)*(+!.F0)-F!18,!5)01!)>!)>>,-234!>-,,!082DD234L!
SELECT AVG(price - cost) AS profit
FROM products p
JOIN order_details d
ON (d.prod_id = p.prod_id)
JOIN orders o
ON (d.order_id = o.order_id)
WHERE YEAR(order_date) = 2013
AND MONTH(order_date) = 05;
N)*!08)*(+!0,,!18.1!18,!.6,-.4,!D-)>21!>)-!.((!)-+,-0!+*-234!;.A!=.0!bcB_#B!M3!
,.-(2,-!S*,-A!A)*!-.3!08)=,+!18.1!18,!.6,-.4,!082DD234!5)01!=.0!b_B_&!>)-!
5)7D(,1,+!)-+,-0!.3+!b^B]]!>)-!.F.3+)3,+!)-+,-0G!0)!5(,.-(A!=,!=)*(+!()0,!
7)3,A!FA!)>>,-234!>-,,!082DD234!)3!.((!)-+,-0B!@)=,6,-G!21!72481!012((!F,!
=)-18=82(,!1)!)>>,-!>-,,!082DD234!)3!)-+,-0!)6,-!.!5,-1.23!.7)*31B!
Average Profit per Order, May 2013
Average the profit...
products
prod_id price cost
1273641 1839 1275
1273642 1949 721
1273643 2149 845
1273644 2029 763
1273645 1909 1234
... ... ...
orders
order_id order_date
6547914 2013-05-01 00:02:08
2013-05-01 00:02:55
2013-05-01 00:06:15
6547917 2013-06-12 00:10:41
6547918 2013-06-12 00:11:30
... ...
order_details
order_id product_id
6547914 1273641
6547914 1273644
6547914 1273645
6547915 1273645
6547916 1273641
... ...
... on orders made in
May, 2013
6547915
6547916
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
68
D,' [*3!18,!>)(()=234!S*,-AG!=8258!20!.!0(2481(A!-,620,+!6,-02)3!)>!18,!D-,62)*0!)3,G!1)!
+,1,-723,!=8,18,-!)>>,-234!>-,,!082DD234!)3(A!)3!)-+,-0!)>!b$#!)-!7)-,!=)*(+!F,!.!
4))+!2+,.L!
SELECT AVG(price - cost) AS profit
FROM products p
JOIN order_details d
ON (d.prod_id = p.prod_id)
JOIN orders o
ON (d.order_id = o.order_id)
WHERE YEAR(order_date) = 2013
AND MONTH(order_date) = 05
AND PRICE >= 1000;
N)*!08)*(+!0,,!18.1!)*-!.6,-.4,!D-)>21!)3!)-+,-0!)>!b$#!)-!7)-,!=.0!b^B#^G!0)!
.F0)-F234!18,!5)01!)>!082DD234!=)*(+!(,.6,!6,-A!(211(,!D-)>21B!!
N,' [,D,.1!18,!D-,62)*0!S*,-AG!7)+2>A234!21!0(2481(A!,.58!127,!1)!>23+!18,!.6,-.4,!D-)>21!)3!
)-+,-0!)>!.1!(,.01!bd#G!b$##G!.3+!bd##B!
N)*!08)*(+!0,,!18.1!18,-,!20!.!8*4,!0D2I,!23!18,!.7)*31!)>!D-)>21!>)-!)-+,-0!)>!
bd##!)-!7)-,!9=,!7.I,!b$$$B#d!)3!.6,-.4,!>)-!18,0,!)-+,-0<B!!
>,' @)=!7*58!+),0!082DD234!5)01!)3!.6,-.4,!>)-!)-+,-0!1)1.(234!bd##!)-!7)-,g!T-21,!.3+!
-*3!.!S*,-A!1)!>23+!)*1!9sample_solution/avg_shipping_cost_50000.sql!
5)31.230!18,!0)(*12)3G!23!5.0,!A)*!3,,+!.!8231<B!
N)*!08)*(+!0,,!18.1!18,!.6,-.4,!082DD234!5)01!20!b$"B"_G!=8258!8.DD,30!1)!F,!
.F)*1!$$u!)>!18,!D-)>21!=,!F-234!23!)3!18)0,!)-+,-0B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
69
V,' H235,!=,!=)3/1!I3)=!23!.+6.35,!=8)!=2((!.F.3+)3!18,2-!5.-1G!=,!=)*(+!8.6,!1)!.F0)-F!
18,!b$"B"_!.6,-.4,!5)01!)3!all!)-+,-0!)>!.1!(,.01!bd##B!T)*(+!18,!,J1-.!7)3,A!=,!72481!
F-234!23!>-)7!.F.3+)3,+!5.-10!)>>0,1!18,!.++,+!5)01!)>!>-,,!082DD234!>)-!5*01)7,-0!=8)!
=)*(+!8.6,!5)7D(,1,+!18,2-!D*-58.0,0!.3A=.Ag![*3!18,!>)(()=234!S*,-A!1)!0,,!18,!1)1.(!
D-)>21!)3!5)7D(,1,+!D*-58.0,0L!
SELECT SUM(total_price - total_cost) AS total_profit
FROM cart_shipping
WHERE total_price >= 50000
AND steps_completed = 4;
M>1,-!-*33234!1820!S*,-AG!A)*!08)*(+!0,,!18.1!18,!1)1.(!D-)>21!>)-!5)7D(,1,+!
)-+,-0!20!b$#cGd_"B^cB!
W,' P)=G!-*3!18,!>)(()=234!S*,-A!1)!>23+!18,!D)1,312.(!D-)>21G!.>1,-!0*F1-.51234!082DD234!
5)010G!2>!.((!5*01)7,-0!5)7D(,1,+!18,!58,5I)*1!D-)5,00L!
SELECT gross_profit - total_shipping_cost AS potential_profit
FROM (SELECT
SUM(total_price - total_cost) AS gross_profit,
SUM(shipping_cost) AS total_shipping_cost
FROM cart_shipping
WHERE total_price >= 50000) large_orders;
H235,!18,!-,0*(1!)>!b$"#G&ddB"]!20!4-,.1,-!18.3!18,!5*--,31!D-)>21!)>!b$#cGd_"B^c!=,!
5*--,31(A!,.-3!>-)7!5)7D(,1,+!)-+,-0G!21!.DD,.-0!18.1!=,!5)*(+!,.-3!3,.-(A!b$&G###!
7)-,!FA!)>>,-234!>-,,!082DD234!>)-!.((!)-+,-0!)>!.1!(,.01!bd##B!
')34-.1*(.12)30t!N)*-!8.-+!=)-I!.3.(AU234!.!6.-2,1A!)>!+.1.!=218!@.+))D/0!1))(0!8.0!
8,(D,+!7.I,!?*.(5)-,!7)-,!D-)>21.F(,!18.3!,6,-B
This is the end of the Exercise
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
70
Data Model Reference
Tables Imported from MySQL
C8,!>)(()=234!+,D2510!18,!01-*51*-,!)>!18,!;AHaX!1.F(,0!27D)-1,+!231)!@?ZH!*0234!HS))DB!
C8,!D-27.-A!I,A!5)(*73!>-)7!18,!+.1.F.0,G!2>!.3AG!20!+,3)1,+!FA!F)(+!1,J1L
!
4P+*)I"$+L!"#$G&cd!-,5)-+0!927D)-1,+!1)!/dualcore/customers<!
!
Index&
Field&
Description&
Example&
#!
4P+*f5/'
'*01)7,-!Q?!
1846532
$!
>3.7,!
Z2-01!3.7,!
Sam
"!
(3.7,!
X.01!3.7,!
Jones
&!
.++-,00!
M++-,00!)>!-,02+,35,!
456 Clue Road
%!
521A!
'21A!!
Silicon Sands
d!
01.1,!
H1.1,!
CA
]!
U2D5)+,!
e)01.(!5)+,!
94306
'
"I<&)F""+L!]$Gc$"!-,5)-+0!927D)-1,+!1)!/dualcore/employees!.3+!(.1,-!*0,+!
.0!.3!,J1,-3.(!1.F(,!23!@26,<!
!
Index&
Field&
Description&
Example&
#!
"I<f5/'
K7D()A,,!Q?!
BR5331404
$!
>3.7,!
Z2-01!3.7,!
Betty
"!
(3.7,!
X.01!3.7,!
Richardson
&!
.++-,00!
M++-,00!)>!-,02+,35,!
123 Shady Lane
%!
521A!
'21A!!
Anytown
d!
01.1,!
H1.1,!
CA
]!
U2D5)+,!
e)01.(!')+,!
90210
c!
`)Fl121(,!
K7D()A,,/0!`)F!121(,!
Vice President
_!
,7.2(!
,E7.2(!.++-,00!
br5331404@example.com
^!
.5126,!
Q0!.5126,(A!,7D()A,+g!
Y
$#!
0.(.-A!
M33*.(!D.A!923!+)((.-0<!
136900
!
!
)$/"$+L!$G]]"G^d$!-,5)-+0!927D)-1,+!1)!/dualcore/orders<!
!
Index&
Field&
Description&
Example&
#!
)$/"$f5/'
Y-+,-!Q?!
3213254
$!
5*01l2+!
'*01)7,-!Q?!
1846532
"!
)-+,-l+.1,!
?.1,o127,!)>!)-+,-!
2013-05-31 16:59:34
'
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
71
)$/"$f/"*%5&+L!&G&&&G"%%!-,5)-+0!927D)-1,+!1)!/dualcore/order_details<!
!
Index&
Field&
Description&
Example&
#!
)-+,-l2+!
Y-+,-!Q?!
3213254
$!
D-)+l2+!
e-)+*51!Q?!
1754836
'
<$)/P4*+L!$G$$%!-,5)-+0!927D)-1,+!1)!/dualcore/products<!
!
Index&
Field&
Description&
Example&
#!
<$)/f5/'
e-)+*51!Q?!
1273641
$!
F-.3+!
h-.3+!3.7,!
Foocorp
"!
3.7,!
P.7,!)>!D-)+*51!
4-port USB Hub
&!
D-25,!
[,1.2(!0.(,0!D-25,G!23!5,310!
1999
%!
5)01!
T8)(,0.(,!5)01G!23!5,310!
1463
d!
082DD234l=1!
H82DD234!=,2481!923!D)*3+0<!
1
'
+P<<&5"$+L!]]!-,5)-+0!927D)-1,+!1)!/dualcore/suppliers<!
!
Index&
Field&
Description&
Example&
#!
+P<<f5/'
H*DD(2,-!Q?!
1000
$!
>3.7,!
Z2-01!3.7,!
ACME Inc.
"!
(3.7,!
X.01!3.7,!
Sally Jones
&!
.++-,00!
M++-,00!)>!)>>25,!
123 Oak Street
%!
521A!
'21A!!
New Athens
d!
01.1,!
H1.1,!
IL
]!
U2D5)+,!
e)01.(!5)+,!
62264
c!
D8)3,!
Y>>25,!D8)3,!3*7F,-!
(618) 555-5914
'
!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
72
Hive Tables
C8,!>)(()=234!20!.!-,5)-+!5)*31!>)-!@26,!1.F(,0!18.1!.-,!5-,.1,+!)-!S*,-2,+!+*-234!18,!8.3+0E
)3!,J,-520,0B!O0,!18,!DESCRIBE tablename!!5)77.3+!23!@26,!1)!0,,!18,!1.F(,!01-*51*-,B!
!
Table&Name&
Record&Count&
cart_items
&&G_$"!'
cart_orders
$"G^dd!!
cart_shipping
$"G^dd!!
cart_zipcodes
$"G^dd!!
checkout_sessions
$"G^dd!!
customers
"#$G&cd!!
employees
]$Gc$"!!
order_details
&G&&&G"%%!!
orders
$G]]"G^d$!!
products
$G$$%!!
ratings
"$G^^c!!
web_logs
%$"G_]#!!
'
!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
73
Other Data Added to HDFS
C8,!>)(()=234!+,05-2F,0!18,!01-*51*-,!)>!)18,-!27D)-1.31!+.1.!0,10!.++,+!1)!@?ZHB!
!
H)IZ5#"/'E/'H%I<%59#'7%*%L!9c__G^d"!-,5)-+0!1)1.(<G!01)-,+!23!1=)!+2-,51)-2,0L!
/dualcore/ad_data1!9%&_G&_^!-,5)-+0<!!
/dualcore/ad_data2!9&d#Gd]&!-,5)-+0<B!
!
Index&
Field&
Description&
Example&
#!
5.7D.243l2+!
O32S*,(A!2+,312>2,0!)*-!.+!
A3
$!
+.1,!
?.1,!)>!.+!+20D(.A!
05/23/2013
"!
127,!
C27,!)>!.+!+20D(.A!
15:39:26
&!
I,A=)-+!
k,A=)-+!18.1!1-244,-,+!.+!
tablet
%!
+20D(.Al021,!
?)7.23!=8,-,!.+!08)=3!
news.example.com
d!
D(.5,7,31!
X)5.12)3!)>!.+!)3!T,F!D.4,!
INLINE
]!
=.0l5(25I,+!
T8,18,-!.+!=.0!5(25I,+!
1
c!
5D5!
')01!!D,-!5(25IG!23!5,310!
106
'
'
%44"++,&)9L!%$"G_]#!-,5)-+0!9*D().+,+!1)!/dualcore/access.log<!
C820!>2(,!20!*0,+!1)!D)D*(.1,!18,!web_logs!1.F(,!23!@26,B!P)1,!18.1!18,![Z'!^&$!.3+!
O0,-3.7,!>2,(+0!.-,!0,(+)7!D)D*(.1,+!23!()4!>2(,0!>)-!7)+,-3!D*F(25!T,F!021,0!.3+!
.-,!243)-,+!23!)*-![,4,JH,-?,B!
!
!
Index&
Field&/&Description&
Example&
#!
Qe!.++-,00!
192.168.1.15!
$!
[Z'!^&$!9Q+,31<!
-!
"!
O0,-3.7,!
-!
&!
?.1,oC27,!
[22/May/2013:15:01:46 -0800]!
%!
[,S*,01!
"GET /foo?bar=1 HTTP/1.1"!
d!
H1.1*0!5)+,!
200!
]!
hA1,0!1-.30>,--,+!
762!
c!
[,>,-,-!
"http://dualcore.com/"!
_!
O0,-!.4,31!9F-)=0,-<!
"Mozilla/4.0 [en] (WinNT; I)"!
^!
'))I2,!90,002)3!Q?<!
"SESSION=8763723145"!
!
!
! !
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
74
Regular Expression Reference
C8,!>)(()=234!20!.!F-2,>!1*1)-2.(!231,3+,+!>)-!18,!5)36,32,35,!)>!01*+,310!=8)!+)3/1!8.6,!
,JD,-2,35,!*0234!-,4*(.-!,JD-,002)30!)-!7.A!3,,+!.!-,>-,08,-B!M!7)-,!5)7D(,1,!-,>,-,35,!
5.3!F,!>)*3+!23!18,!+)5*7,31.12)3!>)-!f.6./0!e.11,-3!5(.00L!!!
!!!!!!http://tiny.cloudera.com/dae9a
Introduction to Regular Expressions
[,4*(.-!,JD-,002)30!.-,!*0,+!>)-!D.11,-3!7.158234B!C8,-,!.-,!1=)!I23+0!)>!D.11,-30!23!
-,4*(.-!,JD-,002)30L!(21,-.(0!.3+!7,1.58.-.51,-0B!!X21,-.(!6.(*,0!.-,!*0,+!1)!7.158!D-,520,!
D.11,-30!=82(,!7,1.58.-.51,-0!8.6,!0D,52.(!7,.3234m!>)-!,J.7D(,G!.!+)1!=2((!7.158!.3A!
0234(,!58.-.51,-B!@,-,p0!18,!5)7D(,1,!(201!)>!7,1.58.-.51,-0G!>)(()=,+!FA!,JD(.3.12)30!)>!
18)0,!18.1!.-,!5)77)3(A!*0,+L!
< ( [ { \ ^ - = $ ! | ] } ) ? * + . >
X21,-.(!58.-.51,-0!.-,!.3A!58.-.51,-0!3)1!(201,+!.0!.!7,1.58.-.51,-B!C8,Ap-,!7.158,+!,J.51(AG!
F*1!2>!A)*!=.31!1)!7.158!.!7,1.58.-.51,-G!A)*!7*01!,05.D,!21!=218!.!F.5I0(.08B!H235,!.!
F.5I0(.08!20!210,(>!.!7,1.58.-.51,-G!21!7*01!.(0)!F,!,05.D,+!=218!.!F.5I0(.08B!Z)-!,J.7D(,G!
A)*!=)*(+!*0,!18,!D.11,-3!\\.!1)!7.158!.!(21,-.(!+)1B!
[,4*(.-!,JD-,002)30!0*DD)-1!D.11,-30!7*58!7)-,!>(,J2F(,!18.3!027D(A!*0234!.!+)1!1)!7.158!
.3A!58.-.51,-B!C8,!>)(()=234!,JD(.230!8)=!1)!*0,!character4classes!1)!-,01-251!=8258!
58.-.51,-0!.-,!7.158,+B!
Character Classes
[057]!;.158,0!.3A!0234(,!+2421!18.1!20!,218,-!#G!dG!)-!c!
[0-9]! ;.158,0!.3A!0234(,!+2421!F,1=,,3!#!.3+!^!
[3-6] ;.158,0!.3A!0234(,!+2421!F,1=,,3!&!.3+!]!
[a-z]! ;.158,0!.3A!0234(,!()=,-5.0,!(,11,-!
[C-F]! ;.158,0!.3A!0234(,!*DD,-5.0,!(,11,-!F,1=,,3!'!.3+!Z!
Z)-!,J.7D(,G!18,!D.11,-3![C-F][3-6]!=)*(+!7.158!18,!01-234!D3!)-!F5!F*1!=)*(+!>.2(!1)!
7.158!G3!)-!C7B!
C8,-,!.-,!.(0)!0)7,!F*2(1E23!58.-.51,-!5(.00,0!18.1!.-,!08)-15*10!>)-!5)77)3!0,10!)>!
58.-.51,-0B!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
75
Predefined Character Classes
\\d! ;.158,0!.3A!0234(,!+2421!
\\w! ;.158,0!.3A!=)-+!58.-.51,-!9(,11,-0!)>!.3A!5.0,G!D(*0!+24210!)-!*3+,-05)-,<!
\\s! ;.158,0!.3A!=821,0D.5,!58.-.51,-!90D.5,G!1.FG!3,=(23,G!,15B<!
Z)-!,J.7D(,G!18,!D.11,-3!\\d\\d\\d\\w!=)*(+!7.158!18,!01-234!314d!)-!934X!F*1!=)*(+!
>.2(!1)!7.158!93X!)-!Z871B!
H)7,127,0!21p0!,.02,-!1)!58))0,!=8.1!A)*!+)3p1!=.31!1)!7.158!2301,.+!)>!=8.1!A)*!+)!=.31!
1)!7.158B!C8,0,!18-,,!5.3!F,!3,4.1,+!FA!*0234!.3!*DD,-5.0,!(,11,-!2301,.+B!
Negated Predefined Character Classes
\\D! ;.158,0!.3A!0234(,!3)3E+2421!58.-.51,-!
\\W!! ;.158,0!.3A!3)3E=)-+!58.-.51,-!!
\\S!! ;.158,0!.3A!3)3E=821,0D.5,!58.-.51,-!
Z)-!,J.7D(,G!18,!D.11,-3!\\D\\D\\W!=)*(+!7.158!18,!01-234!ZX#!)-!@ P!F*1!=)*(+!>.2(!1)!
7.158!93X!)-!36_B!
C8,!7,1.58.-.51,-0!08)=3!.F)6,!7.158!,.58!,J.51(A!)3,!58.-.51,-B!N)*!5.3!0D,52>A!18,7!
7*(12D(,!127,0!1)!7.158!7)-,!18.3!)3,!58.-.51,-G!F*1!-,4*(.-!,JD-,002)30!0*DD)-1!18,!*0,!)>!
S*.312>2,-0!1)!,(2723.1,!1820!-,D,1212)3B!!
Matching Quantifiers
{5} e-,5,+234!58.-.51,-!7.A!)55*-!,J.51(A!>26,!127,0!
{0,6}! e-,5,+234!58.-.51,-!7.A!)55*-!F,1=,,3!U,-)!.3+!02J!127,0!
?! e-,5,+234!58.-.51,-!20!)D12)3.(!97.A!)55*-!U,-)!)-!)3,!127,0<!
+! e-,5,+234!58.-.51,-!7.A!)55*-!)3,!)-!7)-,!127,0!
* e-,5,+234!58.-.51,-!7.A!)55*-!U,-)!)-!7)-,!127,0!
hA!+,>.*(1G!S*.312>2,-0!1-A!1)!7.158!.0!7.3A!58.-.51,-0!.0!D)002F(,B!Q>!A)*!*0,+!18,!D.11,-3!
ore.+a!)3!18,!01-234!Dualcore has a store in FloridaG!A)*!72481!F,!0*-D-20,+!
1)!(,.-3!18.1!21!7.158,0!ore has a store in Florida!-.18,-!18.3!ore ha!)-!ore
in Florida!.0!A)*!72481!8.6,!,JD,51,+B!C820!20!F,5.*0,!7.158,0!.!r4-,,+Ar!FA!+,>.*(1B!
M++234!.!S*,012)3!7.-I!7.I,0!18,!S*.312>2,-!7.158!.0!>,=!58.-.51,-0!.0!D)002F(,!2301,.+G!0)!
18,!D.11,-3!ore.+?a!)3!1820!01-234!=)*(+!7.158!ore haB!
!
Copyright © 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
76
Z23.((AG!18,-,!.-,!1=)!0D,52.(!7,1.58.-.51,-0!18.1!7.158!U,-)!58.-.51,-0B!C8,A!.-,!*0,+!1)!
,30*-,!18.1!.!01-234!7.158,0!.!D.11,-3!)3(A!=8,3!21!)55*-0!.1!18,!F,4233234!)-!,3+!)>!.!01-234B!
Boundary Matching Metacharacters
^ ;.158,0!)3(A!.1!18,!F,4233234!)>!.!01-234
$ ;.158,0!)3(A!.1!18,!,3+234!)>!.!01-234!
PYCKL!T8,3!*0,+!2302+,!0S*.-,!F-.5I,10!9=8258!+,3)1,!.!58.-.51,-!5(.00<G!18,!^!58.-.51,-!
20!231,-,D-,1,+!+2>>,-,31(AB!Q3!18.1!5)31,J1G!21!3,4.1,0!18,!7.158B!C8,-,>)-,G!0D,52>A234!18,!
D.11,-3![^0-9]!20!,S*26.(,31!1)!*0234!18,!D-,+,>23,+!58.-.51,-!5(.00!\\d!+,05-2F,+!,.-(2,-B!

Navigation menu