Note: this tutorial was tested for Hortonworks Data Platform (HDP®) 2.6.4 and 2.6.5 on Hortonworks Sandbox on a Mac using Docker and on a c5.2xlarge (8 cores, 16GB) in AWS using Docker.
Get Hortonworks Data Platform on Hortonworks Sandbox
- get the VM
- follow the installation guide
- on a Mac
- make sure you give the VM enough RAM (at least 8GB)
- if you have 8 core on your machine consider giving 4 cores to the VM (you will be able to see parallelization in action)
- on AWS
- as explained on Hortonworks Community Connection, correct the lines 28 and 30 of the script docker-deploy-hdp265.sh puting = instead of ==
- on a Mac
- set up the environment
- in particular understand on which IP your HDP is and map it to your desired hostname in the Hosts File
- reset the admin password using the Web shell
- reset the ambari admin password using the
ambari-admin-password-reset
script
Hand-on HDFS (ingestion)
- open Ambari
- go to HDFS view
- create data folder in /user/maria_dev/
- change permission of data folder
- upload geolocation.csv included in geolocations.zip
- for more information
Hands-on Hive (ingestion & wrangling)
- go to hive 2.0 view
- create new table with the wizard
- referring to hfds file
/user/maria_dev/data/geolocation.csv
- have a look to the jobs! This is a full traditional ingestion+wrangling procedure!!!
- referring to hfds file
- learn some details
- describe geolocation
- show create table geolocation
- describe formatted geolocation
- for more information
Hands-on Zeppelin (exploration – part I)
- open Zeppelin UI
- create a new jdbc notebook
- explore the content of geolocation table
%jdbc(hive)
SELECT city, event, count(event) as cnt FROM geolocation WHERE event != "normal" GROUP BY city, event
- configure the bar chart as follows
- keys: city
- groups: event
- values: cnt
hands-on Tez (admin role)
- open the Tez view
- have a look to the visual explanation of the plan
- see the difference with the underlying tez plan
- show the timeline in query details
- open the dag e show the GHraph View and the Vertex Swimlane
hands-on sqoop (ingestion & wrangling)
-
-
- configure sqoop
- got to Web shell
- login as root
- get the jdbc driver for postgresql
curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o postgresql-9.2-1002.jdbc4.jar
- copy the driver to scoop libs
cp postgresql-9.2-1002.jdbc4.jar /usr/hdp/current/sqoop-client/lib/
- test scoop
- check the connection
sqoop list-tables --connect jdbc:postgresql://95.110.236.35:5432/sqooptest --username sqoop --password t3stsq00p
- check the connection
- ingest the table in hive
- configure sqoop
sqoop import --connect jdbc:postgresql://95.110.236.35:5432/sqooptest --username sqoop --password t3stsq00p --table trucks --hive-import --create-hive-table -m 1 --direct
- NOTE: it gives a permission error, but it works
-
Hands-on Zeppelin (exploration – part II)
- go back on zeppelin for some more exploration
%jdbc(hive)
SELECT model, sum(jun13_miles)/sum(jun12_miles) as milesIndex, sum(jun13_gas)/sum(jun12_gas) as gasIndex FROM trucks GROUP BY model
- configure the bar chart as follows
- keys: models
- values: milesIndex, gasIndex
hands-on Oozie and Ambari Workflow Manager (data view)
- go to Ambari Workflow Manager
- build the Oozie workflow
- add a fork
- on a branch add two Hive2 actions in a row
- note: the jdbc connection URL is
jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
- TruckMiliage:
CREATE TABLE TruckMiliage STORED AS ORC AS SELECT truckid, driverid, rdate, miles, gas, miles / gas mpg FROM trucks LATERAL VIEW stack(54, 'jun13',jun13_miles,jun13_gas,'may13',may13_miles,may13_gas,'apr13',apr13_miles,apr13_gas,'mar13',mar13_miles,mar13_gas,'feb13',feb13_miles,feb13_gas,'jan13',jan13_miles,jan13_gas,'dec12',dec12_miles,dec12_gas,'nov12',nov12_miles,nov12_gas,'oct12',oct12_miles,oct12_gas,'sep12',sep12_miles,sep12_gas,'aug12',aug12_miles,aug12_gas,'jul12',jul12_miles,jul12_gas,'jun12',jun12_miles,jun12_gas,'may12',may12_miles,may12_gas,'apr12',apr12_miles,apr12_gas,'mar12',mar12_miles,mar12_gas,'feb12',feb12_miles,feb12_gas,'jan12',jan12_miles,jan12_gas,'dec11',dec11_miles,dec11_gas,'nov11',nov11_miles,nov11_gas,'oct11',oct11_miles,oct11_gas,'sep11',sep11_miles,sep11_gas,'aug11',aug11_miles,aug11_gas,'jul11',jul11_miles,jul11_gas,'jun11',jun11_miles,jun11_gas,'may11',may11_miles,may11_gas,'apr11',apr11_miles,apr11_gas,'mar11',mar11_miles,mar11_gas,'feb11',feb11_miles,feb11_gas,'jan11',jan11_miles,jan11_gas,'dec10',dec10_miles,dec10_gas,'nov10',nov10_miles,nov10_gas,'oct10',oct10_miles,oct10_gas,'sep10',sep10_miles,sep10_gas,'aug10',aug10_miles,aug10_gas,'jul10',jul10_miles,jul10_gas,'jun10',jun10_miles,jun10_gas,'may10',may10_miles,may10_gas,'apr10',apr10_miles,apr10_gas,'mar10',mar10_miles,mar10_gas,'feb10',feb10_miles,feb10_gas,'jan10',jan10_miles,jan10_gas,'dec09',dec09_miles,dec09_gas,'nov09',nov09_miles,nov09_gas,'oct09',oct09_miles,oct09_gas,'sep09',sep09_miles,sep09_gas,'aug09',aug09_miles,aug09_gas,'jul09',jul09_miles,jul09_gas,'jun09',jun09_miles,jun09_gas,'may09',may09_miles,may09_gas,'apr09',apr09_miles,apr09_gas,'mar09',mar09_miles,mar09_gas,'feb09',feb09_miles,feb09_gas,'jan09',jan09_miles,jan09_gas ) dummyalias AS rdate, miles, gas;
- DriverMiliage:
CREATE TABLE DriverMileage STORED AS ORC AS SELECT driverid, sum(miles) totmiles FROM TruckMiliage GROUP BY driverid;
- note: the jdbc connection URL is
- on another branch add another Hive2 action
- GeolocationAgg:
CREATE TABLE GeolocationAgg STORED AS ORC AS SELECT driverid, count(driverid) occurance from geolocation where event!='normal' group by driverid
- GeolocationAgg:
- after the join action add two more Hive2 actions in a row
- joined:
CREATE TABLE joined STORED AS ORC AS SELECT a.driverid, a.occurance, b.totmiles FROM GeolocationAgg a, DriverMileage b WHERE a.driverid=b.driverid
- riskfactor:
CREATE TABLE riskfactor STORED AS ORC AS select driverid, occurance, totmiles, totmiles/occurance riskfactor from joined
- joined:
- validate the workflow saving it in
/user/maria_dev/data/wf/riskanalysis.xml
- submit and run it on submission (overwrite)
- follow the progress on
- Solution
Hands-on Zeppelin (visualize the results of the analysis)
- go back on zeppelin
%jdbc(hive)
SELECT a.driverid, a.riskfactor, b.city
FROM riskfactor a, geolocation b
WHERE a.driverid=b.driverid AND b.state = "California"
ORDER BY a.riskfactor DESC
- configure the scatter chart as follows
- xAxis: city
- yAxis: driverID
- size: riskFactor
- Solution
*Spot on with this write-up, I truly think this website needs much more consideration. I?ll probably be again to read much more, thanks for that info.
*I am often to blogging and i really appreciate your content. The article has really peaks my interest. I am going to bookmark your site and keep checking for new information.