Hands on Hortonworks Data Platform

Note: this tutorial was tested for Hortonworks Data Platform (HDP®) 2.6.4 and 2.6.5 on Hortonworks Sandbox on a Mac using Docker and on a c5.2xlarge (8 cores, 16GB) in AWS using Docker.

Get Hortonworks Data Platform on Hortonworks Sandbox

  • get the VM
  • follow the installation guide
    • on a Mac
      • make sure you give the VM enough RAM (at least 8GB)
      • if you have 8 core on your machine consider giving 4 cores to the VM (you will be able to see parallelization in action)
    • on AWS
  • set up the environment
    • in particular understand on which IP your HDP is and map it to your desired hostname in the Hosts File
    • reset the admin password using the Web shell
    • reset the ambari admin password using the ambari-admin-password-reset script

Hand-on HDFS (ingestion)

Hands-on Hive (ingestion & wrangling)

  • go to hive 2.0 view
  • create new table with the wizard
    • referring to hfds file /user/maria_dev/data/geolocation.csv
    • have a look to the jobs! This is a full traditional ingestion+wrangling procedure!!!
  • learn some details
    • describe geolocation
    • show create table geolocation
    • describe formatted geolocation
  • for more information

Hands-on Zeppelin (exploration – part I)

  • open Zeppelin UI
  • create a new jdbc notebook
  • explore the content of geolocation table
%jdbc(hive)
SELECT city, event, count(event) as cnt FROM geolocation WHERE event != "normal" GROUP BY city, event
  • configure the bar chart as follows
    • keys: city
    • groups: event
    • values: cnt

hands-on Tez (admin role)

  • open the Tez view
  • have a look to the visual explanation of the plan
  • see the difference with the underlying tez plan
    • show the timeline in query details
    • open the dag e show the GHraph View and the Vertex Swimlane

hands-on sqoop (ingestion & wrangling)

      • configure sqoop
        • got to Web shell
        • login as root
        • get the jdbc driver for postgresql
          curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o postgresql-9.2-1002.jdbc4.jar
        • copy the driver to scoop libs
          cp postgresql-9.2-1002.jdbc4.jar /usr/hdp/current/sqoop-client/lib/
      • test scoop
        • check the connection
          sqoop list-tables --connect jdbc:postgresql://95.110.236.35:5432/sqooptest --username sqoop --password t3stsq00p
      • ingest the table in hive
    sqoop import --connect jdbc:postgresql://95.110.236.35:5432/sqooptest --username sqoop --password t3stsq00p --table trucks --hive-import --create-hive-table -m 1 --direct
    • NOTE: it gives a permission error, but it works

Hands-on Zeppelin (exploration – part II)

  • go back on zeppelin for some more exploration
%jdbc(hive) 
SELECT model, sum(jun13_miles)/sum(jun12_miles) as milesIndex, sum(jun13_gas)/sum(jun12_gas) as gasIndex  FROM trucks GROUP BY model
  • configure the bar chart as follows
    • keys: models
    • values: milesIndex, gasIndex

hands-on Oozie and Ambari Workflow Manager (data view)

  • go to Ambari Workflow Manager
  • build the Oozie workflow
    • add a fork
    • on a branch add two Hive2 actions in a row
      • note: the jdbc connection URL is
        jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
      • TruckMiliage:
        CREATE TABLE TruckMiliage STORED AS ORC AS SELECT truckid, driverid, rdate, miles, gas, miles / gas mpg FROM trucks LATERAL VIEW stack(54, 'jun13',jun13_miles,jun13_gas,'may13',may13_miles,may13_gas,'apr13',apr13_miles,apr13_gas,'mar13',mar13_miles,mar13_gas,'feb13',feb13_miles,feb13_gas,'jan13',jan13_miles,jan13_gas,'dec12',dec12_miles,dec12_gas,'nov12',nov12_miles,nov12_gas,'oct12',oct12_miles,oct12_gas,'sep12',sep12_miles,sep12_gas,'aug12',aug12_miles,aug12_gas,'jul12',jul12_miles,jul12_gas,'jun12',jun12_miles,jun12_gas,'may12',may12_miles,may12_gas,'apr12',apr12_miles,apr12_gas,'mar12',mar12_miles,mar12_gas,'feb12',feb12_miles,feb12_gas,'jan12',jan12_miles,jan12_gas,'dec11',dec11_miles,dec11_gas,'nov11',nov11_miles,nov11_gas,'oct11',oct11_miles,oct11_gas,'sep11',sep11_miles,sep11_gas,'aug11',aug11_miles,aug11_gas,'jul11',jul11_miles,jul11_gas,'jun11',jun11_miles,jun11_gas,'may11',may11_miles,may11_gas,'apr11',apr11_miles,apr11_gas,'mar11',mar11_miles,mar11_gas,'feb11',feb11_miles,feb11_gas,'jan11',jan11_miles,jan11_gas,'dec10',dec10_miles,dec10_gas,'nov10',nov10_miles,nov10_gas,'oct10',oct10_miles,oct10_gas,'sep10',sep10_miles,sep10_gas,'aug10',aug10_miles,aug10_gas,'jul10',jul10_miles,jul10_gas,'jun10',jun10_miles,jun10_gas,'may10',may10_miles,may10_gas,'apr10',apr10_miles,apr10_gas,'mar10',mar10_miles,mar10_gas,'feb10',feb10_miles,feb10_gas,'jan10',jan10_miles,jan10_gas,'dec09',dec09_miles,dec09_gas,'nov09',nov09_miles,nov09_gas,'oct09',oct09_miles,oct09_gas,'sep09',sep09_miles,sep09_gas,'aug09',aug09_miles,aug09_gas,'jul09',jul09_miles,jul09_gas,'jun09',jun09_miles,jun09_gas,'may09',may09_miles,may09_gas,'apr09',apr09_miles,apr09_gas,'mar09',mar09_miles,mar09_gas,'feb09',feb09_miles,feb09_gas,'jan09',jan09_miles,jan09_gas ) dummyalias AS rdate, miles, gas;
      • DriverMiliage:
        CREATE TABLE DriverMileage STORED AS ORC AS SELECT driverid, sum(miles) totmiles FROM TruckMiliage GROUP BY driverid;
    • on another branch add another Hive2 action
      • GeolocationAgg:
        CREATE TABLE GeolocationAgg STORED AS ORC AS SELECT driverid, count(driverid) occurance from geolocation where event!='normal' group by driverid
    • after the join action add two more Hive2 actions in a row
      • joined:
        CREATE TABLE joined STORED AS ORC AS SELECT a.driverid, a.occurance, b.totmiles FROM GeolocationAgg a, DriverMileage b WHERE a.driverid=b.driverid
      • riskfactor:
        CREATE TABLE riskfactor STORED AS ORC AS select driverid, occurance, totmiles, totmiles/occurance riskfactor from joined
    • validate the workflow saving it in /user/maria_dev/data/wf/riskanalysis.xml
    • submit and run it on submission (overwrite)
    • follow the progress on
  • Solution

Hands-on Zeppelin (visualize the results of the analysis)

  • go back on zeppelin
%jdbc(hive)
SELECT a.driverid, a.riskfactor, b.city
FROM riskfactor a, geolocation b 
WHERE  a.driverid=b.driverid AND b.state = "California"
ORDER BY a.riskfactor DESC
  • configure the scatter chart as follows
    • xAxis: city
    • yAxis: driverID
    • size: riskFactor
  • Solution

where to learn more

2 Replies to “Hands on Hortonworks Data Platform”

  1. *I am often to blogging and i really appreciate your content. The article has really peaks my interest. I am going to bookmark your site and keep checking for new information.

Leave a Reply

Your email address will not be published.