Data Streams, Big Data and Analytics (3 days) [Q4, 2018]

An introduction to Big Data technologies (15.11.2018)

  • Introduction [slide]
    • Data-driven Decision Making for Data-driven Organizations
    • What are Big Data, Data Science and AI?
    • How do Big Data, Data Science and AI support Data-driven Decision Making?
    • Who is using Big Data, Data Science and AI?
  • Netflix and Big Data, a 20 years long success story [slide]
  • Scaling Storing Horizontally with HDFS [whiteboard]
  • Scaling Processing Horizontally with Hadoop [whiteboard,slide]
  • Vertical vs. Horizontal Scalability [slide]
  • Enabling data analytics  [slide]
    • hiding the MapReduce details using PIG and HIVE on Tez
    • choosing optimized file formats (ORC and Parquet)
  • Benchmarking speed ups [slides]
  • Logical Architecture of a Big Data System [slides]
  • Horizontal Scalability Business Intelligence in Macy’s [slide]

Spark: a unified engine to tame volume and velocity (26.11.2018)

  • Spark: Efficient (100x faster) and Usable (5x less code) Big Data [slide]
  • Overview of Spark internals
    • Cluster Architecture
    • How Spark schedules and executes jobs and tasks
    • The Catalyst query optimizer
  • In-depth presentation of Spark SQL and DataFrames:
    • reading in DataFrames from CSV with and withoud schema inference
    • writing DataFrames as Parquet and Tables
    • Spark SQL and the role of caching
    • Data Aggregation, Column Operations, date/time funtions
    • Use of the Spark UI to analyze behavior and performance
  • Spark Structured Streaming
    • Sources and sinks
    • Structured Streaming APIs
    • Windowing & Aggregation

Kafka: a distributed streaming platform (4.12.2018 am)

  • Introduction to Kafka [slide]
    • messaging: topics (regural and compacted), partitions and replicas
    • stream processing: stream, tables and KSQL
  • Kafka Demo [slides][link]
  • An experience in using Kafka at small scale [slides by Matteo Ferroni]

Analytics with SparkML:  (4.12.2018 am)

  • understanding sentiment analytics on IMDB data as a classification problem [whiteboard]
    • How to build machine learning pipelines for supervised learning
    • Use transformers to perform pre-processing on a dataset prior to training
    • Train analytical models with Spark ML’s DataFrame-based estimators including Logistic Regression and Decision Trees
    • Evaluate a classification model performance using a confusion matrix
  • understanding predictive analysis of bike rental counts as a regression problem using Decision Trees [whiteboard]
    • hands-on building a pipeline
    • Evaluate a regression model performance using a confusion matrix
    • Model complexity, underfit and overfit
  • Learning to tune hyperparameters via cross-validation and grid search applying Random Forests to bike rental counts [whiteboard]