Data Streams, Big Data and Analytics (3 days) [Q4, 2018]

An introduction to Big Data technologies (15.11.2018)

Introduction [slide]
- Data-driven Decision Making for Data-driven Organizations
- What are Big Data, Data Science and AI?
- How do Big Data, Data Science and AI support Data-driven Decision Making?
- Who is using Big Data, Data Science and AI?
Netflix and Big Data, a 20 years long success story [slide]
Scaling Storing Horizontally with HDFS [whiteboard]
Scaling Processing Horizontally with Hadoop [whiteboard,slide]
Vertical vs. Horizontal Scalability [slide]
Enabling data analytics [slide]
- hiding the MapReduce details using PIG and HIVE on Tez
- choosing optimized file formats (ORC and Parquet)
Benchmarking speed ups [slides]
Logical Architecture of a Big Data System [slides]
Horizontal Scalability Business Intelligence in Macy’s [slide]

Spark: Efficient (100x faster) and Usable (5x less code) Big Data [slide]
Overview of Spark internals
- Cluster Architecture
- How Spark schedules and executes jobs and tasks
- The Catalyst query optimizer
In-depth presentation of Spark SQL and DataFrames:
- reading in DataFrames from CSV with and withoud schema inference
- writing DataFrames as Parquet and Tables
- Spark SQL and the role of caching
- Data Aggregation, Column Operations, date/time funtions
- Use of the Spark UI to analyze behavior and performance
Spark Structured Streaming
- Sources and sinks
- Structured Streaming APIs
- Windowing & Aggregation

Introduction to Kafka [slide]
- messaging: topics (regural and compacted), partitions and replicas
- stream processing: stream, tables and KSQL
Kafka Demo [slides][link]
An experience in using Kafka at small scale [slides by Matteo Ferroni]

understanding sentiment analytics on IMDB data as a classification problem [whiteboard]
- How to build machine learning pipelines for supervised learning
- Use transformers to perform pre-processing on a dataset prior to training
- Train analytical models with Spark ML’s DataFrame-based estimators including Logistic Regression and Decision Trees
- Evaluate a classification model performance using a confusion matrix
understanding predictive analysis of bike rental counts as a regression problem using Decision Trees [whiteboard]
- hands-on building a pipeline
- Evaluate a regression model performance using a confusion matrix
- Model complexity, underfit and overfit
Learning to tune hyperparameters via cross-validation and grid search applying Random Forests to bike rental counts [whiteboard]