An introduction to Big Data technologies (15.11.2018)
- Introduction [slide]
- Data-driven Decision Making for Data-driven Organizations
- What are Big Data, Data Science and AI?
- How do Big Data, Data Science and AI support Data-driven Decision Making?
- Who is using Big Data, Data Science and AI?
- Netflix and Big Data, a 20 years long success story [slide]
- Scaling Storing Horizontally with HDFS [whiteboard]
- Scaling Processing Horizontally with Hadoop [whiteboard,slide]
- Vertical vs. Horizontal Scalability [slide]
- Enabling data analytics [slide]
- hiding the MapReduce details using PIG and HIVE on Tez
- choosing optimized file formats (ORC and Parquet)
- Benchmarking speed ups [slides]
- Logical Architecture of a Big Data System [slides]
- Horizontal Scalability Business Intelligence in Macy’s [slide]
Spark: a unified engine to tame volume and velocity (26.11.2018)
- Spark: Efficient (100x faster) and Usable (5x less code) Big Data [slide]
- Overview of Spark internals
- Cluster Architecture
- How Spark schedules and executes jobs and tasks
- The Catalyst query optimizer
- In-depth presentation of Spark SQL and DataFrames:
- reading in DataFrames from CSV with and withoud schema inference
- writing DataFrames as Parquet and Tables
- Spark SQL and the role of caching
- Data Aggregation, Column Operations, date/time funtions
- Use of the Spark UI to analyze behavior and performance
- Spark Structured Streaming
- Sources and sinks
- Structured Streaming APIs
- Windowing & Aggregation
Kafka: a distributed streaming platform (4.12.2018 am)
- Introduction to Kafka [slide]
- messaging: topics (regural and compacted), partitions and replicas
- stream processing: stream, tables and KSQL
- Kafka Demo [slides][link]
- An experience in using Kafka at small scale [slides by Matteo Ferroni]
Analytics with SparkML: (4.12.2018 am)
- understanding sentiment analytics on IMDB data as a classification problem [whiteboard]
- How to build machine learning pipelines for supervised learning
- Use transformers to perform pre-processing on a dataset prior to training
- Train analytical models with Spark ML’s DataFrame-based estimators including Logistic Regression and Decision Trees
- Evaluate a classification model performance using a confusion matrix
- understanding predictive analysis of bike rental counts as a regression problem using Decision Trees [whiteboard]
- hands-on building a pipeline
- Evaluate a regression model performance using a confusion matrix
- Model complexity, underfit and overfit
- Learning to tune hyperparameters via cross-validation and grid search applying Random Forests to bike rental counts [whiteboard]