Objectives
The course provides the foundational concepts, methods, languages, and systems for ingesting, processing, and analyzing data that flows to enable real-time decisions. The course aims to the tame velocity dimensions of Big Data without forgetting the volume and variety dimensions.
Topics covered and tentative scheduling OUTDATED
- 18/09/2024 – 01 – Administrative items and introduction to the course
Streaming Data Engineering
- 19/09/2024 – 02 – Intro to Streaming Data Engineering
- 25/09/2024 – 03 – Languages for DSMS illustrated via EPL
- 26/09/2024 – 04 – Fire Alarm case study – part 1
- 02/10/2024 – 05 – Languages for CEP illustrated via EPL & Fire Alarm case study – part 2
- 03/10/2024 – 08 – Scaling stream ingestion with Apache Kafka
- 09/10/2024 – 06 – Advanced EPL – Join Semantics using Ad case study & contexts using Bocce game case study
- 16/10/2024 – 07 – A complete example of realistic problem solved using EPL
- 17/10/2024 – 09 – Scaling stream processing with Apache Spark Structured Streaming
- 23/10/2024 – 10 – Kafka & Spark Structured Streaming in practice
- 24/10/2024 – 11 – Ingesting and analysing data of a global retail
- 30/10/2024 – 12 – Wrap up and preview of the exam’s questions about Streaming Data Engineering
Streaming Data Science
This second part of the course covers Time-series Analytics (TSA), Streaming Machine Learning (SML), Continual Artificial Intelligence (Continual AI).
- 31/10/2024 – 13 – Introduction to Streaming Data Science
- 06/11/2024 – 14 – TSA Modeling – Introduction to Time-series Analytics and the key concept of Stationarity
- 07/11/2024 – 15 – TSA Modeling – Decomposing and detrending time-series with and without seasonality
- 14/11/2024 – 16 – TSA Forecasting – Time-series forecasting baselines
- 20/11/2021 – 17 – TSA Forecasting – Temporal Dependence, ARMA/ARIMA/SARIMA/SARIMAX models & order estimation
- 21/11/2024 – 18 – TSA Forecasting – Meta’s Prophet & Deep Learning for Time Series
- 27/11/2024 – 19 – SML Taming data streams – Foundations: learning one sample at a time, prequential evaluation, and Concept drift
- 28/11/2024 – 20 – SML Predicting data streams – Methods for streaming classification via River
- 04/12/2024 – 21 – SML Predicting data streams – Ensemble methods for streaming classification and methods for streaming regression (a.k.a., forecasting) via River
- 05/12/2024 – 22 – Continual AI Theory – An introduction
- 12/12/2024 – 23 – Continual AI Practice – Hands on Avalanche
- 18/12/2024 – 24 – SML vs CL
- 19/12/2024 – 25 – Preview of the questions and the exercises about Streaming Data Science
NOTE: for the material refer to the github and the recordings refer to the Webeep page of the course.
Thesis
Please complete this form if you want a thesis on the course topic.
Prerequisites
Students are expected to know the basics about: database management, SQL and Machine Learning.
For a refresh of SQL, I recommend https://www.w3schools.com/sql/. It is simple and comprehensive.
For a gentle introduction to Machine Learning, I recommend watching the following two videos by Luis Serrano:
- A Friendly Introduction to Machine Learning – https://youtu.be/IpGxLWOIZy4
- Machine Learning: Testing and Error Metrics – https://youtu.be/aDW44NPhNw0
I also recommend you to read/enjoy the visual introduction to Decision Trees by R2D3.
Expected learning outcomes
Knowledge and understanding | Students will learn how to identify problems that can be addressed with big data techniques tailored for velocity and apply the stream data analysis technologies for solving real-world problems |
Applying knowledge and understanding | Given specific project cases, students will be able to define and implement a streaming data analysis solution for the problem, and apply it on real data streams from social media and IoT sensors |
Making judgements | Given specific project cases, students will be able to learn how to decide which streaming data analysis solution to apply and how to evaluate this decision |
Communication | Students will learn to write a report on a project describing and motivating the decisions taken and the results obtained, and oresent their work in front of their colleagues and teachers |
Lifelong learning skills | Students will learn how to develop a realistic streaming data analysis project in all its phases |
Evaluation
The exam consists of a theoretical (written exam, with a possible oral discussion if necessary by the instructor) and an optional practical part (project work with oral presentation). Further contributions to the mark may come from optional continuous evaluations along the course using in-presence quizzes during the lessons and other interactive modalities in class (max one mark).
The written exam comprises a mix of theoretical questions regarding any course subjects and exercises regarding the technical content and how to apply it in practice. Students can get up to 30 in the written test. The whole exam is a closed-book evaluation. A minimum score on each part is requested.
The optional practical project requires using one or more of the technologies presented in the lectures. It solves a realistic streaming data analysis problem based on real or realistic datasets publicly available or provided by the teachers. Only students who get at least 27/30 in the written exam can opt for it. The maximum increment for the optional project is three marks.
The final grade is computed as follows: written text result + optional continuous evaluation result + optional practical project result. E.g., written text 27 + optional continuous evaluation 1 + optional practical project 3 = 30L
Bibliography
- Kreps, Jay, I Love logs: Event data, stream processing, and data integration., O’Reilly Media, Inc., 2014
- Event Processing Language (EPL)
- ksqlDB Documentation
- Spark Structured Streaming
- Flux language
- Geoff Holmes, Ricard Gavaldà, Albert Bifet, Bernhard Pfahringer, Machine Learning for Data Streams: With Practical Examples in MOA, MIT Press, 2018