Streaming Data Analytics 2022-23

Objectives

The course provides the foundational concepts, methods, languages, and systems for ingesting, processing, and analyzing data that flows to enable real-time decisions. The course aims to the tame velocity dimensions of Big Data without forgetting the volume and variety dimensions.

Topics covered and tentative scheduling

  • 14/09/2021 – Administrative items and course introduction

Streaming Data Engineering

  • 15/09/2022 – From the foundations of streaming algorithms to real-world languages and systems
  • 22/09/2022 – Languages for Data Stream Management Systems (DSMS) illustrated via EPL
  • 28/09/2022 – DSMS Practice – Fire Alarm case study in EPL
  • 29/09/2022 – Languages for Complex Event Processing (CEP) illustrated with the every clause and the pattern guards in EPL
  • 5/10/2022 – CEP Practice – Robotic Arm case study in EPL
  • 12/10/2022 – Horizontally scalable systems illustrated via Apache Kafka and Apache Spark
  • 13/10/2022 – Spark Structured Streaming as an excuse to talk about handling late arrivals and streaming join semantics
  • 20/10/2022 – Practice with Spark Structured Streaming
  • 26/10/2022 – ksqlDB: cutting all from the same molt
  • 27/10/2022 – Practice with ksqlDB
  • 02/11/2022 – Wrap up and preview of the questions and the exercises about Streaming Data Engineering

Streaming Data Science

This second part of the course covers Time-series Analytics (TSA), Streaming Machine Learning (SML), and Recurrent Neural Networks (RNN)

  • 09/11/2022 – SML Theory/Practice – Foundations: learning one sample at a time, prequential evaluation, and concept drift
  • 10/11/2022 – SML Theory/Practice – Methods for streaming classification illustrated via River
  • 16/11/2022 – SML Theory/Practice – Ensemble methods for streaming machine learning via River
  • 17/11/2022 – Wrap up and preview of the questions and the exercises about SML
  • 23/11/2022 – TSA Theory/Practice – Decomposing and detrending time-series without seasonality
  • 24/11/2022 – TSA Theory/Practice – stationary time-series, decomposition in presence of seasonality and forecasting (part 1 – baseline methods)
  • 30/11/2022 – TSA Theory/Practice -Time-series forecasting (part 2 – exponential smooting & SARIMA)
  • 1/12/2022 – TSA Theory/Practice – Time-series analysis using SARIMA
  • 14/12/2022 – TSA Theory/Practice – Recurrent Neural Networks
  • 15/12/2022 – Wrap up and preview of the questions and the exercises about TSA
  • 21/12/2022 – Combining TinyML and SML/TSA for adaptive machine learning at the Edge
  • 22/12/2022 – Stream Reasoning: Artificial Intelligence + Stream Processing (BONUS LECTURE)

NOTE: for the material and the recordings refer to the Webeep page of the course

Prerequisites

Students are expected to know the basics about: database management, SQL and Machine Learning.

For a refresh of SQL, I recommend https://www.w3schools.com/sql/. It is simple and comprehensive.

For a gentle introduction to Machine Learning, I recommend watching the following two videos by Luis Serrano:

I also recommend you to read/enjoy the visual introduction to Decision Trees by R2D3.

Expected learning outcomes

Knowledge and understandingStudents will learn how to identify problems that can be addressed with big data techniques tailored for velocity and apply the stream data analysis technologies for solving real-world problems
Applying knowledge and understandingGiven specific project cases, students will be able to define and implement a streaming data analysis solution for the problem, and apply it on real data streams from social media and IoT sensors
Making judgementsGiven specific project cases, students will be able to learn how to decide which streaming data analysis solution to apply and how to evaluate this decision
CommunicationStudents will learn to write a report on a project describing and motivating the decisions taken and the results obtained, and oresent their work in front of their colleagues and teachers
Lifelong learning skillsStudents will learn how to develop a realistic streaming data analysis project in all its phases

Evaluation

The exam consists of a theoretical (written exam, with a possible oral discussion if necessary by the instructor) and an optional practical part (project work with oral presentation). Further contributions to the mark may come from optional continuous evaluations along the course using in-presence quizzes during the lessons and other interactive modalities in class (max two marks).

The written exam comprises a mix of theoretical questions regarding any course subjects and exercises regarding the technical content and how to apply it in practice. Students can get up to 30 in the written test. The whole exam is a closed-book evaluation. A minimum score on each part is requested.
The optional practical project requires using one or more of the technologies presented in the lectures. It solves a realistic streaming data analysis problem based on real or realistic datasets publicly available or provided by the teachers. Only students who get at least 27/30 in the written exam can opt for it. The maximum increment for the optional project is three marks.

The final grade is computed as follows: written text result + optional continuous evaluation result + optional practical project result. E.g., written text 27 + optional continuous evaluation 1 + optional practical project 3 = 30L

Bibliography