This is a copy of the official page of the PhD course 055067 on “Data Management For Large-scale Analytics” organized for PhD program in Data Analytics and Decision Sciences by prof. Marco Brambilla and prof. Emanuele Della Valle in collaboration with prof. Stefano Ceri and prof. Danilo Ardagna.
Abstract
Large-scale data analytics is everywhere and researchers from all disciplines are addressing this topic from their own perspective, creating vertical excellent experiments, but often loosing the wider picture. This course aims at providing the principles, practices and technologies that enable large-scale data analytics and thus foster practice and academic debate around data science.
Contents
- Part 1: INTRO. Grand challenges of Data Analytics
- Introduction to large-scale analytics
- Opportunities for social, environmental and economic problems
- Problem of current research in big data and data science
- Data access and quality issues
- Part 2: DATA. Data models and their implementations
- Traditional ER and relational data models, SQL
- Transactional and active databases
- NoSQL data models: document, graph, column-based and key-value models
- NoSQL platforms and technologies
- Main memory large-scale databases
- Part 3: FEATURES. Taming data volume, velocity, variety, and veracity
- Volume: Scaling computation and storage horizontally
- Map Reduce from Apache Hadoop to Apache Spark and Apache Flink
- Velocity: Information flow processing principle, approaches and tools
- Hands-on Apache Spark to tame volume and velocity in data analytics
- Veracity: data quality and data wrangling
- Variety: web data extraction and data integration
- Part 4: Project work
Calendar
TopicDateStart TimeEnd TimeHoursInstructorRoom
Part 1: INTRO. Grand challenges of Data Analytics | ||||||
Introduction to large-scale analytics and opportunities for social, environmental and economic problems. | Feb 7 | 14:30 | 16:30 | 1 | E. Della Valle | PT1 – DEIB – Building 20 |
Problems in current researchData access and quality issues | Feb 10 | 13:30 | 14:30 | 1 | M. Brambilla | PT1 – DEIB – Building 20 |
Part 2: DATA. Data models and their implementations | ||||||
Traditional ER and relational data models and SQL | Feb 10 | 14:30 | 16:30 | 2 | M. Brambilla | PT1 – DEIB – Building 20 |
Architectural and transactional aspects of databases | Feb 10 | 16:30 | 18:30 | 2 | S. Ceri | PT1 – DEIB – Building 20 |
NoSQL data models, platforms and technologies. Main memory large-scale databases | Feb 11 | 10:00 | 14:00 | 2 | M. Brambilla | BIO1- Building 21 – First Floor |
Part 3: FEATURES. Taming data volume, velocity, variety, and veracity | ||||||
Volume: Scaling computation and storage horizontally | Feb 18 | 10:00 | 11:00 | 1 | D. Ardagna | PT1 – DEIB – Building 20 |
Map Reduce from Apache Hadoop to Apache Spark and Apache Flink | Feb 18 | 11:00 | 13:00 | 2 | D. Ardagna | PT1 – DEIB – Building 20 |
Velocity: Information flow processing principle, approaches and tools | Feb 18 | 14:00 | 15:00 | 1 | E. Della Valle | PT1 – DEIB – Building 20 |
Hands-on Apache Spark and Apache Kafka | Feb 21 | 13:00 | 17:00 | 2 | E. Della Valle | PT1 – DEIB – Building 20 |
– Veracity: data quality and data wrangling | Feb 26 | 14:00 | 16:00 | 2 | M. Brambilla | PT1 – DEIB – Building 20 |
– Variety: web data extraction and data integration | Feb 26 | 16:00 | 17:00 | 1 | M. Brambilla | PT1 – DEIB – Building 20 |
Part 4: Project Work | ||||||
Support to project work | TBD | TBD | TBD | 3 | M. Brambilla + D. Ardagna | TBD |
Evaluation of project work | TBD | TBD | TBD | 3 | M. Brambilla + E. Della Valle | TBD |
Exam
Students will be required to build a research case, identifying business value, data and methods, using the tools to analyze and visualize data, critically analyzing pitfalls, and highlighting their contributions.
The evaluation will be based on a concrete implementation of a case proposed by the instructors, where students will be asked to implement the data management phases discussed in class on a practical example, using cloud-based large-scale data management platforms and technologies.