Introductory Lectures

Big Data Analytics [18-05-2018]

Introduction [slide]
- What’s Big Data? Volume, Velociy, Variety without foresaking Veracity
- Why? It’s all about creating value
- How? Paradigmatic shifts enabled by Big Data
Netflix and Big Data, a 20 years long success story [slide]
Vertical vs. Horizontal Scalability [slide]
Horizontal Scalability Business Intelligence in Macy’s [slide]

Taming Data Volume [25-05-2018]

Scaling Storing Horizontally with NoSQL [slide]
Scaling Processing Horizontally with Hadoop [slide]
Hiding the details with declarative abstractions (PIG and HIVE) [slide]
Spark: Efficient (100x faster) and Usable (5x less code) Big Data [slide]
Big Data Use Cases [slide]

Taming Data Velocity [01-06-2018]

It’s a Streaming World! [slide]
Taming Velocity with Stream and Complex Event Processing [slide]
Kafka as an example of Stream Processing [slides]
- Technology perspective
  - The Death and Rebirth of the Event Driven Architecture [Jay Kreps’ key note at kafka summit 2018]
  - The Present and Future of the Streaming Platform [Neha Narkhede’s key note at kafka summit 2018]
- Kafka sucess stories
  - CERN [Luca Magnoni’s talk at kafka summit 2018]
  - ING [Timor Timuri + Richard Bras at kafka summit 2018]

Data Science [06-06-2018]

introduction [slide]
- Why? Get rid of the HiPPO and embrace data-driven decision making!
- What? Big Data is crudel oil, Data Science is refining it!
- Who? Hacking skills + Math & statistics knowledge + Substantive Experience
- How? Statistics + Machine Learning + Visualizations in an agile way
- Where? Public safety, Swisscom, ENI
Machine Learning overview [by Brooke Wenig] [slides]
- What? Supervised vs. unsupervised ML
- All models are wrong some are useful
- K-means as an example of unsupervised ML [slides]
- Examples of supervised ML methods
  - Logistic Regression for Classification of text as positive vs. negative
  - Decision Trees for Regression problems (e.g. predicting bike sharing usage)
  - Random Forests to train model with low bais and low variance

Taming Data Variety [26-06-2018]

Introduction
- The interoperability problem
- The standardisation dilemma
- Variety cannot be avoided
- Embrach variety-proof technologies with smart data & smart machines
The long-wave of smart data technologies adoption
- The early days of the Semantic Web
- The “happy” days of Linked Open Data
- The success story of schema.org & Google Knowledge Graph
The disruptions of smart machines
- From dump machines to smart machines
- Deep Learning and its applications
Long life Cognitive Computing!
- How smart data and smart machines made Watson win Jeopardy
- Success stories in Cognitive Computing

Technical Lectures

A deep dive in hadoop [14-06-2018]

Hadoop ecosystem essentials [slides]
Hadoop key blocks: HDFS, YARN, MapReduce, Tez, Pig and Hive [slides]
Logical Architecture of a Big Data System [slides]
Hands on Hortonworks Data Platform [my blob post]
More on Technologies:
- Core Components: HDFS 2, YARN, TEZ, slider
- Data Access: HIVE 2, Stinger.next
- Data Formats: ORC, Parquet, AVRO
- Data Ingestion: sqoop
- Orchestration: oozie and Ambari workflow editor
Benchmarks: Tez Improvements [slides] and Impact of File formats
Success Stories for Pig at Twitter and Hive at Facebook
Conclusions: many distributions, even more components, learn them and use the right combination for your use case [slides]

Apache Spark overview analysing Wikipedia logs [28-06-2018]

Spark as a unifying platform for Data Engineering, Data Analysis and Data Science
how to implement simple use cases for Spark using core APIs using Wikipedia logs
how to build data pipelines and query large data sets using Spark SQL and DataFrames using English Wikipedia logs
Learn about the internals of Catalyst Optimizer and Tugsten
Understand how Spark structured streaming can analyse in real-time Wikipedia edits
Understand how GraphFrames can analyse Wikipedia graph (users that edit pages that link other pages)
MATERIAL: notebooks for Databricks

Data science with Apache Spark [05-07-2018]

SparkML: assemble processing, model-building, and evaluation pipelines
How to build a simple sentiment mining solution using Logistic Regression and amazon reviews
How to predict bike rental counts using Decision Trees
Learn how to perform hyperparameter tuning using Random Forests to improve the prediction of bike rental counts
Check if you understood by working with Gradient Boosted Decision Trees
MATERIAL: notebooks for Databricks and slides by Brooke Wenig

Docker [11-07-2018 morning]

Container Basics
Docker Images
- Architecture
- Dockerfile
- Docker Hub
Docker Services
- Compose (single machine)
- Swarm Mode (cenni)
MATERIAL:
- we used a subset of the Container Training.
- we also used a gitter.im channel

Deep Learning Theory and Practice with Keras and TensorFlow [11-07-2018 afternoon]

Introduction to Deep Learning: how, why and when it works
MNIST Digits Dataset: the dataset for the hands-on
Keras: High-Level API for Neural Networks and Deep Learning
Hands-on Neural Network with Keras building a “Dense Feed-Forward Shallow” Network
Understanding Training using Gradient Descent and Back Propagation
Introducing non-linearity: from Sigmoid to ReLU
Convolutional Neural Networks: from intuition to a working network for MNIST
Recurrent Neural Networks: Networks for Understanding Time-Oriented Patterns in Data
Transfer Learning
MATERIAL:
- notebooks for Databricks
- playing with Deep Learning in your browser

Urban Data Science Hackathon [12-07-2018]

The datasets available: people flows from counting sensors, people presence and demographics from mobile telecom data, free parking, weather, and social media
Learning to formulate an Urban Data Science problem
Setting a Urban Data Science problem
Using methods and techniques learnt in previous days to solve the set problem
Presentation of the solutions
MATERIAL: introduction

From SQL to noSQL and back to newSQL

[06-09-2018]

Brainstorm on which are the requirements of a Database Management System (DBMS), how Relational DBMS address those needs since the ’90s and whether a different approach was possible in the 2000s and is mandatory in the 2010s [whiteboard]
A running example we will use across the lecture [Entity Relationship diagram, example data]
The impediment mismatch problem [class diagram of the runnging example]
Documents as transaction boundaries [json representation of the running example]
Numbers to know [slide]
key-value stores as memcached and Redis
- memcached [slides]
- hands-on memcached [notes]
- Redis [slides]
- hands-on Redis [notes][cheatsheet][running example]

[18-09-2018 (am)]

document stores
- MongoDB [slides]
- hands-on MongoDB [notes]
scalable column store such as Google BigTable, Hbase and Cassandra [slides]
graph database [slides]
- neo4j and cypher [slides]
- hands-on neo4j and cypher using neo4j Web interface [notes]
- RDF and SPARQL [slides]
- hands-on sparql using dbpedia public sparql end-point [notes]

[19-09-2018 (pm)]

things left behind from the original plan
- back to SQL with newSQL (e.g., VoltDB): it’s always a matter of trade-off
- Conclusion: how to choose and what avoid doing [slides]
alternatively I can teach more deep learning

Big Data Analytics and Data Science [Q2, 2018]

Contents

Introductory Lectures

Big Data Analytics [18-05-2018]

Taming Data Volume [25-05-2018]

Taming Data Velocity [01-06-2018]

Data Science [06-06-2018]

Taming Data Variety [26-06-2018]

Technical Lectures

A deep dive in hadoop [14-06-2018]

Apache Spark overview analysing Wikipedia logs [28-06-2018]

Data science with Apache Spark [05-07-2018]

Docker [11-07-2018 morning]

Deep Learning Theory and Practice with Keras and TensorFlow [11-07-2018 afternoon]

Urban Data Science Hackathon [12-07-2018]

From SQL to noSQL and back to newSQL

[06-09-2018]

[18-09-2018 (am)]

[19-09-2018 (pm)]