Introductory Lectures

Big Data Analytics [05-09-2018 (am)]

Introduction [slide]
- Data-driven Decision Making for Data-driven Organizations
- What are Big Data, Data Science and AI?
- How do Big Data, Data Science and AI support Data-driven Decision Making?
- Who is using Big Data, Data Science and AI?
Netflix and Big Data, a 20 years long success story [slide]

Taming Data Volume [05-09-2018 (pm)]

Vertical vs. Horizontal Scalability [slide]
Scaling Storing Horizontally with NoSQL [slide]
Scaling Processing Horizontally with Hadoop [slide]
Hiding the details with declarative abstractions (PIG and HIVE) [slide]
Spark: Efficient (100x faster) and Usable (5x less code) Big Data [slide]
Big Data Use Cases [slide]
Horizontal Scalability Business Intelligence in Macy’s [slide]

Taming Data Velocity [10-09-2018 (am)]

Introduction [slide]
- How did we get to technologies able to tame velocity?
- The database perspective: from RDBMS to DSMS
- The event processing perspective: from Publish-Subscribe middlewares to Complex Event Processing
- The IT architectural perspective: from Service-Oriented Architecture to Event-driven Architecture
- is this only hype?
The journey to an event-driven enterprise [slides][link to Kakfa Summit 2018 keynote]
- 1st – streaming awareness and pilot
- 2nd – early production
- 3rd – mission critical production
- 4th – global streaming
- 5th – central nervous system
Introduction to Kafka [slide]
- messaging: topics (regural and compacted), partitions and replicas
- stream processing: stream, tables and KSQL
Kafka Demo [slides][link]
An experience in using Kafka at small scale [slides by Matteo Ferroni]
Intereting links to check-out
- Massive Online Reasoning [MOA,SAMOA]

Taming Data Variety [10-09-2018 (pm)]

Introduction [slides]
- The interoperability problem
  - The standardisation dilemma
  - Variety cannot be avoided
- Embrach variety-proof technologies with smart data & smart machines
- The long-wave of smart data technologies adoption
  - The early days of the Semantic Web
  - The “happy” days of Linked Open Data
  - The success story of schema.org & Google Knowledge Graph
An overview of existing knowledge graphs [slides]
Demo of Ontop [slides]
- ingest a stream
- ingest a table
- enrich a stream with a table
- continuosly analyse the enriched stream
- connect elasticsearch and kibana to visualize the analyses in real-time
My own research: Stream Reasoning [slides]
Interesting links to check-out:
- Companies in this area: Capsenta, Spazio Dati, TopQuadrant, Cambridge Semantics, AtScale

Data Science [19-09-2018 (am)]

introduction [slide]
- Why? Get rid of the HiPPO and embrace data-driven decision making!
- What? Big Data is crudel oil, Data Science is refining it!
- Who? Hacking skills + Math & statistics knowledge + Substantive Experience
- How? Statistics + Machine Learning + Visualizations in an agile way
- Where? Public safety, Swisscom, ENI
The main Machine Learning algorithms [white board]
- Predicting house prices using Linear Regression and Gradient Descent
- Detecting spam emails using Naive Bayes Algorithm
- Recommending Apps based on Decision Trees [animation]
- Finding the best location for a shop based on K-means clustering or Hierarchical Clustering
- Deciding to accept students at a university based on Logistic Regression and Gradient Descent with Log-loss function
- When a line is not enough … or the kernel trick of Support Vector Machines
Advance Analytics with Deep Learning
- Hands-on the Linear Perceptron and the linear classification problems it can solve [screenshot]
- Hands-on the Linear Perceptron and the simple non-linear classification problems it can solve [screenshot]
- Hands-on Deep Learning and the complex non-linear classification problems it can solve using sigmoid [screenshot], ReLU [screenshot] and deep NN [screenshot]
- Demo: how Keras + Tensorflow can classify correctly the MNIST Digits Dataset
- Demo: how Keras + Tensorflow can understand images using Inception V3 model

Technical Lectures

A deep dive in hadoop [11-10-2018]

Hadoop ecosystem essentials [slides]
Hadoop key blocks: HDFS, YARN, MapReduce, Tez, Pig and Hive [slides]
Logical Architecture of a Big Data System [slides]
Hands on Hortonworks Data Platform [my blob post]
More on Technologies:
- Core Components: HDFS 2, YARN, TEZ, slider
- Data Access: HIVE 2, Stinger.next
- Data Formats: ORC, Parquet, AVRO
- Data Ingestion: sqoop
- Orchestration: oozie and Ambari workflow editor
Benchmarks: Tez Improvements [slides] and Impact of File formats
Success Stories for Pig at Twitter and Hive at Facebook
Conclusions: many distributions, even more components, learn them and use the right combination for your use case [slides]

Apache Spark overview analysing Wikipedia logs [18-10-2018]

Spark as a unifying platform for Data Engineering, Data Analysis and Data Science
How to implement simple use cases for Spark using core APIs using Wikipedia logs
How to build data pipelines and query large data sets using Spark SQL and DataFrames using English Wikipedia logs
Learn about the internals of Catalyst Optimizer and Tugsten
Understand how Spark structured streaming can analyse in real-time Wikipedia edits
MATERIAL: temporary link to the notebooks for Databricks + this other file

Data science with Apache Spark [23-10-2018 (am) + 25-10-2018 (am)]

SparkML: assemble processing, model-building, and evaluation pipelines
How to build a simple sentiment mining solution using Logistic Regression and amazon reviews [white board]
How to predict bike rental counts using Decision Trees
Learn how to perform hyperparameter tuning using Random Forests to improve the prediction of bike rental counts
Check if you understood by working with Gradient Boosted Decision Trees
Demo of K-means|| to cluster Iris flower data set [notebook]
Demo of how to build a movie recommendation engine using MovieLens dataset and Alternating Least Squares (ALS) [notebook][movie of Monalisa and more scientific insights]
MATERIAL: notebooks for Databricks and slides by Brooke Wenig

Deep Learning Theory and Practice with Keras and TensorFlow [6-11-2018]

Demistifying Deep Learning by undersdtanding its core as a simple classifier/regressor
Intro to Neural Networks with Keras: Sequential Models, Compilation, Epochs, Loss Visualization, Activation Functions, Loss functions, Metrics, Optimizers and Batch Size
- Understanding Training using Gradient Descent and Back Propagation
- Introducing non-linearity: from Sigmoid to ReLU
Hands-on Neural Network with Keras building a “Dense Feed-Forward Shallow” Network to predict the house price on the Boston Housing dataset.
More on Neural Networks with Keras: Data Normalization (e.g., standard scaler), Custom Metrics, Validation data, callbacks (e.g., check pointing and early stopping), and Saving Models
Hands-on Neural Network with Keras optimizing the “Dense Feed-Forward Shallow” Network built in previous hands-on to predict the house price on the Boston Housing dataset.
Convolutional Neural Networks and ImageNet: from intuition [whiteboard] to a working network VGG16 model and )
- Understanding Image Kernels
- learn more
Transfer Learning: using Deep Learning Pipelines (Inception V3 and Spark DeepImageFeaturizer) and Spark Logistic Regression
MATERIAL:
- notebooks for Databricks. Make sure you set-up the cluster correctly!
- slides [16-48,58-68,88-91] from Brooke Wenig
- playing with Deep Learning in your browser

Urban Data Science Hackathon [14-11-2018]

The datasets available: people flows from counting sensors, people presence and demographics from mobile telecom data, free parking, weather, and social media
Learning to formulate an Urban Data Science problem
Setting a Urban Data Science problem
Using methods and techniques learnt in previous days to solve the set problem
Presentation of the solutions
MATERIAL: introduction & link to gitter channel

Docker [22.11.2018 (am)]

Container Basics
Docker Images
- Architecture
- Dockerfile
- Docker Hub
Docker Services
- Compose (single machine)
- Swarm Mode (cenni)
MATERIAL:
- we used a subset [pdf] of the Container Training.
- if you cannot install docker, you can try Docker for Beginners (registration is required)

From SQL to noSQL and back to newSQL [22 (pm) / 27 (am) / 29 (am) 11.2018 ]

Gitter channel used during the lecture. It contains all the raw examples.
Brainstorm on which are the requirements of a Database Management System (DBMS), how Relational DBMS address those needs since the ’90s and whether a different approach was possible in the 2000s and is mandatory in the 2010s [whiteboard]
A running example we will use across the lecture [Entity Relationship diagram, example data]
document stores
- MongoDB [slides]
- MongoDB Terminal Online
- hands-on MongoDB [blog post]
graph stores
- neo4j and cypher [slides]
- hands-on Neo4j
Key-value stores
- Redis [slides][cheatsheet]
Wide column stores
- Cassandra [slides]
Wrapup
- Positioning of varioius noSQL solution compared to traditional SQL and new SQL [whiteboard]
- how to choose and what avoid doing [slides]

Big Data Analytics and Data Science [Q3, 2018]

Contents

Introductory Lectures

Big Data Analytics [05-09-2018 (am)]

Taming Data Volume [05-09-2018 (pm)]

Taming Data Velocity [10-09-2018 (am)]

Taming Data Variety [10-09-2018 (pm)]

Data Science [19-09-2018 (am)]

Technical Lectures

A deep dive in hadoop [11-10-2018]

Apache Spark overview analysing Wikipedia logs [18-10-2018]

Data science with Apache Spark [23-10-2018 (am) + 25-10-2018 (am)]

Deep Learning Theory and Practice with Keras and TensorFlow [6-11-2018]

Urban Data Science Hackathon [14-11-2018]

Docker [22.11.2018 (am)]

From SQL to noSQL and back to newSQL [22 (pm) / 27 (am) / 29 (am) 11.2018 ]