Skip to content
Contents
Introductory Lectures
Big Data Analytics [18-05-2018]
Introduction [slide ]
What’s Big Data? Volume , Velociy , Variety without foresaking Veracity
Why? It’s all about creating value
How? Paradigmatic shifts enabled by Big Data
Netflix and Big Data, a 20 years long success story [slide ]
Vertical vs. Horizontal Scalability [slide ]
Horizontal Scalability Business Intelligence in Macy’s [slide ]
Taming Data Volume [25-05-2018]
Scaling Storing Horizontally with NoSQL [slide ]
Scaling Processing Horizontally with Hadoop [slide ]
Hiding the details with declarative abstractions (PIG and HIVE) [slide ]
Spark: Efficient (100x faster) and Usable (5x less code) Big Data [slide ]
Big Data Use Cases [slide ]
Taming Data Velocity [01-06-2018]
It’s a Streaming World! [slide ]
Taming Velocity with Stream and Complex Event Processing [slide ]
Kafka as an example of Stream Processing [slides ]
Technology perspective
Kafka sucess stories
CERN [Luca Magnoni’s talk at kafka summit 2018]
ING [Timor Timuri + Richard Bras at kafka summit 2018]
Data Science [06-06-2018]
introduction [slide]
Why? Get rid of the HiPPO and embrace data-driven decision making!
What? Big Data is crudel oil, Data Science is refining it!
Who? Hacking skills + Math & statistics knowledge + Substantive Experience
How? Statistics + Machine Learning + Visualizations in an agile way
Where? Public safety, Swisscom, ENI
Machine Learning overview [by Brooke Wenig ] [slides ]
What? Supervised vs. unsupervised ML
All models are wrong some are useful
K-means as an example of unsupervised ML [slides ]
Examples of supervised ML methods
Logistic Regression for Classification of text as positive vs. negative
Decision Trees for Regression problems (e.g. predicting bike sharing usage)
Random Forests to train model with low bais and low variance
Taming Data Variety [26-06-2018]
Introduction
The interoperability problem
The standardisation dilemma
Variety cannot be avoided
Embrach variety-proof technologies with smart data & smart machines
The long-wave of smart data technologies adoption
The early days of the Semantic Web
The “happy” days of Linked Open Data
The success story of schema.org & Google Knowledge Graph
The disruptions of smart machines
From dump machines to smart machines
Deep Learning and its applications
Long life Cognitive Computing!
How smart data and smart machines made Watson win Jeopardy
Success stories in Cognitive Computing
Technical Lectures
A deep dive in hadoop [14-06-2018]
Hadoop ecosystem essentials [slides ]
Hadoop key blocks: HDFS, YARN, MapReduce, Tez, Pig and Hive [slides ]
Logical Architecture of a Big Data System [slides ]
Hands on Hortonworks Data Platform [my blob post ]
More on Technologies:
Benchmarks: Tez Improvements [slides ] and Impact of File formats
Success Stories for Pig at Twitter and Hive at Facebook
Conclusions: many distributions, even more components, learn them and use the right combination for your use case [slides ]
Apache Spark overview analysing Wikipedia logs [28-06-2018]
Spark as a unifying platform for Data Engineering, Data Analysis and Data Science
how to implement simple use cases for Spark using core APIs using Wikipedia logs
how to build data pipelines and query large data sets using Spark SQL and DataFrames using English Wikipedia logs
Learn about the internals of Catalyst Optimizer and Tugsten
Understand how Spark structured streaming can analyse in real-time Wikipedia edits
Understand how GraphFrames can analyse Wikipedia graph (users that edit pages that link other pages)
MATERIAL: notebooks for Databricks
Data science with Apache Spark [05-07-2018]
SparkML: assemble processing, model-building, and evaluation pipelines
How to build a simple sentiment mining solution using Logistic Regression and amazon reviews
How to predict bike rental counts using Decision Trees
Learn how to perform hyperparameter tuning using Random Forests to improve the prediction of bike rental counts
Check if you understood by working with Gradient Boosted Decision Trees
MATERIAL: notebooks for Databricks and slides by Brooke Wenig
Docker [11-07-2018 morning]
Container Basics
Docker Images
Architecture
Dockerfile
Docker Hub
Docker Services
Compose (single machine)
Swarm Mode (cenni)
MATERIAL:
Deep Learning Theory and Practice with Keras and TensorFlow [11-07-2018 afternoon]
Introduction to Deep Learning: how, why and when it works
MNIST Digits Dataset: the dataset for the hands-on
Keras: High-Level API for Neural Networks and Deep Learning
Hands-on Neural Network with Keras building a “Dense Feed-Forward Shallow” Network
Understanding Training using Gradient Descent and Back Propagation
Introducing non-linearity: from Sigmoid to ReLU
Convolutional Neural Networks: from intuition to a working network for MNIST
Recurrent Neural Networks: Networks for Understanding Time-Oriented Patterns in Data
Transfer Learning
MATERIAL:
Urban Data Science Hackathon [12-07-2018]
The datasets available: people flows from counting sensors, people presence and demographics from mobile telecom data, free parking, weather, and social media
Learning to formulate an Urban Data Science problem
Setting a Urban Data Science problem
Using methods and techniques learnt in previous days to solve the set problem
Presentation of the solutions
MATERIAL: introduction
From SQL to noSQL and back to newSQL
[06-09-2018]
[18-09-2018 (am)]
document stores
MongoDB [slides ]
hands-on MongoDB [notes]
scalable column store such as Google BigTable, Hbase and Cassandra [slides ]
graph database [slides ]
neo4j and cypher [slides ]
hands-on neo4j and cypher using neo4j Web interface [notes]
RDF and SPARQL [slides]
hands-on sparql using dbpedia public sparql end-point [notes]
[19-09-2018 (pm)]
things left behind from the original plan
back to SQL with newSQL (e.g., VoltDB): it’s always a matter of trade-off
Conclusion: how to choose and what avoid doing [slides ]
alternatively I can teach more deep learning