Skip to content	 
	
	
	
		
	
		
			
	
	
		Contents 
Introduction [slide ]
What’s Big Data? Volume , Velociy , Variety  without foresaking Veracity  
Why? It’s all about creating value  
How? Paradigmatic shifts enabled by Big Data 
 
 
Netflix and Big Data, a 20 years long success story [slide ] 
Vertical vs. Horizontal Scalability [slide ] 
Horizontal Scalability Business Intelligence in Macy’s [slide ] 
 
Scaling Storing Horizontally with NoSQL [slide ] 
Scaling Processing Horizontally with Hadoop [slide ] 
Hiding the details with declarative abstractions (PIG and HIVE) [slide ] 
Spark: Efficient (100x faster) and Usable (5x less code) Big Data [slide ] 
Big Data Use Cases [slide ] 
 
It’s a Streaming World! [slide ] 
Taming Velocity with Stream and Complex Event Processing [slide ] 
Kafka as an example of Stream Processing [slides ]
Technology perspective
 
Kafka sucess stories
CERN  [Luca Magnoni’s talk at kafka summit 2018]ING  [Timor Timuri + Richard Bras at kafka summit 2018] 
 
 
 
 
introduction [slide]
Why? Get rid of the HiPPO and embrace data-driven decision making! 
What? Big Data is crudel oil, Data Science is refining it! 
Who? Hacking skills + Math & statistics knowledge + Substantive Experience 
How? Statistics + Machine Learning + Visualizations in an agile way 
Where? Public safety, Swisscom, ENI 
 
 
Machine Learning overview [by Brooke Wenig ] [slides ]
What? Supervised vs. unsupervised ML 
All models are wrong some are useful 
K-means as an example of unsupervised ML [slides ] 
Examples of supervised ML methods
Logistic Regression for Classification of text as positive vs. negative 
Decision Trees for Regression problems (e.g. predicting bike sharing usage) 
Random Forests to train model with low bais and low variance 
 
 
 
 
 
26-06-2018]
Introduction
The interoperability problem 
The standardisation dilemma 
Variety cannot be avoided 
Embrach variety-proof technologies with smart data & smart machines 
 
 
The long-wave of smart data technologies adoption
The early days of the Semantic Web 
The “happy” days of Linked Open Data 
The success story of schema.org & Google Knowledge Graph 
 
 
The disruptions of smart machines
From dump machines to smart machines 
Deep Learning and its applications 
 
 
Long life Cognitive Computing!
How smart data and smart machines made Watson win Jeopardy 
Success stories in Cognitive Computing 
 
 
 
Hadoop ecosystem essentials [slides ] 
Hadoop key blocks: HDFS, YARN, MapReduce, Tez, Pig and Hive [slides ] 
Logical Architecture of a Big Data System [slides ] 
Hands on Hortonworks Data Platform [my blob post ] 
More on Technologies:
 
Benchmarks: Tez Improvements [slides ] and Impact of File formats  
Success Stories for Pig at Twitter  and Hive at Facebook  
Conclusions: many distributions, even more components, learn them and use the right combination for your use case [slides ] 
 
Spark as a unifying platform for Data Engineering, Data Analysis and Data Science 
how to implement simple use cases for Spark using core APIs using Wikipedia logs 
how to build data pipelines and query large data sets using Spark SQL and DataFrames using English Wikipedia logs 
Learn about the internals of Catalyst Optimizer  and Tugsten  
Understand how Spark structured streaming can analyse in real-time Wikipedia edits 
Understand how GraphFrames can analyse Wikipedia graph (users that edit pages that link other pages) 
MATERIAL: notebooks  for Databricks  
 
SparkML: assemble processing, model-building, and evaluation pipelines How to build a simple sentiment mining solution using Logistic Regression and amazon reviews How to predict bike rental counts using Decision Trees Learn how to perform hyperparameter tuning using Random Forests to improve the prediction of bike rental counts  Check if you understood by working with Gradient Boosted Decision Trees MATERIAL: notebooks  for Databricks  and slides  by  Brooke Wenig  
 
Container Basics 
Docker Images
Architecture 
Dockerfile 
Docker Hub 
 
 
Docker Services
Compose (single machine) 
Swarm Mode (cenni) 
 
 
MATERIAL:
 
 
Introduction to Deep Learning:  how, why and when it worksMNIST Digits Dataset: the dataset for the hands-on Keras: High-Level API for Neural Networks and Deep Learning Hands-on Neural Network with Keras building a “Dense Feed-Forward Shallow” Network Understanding Training using Gradient Descent and Back Propagation Introducing non-linearity: from Sigmoid to ReLU Convolutional Neural Networks: from intuition to a working network for MNIST Recurrent Neural Networks: Networks for Understanding Time-Oriented Patterns in Data Transfer Learning 
MATERIAL:
 
 
The datasets available: people flows from counting sensors, people presence and demographics from mobile telecom data, free parking, weather, and social media Learning to formulate an Urban Data Science problem Setting a Urban Data Science problem Using methods and techniques learnt in previous days to solve the set problem Presentation of the solutions MATERIAL: introduction  
 
From SQL to noSQL and back to newSQL 
[06-09-2018] 
[18-09-2018 (am)] 
document stores
MongoDB [slides ] 
hands-on MongoDB [notes] 
 
 
scalable column store such as Google BigTable, Hbase and Cassandra [slides ] 
graph database [slides ]
neo4j and cypher [slides ] 
hands-on neo4j and cypher using neo4j Web interface [notes] 
RDF and SPARQL [slides] 
hands-on sparql using dbpedia public sparql end-point [notes] 
 
 
 
[19-09-2018 (pm)] 
things left behind from the original plan
back to SQL with newSQL (e.g., VoltDB): it’s always a matter of trade-off 
Conclusion: how to choose and what avoid doing [slides ] 
 
 
alternatively I can teach more deep learning