Contents
- Introductory Lectures
- Technical Lectures
Introductory Lectures
Big Data Analytics [05-09-2018 (am)]
- Introduction [slide]
- Data-driven Decision Making for Data-driven Organizations
- What are Big Data, Data Science and AI?
- How do Big Data, Data Science and AI support Data-driven Decision Making?
- Who is using Big Data, Data Science and AI?
- Netflix and Big Data, a 20 years long success story [slide]
Taming Data Volume [05-09-2018 (pm)]
- Vertical vs. Horizontal Scalability [slide]
- Scaling Storing Horizontally with NoSQL [slide]
- Scaling Processing Horizontally with Hadoop [slide]
- Hiding the details with declarative abstractions (PIG and HIVE) [slide]
- Spark: Efficient (100x faster) and Usable (5x less code) Big Data [slide]
- Big Data Use Cases [slide]
- Horizontal Scalability Business Intelligence in Macy’s [slide]
Taming Data Velocity [10-09-2018 (am)]
- Introduction [slide]
- How did we get to technologies able to tame velocity?
- The database perspective: from RDBMS to DSMS
- The event processing perspective: from Publish-Subscribe middlewares to Complex Event Processing
- The IT architectural perspective: from Service-Oriented Architecture to Event-driven Architecture
- is this only hype?
- The journey to an event-driven enterprise [slides][link to Kakfa Summit 2018 keynote]
- 1st – streaming awareness and pilot
- 2nd – early production
- 3rd – mission critical production
- 4th – global streaming
- 5th – central nervous system
- Introduction to Kafka [slide]
- messaging: topics (regural and compacted), partitions and replicas
- stream processing: stream, tables and KSQL
- Kafka Demo [slides][link]
- An experience in using Kafka at small scale [slides by Matteo Ferroni]
- Intereting links to check-out
Taming Data Variety [10-09-2018 (pm)]
- Introduction [slides]
- The interoperability problem
- The standardisation dilemma
- Variety cannot be avoided
- Embrach variety-proof technologies with smart data & smart machines
- The long-wave of smart data technologies adoption
- The early days of the Semantic Web
- The “happy” days of Linked Open Data
- The success story of schema.org & Google Knowledge Graph
- The interoperability problem
- An overview of existing knowledge graphs [slides]
- Demo of Ontop [slides]
- ingest a stream
- ingest a table
- enrich a stream with a table
- continuosly analyse the enriched stream
- connect elasticsearch and kibana to visualize the analyses in real-time
- My own research: Stream Reasoning [slides]
- Interesting links to check-out:
- Companies in this area: Capsenta, Spazio Dati, TopQuadrant, Cambridge Semantics, AtScale
Data Science [19-09-2018 (am)]
- introduction [slide]
- Why? Get rid of the HiPPO and embrace data-driven decision making!
- What? Big Data is crudel oil, Data Science is refining it!
- Who? Hacking skills + Math & statistics knowledge + Substantive Experience
- How? Statistics + Machine Learning + Visualizations in an agile way
- Where? Public safety, Swisscom, ENI
- The main Machine Learning algorithms [white board]
- Predicting house prices using Linear Regression and Gradient Descent
- Detecting spam emails using Naive Bayes Algorithm
- Recommending Apps based on Decision Trees [animation]
- Finding the best location for a shop based on K-means clustering or Hierarchical Clustering
- Deciding to accept students at a university based on Logistic Regression and Gradient Descent with Log-loss function
- When a line is not enough … or the kernel trick of Support Vector Machines
- Advance Analytics with Deep Learning
- Hands-on the Linear Perceptron and the linear classification problems it can solve [screenshot]
- Hands-on the Linear Perceptron and the simple non-linear classification problems it can solve [screenshot]
- Hands-on Deep Learning and the complex non-linear classification problems it can solve using sigmoid [screenshot], ReLU [screenshot] and deep NN [screenshot]
- Demo: how Keras + Tensorflow can classify correctly the MNIST Digits Dataset
- Demo: how Keras + Tensorflow can understand images using Inception V3 model
Technical Lectures
A deep dive in hadoop [11-10-2018]
- Hadoop ecosystem essentials [slides]
- Hadoop key blocks: HDFS, YARN, MapReduce, Tez, Pig and Hive [slides]
- Logical Architecture of a Big Data System [slides]
- Hands on Hortonworks Data Platform [my blob post]
- More on Technologies:
- Benchmarks: Tez Improvements [slides] and Impact of File formats
- Success Stories for Pig at Twitter and Hive at Facebook
- Conclusions: many distributions, even more components, learn them and use the right combination for your use case [slides]
Apache Spark overview analysing Wikipedia logs [18-10-2018]
- Spark as a unifying platform for Data Engineering, Data Analysis and Data Science
- How to implement simple use cases for Spark using core APIs using Wikipedia logs
- How to build data pipelines and query large data sets using Spark SQL and DataFrames using English Wikipedia logs
- Learn about the internals of Catalyst Optimizer and Tugsten
- Understand how Spark structured streaming can analyse in real-time Wikipedia edits
- MATERIAL: temporary link to the notebooks for Databricks + this other file
Data science with Apache Spark [23-10-2018 (am) + 25-10-2018 (am)]
- SparkML: assemble processing, model-building, and evaluation pipelines
- How to build a simple sentiment mining solution using Logistic Regression and amazon reviews [white board]
- How to predict bike rental counts using Decision Trees
- Learn how to perform hyperparameter tuning using Random Forests to improve the prediction of bike rental counts
- Check if you understood by working with Gradient Boosted Decision Trees
- Demo of K-means|| to cluster Iris flower data set [notebook]
- Demo of how to build a movie recommendation engine using MovieLens dataset and Alternating Least Squares (ALS) [notebook][movie of Monalisa and more scientific insights]
- MATERIAL: notebooks for Databricks and slides by Brooke Wenig
Deep Learning Theory and Practice with Keras and TensorFlow [6-11-2018]
- Demistifying Deep Learning by undersdtanding its core as a simple classifier/regressor
- Intro to Neural Networks with Keras: Sequential Models, Compilation, Epochs, Loss Visualization, Activation Functions, Loss functions, Metrics, Optimizers and Batch Size
- Understanding Training using Gradient Descent and Back Propagation
- Introducing non-linearity: from Sigmoid to ReLU
- Hands-on Neural Network with Keras building a “Dense Feed-Forward Shallow” Network to predict the house price on the Boston Housing dataset.
- More on Neural Networks with Keras: Data Normalization (e.g., standard scaler), Custom Metrics, Validation data, callbacks (e.g., check pointing and early stopping), and Saving Models
- Hands-on Neural Network with Keras optimizing the “Dense Feed-Forward Shallow” Network built in previous hands-on to predict the house price on the Boston Housing dataset.
- Convolutional Neural Networks and ImageNet: from intuition [whiteboard] to a working network VGG16 model and )
- Transfer Learning: using Deep Learning Pipelines (Inception V3 and Spark DeepImageFeaturizer) and Spark Logistic Regression
- MATERIAL:
- notebooks for Databricks. Make sure you set-up the cluster correctly!
- slides [16-48,58-68,88-91] from Brooke Wenig
- playing with Deep Learning in your browser
Urban Data Science Hackathon [14-11-2018]
- The datasets available: people flows from counting sensors, people presence and demographics from mobile telecom data, free parking, weather, and social media
- Learning to formulate an Urban Data Science problem
- Setting a Urban Data Science problem
- Using methods and techniques learnt in previous days to solve the set problem
- Presentation of the solutions
- MATERIAL: introduction & link to gitter channel
Docker [22.11.2018 (am)]
- Container Basics
- Docker Images
- Architecture
- Dockerfile
- Docker Hub
- Docker Services
- Compose (single machine)
- Swarm Mode (cenni)
- MATERIAL:
- we used a subset [pdf] of the Container Training.
- if you cannot install docker, you can try Docker for Beginners (registration is required)
From SQL to noSQL and back to newSQL [22 (pm) / 27 (am) / 29 (am) 11.2018 ]
- Gitter channel used during the lecture. It contains all the raw examples.
- Brainstorm on which are the requirements of a Database Management System (DBMS), how Relational DBMS address those needs since the ’90s and whether a different approach was possible in the 2000s and is mandatory in the 2010s [whiteboard]
- A running example we will use across the lecture [Entity Relationship diagram, example data]
- document stores
- MongoDB [slides]
- MongoDB Terminal Online
- hands-on MongoDB [blog post]
- graph stores
- neo4j and cypher [slides]
- hands-on Neo4j
- Key-value stores
- Redis [slides][cheatsheet]
- Wide column stores
- Cassandra [slides]
- Wrapup
- Positioning of varioius noSQL solution compared to traditional SQL and new SQL [whiteboard]
- how to choose and what avoid doing [slides]