Big Data Analytics and Data Science [Q2, 2018]

Contents

Introductory Lectures

Big Data Analytics [18-05-2018]

  • Introduction [slide]
    • What’s Big Data? Volume, Velociy, Variety without foresaking Veracity
    • Why? It’s all about creating value
    • How? Paradigmatic shifts enabled by Big Data
  • Netflix and Big Data, a 20 years long success story [slide]
  • Vertical vs. Horizontal Scalability [slide]
  • Horizontal Scalability Business Intelligence in Macy’s [slide]

Taming Data Volume [25-05-2018]

  • Scaling Storing Horizontally with NoSQL [slide]
  • Scaling Processing Horizontally with Hadoop [slide]
  • Hiding the details with declarative abstractions (PIG and HIVE) [slide]
  • Spark: Efficient (100x faster) and Usable (5x less code) Big Data [slide]
  • Big Data Use Cases [slide]

Taming Data Velocity [01-06-2018]

Data Science [06-06-2018]

  • introduction [slide]
    • Why? Get rid of the HiPPO and embrace data-driven decision making!
    • What? Big Data is crudel oil, Data Science is refining it!
    • Who? Hacking skills + Math & statistics knowledge + Substantive Experience
    • How? Statistics + Machine Learning + Visualizations in an agile way
    • Where? Public safety, Swisscom, ENI
  • Machine Learning overview [by Brooke Wenig] [slides]
    • What? Supervised vs. unsupervised ML
    • All models are wrong some are useful
    • K-means as an example of unsupervised ML [slides]
    • Examples of supervised ML methods
      • Logistic Regression for Classification of text as positive vs. negative
      • Decision Trees for Regression problems (e.g. predicting bike sharing usage)
      • Random Forests to train model with low bais and low variance

Taming Data Variety [26-06-2018]

  • Introduction
    • The interoperability problem
    • The standardisation dilemma
    • Variety cannot be avoided
    • Embrach variety-proof technologies with smart data & smart machines
  • The long-wave of smart data technologies adoption
    • The early days of the Semantic Web
    • The “happy” days of Linked Open Data
    • The success story of schema.org & Google Knowledge Graph
  • The disruptions of smart machines
    • From dump machines to smart machines
    • Deep Learning and its applications
  • Long life Cognitive Computing!
    • How smart data and smart machines made Watson win Jeopardy
    • Success stories in Cognitive Computing

Technical Lectures

A deep dive in hadoop [14-06-2018]

Apache Spark overview analysing Wikipedia logs [28-06-2018]

  • Spark as a unifying platform for Data Engineering, Data Analysis and Data Science
  • how to implement simple use cases for Spark using core APIs using Wikipedia logs
  • how to build data pipelines and query large data sets using Spark SQL and DataFrames using English Wikipedia logs
  • Learn about the internals of Catalyst Optimizer and Tugsten
  • Understand how Spark structured streaming can analyse in real-time Wikipedia edits
  • Understand how GraphFrames can analyse Wikipedia graph (users that edit pages that link other pages)
  • MATERIAL: notebooks for Databricks

Data science with Apache Spark [05-07-2018]

  • SparkML: assemble processing, model-building, and evaluation pipelines
  • How to build a simple sentiment mining solution using Logistic Regression and amazon reviews
  • How to predict bike rental counts using Decision Trees
  • Learn how to perform hyperparameter tuning using Random Forests to improve the prediction of bike rental counts
  • Check if you understood by working with Gradient Boosted Decision Trees
  • MATERIAL: notebooks for Databricks and slides by  Brooke Wenig

Docker [11-07-2018 morning]

Deep Learning Theory and Practice with Keras and TensorFlow [11-07-2018 afternoon]

  • Introduction to Deep Learning: how, why and when it works
  • MNIST Digits Dataset: the dataset for the hands-on
  • Keras: High-Level API for Neural Networks and Deep Learning
  • Hands-on Neural Network with Keras building a “Dense Feed-Forward Shallow” Network
  • Understanding Training using Gradient Descent and Back Propagation
  • Introducing non-linearity: from Sigmoid to ReLU
  • Convolutional Neural Networks: from intuition to a working network for MNIST
  • Recurrent Neural Networks: Networks for Understanding Time-Oriented Patterns in Data
  • Transfer Learning
  • MATERIAL:

Urban Data Science Hackathon [12-07-2018]

  • The datasets available: people flows from counting sensors, people presence and demographics from mobile telecom data, free parking, weather, and social media
  • Learning to formulate an Urban Data Science problem
  • Setting a Urban Data Science problem
  • Using methods and techniques learnt in previous days to solve the set problem
  • Presentation of the solutions
  • MATERIAL: introduction

From SQL to noSQL and back to newSQL

[06-09-2018]

[18-09-2018 (am)]

  • document stores
    • MongoDB [slides]
    • hands-on MongoDB [notes]
  • scalable column store such as Google BigTable, Hbase and Cassandra [slides]
  • graph database [slides]
    • neo4j and cypher [slides]
    • hands-on neo4j and cypher using neo4j Web interface [notes]
    • RDF and SPARQL [slides]
    • hands-on sparql using dbpedia public sparql end-point [notes]

TENTATIVE [19-09-2018 (pm)]

  • things left behind from the original plan
    • back to SQL with newSQL (e.g., VoltDB): it’s always a matter of trade-off
    • Conclusion: how to choose and what avoid doing [slides]
  • alternatively I can teach more deep learning