Big Data Analytics and Data Science [Q3, 2018]

Contents

Introductory Lectures

Big Data Analytics [05-09-2018 (am)]

  • Introduction [slide]
    • Data-driven Decision Making for Data-driven Organizations
    • What are Big Data, Data Science and AI?
    • How do Big Data, Data Science and AI support Data-driven Decision Making?
    • Who is using Big Data, Data Science and AI?
  • Netflix and Big Data, a 20 years long success story [slide]

Taming Data Volume [05-09-2018 (pm)]

  • Vertical vs. Horizontal Scalability [slide]
  • Scaling Storing Horizontally with NoSQL [slide]
  • Scaling Processing Horizontally with Hadoop [slide]
  • Hiding the details with declarative abstractions (PIG and HIVE) [slide]
  • Spark: Efficient (100x faster) and Usable (5x less code) Big Data [slide]
  • Big Data Use Cases [slide]
  • Horizontal Scalability Business Intelligence in Macy’s [slide]

Taming Data Velocity [10-09-2018 (am)]

  • Introduction [slide]
    • How did we get to technologies able to tame velocity?
    • The database perspective: from RDBMS to DSMS
    • The event processing perspective: from Publish-Subscribe middlewares to Complex Event Processing
    • The IT architectural perspective: from Service-Oriented Architecture to Event-driven Architecture
    • is this only hype?
  • The journey to an event-driven enterprise [slides][link to Kakfa Summit 2018 keynote]
    • 1st – streaming awareness and pilot
    • 2nd – early production
    • 3rd – mission critical production
    • 4th – global streaming
    • 5th – central nervous system
  • Introduction to Kafka [slide]
    • messaging: topics (regural and compacted), partitions and replicas
    • stream processing: stream, tables and KSQL
  • Kafka Demo [slides][link]
  • An experience in using Kafka at small scale [slides by Matteo Ferroni]
  • Intereting links to check-out

Taming Data Variety [10-09-2018 (pm)]

  • Introduction [slides]
    • The interoperability problem
      • The standardisation dilemma
      • Variety cannot be avoided
    • Embrach variety-proof technologies with smart data & smart machines
    • The long-wave of smart data technologies adoption
      • The early days of the Semantic Web
      • The “happy” days of Linked Open Data
      • The success story of schema.org & Google Knowledge Graph
  • An overview of existing knowledge graphs [slides]
  • Demo of Ontop [slides]
    • ingest a stream
    • ingest a table
    • enrich a stream with a table
    • continuosly analyse the enriched stream
    • connect elasticsearch and kibana to visualize the analyses in real-time
  • My own research: Stream Reasoning [slides]
  • Interesting links to check-out:

Data Science [19-09-2018 (am)]

  • introduction [slide]
    • Why? Get rid of the HiPPO and embrace data-driven decision making!
    • What? Big Data is crudel oil, Data Science is refining it!
    • Who? Hacking skills + Math & statistics knowledge + Substantive Experience
    • How? Statistics + Machine Learning + Visualizations in an agile way
    • Where? Public safety, Swisscom, ENI
  • Machine Learning overview [by Brooke Wenig] [slides]
    • What? Supervised vs. unsupervised ML
    • All models are wrong some are useful
    • K-means as an example of unsupervised ML [slides]
    • Examples of supervised ML methods
      • Logistic Regression for Classification of text as positive vs. negative
      • Decision Trees for Regression problems (e.g. predicting bike sharing usage)
      • Random Forests to train model with low bais and low variance

Technical Lectures

A deep dive in hadoop [11-10-2018]

Apache Spark overview analysing Wikipedia logs [18-10-2018]

  • Spark as a unifying platform for Data Engineering, Data Analysis and Data Science
  • how to implement simple use cases for Spark using core APIs using Wikipedia logs
  • how to build data pipelines and query large data sets using Spark SQL and DataFrames using English Wikipedia logs
  • Learn about the internals of Catalyst Optimizer and Tugsten
  • Understand how Spark structured streaming can analyse in real-time Wikipedia edits
  • Understand how GraphFrames can analyse Wikipedia graph (users that edit pages that link other pages)
  • MATERIAL: notebooks for Databricks

Data science with Apache Spark [23-10-2018 (am) + 25-10-2018 (am)]

  • SparkML: assemble processing, model-building, and evaluation pipelines
  • How to build a simple sentiment mining solution using Logistic Regression and amazon reviews
  • How to predict bike rental counts using Decision Trees
  • Learn how to perform hyperparameter tuning using Random Forests to improve the prediction of bike rental counts
  • Check if you understood by working with Gradient Boosted Decision Trees
  • MATERIAL: notebooks for Databricks and slides by  Brooke Wenig

Deep Learning Theory and Practice with Keras and TensorFlow [6-11-2018]

  • Introduction to Deep Learning: how, why and when it works
  • MNIST Digits Dataset: the dataset for the hands-on
  • Keras: High-Level API for Neural Networks and Deep Learning
  • Hands-on Neural Network with Keras building a “Dense Feed-Forward Shallow” Network
  • Understanding Training using Gradient Descent and Back Propagation
  • Introducing non-linearity: from Sigmoid to ReLU
  • Convolutional Neural Networks: from intuition to a working network for MNIST
  • Recurrent Neural Networks: Networks for Understanding Time-Oriented Patterns in Data
  • Transfer Learning
  • MATERIAL:

Urban Data Science Hackathon [14-11-2018]

  • The datasets available: people flows from counting sensors, people presence and demographics from mobile telecom data, free parking, weather, and social media
  • Learning to formulate an Urban Data Science problem
  • Setting a Urban Data Science problem
  • Using methods and techniques learnt in previous days to solve the set problem
  • Presentation of the solutions
  • MATERIAL: introduction

Docker

From SQL to noSQL and back to newSQL

Brainstorm on what a RDBMS is and which requirements it satisfies
• Theoretical limits on Consistency, Availability and Partition tolerance
• The noSQL landscape
• key-value store such as memcached and Redis
• hands-on memcached textual protocol using telnet
• hands-on redis using redis-cli
• scalable column store such as Google BigTable, Hbase and Cassandra
• hands-on Cassandra using CQL-CLI
• document-store such as mongoDB and ElasticSearch
• hands-on mongoDB using mongo-cli
• graph database such as neo4j
• hands-on neo4j and cypher using neo4j Web interface
• back to SQL with newSQL (e.g., VoltDB): it’s always a matter of trade-off
• Conclusion: how to choose and what avoid doing