Big Data Analytics and Data Science [Q3, 2018]

Contents

Introductory Lectures

Big Data Analytics [05-09-2018 (am)]

  • Introduction [slide]
    • Data-driven Decision Making for Data-driven Organizations
    • What are Big Data, Data Science and AI?
    • How do Big Data, Data Science and AI support Data-driven Decision Making?
    • Who is using Big Data, Data Science and AI?
  • Netflix and Big Data, a 20 years long success story [slide]

Taming Data Volume [05-09-2018 (pm)]

  • Vertical vs. Horizontal Scalability [slide]
  • Scaling Storing Horizontally with NoSQL [slide]
  • Scaling Processing Horizontally with Hadoop [slide]
  • Hiding the details with declarative abstractions (PIG and HIVE) [slide]
  • Spark: Efficient (100x faster) and Usable (5x less code) Big Data [slide]
  • Big Data Use Cases [slide]
  • Horizontal Scalability Business Intelligence in Macy’s [slide]

Taming Data Velocity [10-09-2018 (am)]

  • Introduction [slide]
    • How did we get to technologies able to tame velocity?
    • The database perspective: from RDBMS to DSMS
    • The event processing perspective: from Publish-Subscribe middlewares to Complex Event Processing
    • The IT architectural perspective: from Service-Oriented Architecture to Event-driven Architecture
    • is this only hype?
  • The journey to an event-driven enterprise [slides][link to Kakfa Summit 2018 keynote]
    • 1st – streaming awareness and pilot
    • 2nd – early production
    • 3rd – mission critical production
    • 4th – global streaming
    • 5th – central nervous system
  • Introduction to Kafka [slide]
    • messaging: topics (regural and compacted), partitions and replicas
    • stream processing: stream, tables and KSQL
  • Kafka Demo [slides][link]
  • An experience in using Kafka at small scale [slides by Matteo Ferroni]
  • Intereting links to check-out

Taming Data Variety [10-09-2018 (pm)]

  • Introduction [slides]
    • The interoperability problem
      • The standardisation dilemma
      • Variety cannot be avoided
    • Embrach variety-proof technologies with smart data & smart machines
    • The long-wave of smart data technologies adoption
      • The early days of the Semantic Web
      • The “happy” days of Linked Open Data
      • The success story of schema.org & Google Knowledge Graph
  • An overview of existing knowledge graphs [slides]
  • Demo of Ontop [slides]
    • ingest a stream
    • ingest a table
    • enrich a stream with a table
    • continuosly analyse the enriched stream
    • connect elasticsearch and kibana to visualize the analyses in real-time
  • My own research: Stream Reasoning [slides]
  • Interesting links to check-out:

Data Science [19-09-2018 (am)]

  • introduction [slide]
    • Why? Get rid of the HiPPO and embrace data-driven decision making!
    • What? Big Data is crudel oil, Data Science is refining it!
    • Who? Hacking skills + Math & statistics knowledge + Substantive Experience
    • How? Statistics + Machine Learning + Visualizations in an agile way
    • Where? Public safety, Swisscom, ENI
  • The main Machine Learning algorithms [white board]
    • Predicting house prices using Linear Regression and Gradient Descent
    • Detecting spam emails using Naive Bayes Algorithm
    • Recommending Apps based on Decision Trees [animation]
    • Finding the best location for a shop based on K-means clustering or Hierarchical Clustering
    • Deciding to accept students at a university based on Logistic Regression and Gradient Descent with Log-loss function
    • When a line is not enough … or the kernel trick of Support Vector Machines
  • Advance Analytics with Deep Learning
    • Hands-on the Linear Perceptron and the linear classification problems it can solve [screenshot]
    • Hands-on the Linear Perceptron and the simple non-linear classification problems it can solve [screenshot]
    • Hands-on Deep Learning and the complex non-linear classification problems it can solve using sigmoid [screenshot], ReLU [screenshot] and deep NN [screenshot]
    • Demo: how Keras + Tensorflow can classify correctly the MNIST Digits Dataset
    • Demo: how Keras + Tensorflow can understand images using Inception V3 model

 

 

Technical Lectures

A deep dive in hadoop [11-10-2018]

Apache Spark overview analysing Wikipedia logs [18-10-2018]

  • Spark as a unifying platform for Data Engineering, Data Analysis and Data Science
  • How to implement simple use cases for Spark using core APIs using Wikipedia logs
  • How to build data pipelines and query large data sets using Spark SQL and DataFrames using English Wikipedia logs
  • Learn about the internals of Catalyst Optimizer and Tugsten
  • Understand how Spark structured streaming can analyse in real-time Wikipedia edits
  • MATERIAL: temporary link to the notebooks for Databricks  + this other file

Data science with Apache Spark [23-10-2018 (am) + 25-10-2018 (am)]

Deep Learning Theory and Practice with Keras and TensorFlow [6-11-2018]

Urban Data Science Hackathon [14-11-2018]

  • The datasets available: people flows from counting sensors, people presence and demographics from mobile telecom data, free parking, weather, and social media
  • Learning to formulate an Urban Data Science problem
  • Setting a Urban Data Science problem
  • Using methods and techniques learnt in previous days to solve the set problem
  • Presentation of the solutions
  • MATERIAL: introduction & link to gitter channel

Docker [22.11.2018 (am)]

From SQL to noSQL and back to newSQL [22 (pm) / 27 (am) / 29 (am) 11.2018  ]

Brainstorm on what a RDBMS is and which requirements it satisfies
• Theoretical limits on Consistency, Availability and Partition tolerance
• The noSQL landscape
• key-value store such as memcached and Redis
• hands-on memcached textual protocol using telnet
• hands-on redis using redis-cli
• scalable column store such as Google BigTable, Hbase and Cassandra
• hands-on Cassandra using CQL-CLI
• document-store such as mongoDB and ElasticSearch
• hands-on mongoDB using mongo-cli
• graph database such as neo4j
• hands-on neo4j and cypher using neo4j Web interface
• back to SQL with newSQL (e.g., VoltDB): it’s always a matter of trade-off
• Conclusion: how to choose and what avoid doing