๐Ÿ“š Course Information

  • Professor: G. Fourny
  • Credits: 10 ECTS
  • Semester: Fall 2025
  • Status: In Progress
Big Data is a portfolio of technologies that were designed to store, manage and analyze data that is too large to fit on a single machine, while accommodating for the issue of growing discrepancy between capacity, throughput and latency.

Welcome to my journey through ETH's Big Data course! This is where theory meets reality, and where I'm learning that handling petabytes of data requires fundamentally rethinking everything we know about databases and systems.

The Big Picture Problem

Here's the core challenge that blew my mind in the first lecture:

How do we deal with BIG data? Traditionally, a DBMS fits on a single machine. But Petabytes of data do not fit on a single machine.

As a consequence, in this course, we will have to rebuild the entire technology stack, bottom to top, with those same concepts and insights that we got in the past decades, but on a cluster of machines rather than on a single machine.
Big Data Technology Stack

The evolution from single-machine to distributed systems

The Three Vs of Big Data

Everything in big data revolves around three fundamental challenges:

1. Volume ๐Ÿ“

The scale is just insane. We're talking about prefixes that sound like science fiction:

  • Terabytes (10ยนยฒ) - This used to be "big"
  • Petabytes (10ยนโต) - Now we're talking
  • Exabytes (10ยนโธ) - Google-scale stuff
  • Zettabytes (10ยฒยน) - Global internet traffic
  • Yottabytes (10ยฒโด) - Theoretical for now... right?

2. Variety ๐ŸŒˆ

Data isn't just tables anymore. Modern systems need to handle 5 different shapes:

  1. Tables - The classic relational model
  2. Trees - XML, JSON, Parquet, Avro formats
  3. Graphs - Think Neo4j, social networks
  4. Cubes - Business analytics and OLAP
  5. Vectors - Embeddings for unstructured data (text, images, audio)

3. Velocity โšก

This is where things get really interesting. There's a growing distortion between three factors:

Capacity

How much data we can store per unit of volume

Throughput

How many bytes we can read per unit of time

Latency

How long we wait until bytes start arriving

The Solutions: Parallel and Batch

The course teaches us two fundamental approaches to tackle these challenges:

Parallelization ๐Ÿ”„

To bridge the gap between capacity and throughput, we need to:

  • Exploit sequential access & parallelism (batch jobs, scans, partitioning)
  • Use distribution to overcome single-node throughput limits
  • Hide latency with caching, replication, and in-memory techniques

Batch Processing ๐Ÿ“ฆ

To handle the throughput vs latency gap, we move from real-time to batch processing that runs automatically and efficiently processes large chunks of data.

Big Data Architecture Overview

Modern big data architecture - from raw data to insights

Key Technologies and Concepts

The Foundation: HDFS and MapReduce

Pro tip from my notes: Pay close attention to HDFS and MapReduce - they're the building blocks everything else rests on.

The Evolution: From NoSQL to Modern Systems

The course covers the evolution of data storage beyond traditional RDBMS:

  1. Wide Column Stores - Like Cassandra
  2. Document Stores - MongoDB and friends
  3. Key-Value Stores - Redis, DynamoDB
  4. Graph Stores - Neo4j, Amazon Neptune

Data Independence: The Guiding Principle

Data independence means that the logical view on the data is cleanly separated, decoupled, from its physical storage.

This concept, introduced by Edgar Codd in 1970, is still the foundation of everything we do. Even in distributed systems, we want to hide physical complexity and expose simple, clean models.

The Red Thread ๐Ÿงต

Here's the conceptual flow that ties everything together:

Raw Data (Objects/Blocks) โ†’ Data Model โ†’ Processing โ†’ Declarative Language

When the course talks about normalization, models, or declarative queries, I immediately connect it to Spark SQL and JSONiq. This mental framework has been incredibly helpful.

Learning Strategy

Some practical advice for anyone taking this course:

  • Focus intensely on HDFS and MapReduce - Everything builds on these
  • Connect concepts to Spark SQL/JSONiq - Especially when discussing normalization and declarative models
  • Follow the red thread - Always trace from raw data to final query language
  • Study 3-4 hours per week - The material is dense but manageable
Course Study Plan

My study schedule and approach for the course

What's Coming Next

As the course progresses, I'll be diving deeper into:

  • Distributed file systems and their trade-offs
  • MapReduce programming paradigms
  • Spark and modern distributed computing
  • Data modeling for different shapes (trees, graphs, cubes)
  • Query optimization in distributed systems

Final Thoughts

This course is fundamentally changing how I think about data systems. The realization that we need to "rebuild the entire technology stack" for distributed systems is both daunting and exciting.

It's fascinating to see how the principles from traditional databases (like data independence) still apply, but require completely different implementations when you're dealing with clusters instead of single machines.

The ๐Ÿ’ฝ emoji in my course notes isn't just decoration - it represents this massive shift from single disks to distributed storage that's reshaping the entire field of data management.

More updates to come as I progress through HDFS, MapReduce, and beyond! ๐Ÿš€