Study guide - Designing Data-Intensive Applications Notes

This study guide will help you navigate Designing Data-Intensive Applications effectively, whether you’re learning for the first time or reviewing key concepts.

Alternative reading paths

For practitioners building systems

Immediate needs

Quick practical path:

Chapter 1 (overview)
Chapter 5 (replication basics)
Chapter 7 (transactions)
Chapters you need for current project
Return to fill gaps

Backend engineers

Focus on:

Chapters 2-3 (storage fundamentals)
Chapter 5 (replication)
Chapter 6 (partitioning)
Chapter 7 (transactions)
Chapter 11 (streaming for real-time systems)

Data engineers

Focus on:

Chapter 3 (storage, especially OLAP)
Chapter 10 (batch processing, MapReduce)
Chapter 11 (stream processing)
Chapter 12 (data integration)

Distributed systems engineers

Focus on:

Chapters 5-9 (all distributed data chapters)
Pay special attention to:
- Chapter 8 (failure modes)
- Chapter 9 (consensus)

Key concepts to master

Part 1: Foundations of data systems

Chapter 1: Core principles

Critical concepts:

Reliability: Faults vs. failures, fault tolerance strategies
Scalability: Load parameters, describing performance with percentiles
Maintainability: Operability, simplicity, evolvability

Key insight: Describe system behavior with concrete metrics, not vague terms like “fast” or “scalable”Practical exercise: For a system you know, identify its load parameters and measure p50, p95, p99 response times

Chapter 2: Choosing data models

Critical concepts:

Relational model: Normalized data, joins, ACID transactions
Document model: Schema flexibility, data locality, embedded documents
Graph model: Many-to-many relationships, traversals, pattern matching

Key insight: Data model choice affects how you think about problems and write codePractical exercise: Model a social network in relational, document, and graph databases. Compare query complexity.

Chapter 3: Storage engines

Critical concepts:

Log-structured storage: LSM-trees, SSTables, compaction
B-trees: In-place updates, balanced tree, fixed-size pages
OLTP vs. OLAP: Different workload patterns need different storage
Column storage: Compression, vectorized processing

Key insight: Write-optimized (LSM) vs. read-optimized (B-tree) storage enginesPractical exercise: Implement a simple key-value store with hash index and log-structured storage

Chapter 4: Data encoding

Critical concepts:

Schema evolution: Adding/removing fields, changing types
Compatibility: Backward (new reads old) and forward (old reads new)
Encoding formats: JSON vs. Thrift vs. Protocol Buffers vs. Avro

Key insight: Plan for schema changes from day one. Old and new code will coexist during deployments.Practical exercise: Define a Protocol Buffers schema, evolve it by adding optional fields, test compatibility

Part 2: Distributed data

Chapter 5: Replication fundamentals

Critical concepts:

Leader-based replication: Synchronous vs. asynchronous, failover
Multi-leader replication: Write conflicts, conflict resolution
Leaderless replication: Quorums, read repair, anti-entropy
Consistency issues: Read-after-write, monotonic reads, consistent prefix

Key insight: Replication lag is unavoidable with async replication. Design your application to handle it.Practical exercise: Set up PostgreSQL with streaming replication. Observe replication lag. Test failover scenarios.

Chapter 6: Partitioning strategies

Critical concepts:

Partitioning by key range: Efficient range queries, risk of hot spots
Partitioning by hash: Even distribution, no range queries
Secondary indexes: Document-partitioned (local) vs. term-partitioned (global)
Rebalancing: Fixed partitions, dynamic partitioning, proportional to nodes

Key insight: Partitioning is for scalability. Replication is for fault tolerance. Use both together.Practical exercise: Partition a dataset by hash. Measure load distribution. Identify and fix hot spots.

Chapter 7: Transaction isolation

Critical concepts:

ACID properties: Atomicity, consistency, isolation, durability
Isolation levels: Read committed, repeatable read, serializable
Concurrency problems: Dirty reads, dirty writes, lost updates, write skew, phantoms
Implementing serializability: Actual serial execution, 2PL, SSI

Key insight: Weak isolation levels have subtle edge cases. Understand what guarantees you actually need.Practical exercise: Create race conditions (lost update, write skew) in a database. Fix them with appropriate isolation level.

Chapter 8: Distributed system realities

Critical concepts:

Unreliable networks: Packet loss, delays, partitions
Unreliable clocks: Clock skew, monotonic vs. time-of-day
Partial failures: Cannot distinguish crashed vs. slow
Timeouts and retries: Exponential backoff, idempotency

Key insight: In distributed systems, you often can’t tell what happened. Design for uncertainty.Practical exercise: Use tc (traffic control) to inject network latency/loss. Observe how applications behave.

Chapter 9: Consistency and consensus

Critical concepts:

Linearizability: Strongest consistency, appears as single copy
Causality: Happens-before relationship, causal consistency
Consensus: Getting nodes to agree, Paxos, Raft, ZAB
Total order broadcast: Equivalent to consensus

Key insight: Consensus is expensive but necessary for coordination problems like leader election.Practical exercise: Set up etcd or Consul cluster. Observe leader election. Test partition scenarios.

Part 3: Derived data

Chapter 10: Batch processing patterns

Critical concepts:

MapReduce: Map phase, shuffle, reduce phase
Distributed joins: Broadcast join, partitioned join, map-side join
Dataflow engines: Beyond MapReduce (Spark, Flink)
Graph processing: Pregel model, bulk synchronous parallel

Key insight: Batch processing is about high throughput on large datasets, not low latency.Practical exercise: Write a MapReduce job to analyze logs. Implement a join between two large datasets.

Chapter 11: Stream processing

Critical concepts:

Event streams: Messages vs. events, Kafka, Kinesis
Change data capture: Streaming database changes
Event sourcing: Immutable event log, deriving state
Stream processing: Windowing, joins, fault tolerance

Key insight: Streams bridge batch and request-response. Lower latency than batch, more fault-tolerant than services.Practical exercise: Set up Kafka. Implement CDC from database. Build stream processor with windowed aggregations.

Chapter 12: Integration patterns

Critical concepts:

Unbundling databases: Separate specialized systems
Dataflow architectures: Event log as integration backbone
Derived data: System of record vs. derived views
Lambda vs. Kappa: Batch + stream vs. stream only

Key insight: Modern applications are composed of multiple specialized databases, integrated via event streams.Practical exercise: Design a system with OLTP database, cache, search index, and analytics warehouse. Use CDC for integration.

Common pitfalls and misconceptions

Avoid these common mistakes when learning and applying concepts:

“NoSQL is always better than SQL”
- Wrong. Choose based on data model and access patterns
- Relational databases still excel for many use cases
“Eventual consistency is good enough”
- Maybe, but understand the anomalies your application can tolerate
- Some use cases require strong consistency
“Distributed transactions are impossible”
- They’re possible but expensive and limit availability
- Often better to avoid them, but understand the trade-offs
“CAP theorem means choose 2 of 3”
- Misleading. In reality, network partitions are rare events
- During partition, must choose consistency or availability
- Rest of the time, can have both
“Microservices solve all problems”
- They introduce distributed systems challenges
- Benefits come with complexity costs
“Schema changes require downtime”
- Not with proper schema evolution techniques
- Backward and forward compatibility enable zero-downtime deployments

Practical exercises and projects

Beginner level

Build a key-value store

Learning goals: Storage engines, indexingImplement:

Hash index with log-structured storage
Compaction to prevent infinite growth
Crash recovery

Tech: Python, file I/O

Set up replication

Learning goals: Replication, failoverImplement:

PostgreSQL primary with 2 replicas
Streaming replication
Test failover manually

Tech: PostgreSQL, Docker

Compare data models

Learning goals: Data modeling trade-offsModel the same domain in:

Relational (PostgreSQL)
Document (MongoDB)
Graph (Neo4j)

Compare query complexity and performance

Explore consistency

Learning goals: Replication lag, consistencyExperiment with:

Read from leader vs. follower
Measure replication lag
Observe eventual consistency

Tech: PostgreSQL replication

Intermediate level

Build a distributed cache

Learning goals: Partitioning, consistent hashingImplement:

Consistent hashing ring
Partition assignment
Handle node additions/removals

Tech: Python/Go, Redis

Implement MapReduce

Learning goals: Batch processingImplement:

Simple MapReduce framework
Word count, join operations
Fault tolerance

Tech: Python, multiprocessing

Build event sourcing system

Learning goals: Event logs, derived stateImplement:

Event store
State reconstruction from events
Multiple projections

Tech: Kafka, any language

Transaction isolation levels

Learning goals: Concurrency, isolationDemonstrate:

Lost updates with read committed
Write skew with repeatable read
Fix with serializable isolation

Tech: PostgreSQL

Advanced level

Consensus implementation

Learning goals: Distributed consensusImplement:

Simplified Raft consensus
Leader election
Log replication

Tech: Go/Rust, networking

Streaming platform

Learning goals: Stream processingBuild:

CDC from database
Stream processing pipelines
Windowed aggregations

Tech: Kafka, Flink/Spark Streaming

Multi-datacenter architecture

Learning goals: Geo-distribution, consistencyDesign:

Multi-region deployment
Conflict resolution
Latency optimization

Tech: Cloud providers, distributed DB

Data integration platform

Learning goals: System compositionIntegrate:

OLTP database
Search index
Analytics warehouse
Cache layer

All synchronized via CDC and event streams

Discussion questions

Use these questions to deepen your understanding:

Why do we need so many different databases?
- Consider: different data models, workload patterns, CAP trade-offs
When is eventual consistency acceptable?
- Think about: user expectations, business requirements, error handling
What makes distributed systems hard?
- Examine: partial failures, network unreliability, asynchronous execution
How do you choose between batch and stream processing?
- Consider: latency requirements, data volumes, complexity tolerance
Is microservices architecture worth the complexity?
- Weigh: team independence, deployment flexibility vs. distributed system challenges
How important is backward compatibility?
- Think about: rolling deployments, mobile apps, third-party integrations

Further resources

After completing this book, continue learning with:

Academic papers

Read the original research papers referenced throughout the book. Start with:

Bigtable, Dynamo, Spanner
Paxos, Raft consensus algorithms
Dremel (columnar storage)

Open source projects

Study implementations of concepts:

PostgreSQL (B-trees, MVCC, replication)
Cassandra (leaderless replication, LSM-trees)
Kafka (event log, partitioning)
etcd (Raft consensus)

System design practice

Apply your knowledge:

Practice system design interviews
Design real-world systems
Read architecture blogs (Netflix, Uber, LinkedIn)

Related books

Deepen specific areas:

“Database Internals” by Alex Petrov
“Streaming Systems” by Tyler Akidau
“Designing Distributed Systems” by Brendan Burns

Retention strategies

Active reading

Don’t just read passively. For each chapter:

Take notes in your own words
Draw diagrams of concepts
Explain concepts to a colleague

Hands-on practice

Theory alone isn’t enough:

Complete practical exercises
Set up actual systems
Break things and fix them

Spaced repetition

Review periodically:

Week 1: Review all chapters
Month 1: Review key concepts
Month 3: Review challenging topics
Month 6: Full review

Apply to real work

Best way to learn:

Apply concepts to your projects
Evaluate existing systems with new knowledge
Share learnings with your team

Quick reference

When to use what

Use case	Best choice	Why
Transactional workload	Relational DB	ACID, joins, constraints
Hierarchical data	Document DB	Schema flexibility, locality
Highly connected data	Graph DB	Relationship traversal
High write throughput	LSM-tree storage	Sequential writes
Analytics queries	Column-oriented DB	Scan efficiency
Strong consistency needed	Single-leader replication	Linearizability
Multi-datacenter writes	Multi-leader or leaderless	Availability during partitions
Event-driven architecture	Event streaming (Kafka)	Decoupling, scalability
Large batch analytics	Hadoop/Spark	High throughput
Real-time analytics	Stream processing	Low latency

Trade-off cheat sheet

Trade-off	Choose A if…	Choose B if…
Consistency vs. Availability	Correctness critical (banking)	Uptime critical (social media)
Normalization vs. Denormalization	Write-heavy, need consistency	Read-heavy, can tolerate staleness
B-tree vs. LSM-tree	Read-heavy workload	Write-heavy workload
Batch vs. Stream	Can tolerate hours of latency	Need minute/second latency
Vertical vs. Horizontal scaling	Simpler operations	Need unlimited scale
Microservices vs. Monolith	Independent team scaling	Simpler operations

Documentation Index

​Recommended reading order

​Alternative reading paths

​For practitioners building systems

Immediate needs

Backend engineers

Data engineers

Distributed systems engineers

​Key concepts to master

​Part 1: Foundations of data systems

​Part 2: Distributed data

​Part 3: Derived data

​Common pitfalls and misconceptions

​Practical exercises and projects

​Beginner level

Build a key-value store

Set up replication

Compare data models

Explore consistency

​Intermediate level

Build a distributed cache

Implement MapReduce

Build event sourcing system

Transaction isolation levels

​Advanced level

Consensus implementation

Streaming platform

Multi-datacenter architecture

Data integration platform

​Discussion questions

​Further resources

Academic papers

Open source projects

System design practice

Related books

​Retention strategies

​Quick reference

​When to use what

​Trade-off cheat sheet

Recommended reading order

Alternative reading paths

For practitioners building systems

Key concepts to master

Part 1: Foundations of data systems

Part 2: Distributed data

Part 3: Derived data

Common pitfalls and misconceptions

Practical exercises and projects

Beginner level

Intermediate level

Advanced level

Discussion questions

Further resources

Retention strategies

Quick reference

When to use what

Trade-off cheat sheet