This study guide will help you navigate Designing Data-Intensive Applications effectively, whether you’re learning for the first time or reviewing key concepts.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ps06756/Designing-Data-Intensive-Applications/llms.txt
Use this file to discover all available pages before exploring further.
Recommended reading order
The book is designed to be read sequentially, but you can adapt based on your goals and experience level.Foundation: Chapters 1-4
- Understand the three pillars of data systems
- Learn about fault tolerance vs. fault prevention
- Study vertical vs. horizontal scaling
- Compare relational, document, and graph models
- Understand when to use each data model
- Learn declarative vs. imperative queries
- Master log-structured vs. update-in-place storage
- Understand B-trees and LSM-trees
- Learn column-oriented storage for analytics
- Study schema evolution techniques
- Compare JSON, Thrift, Protocol Buffers, and Avro
- Understand backward and forward compatibility
Distributed data: Chapters 5-9
- Master leader-based, multi-leader, and leaderless replication
- Understand replication lag and consistency issues
- Study read-after-write and monotonic read guarantees
- Learn key-range vs. hash partitioning
- Understand secondary indexes in partitioned databases
- Study rebalancing strategies
- Master ACID properties
- Understand isolation levels (read committed, repeatable read, serializable)
- Learn about concurrency problems (dirty reads, lost updates)
- Understand partial failures and network issues
- Study unreliable clocks and their implications
- Learn about detecting faults with timeouts
- Master linearizability vs. serializability
- Understand causality and ordering guarantees
- Study consensus algorithms (Paxos, Raft)
Derived data: Chapters 10-12
- Understand MapReduce and distributed filesystems
- Learn join algorithms in batch processing
- Study dataflow engines beyond MapReduce
- Master event streams and message brokers
- Understand change data capture (CDC)
- Learn stream processing frameworks
- Study data integration patterns
- Understand unbundling databases
- Learn lambda and kappa architectures
Alternative reading paths
For practitioners building systems
Immediate needs
- Chapter 1 (overview)
- Chapter 5 (replication basics)
- Chapter 7 (transactions)
- Chapters you need for current project
- Return to fill gaps
Backend engineers
- Chapters 2-3 (storage fundamentals)
- Chapter 5 (replication)
- Chapter 6 (partitioning)
- Chapter 7 (transactions)
- Chapter 11 (streaming for real-time systems)
Data engineers
- Chapter 3 (storage, especially OLAP)
- Chapter 10 (batch processing, MapReduce)
- Chapter 11 (stream processing)
- Chapter 12 (data integration)
Distributed systems engineers
- Chapters 5-9 (all distributed data chapters)
- Pay special attention to:
- Chapter 8 (failure modes)
- Chapter 9 (consensus)
Key concepts to master
Part 1: Foundations of data systems
Chapter 1: Core principles
Chapter 1: Core principles
- Reliability: Faults vs. failures, fault tolerance strategies
- Scalability: Load parameters, describing performance with percentiles
- Maintainability: Operability, simplicity, evolvability
Chapter 2: Choosing data models
Chapter 2: Choosing data models
- Relational model: Normalized data, joins, ACID transactions
- Document model: Schema flexibility, data locality, embedded documents
- Graph model: Many-to-many relationships, traversals, pattern matching
Chapter 3: Storage engines
Chapter 3: Storage engines
- Log-structured storage: LSM-trees, SSTables, compaction
- B-trees: In-place updates, balanced tree, fixed-size pages
- OLTP vs. OLAP: Different workload patterns need different storage
- Column storage: Compression, vectorized processing
Chapter 4: Data encoding
Chapter 4: Data encoding
- Schema evolution: Adding/removing fields, changing types
- Compatibility: Backward (new reads old) and forward (old reads new)
- Encoding formats: JSON vs. Thrift vs. Protocol Buffers vs. Avro
Part 2: Distributed data
Chapter 5: Replication fundamentals
Chapter 5: Replication fundamentals
- Leader-based replication: Synchronous vs. asynchronous, failover
- Multi-leader replication: Write conflicts, conflict resolution
- Leaderless replication: Quorums, read repair, anti-entropy
- Consistency issues: Read-after-write, monotonic reads, consistent prefix
Chapter 6: Partitioning strategies
Chapter 6: Partitioning strategies
- Partitioning by key range: Efficient range queries, risk of hot spots
- Partitioning by hash: Even distribution, no range queries
- Secondary indexes: Document-partitioned (local) vs. term-partitioned (global)
- Rebalancing: Fixed partitions, dynamic partitioning, proportional to nodes
Chapter 7: Transaction isolation
Chapter 7: Transaction isolation
- ACID properties: Atomicity, consistency, isolation, durability
- Isolation levels: Read committed, repeatable read, serializable
- Concurrency problems: Dirty reads, dirty writes, lost updates, write skew, phantoms
- Implementing serializability: Actual serial execution, 2PL, SSI
Chapter 8: Distributed system realities
Chapter 8: Distributed system realities
- Unreliable networks: Packet loss, delays, partitions
- Unreliable clocks: Clock skew, monotonic vs. time-of-day
- Partial failures: Cannot distinguish crashed vs. slow
- Timeouts and retries: Exponential backoff, idempotency
Chapter 9: Consistency and consensus
Chapter 9: Consistency and consensus
- Linearizability: Strongest consistency, appears as single copy
- Causality: Happens-before relationship, causal consistency
- Consensus: Getting nodes to agree, Paxos, Raft, ZAB
- Total order broadcast: Equivalent to consensus
Part 3: Derived data
Chapter 10: Batch processing patterns
Chapter 10: Batch processing patterns
- MapReduce: Map phase, shuffle, reduce phase
- Distributed joins: Broadcast join, partitioned join, map-side join
- Dataflow engines: Beyond MapReduce (Spark, Flink)
- Graph processing: Pregel model, bulk synchronous parallel
Chapter 11: Stream processing
Chapter 11: Stream processing
- Event streams: Messages vs. events, Kafka, Kinesis
- Change data capture: Streaming database changes
- Event sourcing: Immutable event log, deriving state
- Stream processing: Windowing, joins, fault tolerance
Chapter 12: Integration patterns
Chapter 12: Integration patterns
- Unbundling databases: Separate specialized systems
- Dataflow architectures: Event log as integration backbone
- Derived data: System of record vs. derived views
- Lambda vs. Kappa: Batch + stream vs. stream only
Common pitfalls and misconceptions
Practical exercises and projects
Beginner level
Build a key-value store
- Hash index with log-structured storage
- Compaction to prevent infinite growth
- Crash recovery
Set up replication
- PostgreSQL primary with 2 replicas
- Streaming replication
- Test failover manually
Compare data models
- Relational (PostgreSQL)
- Document (MongoDB)
- Graph (Neo4j)
Explore consistency
- Read from leader vs. follower
- Measure replication lag
- Observe eventual consistency
Intermediate level
Build a distributed cache
- Consistent hashing ring
- Partition assignment
- Handle node additions/removals
Implement MapReduce
- Simple MapReduce framework
- Word count, join operations
- Fault tolerance
Build event sourcing system
- Event store
- State reconstruction from events
- Multiple projections
Transaction isolation levels
- Lost updates with read committed
- Write skew with repeatable read
- Fix with serializable isolation
Advanced level
Consensus implementation
- Simplified Raft consensus
- Leader election
- Log replication
Streaming platform
- CDC from database
- Stream processing pipelines
- Windowed aggregations
Multi-datacenter architecture
- Multi-region deployment
- Conflict resolution
- Latency optimization
Data integration platform
- OLTP database
- Search index
- Analytics warehouse
- Cache layer
Discussion questions
Use these questions to deepen your understanding:-
Why do we need so many different databases?
- Consider: different data models, workload patterns, CAP trade-offs
-
When is eventual consistency acceptable?
- Think about: user expectations, business requirements, error handling
-
What makes distributed systems hard?
- Examine: partial failures, network unreliability, asynchronous execution
-
How do you choose between batch and stream processing?
- Consider: latency requirements, data volumes, complexity tolerance
-
Is microservices architecture worth the complexity?
- Weigh: team independence, deployment flexibility vs. distributed system challenges
-
How important is backward compatibility?
- Think about: rolling deployments, mobile apps, third-party integrations
Further resources
After completing this book, continue learning with:Academic papers
- Bigtable, Dynamo, Spanner
- Paxos, Raft consensus algorithms
- Dremel (columnar storage)
Open source projects
- PostgreSQL (B-trees, MVCC, replication)
- Cassandra (leaderless replication, LSM-trees)
- Kafka (event log, partitioning)
- etcd (Raft consensus)
System design practice
- Practice system design interviews
- Design real-world systems
- Read architecture blogs (Netflix, Uber, LinkedIn)
Related books
- “Database Internals” by Alex Petrov
- “Streaming Systems” by Tyler Akidau
- “Designing Distributed Systems” by Brendan Burns
Retention strategies
Active reading
- Take notes in your own words
- Draw diagrams of concepts
- Explain concepts to a colleague
Hands-on practice
- Complete practical exercises
- Set up actual systems
- Break things and fix them
Spaced repetition
- Week 1: Review all chapters
- Month 1: Review key concepts
- Month 3: Review challenging topics
- Month 6: Full review
Quick reference
When to use what
| Use case | Best choice | Why |
|---|---|---|
| Transactional workload | Relational DB | ACID, joins, constraints |
| Hierarchical data | Document DB | Schema flexibility, locality |
| Highly connected data | Graph DB | Relationship traversal |
| High write throughput | LSM-tree storage | Sequential writes |
| Analytics queries | Column-oriented DB | Scan efficiency |
| Strong consistency needed | Single-leader replication | Linearizability |
| Multi-datacenter writes | Multi-leader or leaderless | Availability during partitions |
| Event-driven architecture | Event streaming (Kafka) | Decoupling, scalability |
| Large batch analytics | Hadoop/Spark | High throughput |
| Real-time analytics | Stream processing | Low latency |
Trade-off cheat sheet
| Trade-off | Choose A if… | Choose B if… |
|---|---|---|
| Consistency vs. Availability | Correctness critical (banking) | Uptime critical (social media) |
| Normalization vs. Denormalization | Write-heavy, need consistency | Read-heavy, can tolerate staleness |
| B-tree vs. LSM-tree | Read-heavy workload | Write-heavy workload |
| Batch vs. Stream | Can tolerate hours of latency | Need minute/second latency |
| Vertical vs. Horizontal scaling | Simpler operations | Need unlimited scale |
| Microservices vs. Monolith | Independent team scaling | Simpler operations |