Designing Data-Intensive Applications

Browse
Books
Shelves

Designing Data-Intensive Applications

Martin Kleppmann

ID: 217

UUID:

Date added: May 13, 2025

Last modified: May 13, 2025

File sizes PDF (23.3 MiB) EPUB (817.7 KiB)

Languages English

Google Books Hardcover ISBN

Computers

Publisher: "O'Reilly Media, Inc."

Published: Mar 16, 2017

Core Thesis: This book argues that no single data system is right for all jobs and that the terms engineers rely on — 'ACID,' 'eventual consistency,' 'scalable,' 'reliable' — conceal a wide spectrum of actual behaviors and trade-offs. Drawing on more than 800 academic papers, engineering blog posts, and production folklore, Kleppmann synthesizes the principles and practicalities of how modern data systems really work, from single-machine storage engines up through distributed replication, transactions, and consensus, and into the derived-data pipelines (batch and stream) that increasingly tie systems together. A recurring second claim is that strong guarantees have inherent costs (linearizability requires network-latency-scale coordination, serializability limits throughput) and that good engineering means choosing consciously among them rather than accepting vendor marketing at face value. The closing chapter pushes this further, arguing that log-based derived data, end-to-end operation identifier

Digested:

Description:

Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords? In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications. Peer under the hood of the systems you already use, and learn how to use and operate them more effectively Make informed decisions by identifying the strengths and weaknesses of different tools Navigate the trade-offs around consistency, scalability, fault tolerance, and complexity Understand the distributed systems research upon which modern databases are built Peek behind the scenes of major online services, and learn from their architectures

Review

Something is wrong with how the software industry talks about its databases. We toss around words like “ACID,” “scalable,” “consistent,” and “real-time” as if they were binary properties a system either possesses or lacks, but the reality is a swamp of partial guarantees, subtle failure modes, and trade-offs that vendor documentation is designed to obscure. Martin Kleppmann’s Designing Data-Intensive Applications is a 600-page corrective to that habit of mind. More than a textbook, less than a monograph, and vastly more honest than most technical books dare to be, it synthesizes over 800 papers, postmortems, and engineering blog posts into a coherent argument: that the language we use to reason about data systems is dangerously imprecise, and that the only responsible way to build software that touches data is to understand the actual mechanisms, not the marketing abstractions, and to choose your trade-offs consciously.

The book is not a reference manual for any particular database, nor is it a how-to for passing a system design interview. Kleppmann calls it “the missing guide to how modern data systems work under the hood,” and the key word is “modern”: he does not assume the reader will ever write a storage engine or implement Raft from scratch, but he insists that you cannot make competent architectural decisions without knowing what is happening inside the black boxes you are composing. This is a pragmatist’s book, in the philosophical sense: truth is what works, and what works depends on the problem you are solving. But it is also an analytic book, dismantling fuzzy concepts into their constituent parts with the patience of a good debugger, and an empiricist’s book, grounding every claim in a production incident or a peer-reviewed measurement. The result is one of the few works in software engineering that deserves the word “canonical” without irony.

The structure mirrors the layered architecture it describes. Part I, “Foundations of Data Systems,” starts from first principles that apply whether data lives on a single machine or a thousand. The opening chapter defines reliability, scalability, and maintainability not as platitudes but as design goals with operational teeth: reliability means “continuing to work correctly, even when things go wrong,” where “faults” include not just crashed processes but humans deploying bad code at 2 a.m. Scalability is illustrated through the Twitter fan-out problem — should a new tweet be written to every follower’s timeline at post time or merged at read time? — which becomes a running motif for the tension between write-time and read-time work that will resurface throughout the book. Maintainability is Kleppmann’s quiet political argument: systems that are opaque to their operators and resistant to change are the ones that accumulate technical debt and eventually fail in socially expensive ways.

From there the book descends into physical reality. Chapter 2 on data models opens with Wittgenstein: “The limits of my language mean the limits of my world.” Kleppmann uses this to frame the choice between relational, document, and graph models not as a tribal war but as a question of which relationships your application needs to express, and whether the impedance mismatch between in-memory objects and disk-bound tables is a price worth paying for the query flexibility of SQL. The historical tour from IMS’s hierarchical model through the CODASYL network model to Codd’s relational model is not academic nostalgia; it demonstrates that every “new” NoSQL idea has a predecessor, and that the document model is, in important respects, a return to the hierarchical thinking that the relational revolution overthrew. The point is not that one model is superior but that understanding the lineage makes you less susceptible to fashion.

Chapter 3, on storage and retrieval, is where the book’s method shines brightest. Kleppmann walks from the simplest possible append-only file with a hash index all the way up to B-trees, LSM-trees, and column-oriented warehouses, always with a concrete system in mind: Bitcask for the hash index, LevelDB and Cassandra for SSTables, Vertica for column stores. The comparative tables here are models of clarity: write amplification vs. read efficiency, compaction overhead vs. sustained throughput. This is not a theoretical survey; it is an engineer’s map of the territory, complete with notes on where the swamps are. The pivot to OLAP and data warehousing — star schemas, bitmap indexes, data cubes — connects the storage engine internals to the analytic workloads that dominate industry, and the chapter’s quiet insight is that the physical layout of bytes on disk directly shapes what kinds of questions you can ask efficiently, an insight too often obscured by SQL’s declarative abstraction.

Encoding and evolution (Chapter 4) could have been a dry catalogue of wire formats, but Kleppmann instead treats schema evolution as the central challenge of long-lived software. The epigraph — “Everything changes and nothing stands still” — sets the tone. The discussion of Thrift, Protocol Buffers, and Avro is less about which binary encoding is fastest and more about the social contract implied by field tags versus writer/reader schema resolution: backward compatibility is not a compiler feature; it is a promise to consumers of your data that your system will not break them, and it requires deliberate design. The dataflow section — databases, services, message-passing — connects encoding to the real operational problem of rolling upgrades, making clear that data outlives processes, and format choices are architecture choices.

Part II, “Distributed Data,” is the book’s intellectual center of gravity. It opens with a frank statement: you distribute data because you must — for scale, for fault tolerance, for latency — but distribution is a source of suffering, not a feature. Chapter 5 on replication unpacks the leader-based model and then walks, with remarkable patience, through the anomalies that replication lag introduces: reading your own writes, seeing things out of order, seeing things that don’t yet exist. Kleppmann’s treatment of multi-leader and leaderless replication (the Dynamo heritage) is especially valuable because it refuses to hide the conflict resolution mess: last-write-wins is easy but lossy; version vectors are correct but operationally expensive; CRDTs are theoretically elegant but often impractical. The implicit thesis is that replication is a spectrum of guarantees, not a binary “synchronous vs. asynchronous” switch, and that the right choice is always context-dependent.

Chapter 6 on partitioning extends the same anti-absolute mindset to sharding: hash partitioning distributes load evenly but kills range queries; key-range partitioning enables range scans but risks hot spots. The discussion of secondary index partitioning — document-partitioned vs. term-partitioned — is a miniature masterpiece of distributed system design, showing how a seemingly local decision (where to put an index) cascades into global performance characteristics. The rebalancing strategies and request-routing mechanisms (ZooKeeper, gossip) are illustrated with the same production-first approach, drawing on Cassandra, Riak, and MongoDB.

Then come the three chapters that give the book its reputation for intellectual seriousness. Chapter 7 on transactions begins with a demolition of the ACID acronym: “Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application.” This is not pedantry; it is a reminder that the database cannot know what your application considers correct. The taxonomy of weak isolation levels — read committed, snapshot isolation, lost updates, write skew, phantoms — is the most lucid I have encountered. Kleppmann provides a comparative table that lets you see at a glance which anomaly each level permits, and then walks through three distinct paths to serializability — literal serial execution (VoltDB), two-phase locking, and serializable snapshot isolation — with the intellectual honesty to admit that each has costs, and that the “correct” choice is an engineering trade-off, not a moral one.

Chapter 8, “The Trouble with Distributed Systems,” is where Kleppmann’s voice shifts from tutorial to warning. “In distributed systems, suspicion, pessimism, and paranoia pay off.” The chapter catalogues network faults with almost sadistic thoroughness: asymmetric links — “just because a network link works in one direction doesn’t guarantee it’s also working in the opposite direction” — shark-bitten undersea cables, queues that explode under congestion. The treatment of clocks is similarly unforgiving: time-of-day clocks that jump backward due to leap seconds or NTP misconfiguration, monotonic clocks that drift, garbage collection pauses that can freeze a process long enough for a lease to expire without the process knowing it has been presumed dead. The fencing token pattern — a monotonically increasing number that lets storage reject writes from zombie leaders — is presented not as an optimization but as a survival necessity. The Byzantine fault model, while acknowledged as less relevant to most datacenter systems, is explained clearly enough that the reader understands why it matters for cross-organizational protocols and high-assurance environments. The chapter’s formal turn — system models, safety vs. liveness properties — is the book’s most abstract section, and Kleppmann does not pretend otherwise, but the preceding flood of concrete failure modes makes the abstraction feel earned. If you have never seen a clock go backward or a perfectly healthy node be declared dead by a consensus algorithm, this chapter will make you feel as though you have.

Chapter 9, “Consistency and Consensus,” is the climax: linearizability, the CAP theorem, causality, total order broadcast, two-phase commit, and consensus algorithms all wrapped into a unified argument that these concepts are deeply intertwined. Kleppmann’s treatment of CAP is refreshingly grumpy: the theorem is true but almost useless for engineering decisions because it frames partition tolerance as optional and ignores latency. “Linearizability is slow — and this is true all the time, not only during a network fault.” That single sentence, supported by a discussion of minority-quorum strategies and the inherent network delay lower bound, is worth more than a hundred conference talks about eventual consistency. The proof sketch showing that linearizable compare-and-swap, total order broadcast, and consensus are equivalent is the book’s theoretical core, but Kleppmann keeps it at the level of diagrams and intuitions, referring the formal-minded to Lamport’s papers. The practical conclusion — that coordination services like ZooKeeper, etcd, and Chubby use consensus to provide a small set of linearizable primitives for the rest of the system to build on — ties the abstraction back to the day-to-day reality of partition assignment and leader election.

Part III, “Derived Data,” is where the book moves from what is to what could be. Chapter 10 on batch processing is a love letter to the Unix philosophy, with Doug McIlroy’s 1964 vision of garden-hose I/O and the observation that the principles of automation, rapid prototyping, and incremental iteration “sound remarkably like the Agile and DevOps movements of today.” Kleppmann traces a direct line from sort | uniq -c | sort -rn to MapReduce and from MapReduce to dataflow engines like Spark and Flink that avoid materializing intermediate state. The join taxonomy — reduce-side sort-merge, map-side broadcast, partitioned, merge — is the same analytical clarity applied to a different problem, and the comparison between Hadoop and MPP databases is a case study in trade-off thinking: schema-on-read vs. schema-on-write, fault tolerance via recomputation vs. via replication, diversity of storage models vs. single tightly integrated engine.

Chapter 11 on stream processing is the book’s most forward-looking technical section. It opens with John Gall’s aphorism: “A complex system that works is invariably found to have evolved from a simple system that works.” The chapter argues that streams are the natural generalization of batch, not an exotic special case, and lays out the architectural choice between AMQP/JMS-style message brokers and log-based brokers like Kafka and Kinesis with the same care the earlier chapters gave to storage engines. The centerpiece is the passage on state, streams, and immutability:

Whenever you have state that changes, that state is the result of the events that mutated it over time. … The key idea is that mutable state and an append-only log of immutable events do not contradict each other: they are two sides of the same coin.

This is not a new idea — Pat Helland’s work on immutability and the event sourcing tradition predate it — but Kleppmann’s synthesis makes it accessible as a design principle rather than a database internals topic. The coverage of change data capture, event sourcing, and the three types of stream joins (stream-stream, stream-table, table-table) is dense but practical, and the discussion of fault tolerance — microbatching, checkpointing, idempotence, rebuilding state — is honest about the messy gap between exactly-once semantics as a user-visible guarantee and the retries and deduplication happening underneath.

Chapter 12, “The Future of Data Systems,” is where the book stops being merely a great technical work and becomes something rarer: a normative argument written in the first person by an engineer who has thought seriously about the moral weight of the systems he designs. Kleppmann advances a cluster of related theses: that derived data pipelines built on event logs are the most promising strategy for integrating heterogeneous systems; that databases should be “unbundled” along their write and read paths, with application code serving as the derivation function; that end-to-end operation identifiers should carry correctness across system boundaries, applying the end-to-end argument to databases. He quotes Saltzer, Reed, and Clark directly:

The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the endpoints of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible.

From this he distills the most provocative slogan in the book: “Violations of timeliness are ‘eventual consistency,’ whereas violations of integrity are ‘perpetual inconsistency.’ I am going to assert that in most applications, integrity is much more important than timeliness.” This distinction, which could have been a footnote in a distributed systems textbook, becomes the pivot for a broader claim about auditability, “trust but verify,” and the design of self-validating systems.

Then the chapter takes a sharp turn. After nearly five hundred pages of storage engines and consensus algorithms, Kleppmann quotes Bruce Schneier:

Data is the pollution problem of the information age, and protecting privacy is the environmental challenge. Almost all computers produce information. It stays around, festering. How we deal with it — how we contain it and how we dispose of it — is central to the health of our information economy.

The section “Doing the Right Thing” is a brief but serious ethical meditation on predictive analytics, algorithmic bias (citing Cathy O’Neil’s Weapons of Math Destruction), surveillance capitalism (Shoshana Zuboff), and the responsibilities of practitioners. Kleppmann does not offer solutions; he offers the simpler and more difficult thing: a demand that the reader stop treating these concerns as optional or downstream, and recognize that the structure of a data pipeline — what is logged, how long it is kept, who can query it — is a political decision. For a book that begins with hash indexes and B-trees, this is a remarkable climax, and it is precisely the kind of ethical turn that the engineering literature too often lacks.

The book’s limitations are real, and honesty requires naming them. It was published in 2017, and while the foundational material remains authoritative, the tool landscape has moved. Kafka’s exactly-once semantics, which the book treats as future work, are now a production feature; Kafka Streams and ksqlDB have matured; Flink has become a dominant stream processor; the lakehouse paradigm (Delta Lake, Iceberg, Hudi) has emerged, as have data mesh, vector databases, and serverless query engines. The unbundled databases thesis in Chapter 12 has been substantially vindicated — you can see its fingerprints on products like Materialize and Confluent Cloud — but the book cannot guide you through the current ecosystem. Chapters 8 and 9 are dense, and readers without prior exposure to distributed systems concepts may find themselves rereading them multiple times; a few worked exercises would have helped solidify the abstractions. The ethical section, while welcome, is brief and feels like an appendix to the technical argument rather than a fully integrated dimension of the design process — a criticism that Kleppmann himself might accept, given that he frames it as a starting point rather than a conclusion. Nevertheless, these are limitations of scope, not of execution, and they do not diminish what the book achieves within the territory it chooses to cover.

The book occupies a peculiar position in its intellectual tradition. It is not a work of original research, and it does not advance a new theorem. Its contribution is synthesis: Kleppmann has read the primary sources — Lamport on clocks, Codd on the relational model, Dean and Ghemawat on MapReduce, DeCandia et al. on Dynamo, Stonebraker on shared-nothing architectures — and he has translated their insights into a coherent narrative that a working engineer can absorb and apply. This is the pragmatist tradition in its best light: the measure of knowledge is not novelty but usefulness, and to make a thousand research papers useful to a developer debugging a replication lag bug is a genuine intellectual achievement. The analytic tradition, too, is present in the careful dissection of concepts that the industry uses sloppily: the separation of consistency (the database’s guarantee) from integrity (the application’s invariant), the splitting of “scalable” into load parameters and performance percentiles, the insistence that linearizability, serializability, and causal consistency are distinct properties that must not be conflated. And the empiricist tradition runs through the entire book in its relentless citation of real-world failures — GitHub’s MySQL outage, Cassandra’s last-write-wins data loss, the Chaos Monkey — which makes the theoretical material feel concrete rather than academic.

At the same time, the closing chapter pulls the book into territory that the standard technical canon rarely enters: technology and society, surveillance and privacy, the ethics of platforms. Kleppmann does not pretend to be a philosopher, but he recognizes that a book about building data pipelines cannot end with a diagram of a stream join. The systems it describes are the infrastructure of surveillance capitalism, and the people who build them have a responsibility to understand what they are enabling. This is not a position that follows logically from the earlier chapters — you cannot derive an ethical obligation from a B-tree’s write amplification — but it is an honest acknowledgment of the world the book lives in. In the vocabulary of this platform’s canonical map, the book sits at the intersection of the analytic, the empiricist, and the pragmatist, but it also reaches into the territory of ethics and the critique of platform capitalism, though it does not stay there long enough to map the terrain fully.

Who should read Designing Data-Intensive Applications? The obvious audience is backend engineers, data engineers, and software architects who work with databases, message queues, and stream processors and who have begun to suspect that the abstractions they trust are leaking. The book will not teach you how to configure a specific PostgreSQL cluster or write a Kafka Streams topology, but it will give you a mental model that makes those tasks intelligible. It is also a valuable corrective for engineers who have been taught to optimize for interview performance — for the “design a URL shortener” problems that reduce distributed systems to a set of flash-card solutions — and who need to encounter the actual mess of partial failure, clock uncertainty, and conflicting guarantees. The prerequisite knowledge is modest: you need to know SQL, understand what a hash table is, and have written enough code to feel the pain of schema migrations. Kleppmann builds the rest from the ground up.

The book’s deepest value, though, is cultural. It models a way of thinking about software that is simultaneously rigorous, humble, and ethically awake. It refuses to let the reader off the hook with a slide deck of best practices; it demands that you understand why a particular choice works in this context and what the hidden costs are. The prose is clear, occasionally dry, and punctuated with epigraphs that signal intellectual ambition — Wittgenstein, Heraclitus, Aquinas, Gall — without ever losing sight of the practical goal. At a time when the infrastructure industry runs on marketing and conference keynotes, Designing Data-Intensive Applications is a quiet insistence that building systems that work requires knowing how they work, and that knowing how they work is a moral as much as a technical project. That is a position worth defending, and this book defends it with the kind of thoroughness that makes you wonder why anyone would ever settle for less.

Uploading...