Mercy Markus

Fiddling with code 👩🏾‍💻

23 Mar 2025

Designing Data-Intensive Applications by Martin Kleppmann: Part I

A friend and colleague recommended Designing Data-Intensive Applications (DDIA) to me back in 2022. He found it invaluable while building and maintaining systems that needed to scale whilst being highly available and reliable. Although I attempted to start the book several times, I never made it past the first chapter. Over the weekend, I finally committed to diving into the book.

Why I Picked It Up Again

I spent most of 2024 on a sabbatical, experiencing life and myself outside the confines of a job. In between, I reflected on where to focus my energy when I was ready to return to work. During this time, I explored Systems Engineering, delving into the Linux Kernel and experimenting with the Rust programming language. While fascinating, it was also overwhelming. The steep learning curve left me feeling impatient and unsure of this path.

Conversations with a friend who works as a Data Engineer reminded me of my initial plan to transition to data engineering in late 2020. That plan had been set aside when I pursued an exciting Software Engineering role at Microsoft where I had the opportunity to work on the developer platform and tooling for the Microsoft Mesh product. This ignited my interests in automating software integration and delivery and the role of data in this. Revisiting DDIA and the Data Engineering path felt like a natural step, given my background in data science and experience as a software engineer.

This time, the goal is to approach the book with patience and consistency. I plan to document my learnings and progress here while applying my learnings to projects I work on. Let’s see how far I get!

Chapter 1: Reliable, Scalable, and Maintainable Applications

Chapter 1 provides a gentle introduction to reasoning about data-intensive applications. Martin Kleppmann explains what makes an application data-intensive rather than compute-intensive and outlines three primary concerns of most software systems: reliability, scalability, and maintainability. Here’s what stood out for me:

  1. Reliability: Reliability ensures applications function as intended, even when faults occur. Failures are inevitable, but systems must anticipate and recover from them to minimize disruption. The key question is: How do we bounce back and ensure minimal disruption when failures occur?

  2. Scalability: Scalability addresses how well an application handles increased load. Identifying meaningful load parameters is crucial e.g. tracking checkout completion times may be more valuable for an e-commerce store than overall site visits. This section also highlighted how all measurements are inherently inaccurate, as the tools used for monitoring are based off estimations. My takeaway is to focus on the right metrics that help inform scaling decisions.

  3. Maintainability: Building a reliable and scalable system is just the beginning. Maintainability ensures that the system remains operable, efficient, and developer-friendly over time. Poorly maintained systems can frustrate the engineers working on them, leading to burnout or high turnover. The goal is to design systems that are straightforward to troubleshoot, update, and extend.

Chapter 2: Data Models and Query Languages

Data models form the foundation of any software system, providing a structured way to represent real-world entities and their relationships. Choosing the right data model is critical for ensuring the system is performant, scalable, and aligned with business needs.

Data Models

  1. Relational: Data is organized into tables (relations) with rows (tuples) and columns. Each table represents a specific entity, and relationships between entities are modeled through foreign keys and joins.

    • Strengths: Well-suited for use cases requiring robust relationships (one-to-one, one-to-many, many-to-many) and complex queries. SQL’s widespread adoption and mature ecosystem make it highly versatile.
    • Challenges: Modeling complex many-to-many relationships often requires intricate joins, which can impact performance at scale.
  2. Document: Data is stored in a nested, self-contained format, often resembling JSON objects. This model prioritizes data locality, reducing the need for joins by embedding related data.

    • Strengths: Flexible schema design, better performance for specific read-heavy workloads, and alignment with modern application code structures.
    • Challenges: Handling many-to-many relationships typically requires multiple queries or denormalization, which can complicate data consistency.
  3. Network: Data is represented as a graph-like structure with nodes and edges, allowing multiple parent-child relationships. Access paths are predefined and hierarchical.

    • Strengths: Efficient at representing many-to-one and many-to-many relationships.
    • Challenges: Developers must remember access paths, and the rigidity of predefined paths can limit flexibility.
  4. Graph: Designed for highly interconnected data, graph models use nodes to represent entities and edges for relationships.

    • Strengths: Intuitive for complex relationships and queries involving networks, such as social connections or recommendation systems.
    • Challenges: Performance can degrade with large-scale graphs, and specialized databases are often required.

Understanding these models and their trade-offs helps in selecting the best fit for the application, ensuring efficient data handling and system performance.

Query Languages

  1. SQL: A declarative language that enables you to specify what data you want, without the specifics on how to retrieve it. You can define conditions (e.g. WHERE, LIKE) and transformations (e.g. GROUP BY, ORDER BY) without worrying about execution order. The database’s query optimizer takes care of these details, making SQL both powerful and user-friendly.

  2. MapReduce: A programming model for processing large datasets across distributed systems. Though not a declarative or imperative language, it offers a framework for performing read-only queries and transformations over vast amounts of data. Some NoSQL data stores (e.g. MongoDB & CouchDB) incorporate a simplified form of MapReduce for query purposes.

  3. Cypher: A declarative query language for property graphs, designed specifically for Neo4j. It enables intuitive queries on graph structures, focusing on relationships between nodes and edges.

  4. SPARQL: An RDF query language that retrieves and manipulates data stored in triple stores which use the RDF (Resource Description Framework) data model. SPARQL is particularly suited for semantic web applications, where data from different sources is linked and combined into a unified “web of data”.

Each query language is optimized for specific use cases, making it essential to understand the trade-offs when selecting a database and its associated querying mechanisms.

Chapter 3: Storage and Retrieval

This chapter explains how databases store and retrieve data, highlighting why different workloads; transactional (OLTP) vs. analytical (OLAP), require different storage engines and how to choose the right one.

A key concept in data retrieval is indexes, which are special data structures that store copies of a subset of the database’s keys (not necessarily the primary or secondary keys). Indexes can be stored in-memory or on-disk to speed up queries. However, they come at a cost; writes become slower because the index must also be updated whenever data is modified.

Choosing the right index depends on the application’s query patterns and usage needs.

Types of Indexes

  1. Hash Index: A hash function is used to map keys to their corresponding values, allowing for fast lookup operations. This is an in-memory index and works best when all the keys can fit in memory. Its downside is that range queries are inefficient and each key needs to be looked up individually.
  2. Sorted String Tables: use in-memory trees a.k.a memtables (e.g. red-black trees) to store data efficiently before writing the already sorted data to disk. The advantage it has over hash indexes is that all its keys don’t need to fit into memory because of the frequent transfers between memory and disk. This makes it possible to store a sparse index in-memory because of the sorted nature of the data (i.e. the value of a missing key can be found when referenced/looked up through that of an available key).
  3. Log-Structured Merge-Trees: (LSM-Trees) are also a type of in-memory based index that use multiple SSTables by merging and compacting them to store sorted data. A downside of the LSM-Tree algorithm is that looking up non-existent keys can be slow/expensive because it needs to check the memtable and then all the segments on disk. Bloom filters are used to avoid this issue by approximating the contents of a set and reporting if a key exists/not in a database. Another downside, is that compaction can hog system resources leading to other operations waiting. An advantage of LSM-Trees is that because data is stored in sorted order, range queries are efficient and because the disk writes are compact and sequential, the LSM-tree can support very high write throughput.
  4. B-Trees: This is the most widely used index and are used in almost all relational databases and many non-relational databases. They also utilize sorted key-value pairs which allows efficient lookups and range queries. While log structured indexes use variable sized segments, B-Trees break the data into fixed blocks or pages (traditionally 4 KB in size) and reads/writes one page at a time on-disk. Pages have an address, can reference other pages and are linked together with the concept of a root page which is the entry point to the tree and contains the keys to the other pages. Updating the value of a key in a B-Tree requires finding the page that contains the key, updating the value, and then writing the page back to disk while adding a new key if there isn’t enough space involves splitting the page into two, moving half of the keys to the new page and updating the root page to point to the new pages. This ensures that the tree remains balanced. Due to the crashes that can occur during writes/updates, B-Trees use WAL (Write-Ahead Logging) which is a on-disk log used to store changes that have to be made before making them to ensure durability. Locks are also used to prevent multiple writers from writing to the same page at the same time. Reads are thought to be faster for B-Trees compared to LSM-Trees. An advantage of B-Trees is that each key exists in one place in the index.

Other Indexing Structures

Storing values within an index e.g. clustered indexes, a technique used to improve query performance by storing the actual data in the index and not just a reference to the data’s location OR covering indexes which store a subset of the columns of a table in an index can be useful for queries that require frequent access to the data. Despite speeding up reads, it can slow down writes because the index must also be updated when data is modified. Additional storage is also required.

Multi dimensional indexes like R-Trees are used to store spatial data and can be used used to filter data on multiple dimensions simultaneously which has the advantage of reducing the number of passes required to retrieve specific data.

Full-text search engines and fuzzy indexes allows you to search for synonyms, grammatical variations and occurrences of words near each other in the same document as most indexes don’t allow you to search for similar keys, such as misspelled words. Lucene uses an SSTable-like structure for its term dictionary. This structure requires a small in-memory index that tells queries the offset in the sorted file to look for a specific key.

In-memory databases are becoming more common place as RAM becomes cheaper and durable alternatives are being built. Some such as Memcached are used as caches and can tolerate data losses on RAM when restarted as the data can be retrieved from disk while others achieve durability by using special hardware (such as battery-powered RAM), write logs of changes to disk, write periodic snapshots to disk, or by replicating the in-memory state to another machine.

Key Benefits of In-Memory databases over On-Disk

  1. They can be faster if they don’t require encoding in-memory data structures to disk compatible encodings.
  2. They are also able to easily model data disk-based indexes struggle with.

Transaction Processing vs. Analytics

The term “transaction” processing was coined in the early days of business processing when database writes were due to transactions being made and it stuck even after databases were used for far more because data access patterns (Online Transaction Processing - OLTP) were still similar. It now represents a logical unit of reads and writes. Databases started being used for data analytics which have different access patterns (Online Analytical Processing - OLAP), typically query large numbers of records and calculate aggregates over multiple fields as opposed to returning raw data to users. The differences between OLTP and OLAP aren’t always clear-cut but they have some differing characteristics such as:

Property OLTP OLAP
Purpose Manage and process real-time transactions Analyze large volumes of data to support decision making
Data source Uses real-time and transactional data from single source Uses historical and aggregated data from multiple sources
Data model Uses normalized or denormalized models Uses star schema, snowflake schema, or other analytical models
Main read pattern Small number of records per query, fetched by key Aggregate over large number of records
Main write pattern Random-access, low-latency writes from user input Bulk import (ETL) or event stream
Primarily used by End user/customer, via web application Internal analyst, for decision support
Dataset size Gigabytes to terabytes Terabytes to petabytes
Response time Shorter response times, typically in milliseconds Longer response times, typically in seconds or minutes
Example applications Good for processing payments, customer data management, and order processing Good for analyzing trends, predicting customer behavior, and identifying profitability

Data Warehousing

Previously, the same relational databases were used to process both workloads but as systems became more complex with databases required to be highly available while processing transactions with low latency, running analytic queries which are often long-running operations and can impact the performance of executing transactions was frowned upon. This led to the rise of data warehouses. Despite this divergence, most data warehouses use a relational model thanks to SQL’s strengths at handling analytical queries.

Data warehouses are a separate database that holds read-only copies of the data found in OLTP databases gotten from periodic data dumps (batch processing) or a consistent stream of updates which has been transformed, cleaned and loaded into the warehouse (Extract-Transform-Load). An advantage of data warehousing over using OLTP databases for analytics is that the warehouses can be optimized for analytic access patterns. The indexing algorithms mentioned at the start of the chapter aren’t suited for analytic queries but work best for OLTP.

Analytics Processing Schemas: Stars and Snowflakes

The data models highlighted in Chapter 2 are majorly used in OLTP systems based on the system’s needs. For OLAP, the options aren’t as diverse.

Star Schema a.k.a dimensional modeling is the more common option. It is characterized by several dimension tables linked together using foreign keys to a fact table in the middle of it all (hence the star shape). Fact tables are usually very large and comprises of individual events/transactions (rows) because this gives the most flexibility to analysts. The different dimensions represent the who, what, where, when, how, and why of events from fact tables.

Snowflake schemas which are an extension of the star schema but with sub-dimensions are another option. These are more normalized (less repetitions) because additional dimension tables are utilized but are less preferred by analysts because they’re more complex than star schemas to work with (multiple joins are needed to extract information).

Column-Oriented Storage

OLTP databases store data in a row-oriented format and this fits that use-case because its usually a small number of records that are retrieved per query per time (fetched by key) from the databases by applications (usually through pagination). Most data warehouses store tables that are very wide with over 100s of columns sometimes and only a subset of these columns are retrieved when running analytical queries which is why it makes sense to store all values from a column together instead to avoid slow queries from loading all the rows into memory before filtering for the subset of data.

Columnar storage works by storing the rows from each column file in a specific order so the queried fields can be accurately stitched together at query time. Query performance can be further increased especially in columns that store repeated values by compressing them using techniques like bitmap encoding and then creating indexes from this.

Bitmap Encoding Analogy: Think of it like a dictionary where each word has been assigned a key. All the keys are laid out and their positions are known, so if the word “apple” appears somewhere in a sentence/column, it’s position on the bitmap will contain a “1” and the other non-apple words in the sentence/column will be “0”. Large bitmap like this are usually sparse and are compressed further by applying run-length encoding e.g apple can be represented by 5, 1 (5 zeroes, 1 one, rest zeroes) as opposed to say 0, 0, 0, 0, 0, 1, 0, 0, 0, 0

Note: Column storage doesn’t only apply to relational data models. Non relational data like document data models can also be columnar e.g. Parquet which is based on Google’s Dremel.

Besides reducing the volume of data that needs to be loaded from disk, column-oriented storage layouts are also good for making efficient use of CPU cycles through a technique known as vectorized processing.

The primary loads in data warehouses are large read-only queries that makes column-oriented storage, compression and sorting suitable for optimizing reads while writes become a lot more difficult. Updating in-place like in B-trees is not possible with compressed columns because if you inserted a row in the middle of a sorted table, you’d most likely need to rewrite all the column files as rows are identified by their position within a column.

Thankfully, LSM-trees discussed earlier handle this well because all writes are first sent to and sorted in an in-memory store before writing them to disk. Queries now need to check data in-memory & on-disk and combine them to get complete/accurate results. This implementation detail is hidden by the query optimizer.

Not every data warehouse uses columnar storage, although this is now becoming the norm. Traditional row-oriented databases and a few other architectures are also used.

Aggregation

OLAP queries tend to run aggregate functions such as COUNT, SUM, AVG, MIN, MAX. If multiple queries use the same aggregates, it makes sense to cache some of the commonly used aggregates. A materialized view is one example of this. It a relational data model, it’s a table-like object whose contents are the actual results of some query written to disk while a virtual view (a.k.a view ) is more like a shortcut for writing queries (e.g frequently run queries). The results of a materialized view needs to be updated when the data changes which can makes writes more expensive which is why they aren’t often used in OLTP databases.

Data/OLAP cubes are a special case of materialized views which are made up of a grid of aggregates grouped by different dimensions.

e.g Imagine that your fact table contains foreign keys from 2 dimension tables e.g date_key & product_key. you can represent this as a 2-D cube of aggregates in a table whose columns are product_key values and rows are date_key values with each cell containing an aggregate e.g net_price. The advantage of this is that you can easily see the sales by date for all the products or the sales by product for a range of dates.

The advantage of a materialized data cube is that certain queries become very fast because they’ve been precomputed while the disadvantage is certain queries aren’t possible to compute because those fields aren’t dimensions in the cube which is why most data warehouses keep most data as raw data and only use aggregates such as data cubes as a performance booster for specific queries.

Chapter 4: Encoding and Evolution

WIP: To be continued

Final Thoughts

DDIA has often been referred to as the “holy grail” for designing and understanding data-intensive systems. Part 1 so far has provided valuable insights, and I’m excited to continue this journey.