Mercy Markus

Fiddling with code 👩🏾‍💻

06 Jan 2025

Designing Data-Intensive Applications by Martin Kleppmann: Part I

A friend and colleague recommended Designing Data-Intensive Applications (DDIA) to me back in 2022. He found it invaluable while building and maintaining systems that needed to be highly available and reliable. Although I attempted to start the book several times, I never made it past the first chapter. Over the weekend, I finally committed to diving into the book.

Why I Picked It Up Again

I spent most of 2024 on a sabbatical, experiencing life and myself outside the confines of a job. In between, I reflected on where to focus my energy when I was ready to return to work. During this time, I explored Systems Engineering, delving into the Linux Kernel and experimenting with the Rust programming language. While fascinating, it was also overwhelming. The steep learning curve left me feeling impatient and unsure of this path.

Conversations with a friend who works as a Data Engineer reminded me of my initial plan to transition to data engineering in late 2020. That plan had been set aside when I pursued an exciting Software Engineering role at Microsoft. Revisiting DDIA and the Data Engineering path felt like a natural step, given my background in data science and experience as a software engineer.

This time, the goal is to approach the book with patience and consistency. I plan to read about 100 pages weekly, documenting my learnings and progress here. Let’s see how far I get!

Chapter 1: Reliable, Scalable, and Maintainable Applications

Chapter 1 provides a gentle introduction to reasoning about data-intensive applications. Martin Kleppmann explains what makes an application data-intensive rather than compute-intensive and outlines three primary concerns of most software systems: reliability, scalability, and maintainability. Here’s what stood out to me:

  1. Reliability: Reliability ensures applications function as intended, even when faults occur. Failures are inevitable, but systems must anticipate and recover from them to minimize disruption. The key question is: How do we bounce back and ensure minimal disruption when failures occur?

  2. Scalability: Scalability addresses how well an application handles increased load. Identifying meaningful load parameters is crucial e.g. tracking checkout completion times may be more valuable for an e-commerce store than overall site visits. This section also highlighted how all measurements are inherently inaccurate, as the tools used for monitoring are based off estimations. The takeaway is to focus on the right metrics that help inform scaling decisions.

  3. Maintainability: Building a reliable and scalable system is just the beginning. Maintainability ensures that the system remains operable, efficient, and developer-friendly over time. Poorly maintained systems can frustrate the engineers working on them, leading to burnout or high turnover. The goal is to design systems that are straightforward to troubleshoot, update, and extend.

Chapter 2: Data Models and Query Languages

Data models form the foundation of any software system, providing a structured way to represent real-world entities and their relationships. Choosing the right data model is critical for ensuring the system is performant, scalable, and aligned with business needs.

Data Models

  1. Relational: Data is organized into tables (relations) with rows (tuples) and columns. Each table represents a specific entity, and relationships between entities are modeled through foreign keys and joins.

    • Strengths: Well-suited for use cases requiring robust relationships (one-to-one, one-to-many, many-to-many) and complex queries. SQL’s widespread adoption and mature ecosystem make it highly versatile.
    • Challenges: Modeling complex many-to-many relationships often requires intricate joins, which can impact performance at scale.
  2. Document: Data is stored in a nested, self-contained format, often resembling JSON objects. This model prioritizes data locality, reducing the need for joins by embedding related data.

    • Strengths: Flexible schema design, better performance for specific read-heavy workloads, and alignment with modern application code structures.
    • Challenges: Handling many-to-many relationships typically requires multiple queries or denormalization, which can complicate data consistency.
  3. Network: Data is represented as a graph-like structure with nodes and edges, allowing multiple parent-child relationships. Access paths are predefined and hierarchical.

    • Strengths: Efficient at representing many-to-one and many-to-many relationships.
    • Challenges: Developers must remember access paths, and the rigidity of predefined paths can limit flexibility.
  4. Graph: Designed for highly interconnected data, graph models use nodes to represent entities and edges for relationships.

    • Strengths: Intuitive for complex relationships and queries involving networks, such as social connections or recommendation systems.
    • Challenges: Performance can degrade with large-scale graphs, and specialized databases are often required.

Understanding these models and their trade-offs helps in selecting the best fit for the application, ensuring efficient data handling and system performance.

Query Languages

  1. SQL: A declarative language that enables you to specify what data you want, without the specific on how to retrieve it. You can define conditions (e.g. WHERE, LIKE) and transformations (e.g. GROUP BY, ORDER BY) without worrying about execution order. The database’s query optimizer takes care of these details, making SQL both powerful and user-friendly.

  2. MapReduce: A programming model for processing large datasets across distributed systems. Though not a declarative or imperative language, it offers a framework for performing read-only queries and transformations over vast amounts of data. Some NoSQL data stores incorporate a simplified form of MapReduce for query purposes.

  3. . Cypher: A declarative query language for property graphs, designed specifically for Neo4j. It enables intuitive queries on graph structures, focusing on relationships between nodes and edges.

  4. SPARQL: An RDF query language that retrieves and manipulates data stored in Resource Description Framework (RDF) databases. SPARQL is particularly suited for semantic web applications, where data from different sources is linked and combined into a unified “web of data”.

Each query language is optimized for specific use cases, making it essential to understand the trade-offs when selecting a database and its associated querying mechanisms.

Final Thoughts

DDIA has often been referred to as the “holy grail” for designing and understanding data-intensive systems. Part 1 so far has provided valuable insights, and I’m excited to continue this journey.

Weekly Progress Report

Week Start Page End Page
1: 6th Jan 2025 0 92
2: 13th Jan 2025 93 163
3: 20th Jan 2025 164
4: 27th Jan 2025
5: 3rd Feb 2025
6: 10th Feb 2025
7: 17th Feb 2025
8: 24th Feb 2025
9: 3rd Mar 2025
10: 10th Mar 2025