[Book Notes] Design Data-Intensive Application Ch1: Reliable, Scalable, and Maintainable Applications

Chapter 1. Reliable, Scalable, and Maintainable Applications Reliability Hardware Faults Software Errors Human Errors Scalability Load parameters Performance Maintainability Operability: Make Life Easy for Operations Simplicity: Managing Complexity Evolvability: Making Change Easy

Chapter 1. Reliable, Scalable, and Maintainable Applications

Reliability

Perform functions expectedly

Tolerate unexpectedly behaviors

Performance is good enough for required use case

Prevents unauthorized access and abuse

Fault: a component deviates from its spec

Failure: stop providing service to users

💡

If we turn down one service, what will users see? Will we lose any data?

Hardware Faults

How to mitigate:

redundancy of hardware components

data backup

software fault-tolerance techniques

Software Errors

Such as:

Software bug caused by bad input

A runaway process that uses up shared resource (eg. CPU time, memory, disk, or network bandwidth)

Unresponsive service or corrupted responses

Cascading failures, eg. one triggers another, and another, and another...

Human Errors

The most common cause

How to mitigate:

Consider “authorization” in system design

well-designed abstractions
APIs
Admin interface

Non-production environments (release cycle, environment variables, smooth configuration)

Tests at all levels

Easy configuration (CICD) to roll back old codes or roll out new codes

Monitor and alerts

Infrastructure as code + Configuration as code

Scalability

If the system grows in a particular way, what are our options for coping with the growth?

Load parameters

Request per second to a web server

The ratio of reads to writes in a DB

The number of simultaneously active users in a chat room

The hit rate on a cache

💡

Think about numbers, and always use numbers to make decision. For example, what is the average tweet writes to a DB, and what is the average tweets per follower read? How do you distribute the loads? But what about those who have 1 million followees? What if there is a holiday and everyone is trying to tweet?

Performance

Increase or not increase a load, how does the performance change?

Throughput: the number of records we can process per second or the number of data size we can process per second
Response time: the time between a client sending a request and receiving a response (user cares about response time)
Latency: the duration that a request is waiting to be handled (not seen by a user)

Not each request takes the same amount time to complete:

Context switch to a background process
Loss of network packet
TCP retransmission
A garbage collection pause
A page fault forcing a read from disk
Mechanical vibrations in the server rack

Metrics for evaluating and analyzing response time:

mean (avg)
percentile
median (half users experience better than this, the other half don’t) - how long typically have to wait

High percentiles of response time = tail latencies

Tail latencies might sometimes be the most valuable, eg. users who purchase the most have more data
Even parallel, users still need to wait for the slowest one to finish
Several small slow services might make the entire response time even longer

Scaling up is simpler than scaling out, but still need to scale out when scaling is costly or high-availability is required

Maintainability

Fix bugs

Keep systems operational

Investigate failures

Adapt it to new platforms

Modify it for new use cases

Repay tech debt

Add new features

Operability: Make Life Easy for Operations

Monitor the health of the system

Restore systems quickly if it goes bad

Track down root cause, eg. systems failures or degraded performance

Keep software and platform up-to-date

Keep track on how systems affect each other

Anticipate future problems and be able to solve them

Establish good practices and tools for deployment and configuration

Perform complex maintenance tasks such as migration

Maintain the security of the systems as configuration changes

Define process of operations

Keep the production environment stable

Preserve the org’s knowledge about the system

Provide documentation and operational model or steps with expected behaviors

Provide visibility to the runtime behavior and internal states of the system

Avoid dependency on individual machines

Build self-healing system with enough admin controls

Simplicity: Managing Complexity

Symptoms of complexity:

Explosion of the state space
Tight coupling of modules
Tangled dependencies
Inconsistent naming and terminology
Hacks aimed at solving performance problems
Special-casing to work around issues

It is easier to introduce a bug in a more complex system than a simpler one

How to remove accidental complexity: abstraction

Hide the implementation detail behind a clean, simple-to-understand facade
Used by different applications

Evolvability: Making Change Easy

Agile: in a frequently changing environment

Test-driven development (TDD)
Refactoring