[Book Notes] Design Data-Intensive Application Ch1: Reliable, Scalable, and Maintainable Applications

[Book Notes] Design Data-Intensive Application Ch1: Reliable, Scalable, and Maintainable Applications

Tags
Engineering
Distributed System
Created
Feb 5, 2022 08:02 PM
Edited
Feb 5, 2022
Description
This chapter mainly describes the terminology: reliability, scalability, maintainability.

Chapter 1. Reliable, Scalable, and Maintainable Applications

Reliability

  • Perform functions expectedly
  • Tolerate unexpectedly behaviors
  • Performance is good enough for required use case
  • Prevents unauthorized access and abuse
Fault: a component deviates from its spec
Failure: stop providing service to users
đź’ˇ
If we turn down one service, what will users see? Will we lose any data?

Hardware Faults

How to mitigate:
  • redundancy of hardware components
  • data backup
  • software fault-tolerance techniques

Software Errors

Such as:
  • Software bug caused by bad input
  • A runaway process that uses up shared resource (eg. CPU time, memory, disk, or network bandwidth)
  • Unresponsive service or corrupted responses
  • Cascading failures, eg. one triggers another, and another, and another...

Human Errors

  • The most common cause
How to mitigate:
  • Consider “authorization” in system design
    • well-designed abstractions
    • APIs
    • Admin interface
  • Non-production environments (release cycle, environment variables, smooth configuration)
  • Tests at all levels
  • Easy configuration (CICD) to roll back old codes or roll out new codes
  • Monitor and alerts
  • Infrastructure as code + Configuration as code

Scalability

If the system grows in a particular way, what are our options for coping with the growth?

Load parameters

  • Request per second to a web server
  • The ratio of reads to writes in a DB
  • The number of simultaneously active users in a chat room
  • The hit rate on a cache
đź’ˇ
Think about numbers, and always use numbers to make decision. For example, what is the average tweet writes to a DB, and what is the average tweets per follower read? How do you distribute the loads? But what about those who have 1 million followees? What if there is a holiday and everyone is trying to tweet?

Performance

  • Increase or not increase a load, how does the performance change?
    • Throughput: the number of records we can process per second or the number of data size we can process per second
    • Response time: the time between a client sending a request and receiving a response (user cares about response time)
    • Latency: the duration that a request is waiting to be handled (not seen by a user)
  • Not each request takes the same amount time to complete:
    • Context switch to a background process
    • Loss of network packet
    • TCP retransmission
    • A garbage collection pause
    • A page fault forcing a read from disk
    • Mechanical vibrations in the server rack
  • Metrics for evaluating and analyzing response time:
    • mean (avg)
    • percentile
    • median (half users experience better than this, the other half don’t) - how long typically have to wait
  • High percentiles of response time = tail latencies
    • Tail latencies might sometimes be the most valuable, eg. users who purchase the most have more data
    • Even parallel, users still need to wait for the slowest one to finish
    • Several small slow services might make the entire response time even longer
  • Scaling up is simpler than scaling out, but still need to scale out when scaling is costly or high-availability is required

Maintainability

  • Fix bugs
  • Keep systems operational
  • Investigate failures
  • Adapt it to new platforms
  • Modify it for new use cases
  • Repay tech debt
  • Add new features

Operability: Make Life Easy for Operations

  • Monitor the health of the system
  • Restore systems quickly if it goes bad
  • Track down root cause, eg. systems failures or degraded performance
  • Keep software and platform up-to-date
  • Keep track on how systems affect each other
  • Anticipate future problems and be able to solve them
  • Establish good practices and tools for deployment and configuration
  • Perform complex maintenance tasks such as migration
  • Maintain the security of the systems as configuration changes
  • Define process of operations
  • Keep the production environment stable
  • Preserve the org’s knowledge about the system
  • Provide documentation and operational model or steps with expected behaviors
  • Provide visibility to the runtime behavior and internal states of the system
  • Avoid dependency on individual machines
  • Build self-healing system with enough admin controls

Simplicity: Managing Complexity

  • Symptoms of complexity:
    • Explosion of the state space
    • Tight coupling of modules
    • Tangled dependencies
    • Inconsistent naming and terminology
    • Hacks aimed at solving performance problems
    • Special-casing to work around issues
  • It is easier to introduce a bug in a more complex system than a simpler one
  • How to remove accidental complexity: abstraction
    • Hide the implementation detail behind a clean, simple-to-understand facade
    • Used by different applications

Evolvability: Making Change Easy

  • Agile: in a frequently changing environment
    • Test-driven development (TDD)
    • Refactoring