Chapter 1. Reliable, Scalable, and Maintainable ApplicationsReliabilityHardware FaultsSoftware ErrorsHuman ErrorsScalabilityLoad parametersPerformanceMaintainabilityOperability: Make Life Easy for OperationsSimplicity: Managing ComplexityEvolvability: Making Change Easy
Chapter 1. Reliable, Scalable, and Maintainable Applications
Reliability
- Perform functions expectedly
- Tolerate unexpectedly behaviors
- Performance is good enough for required use case
- Prevents unauthorized access and abuse
Fault: a component deviates from its spec
Failure: stop providing service to users
If we turn down one service, what will users see? Will we lose any data?
Hardware Faults
How to mitigate:
- redundancy of hardware components
- data backup
- software fault-tolerance techniques
Software Errors
Such as:
- Software bug caused by bad input
- A runaway process that uses up shared resource (eg. CPU time, memory, disk, or network bandwidth)
- Unresponsive service or corrupted responses
- Cascading failures, eg. one triggers another, and another, and another...
Human Errors
- The most common cause
How to mitigate:
- Consider “authorization” in system design
- well-designed abstractions
- APIs
- Admin interface
- Non-production environments (release cycle, environment variables, smooth configuration)
- Tests at all levels
- Easy configuration (CICD) to roll back old codes or roll out new codes
- Monitor and alerts
- Infrastructure as code + Configuration as code
Scalability
If the system grows in a particular way, what are our options for coping with the growth?
Load parameters
- Request per second to a web server
- The ratio of reads to writes in a DB
- The number of simultaneously active users in a chat room
- The hit rate on a cache
Think about numbers, and always use numbers to make decision. For example, what is the average tweet writes to a DB, and what is the average tweets per follower read? How do you distribute the loads? But what about those who have 1 million followees? What if there is a holiday and everyone is trying to tweet?
Performance
- Increase or not increase a load, how does the performance change?
- Throughput: the number of records we can process per second or the number of data size we can process per second
- Response time: the time between a client sending a request and receiving a response (user cares about response time)
- Latency: the duration that a request is waiting to be handled (not seen by a user)
- Not each request takes the same amount time to complete:
- Context switch to a background process
- Loss of network packet
- TCP retransmission
- A garbage collection pause
- A page fault forcing a read from disk
- Mechanical vibrations in the server rack
- Metrics for evaluating and analyzing response time:
- mean (avg)
- percentile
- median (half users experience better than this, the other half don’t) - how long typically have to wait
- High percentiles of response time = tail latencies
- Tail latencies might sometimes be the most valuable, eg. users who purchase the most have more data
- Even parallel, users still need to wait for the slowest one to finish
- Several small slow services might make the entire response time even longer
- Scaling up is simpler than scaling out, but still need to scale out when scaling is costly or high-availability is required
Maintainability
- Fix bugs
- Keep systems operational
- Investigate failures
- Adapt it to new platforms
- Modify it for new use cases
- Repay tech debt
- Add new features
Operability: Make Life Easy for Operations
- Monitor the health of the system
- Restore systems quickly if it goes bad
- Track down root cause, eg. systems failures or degraded performance
- Keep software and platform up-to-date
- Keep track on how systems affect each other
- Anticipate future problems and be able to solve them
- Establish good practices and tools for deployment and configuration
- Perform complex maintenance tasks such as migration
- Maintain the security of the systems as configuration changes
- Define process of operations
- Keep the production environment stable
- Preserve the org’s knowledge about the system
- Provide documentation and operational model or steps with expected behaviors
- Provide visibility to the runtime behavior and internal states of the system
- Avoid dependency on individual machines
- Build self-healing system with enough admin controls
Simplicity: Managing Complexity
- Symptoms of complexity:
- Explosion of the state space
- Tight coupling of modules
- Tangled dependencies
- Inconsistent naming and terminology
- Hacks aimed at solving performance problems
- Special-casing to work around issues
- It is easier to introduce a bug in a more complex system than a simpler one
- How to remove accidental complexity: abstraction
- Hide the implementation detail behind a clean, simple-to-understand facade
- Used by different applications
Evolvability: Making Change Easy
- Agile: in a frequently changing environment
- Test-driven development (TDD)
- Refactoring