🛰️

Site Reliability Engineering

Incorporate tools, workflows and responsibilities for operating the software into development teams

DevOps Principles

The DevOps Research and Assessment program (DORA) conducted statistical research by evaluating the development practices of over 23000 teams from all kinds of companies. They published their results in 2018 in the book "Accelerate: Building and Scaling High Performing Technology Organizations" which gained huge popularity and acknowledgement (e.g. by Google Cloud). The research found four relevant metrics for development team performance:

  • 📈 Lead Time for Changes: The amount of time it takes a commit to get into production.
  • 📈 Deployment Frequency: How often an organization successfully releases to production.
  • 📈 Change Failure Rate: The percentage of deployments causing a failure in production.
  • 📈 Time to Restore Service: How long it takes an organization to recover from a failure in production.

The combination of these metrics show the clear benefit of tightly integrating development (Dev) and operations (Ops). Thinking about development and operations as one integrated approach allows teams t see the bigger picture and optimize all four metrics at once.

Alerts

Alerts define thresholds that indicate abnormal system behavior which requires human intervention. They are typically based on metrics. Alerts often link to runbooks to provide helpful first steps for incident response.

On-Call Rotation

An On-Call Rotation schedules shifts for team members who take turns being available outside of regular working hours to respond to incidents.

Escalation Policy

An Escalation Policy defines an escalation path for dealing with incidents of different severity. It determines who should be notified based on the alerting circumstances and how potential escalations should be handled if more people or more expertise is needed to deal with the incident. Bigger incidents may also require additional people that manage actions and communication during incident response (e.g. Incident Commanders).

Incident Response Workflow

The Incident response process deals with any unexpected problems or outages in production and builds on top of alerts, escalation policies, on-call rotations and runbooks. When Alerts are triggered, Escalation Policies and On-Call Rotations determine which responders to notify. Responders are expected to follow an incident response workflow. Here is an example of such a workflow:

  1. Acknowledge Alert
    • Acknowledge the alert to prevent automated escalation
    • Open the incident
    • Join the designated communication channel
  2. Assess Severity
    • Assess and update the severity level
    • If necessary, escalate the incident according to the Escalation Policy to get more assistance
  3. Investigate and Mitigate
    • Find the root cause of the incident
    • Document important findings as comments
    • Document important actions taken as comments
  4. Monitor Recovery
    • Ensure that the system is functioning normally again and the incident does not reoccur
  5. Post-mortem Analysis