The winding road to successful SRE: Adopting service tiers in your company

2020-10-22 SRE Reliability DevOps Managing change

Aaron Lee on Unsplash (@aaronhjlee)

The journey to implementing Site Reliability Engineering (SRE) can be a long one, and as any activity that requires some technical awareness, adjustments to well established mental models and cultural shifts; going after easy wins, taking notice of what is not easy to implement overnight, and making tweaks to the rulebook might end up being the most pragmatic route.

As your engineering forces becomes more attuned to the concepts. You can leverage some paradigms to help you score quick wins in the adoption of the SRE principles until it becomes second nature and integral part of the engineering department’s culture.

The service tiers are support agreements wrapping around the Service Level Objectives (SLOs) that matter to your business. This is also a useful scale that detaches the support of proof of concepts and throwaway experiments from the highly critical. Furthermore, this offers a terminology that the product team can easily rally around, despite their view oftentimes much more focussed on functionalities than on individual resources.

The names of these tiers should suggest a logical scale such as Bronze, Gold, Platinum over more than obtuse names like Athena, Neptune and Apollo.

Assigning service tiers to our products will help assess the extent to which your products are important for the business and it will also highlight potential shortcomings in preparing for failures or incidents.

Assigning a service tier to a product can dramatically increase your security posture, contributing to patching up any systemic neglect that can exist over legacy resources in an entreprise environment.

Some of the qualifying criteria for each tier

To place the product in their correct tiers, we should take into account:

  1. Whether product downtime may cause damage to your organisation’s reputation
  2. The level of the uptime/availability requirements
  3. Whether the product contains personal customer data (PIIs)
  4. To what extent the product is critical to your organisation’s competitive’s edge

Conceiving the corresponding support metrics

The service tiers will also be used as part of a framework to assign the adequate amount of effort and human resources in support of your code and infrastructure.

These metrics can be defined during community of practices meetings through sharing project’s best practices in order to establish what a minimal standard level of monitoring and the following alerting should look like.

A useful outcome from these meetings could be the formation of check lists with the tools in used across scattered teams mapped to their functions, and package them as readily available solutions to support your critical products, the creation and maintenance would be co-owned by engineering and the delivery teams.

Further benefits

The nature of these weighted labels allows ranking — in the context of a dashboard or communications-, and tallying if observed as a point system, whereby for the sake of the argument, 15 points would be the maximum an engineer should have to look after.

Possible implementation

Platinum

(+99.9% uptime — error budget of 43 minutes per month)

  • Downtime means a priority 1 incident
  • Observability of the whole entire stack as according
  • Setup properly in PagerDuty with 247 team on-call support
  • DevOps reviewed high availablity architecture and strategy
  • Super hot runbooks, reviewed and tested monthly
  • Constant review of SLO’s on a monthly basis
  • A large amount of SLI’s from different aspects of the platform

Gold

  • Downtime means a priority 2 incident
  • Observability over most of the stack
  • Setup properly in PageDuty with working hours support
  • Good runbooks
  • Review of runbooks every few months

Silver

(99% uptime — error budget of 7 hours 12 minutes)

  • Downtime means a priority 3 incident
  • Observability most of the exposed endpoints (website, database etc)
  • Setup properly in PagerDuty with working hours support

This has been useful in companies I have worked with. I hope this will be useful for you too!