SRE, an Enigma?

SRE, an Enigma?

I was watching ‘The Playlist’, a Netflix series about how Spotify came into being. Remember the times when one would go to Torrent sites to download movies and songs? People were resigned to the fact that one had to dial-up thru’ the internet to listen to music from dodgy sites that would not only be down most of the times but also take an enormous amount of time to download content from a limited repository, also inviting viruses on your home computer.

This is a classic case of sites being unreliable – high latency, low availability, not at all secure, thus leading to high user dissatisfaction. Spotify was formed as a start-up idea that would reduce download speed to less than 0.6Mbps, be always available and one had a choice to search for any music. This was the ‘Aha!’ moment for music lovers and the rest is history!

Google invented the concept of SRE (Site Reliability Engineering) as the search engine also had to be always available, offer searches and load content with high speeds. This has five simple foundations:

1)     Form clear SLOs: Service Level Objective is to fix a target % for Service Level Indicators such as Availability, Latency, Throughput etc. This also helps set error budgets i.e allow some tolerance for system failures while the focus is on improvement.

2)     Set Alerting: User alerting mechanisms to show where the problem is and notify, so that service engineers can rectify.

3)     Monitoring (and Observability): If you can’t measure, you don’t know what to improve! Monitor the basic PIs (Performance Indicators) i.e collect data, aggregate and make it available real-time and Observe Logs, Metrics and traces.

4)     Reducing Toil: Automate manual tasks that are repetitive (eg. Using AIOps tools for support functions, CI/CD in DevSecOps etc.)

5)     Simplicity: Simple uncomplicated code, architecture and processes. A system that is relatively new and it is also easy and quick to train to operate.

Most Organisations i have come across don’t know how to start with a holistic approach … there is some automation achieved / planned, a certain level of dashboarding, restructuring the teams to include SREs i.e Engineers who cannot just do routine tasks but also code, etc. We can do a complete maturity assessment and help create a roadmap.

What are you seeing? What are your views on how SRE can be made successful? I’ll be glad to hear! Do drop in a message!

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics