Thomas Stringer’s Post

View profile for Thomas Stringer, graphic

Staff Software Engineer / Site Reliability Engineer, Tech Lead | Linux | Kubernetes | Observability | Cloud

A lot of software engineers don't care enough about on-call when they're interviewing for a new job. On-call is one of the few items that can have a massive impact on your personal life, so you should understand what that impact will be before accepting an offer. Some questions I like to ask about on-call... 🔥 How often are you on-call? Anything more frequent than 25% of the time can be tough. 🔥 How long are you on-call for? Are rotations a week long? Anything longer and you can get bad on-call fatigue. But too short means you're on-call more often. 🔥 How many people are on-call at the same time? Is there a primary and a backup? 🔥 What's the expected response time to acknowledge the incident/page? 🔥 What's the expected time to be actively working the incident? That's not always the same as ack'ing. 🔥 Is the rotation 24/7 or something else, like business hours only or no weekends? 🔥 How many pages does the primary on-call get on average? What's a bad rotation look like? What's a good rotation look like? 🔥 What does the business hours work look like for the on-call? Or they doing toil tasks or still doing project work if there are no incidents? There's more to on-call than a lot of people think. And all of the answers to those above questions have a direct impact to your personal life! #softwareengineering #softwaredevelopment #sre #sitereliabilityengineering

Dave Myler

Lead Architect | Driving Technical Direction, Growth Strategies

3mo

If on-call has an impact on your personal life then the problem isn't the on-call process, it's the reliability of the systems. The impact ought to be close to zero when averaged over any significant period of time.

Edward Morgan

I’m an experienced engineering leader and consultant. I help TA teams hire technical talent 2x faster with evidence-based practices and proven training methods.

3mo

Thomas Stringer you’re very generously suggesting that management actually has defined processes and SLAs for on-call and abides by them.

🚀 Matthew Ellison

I help entrepreneurs build software

3mo

Ya it sucks if a company isn't up front about this. Hopefully you are working for a good team that knows how to balance this but I could totally see getting caught off guard with this. Couple other thoughts to consider around this. - Along these lines its good to ask about weekend work in general. You might not expect it all based on your prior experiences but all companies operate a little differently. This would naturally lead into on call discussions as well but also help cover any time sensitive projects that could affect your personal time. - On top of all this though, I would imagine if you can focus on working for a good company with a good team, you can likely work around issues like this that come up. Because on call might not be a big deal now, but maybe they are planning to release a new big feature soon and things might get hectic for a few months or something. If you have a good team, you'll be able to figure a system that works for everyone.

Matthew Byng-Maddick

Experienced Site Reliability Engineer

3mo

Remember that the corollary to “what’s the expected response time” is that the company is too cheap to pay people to go on shift. This also applies to, say, going out drinking with friends or colleagues (who hasn’t debugged a production system from the pub?). If they’re paying you almost the same money for your normal time to be on-call, then they can expect better responses and more giving up your social life than if they pay much less than that, in which case, they have chosen to take the risk.

Juri Barman

Engineering at Netflix

3mo

I believe there is no one size fits all answer to these questions. It all depends on the urgency, criticality of the user flows being impacted to the business, acceptable SLAs for outages would differ based on the business impact and also team size. Bigger teams would have fewer rotations. Its important to balance between outage resolution time and team's well-being when coming up with an on-call strategy. If budget allows, I have seen companies having secondary support teams in different regions which would help engineers rotate the shifts. Primary and backup(secondary) oncall would work well if the team size is large > 10 otherwise it can cause even more fatigue in smaller team <6 engineers. Also during capacity planning, its important to not take in a lot of project work during on-call weeks. Oncall should not be expected to make progress on project work, however in teams with lighter shifts, the engineer may take the liberty to spend some bandwidth on project work but there should not be a set expectation as things can change really soon!

Like
Reply
Mina Azib

Senior Software Engineer @ Microsoft

3mo

Great questions - Ill def keep in mind giong forward! On-Call can be like the metaphor of the duck thats calm on the surface but kicking like crazy underwater. It can sneak up on you, if youre not ready for it. Also its kind of a red flag if on-call is always busy/intense, it means to me there is likely a process problem OR not enough attention to tech debt/OpEx

Chris Woodard

Senior iOS Developer, Online Course Author at Pluralsight

3mo

If the company has a lot of on-call incidents, it speaks to a lack of rigor in their planning and testing (or they’re cheaping out on staging infrastructure because their cloud bills are giving the green eyeshade mechanics heartburn). In software, there are a few universal rules: 1. There is no such thing as a free lunch. 2. You get what you pay for. 3. Writing software and building systems is like making love; if you rush it, it ends badly.

Krum Bakalsky 🇺🇦

Senior Backend Developer | x-Google | x-AWS

3mo

Thomas Stringer I assume the whole point of asking these questions is to avoid companies/teams with bruising on-call practices. However, they usually have high attrition, which in turn means that they usually hire en masse, which usually means that no-one will have any time to answer this nice question list you've prepared. It makes sense, but it is totally impractical, no company will ever want to spend the time to go through these questions with you. They would rather invest that time/energy for hiring another candidate with no similar questions in mind.

Like
Reply
Patrick Davis

Software Developer at risual ltd (A Node4 Company)

3mo

All valid points, and perhaps my opinion will change in future but… At the moment, I think that the need for developers to be on call is a sign that there may be some underlying systemic issues at the company. Most places of work do just fine with competent multi-line support teams and pre-defined SLAs. At the very least, on call should be shift based and alternate between multiple people to avoid burnout.

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics