On-call roster
I recommend an on-call roster for any team with more than five engineers. It all boils down to work-life balance & context-switching!
Before I discovered on-call
I was part of a small, energetic team building value-added telecom services nearly two decades ago. We made new products, added features, and supported multiple production environments. The pace was frantic.
This mismanagement showed up in the quality of our output and reflected how stressed the team was (our whole environment was).
We fixed the problem by hiring dedicated technical support engineers for each telecom customer and built debugging tools for these engineers.
I discovered on-call
I joined a decently big telecom infrastructure software startup and learned about the benefits of rotating engineers between feature development and production support responsibilities.
We had long release cycles (once every quarter); hence, the on-call engineers got three months to review technical debt deeply.
I learned a lot of good engineering practices from this startup, and the impact of on-call rotation stands out.
First time I setup an on-call roster
Years later, I was again part of a small, young (I felt young, too) and energetic team. Our customers were in different time zones, which created even more problems for production issues. Someone had to know who is the best person to fix the issue if they were going to start waking people at night.
I made my first proper on-call roster with this team. One developer from our group was available on call in case of any issues. They didn't stay up all night and monitor things but didn't put their phone on silent mode at night.
We traded uncertainty for our customers and everyone on the team with a system that allowed everyone to execute without context switches.
On the call, the roster gives comprehensive benefits.
I have continued using an on-call roster with weekly rotation (or multiple rosters in larger teams). Here is how I feel this adds value:
1. Removes context switch and increases productivity for engineers
2. Reliable and fast responses and resolutions because one person is responsible at any point. Fast resolutions for issues benefit customers and build trust with them.
In operations-focused business, it becomes even more critical. All engineers shouldn't be getting support calls, and we shouldn’t block the operations team. On-call roster caters to both concerns.
3. Lots of learnings for the engineering going on-call - business context, debugging tools, analytics, etc.
4. Our whole stack becomes debuggable and better documented
5. We often use the on-call engineer's time to fix technical or operational debt in our product. We often ask on-call engineers to create documentation, increase automation coverage, and write playbooks to make the next on-call job easier.
6. When the on-call system works well, the team pays attention to other teams' services, helping break information silos. The system also reduces individual dependency on projects and improves overall teamwork.
The handling of technical debt deserves a post, but on-call engineering contributing to solving technical debt is a natural solution, in my opinion.
Pitfalls
I plan to write another post about how to run an on-call system effectively soon. I do want to inform you about some pitfalls here:
1. Everyone on the team is not ready to be on-call. The system works when an on-call engineer can handle most of the production issues without involving other team members.
2. Onboarding new team members in on-call takes time because they need to be effective. Shadowing current on-call followed by a reverse shadow week often helps.
3. You are sacrificing a good percentage of velocity by dedicating on-call bandwidth. I think about this bandwidth because the team was probably already spending all this time across multiple engineers; you are only consolidating the effort to one engineer. By design, on-call engineers shouldn't get any sprint tasks.
4. Sometimes, bug triaging and resolution will be slower than in your current ad-hoc system. Overall, I still feel that detection and resolution issues would decrease (which is a good thing) with a dedicated engineer. On-call engineers usually implement quick fixes and move the long-term solution to regular sprints.
5. On-call handover is critical. Ensuring all pending issues and alerts move to the next engineer when the on-call engineer changes is essential.
6. If you have an extensive product or an operations team, on-call doesn't mean you don't need a dedicated technical/customer support team.
7. A well-implemented on-call roster is a multi-level roster. Figuring out level 1 and level 2 is often tricky. I usually keep senior engineers on L1 and engineer managers on L2, but I have seen implementations where L1 and L2 schedules are just staggered by a week or two.
Do this now
If you are an engineer facing too many context switches because of production issues, talk to your seniors and see if this post resonates with them.
If you are a founder/engineering lead and have a product that customers are using, your team is spending time tackling live/production issues- talk to your team and see if the on-call roster makes sense for you.
Thank you friends
Thanks to my friends who helped me understand this topic better and reviewed the post. This post has gone through multiple drafts, and many of my friends have given me some great feedback on the post. Thank you, Abhishek Airan, Manish Singh, Pushpendra Sharma, Indrajit Rajtilak, Kalpesh Balar, Manideep Polireddi, Tanay Pratap, Ankush Dharkar, Aditya Shetty, Mahesh Sharma, Aditya Katare, and Akash Saxena. I have used your words and phrases liberally in this post.