What I wish I knew before my first on-call shift

I'm not ashamed to say that I didn't exactly know what a on-call shift was until I got my first job at a company that implement it. This post offers a few tips for those just getting started to boost their shift. Big thanks to my friend Rafael Graunke for suggesting that this as a good topic to write about.

📕 First off: what is on-call anyways?

According to the Cambridge online dictionary, the adjective on-call:

Describe a worker who is available at anytime when needed

That is spot on. But in the context of Software Engineering, an on-call dev is someone who's always available and dealing with bugs or outages caused by many other factors, but mostly bugs.

✉️ What will be expected of me?

While the expectations can vary highly from one company to another, from my observation, the general expectations are:

Lead incident response
Do triage and prioritization of alerts
Document everything
Be available
Do preventative work

Again, those are general. But I think you can get the gist of what you need to do. Now let's go to how to do it!

📝 Tips on how to handle on-call shifts

Your first on-call shift might seem daunting, but fear not! You have to understand that as a newcomer you surely don't have a sharp intuition build up just yet. But as you get exposed to issues and notes, you slowly construct a sense of when something is highly wrong. That's part of the process, enjoy it.

1️⃣ Communicate!

Your best resource when you're starting is your colleagues as they might have had way more exposure to the system. Look for code owners, responsible teams, anyone that might help you fill out the context that you're probably missing. On top of that, communication like acknowledging alerts timely and keeping management updated are also crucial. The Bear

2️⃣ Be proactive

And by this I mean:

Investigate alerts - dive in that code!
Open tickets to reduce the noisy alerts - nobody likes to be flooded with alerts
Monitor the dashboards in whatever services the company uses, including 3rd party status pages
If touching anything, document it and if it was code, leave the code better than you found it
Prioritize alerts - ask yourself "what would block my costumer?"
Do work in things that will improve monitoring in shady areas - the hardest bugs are those that aren't logged
Contribute to overall system reliability

3️⃣ Document everything

Documenting is very important. You can do this in many ways, by opening work tickets, issues, and creating pages or documents. This is essential to keep track of any changes made or to serve as a quick path if the same issue ever shows again. As for myself, I usually write everything as things are happing or right after. The reason: memory and details are fresh. This also helps ensure that I studied the issue and solution, and helps other engineers whenever they need to read about that again. I usually add details about people involved and who was responsible for the fix along with logs, alerts, and chat discussions.

Cat-writing

💭 Final Thought

I hope that from this short post, you were able to get a better understanding of what is on-call, what people expect of you, and gave you a small head start to make your shift a success. If you're able to follow these simple rules, the last element that is key for developing a good intuition is exposure! So go for it: ask questions, don't settle for being passive. You're very much likely to be off to a good start.