Why does it matter?
Not to state the extremely obvious, but if the app is broken, then that is a bad thing. Our users can’t access any number of our various products, which means we can’t help them with the problem they’ve trusted us to solve. Moreover, as a business, our ability to generate revenue and continue along our path to becoming a sustainable entity is compromised. The work that we deliver should, well, work.
So what exactly is a ‘production incident’?
As with any software product, there are a number of things that are broken the whole time. These range from the relatively trivial (perhaps we’ve formatted a date incorrectly, or a button isn’t taking the user anywhere after they have clicked on it) all the way through to the “oh-no-everything-is-on-fire-the-app-won’t-open”.
Because these issues have different levels of severity, we’ll treat and therefore triage them differently.
The more trivial ones will usually find their way into a squad’s backlog, ready to be picked up by said squad as part of standard sprint work. This can happen in any number of ways - such as a bug report on Slack, a new error being picked up by a developer from one of our error-monitoring services, or a Product Owner playing about with the staging environment of our app.
The more serious and urgent ones, meanwhile, we classify as ‘production incidents’. We have two levels of production alert at Cleo: critical, and high. For something to be classified as ‘a production’ alert, it must have:
- Some mechanism of actually doing said alerting, and an appropriate alert set up
- A runbook associated with the alert, describing the steps that can be taken to rectify the issue
Because everything is a spectrum, there is no hard and fast rule for what constitutes a ‘critical’ alert, a ‘high’ alert, or neither. Our general rubric for critical alerts is, “is something so badly wrong that people should stop what they’re doing to go and fix this now? Is it worth waking someone up at 3a.m. for?” - as the critical alerts are the ones that fire out of hours.
Of course, not all the Things That Go Wrong In Production are things that we have foreseen and conveniently written runbooks for. We might stumble across them from error-reporting services, using the app ourselves, or hear about them from our customer support teams. When this happens, we’ll deal with them in the same way, only without the support of those handy runbooks.
How do we deal with these in the moment?
In-hours, these will be picked up by the ‘first responder’, a rotating role that every engineer is assigned to but, but may opt out of. The first responder will carry out that role for a week, and in that time they will triage both our production alerts, as well as the #rollbar and #bugsnag Slack channels (which our regular error-reporting services automatically post to).
For regular bugs, they will attempt to notify the relevant squads, so that the bug can be ticketed up and appropriately prioritised - they are not expected to fix every new bug that pops up.
For a production incident, they will typically acknowledge the issue, tag in members of relevant squads, and coordinate a Zoom call where those involved can debug the issue. That may be by following the steps in the runbook, but they might also lean on the expertise of their colleagues if they have more context on what the issue may be.
Out-of-hours, the process looks the same, save for the fact that the on-call engineer’s colleagues won’t typically be around (and certainly won’t be expected to be). There are two caveats to this: a second line of support, who can be paged if needs be, as well as potentially a shadow on-call engineer, who may be taking this role in order to get some experience of what being on-call is like, before being thrown in the deep-end by themselves.
Given this, the on-call engineer will typically simply follow the steps spelled out in the runbook. They are also free, of course, to use their own intuition and debugging skills to resolve the issue, however, if the steps in the runbook do not suffice, then the on-call engineer is not expected to be able to fix the issue.
We understand all out-of-hours work to take place on a best-effort basis. What this means in practice is that we ask on-call engineers to stop attempting to fix a given alert after 2 hours working on it. If they cannot fix it in that time with the support of our runbooks, then, in line with our no-blame culture, we accept this to be a collective failure of the engineering team to provide adequate support to the on-call engineer via said runbooks. If this is the case, the on-call engineer will leave the alert on acknowledged until the start of the next working day, so that it can be resolved by the rest of the team in-hours.
How do we follow up on these?
After any production incident, we’ll follow up on this with a post-mortem. We’ve been running these for over four years now, using an ever-evolving template document to help us take reasonable steps to ensure something like this does not happen again, as well as to share knowledge of what went wrong and why.
Typically, these post-mortems will be organised by the engineer who responded to the issue. Sometimes this ceremony will be facilitated by the engineer who picked up this issue, but at other times it might be more helpful for them to be a participant and for someone else to facilitate the post-mortem: again, there is no hard and fast rule.
In the post-mortem, we’ll try to understand several things:
- The scale and impact of the issue
- The root cause of the problem (beyond merely its symptoms)
- How we were made aware of the problem
- How the problem was resolved
- What reasonable steps we could take to prevent this from happening again
This latter step is critical, and there are no restrictions placed on what this looks like - it might come in the form of a new alert, an updated runbook, modifications to application code, or something else entirely.
What do we expect from the on-call engineer, and how do we compensate them?
Because being on-call is a) opt-in, and b) something we care about at Cleo, we want to make being on-call an attractive proposition for engineers at Cleo, rather than something that is overly-onerous on them. To that end, we compensate on-call engineers with:
- 1 day’s pay, or 1 day’s time-off-in-lieu (TOIL) for being on-call for the week
- £50/hour for every hour worked
On top of that, it is not uncommon to see on-call engineers asking others to cover small parts of their shift for them, or to switch shifts. This is so that they can keep any out-of-hours commitments they already have - which can be anything from attending a friend’s wedding through to playing a five-a-side match. This operates on an honour system, rather than something that is rigorously tracked, leaning into our engineering principle that we help each other.
In return, Cleo asks that on-call engineers remain with an internet connection, and able to respond to any alerts within an hour of it firing, as well as remaining “competent” (which is to say, compos mentis and able to debug said alerts). This is, of course, not a hard and fast set of rules; at Cleo, we trust our engineers (and indeed, all our employees) to make sensible, rational decisions in good conscience. We have written a number of case studies of what would be considered acceptable and unacceptable on-call scenarios to help engineers in making these choices - describing situations such as engineers being paged whilst out at dinner, or travelling on short notice.
At Cleo, our users are important to us, and so we need to be sure that we are able to meet their needs to the best of our ability: responding to production incidents effectively is critical to that, and requires policies that work not just for our users or the business, but also for those carrying out the support work.
We think our policies here are pretty great, but we also know they’re not perfect; they’ve been tweaked over time, and we’re always open to suggestions from anyone across the team around how we can improve them.