You are paged about a large spike in 500 errors on the search page.
(set: $time_elapsed to 0)(set: $stress to 0)(set: $baddata_exists to true)(set: $wiki_checked to false)(set: $job_rerun to false)(set: $time_modifier to 1+$stress/100)
<div class="flash center">ALERT: HTTP 5xx -- yelp-main: threshold exceeded</div>
What do you do?
[[Notify users|notifyTooEarly]]
[[Wait for errors to drop|wait]]
[[Check wiki page|wiki]]
[[Check recent code changes|checkCodeChanges]]
[[Check dashboards|checkDashboards]]
[[Escalate to a senior engineer|pageSenior]]Unsure of what course of action to take, you decide to consult the internal wiki for a runbook to follow.
(set: $time_elapsed to it+0.5) (set:$stress to it+10)
<div class="wiki"
><h1>Operations Runbook</h1>
Welcome to the operations runbook! Here are the steps you should take.
• Check operational dashboards.
• Look for recently deployed code changes that could be related.
</div>
• [[Check dashboards|checkDashboards]]
• [[Check recent code changes|checkCodeChanges]]You notified stakeholders. Your stress level decreased. (set:$stress to (max:0,$stress-10))(set: $time_elapsed to it+0.5)
[[Check dashboards|checkDashboardsFinal]]
[[Escalate to a senior engineer|pageSenior]]The dev follows your advice and releases the code without reviews or testing.
(set: $time_elapsed to it+0.5) (set:$stress to it+10)
After the change hits production, the errors don't subside.
<div class="flash center">ALERT: HTTP 5xx -- yelp-main: threshold exceeded</div>
The error is still coming from the searchIndexer component but has a different traceback. Your head starts to spin as you realize that pushing a hotfix introduced another error to debug. Even worse, the original error may still be present in the code.
[[Shift traffic to us-west|shiftTraffic]]
[[Ask the search team to turn off the feature toggle|toggleOff]]
[[Ask the search team to revert both of the bad code changes|revert]]The dev sends out a code review and writes unit tests for the change. While manually testing their change, they find a bug in the patch and send out a new diff.
You glance at the clock; more time has passed than you had realized. You check in with the dev and they say that the change is ready to be deployed. The change will take additional time to deploy through the staging environment until it hits prod.
(set: $time_elapsed to it+0.5) (set:$stress to it+10)
[[Push the new code change|pushNewCode]]
[[Ask the search team to turn off the feature toggle|toggleOff]]<link href="https://fonts.googleapis.com/css?family=Bitter:400,700|Lato|Ubuntu+Mono" rel="stylesheet"><div class="page"><section class="header">OnCall of Duty</section
><section class="content"></section><section class="footer"></section
><div class="stats center">Time: (print: $time_elapsed) hrs (+(print: $time_delta)) | Stress: (print: $stress)% ((if:$stress_delta>=0)[+](print: $stress_delta))</div
></div>
(if: $stress>100)[
(goto: "gameover")
]<div class="center">Triage real-life incidents. Learn what it's like to be-oncall.
Your goal is to mitigate the incident as soon as possible.
</div>
Scoring:
Stress:
Increases as the game progresses. Certain actions can decrease stress.
Time elapsed:
Each step taken increments game-world time elapsed. Some steps may take longer than others.
Ready to begin?
[[I'm ready.|init]]You wait for errors to drop.
The errors have now increased to critical levels.
<div class="flash center">ALERT: !!CRITICAL!! HTTP 5xx -- yelp-main: threshold exceeded</div>
Your stress level rises as you try to think what to do. Meanwhile, the clock keeps ticking...
(set: $time_elapsed to it+0.5) (set:$stress to it+10)
[[Notify stakeholders|notifyTooEarly]]
[[Check wiki page|wiki]]
[[Check recent code changes|checkCodeChanges]]
[[Check dashboards|checkDashboards]]
[[Escalate to a senior engineer|pageSenior]]You check the push records for recent code changes.
You discover that there was a recent deploy related to the searchIndexer component. This looks promising, but you may need more information...
(set: $time_elapsed to it+0.5) (set:$stress to it+10)
[[Notify stakeholders|notify]]
[[Check dashboards|checkDashboardsFinal]]
[[Escalate to a senior engineer|pageSenior]]You check the operational dashboards.
The dashboards show that the error spike is due to a bunch of errors all containing the same traceback.
<div class="code">Traceback (most recent call last):
File "yelp/www/init.py", line 554, in start
self.run()
File "yelp/www/render.py", line 91, in run
decoratedCheckFunction(target)
File "yelp/www/util/retry.py", line 172, in function
return func(**args)
File "/yelp-main/yelp/component/search/indexer.py", line 114, in module
indexedResult = loadAndParseResultsFromIndexerContext(**args)
SearchIndexerException: internal error: unable to resolve resource</div>
They also indicate that 100% of users in us-east are affected. This classifies the incident as a p0, requiring immediate resolution.
You identify that the searchIndexer component of the main application is throwing errors. This looks promising, but you may need more information...
(set: $time_elapsed to it+0.5) (set:$stress to it+10)
[[Check recent code changes|checkCodeChangesFinal]]
[[Escalate to a senior engineer|pageSenior]]Unsure of what course of action to take, you decide to ask a senior engineer on your team.
<section class="slack"
><div class="channel">#incidents</div
><div><span class="name" style="color: #4d4da2;">you</span>I'm having trouble with a spike in 500 errors on the search page.</div
><div><span class="name" style="color: #a24d4d;">sallyb</span>Have you checked the operational dashboards?</div
><div><span class="name" style="color: #a24d4d;">sallyb</span>Are there any recent code changes that may have affected the site?</div
></section>
(set: $time_elapsed to it+0.5)
[[Check recent code changes|checkCodeChanges]]
[[Check dashboards|checkDashboards]]You post on the company's Twitter account that the site is down and engineers are aware of the outage. In addition, you send out a mass email to all users with the same information.
The CTO messages you, wanting to know what the size, impact, and severity of the outage is. She expects you to know what percentage of users are impacted and in which regions.
You realize you don't have this information yet. You spend time trying to explain the situation to others. You realize that you probably should have waited to notify stakeholders after assessing the impact of the situation.
(set: $time_elapsed to it+0.5) (set:$stress to it+10)
[[Wait for errors to drop|wait]]
[[Check wiki page|wiki]]
[[Check recent code changes|checkCodeChanges]]
[[Check dashboards|checkDashboards]]
[[Escalate to a senior engineer|pageSenior]]You check the push records for recent code changes.
You discover that there was a recent deploy related to the search page within the searchIndexer component.
(set: $time_elapsed to it+0.5) (set:$stress to it+10)
[[Ping #search on-point in slack|onPoint]]
[[Page the #search on-call engineer|searchOncall]]
[[Follow the search team's runbook to debug the issue yourself|debugAlone]]You ping the on-call person in the #search on-call channel and sit at your computer waiting for them to respond.
<section class="slack"
><div class="channel">#search</div
><div><span class="name" style="color: #4d4da2;">you</span>hi @oncall, we're seeing a spike in 500 errors on the search page coming from the searchIndexer component, can you take a look?</div
><div><span class="name" style="color: #4d4da2;">you</span>hello? is anyone here?</div
></section>
It looks like the search team went out for lunch and the on-call person isn't responding to your message. You regret not paging the on-call directly as you feel your stress levels rise.
(set: $time_elapsed to it+0.5) (set:$stress to it+10)
[[Page the #search on-call engineer|searchOncall]]
[[Follow the search team's runbook to debug the issue yourself|debugAlone]]You page the on-call person for the search team and get an immediate response.
The search on-call says they're on their way back to the office and acknowledges that the recent code push to the searchIndexer component is likely causing the errors. They mention that the change is related to a new feature and there's a way to toggle the feature off. They say that this is probably the fastest way to stop the errors, but doesn't actually fix the bug.
They ask you for advice on what to do.
(set: $time_elapsed to it+0.5) (set:$stress to it-10)
[[Shift traffic to us-west|shiftTraffic]]
[[Ask the search team to revert the code change|revert]]
[[Ask the search team to deploy a new code change that includes the fix|hotfix]]
[[Ask the search team to turn off the feature toggle|toggleOff]]You check the internal wiki for the search team's runbooks.
(set: $time_elapsed to it+0.5) (set:$stress to it+10)
<div class="wiki"
><h1>Search Indexer Runbook</h1>
Welcome to the Search Indexer runbook for search team on-calls! Here are the steps you should take.
• Check Lucene cluster instances are up and running.
• It might be a Java runtime exception. Use a debugger.
• ...</div>As you peruse the runbook on the searchIndexer component and read through the search team's codebase, you realize you're in way over your head.
[[Ping #search on-point in slack|onPoint]]
[[Page the #search on-call engineer|searchOncall]]You check the operational dashboards.
The dashboards show that the error spike is due to a bunch of errors all containing the same traceback.
<div class="code">Traceback (most recent call last):
File "yelp/www/init.py", line 554, in start
self.run()
File "yelp/www/render.py", line 91, in run
decoratedCheckFunction(target)
File "yelp/www/util/retry.py", line 172, in function
return func(**args)
File "/yelp-main/yelp/component/search/indexer.py", line 114, in module
indexedResult = loadAndParseResultsFromIndexerContext(**args)
SearchIndexerException: internal error: unable to resolve resource</div>
They also indicate that 100% of users in us-east are affected. This classifies the incident as a p0, requiring immediate resolution.
(set: $time_elapsed to it+0.5) (set:$stress to it+10)
[[Ping #search on-point in slack|onPoint]]
[[Page the #search on-call engineer|searchOncall]]
[[Follow the search team's runbook to debug the issue yourself|debugAlone]]You shift the traffic in us-east to us-west.
The errors subside. It looks like the feature was only available to users on the east coast. You breathe a sigh of relief as you think to yourself that you just bought yourself a bit of time.
However, you see that capacity in us-west is now cut in half and servers are degraded.
<div class="flash center">ALERT: Warning: low capacity in us-west: degraded performance</div>
(set: $time_elapsed to it+0.5) (set:$stress to it+10)
[[Ask the search team to revert the bad code change|revert]]
[[Ask the search team to deploy a new code change that includes the fix|hotfix]]
[[Ask the search team to turn off the feature toggle|toggleOff]]You ask the search team's on-call to revert the bad code change.
They get back to the office and start to deploy a revert to production. As the revert hits production, the errors subside. You breathe a sigh of relief.
[[Write a postmortem|postmortem]]You ask the search team to make a new code change that would prevent the error from being raised on the search page.
The search on-call comes up with a fix on their dev machine and asks if they can release it right away. How should we release it?
[[I like to live dangerously. Rush it!|rushrelease]]
[[Ask them to write unit tests, run manual tests, and send out a code review.|carefulrelease]]You ask the search team's on-call to turn the feature off.
They get back to the office and deploy a config change to production. It hits production in a few minutes, and the errors subside. You breathe a sigh of relief.
What do you do next?
[[Write a postmortem|postmortem]]You ask the search team's on-call to push the code fix.
They deploy the fix to production. As it hits production, the errors subside. You breathe a sigh of relief. It took a lot longer than you thought, but at least you resolved the issue.
[[Write a postmortem|postmortem]]Congratulations!
You successfully mitigated the issue and the errors finally stopped. You issue an all-clear to the engineering organization as you start to think about sipping mai tais on warm, sandy beaches...
Guess it's time to write that postmortem.
[[Play again|init]]
(link:"Fill out feedback form")[(gotoURL:"https://docs.google.com/forms/d/e/1FAIpQLSest4tngyTeWZ9DBudZLUL5IUsyICTXhduuPellDXnny9LR2A/viewform")]