You got an email with batch failure.
Note: Watch your stress level! It may affect your productivity...
[[Check wiki page|wiki]]
You opened the wiki page for the batch.
><h1>Rerunning a failed batch</h1>
This batch can sometimes be flaky. You can try rerunning the job on Tron first.
[[Rerun job on Tron|firstrerun]]
You asked a senior engineer on your team what to do.
<div><span class="name" style="color: #4d4da2;">you</span>I'm having trouble with batch failure!</div>
<div><span class="name" style="color: #4d4da2;">you</span>What should I do?</div>
<div><span class="name" style="color: #a24d4d;">senioreng</span>Have you checked runbook on the wiki page?</div>
<div><span class="name" style="color: #a24d4d;">senioreng</span>There should be some instructions there.</div>
You notified stakeholders. Your stress level decreased.
Time Elapsed: (print: $time_elapsed) hours
Oncall Stress: (print: $stress)%
What do you do next?
[[Rerun job on Tron|firstrerun]]
[[Check wiki page|wiki]]
[[Escalate to a senior engineer|seniorwiki]]
You reran the job on Tron.
The job failed after (print: $time_delta*60)mins, so it doesn't seem like a flake...
[[Check EMR logs|checkemr]]
[[Ask a senior engineer|senioremr]]The EMR log shows a KeyError.
Looks like the input is corrupted and missing a required field.
[[Change the batch (data consumer) code|changeconsumer]]
[[Change the logging (data producer) code|changeproducer]]
[[Clean bad log line|cleanbadlog]]
You asked a senior engineer on your team what to do. They recommended you check the EMR log on S3.
<div><span class="name" style="color: #4d4da2;">you</span>I tried rerunning the job on Tron.</div>
<div><span class="name" style="color: #4d4da2;">you</span>But it's still failing.</div>
<div><span class="name" style="color: #a24d4d;">senioreng</span>So it's not a flake?</div>
<div><span class="name" style="color: #a24d4d;">senioreng</span>You could look into the EMR log for traceback.</div>
You worked with ops to clean the bad log line from S3.
You came up with a code fix for the data producer and pushed out the change.
You came up with a code fix.
How should we release it?
[[I like to live dangerously. Rush it!|rushrelease]]
[[Write unit tests, run manual tests and send out a code review. |carefulrelease]]
You released the code without reviews or testing. #agiledevelopment
You sent out a code review and got a shipit. It took a bit of time but this should work...
[[Rerun the batch|rerunfail]]
[[Rerun the batch|rerunfail]]Congratulation!
The batch succeeded and you've closed a p0 case :megahappy:
Final stats
Time Elapsed: (print: $time_elapsed) hours
Oncall Stress: (print: $stress)%
[[Play again|init]]
You reran the batch but it still seems to be failing... Let's go back to investigation.
You asked a senior engineer what to do.
<div><span class="name" style="color: #4d4da2;">you</span>There is something wrong with the input log. </div>
<div><span class="name" style="color: #4d4da2;">you</span>What should I do about it?</div>
<div><span class="name" style="color: #a24d4d;">senioreng</span>You could either fix the input or the batch code.</div>
<div><span class="name" style="color: #a24d4d;">senioreng</span>I don't think updating the logging code will do anything</div>
[[Clean bad log line|cleanbadlog]]
[[Change the batch (data consumer) code|changeconsumer]]
<section class="content"></section><section class="footer"></section>
Time: (print: (round:$time_elapsed*100)/100) hrs | Stress: (print: (round:$stress*100)/100)%
Slack snippet:
<div><span class="name" style="color: #4d4da2;">helplessdev</span>I have a problem!!!</div>
<div><span class="name" style="color: #a24d4d;">opsperson</span>Don't worry, I can help. What seems to be the problem?</div>
<div><span class="name" style="color: #4d4da2;">helplessdev</span>Search pages are serving 500 errors and I don't know where they're coming from</div>
goofy flashing panic message!
<div class="flash center">DAR-2152</div>
Code snippet:
>x = 67;
status = `_dfsk___ldfj`;
hope += 1;</div>
regular button: (click to see back button)
[[go to next step|example page]]
Fancy back button
<div class="back">[[back to style|Style use examples]]</div>
----Game Over----
Your manager has decided to ask someone else to be in charge of this incident.
Final stats
Time Elapsed: (print: $time_elapsed) hours
Oncall Stress: (print: $stress)%
[[Play again|init]]
(link:"Fill out feedback form")[(gotoURL:"")]