You got an email with batch failure.
(set: $time_elapsed to 0)(set: $stress to 0)(set: $baddata_exists to true)(set: $wiki_checked to false)(set: $gameover to false)(set: $stress_increment to 10)(set:$time_increment to 1)
Note: Watch your stress level! It may affect your productivity...
[[Check wiki page|wiki]]
[[Ask a senior engineer|seniorwiki]]You opened the wiki page for the batch.(set:$time_delta to $time_increment*0.25*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta)
<div class="wiki"
><h1>Rerunning a failed batch</h1>
This batch can sometimes be flaky. You can try rerunning the job on Tron first.
Use the command <span class="code"
>tronctl retry</span></div>
[[Rerun job on Tron|firstrerun]]
[[Ask a senior engineer|seniorwiki]]You asked a senior engineer on your team what to do.(if:($stress-$stress_increment)<=0)[(set:$stress_delta to $stress*(-1))](else:)[(set:$stress_delta to $stress_increment*(-1))](set:$stress to it+$stress_delta)(set:$time_delta to $time_increment*.25*(1+$stress/100))(set: $time_elapsed to it+$time_delta) <!-- asking consumes less time -->
<section class="slack"
><div class="channel">#team-something</div
><div><span class="name" style="color: #4d4da2;">you</span>I'm having trouble with batch failure!</div
><div><span class="name" style="color: #4d4da2;">you</span>What should I do?</div
><div><span class="name" style="color: #a24d4d;">senioreng</span>Have you checked runbook on the wiki page?</div
><div><span class="name" style="color: #a24d4d;">senioreng</span>There should be some instructions there.</div
></section>
[[Check wiki page|wiki]]You notified stakeholders. Your stress level decreased. (set:$stress to (max:0,$stress-1))(set: $time_elapsed to it+1)
-----
Time Elapsed: (print: $time_elapsed) hours
Oncall Stress: (print: $stress)%
-----
<!-- this is still in development
What do you do next?
(if:$job_rerun is false)[
[[Rerun job on Tron|firstrerun]]
](if:$wiki_checked is false)[
[[Check wiki page|wiki]]
]
[[Escalate to a senior engineer|seniorwiki]]
-->You reran the job on Tron. (set:$time_delta to $time_increment*0.5*(1+$stress/100))(set: $time_elapsed to it+$time_delta)(set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta)
The job failed after (print: $time_delta*60)mins, so it doesn't seem like a flake...
[[Check EMR logs|checkemr]]
[[Ask a senior engineer|senioremr]]The EMR log shows a KeyError.
Looks like the input is corrupted and missing a required field.(set:$time_delta to $time_increment*0.5*(1+$stress/100))(set: $time_elapsed to it+$time_delta)(set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta)
[[Change the batch (data consumer) code|changeconsumer]]
[[Change the logging (data producer) code|changeproducer]]
[[Clean bad log line|cleanbadlog]]
[[Ask a senior engineer|seniordontchangeproducer]]You asked a senior engineer on your team what to do. They recommended you check the EMR log on S3. (if:($stress-$stress_increment)<=0)[(set:$stress_delta to $stress*(-1))](else:)[(set:$stress_delta to $stress_increment*(-1))](set:$stress to it+$stress_delta)(set:$time_delta to $time_increment*.25*(1+$stress/100))(set: $time_elapsed to it+$time_delta) <!-- asking consumes less time -->
<section class="slack"
><div class="channel">#team-something</div
><div><span class="name" style="color: #4d4da2;">you</span>I tried rerunning the job on Tron.</div
><div><span class="name" style="color: #4d4da2;">you</span>But it's still failing.</div
><div><span class="name" style="color: #a24d4d;">senioreng</span>So it's not a flake?</div
><div><span class="name" style="color: #a24d4d;">senioreng</span>You could look into the EMR log for traceback.</div
></section>
[[Check EMR log|checkemr]]You worked with ops to clean the bad log line from S3.(set:$time_delta to $time_increment*2*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta) <!-- 2x time-->
[[Rerun the batch|rerunsuccess]]You came up with a code fix for the data producer and pushed out the change.(set:$time_delta to $time_increment*(1+$stress/100))(set: $time_elapsed to it+$time_delta)(set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta)
[[Rerun the batch|rerunfail]]You came up with a code fix.(set:$time_delta to $time_increment*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta)
How should we release it?
[[I like to live dangerously. Rush it!|rushrelease]]
[[Write unit tests, run manual tests and send out a code review. |carefulrelease]]
You released the code without reviews or testing. #agiledevelopment (set:$time_delta to $time_increment*0.25*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta) <!-- this one takes half the time -->
[[Rerun the batch|rerunfail]]You sent out a code review and got a shipit. It took a bit of time but this should work... <!-- this one consumes 3x amount of time -->
(set:$time_delta to $time_increment*3*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta)
[[Rerun the batch|rerunfail]]Congratulation!
The batch succeeded and you've closed a p0 case :megahappy:
Final stats(set:$time_delta to $time_increment*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta)
-----
Time Elapsed: (print: $time_elapsed) hours
Oncall Stress: (print: $stress)%
-----
[[Play again|init]]
(link:"Fill out feedback form")[(gotoURL:"https://docs.google.com/forms/d/e/1FAIpQLSest4tngyTeWZ9DBudZLUL5IUsyICTXhduuPellDXnny9LR2A/viewform")]You reran the batch but it still seems to be failing... Let's go back to investigation.(set:$time_delta to $time_increment*0.5*(1+$stress/100))(set: $time_elapsed to it+$time_delta) (set: $stress_delta to $stress_increment)(set:$stress to it+$stress_delta)
[[continue|checkemr]]You asked a senior engineer what to do. (if:($stress-$stress_increment)<=0)[(set:$stress_delta to $stress*(-1))](else:)[(set:$stress_delta to $stress_increment*(-1))](set:$stress to it+$stress_delta)(set:$time_delta to $time_increment*.25*(1+$stress/100))(set: $time_elapsed to it+$time_delta) <!-- asking consumes less time -->
<section class="slack"
><div class="channel">#team-something</div
><div><span class="name" style="color: #4d4da2;">you</span>There is something wrong with the input log. </div
><div><span class="name" style="color: #4d4da2;">you</span>What should I do about it?</div
><div><span class="name" style="color: #a24d4d;">senioreng</span>You could either fix the input or the batch code.</div
><div><span class="name" style="color: #a24d4d;">senioreng</span>I don't think updating the logging code will do anything</div
></section>
[[Clean bad log line|cleanbadlog]]
[[Change the batch (data consumer) code|changeconsumer]]
[[Change the logging (data producer) code anyway|changeproducer]]<link href="https://fonts.googleapis.com/css?family=Bitter:400,700|Lato|Ubuntu+Mono" rel="stylesheet"><div class="page"><section class="header">OnCall of Duty</section
><section class="content"></section><section class="footer"></section
><div class="stats center">Time: (print: (round:$time_elapsed*100)/100) hrs (+(print: (round:$time_delta*100)/100)) | Stress: (print: (round:$stress*100)/100)% ((if:$stress_delta>=0)[+](print: (round:$stress_delta*100)/100))</div
></div>
(if: $stress>100)[(if: $gameover is false)[
(goto: "gameover")
]
]
Slack snippet:
<section class="slack"
><div><span class="name" style="color: #4d4da2;">helplessdev</span>I have a problem!!!</div
><div><span class="name" style="color: #a24d4d;">opsperson</span>Don't worry, I can help. What seems to be the problem?</div
><div><span class="name" style="color: #4d4da2;">helplessdev</span>Search pages are serving 500 errors and I don't know where they're coming from</div></section>
goofy flashing panic message!
<div class="flash center">DAR-2152</div>
Code snippet:
<div class="code"
>x = 67;
status = `_dfsk___ldfj`;
hope += 1;</div>
regular button: (click to see back button)
[[go to next step|example page]]
Fancy back button
<div class="back">[[back to style|Style use examples]]</div>
----Game Over----(set: $gameover to true)
Your manager has decided to ask someone else to be in charge of this incident.
Final stats
-----
Time Elapsed: (print: $time_elapsed) hours
Oncall Stress: (print: $stress)%
-----
[[Play again|init]]
(link:"Fill out feedback form")[(gotoURL:"https://docs.google.com/forms/d/e/1FAIpQLSest4tngyTeWZ9DBudZLUL5IUsyICTXhduuPellDXnny9LR2A/viewform")]