I'm going to visit a school for kids that can't write good

This commit is contained in:
R Tyler Croy 2019-12-02 18:55:07 -08:00
parent 347fe071ca
commit 6588ecb886
No known key found for this signature in database
GPG Key ID: E5C92681BEF6CEA2
2 changed files with 27 additions and 21 deletions

View File

@ -51,3 +51,6 @@ dfeldman:
ugi:
name: Ugi Kutluoglu
hamiltonh:
name: Hamilton Hord

View File

@ -32,10 +32,10 @@ the "night shift."
## Trying it out
Getting everyone on board with the day/night shifts was the easy part,
implementing the shifts in PagerDuty turned out far more difficult. TO begin, I created the
`Core-Platform` schedule, adding all of the team members. The schedule
was built using Pagerduty's "Restrict On-Call Times" in order to restrict the
schedule's activation, limiting it to 7:00-17:00 PST.
implementing the shifts in PagerDuty turned out far more difficult. To begin, I
created the `Core-Platform` schedule, adding all of the team members. The
schedule was built using Pagerduty's "Restrict On-Call Times" in order to
restrict the schedule's activation, limiting it to 7:00-17:00 PST.
Next I created an "Escalation Policy" with Core Platform as the first level,
and then configuring the existing Core Infrastructure primary schedule as the
@ -48,12 +48,13 @@ resolve the incident.
## Bumpy roads
Having wired the settings together for Core Platform's services as a prototype,
I shared the work with a developer we were working with from PagerDuty, it went
_okay_. I explained the desired end-goal, and walked through what I expected to
I shared my progress with a developer from PagerDuty; it went
_okay_. I explained the desired end-goal, and we walked through what I expected to
happen. Considering our settings, he explained that what would _actually_
happen was:
happen:
* During the day, Core Platform developers
* During the day, Core Platform developers would be notified when incidents
happened.
* Outside of the day shift, there would **always** be a 30 minute delay, dead
air, before anybody would be notified. After that 30 minute delay, the Core
Infrastructure team would receive the alert.
@ -63,20 +64,22 @@ Definitely not ideal.
## Hack-arounds
The PagerDuty and I switched gears and tried to find ways in which we could
arrive at something as close as possible to our desired end-state. We figured
out a couple options:
The PagerDuty developer and I switched gears and tried to find ways in which we
could arrive at something as close as possible to our desired end-state. We
figured out a couple options:
1. In the Core-Platform Schedule, Add a Secondary Layer built with the members of the Core-Infra Team
* We would get the desired effect of skipping the Core-Platform
Developers when outside of business hours.
* This option would also put the management of part of Core-Infra Team's
rotation in Core-Platform's hands, including managing overrides.
1. In the Core-Platform Escalation Policy, add the Core-Infra Schedule to the first notification in addition to Core-Platform
* This would require policy for Core-Infra to manually follow of only
picking up the escalation outside of business hours, with know way to
know if they should pick up or not.
1. In the `Core-Platform` Schedule, add a Secondary Layer built with the
members of the `Core-Infra` Team
* We would get the desired effect of skipping Core Platform developers when outside of business hours.
* This option would also put the management of part of Core
Infrastructure's rotation into Core Platform's hands, including the
management of explicit overrides.
1. In the `Core-Platform` Escalation Policy, add the `Core-Infra` Schedule to
the first notification in addition to the existing `Core-Platform` Schedule.
* This would require documented policy for engineers in Core Infrastructure
to only respond to incidents outside of the day shift, with no automamted
way for them to know whether they can ignore an alert until they receive it.
1. Create duplicated Services for day time vs night time, with different
Escalation Policies to rout to different groups. Then with event rules,
route alerts to the different services based on time of day.
@ -84,7 +87,7 @@ rotation in Core-Platform's hands, including managing overrides.
ever needed to change anything we'd have to change it on ALL services
running this style of escalations.
1. Keep the baked-in delayed response for night shift alerts.
* Obviously not a good choice for situations where every minute counts!
* Obviously not a good choice for situations where every minute counts!
1. Switch the Core Platform schedule to 24/7 by removing the restriction.
* Pushes developers into new and uncomfortable positions of being on-call
all the time, making team based escalations less appealing for adoption