Tom Dyer
Systems Engineer - L.I.V.E. - Solium
http://tomdyer.ca / @thomaswsdyer
Powered by reveal.js
Solium's Monitoring and Alerting System
Nagios watches the infrastructure.
Admiral Ackbar looks at behaviour:
We do this monitoring through "Traps".
An email to the entire team
Pre-formatted text for JIRA
Step-by-Step Resolution Instructions
OpsGenie for Alerting
Two parallel schedules for different alert "types".
"Production alerts"
3 people
"Business Hours"
6 people
Directed Alerts!
Auto Close alerts!
Links to Wiki playbooks!
OpsGenie!
Alert / Pager Fatigue for 3 people
Some alerts could only be handled by certain people
More schedules = More overhead
Wait...who's on call?!?
OpsGenie with ONE Schedule!
Alerts go to HipChat and Pager.
Every alert is actionable by everyone!
Admiral Ackbar for multiple business units!
More Traps!
Automagic Remediation?
Questions?