Incident #003
Generated by MRX Admin on 2 Dec 2024 14:42. All timestamps are local to Europe/Budapest
Incident Type: Default
Severity: Minor
No custom fields have been set for this incident.
Declared at: 2 Dec 2024 13:53
Resolved at: 2 Dec 2024 14:40
Identified at: 2 Dec 2024 14:23
Fixed at: 2 Dec 2024 14:29
Incident duration: 47 minutes
Incident Lead: MRX Admin
Reporter: MRX Admin
Active participants: MRX Admin
Summary
Problem: our link shortener site not update the main routes, and this will get reported the Dub team too.
Incident Timeline
2024-12-02
13:53:26
Incident reported by MRX Admin
MRX Admin reported the incidentSeverity: MinorStatus: Investigating
14:23:29
Status changed from Investigating → Fixing
MRX Admin shared an updateStatus: Investigating → FixingThe Dub team is on way to fix the main problem.
14:29:53
Status changed from Fixing → Monitoring
MRX Admin shared an updateStatus: Fixing → MonitoringThe Dub team has find a fix for the issue, and we now monitoring the results.
14:40:41
Incident resolved and entered the post-incident flow
MRX Admin shared an updateStatus: Monitoring → DocumentingThe problem resolved by the Dub team.
Contributors
Outline any factors that played a role in this incident happening, or it being as bad as it was.
This could be technical (e.g., “the server’s disk filled up”), human (e.g., “Sara missed her first page”), or external (e.g., “this coincided with a marketing email being sent to our customers”).
Cover as many of the factors as you can without over-focusing on one “root” cause.
e.g. A recent deployment lead to the authentication service failing to start. This lead to users being unable to log in.
Mitigators
Outline any factors that reduced the incident's impact or prevented it from being worse than it was.
This might include external factors (e.g., “it was lucky it happened during work hours”), effective technical controls (e.g., ”our alerting caught this quickly” or ”our auto-rollback worked as expected”), or having the right person on call.
Highlighting these elements helps identify what's working well and what's worth reinforcing or scaling.
e.g. We recently deployed some alerting changes that meant we were paged within 30 seconds of failed deployment landing in production.
Learnings and risks
Capture anything we learned as result of responding to or investigating this incident, and any risks that were revealed or highlighted.
Think about how to improve response next time, and consider any patterns pointing to broader issues, like “key person risk.”
e.g. We don’t have a reliable way to find and surface the right runbooks for incidents like this. There’s a risk the wrong actions are taken or we miss things that would otherwise help to resolve them more quickly.
Follow-ups
Last updated