Incident #003

INC-3: Website monitoring fail

Generated by MRX Admin on 2 Dec 2024 14:42. All timestamps are local to Europe/Budapest

Key Information

Timestamps

Incident Type: Default
Severity: Minor
No custom fields have been set for this incident.

Declared at: 2 Dec 2024 13:53
Resolved at: 2 Dec 2024 14:40
Identified at: 2 Dec 2024 14:23
Fixed at: 2 Dec 2024 14:29

Incident duration: 47 minutes

Team

Useful Links

Incident Lead: MRX Admin
Reporter: MRX Admin
Active participants: MRX Admin

Summary

Problem: our link shortener site not update the main routes, and this will get reported the Dub team too.

Incident Timeline

Time

Event

2024-12-02

13:53:26

Incident reported by MRX Admin

MRX Admin reported the incidentSeverity: MinorStatus: Investigating

14:23:29

Status changed from Investigating → Fixing

MRX Admin shared an updateStatus: Investigating → FixingThe Dub team is on way to fix the main problem.

14:29:53

Status changed from Fixing → Monitoring

MRX Admin shared an updateStatus: Fixing → MonitoringThe Dub team has find a fix for the issue, and we now monitoring the results.

14:40:41

Incident resolved and entered the post-incident flow

MRX Admin shared an updateStatus: Monitoring → DocumentingThe problem resolved by the Dub team.

Contributors

Outline any factors that played a role in this incident happening, or it being as bad as it was.

This could be technical (e.g., “the server’s disk filled up”), human (e.g., “Sara missed her first page”), or external (e.g., “this coincided with a marketing email being sent to our customers”).

Cover as many of the factors as you can without over-focusing on one “root” cause.

e.g. A recent deployment lead to the authentication service failing to start. This lead to users being unable to log in.

Mitigators

Outline any factors that reduced the incident's impact or prevented it from being worse than it was.

This might include external factors (e.g., “it was lucky it happened during work hours”), effective technical controls (e.g., ”our alerting caught this quickly” or ”our auto-rollback worked as expected”), or having the right person on call.

Highlighting these elements helps identify what's working well and what's worth reinforcing or scaling.

e.g. We recently deployed some alerting changes that meant we were paged within 30 seconds of failed deployment landing in production.

Learnings and risks

Capture anything we learned as result of responding to or investigating this incident, and any risks that were revealed or highlighted.

Think about how to improve response next time, and consider any patterns pointing to broader issues, like “key person risk.”

e.g. We don’t have a reliable way to find and surface the right runbooks for incidents like this. There’s a risk the wrong actions are taken or we miss things that would otherwise help to resolve them more quickly.

Follow-ups

View follow-ups in dashboard

NextIncident #004

Last updated 1 year ago