The root cause of the incident was due to an inefficient database query included in a deployment aimed at resolving issues with the display of all-day meetings. The query caused significant performance degradation shortly after release. The issue was identified and resolved within one hour through a rollback and code reversion.
Impact:
A database outage caused significant platform performance degradation, this was caused by a deployment that included an inefficient query for loading the schedule.
The issue affected the performance of the platform, leading to significant degradation shortly after the release.
The incident started at approximately 10:00 AM, with the deployment initiated at 09:50 AM. The issue was resolved by 11:03 AM.
Resolution Steps:
Incident Detection: The performance issues were noticed around 10:00 AM, shortly after the deployment.
Rollback: The deployment was immediately rolled back to the previous stable version.
Code Reversion: The code changes introducing the inefficient query were reverted to prevent further impact.
Service Recovery: The rollback successfully restored the platform's performance by 11:03 AM.
Mitigations:
Query Performance Auditing:
Implement additional processes to review high impact database queries, particularly those involved in critical user-facing functionalities.
Expand Development and QA Databases to introduce further replication of production data size.
Conclusion:
The database outage resulted from an inefficient query deployed to production that required a greater breadth of production-like conditions during Development and QA processes. A quick resolution occurred by rolling back the code to a previously stable version. Moving forward, enhancements to the QA process and deployment procedures will be prioritized to prevent recurrence.