Incident Initiation
On February 13, 2025, at 1:27 AM CET, an issue was introduced which lead to excessive CPU utilization due to heavy database usage in West Europe (Production 1). The issue remained undetected until February 13, 2025, at 1:18 PM CET, when monitoring systems flagged abnormal resource consumption. The impact was significant, affecting multiple tenants and a large number of users. The degraded performance resulted in a high rate of exceptions/errors in system logs and hindered application functionality.
Investigation
The engineering team initiated an immediate investigation on February 13, 2025, at 1:36 PM CET. They hypothesized that the issue was related to an inefficient database query or workload increase. Further analysis confirmed that the heavy database usage was the root cause of CPU maxing out, leading to performance degradation.
Mitigation and Resolution
To mitigate the incident, the engineering team promptly scaled up the database resources at 1:40 PM CET to stabilize the application. Continuous monitoring was implemented to track system performance and ensure stability. By 2:29 PM CET, the application was stabilized, and error rates significantly reduced. The team planned to revert the database resources to normal levels by the following morning to ensure optimal operation.
Impact and Scope
The incident impacted multiple tenants across various clusters, leading to performance degradation for affected users. The issue was widespread, affecting application responsiveness and generating increased system errors.
Post-Incident Actions
In response to this incident, the engineering team will implement several post-incident actions, including a detailed review of database query efficiency and workload distribution. Additional procedural improvements will be made to monitor resource consumption more closely and introduce alerting mechanisms for early anomaly detection.
We sincerely apologize for the disruption caused by this incident. Our engineering team is committed to ensuring service reliability and stability. We appreciate our customers' patience and understanding as we continue to enhance our monitoring and mitigation processes.