Service degradation - Users encounter issues when accessing West Europe (Production 1)

Incident Report for Templafy

Postmortem

Incident Initiation

On February 13, 2025, at 1:27 AM CET, an issue was introduced which lead to excessive CPU utilization due to heavy database usage in West Europe (Production 1). The issue remained undetected until February 13, 2025, at 1:18 PM CET, when monitoring systems flagged abnormal resource consumption. The impact was significant, affecting multiple tenants and a large number of users. The degraded performance resulted in a high rate of exceptions/errors in system logs and hindered application functionality.

Investigation

The engineering team initiated an immediate investigation on February 13, 2025, at 1:36 PM CET. They hypothesized that the issue was related to an inefficient database query or workload increase. Further analysis confirmed that the heavy database usage was the root cause of CPU maxing out, leading to performance degradation.

Mitigation and Resolution

To mitigate the incident, the engineering team promptly scaled up the database resources at 1:40 PM CET to stabilize the application. Continuous monitoring was implemented to track system performance and ensure stability. By 2:29 PM CET, the application was stabilized, and error rates significantly reduced. The team planned to revert the database resources to normal levels by the following morning to ensure optimal operation.

Impact and Scope

The incident impacted multiple tenants across various clusters, leading to performance degradation for affected users. The issue was widespread, affecting application responsiveness and generating increased system errors.

Post-Incident Actions

In response to this incident, the engineering team will implement several post-incident actions, including a detailed review of database query efficiency and workload distribution. Additional procedural improvements will be made to monitor resource consumption more closely and introduce alerting mechanisms for early anomaly detection.

We sincerely apologize for the disruption caused by this incident. Our engineering team is committed to ensuring service reliability and stability. We appreciate our customers' patience and understanding as we continue to enhance our monitoring and mitigation processes.

Posted Feb 20, 2025 - 15:51 CET

Resolved

The incident has been resolved, and further information will be provided in a postmortem shortly.

We apologize for the impact to affected customers.

Posted Feb 13, 2025 - 14:42 CET

Monitoring

The incident has been successfully mitigated, and our team is actively monitoring the situation to ensure ongoing stability and performance. We are observing the systems to prevent any further disruptions.

Posted Feb 13, 2025 - 13:39 CET

Identified

We have identified an issue that affects a subset of customers and are working towards a resolution.
Further updates will be posted here soon.

Posted Feb 13, 2025 - 13:10 CET

This incident affected: Templafy Hive (Library & Dynamics).