Investigation
The incident began on December 17, 2025, at 9:19 AM CET, when users experienced noticeable slowness while accessing the West Europe (Production 1) environment. The issue was detected at 9:28 AM CET through monitoring alerts indicating elevated SQL CPU utilization in the Templafy Hive environment.
The engineering team initiated an investigation immediately upon detection. Early analysis focused on recent application-level changes and database behavior that could influence query execution and performance. As part of this process, the team reviewed active feature flags and identified a recently enabled query-related optimization as a potential contributing factor.
At approximately 9:30 AM CET, the team disabled the identified feature flag as an initial mitigation step. However, SQL CPU utilization remained elevated, indicating that additional factors were contributing to the degradation.
Further investigation into database telemetry and query execution behavior showed that, under certain data conditions, the SQL engine intermittently selected inefficient execution plans. These plans significantly increased CPU consumption and contributed to the observed performance impact.
Mitigation
When the initial feature flag rollback did not sufficiently reduce database load, the engineering team proceeded with infrastructure-level mitigation.
At 9:45 AM CET, the SQL Elastic Pool was initially scaled up to increase available capacity. This action resulted in a short-term improvement; however, CPU contention reoccurred shortly afterward.
To fully stabilize the environment, a second scaling operation was performed at 10:04 AM CET, increasing database and elastic pool capacity further. This action successfully alleviated CPU pressure and restored stable performance across the affected environment.
Resolution
By 10:06 AM CET on December 17, 2025, platform responsiveness had returned to normal for users in West Europe (Production 1). Continued monitoring following the second scaling operation confirmed that SQL CPU utilization remained within healthy thresholds and no further degradation was observed.
The incident was considered fully resolved once sustained stability was verified.
Post-Incident Actions
The engineering team is taking several follow-up actions to reduce the likelihood and impact of similar incidents in the future:
These actions aim to strengthen system resilience and improve response efficiency.
Impact and Scope
This incident affected a subset of users accessing the West Europe (Production 1) cluster. Impacted users experienced intermittent slowness while interacting with the platform during the incident window. No data loss was observed, and the issue was isolated to this single production cluster.