Degraded performance detected on West Europe (Production 1)

Incident Report for Templafy

Postmortem

Investigation
The incident began on December 17, 2025, at 9:19 AM CET, when users experienced noticeable slowness while accessing the West Europe (Production 1) environment. The issue was detected at 9:28 AM CET through monitoring alerts indicating elevated SQL CPU utilization in the Templafy Hive environment.
The engineering team initiated an investigation immediately upon detection. Early analysis focused on recent application-level changes and database behavior that could influence query execution and performance. As part of this process, the team reviewed active feature flags and identified a recently enabled query-related optimization as a potential contributing factor.
At approximately 9:30 AM CET, the team disabled the identified feature flag as an initial mitigation step. However, SQL CPU utilization remained elevated, indicating that additional factors were contributing to the degradation.
Further investigation into database telemetry and query execution behavior showed that, under certain data conditions, the SQL engine intermittently selected inefficient execution plans. These plans significantly increased CPU consumption and contributed to the observed performance impact.

Mitigation
When the initial feature flag rollback did not sufficiently reduce database load, the engineering team proceeded with infrastructure-level mitigation.
At 9:45 AM CET, the SQL Elastic Pool was initially scaled up to increase available capacity. This action resulted in a short-term improvement; however, CPU contention reoccurred shortly afterward.
To fully stabilize the environment, a second scaling operation was performed at 10:04 AM CET, increasing database and elastic pool capacity further. This action successfully alleviated CPU pressure and restored stable performance across the affected environment.

Resolution
By 10:06 AM CET on December 17, 2025, platform responsiveness had returned to normal for users in West Europe (Production 1). Continued monitoring following the second scaling operation confirmed that SQL CPU utilization remained within healthy thresholds and no further degradation was observed.
The incident was considered fully resolved once sustained stability was verified.

Post-Incident Actions
The engineering team is taking several follow-up actions to reduce the likelihood and impact of similar incidents in the future:

  • Implementing database schema and query improvements to reduce the risk of inefficient execution plans under high-load scenarios.
  • Adjusting feature flag rollout practices to ensure more gradual exposure when testing query-related changes.
  • Evaluating controlled operational mechanisms to safely address problematic query execution behavior when necessary, with appropriate auditability and approvals.

These actions aim to strengthen system resilience and improve response efficiency.

Impact and Scope
This incident affected a subset of users accessing the West Europe (Production 1) cluster. Impacted users experienced intermittent slowness while interacting with the platform during the incident window. No data loss was observed, and the issue was isolated to this single production cluster.

Posted Dec 22, 2025 - 10:07 CET

Resolved

The incident has been resolved, and further information will be provided in a postmortem shortly.

We apologize for the impact to affected customers.
Posted Dec 17, 2025 - 16:04 CET

Monitoring

The incident has been successfully mitigated, and our team is actively monitoring the situation to ensure ongoing stability and performance. We are observing the systems to prevent any further disruptions.
Posted Dec 17, 2025 - 10:06 CET

Identified

We have identified an issue that affects a subset of customers and are working towards a resolution.
Further updates will be posted here soon.
Posted Dec 17, 2025 - 09:19 CET
This incident affected: Templafy Hive (Account Management, Library & Dynamics, User Management).