Content Library Asset Insertion and Document Creation Actions Periodically Fail

Incident Report for Templafy

Postmortem

Incident Initiation:

On December 3, 2024, at 2:03 PM CET, a defect was introduced in the West Europe (Production 0) environment of our Hive platform, specifically affecting the Content Library functionality. After users started reporting that templates from the Content Library were not opening, and an incident was created at 5:22 PM CET. This issue obstructed users' daily work, affecting multiple customers.

Investigation:

The engineering team initiated an investigation at 5:30 PM CET on December 3, 2024. Initial analysis indicated an intermittent error loading the VSTO Add-in when creating a document. Further investigation revealed that the root cause was a race condition during the loading of the VSTO Add-in. The VSTO Add-in was failing to load because it could not establish a bridge with the Web-App, the connection needed for the VSTO Add-in to communicate with the Web-App. This issue was linked to a change in the order of loading dependencies following an upgrade of a frontend component responsible for managing dependencies during the application startup. The VSTO Add-in incorrectly assumed that shared dependencies were being preloaded in a specific order.

Mitigation and Resolution:

To mitigate the incident, a workaround was provided to customers at December 3, at 8:31 PM CET, and further deployments were halted to limit additional impact. On December 4, at 8:15 AM CET, the engineering team continued working on a permanent solution. The engineering team started identifying the proper sequence for loading dependencies, ensuring that the function in the Web-App required by VSTO Add-in was correctly registered.

Since at the time of halting the deployment, it was not clear which change was the root cause of the issue, a deployment of another system spread the issue to all production clusters, West Europe (Production 1), East US (Production 2), Australia East (Production 3), Canada East (Production 4) and West Europe (Production 5).

At 2:07 PM CET, the engineering team had implemented the permanent solution and deployed this, restoring the full functionality to the Content Library. At 8:39 PM CET, we discovered that the change in the loading behavior for dependencies also caused an issue in another component, the Template Designer. There had already been made a solution for this that was deployed to West Europe (Production 0). The engineering team quickly deployed the same solution to all other environments as well, fully resolving the issue at 9:34 PM CET.

Impact and Scope:

The incident affected multiple customers, obstructing users from opening templates from the Content Library and at a later time obstructing admins with using the Template Designer, which caused disruptions to their daily activities. The incident impacted a significant number of users in all Production environments.

Post-Incident Actions:

The engineering team will conduct a thorough review of the dependency loading processes and implement stricter testing and monitoring to detect issues in the future. The upgrade procedures for the frontend component for dependency management will be revised to prevent similar issues from occurring, ensuring that updates to our frontend component maintain compatibility with the VSTO Add-in. Additionally, changes to our internal processes and tools will be made to ensure halting faulty deployments will be more consistent, reducing the chance of similar errors occurring in the future.

We sincerely apologize for the disruption this incident caused and reaffirm our commitment to providing reliable and stable services for our users.

Posted Dec 06, 2024 - 10:32 CET

Resolved

The incident has been resolved, and further information will be provided in a postmortem shortly.

We apologize for the impact to affected customers.

Posted Dec 04, 2024 - 14:22 CET

Monitoring

The incident has been successfully mitigated, and our team is actively monitoring the situation to ensure ongoing stability and performance. We are observing the systems to prevent any further disruptions.

Posted Dec 04, 2024 - 13:49 CET

Update

We are actively deploying a potential fix and continuing to investigate the root cause of the issue. While we anticipate this may provide partial mitigation, our team is committed to achieving a full resolution as quickly as possible.

Posted Dec 04, 2024 - 12:29 CET

Identified

Last night, we identified an issue affecting our West Europe (Production 0) environment and believed it was resolved using a workaround, moving the status to monitoring. However, as of this morning, we have observed that the issue has propagated to additional production environments starting 7:48AM CET. As a result, we have updated the status back to identified and are actively working on resolving the broader impact.

Thank you for your patience.

Posted Dec 04, 2024 - 09:42 CET

Monitoring

Posted Dec 03, 2024 - 21:04 CET

Identified

We have identified an issue that affects a subset of customers and are working towards a resolution.
Further updates will be posted here soon.

Posted Dec 03, 2024 - 17:41 CET

This incident affected: Templafy Hive (Library & Dynamics).