Incident Initiation:
On December 3, 2024, at 2:03 PM CET, a defect was introduced in the West Europe (Production 0) environment of our Hive platform, specifically affecting the Content Library functionality. After users started reporting that templates from the Content Library were not opening, and an incident was created at 5:22 PM CET. This issue obstructed users' daily work, affecting multiple customers.
Investigation:
The engineering team initiated an investigation at 5:30 PM CET on December 3, 2024. Initial analysis indicated an intermittent error loading the VSTO Add-in when creating a document. Further investigation revealed that the root cause was a race condition during the loading of the VSTO Add-in. The VSTO Add-in was failing to load because it could not establish a bridge with the Web-App, the connection needed for the VSTO Add-in to communicate with the Web-App. This issue was linked to a change in the order of loading dependencies following an upgrade of a frontend component responsible for managing dependencies during the application startup. The VSTO Add-in incorrectly assumed that shared dependencies were being preloaded in a specific order.
Mitigation and Resolution:
To mitigate the incident, a workaround was provided to customers at December 3, at 8:31 PM CET, and further deployments were halted to limit additional impact. On December 4, at 8:15 AM CET, the engineering team continued working on a permanent solution. The engineering team started identifying the proper sequence for loading dependencies, ensuring that the function in the Web-App required by VSTO Add-in was correctly registered.
Since at the time of halting the deployment, it was not clear which change was the root cause of the issue, a deployment of another system spread the issue to all production clusters, West Europe (Production 1), East US (Production 2), Australia East (Production 3), Canada East (Production 4) and West Europe (Production 5).
At 2:07 PM CET, the engineering team had implemented the permanent solution and deployed this, restoring the full functionality to the Content Library. At 8:39 PM CET, we discovered that the change in the loading behavior for dependencies also caused an issue in another component, the Template Designer. There had already been made a solution for this that was deployed to West Europe (Production 0). The engineering team quickly deployed the same solution to all other environments as well, fully resolving the issue at 9:34 PM CET.
Impact and Scope:
The incident affected multiple customers, obstructing users from opening templates from the Content Library and at a later time obstructing admins with using the Template Designer, which caused disruptions to their daily activities. The incident impacted a significant number of users in all Production environments.
Post-Incident Actions:
The engineering team will conduct a thorough review of the dependency loading processes and implement stricter testing and monitoring to detect issues in the future. The upgrade procedures for the frontend component for dependency management will be revised to prevent similar issues from occurring, ensuring that updates to our frontend component maintain compatibility with the VSTO Add-in. Additionally, changes to our internal processes and tools will be made to ensure halting faulty deployments will be more consistent, reducing the chance of similar errors occurring in the future.
We sincerely apologize for the disruption this incident caused and reaffirm our commitment to providing reliable and stable services for our users.