On June 20, 2024, at 7:58 PM CET, an incident was initiated because uploading of presentation templates were not fully processed due to the failure of slide preview generation. The issue was introduced at 2:05 PM CET and detected by a user at 4:50 PM CET while using document creation public API. The engineering team began the investigating at 8:05 PM CET. The incident was confirmed to be affecting more than one tenant and a large number of users, causing a high rate of exceptions/errors in system logs. At 8:15 PM CET, the engineering team identified that the recent release changes, specifically a minor version increase in an external library, as the potential cause and began reverting these changes. By 9:00 PM CET, the mitigation changes has been validated in our testing environment and prepared for production release. At 9:49 PM CET, the issue was resolved on West Europe (Production 0).
This incident serves as a reminder of the importance of thorough testing and monitoring, especially after release changes. The engineering team's prompt response and effective mitigation strategies helped to minimize the impact and ensure a swift resolution.