Website unavailable
Incident Report for Templafy
Postmortem

Summary of the incident:
January 27th 2020 Templafy web app was unreachable, which resulted in tenants being unavailable and users experiencing a blank page in the web browser when attempting to access it. Our team was notified of the degraded performance at 5:28 UTC from our automated alert systems.
Investigations were initiated immediately and the issue was quickly identified.

Templafy web app was fully operational at 6:15 UTC.

Details of the incident:
At 5:28 UTC our automated alert systems detected an issue.
Investigation proved that a number of servers were not serving incoming traffic as expected, which subsequently caused remaining working servers to overload, due to the unexpected, heavy load.

A more detailed overview of the incident is described in the following breakdown:

  • At 5:00 UTC: our Azure automated workload balancer spun up additional servers as planned in order to account for the expected traffic coming in later in the day.
  • At 5:02 UTC: all additional servers were running, but due to a platform related issues on Microsoft’s side, these servers did not serve incoming requests. We have reported the issue to Microsoft.
  • At 5:13 UTC: our initial running servers started to respond slowly under the heavy load, as the additional servers did not respond to traffic.
  • At 5:28 UTC: our automated systems detected there was a problem and investigations were initiated.
  • At 5:38 UTC: all servers were restarted.
  • At 5:48 UTC: traffic was starting to be served.
  • At 6:15 UTC: performance was back at full. We have reported the slow return to full performance to Microsoft.

Resolution:
A restart of all servers.

Root cause of incident:

All internal investigations so far points to problems on Microsoft Azure platform and we are in contact with Microsoft to further investigate and understand the two core problems found on their side:

  • Additional servers automatically added to workload did not serve requests
  • After restart, servers were slow at returning to full performance
Posted Jan 27, 2020 - 07:49 CET

Resolved
All systems are up. We are still investigating the problem, and are in contact with Microsoft. Root cause still unknown.
Posted Jan 27, 2020 - 07:21 CET
Update
We confirmed that our website is down. Our operation and engineers are investigating the cause.
Posted Jan 27, 2020 - 06:59 CET
Investigating
Our automated system has detected possible performance issues with one or more of our systems. We have acknowledged it and are looking into it
Posted Jan 27, 2020 - 06:32 CET
This incident affected: WebApp.