Downtime
Incident Report for Templafy
Postmortem

Downtime

On February 17th 2020, Templafy was experiencing downtime.

Templafy was experiencing intermittent access problems which subsequently affected 2-3% of requests to the Templafy platform.

To explain further what happened, please see the below note.

Summary of the incident on February 17th 2020

During the incident, some users would receive an error message when attempting to access the solution. The error messages shown included error codes 502, 503 and 403 as well as “The service is unavailable”.

  • At 7:56 am CET investigation was initiated after we received reports that Templafy was unavailable.
  • Internal troubleshooting steps were carried out to no avail.
  • To continue investigation, we escalated the issue to Microsoft.
  • During a lengthy joint investigation with Microsoft Azure support team, we identified the problem to be related to DNS which made 2-3% of all requests fail.
  • At 10:06 am CET we updated our DNS
  • At 11:06 am CET the DNS change was fully propagated
  • We continued to monitor the situation
  • At 11:26 am CET we resolved the incident on the status page

Root cause

The root of the problem proved to be related to the scheduled maintenance on Saturday, February 15th where new IP addresses were added to our app service.

As a part of our maintenance we upgraded our service plan and added a CNAME DNS record as recommended by Microsoft in order to mitigate performance instability caused by a bug found in Microsoft Azure’s courtesy warm-up when scaling.

With this upgrade, our inbound and outbound IP addresses changed automatically, and the IP address configured in our DNS failed to respond on 2-3% of all requests.

Our availability monitors didn’t report the failed requests after the scheduled maintenance on Saturday February 15th because of the newly added CNAME DNS record as per Microsoft’s request.

Resolution

To solve the issue, we downgraded and upgraded our Azure service plan to have new IP addresses automatically allocated to our app service. This allocation took 20 minutes to reflect in Azure after which we updated our DNS. This DNS change took one hour to fully propagate to all users.

Posted Feb 17, 2020 - 17:13 CET

Resolved
We have resolved the issue and will update the status page with a post mortem soon.
Posted Feb 17, 2020 - 11:26 CET
Update
We are still investigating the issue with Templafy being unavailable and the issue has been escalated to Microsoft.
Some users may experience error messages, either a 502 or 503 error as well as "Service is unavailable". As of now, the issue is intermittent and happens sporadically, and we are investigating the root cause.
Posted Feb 17, 2020 - 09:29 CET
Update
We are continuing to investigate this issue.
Posted Feb 17, 2020 - 08:02 CET
Investigating
We are currently investigating an issue with Templafy being sporadicly unavailable, which means that some users may experience an error page when trying to access Templafy. We will keep this status page updated with more information soon.
Posted Feb 17, 2020 - 08:02 CET
This incident affected: WebApp.