DIRECT and API authenticate issues - Outage
Incident Report for Currencycloud
Postmortem

Overview

On 31/03/21 a failure with the authentication API impacted customers authenticating via APIs and Currencycloud Direct.

Timeline

31/03/2021

11:56 UTC - Currencycloud monitoring detected a spike in errors on the /authenticate/api endpoint causing some requests to that endpoint to fail.

12:00: UTC - Currencycloud identified 5XX errors being returned on the authentication endpoint.

12:13 UTC - Currencycloud investigated a large number of requests to the /authenticate/api/ in a very short period of time which caused a row-level lock on the database. Rate limiting was introduced to reduce the impact on that endpoint.

12:48 - Currencycloud attempted to terminate processes to free up database table locks.

12:52 UTC - All requests to /authenticate/api continued to fail after the previous solution.

13:03 UTC - A rolling restart of api-v2 was attempted to clear the database locks.

13:33 UTC - Service Restored.

13:40 UTC - Rate limiting mitigations removed.

15:40 UTC - Currencycloud performed an emergency code change to prevent the issue from recurring.

Resolution

A restart of the api-v2 cleared the database locks along with a code change to prevent future issues.

Root Cause Analysis

A sudden increase in authentication requests to the authenticate API, caused a row-level lock in the corresponding table in the database, multiple authorisation requests timed out and failed causing a snowball effect of more authentications attempts. This maxed out the thread pool available,  extending the problem to other customers.

Remediation Items

  • Implement monitoring on DML latency in DEMO and PROD - COMPLETED.
  • Implementation of code change to prevent Database locks on authentication requests - COMPLETED.
  • Review and implement rate limiting on the authentication/api - COMPLETED.
Posted Apr 07, 2021 - 13:45 UTC

Resolved
Service has remained stable but the teams are continuing to monitor
Posted Mar 31, 2021 - 15:21 UTC
Monitoring
We have restored service but are going to monitoring the system closely
Posted Mar 31, 2021 - 13:24 UTC
Update
Service looks to have failed - urgent action is taking to resolve
Posted Mar 31, 2021 - 13:06 UTC
Investigating
We are currently investigating an issue with /authenticate/api

This will impact some customers authenticating on DIRECT and via the API

Teams are working to restore the services as quickly as possible
Posted Mar 31, 2021 - 12:56 UTC
This incident affected: API and Paydirect.io / Direct.