Major platform outage
Incident Report for Currencycloud
Postmortem

Overview

On the 20th January 2023 Currencycloud clients experienced issues accessing the production environment.   

Timeline

11:36 UTC - Our monitoring systems alerted us to a large number of errors across multiple endpoints  

11:39 UTC - Investigations pointed towards a recent deployment in an internal service

11:43 UTC - Database CPU hit high CPU load

11:52 UTC - Change was rolled back to the previous version 

12:10 UTC - Issue Resolved

Resolution

The change that caused the issue was reverted.

Root Cause Analysis

A recent MySQL upgrade performed on the weekend of the 14th January 2023, caused a degradation in performance of an SQL query used by internal services. As part of the MySQL upgrade, the behaviour of the query optimiser changed, creating a performance issue on a query, which caused performance issues on the platform. On the evening of Thursday 19th January 2023, a configuration change was made to revert this optimiser change and stabilised the service temporarily.

On the morning of Friday 20th January 2023, our development team prepared a fix to resolve this issue permanently, to ensure the right optimisation of queries. This change had unintended consequences that significantly degraded the performance of an internal service database, causing high CPU load and degradation of service. The change was rolled back to resolve the issue.

Remediation Items

  • Temporary additional signoff process on changes to this particular service
  • Performance testing in pipeline being introduced to reduce risk of performance degradation before production deployment
  • SQL query optimisation
  • Introduce timeouts to slow SQL queries 
  • Investigation on viability of using search technology instead of databases for specific endpoints
  • There is a longer term piece of work to rewrite this internal service, to improve performance and reduce the blast radius of future changes.
Posted Jan 30, 2023 - 17:08 UTC

Resolved
Team is confident the system will remain stable, Resolving incident

Downtime 11:35 - 12:12
Posted Jan 20, 2023 - 16:35 UTC
Monitoring
Service has been restored but teams are continuing to monitor to mitigate any more issues.
Posted Jan 20, 2023 - 12:17 UTC
Update
All teams are in a major incident bridge to resolve the issue as quickly as possible
Posted Jan 20, 2023 - 12:09 UTC
Investigating
We are investigating an issue that is impacting the platform teams are working to resolve
Posted Jan 20, 2023 - 11:50 UTC
This incident affected: API, Payments, Conversions, Paydirect.io / Direct, Notifications, Balances, and Other.