The platform was available, however an issue with messaging between internal services meant failures were seen when creating conversions, creating payments or performing transfers. The platform internal message broker service failed following a change to adjust the instance sizes, this prompted the manual shut down of the conversions/create API to limit the impact of conversions being created inconsistently or in an untimely manner.
04/08/2020
16:33 UTC - Currencycloud alerting identified failures on the conversion/create and payment/create API endpoints
16:45 UTC - Investigations identified that some conversions and payments were failing and some conversions and transfers were not completing
16:57 UTC - The platform message broker service was observed to be erroring so the conversions/create endpoint was blocked to stop further issues at source
17:15 UTC - Deeper investigation identified that the platform’s internal messaging service was not configured correctly and had essentially degraded during the duration of a scheduled change, it eventually ceased functioning
17:22 UTC - One instance of the messaging service was restarted, messaging began to recover
17:24 UTC - The conversion/create endpoint was unblocked
17:33 UTC - Issues no longer being seen on payments but some conversions still failing
17:38 UTC - The conversions/create endpoint was blocked once more
17:54 UTC - All all other message broker instances were restarted to ensure correct configuration
17:55 UTC - No more errors observed, system stabilised and monitored
18:05 UTC - The conversion/create endpoint was unblocked
18:25 UTC - Incident confirmed as resolved after a further period of monitoring
Resolution
Recreating the erroring message broker instances to remove the misconfiguration resolved the issue
Root Cause Analysis
A routine change to our messaging broker service caused instances to restart in a corrupted state. The change followed a documented process which, as designed, terminated each instance and reintroducing them in turn with the newly desired configuration, however the new service instances failed sync correctly when reintroduced. The conversions create endpoint was disabled in order to contain impact whilst the issue was resolved.
Remediation Items
• Review and implement changes to our conversions flow in order to prevent a message broker outage resulting in the need to block the conversions end point
• Review timing of routine changes to this critical platform component to ensure any impact from issues encountered is minimal