Overview
On 29/03/21 a large number of payments attempted, via all payment transfer channels were unable to be released.
Timeline
29/03/2021
08:00 UTC - Currencycloud detected that payments were not being released as expected when customers had the required balance.
08:15: UTC - Currencycloud identified that payments were not being marked releasable by the system.
08:35 UTC - Currencycloud identified a consumer on one of our message brokers had dropped to zero. This caused messages between the services to queue up.
08:48 UTC - Due to the queue of messages, a number of SWIFT payments in the following currencies missed the cut-off time to be paid that day:
BGN, HRK, CZK, HUF, MXN, RON, PLN, TRY
Additionally, CNY, THB, UGX missed the cut-off time for a next-day payment.
Regular payments via local transfer channels were also impacted, but did not miss any cut-off times.
09:00 UTC - Application restart corrected the issue and all payments were correctly marked as releasable and were processed.
Resolution
A restart on the application restored the missing consumers on the messaging service.
Root Cause Analysis
A network connectivity issue caused a node on the messaging service to go disconnect from the cluster and cause a split brain situation. Action was taken to remove the node from the cluster. During this time several services were still connected to the split node, which was then removed from the target group causing the connections to be drained. The application restart was required to reconnect to the new node.
Remediation Items