OVERVIEW
The platform experienced a significant outage on 3 September 2024, triggered by an update to some micro services. The incident began with intermittent errors on the static data service, which then escalated to a complete platform-wide outage lasting approximately 20 minutes.
CLIENT IMPACT
The root cause of the incident was an issue with a micro service upgrade. Due to an error the upgrade process was not fully completed before the previous version was deleted and all services had successfully migrated to the newer version.
Some services were still running with the old version, as they had not been restarted to pick up the new version. This meant these services could no longer communicate with each other as version had been deleted.
In summary, the root cause was a gap in the upgrade process, where the old version was removed before all services had fully migrated to the new version. This led to a cascading failure that impacted a wide range of services and caused a significant platform outage
REMEDIATION
A restart of all impacted services allowed clients to pick-up the new version and resolve the incident.
Process review conducted, improvements to the upgrade process are underway to move to a simpler process.