Symphony Talent experienced a significant service disruption with the legacy data center provider where the Smartpost Job Distribution infrastructure was hosted. A service termination request was incorrectly executed on the wrong date by the service provider. This caused a complete disruption of all Smartpost functions.
Due to historical service quality incidents with this provider, Symphony Talent was already in the final phase of a project to move Smartpost to our current strategic data center provider (AWS). The planned migration date for this activity was 1/19/18 and service components had already cleared quality tests ahead of that move. Real-time data replication was already in place as well keeping this new environment in sync.
When the service provider disruption occurred, the Smartpost Engineering Team began triaging and investigating root cause. They quickly found the issue was due to the early termination of services still in use. Communication with the legacy service provider uncovered that restoration of service in their facility would take longer to complete than activating the new infrastructure in AWS. The business decision was made to focus attention on the immediate activation of services on the new infrastructure within AWS.
The root cause was the termination of services by the legacy service provider ahead of the scheduled activation of services within the new service provider (AWS).
-- Career website clients - existing jobs continued to display normally throughout the Smartpost incident, but any job updates (new jobs, changes, closed jobs) were delayed.
-- Job distribution/SmartPost clients - the Smartpost Web Interface was down and inaccessible. All job and media distribution functions were unavailable.
-- Programmatic distribution/M-Cloud clients - job updates were delayed until this issue was remediated.
Once the root cause was identified to be related to the legacy service provider and the recovery time was estimated to be significant, services were activated and brought online within the new environment.
01/08/18 2:22pm US EST - Alerting of service interruption began.
01/08/18 2:35pm US EST - Major Incident team assembled and investigating.
01/08/18 2:39pm US EST - Legacy vendor communications initiated to identify remediation options & ST team began investigating alternate recovery options.
01/08/18 2:45pm US EST - Client & stakeholder notifications initiated.
01/08/18 6:45pm US EST - Vendor still unable to provide restoration timeline, ST leadership made decision to execute backup plan and activate services in AWS.
01/08/18 6:50pm US EST - Recovery actions initiated to start bringing new environment online.
01/08/18 6:50pm - 01/09/18 6:00pm US EST - Disaster Recovery plans executed to restore services within AWS.
01/09/18 4:30am US EST - Core Smartpost and job distribution functionality restored.
01/09/18 4:30am - 6:00pm US EST - ST Engineering team monitoring system health and executing batch jobs to get data feeds caught up.
01/09/18 6:00pm US EST - Backlogged batch jobs caught up and processed data all current.
Smartpost M-Cloud Job Distribution to Career Websites and External Media.
Symphony Talent determined our legacy vendor was not a good fit for us and has spent 18 months moving all services away from them due to issues such as this one. The Smartpost migration was a very difficult and time consuming project, so it was the last application to move. Our corrective action began over 18 months ago when we committed to using only the top vendors to help us provide our services. Unfortunately, our legacy provider let us down again before we had a chance to proactively leave them.