Smartpost job distribution issue identified
Incident Report for Symphony Talent
Postmortem

Smartpost Job Distribution Incident

What happened?

Symphony Talent experienced a significant service disruption with the legacy data center provider where the Smartpost Job Distribution infrastructure was hosted. A service termination request was incorrectly executed on the wrong date by the service provider. This caused a complete disruption of all Smartpost functions.

Due to historical service quality incidents with this provider, Symphony Talent was already in the final phase of a project to move Smartpost to our current strategic data center provider (AWS). The planned migration date for this activity was 1/19/18 and service components had already cleared quality tests ahead of that move. Real-time data replication was already in place as well keeping this new environment in sync.

When the service provider disruption occurred, the Smartpost Engineering Team began triaging and investigating root cause. They quickly found the issue was due to the early termination of services still in use. Communication with the legacy service provider uncovered that restoration of service in their facility would take longer to complete than activating the new infrastructure in AWS. The business decision was made to focus attention on the immediate activation of services on the new infrastructure within AWS.

The root cause was the termination of services by the legacy service provider ahead of the scheduled activation of services within the new service provider (AWS).

What was the impact?

-- Career website clients - existing jobs continued to display normally throughout the Smartpost incident, but any job updates (new jobs, changes, closed jobs) were delayed.

-- Job distribution/SmartPost clients - the Smartpost Web Interface was down and inaccessible. All job and media distribution functions were unavailable.

-- Programmatic distribution/M-Cloud clients - job updates were delayed until this issue was remediated.

Resolution

Once the root cause was identified to be related to the legacy service provider and the recovery time was estimated to be significant, services were activated and brought online within the new environment.

Incident Timeline

01/08/18 2:22pm US EST - Alerting of service interruption began.

01/08/18 2:35pm US EST - Major Incident team assembled and investigating.

01/08/18 2:39pm US EST - Legacy vendor communications initiated to identify remediation options & ST team began investigating alternate recovery options.

01/08/18 2:45pm US EST - Client & stakeholder notifications initiated.

01/08/18 6:45pm US EST - Vendor still unable to provide restoration timeline, ST leadership made decision to execute backup plan and activate services in AWS.

01/08/18 6:50pm US EST - Recovery actions initiated to start bringing new environment online.

01/08/18 6:50pm - 01/09/18 6:00pm US EST - Disaster Recovery plans executed to restore services within AWS.

01/09/18 4:30am US EST - Core Smartpost and job distribution functionality restored.

01/09/18 4:30am - 6:00pm US EST - ST Engineering team monitoring system health and executing batch jobs to get data feeds caught up.

01/09/18 6:00pm US EST - Backlogged batch jobs caught up and processed data all current.

What products / customers were impacted?

Smartpost M-Cloud Job Distribution to Career Websites and External Media.

Corrective Actions Undertaken to Prevent Recurrence

Symphony Talent determined our legacy vendor was not a good fit for us and has spent 18 months moving all services away from them due to issues such as this one. The Smartpost migration was a very difficult and time consuming project, so it was the last application to move. Our corrective action began over 18 months ago when we committed to using only the top vendors to help us provide our services. Unfortunately, our legacy provider let us down again before we had a chance to proactively leave them.

Posted Jan 11, 2018 - 18:44 EST

Resolved
Job delivery services have been monitored and operating normally since mid day Tuesday.

A post mortem report will be posted soon with background and findings on this event.
Posted Jan 11, 2018 - 11:15 EST
Monitoring
The job delivery backend is recovered and queued data is continuing to get caught up.

The Symphony Talent team is continuing to monitor the health of incoming job data files, outgoing file deliveries, and updates to jobs data on career websites to ensure data is all moving as expected.
Posted Jan 09, 2018 - 18:48 EST
Update
The Symphony Talent team is continuing to monitor and restore secondary systems. See below for current state and impacts across affected products:

Career Websites
• Job updates (new jobs, job updates, job closes) to Career Sites have recovered but have not yet caught up for all sites.  All sites are expected to catch up by the end of day today.
 
M-Cloud
• Job and media distribution recovered.  Job data is expected to catch up by the end of day today.
 
SmartPost
• All web pages and user interfaces are recovered.
• Core backend services are recovered.
• The system is currently processing backlogged jobs.
• ST team is evaluating and monitoring connectivity with 3rd parties.
Posted Jan 09, 2018 - 11:06 EST
Update
The Symphony Talent team has restored key functionality and are continuing to bring up secondary systems. Jobs are catching up, but might still be delayed for some customers.
Posted Jan 09, 2018 - 05:02 EST
Update
The Symphony Talent team continues to work on resolving the issue. We will provide another update in approximately 2 hours unless the situation changes sooner.
Posted Jan 09, 2018 - 03:07 EST
Update
The Symphony Talent team is still working on the issue. We will provide another update in approximately 2 hours unless the situation changes sooner.
Posted Jan 09, 2018 - 01:14 EST
Update
The Symphony Talent team is continuing to work on resolving the issue with our data center provider. We will provide another update in approximately 2 hours unless the situation changes sooner.
Posted Jan 08, 2018 - 21:23 EST
Identified
The Symphony Talent team has identified the issue as being with our data center provider. We are working on a resolution and will communicate updates until the issue is resolved.
Posted Jan 08, 2018 - 19:03 EST
Update
The Symphony Talent team is continuing to troubleshoot the application issues with Job Distribution and will provide additional updates once available.
Posted Jan 08, 2018 - 16:44 EST
Investigating
We have identified an issue which is impacting the job distribution functions of the Smartpost platform.

Impacts:

-- Career website clients - posted jobs should continue to display normally, but any job updates (new jobs, changes, closed jobs) will not show up until this issue is resolved.


-- Job distribution/SmartPost clients - the Smartpost User Interface is currently down and inaccessible.

-- Programmatic distribution/M-Cloud clients - job updates will be delayed until this issue is remediated.


The Symphony Talent team is investigating the issue and will provide more details as soon as possible.
Posted Jan 08, 2018 - 15:01 EST
This incident affected: SFX - Career Websites (CWS), SFX - Job Distribution, and SFX - Programmatic Media.