Career Site Outage
Incident Report for Symphony Talent
Postmortem

What happened?

An automated proxy server configuration update failed to complete successfully which introduced an incomplete IP list into the career website hosting platform backend. This incomplete IP list caused a security process to incorrectly begin flagging legitimate incoming admin users as spam users (because their legitimate source IP’s could not be validated against the incomplete IP list). As a result, any content/sites created by those users was set to inactive - effectively taking the site down during the incident period. Since some of the affected admin users were core senior staff who had setup customer sites - customer sites were impacted.

What was the impact?

The security process inactivated content/sites associated with the flagged users - resulting in the unplanned inactivation of 15 customer career sites.

Resolution

Automated monitoring immediately began alerting operations staff who identified the root cause of the issue, reactivated sites, corrected the IP config issue, and restored admin users who were marked as spam.

Incident Timeline

10/29/2018 - 12:45pm US EDT - Initial config updated failed.

10/29/2018 - 12:59pm US EDT - Security processes began inactivating sites and monitoring alerts began triggering.

10/29/2018 - 1:25pm US EDT - Root cause identified and remediation begun.

10/29/2018 - 1:37pm US EDT - Affected sites reactivated & caching tier invalidations begun.

10/29/2018 - 2:19pm US EDT - All caching tiers verified as refreshed.

What products / customers were impacted?

Career Websites (X-Cloud Candidate) - Approximately 15 customer career sites

Corrective Actions Undertaken to Prevent Recurrence

The root cause was the automated proxy server configuration update process failure. That has been disabled until a thorough RCA can be completed to find the cause of the partial run and correct with appropriate testing.

Additional quality / verification processes are being considered for Proxy IP update process.

Security tooling is being reviewed to determine if site/content inactivation can be decoupled from user inactive workflows.

Posted 9 months ago. Oct 29, 2018 - 22:26 UTC

Resolved
15 career sites were impacted by a failed proxy configuration update process on 10/29/18. Full details available in post-mortem report.
Posted 9 months ago. Oct 29, 2018 - 17:37 UTC