An elasticsearch node exceeded all available resources and failed. This resulted in a cluster failure, which resulted in the unavailability and eventual loss of data indexes used to present Analytics data in the X-Cloud platform.
During the incident time frame, users saw either static data or no data in the Analytics tab within X-Cloud Recruiter.
The affected elasticsearch cluster was brought back to a healthy state, and then indexes were rebuilt.
01/14/2019 - 05:14pm US EDT - Initial failures detected.
01/14/2019 - 08:33pm US EDT - Investigation and recovery efforts determined that the cluster would require a node rebuild and extension. Rebuild initiated
01/14/2019 - 11:50pm US EDT - Elasticsearch node recovered and resources increased. Index rebuilding initiated.
01/15/2019 - 09:17am US EDT - Data index rebuilding completed for 98% of customers.
01/15/2019 - 01:05pm US EDT - Data index rebuilding completed for remaining customers.
X-Cloud Analytics function in X-Cloud Recruiter
Immediate Recovery - Elasticsearch cluster rebuilt.
Long Term Fixes
1) Elasticsearch cluster resources dramatically increased.
2) Additional proactive monitoring added to elasticsearch cluster to preemptively identify failure signatures experienced.