Last night at approximately 8:43PM EST we received notification from our various monitoring systems that a large number of servers were down in our NY1 region. We immediately began to troubleshoot the situation and confirmed that network connectivity was unavailable. We immediately began to access equipment and found that most of it was online about a minute later. However after reviewing the logs, we saw that the uptime for each device accessed was now in the minutes instead of days, indicating that it had been rebooted.
We then immediately contacted the Equinix datacenter to get more information. We suspected that there was a power failure and needed to confirm if it was related to our equipment or something larger. On the phone we received confirmation that there was a large power failure at the facility.
All of our staff was at high alert at this point, as we had to review each piece of hardware that had been rebooted and we dispatched people to the datacenter as well. Lev, our Director of Datacenter Operations, is currently in Amsterdam opening up our latest facility so instead Anthony and Moisey were on site within the hour.
An hour later we received official confirmation from Equinix via email that there was in fact a power failure incident at the facility.
INCIDENT SUMMARY: UPS 7 Failure
INCIDENT DESCRIPTION:
Equinix IBX Site Engineer reports that UPS 7 failed causing disruptions to customer equipment. UPS 7 is back online. Engineers are currently investigating the issue.
Next update will be when a significant change to the situation occurs.
Information was limited coming from Equinix directly; however, with our engineers onsite, we also had a chance to discuss the power issue with other customers of the datacenter and gathered more information.
Informally what we suspect is that UPS7 was responsible for cleaning the dirty power that comes in from the public grid into stable power which then is distributed throughout the DC. There was in fact a hardware failure of UPS7, which should have triggered an automatic switch to a redundant UPS–which they do have on site– but that switched failed to occur. It is very likely that there is more than one UPS that handles in-bound power, as only about half of the datacenter experienced a failure.
When the redundancy failed and another UPS did not take over, it essentially meant that power was cut off to equipment. UPS7 then hard rebooted and was back online, which then resumed the flow of power to equipment; however, there was an interruption of several minutes in between.
While we were on site, we did see power engineers arrive at the facility about 3-4 hours later to investigate what caused the initial failure of UPS7 and why the redundant power switching systems did not operate as they were supposed to.
Losing power to hypervisors is the worst case scenario because an immediate interruption in power doesn’t allow the disk drives to clear any caches they have, thus increasing the likelihood that there may be filesystem corruption. We began to troubleshoot every single hypervisor to ensure that it booted up successfully and we did find several systems that needed manual intervention.
We did not need to recover or rebuild any RAIDs during the process. Instead, some systems failed to boot citing that they failed to find a RAID config, but we suspected that was related to the way they lost power. We powered off those systems and removed the power cords to ensure that everything would reset correctly, and then reseated the physical SSD drives and powered the systems back on.
Given that the network was also affected, we had to ensure that all of the top of rack switches would converge back onto the network successfully. Here we observed three switches that needed manual intervention to get them back on both cores. One of the switches also had one of the 10-gigE gbics fail which we replaced with a spare. After that was completed, the network layer was back in full operation.
Once we completed getting all of the physical hypervisors back online, we then proceeded with powering on all of the virtual machines that resided on those systems. We wanted to approach this in a systematic way to ensure that we could give 100% focus to each step of this process. After the virtual machines were back online, we began notifying any customers that opened any tickets that the majority of the work was now complete and to please notify us if they saw any issues.
With all of the hypervisors back online, networking issues resolved, and all virtual machines booted, we instructed customers that were having any issues to please open a ticket so that we could troubleshoot with them. We did see a small percentage of virtual machines having dirty file systems which required an fsck to get them online and working, and we ask customers to reach out to us so we can help with that process if any customer is not familiar with fsck.