We have now received the following explanation from the data centre surrounding the circumstances of Sundays power failure. As expected, there was little the data centre could have done to prevent/for see this occurrence and we are now satisfied that the problem has been fully resolved.
We began investigation on Sunday, working with our suppliers to find the cause of the outage. The initial investigations appeared to have identified a phase detection relay which may have failed in the generator auto-changeover panel. This could have caused the panel to think there was a mains power failure when in fact there had not been.
The UPS systems worked as expected without fault, however due to the mains being forced off by the potentially failed relay, and power from the generators also locked out, the UPS batteries fully discharged after 15 minutes resulting in a critical power failure.
Engineers went to collect a new relay, along with other spares that maybe needed from the manufactures in North Yorkshire, and arrived back on the early hours of Monday morning.
We also took the step to manually disable the mains auto changeover system in order to avoid any potential further mains cuts. In the event of a real power outage on site we would operate the generator changeover procedure manually using onsite staff.
On Monday, the switch panel vendor was on site this morning and conducted several tests of the board. A decision has been taken to replace the Phase Failure Relay (PFR) for a new one.
We then completed six mains failure tests over the next 48 hours, in order to fully test the systems that have been replaced to ensure a full switch over to generator takes place in the event of a real power cut.
Today, we have collated the information so far to provider a fuller description of events. At 4.43am on Sunday morning the building lost mains power. The building suffered a power failure which caused the automatic systems to start the generator, which ran as expected. The system is then design to switch off the Air Circuit Breaker (ACB) to the mains feed, and close the ACB to the generator, thus supplying the UPS with generator power. This worked as expected and the generator took the load. Approximately 2 minutes later, the power cut ended, and power was restored switching down the generator and operating the ACB’s to switch back to mains, which all worked as planned.
Shortly after this there was a further power cut, which re-started the above sequence, in that the generator started (successfully), the mains ACB opened (successfully) and the signal was sent to close the generator ACB. This signal was sent to the ACB, however the ACB failed to close, thus meaning that the generator could not supply the UPS with power during the power cut. The UPS worked as expected and took the load. During this time the mains came back on. The ACBs have a physical and electrical interlocking system, which prevents both ACBs from being operated at the same time, thus preventing the possibility of both mains and generators feeding the load, which would result in a severe failure. Because the signals were sent to the generator ACB to close, but it never did, the interlocking systems got into a state of deadlock, where they were both stuck in an ‘open’ position, thus leaving the UPS with no feed, resulting in the batteries draining down after 15 minutes, and the system loosing the critical load.
Work started Sunday and continued on Monday to look at the electrical circuitry that controls the electrical side of the interlocks, as well as the mains phase failure relay, which detects a mains failure. This was tested as OK, however it was decided to replace critical parts with new spares to rule out any issues. After this was completed, we conducted a mains failure test which failed in the same way it did on Sunday morning. We restored mains manually at this point.
Work then commenced to look at a possible failure of the manual interlock system, which could cause the same issue. Work continued to check and replace certain parts of this system before we re-ran a mains failure situation. This test passed and the system worked as expected. We then decided to re-run the test, to ensure the issue had been fixed. The next generator test was completed, however the test failed with the same result as the first failure. Mains was again restored manually.
Due to all electrical circuitry testing and operating OK, and all manual interlocks working OK, the board vendor then started to look at a possible fault with the generator ACB. Firing pins in the ACB were tested and passed, which led the board manufacturer to suspect there was an intermittent issue with the generator ACB. This ACB is manufactured by APC/Schneider Electric/Merlin Gerin (now all the same company). As this ACB is under warranty our board vendor did not want to strip the ACB and look for issues, preferring that Merlin Gerin engineers look at this component directly.
Merlin Gerin were contacted on Monday night, and provided telephone support to the board vendors, however this was un-successful. Merlin Gerin have now agreed that an emergency support engineer needs to look at the unit in situ with the hope of swapping the failed component and re-testing. A Merlin Gerin emergency support engineer is currently on site with the required spares. Representatives from both the UPS manufacture and the control board vendor are also on site.
After the fifth mains failure test we believe Merlin Gerin have now identified the issue with the ACB. The suspected faulty component was replaced, and then there were a further three mains failure tests to validate the correct operation. The further tests have been successful and not exhibited the problematic behaviour previously identified.