One of the telecoms engineers in my workplace spoke to a source inside BT and found out the true story behind the two outages. Last Friday was not a power cut. A card in a router locked up, and rather than replacing it BT took the decision to simply reboot it and hope it doesn’t happen again.
This goes against general practices in any large datacentre where you have a multitude of people relying on a single device to be working properly 100% of the time – when something that critical fails, you replace it immediately. It also means they really don’t have any redundancy as it appears there was no second card or router to take over when the first failed, and that means a 2-6 hour outage while engineers are gotten out of their beds, travel to site, diagnose and fix.