Emergency Network Maintenance

We need to perform some emergency network as soon as possible. This has been scheduled for tonight, which we appreciate is very short notice. We need to reboot a router to install new software, and this reboot will take up to 45 minutes. We will do everything we can to speed up the process as much as we can and reduce the maintenance time.

Date: 24/09/2009
Window: 23:00 for 2 hours
Duration: < 45 minutes.

The maintenance is to perform an emergency upgrade of Cisco software. We are using a Cisco VSS-1440 as part of our network core, and we have been experiencing some reduced performance with it today. There is no cause within our network configuration and set up of this, and it started to have a detrimental effect on some clients today. We escalated this to the Cisco TAC team, who have diagnosed a fault with the software on the router in the form of a memory leak. Cisco has supplied us with a new version of the software for the router which will fix the memory leak and slow performance.

The nature of this problem is that it will escalate as time goes on, which is why we want to apply the fix as soon as core business hours finish today. Please accept our apologies for the short notice, we hope your clients appreciate this problem was out of our control caused by Cisco software, and we are working as best we can to resolve it quickly.

We apologise for any inconvenience this may cause, please do not hesitate to contact us if you have any queries or questions regarding this maintenance window.

UPDATE 01:56 - Servers are starting to come back online now, we have requested a full reason for outage from the data centre as to why this took much longer than expected and we will be contacting all customers once we have the facts.

UPDATE 05:38 -

The following update has been issued by our data centre:

This is a further update to our earlier message regarding the problems with the scheduled maintenance.

As mentioned in our previous message, there was a complication with the new firmware which required additional troubleshooting. During this time there was no connectivity to Spectrum House North Side (RSH Nth), one of the zones in one of our datacentres in Maidenhead. The network, including the London fibre ring and all external peering points, remained in full working order.

Full service to RSH Nth should have been restored by approximately 03:30. Small outages of less than 5 minutes may still be experienced by individual subnets as final configuration work is completed. These will be completed by 07:00. If anyone is still experiencing any problems please contact us immediately and we”ll do our best to resolve them for you.

Date: 24/09/09
Time: 23:00
Duration: <4.5hrs

The main cause of the extended outage was a problem in getting the VSS cluster to accept the new firmware. We have tried to provide a brief but accurate summary of the events below. A full reason for outage (RFO) will then follow tomorrow.

Given that the new firmware was required to avoid the memory leak issue, the situation had to be resolved. A decision was made to focus on correcting a potentially debilitating problem to the network and subsequently this evening''s outage was extended, rather than revert to a flawed firmware version.

1. New firmware image is loaded and prepared for use on reboot.
2. First router is rebooted.
3. Router finishes booting into new firmware, but the configuration has been wiped and it is no longer part of the cluster.
4. Router is reinitialised with cluster settings and rebooted again, to apply these changes.
5. The router hangs during the boot process, shortly after decompressing the image.
6. On consultation with Cisco it is agreed to boot back into the old firmware to try and restore a solid boot. This works.
7. The old image is removed from the boot memory and the new image is again prepared for use on boot.
8. This time both the image and the minor temporary configuration hold.
9. The backup of the configuration is restored to the router and a reboot applied to test that it holds, which it does.
10. The boot process includes bringing up each line card one at a time. During this boot two of the line cards are not initialised, citing an error.
11. Following consultation with Cisco, one of the two line cards is brought back online. This restores connectivity to the remaining racks in RSH.
12. Due to the reconfiguration of the cluster on the first router, these changes have to be replicated on the rest of the cluster. This is a time consuming process.
13. The previously scheduled maintenance that had been prevented by the memory leak needs to be completed. This process is ongoing and should be completed by 07:00. Once these final configuration changes are applied we will send a further update.

Obviously we would like to apologise to any of our clients who were affected by this work. Due to unforeseen circumstances described in detail above, work was severely delayed for not just our clients but other companies located in the same data centre.

We are working closely with the data centre to identify any further weak points and to avoid additional disruption.

Leave a Reply