Archive for September, 2009

Network Issues Resolved

Saturday, September 26th, 2009

We have now received the all clear from the data centre:

Please be advised that we are now able to officially provide an all-clear notification to all clients regarding the network disturbances, outages, interruptions to service and emergency maintenance experienced by clients in RSH-North since Midday, Thursday 24th September.

To clarify, normal service has been resumed for the large majority of clients since before 3am however we are now able to issue a full all clear.

The NAS service continues to experience problems and this will be dealt with separately.

A full report will be issued as soon as possible but in the meantime all clients are encouraged to contact us by telephone immediately if they are experiencing any problems at this time.

Network Outage

Friday, September 25th, 2009

Currently all servers are offline as the data centre are once again attempting to resolve the issues with the core server.

Further updates will be posted as we have them

UPDATE: 21:37 – Approximately 40% of our servers are now coming back online…

UPDATE – 01:33 – All servers up and running. A full RFO will be issued shortly…

UPDATE – 02:11 – After removing the VSS cluster technology from the RSH North zone routers and restoring the routers to a similar configuration to the ones which served RHC stably for over two years, considerable progress was made to restore connectivity for 90% of clients. However, service is not stable and clients will see differing levels of connectivity ranging from complete loss of service to severely degraded.

Obviously we are continuing to work on this with Cisco.

Please be assured that further updates will be provided as and when we have them.

Network Issues Update

Friday, September 25th, 2009

The data centre are continuing to have severe network issues that will be affecting all our servers.

We are still waiting for the all clear from the data centre but in the mean time users may experience a slowness in connection, timeouts or lag when browsing sites or using services.

Please bear with us during this period as unfortunately the issue is out of our control but we are working closely with the data centre to get this resolved asap.

Update 18:22
The following has now been received from the data centre:

Below is a summary update of the issues experienced by clients in RSH-North today and yesterday, 25th and 24th September 2009.

Fundamentally we have experienced serious issues that have affected all clients in RSH-North. This has been due to the Cisco equipment at the core of our Spectrum House services not responding as per specification and documentation.

As has already been explained, we have been in direct contact with Cisco, working on a Level 1 priority request to solve the issues that have affected our clients today. We take full responsibility for our vendor selection and do not wish to appear to be passing blame “conveniently”. We pride ourselves in the level of service that we provide and also in the quality of communication that we send to clients. We accept that neither have been anywhere close to our usual standard during this prolonged incident. However, we would like to take this opportunity to clarify a few points that we are aware have been questioned and discussed by our clients and competitors:

- Spectrum House routers were configured in a redundant VSS cluster
- CPU usage was recorded as being very high for normal usage
- Cisco have offered two possible solutions to the issue, including a firmware update provided to us yesterday. These solutions have failed to resolve the issues experienced
- Our priority has, at all times, been to provide stable service for as much time as possible, to as many clients as possible. This has been the endeavour and sometimes this has not been possible. We do not shirk the responsibility for this matter and recognise the impact that it has on our clients’ businesses and both our and their reputations.

Our network team are continuing to investigate why some of our clients are still experiencing outages, interrupted service and packet-loss and are doing so with Cisco.

As soon as we have further updates we will be posting them directly here.

Emergency Network Maintenance

Thursday, September 24th, 2009

We need to perform some emergency network as soon as possible. This has been scheduled for tonight, which we appreciate is very short notice. We need to reboot a router to install new software, and this reboot will take up to 45 minutes. We will do everything we can to speed up the process as much as we can and reduce the maintenance time.

Date: 24/09/2009
Window: 23:00 for 2 hours
Duration: < 45 minutes.

The maintenance is to perform an emergency upgrade of Cisco software. We are using a Cisco VSS-1440 as part of our network core, and we have been experiencing some reduced performance with it today. There is no cause within our network configuration and set up of this, and it started to have a detrimental effect on some clients today. We escalated this to the Cisco TAC team, who have diagnosed a fault with the software on the router in the form of a memory leak. Cisco has supplied us with a new version of the software for the router which will fix the memory leak and slow performance.

The nature of this problem is that it will escalate as time goes on, which is why we want to apply the fix as soon as core business hours finish today. Please accept our apologies for the short notice, we hope your clients appreciate this problem was out of our control caused by Cisco software, and we are working as best we can to resolve it quickly.

We apologise for any inconvenience this may cause, please do not hesitate to contact us if you have any queries or questions regarding this maintenance window.

UPDATE 01:56 - Servers are starting to come back online now, we have requested a full reason for outage from the data centre as to why this took much longer than expected and we will be contacting all customers once we have the facts.

UPDATE 05:38 -

The following update has been issued by our data centre:

This is a further update to our earlier message regarding the problems with the scheduled maintenance.

As mentioned in our previous message, there was a complication with the new firmware which required additional troubleshooting. During this time there was no connectivity to Spectrum House North Side (RSH Nth), one of the zones in one of our datacentres in Maidenhead. The network, including the London fibre ring and all external peering points, remained in full working order.

Full service to RSH Nth should have been restored by approximately 03:30. Small outages of less than 5 minutes may still be experienced by individual subnets as final configuration work is completed. These will be completed by 07:00. If anyone is still experiencing any problems please contact us immediately and we”ll do our best to resolve them for you.

Date: 24/09/09
Time: 23:00
Duration: <4.5hrs

The main cause of the extended outage was a problem in getting the VSS cluster to accept the new firmware. We have tried to provide a brief but accurate summary of the events below. A full reason for outage (RFO) will then follow tomorrow.

Given that the new firmware was required to avoid the memory leak issue, the situation had to be resolved. A decision was made to focus on correcting a potentially debilitating problem to the network and subsequently this evening''s outage was extended, rather than revert to a flawed firmware version.

1. New firmware image is loaded and prepared for use on reboot.
2. First router is rebooted.
3. Router finishes booting into new firmware, but the configuration has been wiped and it is no longer part of the cluster.
4. Router is reinitialised with cluster settings and rebooted again, to apply these changes.
5. The router hangs during the boot process, shortly after decompressing the image.
6. On consultation with Cisco it is agreed to boot back into the old firmware to try and restore a solid boot. This works.
7. The old image is removed from the boot memory and the new image is again prepared for use on boot.
8. This time both the image and the minor temporary configuration hold.
9. The backup of the configuration is restored to the router and a reboot applied to test that it holds, which it does.
10. The boot process includes bringing up each line card one at a time. During this boot two of the line cards are not initialised, citing an error.
11. Following consultation with Cisco, one of the two line cards is brought back online. This restores connectivity to the remaining racks in RSH.
12. Due to the reconfiguration of the cluster on the first router, these changes have to be replicated on the rest of the cluster. This is a time consuming process.
13. The previously scheduled maintenance that had been prevented by the memory leak needs to be completed. This process is ongoing and should be completed by 07:00. Once these final configuration changes are applied we will send a further update.

Obviously we would like to apologise to any of our clients who were affected by this work. Due to unforeseen circumstances described in detail above, work was severely delayed for not just our clients but other companies located in the same data centre.

We are working closely with the data centre to identify any further weak points and to avoid additional disruption.

SmarterMail Upgrade

Monday, September 21st, 2009

13:22 – We are currently upgrading the primary mailserver and mail is being diverted to our backup mail server. We expect this work to take no more than thirty minutes.

13:54 – The upgrade is now complete

SmarterStats Upgrade

Monday, September 21st, 2009

We are currently in the process of upgrading SmarterStats across our Windows network.

Users may experience intermittent issues and/or be unable to view their stats whilst this work in progress.

We expect the work to take no more than 60 minutes.

11:33 – This work has now been completed and all servers have been upgraded to SmarterStats Enterprise 4.x

Linux Server One cPanel Issue

Friday, September 11th, 2009

We have had reports that users are currently unable to login to their cPanel account.

We are currently investigating the problem and should have the issue resolved shortly.

Update 21:50 – This should now be resolved.