Archive for the ‘Linux Servers’ Category

DirectAdmin Server Maintenance

Wednesday, October 7th, 2009

This is advanced notice of the maintenance work to be carried out on the DirectAdmin Linux server. If you use DirectAdmin to administer your site then you will be affected by this move.

Our monitoring system has picked up a minor issue with this server so as a precaution, all accounts will be moved to a new server.

The work has been scheduled for Monday 12th October from 8pm

Further updates will be posted closer to the time.

Network Issues Resolved

Saturday, September 26th, 2009

We have now received the all clear from the data centre:

Please be advised that we are now able to officially provide an all-clear notification to all clients regarding the network disturbances, outages, interruptions to service and emergency maintenance experienced by clients in RSH-North since Midday, Thursday 24th September.

To clarify, normal service has been resumed for the large majority of clients since before 3am however we are now able to issue a full all clear.

The NAS service continues to experience problems and this will be dealt with separately.

A full report will be issued as soon as possible but in the meantime all clients are encouraged to contact us by telephone immediately if they are experiencing any problems at this time.

Network Outage

Friday, September 25th, 2009

Currently all servers are offline as the data centre are once again attempting to resolve the issues with the core server.

Further updates will be posted as we have them

UPDATE: 21:37 – Approximately 40% of our servers are now coming back online…

UPDATE – 01:33 – All servers up and running. A full RFO will be issued shortly…

UPDATE – 02:11 – After removing the VSS cluster technology from the RSH North zone routers and restoring the routers to a similar configuration to the ones which served RHC stably for over two years, considerable progress was made to restore connectivity for 90% of clients. However, service is not stable and clients will see differing levels of connectivity ranging from complete loss of service to severely degraded.

Obviously we are continuing to work on this with Cisco.

Please be assured that further updates will be provided as and when we have them.

Network Issues Update

Friday, September 25th, 2009

The data centre are continuing to have severe network issues that will be affecting all our servers.

We are still waiting for the all clear from the data centre but in the mean time users may experience a slowness in connection, timeouts or lag when browsing sites or using services.

Please bear with us during this period as unfortunately the issue is out of our control but we are working closely with the data centre to get this resolved asap.

Update 18:22
The following has now been received from the data centre:

Below is a summary update of the issues experienced by clients in RSH-North today and yesterday, 25th and 24th September 2009.

Fundamentally we have experienced serious issues that have affected all clients in RSH-North. This has been due to the Cisco equipment at the core of our Spectrum House services not responding as per specification and documentation.

As has already been explained, we have been in direct contact with Cisco, working on a Level 1 priority request to solve the issues that have affected our clients today. We take full responsibility for our vendor selection and do not wish to appear to be passing blame “conveniently”. We pride ourselves in the level of service that we provide and also in the quality of communication that we send to clients. We accept that neither have been anywhere close to our usual standard during this prolonged incident. However, we would like to take this opportunity to clarify a few points that we are aware have been questioned and discussed by our clients and competitors:

- Spectrum House routers were configured in a redundant VSS cluster
- CPU usage was recorded as being very high for normal usage
- Cisco have offered two possible solutions to the issue, including a firmware update provided to us yesterday. These solutions have failed to resolve the issues experienced
- Our priority has, at all times, been to provide stable service for as much time as possible, to as many clients as possible. This has been the endeavour and sometimes this has not been possible. We do not shirk the responsibility for this matter and recognise the impact that it has on our clients’ businesses and both our and their reputations.

Our network team are continuing to investigate why some of our clients are still experiencing outages, interrupted service and packet-loss and are doing so with Cisco.

As soon as we have further updates we will be posting them directly here.

Emergency Network Maintenance

Thursday, September 24th, 2009

We need to perform some emergency network as soon as possible. This has been scheduled for tonight, which we appreciate is very short notice. We need to reboot a router to install new software, and this reboot will take up to 45 minutes. We will do everything we can to speed up the process as much as we can and reduce the maintenance time.

Date: 24/09/2009
Window: 23:00 for 2 hours
Duration: < 45 minutes.

The maintenance is to perform an emergency upgrade of Cisco software. We are using a Cisco VSS-1440 as part of our network core, and we have been experiencing some reduced performance with it today. There is no cause within our network configuration and set up of this, and it started to have a detrimental effect on some clients today. We escalated this to the Cisco TAC team, who have diagnosed a fault with the software on the router in the form of a memory leak. Cisco has supplied us with a new version of the software for the router which will fix the memory leak and slow performance.

The nature of this problem is that it will escalate as time goes on, which is why we want to apply the fix as soon as core business hours finish today. Please accept our apologies for the short notice, we hope your clients appreciate this problem was out of our control caused by Cisco software, and we are working as best we can to resolve it quickly.

We apologise for any inconvenience this may cause, please do not hesitate to contact us if you have any queries or questions regarding this maintenance window.

UPDATE 01:56 - Servers are starting to come back online now, we have requested a full reason for outage from the data centre as to why this took much longer than expected and we will be contacting all customers once we have the facts.

UPDATE 05:38 -

The following update has been issued by our data centre:

This is a further update to our earlier message regarding the problems with the scheduled maintenance.

As mentioned in our previous message, there was a complication with the new firmware which required additional troubleshooting. During this time there was no connectivity to Spectrum House North Side (RSH Nth), one of the zones in one of our datacentres in Maidenhead. The network, including the London fibre ring and all external peering points, remained in full working order.

Full service to RSH Nth should have been restored by approximately 03:30. Small outages of less than 5 minutes may still be experienced by individual subnets as final configuration work is completed. These will be completed by 07:00. If anyone is still experiencing any problems please contact us immediately and we”ll do our best to resolve them for you.

Date: 24/09/09
Time: 23:00
Duration: <4.5hrs

The main cause of the extended outage was a problem in getting the VSS cluster to accept the new firmware. We have tried to provide a brief but accurate summary of the events below. A full reason for outage (RFO) will then follow tomorrow.

Given that the new firmware was required to avoid the memory leak issue, the situation had to be resolved. A decision was made to focus on correcting a potentially debilitating problem to the network and subsequently this evening''s outage was extended, rather than revert to a flawed firmware version.

1. New firmware image is loaded and prepared for use on reboot.
2. First router is rebooted.
3. Router finishes booting into new firmware, but the configuration has been wiped and it is no longer part of the cluster.
4. Router is reinitialised with cluster settings and rebooted again, to apply these changes.
5. The router hangs during the boot process, shortly after decompressing the image.
6. On consultation with Cisco it is agreed to boot back into the old firmware to try and restore a solid boot. This works.
7. The old image is removed from the boot memory and the new image is again prepared for use on boot.
8. This time both the image and the minor temporary configuration hold.
9. The backup of the configuration is restored to the router and a reboot applied to test that it holds, which it does.
10. The boot process includes bringing up each line card one at a time. During this boot two of the line cards are not initialised, citing an error.
11. Following consultation with Cisco, one of the two line cards is brought back online. This restores connectivity to the remaining racks in RSH.
12. Due to the reconfiguration of the cluster on the first router, these changes have to be replicated on the rest of the cluster. This is a time consuming process.
13. The previously scheduled maintenance that had been prevented by the memory leak needs to be completed. This process is ongoing and should be completed by 07:00. Once these final configuration changes are applied we will send a further update.

Obviously we would like to apologise to any of our clients who were affected by this work. Due to unforeseen circumstances described in detail above, work was severely delayed for not just our clients but other companies located in the same data centre.

We are working closely with the data centre to identify any further weak points and to avoid additional disruption.

Linux Server One cPanel Issue

Friday, September 11th, 2009

We have had reports that users are currently unable to login to their cPanel account.

We are currently investigating the problem and should have the issue resolved shortly.

Update 21:50 – This should now be resolved.

Linux Server One Unavailable

Saturday, August 22nd, 2009

We are aware of an issue with Linux server one (78.129.143.18) and we are looking into the issue as a priority.

Update 18:04 – This has now been resolved and we are investigating the outage.

Linux Server One

Saturday, July 25th, 2009

Linux server one rebooted a short while ago and has not automatically restarted. We have the data centre looking into the issue as a matter of urgency…

Update 10:32 – At this stage it looks like the network card has conflicted with the kernel upgrade. We are waiting for the data centre to upgrade the card drivers and reboot the server…

Update 10:57 – It does indeed look like the NIC card is the problem. The data centre are replacing the card and re-configuring the drivers now…

Update 11:26 – The main server is now back online, we are now waiting for IPs to be reallocated to the new NIC which should take between 10-15 minutes…

Update 12:10 – IP address re-allocations are still filtering through the new NIC card and routers, this is approximately 50% complete…

Update 12:46 – 95% of the IPs are now back online with only three remaining to propagate through..

Update 14:24 – This issue is now fully resolved…

Linux Server One Ongoing Load Issues

Friday, July 24th, 2009

Despite our best efforts we are still having ongoing issues with the load on server one. We have deployed a new server and are currently preparing it for use.

We shall then start identifying high use sites and will be moving them to that server.

Linux Server One Unavailable

Thursday, July 23rd, 2009

Unfortunately we are continuing to have issue with Linux server one with intermittent outages.

Please bear with us whilst we try and resolve this ongoing problem.

Update 14:45 – We are currently rebuilding Apache which will take approximately 30 minutes to complete. We should be able to start serving requests once this is complete.

Update 15:17 – Apache has now been rebuilt and the server is now up and running. We have taken the opportunity to upgrade several core elements of Apache including upgrading to php v5.2.10 to attempt to reduce the server load and increase stability.

We would like to apologise for the excessive downtime over the last few days on this server. This is something we do take very seriously in providing a top quality service and please rest assured we are doing everything we can to ensure your service is uninterrupted as possible.

Update 17:12 – We currently have technicians working on the server upgrading all software to latest versions so service may be a little patchy for a while.

Once this work is complete the server should be up to full speed…