SOBS outage 13-7-2018 10:46am

Jul 13, 2018

This appears to be a networking error at Servers Australia.

They know about the problem and are working to resolve it.

NOTE: Some schools are able to access SOBS, networking is a bit like this, it depends a lot on who your internet provider is.

UPDATE: 11:23am looks as though it may be resolved.  We will continue to monitor the problem and provide details after the debriefing.

UPDATE: 1:18pm if you are experiencing any problems accessing SOBS please call or email us.

For SOBS users we consider this outage lasted from 10:46 through to 11:23, 37 minutes.

Following is the incident report from Jared Hirst, CEO of Servers Australia:

On Friday the 13th of July, some customers suffered a network interruption to their services across the Servers Australia network. An error occurred within the Servers Australia core infrastructure that propagated to almost all core and edge switches across our sites for services that were not setup and configured for multi-site redundancy. This error was not due to a hardware error, but rather a configuration error within a core OSPF routing service. For full technical details of what occurred, please see the Post Incident Report

Due to recent major network upgrades, all Servers Australia sites can stand independent of each other and are connected to individual routers, transit providers, peering points and core switches giving them a great deal of redundancy in the event of a failure.

However, they do all share a global OSPF routing table that can ultimately bring down the entire network upon a misconfiguration. This setup is very common in networks that have started off as a single site and then grown into multiple sites over time. This is by no means any excuse for an interruption to a network service, but it is a common issue for large and small network operators. 

Servers Australia has had significant growth over the past five years, and the network has expanded from 1 site to 17 sites through several acquisitions. We have had some bumps along the way with integrating acquisitions and vendor bug issues, as do all large providers. These issues have all been rectified with large expenditure into our core network and hard work by our team.  

The work to consolidate these locations is still underway. This is not our customer’s issue, but I want to be transparent that integrating all of our clients into the one network has not been an easy task.  

Some clients feel that they had more stability with their old provider, and some feel that we have been providing a far better service. 

I feel that any outage that we have provided is a failure of our promise to our customers to deliver a great service, and I am personally committed to ensuring that there are no major network issues moving forward. 

Most, if not all of the network, has kept up with that growth and integrations with the exception of a few configurations that have not been able to be changed, and unfortunately the OSPF areas has been one of them. To make the changes that were needed to prevent the network incident on Friday, we would have needed to schedule a network-wide outage.  

I am ultimately responsible for this outage, as we were attempting to make the network changes with little to no impact to customers. This was happening, though it was taking time. In hindsight, this was the wrong call to make, and ultimately led to the conditions that resulted in Friday’s incident. 

We know what we need to do now to ensure that an outage of this scale cannot happen again, and I have given my team the full go ahead on scheduling any and all work to be completed to ensure that all data centres, sites, switches and routers are all independent of each other to ensure that we have a complete segregated network.

I apologise for any issues that this caused, and I am available for anything that I can do to assist.

Yours Sincerely,

Jared Hirst