Lessons From EMC VNX 5400 Failure at OVH

One of our articles already covered recent OVH service failure especially in Europe and France happening at one of their datacenters, P19 more precisely.

Even if the failure is related to a Dell EMC storage array, here a VNX 5400, deployed in 2012, it seems that procedures and configurations around that array and services related to it were limited and surely not enough to satisfy such IT services and operations. The impact was huge with 50,000 sites down as this array stored many databases used by all these sites.

OVH has published an update (available here in French) and indicated the real root cause of the problem: a leakage of cooling liquid that has ‘touched’ the EMC array that shouldn’t be present in that room at that time due to changes in the original computing room. The monitoring tool was not operated and this array was planned to be replaced. So finally it seems that this failure was the result of a series of several bad steps with a real lack of luck. The problem is that the procedure was not updated during this period of time as the failure exposure was bigger in that context.

OVH tried a few things to restore the service but honestly these actions appeared to be old school with backup/restoration procedure and the action to move a new array from a remote site with one day old data sets to swap the failed one.

What really surprising is the lack of modern process to protect and maintain service up and running.

First, OVH used a RPO of 24 hours as the backup procedure is done every day and sent remotely to a system at a few hundreds of kilometers. Even if this model seems to be good on the paper, having a RPO of 24 hours appears to be not aligned with current cloud service providers goal and mission and users expect different protection models and surely a better RPO. Back to the report published by OVH, the restoration took many hours and 24 hours later 76 servers were up and running on the 96 SSDs.

Some confusion seems to exist here as, even if the array was configured in active/active mode, only one array was used and the protection against the full stop not designed.

Historically, IT architects used to deploy a volume manager to consider at least two storage entities with mirroring between arrays in addition to path redundancies. In that case, if an array fails, a path fails, a power supply, a volume… servers continue to interact with the surviving mirror without any downtime. You have then plenty of time to swap the failed component and initiate a re-hydratation of the new storage entity from the live one. It is just a simple design, normally the default one, but a very efficient one that has proven its robustness for decades.

Second, data protection technologies such snapshot, replication even CDP exist for a long time and for sensitive data. These three approaches are mandatory. But of course in the current case, it has to be designed with very small RPO and things around near-CDP could be the right and minimum choice. The VNX array offers all the needed features to activate these data services, and architects can even consider to rely on array-, network- or server-based data services.

Third, it appears that architects have made some confusion between data protection and application availability. In other words, the two work together and it you have an application without data, it’s useless and if you have ‘good’ data without applications, your business is stopped as well. Again a model to protect data aligned with RPO goals and RTO to maintain the service is the basis of the design of data services and operations.

As an example, we can illustrate this approach with three classical examples:

Imagine a source code server, the protection model must protect data frequently to avoid any loss of line of codes. As there is no direct revenue associated with that, you can choose a long restore time. In that case, it means (very) short RPO and variable, flexible (but long) RTO.
The second example is a static web site that finally doesn’t change often even if you wish to capture all changes each time they occur. But as the company visibility vehicles on the Internet, you wish to restart the web server as soon as the failure occurs. Here it means long RPO (days, weeks or even months) and very short RTO.
And finally, a mixed of the two for dynamic web sites, commerce-oriented applications with associated revenues, you don’t wish to loose any transactions and disappointed users but maintain business in all circumstances. With 50,000 web sites, OVH should have implemented this model to satisfy users and protect its image.

For OVH, architects have designed a weak data protection mechanism and finally completely forget the application availability. It seems that no application clusterization, automatic or manual, were set up. A big mistake.

So this case raises several questions, among them:

Why only one array, as a single point of failures, is used in this configuration?
Why the protection of data is so weak with a RPO of 24 hours?
Why technologies such snapshot, replication and CDP wre not used or badly used?
Why a volume manager was not configured?
Why applications were not protected with automatic failover mechanism to make downtime transparent? and,
Did this configuration changes its role during its lifetime with more and more critical applications without changing the data protection procedure?

We hope that OVH will start to study again all procedures related to data protection and application availability and as they recently raised €400 million. It should be obvious to consider a serious new plan. Of course, we always hear bad news and never good ones as failures become transparent when everything is perfectly designed. And final point, OVH could publish RPO/RTO goals and realities for every datacenter, service, application instance… It will help users to decide which services to pick and subscribe to the one they need and require.

We invite OVH and especially Octave Klaba, CTO, to demonstrate lessons learned from this downtime.