A recent paper from Microsoft Research, Azure and John Hopkins University discusses the issues of troubles of grey failures. This is a type of cloud failure under some discussion, that I treat as follows:
A grey failure is a problem that is affecting one or more applications deployed on a platform, that is detectable by those applications, but that has not been detected by the platform itself.
This is not a new problem. From an application development point of view, development can simply opt to consider infrastructure volatile and simply jump ship – to a different data center, or different cloud provider even – and hope the new host doesn’t have the problem. This is good practice. An application should never assume the platform is stable; doing so, the application will inherit every problem provided by the platform. The issue, as it were, is for the infrastructure provider to discover the fault.
The Microsoft paper brings up several points showing the incidence of grey failures. The comparison they choose to use is when the application (which, for them, is the OS inside of the VM) encounters high load versus when the infrastructure detects a problem. One example I found particularly interesting was the case of a malfunctioning network card.
In comparison, looking at the VM level metrics, the authors state that when the VM is migrated, the network usage takes a slight dip. It never manages to recover to its “proper” utilisation, and the user reboots the VM. After the reboot, the VM doesn’t get any traffic. The underlying cause isn’t described, but this is noted as a grey failure.
Here is where the challenge appears. What is to say that the issue of the VM is from the platform and not a failure in the application, and the application itself has chosen to take this node out of rotation for debugging. This is a common use case – blacklisting a VM to preserve the state so a human operator can see it.
The answer (and, probably, most used means of detecting grey failures) is anomaly detection. Tracking usual patterns of tenants and software running underneath, and reacting to when they start behaving in unexpected ways. The key element in making it successful is efficiency. If the error detection needs more compute than the processes it is monitoring, it is not profitable, so we will continue to see random errors occur in our massively complex systems.