Server virtualisation has revolutionised the way we deploy systems in our data centres. It has sparked a gold rush along the way, creating outlandish wealth. It has also left a wake of destruction, and it very well may cause catastrophic failures in a lot of shops.
It certainly is going to raise blood pressure and cause hair loss among admins.
The good parts: Server virtualisation lets you bring a virtual machine online in minutes, gets you enormous utilisation gains on your hardware, and enables mobility unlike anything we’ve ever seen before.
The bad parts: Server virtualisation has created an on-demand expectation from the business that hasn’t been fully vetted yet. Mobility, specifically the ability for a virtual machine workload to take on migratory attributes, creates massive performance problems, not only for that workload, but for all the other workloads on the same physical kit. Worse yet, you can’t see it. You only know it when the phone rings and everyone is screaming at you.
The solution to date: Lose all the benefits and make sure your workloads don’t move, eliminating any utilisation gains, causing you to actually manage more stuff (physical and virtual) instead of less, and costing you more money to boot.
The problem: Storage. Plain and simple. It’s a storage problem, my dear Watson. Specifically, it’s an utter lack of QoS (quality of service) in 99 percent of storage systems that simply were not built to be “adaptive” to changing requirements.
Storage is mainly a one-size-fits-all thing. It has X drives and Y controllers and can do Z I/Os per second. Period. That’s fine if X, Y and Z meet the requirements of what’s thrown at it – but it’s hell if it doesn’t.
In the old world (say, 2011) we built a system (physical in production, virtual in the lab) that was “fixed” to an application, or a series of applications (i.e., workloads). Thus, we could test and deploy said workload and know exactly how it would perform – because it was fixed. Nothing changed.
In the new world order (December), we build workloads (Exchange, Oracle, etc.) inside a VM and plop them on a physical machine. We make sure the storage works. We treat it like it is “fixed”. It works great! Then something happens. That workload moves elsewhere, or another workload suddenly appears next to it – and all of them start fighting for the I/O resources of the storage. The storage can only do so much, so it starts to arbitrate. Now your I/O performance on each workload starts to suffer. Now your phone starts lighting up.
Because storage is still largely “dumb”, the only way to solve the problem is to go back in time and build fixed systems – which is absurd, but true.
But there is a way out! Storage has to get smarter. Storage for transient workload operations cannot be architected like storage for fixed workload operations. You need a better mousetrap.
You need a storage system(s) that can be adaptive – ideally in real time – to changing workload requirements. You need to guarantee performance, not hope for performance. That’s how you get the phone to stop ringing.
If physical machine A, with seven workloads on it, dies and those seven workloads move elsewhere – if you don’t have I/O capabilities that can A) handle increased load, and B) adapt their I/O profiles to satisfy more important apps first – someone is going to be upset. Simple as that.
In a perfect world, all you have to do is make sure all of your storage is flash. No disks anywhere, and you probably don’t have a real problem (although, even then, I could show you how it would screw up without QoS). Presuming you aren’t the US government printing money at will, you probably can’t afford to do that. You will have disks, and normal storage systems. Therefore, you will inevitably have these problems.
If you run server virtualisation, and run multiple, mixed workloads – you will have this problem.
Thus, from now on, start looking for smarter storage. You need storage that can A) handle a high I/O load (much higher than what you need for normal operations), B) delineate between workloads in terms of importance, and C) guarantee the outcome. You need QoS.
This is a fantastic new problem, by the way. We haven’t had I/O problems for 20 years or more. All storage systems were “good enough” in the fixed-workload world. Mobility changes everything. By giving yourself awesome utilisation on your physical servers, you end up having to give yourself terrible utilisation on your storage (which is way more expensive than your servers, lest you forgot) in order to guarantee that performance levels meet minimum requirements if a bunch of workloads move over to you. It’s a conundrum.
If your storage were smart and self-optimising, using the right combination of intelligent caching, flash and disk – and could deliver QoS at a granular enough level – you’d be perfectly aligned.
Hey, industry, go build that.