Hi everyone! For this blog entry I have decided to go back in the lab and geek out. We had a client that had some very specific constraint for their VMware environment and after many discussions and some Googling, I still did not have a firm answer to a very simple question: “What happens to a Windows VM when a NFS datastore goes away?” Which is similar to another question, “What happens to a Windows Server when you pull out its hard drive while it is running?” I could not find a clear article with the answer and all the people I talked to were divided in three camps: “I don’t know”, “The VM will crash” and “The VM will keep running degraded”. So, which camp are you in?
First, some specifics (very important for this case as you will see). This is around a VM (with VMware tools installed) running Windows Server 2008 R2 with SP1 (Standard) hosted on VMware vSphere 5.1 and the goal is to understand the failure at the VM level to trigger a recovery process.
The versions are very important for several reasons. Let’s start with vSphere: In older versions, when storage became unavailable the hostd process could become hanged while waiting for I/Os to complete. On the newer versions, when an All Path Down (APD) event is detected, after 140 seconds (default) vSphere will fast-failed the I/Os rather than keep trying, this is explained in detail here:http://cormachogan.com/2012/09/07/vsphere-5-1-storage-enhancements-part-4-all-paths-down-apd/. There is also some new detection around Permanent Device Loss (PDL), but it does not change our failure scenario. So overall, vSphere will report that the storage is gone and will keep humming along.
Now the Windows side. A lot of Windows behavior is driven by the “TimeoutValue” registry key (“HKLM\System\CurrentControlSet\Services\Disk\TimeoutValue”) which is clearly explained here: http://msdn.microsoft.com/en-us/library/windows/hardware/ff563970(v=vs.85).aspx. This is how long Windows will wait for a disk to reply to an I/O request. Default is 10, recommended for normal usage is 10-20, but your storage vendor may recommend other values, in this case since the storage vendor is for all intent and purposes VMware, their recommendation is 60 (http://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.storage.doc_50%2FGUID-EA1E1AAD-7130-457F-8894-70A63BD0623A.html) which is set automatically for you by VMware Tools when you install it.
Now we know that after 60 seconds the OS will report our I/Os as failed. Not quite! Because of retries and all, this could go all the way to 8 minutes (very well explained here:http://blogs.msdn.com/b/san/archive/2011/09/01/the-windows-disk-timeout-value-understanding-why-this-should-be-set-to-a-small-value.aspx). So after 8 min our Windows Server will crash in flames, right? Nope. It is actually surprising to see how well Windows server survives without an O/S drive and a data drive. To validate my hypothesis (that the server would keep running degraded (based on previous experience (we’ll keep that story for another day))), I actually decided to recreate this in my lab (It is currently running vSphere 5.5 beta, but the behavior for Windows would be the same). I created a test VM (Windows Server 2008 R2 Standard fully patched) with a C: drive for the OS and a D: drive for the data, installed HD_Speed (http://www.steelbytes.com/?mid=20) to drive some I/Os, opened an RDP session to keep an eye on it and unplugged the IP storage. Very quickly, HD_Speed started reporting that my I/O speed was down to 0 and that I had some disk errors, but task manager was still reporting that my CPU usage and memory usage were low and it was still updating its graphics. I let the server run like this overnight and had a look in the morning. The server was still in the same state: responding to pings, RDP session still open, was able to open task manager look at the pretty graphics updating to tell me that the machine was not doing much.
So conclusion: If you network storage array becomes unavailable, vSphere will handle the event gracefully (not hostd hanging) and the underlying OS in the VM will be on its own. In the case of Windows 2008 R2, it means that everything that is already in memory will stay there and keep “working” (the I/O subsystem will report that disk reads and writes are failing). Depending on your application mix it may eventually crash, but for a basic server not doing much, it may never crash (in this case an IP ping based monitoring solution would still not have detected the server failure after over 12 hours).