Understanding vSAN Rebuilds and Repairs
Recently a customer shared a story about a routine patch upgrade that resulted in several production VMs becoming inaccessible. The only way to bring them back online was to get the host out of maintenance mode. This was very confusing for him because DRS safely migrated the VMs to other hosts so the assumption was, if a host goes into maintenance mode there should be no dependencies on it. This prompted a long discussion about how vSAN handles data accessibility and resiliency as well as recommendations for using the vSAN maintenance mode options for planned downtime activities such as firmware upgrades, storage device replacement and software patches.
Fortunately, my colleague Jeff Hunter recently wrote a post detailing the vSAN Maintenance Mode Options with recommendations on how and when to use each but I thought it would be helpful to explain how vSAN handles data accessibility and resiliency with the use of component rebuilds and repairs. To better understand how these rebuilds\repairs work, lets establish a few foundational concepts.
vSAN Objects and Component Placement
VMware vSAN is an object-based distributed storage system that uses physical storage devices on each ESXi host in a cluster to contribute to the vSAN storage system. Virtual machines that live on vSAN storage are comprised of a number of storage objects. VMDKs, VM home namespace, VM swap areas, snapshot delta disks, and snapshot memory maps are all examples of storage objects in vSAN.
Each object consists of one or more components. The number of components that make up an object depends primarily on the size of the objects and the storage policy assigned to the object. The maximum size of a component is 255GB. If an object is larger than 255GB, it is split up into multiple components. For detailed information on vSAN Objects and Component placement go to storagehub.vmware.com.
The object in Figure 1 is a 700GB VMDK. A few observations:
- Because the maximum size of components is 255GB it will take 3 components (C1, C2, C3) to make one full copy of the object.
- The object has a VM storage policy of RAD-1 (mirror) and FTT=1. This policy requires two replicas of the object on separate hosts and a witness component acting as a tie-breaker.
vSAN Object Component States
A component has four possible states:
- Active: Accessible
- Absent: Inaccessible with no error codes sensed (host or network outage, or maintenance mode with no data evacuation)
- Degraded: Inaccessible with error codes sensed. (i.e. device failure) In this case the rebuild will begin immediately
- Active-Stale: Sequence numbers of components not up to date (i.e. multiple host failures with one coming back up online.
In the story above, the customer had several data objects on his host:
- 2 objects with FTT=0
- 108 objects with FTT=1 (mirror)
When he put the host in maintenance mode and chose No Data Evacuation, he failed to heed the “What-if” information and as a result of the absent objects, the FTT=0 VMs were unable to tolerate the “failure” and were inaccessible until the host returned. The FTT=1 VMs were still accessible but non-compliant with their storage policy because they could not tolerate an additional failure.
vSAN Rebuild Process
When vSAN components are offline they are marked “absent” and colored orange in the vSAN user interface. vSAN waits 60 minutes by default before starting the repair operation. vSAN has this delay as many issues are transient. In other words, vSAN expects absent components to be back online in a reasonable amount of time and we want to avoid copying large quantities of data unless it is necessary. An example is a host being temporarily offline due to an unplanned reboot.
vSAN will begin the repair process for absent components after 60 minutes to restore redundancy. For example, an object such as a virtual disk (VMDK file) protected by a RAID-1 mirroring storage policy will create a second mirror copy from the healthy copy. This process can take a considerable amount of time depending on how much data must be copied. The rebuild process continues even if the absent copy comes back online in versions of vSAN prior to 6.6.
Repair Objects Immediately
There are some scenarios in which a host will be absent for longer than 60 minutes. The affected VMs are still accessible however non-compliant with their storage policies. More importantly in the case of FTT=1, until a rebuild occurs vSAN will not be able to tolerate additional failures. If this is the case, you may choose to repair the objects immediately. This option will resynchronize the absent objects on available hosts in the vSAN cluster.
vSAN Rebuilds and Repairs in vSAN 6.6
The purpose of any type of component resync or rebuild is to restore or ensure the level of resiliency defined for a given VMDK, VM, or collection of VMs. A number of improvements are included in vSAN 6.6 that assist in the rebuild process and placement of components such as:
- Intelligent rebuilds using enhanced rebalancing
- Intelligent rebuilds using smart, efficient repairs
- Intelligent rebuilds using partial repairs
- Resumable resyncs
vSAN’s ability to intelligently manage performance, efficiency, and availability of data stored on a cluster powered by vSAN is referred to as intelligent rebuilds. vSAN 6.6 has a number of improvements designed to offer more intelligent rebuilds, optimizing the return to normal operations and compliance quickly, and automatically. For detailed information on all aspects of VMware vSAN be sure to visit storagehub.vmware.com
Mant thanks to my colleague Pete Flecha who authored this article. The original version can be found here.