see aslo:http://www.yellow-bricks.com/vmware-high-availability-deepdiv/
HA Deepdive
My posts on VMware High Availability (HA) have always been my best read articles. I aggregated all articles in to a page which is easier to maintain when functionality changes and a lot easier to find via Google and my menu. This section is always under construction so come back every once in a while! At the bottom you can find a small change log.
Everybody probably knows the basics of VMware HA (or vSphere HA as we like to call it today) so I’m not going to explain how to set it up or that it uses a heartbeat for monitoring outages or isolation. If you want to know more about HA and DRS I recommend reading our HA and DRS technical deepdive book. Keep in mind that a lot of this info is derived from the availability guide, VMworld presentations and by diving into HA in my lab. The books can be found here:
- vSphere 5.1 Clustering Deepdive – Paper or eBook (kindle)
- vSphere 5.0 Clustering Deepdive – Paper or eBook (kindle)
- vSphere 4.1 Clustering Deepdive – Paper or eBook (kindle)
This article has been split up in vSphere 5.0 and prior as there are concepts which are specific to a version.
vSphere 5.0 specifics (FDM)
- The basics
- Master versus Slave
- Heartbeats
- Isolated versus Partitioned
- Restarting VMs
- Isolation Response and Detection
vSphere 4.1 specifics (AAM)
Version Independent
- Admission control
- Host failures
- Percentage of cluster resources reserved
- Specify a failover host
- My Admission Control Policy Recommendation
- Flattening shares
- Advanced settings
vSphere 5.0 specifics (FDM)
vSphere 5.0 uses an agent called “FDM” aka Fault Domain Manager. The following sections are all specific to vSphere 5.0 and higher. Specific version numbers are called out where and when applicable.
The Basics
With vSphere 5.0 comes a new HA architecture. HA has been rewritten from the ground up to shed some of those constraints that were enforced by AAM. HA as part of 5.0, also referred to as FDM (fault domain manager), introduces less complexity and higher resiliency. From a UI perspective not a lot has changed, but there is a lot under the covers that has changed though, no more primary/secondary node concept as stated but a master/slave concept with an automated election process. Anyway, too much details for now, will come back to that later. Lets start with the basics first. As mentioned the complete agent as been rewritten and the dependency on VPXA has been removed. HA talks directly to hostd instead of using a translator to talk to VPXA with vSphere 4.1 and prior. This excellent diagram by Frank Denneman, also used in our book, demonstrates how things are connected as of vSphere 5.0.
The main point here though is that the FDM agent also communicates with vCenter and vCenter with the FDM agent. As of vSphere 5.0, HA leverages vCenter to retrieve information about the status of virtual machines and vCenter is used to display the protection status of virtual machines. On top of that, vCenter is responsible for the protection and unprotection of virtual machines. This not only applies to user initiated power-offs or power-ons of virtual machines, but also in the case where an ESXi host is disconnected from vCenter at which point vCenter will request the master HA agent to unprotect the affected virtual machines. What protection actually means will be discussed in a follow-up article. One thing I do want to point out is that if vCenter is unavailable, it will not be possible to make changes to the configuration of the cluster. vCenter is the source of truth for the set of virtual machines that are protected, the cluster configuration, the virtual machine-to-host compatibility information, and the host membership. So, while HA, by design, will respond to failures without vCenter, HA relies on vCenter to be available to configure or monitor the cluster.
Without diving straight in to the deep I want to point out two minor chances but huge improvements when it comes to managing/troubleshooting HA which I want to point out:
- No dependency on DNS
- Syslog functionality
As of 5.0 HA is no longer dependent on DNS, as it works with IP addresses only. This is one of the major improvements that FDM brings. This also means that the character limit that HA imposed on the hostname has been lifted. (Pre-vSphere 5.0, hostnames were limited to 26 characters.) Another major change is the fact that the HA log files are part of the normal log functionality ESXi offer which means that you can find the log file in /var/log and it is picked up by syslog!
Note that if you add ESX/ESXi 4.1 or prior hosts to a vSphere 5.0 vCenter Server the new vSphere HA Agent (FDM) will be installed in conjunction with the vCenter Agent.
Master versus Slave
As mentioned vSphere High Availability has been completely overhauled… This means some of the historical constraints have been lifted and that means you can / should / might need to change your design or implementation.
What I will discuss in this section are the changes around the Primary / Secondary node concept that was part of HA prior to vSphere 5.0. This concept basically limited you in certain ways… For those new to VMware /vSphere, in the past there was a limit of 5 primary nodes. As a primary node was a requirement to restart virtual machines you always wanted to have at least 1 primary node available. As you can imagine this added some constraints around your cluster design when it came to Blades environments or Geo-Dispersed clusters.
vSphere 5.0 has completely lifted these constraints. Do you have a Blade Environment and want to run 32 hosts in a cluster? You can right now as the whole Primary/Secondary node concept has been deprecated. HA uses a new mechanism called the Master/Slave node concept. This concept is fairly straight forward. One of the nodes in your cluster becomes the Master and the rest become Slaves. I guess some of you will have the question “but what if this master node fails?”. Well it is very simple, when the master node fails an election process is initiated and one of the slave nodes will be promoted to master and pick up where the master left off. On top of that, lets take the example of a Geo-Dispersed cluster, when the cluster is split in two sites due to a link failure each “partition” will get its own master. This allows for workloads to be restarted even in a geographically dispersed cluster when the network has failed….
What is this master responsible for? Well basically all the tasks that the primary nodes used to have like:
- restarting failed virtual machines
- exchanging state with vCenter
- monitor the state of slaves
As mentioned when a master fails a election process is initiated. The HA master election takes roughly 15 seconds. The election process is simple but robust. The host that is participating in the election with the greatest number of connected datastores will be elected master. If two or more hosts have the same number of datastores connected, the one with the highest Managed Object Id will be chosen. This however is done lexically; meaning that 99 beats 100 as 9 is larger than 1. That is a huge improvement compared to what is was like in 4.1 and prior isn’t it?
For those wondering which host won the election and became the master, go to the summary tab of your cluster and click “Cluster Status”.
The question remains how the master knows if a host is still alive or not. Well it uses heartbeats in order to figure that out!
Heartbeating
vSphere 5.0 uses two different Heartbeat mechanisms. The first one is a network heartbeat mechanism. Each slave will send a heartbeat to its master and the master sends a heartbeat to each of the slaves. These heartbeats are sent by default every second. When a slave isn’t receiving any heartbeats from the master, it will try to determine whether it is Isolated or whether the master is isolated or has failed. We will discuss these “states” in more detail later .
Network heartbeats are a familiar concept to both, something that is new and has been introduced with vSphere 5.0 is Datastore Heartbeating.
Those familiar with HA prior to vSphere 5.0 probably know that virtual machine restarts were always initiated, even if only the management network of the host was isolated and the virtual machines were still running. As you can imagine, this added an unnecessary level of stress to the host. This has been mitigated by the introduction of the datastore heartbeating mechanism. Datastore heartbeating adds a new level of resiliency and allows HA to make a distinction between a failed host and an isolated / partitioned host.
Datastore heartbeating enables a master to more correctly determine the state of a host that is not reachable via the management network. The new datastore heartbeat mechanism is only used in case the master has lost network connectivity with the slaves to validate whether the host has failed or is mere isolated/network partitioned. As shown in the screenshot above two datastores are automatically selected by vCenter. You can rule out specific volumes if and when required or even make the selection yourself. I would however recommend to let vCenter decide.
As mentioned by default it will select two datastores. It is possible however to configure an advanced setting (das.heartbeatDsPerHost) to allow for more datastores for datastore heartbeating. I can imagine this is something that you would do when you have multiple storage devices and want to pick a datastore from each, but generally speaking I would not recommend configuring this option as the default should be sufficient for most scenarios.
How does this heartbeating mechanism work? HA leverages the existing VMFS filesystem locking mechanism. The locking mechanism uses a so called “heartbeat region” which is updated as long as the lock on a file exists. In order to update a datastore heartbeat region, a host needs to have at least one open file on the volume. HA ensures there is at least one file open on this volume by creating a file specifically for datastore heartbeating. In other words, a per-host a file is created on the designated heartbeating datastores, as shown in the screenshot below. HA will simply check whether the heartbeat region has been updated.
If you are curious which datastores have been selected for heartbeating. Just go to your summary tab on your cluster and click “Cluster Status”, the 3rd tab “Heartbeat Datastores” will reveal it.
Isolated vs Partitioned
As this is a change in behavior I do want to discuss the difference between an Isolation and a Partition. First of all, a host is considered to be either Isolated or Partitioned when it loses network access to a master but has not failed. Before we will explain the difference the states and the associated criteria below, here’s a short explanation of what an “isolation address” is:
“IP address the ESX hosts uses to check on isolation when no heartbeats are received, where [x] = 1‐10. (see screenshot below for an example) VMware HA will use the default gateway as an isolation address and the provided value as an additional checkpoint. I recommend to add an isolation address when a secondary service console is being used for redundancy purposes.”
As mentioned there are two different states:
- Isolated
- Is not receiving heartbeats from the master
- Is not receiving any election traffic
- Cannot ping the isolation address
- Partitioned
- Is not receiving heartbeats from the master
- Is receiving election traffic
- (at some point a new master will be elected at which the state will be reported to vCenter)
In the case of an Isolation, a host is separated from the master and the virtual machines running on it might be restarted, depending on the selected isolation response and the availability of a master. It could occur that multiple hosts are fully isolated at the same time. When multiple hosts are isolated but can still communicate amongst each other over the management networks, it is called a network partition. When a network partition exists, a master election process will be issued so that a host failure or network isolation within this partition will result in appropriate action on the impacted virtual machine(s).
With vSphere 5.1 we re-introduce the advanced option called “das.failuredetectiontime”… well not exactly but a similar concept with a different name. This new advanced setting named “das.config.fdm.isolationPolicyDelaySec” will allow you to extend the time it takes before the isolation response is triggered! By default the isolation response is triggered after ~30 seconds with vSphere 5.x. If you have a requirement to increase this then this new advanced setting can be used. Generally speaking I would not recommend using it though as it only increases “downtime” when a problem has occured.
Restarting VMs
Looking from the outside at the way HA in 5.0 behaves might not seem any different but it is. I will call out some changes with regards to how VM restarts are handled but would like to refer you to our book for my in-depth details. These are the things I want to point out:
- Restart priority changes
- Restart retry changes
- Isolation response and detection changes
Restart priority changes
First thing I want to point out is a change in the way the VMs are prioritized for restarts. I have listed the full order in which virtual machines will be restarted below:
- Agent virtual machines
- FT secondary virtual machines
- Virtual Machines configured with a restart priority of high,
- Virtual Machines configured with a medium restart priority
- Virtual Machines configured with a low restart priority
So what are these Agent VMs? Well these are VMs that provide a service like virus scanning or for instance edge services like vShield can provide. FT Secondary virtual machines make sense I guess and so does the rest of the list. Keep in mind though that if the restart fails of one of them HA will continue restarting the remaining virtual machines.
Restart retry changes
I explained how the restart retries worked for 4.1 in the past and basically the total number of restart tries would be 6 by default, this was 1 initial restart and 5 retries as defined with “das.maxvmrestartcount”. With 5.0 this behavior has changed and the max amount of restart counts is 5 in total. Although it might seem like a minor change, it is important to realize. The time line has also slightly change and this is what it looks like with 5.0:
- T0 – Initial Restart
- T2m – Restart retry 1
- T6m – Restart retry 2
- T14m – Restart retry 3
- T30m – Restart retry 4
The “m” stands for minutes and it should be noted that the next retry will happen “X” after the master has detected the restart has failed. So in the case of T0 and T2m it could actually be that the retry happens after 2 minutes and 10 seconds.
All hosts down?
But what happens when all hosts in a cluster fail simultaneously? Let’s describe this scenario and include how HA responds:
- Power Outage, all hosts down
- Power for hosts returns
- As soon as the first host has booted an election process will be kicked off and a master will be elected
- Master reads protected list which contains all VMs which are protected by HA
- Master initiates restarts for those VMs which are listed as protected but not running
Now the one thing I want to point out is that with vSphere 5.0 we will also track if the VM was cleanly powered off, as in initiated by the admin, or powered-off due to a failure/isolation. In the case they are cleanly powered off they will not be restarted, but in this scenario of course they are not cleanly powered off and as such the VMs will be powered on. The great thing about vSphere 5.0 is that you no longer need to know which hosts where your primary nodes so you can power these on first to ensure quick recovery… No, you can power on any host and HA will sort it out for you!
Isolation response and detection changes
Another major change was part of the Isolation Response and Isolation Detection mechanism. Again from the outside it looks like not much has changed but actually a lot has and I will try to keep it simple and explain what has and why this is important to realize. First thing is the deprecation of “das.failuredetectiontime”. I know many of you used this advanced setting to tweak when the host would trigger the isolation response, that is no longer possible and needed to be honest. If you’ve closely read my other articles you hopefully picked up on the datastore heartbeating part already which is one reason for not needing this anymore. The other reason is that before the isolation response is triggered the host will actually validate if virtual machines can be restarted and if it isn’t an all out network outage. Most of us have been there at some point, a network admin decides to upgrade the switches and all hosts trigger the isolation response at the same time… well that won’t happen anymore! One thing that has changed because of that is the time it takes before a restart will be initiated.
The isolation detection mechanism has changed substantially since previous versions of vSphere. The following timeline is the timeline for a vSphere 5.0 host:
- T0 – Isolation of the host (slave)
- T10s – Slave enters “election state”
- T25s – Slave elects itself as master
- T25s – Slave pings “isolation addresses”
- T30s – Slave declares itself isolated and “triggers” isolation response
For a vSphere 5.1 host this timeline slightly differs due the insertion of a minimum 30s delay after the host declares itself isolated before it applies the configured isolation response. This delay can be increased using the advanced option das.config.fdm.isolationPolicyDelaySec.
- T0 – Isolation of the host (slave)
- T10s – Slave enters “election state”
- T25s – Slave elects itself as master
- T25s – Slave pings “isolation addresses”
- T30s – Slave declares itself isolated
- T60s – Slave “triggers” isolation response
Or as Frank would say euuuh show:
After the completion of this sequence, the (new) master will learn the host was isolated and will restart virtual machines based on the information provided by the slave.
If a master is isolated in vSphere 5.1 the timeline will look as follows:
- T0 – Isolation of the host (master)
- T0 – Master pings “isolation addresses”
- T5s – Master declares itself isolated
- T35s – Master “triggers” isolation response
As shown there is a clear difference and of course the reason for it being is the fact that when the master isolates there is no need to trigger an election process which will be needed in the case of a slave to detect if it is isolated or partitioned. One again, before the isolation response is triggered the host will validate if a host will be capable of restarting the virtual machines… no need to incur downtime when it is unnecessary.
The vSphere 5.0 Clustering Deepdive book contains more details than this so if you are interested pick it up.
vSphere 4.1 specifics (AAM)
vSphere 4.1 and prior uses an agent called “AAM” aka Legato Automated Availability Management. The following sections are all specific to vSphere 4.1 or prior. Specific version numbers are called out where and when applicable.
Node Types
A VMware HA Cluster consists of nodes, primary and secondary nodes. Primary nodes hold cluster settings and all “node states” which are synchronized between primaries. Node states hold for instance resource usage information. In case that vCenter is not available the primary nodes will have a rough estimate of the resource occupation and can take this into account when a fail-over needs to occur. Secondary nodes send their state info to the primary nodes.
Nodes send a heartbeat to each other, which is the mechanism to detect possible outages. Primary nodes send heartbeats to primary nodes and secondary nodes. Secondary nodes send their heartbeats to primary nodes only. Nodes send out these heartbeats every second by default. However this is a changeable value: das.failuredetectioninterval. (Advanced Settings on your HA-Cluster)
The first 5 hosts that join the VMware HA cluster are automatically selected as primary nodes. All the others are automatically selected as secondary nodes. When you do a reconfigure for HA the primary nodes and secondary nodes are selected again, this is at random. The vCenter client does not show which host is a primary and which is not. This however can be revealed from the Service Console:
cat /var/log/vmware/aam/aam_config_util_listnodes.log
Another method of showing the primary nodes (including the Master Primary aka failover coordinator) is is shown in the following screenshot where /opt/vmware/aam/bin/Cli is used:
Of course this is also possible with the latest version of PowerCLI:
Get-Cluster clustername | Get-HAPrimaryVMHost
As of vSphere 4.1 a third option has been added. It must be noted that this option will only show results when there is an error. If an error occurs you can easily check what the issue is by going to your cluster and clicking the “Cluster Operational Issues” line on the Summary tab. If there are no issues the screen will be completely gray. I forced an issue though so you can see what is shown.
Now that you’ve seen that it is possible that you can list all node with the CLI you probably wonder what else is possible… Lets start with a warning, this is not supported. Also keep in mind that the supported limit of primaries is 5, I repeat 5. This is a soft limit, so you can manually add a 6th, but this is not supported. Now here’s the magic…
Using Cli you can also promote nodes from secondary to primary and vice versa. This is shown in the following screenshots:
To promote a node:
To demote a node:
I can’t say this enough, it is unsupported but it does work. With vSphere 4.1 a new advanced setting has been introduced. This setting is not even experimental, it is also unsupported. I don’t recommend anyone using this in a production environment, if you do want to play around with it use your test environment. Here it is:
das.preferredPrimaries = hostname1 hostname2 hostname3
or
das.preferredPrimaries = 192.168.1.1,192.168.1.2,192.168.1.3
The list of hosts that are preferred as primary can either be space or comma separated. You don’t need to specify 5 hosts, you can specify any number of hosts. If you specify 5 and all 5 are available they will be the primary nodes in your cluster. If you specify more than 5, the first 5 of your list will become primary.
When primaries are selected by using “promoteNode” or by powering up hosts in the right order you will need to verify occasionally if your hosts are still primary or not as HA has a re-election mechanism.
Election time!
Now when does a re-election occur? It is a common misconception that a re-election occurs when a primary node fails. This is not the case. The promotion of a secondary host only occurs when a primary host is either put in “Maintenance Mode”, disconnected from the cluster, removed from the cluster or when you do a reconfigure for HA.
If all primary hosts fail simultaneously no HA initiated restart of the VMs will take place. HA needs at least one primary host to restart VMs. This is why you can only take four host failures in account when configuring the “host failures” HA admission control policy. (Remember 5 primaries…)
However, when you select the “Percentage” admission control policy you can set it to 50% even when you have 32 hosts in a cluster. That means that the amount of failover capacity being reserved equals 16 hosts.
Although this is fully supported but there is a caveat of course. The amount of primary nodes is still limited to five. Even if you have the ability to reserve over 5 hosts as spare capacity that does not guarantee a restart. If, for what ever reason, half of your 32 hosts cluster fails and those 5 primaries happen to be part of the failed hosts your VMs will not restart. (One of the primary nodes coordinates the fail-over!) Although the “percentage” option enables you to save additional spare capacity there’s always the chance all primaries fail.
Basic design principle: In blade environments, divide hosts over all blade chassis and never exceed four hosts per chassis to avoid having all primary nodes in a single chassis.
Node Role
You will need at least one primary because the “fail-over coordinator” role will be assigned to this primary, this role is also described as “active primary” or “master primary”. I will use “fail-over coordinator” for now, but if you use the HA Cli is is listed as the “Master Primary”. The fail-over coordinator coordinates the restart of VMs on the remaining primary and secondary hosts. The coordinator takes restart priorities in account. Keep in mind, when two hosts fail at the same time it will handle the restart sequentially. In other words, restart the VMs of the first failed host (taking restart priorities in account) and then restart the VMs of the host that failed as second (again taking restart priorities in account). If the fail-over coordinator fails one of the other primaries will take over.
Isolation Response
Talking about HA initiated fail-overs; one of the settings everyone has looked into is the “isolation response”. The isolation response refers to the action that HA takes when the heartbeat network is isolated. Today there are three isolation responses, “power off”, “leave powered on” and “shut down”.
Up to ESX 3.5 U2 / vCenter 2.5U2 the default isolation response when creating a new cluster was “Power off”. As of ESX 3.5 U3 / vCenter 2.5 U3 the default isolation response is “leave powered on”. For vSphere ESX / vCenter 4.0 this has been changed to “Shut down”. Keep this in mind when installing a new environment, you might want to change the default depending on customer requirements.
Power off – When a network isolation occurs all VMs are powered off. It is a hard stop.
Shut down – When a network isolation occurs all VMs running on that host are shut down via VMware Tools. If this is not successful within 5 minutes a “power off” will be executed.
Leave powered on – When a network isolation occurs on the host the state of the VMs remains unchanged.
The question remains, which setting should I use? It depends. I personally prefer “Shut down” because I do not want to use a deprecated host and it will shut down your VMs in a clean manner. Many people prefer to use “Leave powered on” because it reduces the chances of a false positive. A false positive in this case is an isolated heartbeat network but a non-isolated VM network and a non-isolated iSCSI / NFS network.
I guess most of you would like to know how HA knows if the host is isolated or completely unavailable when you have selected “leave powered on”.
HA actually does not know the difference. HA will try to restart the affected VMs in both cases. When the host has failed a restart will take place, but if a host is merely isolated the non-isolated hosts will not be able to restart the affected VMs. This is because of the VMDK file lock; no other host will be able to boot a VM when the files are locked. When a host fails this lock starves and a restart can occur.
The amount of retries is configurable as of vCenter 2.5 U4 with the advanced option “das.maxvmrestartcount”. The default value is 5. Pre vCenter 2.5 U4 HA would keep retrying forever which could lead to serious problems as described in the KB article.
The isolation response is a setting that needs to be taken into account when you create your design. For instance when using an iSCSI array or NFS choosing “leave powered on” as your default isolation response might lead to a split-brain situation depending on the version of ESX used. The reason for this being that the disk lock times out if the iSCSI network is also unavailable. In this case the VM is being restarted on a different host while it is not being powered off on the original host. In a normal situation this should not lead to problems as the VM is restarted and the host on which it runs owns the lock on the VMDK, but for some weird reason when disaster strikes you will not end up in a normal situation but you might end up in an exceptional situation. Do you want to take that risk?
As of vSphere 4 Update 2 a new mechanism has been introduced which will recover VMs that are in a split brain situation. First let me explain what a split brain scenario is, lets start with describing the situation which is most commonly encountered:
- 4 Hosts
- iSCSI / NFS based storage
- Isolation response: leave powered on
When one of the hosts is completely isolated, including the Storage Network, the following will happen:
- Host ESX001 is completely isolated including the storage network(remember iSCSI/NFS based storage!) but the VMs will not be powered off because the isolation response is set to “leave powered on”.
- After 15 seconds the remaining, non isolated, hosts will try to restart the VMs.
- Because of the fact that the iSCSI/NFS network is also isolated the lock on the VMDK will time out and the remaining hosts will be able to boot up the VMs.
- When ESX001 returns from isolation it will still have the VMX Processes running in memory and this is when you will see a “ping-pong” effect within vCenter, in other words VMs flipping back and forth between ESX001 and any of the other hosts.
As of version 4.0 Update 2 ESX(i) detects that the lock on the VMDK has been lost and issues a question which is automatically answered. The VM will be powered off to recover from the split-brain scenario and to avoid the ping-pong effect. The following screenshot shows the event that HA will generate for this auto-answer mechanism which is viewable within vCenter.
Basic design principle 1: Isolation response should be chosen based on the version of ESX used. For pre-vSphere 4 Update 2 environment with iSCSI/NFS Storage I recommend to set the isolation response to “Power off” to avoid a possible split brain scenario. I also recommend to have a secondary service console running on the same vSwitch as the iSCSI network to detect an iSCSI outage and avoid false positives.
Basic design principle 2: Base your isolation response on your SLA. If your SLA dictates that hosts with degraded hardware should not be used, make sure to select shutdown or power off.
Isolation response gotchas
I thought this issue was something that was common knowledge but a recent blog article by Mike Laverick proved me wrong. I think I can safely assume that if Mike doesn’t know this it’s not common knowledge.
The default value for isolation/failure detection is 15 seconds. In other words the failed or isolated host will be declared dead by the other hosts in the HA cluster on the fifteenth second and a restart will be initiated by one of the primary hosts.
For now let’s assume the isolation response is “power off”. The “power off”(isolation response) will be initiated by the isolated host 1 second before the das.failuredetectiontime. A “power off” will be initiated on the fourteenth second and a restart will be initiated on the fifteenth second.
Does this mean that you can end up with your VMs being down and HA not restarting them?
Yes, when the heartbeat returns between the 14th and 15th second the “power off” could already have been initiated. The restart however will not be initiated because the heartbeat indicates that the host is not isolated anymore.
How can you avoid this?
Pick “Leave VM powered on” as an isolation response. Increasing the das.failuredetectiontime will also decrease the chances of running in to issues like these.
Basic design principle: Increase “das.failuredetectiontime” to 30 seconds (30000) to decrease the likely-hood of a false positive.
Admission control
This has always been a hot topic, HA and Slot sizes/Admission Control. Lets start with the basics before we dive deep.
What’s HA admission control about? Why is it there? The “Availability Guide” states the following:
vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and to ensure that virtual machine resource reservations are respected.
Admission Control ensures available capacity for HA initiated fail-overs by reserving capacity. Keep in mind that admission control calculates the capacity required for a fail-over based on available resources. In other words if a host is placed into maintenance mode, or disconnected for that matter, it is taken out of the equation. The same goes for DPM, if if admission control is set to strict DPM in no way will violate availability constraints. It should be noted that Admission Control is a function of vCenter! When HA initiates a restart this happens on a host level and as such HA will not be restricted by its own Admission Control.
To calculate available resources and needed resources for a fail-over HA uses different concepts based on the chosen admission control policy.
Currently there are three admission control policies:
- Host failures cluster tolerates
- Percentage of cluster resources reserved as failover spare capacity
- Specify a failover host
As stated each of these uses a different mechanism for reserving resources for a failover. Host failures uses a mechanism called “slots”. Slots dictate how many VMs can be started up before vCenter starts yelling “Out Of Resources”!! Normally each slot represents one VM.
Percentage based admission control uses a more flexible mechanism. It accumulates all reservations and subtracts it from the total amount of available resources while making sure the specified spare capacity is always available.
A failover host doesn’t use any of those mechanisms, this host is dedicated for failover. It will not be used.
All three policies and concept will be explained in-depth below.
Host failures
Now what happens if you set the number of allowed host failures to 1?
The host with the most slots will be taken out of the equation. (Slots are explained in more detail below) If you have 8 hosts with 90 slots in total but 7 hosts each have 10 slots and one host 20 this single host will not be taken into account. Worst case scenario! In other words the 7 hosts should be able to provide enough resources for the cluster when a failure of the “20 slot” host occurs.
And of course if you set it to 2 the next host that will be taken out of the equation is the host with the second most slots and so on.
One thing worth mentioning, as Chad stated with vCenter 2.5 the number of vCPUs for any given VM was also taken in to account. This led to a very conservative and restrictive admission control. This behavior has been modified with vCenter 2.5 U2, the amount of vCPUs is not taken into account.
Basic design principle: Think about “maintenance mode”. If a single host needs maintenance it will be taken out of the equation and this means you might not be able to boot up new VMs when admission control is set to strict.
What is a Slot?
A slot is a logical representation of the memory and CPU resources that satisfy the requirements for any powered-on virtual machine in the cluster.
In other words a slot size is the worst case CPU and Memory reservation scenario in a cluster. This directly leads to the first “gotcha”:
HA uses the highest CPU reservation of any given VM and the highest memory reservation of any given VM. With vSphere 4.1 if no reservations of higher than 256Mhz are set HA will use a default of 256Mhz for CPU and a default of 0MB+memory overhead for memory. With vSphere 5.0 the default for CPU has been brought down to 32Mhz.
If VM1 has 2GHZ and 1024GB reserved and VM2 has 1GHZ and 2048GB reserved the slot size for memory will be 2048MB+memory overhead and the slot size for CPU will be 2GHZ.
Basic design principle: Be really careful with reservations, if there’s no need to have them on a per VM basis don’t configure them.
By the way, did you know that with vSphere 5.1 and the Web Client you can specify fixed slot sizes in the UI? Nice right? Keep that in mind when you see some of the advanced settings in the next section, depending on the version you are running you could potentially just configure it in the UI.
How does HA calculate how many slots are available per host?
Of course we need to know what the slot size for memory and CPU is first. Then we divide the total available CPU resources of a host by the CPU slot size and the total available Memory Resources of a host by the memory slot size. This leaves us with a slot size for both memory and CPU. The most restrictive number is the amount of slots for this host. If you have 25 CPU slots but only 5 memory slots the amount of available slots for this host will be 5.
As you can see this can lead to very conservative consolidation ratios. With vSphere this is something that’s configurable. If you have just one VM with a really high reservation you can set the following advanced settings to lower the slot size being used during these calculations: das.slotCpuInMHz or das.slotMemInMB. The advanced setting das.slotCpuInMHz and das.slotMemInMB will allow you to specify an upper boundary for your slot size. When one of your VMs has an 8GB reservation this setting can be used to define for instance an upper boundary of 1GB to avoid resource wastage and an overly conservative slot size. However when for instance das.slotMemInMB is configured to 2048MB and the lowest reservation is 500MB then the slotsize for memory will be 500MB+memory overhead. If a lower boundary needs to be specified the advanced setting “das.vmMemoryMinMB” or ” das.vmCpuMinMHz” can be used. To avoid not being able to power on the VM with high reservations these VM will take up multiple slots. Keep in mind that pre-vSphere 4.1 when you were low on resources this could mean that you were not able to power-on this high reservation VM as resources would be fragmented throughout the cluster instead of located on a single host.
As of vSphere 4.1 HA is closely integrated with DRS. When a failover occurs HA will first check if there are resources available on that host for the failover. If resources are not available HA will ask DRS to accommodate for these where possible. HA, as of 4.1, will be able to request a defragmentation of resources to accommodate for this VMs resource requirements. How cool is that?! One thing to note though is that HA will request it, but a guarantee can still not be given so you should be cautious when it comes to resource fragmentation.
The following is an example of where resource fragmentation could lead to issues:
If you need to use a high reservation for either CPU or Memory these options (das.slotCpuInMHz or das.slotMemInMB) could definitely be useful, there is however something that you need to know. Check this diagram and see if you spot the problem, the das.slotMemInMB has been set to 1024MB.
Notice that the memory slot size has been set to 1024MB. VM24 has a 4GB reservation set. Because of this VM24 spans 4 slots. As you might have noticed none of the hosts has 4 slots left. Although in total there are enough slots available; they are fragmented and HA might not be able to actually boot VM24. Keep in mind that admission control does not take fragmentation of slots into account when slot sizes are manually defined with advanced settings. It does count 4 slots for VM24, but it will not verify the amount of available slots per host. As explained, as of vSphere 4.1 it will request defragmentation, but as stated… it can not be guaranteed.
Basic design principle: Avoid using advanced settings to decrease slot size as it might lead to more down time.
Another issue that needs to be discussed is “Unbalanced clusters”. Unbalanced would for instance be a cluster with 5 hosts of which one contains substantially more memory than the others. What would happen to the total amount of slots in a cluster of the following specs:
Five hosts, each host has 16GB of memory except for one host(esx5) which has recently been added and has 32GB of memory.
One of the VMs in this cluster has 4CPUs and 4GB of memory, because there are no reservations set the memory overhead of 325MB is being used to calculate the memory slot sizes. (It’s more restrictive than the CPU slot size.)
This results in 50 slots for esx01, esx02, esx03 and esx04. However, esx05 will have 100 slots available. Although this sounds great admission control rules the host out with the most slots as it takes the worst case scenario into account. In other words; end result: 200 slot cluster. With 5 hosts of 16GB, (5 x 50) – (1 x 50), the result would have been exactly the same. (Please keep in mind that this is just an example, this also goes for a CPU unbalanced cluster when CPU is most restrictive!)
Basic design principle: Balance your clusters when using admission control and be conservative with reservations as it leads to decreased consolidation ratios.
Percentage of cluster resources reserved
Can I avoid large HA slot sizes due to reservations without resorting to advanced settings? Yes there is. The simplest way, without using advanced settings is selecting “Percentage of cluster resources reserved” as your admission control policy.
With vSphere VMware introduced a percentage next to an amount of host failures. The percentage avoids the slot size issue as it does not use slots for admission control. So what does it use?
When you select a specific percentage that percentage of the total amount of resources will stay unused for HA purposes. First of all VMware HA will add up all available resources to see how much it has available. Then VMware HA will calculate how much resources are currently consumed by adding up all reservations of both memory and CPU for powered on virtual machines. For those virtual machines that do not have a reservation larger than 256Mhz a default of 256Mhz will be used for CPU and a default of 0MB+memory overhead will be used for Memory. (Amount of overhead per config type can be found on page 28 of the resource management guide.)
In other words:
((total amount of available resources – total reserved VM resources)/total amount of available resources)
Where total reserved VM resources include the default reservation of 256Mhz and the memory overhead of the VM.
Let’s use a diagram to make it a bit more clear:
Total cluster resources are 24Ghz(CPU) and 96GB(MEM). This would lead to the following calculations:
((24Ghz-(2Gz+1Ghz+256Mhz+4Ghz))/24Ghz) = 69 % available
((96GB-(1,1GB+114MB+626MB+3,2GB)/96GB= 85 % available
As you can see the amount of memory differs from the diagram. Even if a reservation has been set the amount of memory overhead is added to the reservation. For both metrics HA admission control will constantly check if the policy has been violated or not. When one of either two thresholds are reached, memory or CPU, admission control will disallow powering on any additional virtual machines.
Please keep in mind that if you have an unbalanced cluster(host with different CPU or memory resources) your percentage is equal or preferably larger than the percentage of resources provided by the largest host. This way you ensure that all virtual machines residing on this host can be restarted in case of a host failure. Another thing to keep in mind is as there are no slots which HA uses resources might be fragmented throughout the cluster. As explained earlier, HA will request DRS to defragment resource to cater for that specific VM, but it is not a guarantee. I recommend making sure you have at least a host with enough available capacity to boot the largest VM (reservation CPU/MEM). Also make sure you select the highest restart priority for this VM(of course depending on the SLA) to ensure it will be able to boot.)
I created a diagram which makes it more obvious I think. So you have 5 hosts, each with roughly 76% memory usage. A host fails and all VMs will need to failover. One of those VMs has a 4GB memory reservation, as you can imagine failing over this particular VM will be difficult due to the fact that none of the hosts has enough memory available to guarantee it. Although HA will request DRS to free up resources it is not guaranteed DRS can actually do this.
Basic design principle: Do the math, verify that a single host has enough resources to boot your largest VM. Also take restart priority into account for this/these VM(s).
Specify a failover host
With the Specify a Failover Host admission control policy, when a host fails, HA will attempt to restart all virtual machines on the designated failover host. The designated failover host is essentially a “hot standby”. In other words DRS will not migrate VMs to this host when resources are scarce or the cluster is imbalanced. Please note that when selecting this admission control policy it is by no means a guarantee that when a failure occurs all VMs that will need to be restarted actually are restarted on this host! If for whatever reason a restart fails or not enough resource are available the VM will be restarted on a different host!
My Admission Control Policy Recommendation
It depends. Yes I know, that is the obvious answer but it actually does. There are three options and each have it’s own advantages and disadvantages. Here you go:
Amount of host failures
- Pros:
- Fully automated, when a host is added to a cluster HA calculates how many slots are available.
- Ensures fail-over by calculating slotsizes.
- Cons:
- Can be very conservative and inflexible when reservations are used as the largest reservation dictates slot sizes.
- Unbalanced clusters leads to waste of resources.
Percentage reserved
- Pros:
- Flexible as it considers actual reservation per VM.
- Cluster dynamically adjusts number of host failure capacity when resources are added.
- As of vSphere 5.0 you can specify a separate percentage for CPU and Memory instead of a single value.
- Cons:
- Manual calculations need to be done when adding additional hosts in a cluster and amount of host failures need to remain unchanged.
- Unbalanced clusters can be a problem when chosen percentage is too low.
Designated failover host
- Pros:
- What you see is what you get.
- High resource utilization as dedicated fail-over host is unused.
- As of vSphere 5.0 you can specify multiple hosts as dedicated failover hosts.
- Cons:
- What you see is what you get.
- Dedicated fail-over hosts not utilized during normal operations.
Basic design principle: Do the math, and take customer requirements into account. if you need flexibility a “Percentage” is the way to go.
Flattening Shares
Prior to vSphere 4.1, an HA failed over virtual machine could be granted more resource shares then it should causing resource starvation until DRS balanced the load. As of vSphere 4.1 HA calculates normalized shares for a virtual machine when it is powered on after an isolation event!
Pre-vSphere 4.1 an issue could arise when shares had been set custom on a virtual machine. When HA fails over a virtual machine it will power-on the virtual machine in the Root Resource Pool. However, the virtual machine’s shares were scaled for its appropriate place in the resource pool hierarchy, not for the Root Resource Pool. This could cause the virtual machine to receive either too many or too few resources relative to its entitlement.
A scenario where and when this can occur would be the following:
VM1 has a 1000 shares and Resource Pool A has 2000 shares. However Resource Pool A has 2 VMs and both will have 50% of those “2000″ shares.
When the host would fail both VM2 and VM3 will end up on the same level as VM1. However as a custom shares value of 10000 was specified on both VM2 and VM3 they will completely blow away VM1 in times of contention. This is depicted in the following diagram:
This situation would persist until the next invocation of DRS would re-parent the virtual machine to it’s original Resource Pool. To address this issue as of vSphere 4.1 DRS will flatten the virtual machine’s shares and limits before fail-over. This flattening process ensures that the VM will get the resources it would have received if it would have been failed over to the correct Resource Pool. This scenario is depicted in the following diagram. Note that both VM2 and VM3 are placed under the Root Resource Pool with a shares value of 1000.
Of course when DRS is invoked both VM2 and VM3 will be re-parented under Resource Pool A and will receive the amount of shares they had originally assigned again. I hope this makes it a bit more clear what this “flattened shares” mechanism actually does.
Advanced Settings
There are various types of KB articles and this KB article explains it, but let me summarize it and simplify it a bit to make it easier to digest.
There are various sorts of advanced settings, but for HA three in particular:
- das.* –> Cluster level advanced setting.
- fdm.* –> FDM host level advanced setting (FDM = Fault Domain Manager = vSphere HA)
- vpxd.* –> vCenter level advanced setting.
How do you configure these?
- Cluster Level
- In the vSphere Client: Right click your cluster object, click “edit settings”, click “vSphere HA” and hit the “Advanced Options” button.
- In the Web Client: Click “Hosts and Clusters”, click your cluster object, click the “Manage” tab, click “Settings” and “vSphere HA”, hit the “Edit” button
- FDM Host Level
- Open up an SSH session to your host and edit “/etc/opt/vmware/fdm/fdm.cfg”
- vCenter Level
- In the vSphere Client: Click “Administration” and “vCenter Server Settings”, click “Advanced Settings”
- In the Web Client: Click “vCenter”, click “vCenter Servers”, select the appropriate vCenter Server and click the “Manage” tab, click “Settings” and “Advanced Settings”
In this section we will primarily focus on the ones most commonly used, a full detailed list can be found in KB 2033250. Please note that each bullet details the version which supports this advanced setting.
- das.maskCleanShutdownEnabled – 5.0 only
Whether the clean shutdown flag will default to false for an inaccessible and poweredOff VM. Enabling this option will trigger VM failover if the VM’s home datastore isn’t accessible when it dies or is intentionally powered off. - das.ignoreInsufficientHbDatastore – 5.0 only
Suppress the host config issue that the number of heartbeat datastores is less than das.heartbeatDsPerHost. Default value is “false”. Can be configured as “true” or “false”. - das.heartbeatDsPerHost – 5.0 only
The number of required heartbeat datastores per host. The default value is 2; value should be between 2 and 5. - das.failuredetectiontime – 4.1 and prior
Number of milliseconds, timeout time, for isolation response action (with a default of 15000 milliseconds). Pre-vSphere 4.0 it was a general best practice to increase the value to 60000 when an active/standby Service Console setup was used. This is no longer needed. For a host with two Service Consoles or a secondary isolation address a failuredetection time of 15000 is recommended. - das.isolationaddress[x] – 5.0 and prior
IP address the ESX hosts uses to check on isolation when no heartbeats are received, where [x] = 0 ‐ 9. (see screenshot below for an example) VMware HA will use the default gateway as an isolation address and the provided value as an additional checkpoint. I recommend to add an isolation address when a secondary service console is being used for redundancy purposes. - das.usedefaultisolationaddress – 5.0 and prior
Value can be “true” or “false” and needs to be set to false in case the default gateway, which is the default isolation address, should not or cannot be used for this purpose. In other words, if the default gateway is a non-pingable address, set the “das.isolationaddress0” to a pingable address and disable the usage of the default gateway by setting this to “false”. - das.isolationShutdownTimeout – 5.0 and prior
Time in seconds to wait for a VM to become powered off after initiating a guest shutdown, before forcing a power off. - das.allowNetwork[x] – 5.0 and prior
Enables the use of port group names to control the networks used for VMware HA, where [x] = 0 – ?. You can set the value to be ʺService Console 2ʺ or ʺManagement Networkʺ to use (only) the networks associated with those port group names in the networking configuration. - das.bypassNetCompatCheck – 4.1 and prior
Disable the “compatible network” check for HA that was introduced with ESX 3.5 Update 2. Disabling this check will enable HA to be configured in a cluster which contains hosts in different subnets, so-called incompatible networks. Default value is “false”; setting it to “true” disables the check. - das.ignoreRedundantNetWarning – 5.0 and prior
Remove the error icon/message from your vCenter when you don’t have a redundant Service Console connection. Default value is “false”, setting it to “true” will disable the warning. HA must be reconfigured after setting the option. - das.vmMemoryMinMB – 5.0 and prior
The minimum default slot size used for calculating failover capacity. Higher values will reserve more space for failovers. Do not confuse with “das.slotMemInMB”. - das.slotMemInMB – 5.0 and prior
Sets the slot size for memory to the specified value. This advanced setting can be used when a virtual machine with a large memory reservation skews the slot size, as this will typically result in an artificially conservative number of available slots. - das.vmCpuMinMHz – 5.0 and prior
The minimum default slot size used for calculating failover capacity. Higher values will reserve more space for failovers. Do not confuse with “das.slotCpuInMHz”. - das.slotCpuInMHz – 5.0 and prior
Sets the slot size for CPU to the specified value. This advanced setting can be used when a virtual machine with a large CPU reservation skews the slot size, as this will typically result in an artificially conservative number of available slots. - das.sensorPollingFreq – 4.1 and prior
Set the time interval for HA status updates. As of vSphere 4.1, the default value of this setting is 10. It can be configured between 1 and 30, but it is not recommended to decrease this value as it might lead to less scalability due to the overhead of the status updates. - das.perHostConcurrentFailoversLimit – 5.0 and prior
By default, HA will issue up to 32 concurrent VM power-ons per host. This setting controls the maximum number of concurrent restarts on a single host. Setting a larger value will allow more VMs to be restarted concurrently but will also increase the average latency to recover as it adds more stress on the hosts and storage. - das.config.log.maxFileNum – 5.0 only
Desired number of log rotations. - das.config.log.maxFileSize – 5.0 only
Maximum file size in bytes of the log file. - das.config.log.directory – 5.0 only
Full directory path used to store log files. - das.maxFtVmsPerHost – 5.0 and prior
The maximum number of primary and secondary FT virtual machines that can be placed on a single host. The default value is 4. - das.includeFTcomplianceChecks – 5.0 and prior
Controls whether vSphere Fault Tolerance compliance checks should be run as part of the cluster compliance checks. Set this option to false to avoid cluster compliance failures when Fault Tolerance is not being used in a cluster. - das.iostatsinterval (VM Monitoring) – 5.0 and prior
The I/O stats interval determines if any disk or network activity has occurred for the virtual machine. The default value is 120 seconds. - das.failureInterval (VM Monitoring) – 5.0 and prior
The polling interval for failures. Default value is 30 seconds. - das.minUptime (VM Monitoring) – 5.0 and prior
The minimum uptime in seconds before VM Monitoring starts polling. The default value is 120 seconds. - das.maxFailures (VM Monitoring) – 5.0 and prior
Maximum number of virtual machine failures within the specified “das.maxFailureWindow”, If this number is reached, VM Monitoring doesn’t restart the virtual machine automatically. Default value is 3. - das.maxFailureWindow (VM Monitoring) – 5.0 and prior
Minimum number of seconds between failures. Default value is 3600 seconds. If a virtual machine fails more than “das.maxFailures” within 3600 seconds, VM Monitoring doesn’t restart the machine. - das.vmFailoverEnabled (VM Monitoring) – 5.0 and prior
If set to “true”, VM Monitoring is enabled. When it is set to “false”, VM Monitoring is disabled. - das.config.fdm.deadIcmpPingInterval – 5.0 only
Default value is 10. ICPM pings are used to determine whether a slave host is network accessible when the FDM on that host is not connected to the master. This parameter controls the interval (expressed in seconds) between pings. - das.config.fdm.icmpPingTimeout – 5.0 only
Default value is 5. Defines the time to wait in seconds for an ICMP ping reply before assuming the host being pinged is not network accessible. - das.config.fdm.hostTimeout – 5.0 only
Default is 10. Controls how long a master FDM waits in seconds for a slave FDM to respond to a heartbeat before declaring the slave host not connected and initiating the workflow to determine whether the host is dead, isolated, or partitioned. - das.config.fdm.stateLogInterval – 5.0 only
Default is 600. Frequency in seconds to log cluster state. - das.config.fdm.ft.cleanupTimeout – 5.0 only
Default is 900. When a vSphere Fault Tolerance VM is powered on by vCenter Server, vCenter Server informs the HA master agent that it is doing so. This option controls how many seconds the HA master agent waits for the power on of the secondary VM to succeed. If the power on takes longer than this time (most likely because vCenter Server has lost contact with the host or has failed), the master agent will attempt to power on the secondary VM. - das.config.fdm.storageVmotionCleanupTimeout – 5.0 only
Default is 900. When a Storage vMotion is done in a HA enabled cluster using pre 5.0 hosts and the home datastore of the VM is being moved, HA may interpret the completion of the storage vmotion as a failure, and may attempt to restart the source VM. To avoid this issue, the HA master agent waits the specified number of seconds for a storage vmotion to complete. When the storage vmotion completes or the timer expires, the master will assess whether a failure occurred. - das.config.fdm.policy.unknownStateMonitorPeriod – 5.0 only
Defines the number of seconds the HA master agent waits after it detects that a VM has failed before it attempts to restart the VM. - das.config.fdm.event.maxMasterEvents – 5.0 only
Default is 1000. Defines the maximum number of events cached by the master - das.config.fdm.event.maxSlaveEvents – 5.0 only
Default is 600. Defines the maximum number of events cached by a slave.
Basic design principle: Avoid using advanced settings as much as possible as it leads to increased complexity.