DBA’s, have you ever woken up one morning and found your SQL 2012, 2014 or 2016 Availability Group had mysteriously failed over during the wee hours?
Of course there can be several legitimate reasons why it failed over (Windows Cluster or Quorum issue, O/S resources, SQL Server/node rebooted, SQL Services restarted, lost network connectivity, storage went offline, etc), but then there are those weird and seemingly unexplainable events that get you scratching your head after a look in all the obvious places for a smoking-gun.
This occurred with a client a while back with a new virtualised data centre infrastructure build (VMware 6.0). I’d finished the SQL build and test, but the infrastructure team were still completing their configuration tasks. To cut a long story short, I eventually identified the root cause as a (newly implemented) VMware snapshot task on each SQL VM, causing the entire VM to pause for a few seconds while it wrote the snapshot to the datastore. This was long enough to cause the availability group to think it had lost connectivity between the cluster and primary replica (health-check timeout), or primary and secondary replicas (session connectivity timeout), causing the failover event.
My issue was exacerbated because the VM snapshots were scheduled to run at the same time on all SQL virtual machines, causing a small outage on each node simultaneously, with ripple effects felt at the Windows Cluster and of course the underlying availability group. The primary effect was the cluster ping-ponged until the number of permitted failovers had been exhausted (which was set to the default Production setting of ‘n-1 failover every 6 hours’), which left the cluster and availability group in an unresolved state, or the primary replica working but no secondary replicas to fail over to (they can’t sync). The cluster may have lost quorum too.
The availability group session timeout (ping connectivity) default setting for SQL 2012, 2014 and 2016 is 10 seconds. I increased this to 30 seconds and the false failover issue caused by the VM snapshot went away. I left the ‘failure-condition level’ and ‘health-check timeout threshold’ at the default install values.
Due to time and budget constraints I never investigated the root cause, however I suspect the snapshot triggered a hard error (see ‘possible failures’ link below) due to the stun time.
Note: Beginning in ESXi 5.0, the snapshot stun times are logged. Each virtual machine’s log file (vmware.log) will contain messages similar to:
2013-03-23T17:40:02.544Z| vcpu-0| Checkpoint_Unstun: vm stopped for 403475568 us
In this example, the virtual machine was stunned for 403475568 microseconds (1 second = 1 million microseconds) or 403.475568 seconds.
But hey, I was back on-track, the client was happy, and the issue hasn’t reoccurred.
Now, having this occur while in Production could cause significant impact to any uptime SLA and RTO’s, therefore it is very important to understand the impact of snapshots on availability and performance of virtual machines.
Moral of the story? Ensure you’re in-the-loop with your infrastructure team (networking, storage, hypervisor) so you are aware of the changes they make to the environment that may impact your SQL platform.
Possible Failures During Sessions Between Availability Replicas
Change the Session-Timeout Period for an Availability Replica
Flexible Failover Policy for Automatic Failover of an Availability Group
Configure the Flexible Failover Policy to Control Conditions for Automatic Failover
Best practices for virtual machine snapshots in the VMware environment
A snapshot removal can stop a virtual machine for long time
Shedding light on VSS & VDI backups in SQL Server
#1: As an aside, I always configure my availability group (WSFC nodes) with a minimum of 2 NICs; 1 for data (app traffic to the AG listener), and 1 for AG replication traffic (non-routable subnet). This minimises network bandwidth contention that could possibly trigger an unwanted failover condition or degrade client application responsiveness (acknowledging that reduced client responsiveness may be due the lag associated with uncommitted synchronous transactions across the AG replication subnet).
#2: If you are taking SQL native backups, disable the SQL Server VSS writer service, then the VMWare shapshots will not include backups of the SQL Server databases.