A Practical Guide to Failover Groups

Note: Practical guide here doesn’t mean a step-by-step approach to setup the service/feature but to mention the important points which are critical to understand for daily practical job responsibilities or the observations made while doing hands-on on the tasks.

In the last post we learnt about Active Geo-Replication which can be used for disaster recovery but limited to only one database at a time and moreover, the failover in active geo-replication was manual. If you need to configure the same thing for multiple databases as a group and also looking for Automatic Failover, this is where you will start considering Failover Groups. However, there are few other differences too where these two features vary that we will look at towards end of this post after exploring the failover groups. If you want, you can quickly jump to similarities between Active Geo-Replication and Failover Groups or Differences between Active Geo-Replication and Failover Groups.

Configuring Failover Groups for Azure SQL Databases is quite simple through portal. You just need to ensure that the server hosting secondary replica is in different region. Once you have configured this through the wizard over portal, it looks like as –

Green region indicates the primary and blue is the secondary replica. You can clearly see that they are in different region. In fact, it has an associated limitation that there can be ONLY one secondary replica.

Other than you can have multiple databases configured for failover to secondary replica, one more biggest advantage is that the failover to secondary server is automatic and you don’t need to make any changes to connection strings i.e. the endpoint, applications or user connects to remain same.

So, the first thing from a user standpoint that you need to capture immediately after setting up the failover group is the endpoints that users needs to connect to. It looks like as –

Read/Write listener endpoint is basically the primary server’s connection which is like <failover-group-name>.database.windows.net i.e. NOT the primary server’s endpoint but the one driven by failover group and this is what ensures that no changes are needed to connection strings after failover. Read-only listener endpoint is corresponding to the secondary replica which can be used for read operations and again this is not the endpoint of secondary server but based on failover group i.e. <failover-group-name>.secondary.database.windows.net

While setting up failover group there are couple of attributes which are worth to be mentioned explicitly here – Read/Write Failover Policy and Read/Write Grace Period (Hours) both of which can be modified after the setup

Read/Write Failover Policy still seems self-explanatory that you can choose either for an automatic option or a manual intervention to failover in the event of a disaster. You can select as per your needs, not a big deal.

Read/Write Grace Period (Hours) is what needs to be understood well. It says that how long the service must wait before finally cutting over to secondary replicas. Values can range from 1 hour to 24 hours. And this is the time window that service will use to synchronize the changes which have occurred on primary just before the disaster but have not been written all to secondary. If you configure less time say, an hour, all changes may not be cascaded to secondary and there might be some data loss. So, if you can’t afford data loss then configure the highest value so that either there is no data loss or you tried your best by configuring the best possible option. But at the same time, you need to give consideration to RTO (Recovery Time Objective) as well and that’s why this is little tricky and something that you need to understand well.

Another thing we need to understand is the difference between Failover and Forced Failover.

Both options are for manual failover which could be used either for a planned failover or a test exercise. Failover is the usual cut over to secondary replica which first synchronizes the changes to secondary server before actual failover. Whereas Forced Failover is immediate failover to secondary without synchronizing anything to secondary. So, depending on your situation, you can choose the appropriate option.

Similarities between Active Geo-Replication and Failover Groups –

  1. Both are database level features so only database level objects are replicated to secondary replicas.
  2. In fact, Failover Groups is built on top of Active Geo-Replication
  3. Like in Active Geo-Replication, secondary replicas are read-only here too. If you try to update database level settings, you will get the errors
  4. Query Store is READ ONLY here too
  5. Both can be used to offload read operations to secondary replicas
  6. Both replicates to secondary in asynchronous way
  7. Both supports nested configuration i.e. secondary replicas for secondary replicas to bypass the default limitations of four geo-replicas and one failover group though with trade-off
  8. Same query that you used to monitor the replication lag through system catalog view for geo-replication can be used for failover groups
SELECT * FROM sys.dm_geo_replication_link_status

Differences between Active Geo-Replication and Failover Groups –

  1. The very first difference is that Geo-Replication is per database configuration; Failover groups allows multiple databases to be configured at same time
  2. Geo-Replication can be configured on a server within same region but Failover Groups allows for the wider scope and hence asks for the secondary server to be in a different region
  3. There can be up to four secondary replicas in case of Geo-Replication but Failover Groups allows only one replica
  4. Geo-Replication doesn’t support Managed Instances; Failover Groups can be configured for Managed Instances
  5. Failover is manual in case of Geo-Replication; Failover Groups can cut over automatically
  6. After the failover, connection strings needs to be changed in case of Geo-Replication but no such requirements for failover groups

Concluding Remarks

This can be thought of as that on on-premise side there are three options for HA/DR – Log Shipping (per database configuration), Availability Groups (for a group of databases) and Failover Clusters. Considering that Azure SQL Database is a DBaaS i.e. Database as a Service so the options which are scoped at database level are supported i.e. Active Geo-Replication (per database configuration) and Failover Groups (group of databases) in Azure terminologies.