I strongly believe most MS Exchange engineers / administrators biggest nightmares are spiders… I mean who likes spiders (I strongly dislike, dare I say hate walking into a spider web – it sucks), but next up would be a full DR incident where your primary site goes down and now you are left with just DR. If properly planned and implemented with a DAG (or DAGs, depending on your configuration) along with Datacenter Activation Coordinator configuration it doesn’t have to be that rough.
I obviously did this in my lab which is setup as shown below…
Very simple and here is a breakdown:
- One forest, exchangelaboratory.com
- Two AD Sites, EWR (DR) and LGA (Primary)
- Five multiple role Exchange 2010 SP3 UR6 servers, four in the LGA (Primary) site and one in the EWR (DR) site
- Single DAG spanning both sites
- Namespace in the LGA site (primary, so ExternalURL and InternalURL) is mail.exchangelaboratory.com and EWR site (DR – non internet facing, so InternalURL only) is webmail.exchangelaboratory.com
- Only the Primary (LGA) site is internet facing
You are in bed and your phone goes off. Boss man tells you that the LGA site is completely down due to a power failure and you need to bring Exchange online in DR. Active Directory / DNS is already online in DR (with the FSMO roles seized / moved) and Exchange is ready to be activated within that DR site. Get to it!
Stage One: Site Switchover of the DAG
First thing first is to confirm the state of your databases in the DR site. In most situations they should show Disconnected and Healthy because without the FSW + Four Exchange 2010 nodes within a cluster (of five) you no longer have quorum. The easiest way to accomplish this would be to run Get-MailboxDatabase | Get-MailboxDatabaseCopyStatus
Within EMS (Exchange Management Shell) set your session to a local domain controller by running Set-ADServerSettings –PreferredServer <local domain controller FQDN>
The next thing you will want to do is stop the database availability group in configuration only mode within the site that is failed (not the surviving, DR site). The command below will do two things:
- Stop the clustering services
- Evict the servers within the failed site
Confirm that the stop command ran successfully..
On our nodes that are in the surviving site (DR) we will then stop the failover cluster service. In windows 2008 R2 you can run Stop-Service ClusSrv and then confirm by running Get-Service ClusSrv (works in 2012 also) but if you are running Windows Server 2008 you will have to use the legacy net stop clussvc command.
Time to restore the database availability group!
This will reduce the numbers of members within the DAG to what is currently available (in this case just the surviving site) along with force a quorum. Once this command is done we can confirm by importing the failover clustering PowerShell module into our EMS session and confirm the cluster group is online.
We should see the Cluster Group as online, and in our situation the OwnerNode (where the Primary Active Manager will sit) on EWR-EXCH01 (as that is the only survive member of this cluster, which is in the EWR site).
If we rerun the Get-MailboxDatabase | Get-MailboxDatabaseCopyStatus we will see that the databases are now mounted. So far so good.
Stage Two: SMTP traffic
This is going to change for each scenario out there but here are the primary examples:
- Change the source server for the Send connector to include the HUB transport servers in the surviving, DR site
- Change the Receive connector on the surviving HUB transport server to accept emails from your source (usually either a NAT from the internet, or a smarthost device)
- Depending on how you have the NAT and / or Smarthost setup you may need to change DNS records (MX records and TXT records)
Stage Three: Client Access Configuration
As stated earlier in the article the EWR site was non-internet facing. That means the endpoint for clients was the CAS within the LGA site, and if they needed to get to the EWR site they would proxy to there. On each of the CAS services the InternalURL is https://webmail.exchangelaboratory.com/ but the ExternalURL is set to $NULL. We should change this.
You have two options to do this, one is through the GUI and PowerShell and the other is strictly through PowerShell. I will leave some scripts below to get this done within PowerShell.
- Get-ActiveSyncVirtualDirectory -server EWR-EXCH01 | Set-ActiveSyncVirtualDirectory -ExternalUrl ‘https://webmail.exchangelaboratory.com/Microsoft-Server-ActiveSync’
- Get-ECPVirtualDirectory -server EXCHANGE | Set-ECPVirtualDirectory -ExternalUrl ‘https://mail.DOMAIN.ca/ECP’ -InternalUrl ‘https://mail.DOMAIN.ca/ECP’
- Get-OabVirtualDirectory –server EWR-EXCH01 | Set-OabVirtualDirectory -ExternalUrl ‘https://webmail.exchangelaboratory.com’ ‘
- Get-OwaVirtualDirectory –server EWR-EXCH01 | Set-OwaVirtualDirectory –ExternalURL https://webmail.exchangelaboratory.com/owa
- Get-WebServicesVirtualDirectory -server EWR-EXCH01 | Set-WebServicesVirtualDirectory -ExternalUrl ‘https://webmail.exchangelaboratory.com/EWS/Exchange.asmx’
You can also do autodiscover, but if you are doing split DNS like I have in this example AND it is in an active / passive configuration in theory the AutoDiscoverInternalUri (the SCP object) should be pointed to the same value as the LGA site. To check this run the following…
Get-ClientAccessServer EWR-EXCH01 | Select AutoDiscoverInternalURI
If this is not correct you can fix it by doing the following..
Set-ClientAccessServer EWR-EXCH01 –AutoDiscoverInternalUri https://<value>.exchangelaboratory.com/autodiscover/autodiscover.xml
Now that we have the ExternalURL set properly and we know we have a valid InternalURL then we should reconfigure DNS..
- Ensure that your boundary device (device sitting between the internet and the LAN) is configured to pass TCP 443 to your internal endpoint (if you have a single CAS like I do, then it’s that. If you have a hardware load balancer you point the boundary device to the hardware load balancer)
- Change your internal and external DNS A records (if required) to point to the EWR endpoint for client access (usually either a CAS or a Hardware load balancer internally, externally will be your boundary device – either a firewall or reverse proxy)
Best way to test afterwards is with your smartphone, Outlook client (and Outlook for Mac if you have it for EWS) along with https://testexchangeconnectivity.com/.
I have yet to test this, but if you want you can do a CNAME record from mail.exchangelaboratory.com (which is my failed sites namespace) to webmail.exchangelaboratory.com (surviving site namespace). This *may* work but clients typically have to reconfigure there ActiveSync devices. In theory this should work but that is not a guarantee.
Restoring the Failed Site
After a few days the power comes back to our LGA site, and somehow all the servers came back online without a problem. Let’s now prepare Exchange to be activated back in the normal primary site..
First we need to check to ensure (on all Exchange servers) that the Exchange services are started and the Failover Cluster Service is actually in the disabled / stopped state. If the failover cluster service is not in a failed state then the Active Directory request to evict it did not go through properly, which I will cover at the bottom of this article.
To check Exchange servers run Test-ServiceHealth in Exchange Management Shell, and for Cluster Services run Get-Service ClusSvc (Server 2008 R2+, for Windows 2008 open the services.msc MMC)
If this is good on all of the servers in your (what was) failed site, we can now restore the database availability group. This command will read the servers as members to the windows failover cluster, and then readjust quorum. It will not activate databases.
Start-DatabaseAvailabilityGroup DAG –ActiveDirectorySite LGA
You can confirm this ran properly a few different ways…
- Get-DatabaseAvailabilityGroup | FL Name, StartedMailboxServers to ensure the servers did start
- Import the Failover Clusters module into Exchange Management shell and run the Get-ClusterNode and Get-ClusterGroup commands. Cluster Group should show as online and Cluster Node should show all Exchange servers as up
- Check your mailbox database copies (if your CI state is showing as failed give it a few minutes, the CI troubleshooter will usually fix this)
I ran into one issue in the lab I have never seen before (I have done this twice in a real DR scenario) and that is when you run the Start-DatabaseAvailabilityGroup against the surviving site. It comes back with an error shown below.
The error context is ‘”EvictClusterNode(‘SERVERNAME.exchangelaboratory.com’) failed with 0x46”.
Yet if I run Get-Service ClusSvc I see the service is running.
As per Tim McMichael’s blog here the resolution to this is to rerun the command. When I did this a second time it worked without an issue.
Troubleshooting Failback (post-DR, reactivating the site that was down)
This is one I have seen in real life but not my lab testing today. If you notice that the Clustering Service is not disabled / stopped once the server comes back online (give it a few for AD replication and such) then you need to force a cleanup of the server.
On the server affected by this run the following from the command prompt
Cluster node /forcecleanup
Once this is done check the service. It should be disabled / stopped.
TL;DR: Failover and Failback is honestly a few commands, and should take no more than 10-20 minutes.
Any questions post them in the comments!
- Adam F