Friday, February 8, 2013

How to Properly Recover an Exchange Server

You Recovered an Exchange Server… So Share It, Maybe

There comes a time in an Exchange Server’s life that it just gives up… This might be a hardware/driver issue or someone pulling one too many drives out of the RAID Array. We’ve recently had to recover two Exchange 2010 servers from Active Directly due to unforeseen issues. One was a physical machine that was having network related issues and when the NIC firmware was updated, it went into a BSOD cycle. The second server was caused by an unknown failure to both RAID Arrays (OS and Data) on a Hyper-V server that hosted a 2010 DAG Node and everything had to be rebuilt.

Microsoft has the recovery process documented on TechNet… but there are some big caveats that they don’t really call out… and if you make a blunder… they don’t tell you how to fix it.

A link to their documentation - http://technet.microsoft.com/en-us/library/dd876880.aspx

If the server is a member of a DAG… the steps are different - http://technet.microsoft.com/en-us/library/dd638206.aspx

Let’s break down the basic steps and walk through them. Keep in mind; you don’t have to use new hardware to perform these steps. A lot of our deployments use VMs.

Prereqs – Use common sense on these… Most of our Exchange 2010 deployments are Windows Server 2008 R2 and “Should” have Service Pack 1 on it. The Service Pack level won’t affect the recovery aspect, but make sure to get the level fixed before putting back into production.

Recovering the server – There are two incredibly important steps that HAVE to be followed correctly; Step 1: Reset the Computer Account & Step 6: Using the proper installation media. If these are not done properly, it will cause more issues down the line. The hamsters… won’t be happy and might rise up in a revolution.

Step 1 - Reset the computer account in AD. This does not mean go into AD and delete the computer object… Follow the instructions that MS provides, otherwise the computer object won’t have the correct Exchange SACLs and the Exchange installation will fail repeatedly.

If you happen to make the mistake of deleting the computer object, don’t fret, all is not lost. There are two ways to get the system back to a state for you to continue.

·         OMG Method 1 – Run setup /prepareAD

This is the most intensive/intrusive action to get back to normal. Usually the customer would need to run this to make sure it’s run with the correct AD Permissions.

·         OMG Method 2 – Add the Exchange Security Groups to the Computer Object

This method is quicker and much less impactful to the entire organization. PrepareAD will fix these permissions for you, but the manual fix is faster and easier. There are three AD groups that the computer object needs to be a member of

1.       Exchange Trusted Subsystem

2.       Exchange Servers

3.       Exchange Install Domain Servers


If there is any doubt where these AD groups are located, check another Exchange server in the environment and make sure the recovered server has the same group memberships.

Step 2 – Install the OS and bring the server online – If the server has to be completely rebuilt, make sure that you are using the correct OS, it needs to match what was previously installed. You also need to name the server with the same name as the previous server. The NICs and IPs need to be configured like the bjorked server.

Step 3 – Join the server to the domain with the same name as the previous server. This goes hand-in-hand with Step 2                                                                                                                                         

Step 4 – Install the Prereqs for Exchange 2010 (Exchange 2010 Prerequisites). My suggestion is to use the PowerShell command they provide based on the installed roles.

Step 5 – DAG Members Only – Remove the server from the DAG. You’ll have to remove any database copies before you can remove the DAG.

·         Grab a list of all the database copies on the bad server

Get-MailboxDatabaseCopyStatus –server <Bad server name>

·         Remove each database copy -

Remove-MailboxDatabaseCopy <database>\<server>

 

Example: Get-mailboxdatabasecopystatus –server <bad server>|remove-mailboxdatabasecopy

 

·         Remove the server from the DAG. If the server is offline (99% of the time this will be the case) add the ConfigurationOnly parameter

Remove-DatabaseAvailabilityGroupServer –Identity <DAG Name> –MailboxServer <bad server>

Step 6 – Using the proper installation media. Do NOT use a Service Pack update, this does not include all of the necessary programs for installation.

Open a Command Prompt with elevated rights.

Change to the Exchange installation directory

Run the following command: setup /m:recoverserver

Step 7 – DAG Members Only – Add the server back into the DAG and reseed the database copies. This will take a LONG time to complete… hope you have a copy of the Lord of the Rings.

Step 8 – Finish any special configuration steps (IIS, RPC Ports, etc) before putting back into production.

 

Now for some examples for some smelly stuff hitting a rotating object and how to fix them… and no, the answer is not PSS.

When you run setup, the progress and errors is logged in C:\ExchangeSetupLogs. If you get any errors during setup, navigate to that directory and open exchangesetup.log and look through the file to grab the full error message.]

A lazy man’s method, run this command from a PowerShell session - Select-String -Pattern "[Error]" -AllMatches -SimpleMatch -Path c:\exchangesetuplogs\exchangesetup.log |%{Write-Output $_.Line}

Example 1 – You decided to delete the AD Computer object instead of Resetting (Step 1)

[08/04/2012 21:05:40.0690] [2] [ERROR] Unexpected Error

[08/04/2012 21:05:40.0690] [2] [ERROR] Service 'MSExchangeADTopology' failed to reach status 'Running' on this server.

[08/04/2012 21:05:40.0706] [1] [ERROR] The following error was generated when "$error.Clear();

[08/04/2012 21:05:40.0706] [1] [ERROR] Service 'MSExchangeADTopology' failed to reach status 'Running' on this server.

 

The problem - The AD Topology service fails to start because there are no Domain Controllers that are available since the server doesn’t have the proper SACLS.

The fix - Run setup again with /PrepareAD or add the proper permissions (see Step 1 instructions).

 

Example 2 – Setup fails because of missing registry keys.

[08/06/2012 14:53:57.0269] [2] [ERROR] Unexpected Error

[08/06/2012 14:53:57.0285] [2] [ERROR] The registry key "SOFTWARE\Microsoft\ExchangeServer\v14\Transport" does not exist under "HKEY_LOCAL_MACHINE".

[08/06/2012 14:53:57.0332] [1] [ERROR] The following error was generated when "$error.Clear();

[08/06/2012 14:53:57.0332] [1] [ERROR] The registry key "SOFTWARE\Microsoft\ExchangeServer\v14\Transport" does not exist under "HKEY_LOCAL_MACHINE".

 Or

[08/13/2012 15:58:21.0090] [2] [ERROR] Unexpected Error

[08/13/2012 15:58:21.0121] [2] [ERROR] The registry key "SOFTWARE\Microsoft\ExchangeServer\v14\Pickup" does not exist under "HKEY_LOCAL_MACHINE".

[08/13/2012 15:58:21.0293] [1] [ERROR] The following error was generated when "$error.Clear();

[08/13/2012 15:58:21.0293] [1] [ERROR] The registry key "SOFTWARE\Microsoft\ExchangeServer\v14\Pickup" does not exist under "HKEY_LOCAL_MACHINE".

 

The problem - This happens when the recovery made it through some of the Configuring Hub Transport Role section but for some reason or another failed.

The fix – Recreate the registry keys manually or grab them from another Exchange server. There is a big catch to this though… NETWORK SERVICE needs to have Full Control on the Transport, Pickup & Relay keys. If this gets skipped, the next time a Poison message comes through the Transport service might not be able to recover and restart… Instant Severity 2 case.

 

Example 3 – You decided to use a Service Pack extract to recover the server

[08/13/2012 18:45:18.0447] [2] [ERROR] Unexpected Error

[08/13/2012 18:45:18.0447] [2] [ERROR] Service 'MSExchangeSearch' failed to start due to error:'Cannot start service MSExchangeSearch on computer '.'.'.

[08/13/2012 18:45:18.0447] [2] [ERROR] Cannot start service MSExchangeSearch on computer '.'.

[08/13/2012 18:45:18.0447] [2] [ERROR] The dependency service or group failed to start

[08/13/2012 18:45:18.0462] [1] [ERROR] The following error was generated when "$error.Clear();

[08/13/2012 18:45:18.0462] [1] [ERROR] Service 'MSExchangeSearch' failed to start due to error:'Cannot start service MSExchangeSearch on computer '.'.'.

[08/13/2012 18:45:18.0462] [1] [ERROR] Cannot start service MSExchangeSearch on computer '.'.

[08/13/2012 18:45:18.0462] [1] [ERROR] The dependency service or group failed to start

 

The problem - Do you remember what I said about using the correct installation media? If you are getting this message then it might be time to check your vision. The Service Pack extracts do NOT contain all the necessary pieces to install Exchange. One important component for a Mailbox role is Search, if you aren’t recovering a mailbox server then you might be able to get away with just the Service Pack.

The fix – Obtain valid installation media through MSDN or other means. Grab the msfte.msi file from \Setup\ServerRoles\Mailbox and run this on the server. This will install the missing components and allow setup to continue.

 

Example 4 – You forgot to set the proper binding order for the multiple NICs and you can’t rejoin the server to the DAG.

Exchange will create a subfolder in ExchangeSetupLogs called DAGTasks. The information in the file may or may not be truly helpful to your issue.

 

[2012-08-13T23:20:37] The preceding log entry comes from a different process running on computer 'server1.domain.com'. END

[2012-08-13T23:20:37] The operation wasn't successful because an error was encountered. You may find more details in log file "C:\ExchangeSetupLogs\DagTasks\dagtask_2012-08-13_23-20-25.685_add-databaseavailabiltygroupserver.log".

[2012-08-13T23:20:37] WriteError! Exception = Microsoft.Exchange.Cluster.Replay.DagTaskOperationFailedException: A server-side database availability group administrative operation failed. Error: The operation failed. CreateCluster errors may result from incorrectly configured static addresses. Error: An error occurred while attempting a cluster operation. Error: Cluster API '"AddClusterNode() (MaxPercentage=12) failed with 0x35. Error: The network path was not found"' failed. ---> Microsoft.Exchange.Cluster.Replay.AmClusterApiException: An Active Manager operation failed. Error An error occurred while attempting a cluster operation. Error: Cluster API '"AddClusterNode() (MaxPercentage=12) failed with 0x35. Error: The network path was not found"' failed.. ---> System.ComponentModel.Win32Exception: The network path was not found

 

There is nothing on the inter-tubes that will help you diagnose this one. I happened to check how the NICs were configured on a hunch from previous routing issues.

·         Open Control Panel

·         Open Network and Internet

·         Click View network status and tasks

·         Click Change adapter settings

·         Press ALT

·         Click Advanced

·         Click Advanced Settings

·         Verify the Adapters and Binding

·         Make sure that the Data network is first, then Replication, then any ISCSI connections

·         Click OK

·         Add the server to the DAG.

 

Example 5 – Setup previously failed and now it won’t run again… as Will Smith once said, “Parent’s just don’t understand”

[08/13/2012 16:58:07.0362] [1] [REQUIRED] A Setup failure previously occurred while installing the HubTransport role. Either run Setup again for just this role, or remove the role using Control Panel.

 

The problem – When setup fails, it leaves a watermark in the registry so that it knows where to pick up again. The biggest issue with this is that it also prevents you from successfully recovering the server.

The fix – Remove the watermark and rerun setup

·         Open Regedit

·         Navigate to HKLM\Software\Microsoft\ExchangeServer\V14

·         Look in the HubTransportRole, ClientAccessRole, MailboxRole keys and delete the Watermark entry

·         Restart setup

 

---------------------------

This isn’t an all-inclusive list of the hurdles that might get in your way, but it’s a start. Always check the setup logs to see if you can determine why setup isn’t working and what you can do to fix it… psstttt…. PSS looks at the same logs, the only difference is they have an internal KB system gleaned from thousands of failed installs.

 

Now for your viewing enjoyment… http://youtu.be/-qTIGg3I5y8

No comments:

Post a Comment