Quantcast
Channel: TechNet Blogs
Viewing all articles
Browse latest Browse all 36188

Fixing When Your Domain Traveled Back In Time, the Great System Time Rollback to the Year 2000

$
0
0

 

Hey y’all, Mark back again with some more detail around what to when the system time rollback to November 19th, caused Active Directory replication and other time-sensitive operations to fail in your environment. This post contains guidance by a small army of Microsoft PFEs, support professionals and developers. If you have any questions about the recommendations in this post, feel free to give CTS a call and they can guide you through the recovery. Recovering from a time rollback is a complex situation so read each step carefully and don’t skip ahead or you’ll make the problem worse. Also this post is going to be a long one and will probably break the record for additional links so you’ll want to get comfortable.

Here is what this post is going to cover.

How Did This Happen?

What Are The Symptoms?

Mitigation

1.) Correct Time

2.) Check For Replication Errors

3.) Additional Mitigation

Ongoing Tasks

How Did This Happen?

On November 19th, 2012, time servers at USNO.NAVY.MIL incorrectly provided time samples listing CY 2000 as the current year between the hours of 21:07 UTC and 21:59 UTC (16:07-16:59 EST). Get more info here.

Forests most impacted by this time rollback shared two traits:

1. The forest root PDC or master time servers in the forest lacked time jump protection discussed in in KB 884776 (probably because they were running the W2K3 OS)

2. The forest contained Windows Server 2003 DCs (more on this below)

Windows added support for time jump protection starting with the Server 2003 (and XP member workstations) in the form of two registry values: MaxPosPhaseCorrection and MaxNegPhaseCorrection (we’ll refer to both these keys going forward as max*phasecorrection). By default, the max*phasecorrection settings are not populated on Windows Server 2003 DCs. As a result, such DCs adjust the system time after receiving forward or back-dated time samples. Windows Sever 2008 and later DCs set the max*phasecorrection settings to 48 hours and ignore time samples that vary by more than 48 hours from locally configured time.

Time jump protection is not defined on Windows member workstations or servers until enabled by an administrator for the following reasons. Microsoft Commercial Support has observed massive time jumps (from days to multiple decades in the past and future) in customer forests for the last 10 years. Multiple root causes exist but up until now have never been caused by a highly accurate time servers giving out inaccurate time. While the max*phasecorrection settings offer a degree of protection when the time service is running, it offers no protection when inaccurate time is adopted during a reboot or while the time service is not running. Furthermore, the use of max*phasecorrection can prevent client and server computers from adjusting back to accurate time. While smaller max*phasecorrection values make Windows time clients less susceptible to adopting bad time, they also make it hard for such clients to self-correct if good time varies by more than max*phasecorrection seconds in the past or future. For example, setting max*phasecorrection to say 1 hour would prevent time client from self-correcting from a time zone or AM | PM misconfiguration. Given the ratio of domain controllers to member servers and workstations, Microsoft elected not to configure time jump protection on such computers. More information on time jump protection can be found in KB 884776.

clip_image002

Additional information about how time works in an AD forest can be found in this document. The cliff notes version follows: The root PDC gets time from a reliable time source which could be a highly accurate GPS clock, reference time servers on the internet or one or more references from both of those groups. Time flows hierarchically from root to child then grandchild domains. Some other things to take note of: when configuring a reliable time source on the root PDC or manual time servers. The best practice is to source from stratum 2 or stratum 3 level reference times’ sources. If you are configuring multiple time sources, all time sources should have the SAME stratum level AND the same stratum level as the previously configured external time source

What Are The Symptoms?

If system time moved back to year 2000 on November 19th between the hours of 21:07 UTC and 21:59 UTC (16:07-16:59 EST), it’s a pretty safe bet you were affected by the time rollback as USNO.NAVY.MIL.

It’s also possible that your domain moved from current time back to November 19th 2000 then back to current time.

Assuming your event logs have not wrapped, one clue that your DCs experienced a time rollback is to look for calendar year 2000 events bracketed by year 2012 events.

Event Source:        NTDS Replication
Event Category:     Replication
Event ID:  2042
Date:                    11/21/2012
User:                     NT AUTHORITY\ANONYMOUS LOGONComputer:             ContosoDC

Description:

It has been too long since this machine last replicated with the named source machine. The time between replications with this source has exceeded the tombstone lifetime. Replication has been stopped with this source.

The reason that replication is not allowed to continue is that the two machine's views of deleted objects may now be different. The source machine may still have copies of objects that have been deleted (and garbage collected) on this machine. If they were allowed to replicate, the source machine might return objects which have already been deleted.

Time of last successful replication:

2000-11-19 14:09:12

Other side effects of a time rollback.

Active Directory replication fails with Event 2042 reporting “It has been too long since this machine last replicated” and replication status 8614: “The Active Directory cannot replicate with this server because the time since the last replication with this server has exceeded the tombstone lifetime.”

Operations and Applications, requiring Kerberos authentication, including Active Directory replication fail and even accessing a file server fail with one of two errors:

Error 5: access is denied

Or

AD Replication Status -2146893022: target principal name is incorrect

Other date-dependent operations and applications may also fail including those based on lease intervals, caching or object lifetime (think DHCP, DNS, object lifecycles, date-driven password changes on computer accounts, trust relationships).

Mitigation

If you’ve made it this far you are probably effected by the time jump. I can’t stress, that following the recovery steps in ORDER is key. Taking shortcuts can actually make things worse so stay on the path. If you are unsure about any of the recovery steps, contact CTS and we can help you through this. Don’t be a hero. Here is a bird’s eye view of what we are about to take on.

 
1.) Correct Time

a. Don’t immediately reboot

b. Configure each forest root PDC with reliable time sources

c. Monitor time on DCs and critical application servers

d. Add time jump protection to servers with good time

e. Re-monitor time on DCs and critical application servers

f. Correct Servers with inaccurate time

2.) Check for Replication Errors

a. Fix DCs with replication Event 2042/ Replication Error 8614

i. Confirm that strict replication is enabled

ii. Check for lingering objects and remove if present

iii. Set “allow replication with divergent and corrupt replication partner to 1”

iv. Trigger replication or wait for scheduled replication to occur

v. Troubleshooting Error 2146893022: target principal name is incorrect or 5: access is denied

b. Confirm FRS Is Working

3.) Additional Mitigation
1a.) Don’t immediately reboot.

I know this goes against all your instinct when something is wrong but don’t just reboot. AD Replication, Kerberos and possibly secure channels on trusts and computer accounts could be impacted by the time jump. A reboot may trigger the reading of invalid time on OS shutdown or on subsequent OS startup, especially on virtualized guest computers. Don’t try to fix replication and authentication until the system time is corrected. We’ll get to the other issues later.

1b.) Configure each forest root PDC with reliable time sources

Windows computers use the NTP protocol to source time. NTP time assigns stratum levels to define how a close a given computer is to the reference time source. Stratum 2 level server’s source time from government and military stratum 1 computers which source time from stratum 0 atomic clocks and GPS satellites.

Domain-joined Windows clients and servers by default use NT5DS hierarchy for example a stratum 3 forest root PDC or manually configured Windows master time servers source time from an external stratum 2 time server. Time then time flows hierarchically down to domain controllers and clients in subordinate domains. Once again here are some guidelines when configuring external time servers.

1.) Verify that new and existing external time servers have a stratum level no lower than 2 and no higher than 3

2.) When adding a new external time server, make sure it has the same or a lower level as the previously configured external time source. For example if the old time server was stratum 3 and new is stratum 4 for clients will not accept this time change until the time service is restarted. To say this another way, every time service on member servers would need to be restarted or they will start to drift.

3.) A stratum level of 0 for example represents an uninitialized time server and is invalid. Do not use stratum 0 advertising time sources.

To identify stratum level for a reference time server, run the following command

w32tm /stripchart /packetinfo /computer:<DNS name or host name of time source>

In the output there should be a value for stratum. In this case you’ll want to pick a nearby stratum 2 server. A list of stratum 2 servers can be found here

To configure the root PDC to have a reliable time source. We do that with the following commands.

w32tm /config /update /manualpeerlist:DNS Address /reliable:YES /syncfromflags:MANUAL

This will do the following:

- Sets w32time to manually sync from the NTP server you provide

- Sets the “Reliable Time Source” flag for this machine in NETLOGON.

- Prevents w32time from discovering any machines in the domain as a time source.

We’ll then want to resync the time by running this command:

w32tm /resync /rediscover /nowait

- Updates the configuration and forces it to be immediately applied.

A little side note about the forest root PDC. The machine configured as the reliable time source for the forest probably should NOT be a virtual machine for two reasons:

1. The built-in time synchronization between the guest OS and host needs to be turned off so that the configured time source is actually used. Unless this is done, the machine will still get its time from the host regardless of the time source configured.

2. Virtual machines maintain “stable” time by constantly getting time updates from the host. Even with (1) being done, the virtual machine will likely be “less stable” (meaning that more time drift will be seen by clients syncing from it).

Let’s verify our previous commands worked as expected. To that we are going to run the following command.

w32tm /query /configuration /verbose

We will check 3 things in the output.

-The AnnounceFlags value should be >= 8

-The Type (under ‘NtpClient’ in [TimeProviders]) should be NTP

-The NtpServer (under ‘NtpClient’ in [TimeProviders]) should be the time provider you provided.

We can further confirm this by requesting that NLTEST locate a DC with the GTIMESERV flag, DCLOCATOR should find the DC in question. If it isn’t, the NETLOGON changes might not have propagated, so try targeting the DC directly with /SERVER:domainController

Nltest /dsgetdc:domain name /GTIMESERV [/FORCE]

clip_image004

As we can see here we have the TIMESERV flag so we are good to go. For more information around this topic, read these 2 links.

http://blogs.msdn.com/b/w32time/archive/2008/04/02/configuring-a-standalone-time-server.aspx
http://blogs.msdn.com/b/w32time/archive/2008/05/29/to-be-reliable-or-not-to-be-reliable.aspx

1c.) Monitor time on DCs and critical application servers

Now that we got the root PDC getting the correct time we need to go figure out what other members have bad time. There are a few ways to do this. To grab time of all your DCs fellow PFE Tom Moser wrote a script to help with this. You’ll want to start at the root domain focusing on the DCs, the virtualized hosts then application servers in priority order.

The script is located at the bottom of the post.

Syntax .\Get-TimeInfo.ps1 will write the csv output to the working directory as DCTimes.csv

1d.) Add time jump protection to servers with good time

We will now want to set the servers that already have good time with the MaxPosPhaseCorrection and MaxNegPhaseCorrection registry settings prevent Windows computers from adopting time when time servers send time samples with forward or back dated dates. Once again we’ll want to follow KB 884776

1e.) Re-monitor time on DCs and critical application servers

Using the same strategy in step 1C you’ll want to re-monitor the time in your environment to find out what DCs and critical application server’s time is incorrect. Also to note some of the servers time may have gone from bad to good now that the root PDC is giving out proper time.

1f.) Correct Servers with inaccurate time

On servers that still have the bad time you’ll want to do the following.

-Stop the time service (net stop W32time or Services Pane)

-Reset the time of the server by using the net time command to point it a good time server

-net time \\goodtimeserver /set

-This will then asking you to confirm that you want to set the time of the local computer to match the time of the \\goodtimeserver your provided. Hit Y

-Verify the system time is now good.

-Once again set the MaxPosPhaseCorrection and MaxNegPhaseCorrection registry settings

-Start the time service (net start W32time or Services Pane)

2) Check for Replication Errors

Maybe today is your lucky day and you don’t have any replication errors. We have two ways to quickly check. The first option is ADReplStatus, a recently released replication status reporting tool available for download from Microsoft.com. Keep this in your tool box for future use.

1.) Download ADReplStatus then install and run the tool. After the tool completes the replication status phase, click the Errors Only button in the toolbar. Click on column headers or drag column headers to the filter bar to provide the view that helps you focus on domain controllers, partitions and replication errors of interest.

Here is what the replication status might look like in your environment if you've encountered this issue:

clip_image006

If you prefer REPADMIN, here are the steps:

1. From a DC, run the following command to generate a forest-wide replication status report:

repadmin /showrepl * /csv >showrepl.csv

2. Open the file in Excel and filter on the last replication status result column (column K) to identify DCs with the replication failures (replication status 8614 is commonly associated with this issue).

Not so lucky huh. That’s ok read on we’ll get you going.

2a) Fix DCs with replication Event 2042/ Replication Error 8614

We are now going to tackle each DC individually. Follow all the steps provided until it is replicating properly then move on to the next DC in the list starting right back here at the top.

To prevent the spread of lingering objects, the operating system halts if a destination DC hasn’t inbound replicated over a given connection in tombstone lifetime # of days (default 60 or 180 days). There are 2 scenarios that can trigger this behavior: (1.) a destination Dc really did fail inbound for TSL # of days or (2.) replication engine got has the “appearance” of having failed for TSL # of days due to a time jump.

2ai) Confirm that strict replication is enabled

Remember how I warned earlier in this post about not skipping steps. This is what I’m talking about. You can make things much worse later if you don’t do this step. Strict mode replication prevents lingering objects from being replicated or reanimated on destination DCs that have used garbage collection to create, delete, and permanently purge intentionally deleted objects. You will want to enable this on the all DCs in the forest if it’s not enabled already.

Note: This command has to be run with elevated command prompt with Enterprise Admin credentials

repadmin /regkey * +strict > strict.txt

The command enables strict replication on all DCs in the forest by modifying the following registry path, setting and value: 

Registry path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters
Registry setting (DWORD Value -not case sensitive): Strict Replication Consistency
Value: 1

2aii) Check for lingering objects and remove if present

There has been a lot written on lingering objects so we won’t get into too much here. Use the free tool Repldiag created by fellow PFE Ken Brumfield and check out this post by PFE Glenn LeCheminant http://blogs.technet.com/b/glennl/archive/2007/07/26/clean-that-active-directory-forest-of-lingering-objects.aspx to get that all cleaned up.

Additional ReplDiag Resources

Cleaning lingering objects across the forest with ReplDiag.exe [Part 2 of 4]

Why does ReplDiag.exe error out with the message that the topology isn’t stable? [Part 3 of 4]

Can I clean one partition at a time with ReplDiag, and other tips [Part 4 of 4]

Note: DCs that ran in strict replication consistency prior to the time jump likely have few to no lingering objects to remove. Those that ran in loose replication consistency prior to the jump likely contained lingering objects prior to the November 19th time jump. The enabling of strict replication is generally a requirement to stop the spread of lingering objects. Failure to enable strict replication during lingering object cleanup typically means such DCs will inbound replicate the just removed objects from another DC. At the same time, the enabling of strict replication may block needed Active Directory changes located in the replication queue (AD Replication Status 8606/8333 and Event ID 1988). Evaluate whether loose replication needs to be configured so that replication can occur to run the business with the notion of scheduling a more exhaustive cleanup when time permits.

2aiii) Set “allow replication with divergent and corrupt replication partner to 1”

The next step is we will want to disable the time-based replication quarantine via repadmin or by using regedt32:

You must run the following command from a repadmin.exe version included in the RSAT tools (Windows Server 2008 or later) or from a server that had the AD DS role installed. You’ll also want to run this from an admin-elevated command prompt.

Set the value on a single DC (destination DC in replication report) at first and then expand scope of command as needed.

DO NOT SET THIS KEY UNTIL YOU CONFIRM that strict replication was enabled on destination DCs logging replication error status 8614/Directory Services Event 2042. The only time you should relax the “it has been too long” replication quarantine is if destination DCs are configured with strict replication or you have first tested for and removed lingering objects if present. If you don’t want to enable strict replication consistency, check for and remove lingering objects before relaxing the “allow replication with divergent or corrupt replication partner” setting

Repadmin /regkey DestinationDCName +allowDivergent

If you don’t have that available here is how you can do it via registry.

Registry path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters
Registry setting (DWORD Value -not case sensitive):

Allow Replication With Divergent and Corrupt Partner
Value data:

1

A value of 1 is used to allow replication to occur even though replication hasn't completed in tombstone lifetime number of days over a given replication connection. It is important to put this protection back in place after the environment has been recovered by setting the value back to 0 when we are done (0 = disallow, 1 = allow). We have a reminder to do this down the road.

For more info

Troubleshooting AD Replication error 8614: "The Active Directory cannot replicate with this server because the time since the last replication with this server has exceeded the tombstone lifetime"

http://support.microsoft.com/kb/2020053

AD Replication Error 8614 (event ID 2042)

http://technet.microsoft.com/en-us/library/cc757610(v=ws.10).aspx

2aiv) Trigger replication or wait for scheduled replication to occur

You can now force replication to occur or wait for it to follow its normal schedule. If everything is working as expect without any errors you are done. If that’s the case make sure you remove the setting to allow replication with corrupt and divergent partners. Once again this should be run from RSAT tools (Windows Server 2008 or later)

Repadmin /regkey DestinationDCName -allowDivergent

2av) Troubleshooting Error 2146893022: target principal name is incorrect or 5: access is denied

Common errors seen around this are Error 2146893022: target principal name is incorrect or 5: access is denied.

The easiest way to resolve this is to disable the Kerberos Key Distribution Service (KDC) and simply reboot the DC. Don’t worry its ok now, the time is fixed remember. Recheck replication. If it’s working as expected now make sure you remove the setting to allow replication with corrupt and divergent partners. Once again this should be run from RSAT tools (Windows Server 2008 or later)

Repadmin /regkey DestinationDCName -allowDivergent

If it’s still not working follow these detailed steps below.

Error 2146893022: target principal name is incorrect

This can have multiple root causes but we commonly encounter this replication status in this scenario because the DC has invalid Kerberos tickets.

Each DC impacted by this issue (source DC in AD Replication report) will need new tickets issued by a KDC other than itself.

1. Stop the Kerberos Key Distribution Center service. Make sure you do not stop this service on ALL DCs in a given domain and you shouldn’t be if you are following the directions and troubleshooting 1 DC at a time.

*The remaining KDCs must be reachable across the network.

2. Purge local system tickets

3. Run the following command from an elevated command prompt:

Klist -li 0x3e7 purge

4. Test replication

5. If replication fails with the same error then a reboot may be necessary as we may have failed to flush tickets in the right context.

An interesting thing that may also be happening if this doesn’t work is that we are getting a ticket from a DC that is still broken. If that is the case stop the KDC service on all bad DCs leaving at least one per domain online.

For more info feel free to check out

Troubleshooting AD Replication error -2146893022: The target principal name is incorrect.

http://support.microsoft.com/kb/2090913

Once again if this resolves your problem make sure to re-enable this key. This should be run from RSAT tools (Windows Server 2008 or later)

Repadmin /regkey DestinationDCName -allowDivergent

If you encounter replication status 5 "Access is Denied" for domain controllers in between domains

Temporarily add the Replicator Allow SPN Fallback registry value. To do this, follow these steps.
Note Perform steps 1 through 6 on this same domain controller.

    1. Click Start, click Run, type regedit, and then click OK.
    2. Locate and then click the following registry subkey:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters

    1. On the Edit menu, point to New, and then click DWORD Value.
    2. Type Replicator Allow SPN Fallback, and then press ENTER.
    3. Double-click Replicator Allow SPN Fallback in the right-pane, type 1 in the Value data box, and then click OK.

6. Restart the domain controller.

After this has been solved don’t forget to delete or change this value back to 0 and restart the domain controller in order to reverse this setting after recovery operations are complete.

Find out more info at

Troubleshooting AD Replication error 5: Access is denied

http://support.microsoft.com/kb/2002013

Orphaned child domain controller information may not be replicated to other Windows 2000 Server-based domain controllers

http://support.microsoft.com/kb/887430

Once again if this resolves your problem make sure to re-enable this key. This should be run from RSAT tools (Windows Server 2008 or later)

Repadmin /regkey DestinationDCName -allowDivergent

 
2b) Confirm FRS Is Working

The File Replication Service will be negatively impacted by time jumps as well. It is quite possible that changes to FRS replicated content is not happening after returning to the correct time settings. This can be especially crucial for SYSVOL content. There is a backgrounder on this from the Microsoft Knowledge Base:

289668 Advancing time on production computers and the effect on Active Directory and FRS

http://support.microsoft.com/kb/289668/EN-US

The impact on FRS depends on the duration the environment was using the incorrect time and what changes have been happening during that time. If you are encountering problems with FRS, we recommend you contact Microsoft Support Services to investigate the problems and determine a resolution.

In the worst case you need to restart FRS for SYSVOL in the domain. If you are in a large scale environment you will want to contact CTS for support with this. The steps are in this KB article:

289668 Advancing time on production computers and the effect on Active Directory and FRS

http://support.microsoft.com/kb/289668/EN-US

Note You do not need to follow the steps to rebuild the file system objects such as directories and junctions.

3) Additional Mitigation

Now that we got everything under control in Active Directory we’ll want to go ahead and want to set this time jump protection on the rest of the servers in the environment. You may configure max*phasecorrection directly in the registry by following KB 884776

Ongoing Tasks

Continue to run REPAMDIN or ADREPLSTATUS to detect these and other AD Replication failures. Resolve replication errors prioritized by failure duration and criticality within the replication topology.

I hope this extremely long blog post has been helpful recover from this issue. Don’t forget if you have any questions contact CTS to help get this resolved.

-Mark Morowczynski, Justin Turner, A. Conner


Viewing all articles
Browse latest Browse all 36188

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>