Hi there,
It's been a very long while since I have blogged something here and it's time to come back and continue sharing our field experiences with the IT community hoping to shed light for similar problems.
I was tasked to deal with a customer problem where the end users were reporting various problems like "cannot access the file server, getting authentication prompts" and the IT admins were also observing various problems like the server wasn't properly applying GPOs, Netlogon service complaining about DC access issues and etc. At times, they were even able to manually reproduce the issue by issuing a "telnet DC-IP 389" command from the affected server.
There might be a lot of reasons behind, so I decided to collect a number of logs while the issue was reproduced:
a) TCPIP ETL trace:
You can collect it with the below commands on a Windows client/server: (from an elevated command prompt)
netsh start trace capture=yes scenario=internetclient
<<repro>>
netsh trace stop
b) Network trace:
This could be collected in different ways like using the above command, Wireshark, Network Monitor, Message Analyzer,...
c) Handle outputs
This could be collected as follows:
Note: Handle tool could be downloaded from the following link: https://technet.microsoft.com/en-us/sysinternals/handle.aspx Handle v4.1
handle.exe -a -u >> %computername%_handledetails.txt
handle.exe -s >> %computername%_handlesummary.txt
ANALYSIS:
========
The logs were collected while doing a repro with telnet command on the server. After the logs were shared with us, I checked various things to understand why the outbound connection might be failing (by the way, the file server not being able to authenticate the incoming users was also a side effect of this issue since the file server wasn't able to verify the client credentials via Netlogon secure channel)
1) I first checked network traces, but there were no outgoing connection attempts (TCP SYNs sent to the target server) which means the issue is local to the server itself
2) Then I checked the TCPIP ETL trace and observed the root cause:
Note: You can open up the ETL file that is generated as a result of running netsh command in Network Monitor or Message Analyzer
[0]03E0.5214::01/04/18-15:07:37.5237622 [Microsoft-Windows-TCPIP/Diagnostic] TCP: endpoint (sockaddr=0.0.0.0) bind failed: port-acquisition status = The transport address could not be opened because all the available addresses are in use..
[0]58F0.4558::01/04/18-15:07:51.8242042 [Microsoft-Windows-TCPIP/Diagnostic] TCP: endpoint (sockaddr=0.0.0.0) bind failed: port-acquisition status = The transport address could not be opened because all the available addresses are in use..
[0]04D8.072C::01/04/18-15:07:52.0110322 [Microsoft-Windows-TCPIP/Diagnostic] TCP: endpoint (sockaddr=0.0.0.0) bind failed: port-acquisition status = The transport address could not be opened because all the available addresses are in use.. 1616260 [0]
...
Actually that clearly explained why the outbound connections were failing: PORT EXHAUSTION.
3) And the main reason behind the port failure was a socket leak caused by an outdated 3rd party AV software: (from handles.exe output)
Note: The process name was deliberately changed
92355 ABC.exe pid: 1148 NT AUTHORITYSYSTEM
92517 144: File (---) DeviceAfd
92519 148: File (---) DeviceAfd
92627 220: File (---) DeviceAfd
92629 224: File (---) DeviceAfd
92633 22C: File (---) DeviceAfd
92635 230: File (---) DeviceAfd
92689 29C: File (---) DeviceAfd
92701 2B4: File (---) DeviceAfd
92703 2B8: File (---) DeviceAfd
92705 2BC: File (---) DeviceAfd
92707 2C0: File (---) DeviceAfd
92743 308: File (---) DeviceAfd
92755 320: File (---) DeviceAfd
92761 32C: File (---) DeviceAfd
92767 338: File (---) DeviceAfd
92771 340: File (---) DeviceAfd
92773 344: File (---) DeviceAfd
92779 350: File (---) DeviceAfd
92881 420: File (---) DeviceAfd
92897 440: File (---) DeviceAfd
92899 444: File (---) DeviceAfd
92927 47C: File (---) DeviceAfd
92929 480: File (---) DeviceAfd
92933 488: File (---) DeviceAfd
92935 48C: File (---) DeviceAfd
92941 498: File (---) DeviceAfd
92977 4E0: File (---) DeviceAfd
92993 500: File (---) DeviceAfd
93053 578: File (---) DeviceAfd
93073 5A0: File (---) DeviceAfd
93075 5A4: File (---) DeviceAfd
93077 5A8: File (---) DeviceAfd
93079 5AC: File (---) DeviceAfd
93093 5C8: File (---) DeviceAfd
93113 5F0: File (---) DeviceAfd
93145 630: File (---) DeviceAfd
93165 658: File (---) DeviceAfd
93167 65C: File (---) DeviceAfd
93175 66C: File (---) DeviceAfd
93195 694: File (---) DeviceAfd
93199 69C: File (---) DeviceAfd
93217 6C0: File (---) DeviceAfd
93219 6C4: File (---) DeviceAfd
93227 6D4: File (---) DeviceAfd
93239 6EC: File (---) DeviceAfd
93249 700: File (---) DeviceAfd
93253 708: File (---) DeviceAfd
93265 720: File (---) DeviceAfd
93269 728: File (---) DeviceAfd
93271 72C: File (---) DeviceAfd
93273 730: File (---) DeviceAfd
93275 734: File (---) DeviceAfd
93277 738: File (---) DeviceAfd
93281 740: File (---) DeviceAfd
93283 744: File (---) DeviceAfd
93285 748: File (---) DeviceAfd
93297 760: File (---) DeviceAfd
93299 764: File (---) DeviceAfd
93301 768: File (---) DeviceAfd
93305 770: File (---) DeviceAfd
93307 774: File (---) DeviceAfd
93313 780: File (---) DeviceAfd
93317 788: File (---) DeviceAfd
93321 790: File (---) DeviceAfd
93323 794: File (---) DeviceAfd
93327 79C: File (---) DeviceAfd
93329 7A0: File (---) DeviceAfd
93331 7A4: File (---) DeviceAfd
93333 7A8: File (---) DeviceAfd
93335 7AC: File (---) DeviceAfd
93339 7B4: File (---) DeviceAfd
93343 7BC: File (---) DeviceAfd
93355 7D4: File (---) DeviceAfd
93357 7D8: File (---) DeviceAfd
93359 7DC: File (---) DeviceAfd
93361 7E0: File (---) DeviceAfd
93365 7E8: File (---) DeviceAfd
93373 7F8: File (---) DeviceAfd
93383 810: File (---) DeviceAfd
93389 81C: File (---) DeviceAfd
…
RESOLUTION:
===========
So we advised the customer to update the 3rd party AV software. Apart from that, you can take the following actions to avoid possible port leak issues:
a) Please make sure that Windows OS runs with latest rollups/security updates
b) Please make sure that all 3rd party softwares are up to date (including Firewall, AV, backup or any kind of software that might have to frequently establish outbound connections)
c) Finally you may consider extending the port range for busy servers which are supposed to establish many outbound connections very frequently. The following is the maximum range that you can set, but you may extend the range in phases instead of maxing out at the very beginning: (from an elevated command prompt)
netsh int ipv4 set dynamicport tcp start=1025 num=64500
netsh int ipv4 set dynamicport udp start=1025 num=64500
and you can decrease the TCPTimedWaitDelay registry key on the servers: (you may lower it to 30 seconds)
https://technet.microsoft.com/en-us/library/cc757512(v=ws.10).aspx TcpTimedWaitDelay
The TcpTimedWaitDelay value determines the length of time that a connection stays in the TIME_WAIT state when being closed. While a connection is in the TIME_WAIT state, the socket pair cannot be reused. This is also known as the 2MSL state because the value should be twice the maximum segment lifetime on the network. To adjust the TcpTimedWaitDelay settings, you have to modify/create the registry settings as listed below:
Key: | HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters |
Value: | TcpTimedWaitDelay |
Data Type: | REG_DWORD |
Range: | 30-300 (decimal) |
Default value: | 0x78 (120 decimal) |
Recommended value: | 30 |
Value exists by default? | No, needs to be added. |
Note: This change requires a server reboot
Please note that the same techniques could be applied to virtually any Windows versions as of Windows 7/Windows 2008 R2 onwards easily.
Hope this helps
Thanks,
Murat