Field Notes: The case of accidentally misconfigured Failover Cluster CSV cache

Introduction

In this post, I take you through a process of troubleshooting high pool usage using free tools available in the Windows Sysinternals suite and the Windows Performance Toolkit (WPT). I also show you how to resolve the issue by properly configuring the Cluster Shared Volume (CSV) cache using PowerShell command-lets available in the Failover Cluster module.

Here’s the issue

Imagine that you receive the error message below when you attempt to move a virtual machine (VM) from a stand-alone Hyper-V host to a Failover Cluster in order to make it highly available:

This would be unexpected, especially when you know you have only a few VMs running on the destination host configured with memory totaling about 20 GB out of 64 GB installed total RAM. Looking at the Use Counts tab in RAMMap on the destination host, you see that:

Running VMs consume roughly 20 GB of RAM (Driver Locked)
The non-paged pool is touching around 36 GB (Nonpaged Pool)

Non-paged pool usage is over 50% of total RAM and warrants an investigation. We would rather have VMs consume this memory and maximize density. So, what is this non-paged pool? The non-paged pool consists of kernel virtual memory addresses that always reside in RAM. Mark Russinovich discusses pools in detail here, where he also covers tracking pool leaks using Poolmon, Strings and the Debugging Tools for Windows.

Troubleshooting tools and techniques

Let us see how WPT can provide some clues on what is causing this condition. To collect a trace, copy both xperf.exe and perfctrl.dll from a WPT installation folder to a temporary working folder on the affected machine. You don't want to install WPT on a production server - this toolset is usually installed on workstations, generally where the analysis will be performed. To start and stop the trace, run the following commands:

xperf -on PROC_THREAD+LOADER+POOL -stackwalk PoolAlloc+PoolFree+PoolAllocSession+PoolFreeSession -BufferSize 1024 -MinBuffers 256 -MaxBuffers 256 -MaxFile 256 -FileMode Circular
xperf -stop -d pool.etl

For this scenario, running the trace for about 30 seconds should be adequate.

Once pool.etl (or whatever name you chose) has been generated, copy it to a machine with WPT installed and open it with the Windows Performance Analyzer (WPA). Load symbols and add the "Outstanding Size by Paged, Tag" graph to the analysis view. This immediately gives you a clue on the tag that was used for the allocations.

To get an idea on which driver or kernel mode component is using a particular tag for pool allocations, have a look in the pooltag.txt file. Pooltag.txt is installed with Debugging Tools for Windows and with the Windows DDK:

If you don't have any of these tools installed, Yong Rhee has a list published here.

Unfortunately the tag we are after in this case (RDrc) is not listed in Pooltag.txt. Anyway, drivers are typically found in c:WindowsSystem32drivers. Searching for RDrc in the drivers directory using Sysinternals’ Strings yields the following results:

Sysinternals' Sigcheck can be used to get more information on the driver in question. It can be seen from the description in the output that csvvbus.sys is a Cluster Volume Bus Driver. The cool thing about the Sigcheck tool is that it also shows other valuable information such as company and publisher.

With this kind of information, taking a look back in the trace makes more sense. The key to effective data analysis is to sort columns appropriately. Folks at the NT Debugging Blog explain this concept here. This is how I had mine setup, and we can see that RDrc comes up under AIFO (allocated inside, freed outside):

The stack shows that csvvbus.sys makes a call to allocate pool with tag (ntoskrnl.exe!ExAllocatePoolWithTag).

How do we fix this?

Having dealt with a lot of Hyper-V clusters in the field, what immediately came to mind after seeing this was the CSV cache! There are also some clues when you look at the call stack in the trace. For context, the CSV cache provides caching at the block level of read-only unbuffered I/O operations by allocating system memory (RAM) as a write-through cache. This document recommends enabling and properly configuring CSV cache for all clustered Hyper-V and Scale-Out File Server deployments.

Run the following PowerShell command-let to check BlockCacheSize. In this case, it is configured with a maximum of a 1 TB instead of 1 GB or 2 GB I come across in most deployments. This contributes to the high usage of the non-paged pool we observed.

What do we do to fix this? Elden Christensen has some good guidance on how to enable CSV cache. To set the BlockCacheSize to 1 GB, run (Get-Cluster).BlockCacheSize = 1024. The value is in MB, which explains why you may see 1 TB or 1 PB as an example where administrators are not sure whether this value is in bytes, kilobytes, etc. After configuring the cache limit, the non-paged pool usage immediately drops and physical RAM becomes available for other use such as accommodating more VMs. Shared-nothing live migration succeeded in my case.

As a side note, you can also run Get-ClusterSharedVolume "<CSV Name>" | Get-ClusterParameter to confirm if the EnableBlockCache private property is set to true (1), per CSV. I bring this up because I've seen folks try Get-ClusterSharedVolume | FL * and not being excited with the results!

Summary

In this post, I demonstrated how free tools that are available in the Sysinternals suite and the Windows Performance Toolkit can quickly help you troubleshoot issues that may not be easy to catch in Windows and/or Windows Server otherwise. In this scenario, I also covered concepts such as the CSV cache and how this feature could have a negative impact on your system if not properly configured. Till next time…

Field Notes: The case of accidentally misconfigured Failover Cluster CSV cache

Introduction

Here’s the issue

Troubleshooting tools and techniques

How do we fix this?

Summary

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112