APM Agent Throttling settings and other APM Overrides in SC2012 Operations Manager

So, the APM Wizard is great. We put a lot of effort in simplifying the most common scenario for configuration, as described in earlier posts such as this one http://blogs.technet.com/b/momteam/archive/2011/07/30/operations-manager-and-application-monitoring.aspx ; anyhow, there are additional settings that are buried within overrides, in a few places. This post wants to bring some of these to your attention.

One of such places is the Discovery called "Discovery of APM Agent properties" – which you can find, as shown in the screenshot below, by looking for the “.NET Application Monitoring Agent” class. This discovery contains a few interesting overrides. Some of them are related to “global” monitoring settings, while one of them is particularly important in relation to event throttling.

Global per-agent APM Monitoring settings

The other overrides are about enabling “global” settings for monitoring applications. While this is not the recommended approach, it is possible to bypass the APM Template entirely, and use these overrides instead – with some caveats:

"Enable monitoring all web applications" - Enables monitoring for ALL w3wp.exe processes on the IIS machine, with default settings, regardless of app-pool name, and discovery. This essentially turns on APM monitoring for any and every application on the webserver – present and future ones. The drawbacks of this approach is that, since we have not run the wizard/template, and those object have not been discovered as applications by OM, we can't then match the data to alerts or states, therefore the data will only be visible thru AppDiagnostics and Advisor – there will not be any OM “view” or “object” (see earlier post where I described the APM Object model here – http://blogs.technet.com/b/momteam/archive/2012/01/14/apm-object-model.aspx )
"Enable all critical exceptions" - this does the same thing as the equivalent option in the APM template. Again, this is used as a a "global" setting for the agent, mostly useful when "enable monitoring all web apps" is used since. Anyhow, for specific apps, if you do run the template for those apps, the setting specified in the template would be more specific and “win”.
"Performance threshold" – Same as the previous one – same thing can be set from the template. This is a "global" setting for the agent, mostly useful when "enable monitoring all web apps" is used since, for specific apps, the one specified in the template would be more specific.
"Sensitivity threshold" - Same as the previous one – same thing can be set from the template. This is a "global" setting for the agent, mostly useful when "enable monitoring all web apps" is used since, for specific apps, the one specified in the template would be more specific.

APM Event Throttling

Now, let’s get to the topic of throttling specifically. Another useful override in the above discovery rule is the following one:

"Enable Event Throttling" - enables or disables event throttling on the agent – this is the “global” ON/OFF switch for this feature. Keep in mind that throttling is enabled for a reason – and that is to keep CPU utilization on the agent LOW, in case a lot of events are being generated. Therefore it is NOT recommended to change this setting, unless you are in a test/demo environment only, where performance is not a concern. Read on, and we’ll talk more about throttling.

As mentioned earlier, the main goal of Event Throttling is to keep CPU Utilization Low. Therefore it is NOT recommended to switch it off completely, as it is a useful “guard” for system resources. Anyhow, even while leaving it on, there are options to tweak the behavior of throttling – to make it more or less aggressive.

All these options can be configured thru overriding the "Apply APM Agent configuration rule" – again, targeted to “.NET Application Monitoring Agent”, see screenshots below:

While running, the APM agent continuously compares the count of monitored events with limits, and will provide those events to OpsMgr workflow for delivery to the management group only if the limits are not exceeded. In case your application goes “crazy” and starts throwing a lot of exceptions all of a sudden, this allows the APM agent to not add too much load, while still reporting the first exceptions of any “group” to you (see below about event groups).

Please note that the performance counters “% Exception Events / sec”, “% Performance Events / sec” and “Average Request Time” are NOT affected by this: these are the counters used in the monitors in OM, and we have described them earlier here http://blogs.technet.com/b/momteam/archive/2011/08/23/application-monitoring-working-with-alerts.aspx . They’ll keep reporting the real count of events having been “seen” on the agent and will keep track of all requests – but, due to throttling, not all events with full code-level details will be reported.

As I mentioned earlier on here http://blogs.technet.com/b/momteam/archive/2012/06/18/event-to-alert-ratio-reviewing-problems-and-understanding-trends-for-apm-data-in-opsmgr-2012.aspx there are 15 (FIFTEEN!) overrides that govern this. These are basically divided in 5 categories, and for each category there is a limit per MINUTE, per HOUR, and per DAY. All requests are counted by the agent and evaluated against these thresholds. Every time a period (minute, hour, day) passes, the counters to match against these limits are reset.

If you want to customize these limits, you need to understand how these limits look like and what counters are being evaluated to make the throttling decision. Below I attempt such explanation. Anyhow, be aware that we do NOT recommend you to increase ANY of the following limits – as it might have a negative impact on your application’s performance.

Exception Event Groups – This counter tracks the number of exceptions for each Event Group. All events in an Event Group have encountered the same Problem, and therefore have identical call stacks. This is a logical threshold that is not directly related to resource (CPU) usage, but prevents you from receiving an unreasonable number of the same event repeatedly and flood the Operation Manager system and database(s). When the threshold is reached, APM will stop reporting events from that group for the configured time period, but it will continue incrementing the internal event counter for that group. An event group is known as a “Problem” group in AppDiagnostics. When events are displayed in AppDiagnostics and grouped by 'Problem', the 'Count' column will display the actual count of all events that occurred in the group, even though only the first events will be displayed within the group. Also note that, if you have turned on Alerting rules (see more info on http://blogs.technet.com/b/momteam/archive/2012/01/23/custom-apm-rules-for-granular-alerting.aspx ) Alerts in OM have “suppression” based (incrementing the ‘repeat count’) based on this concept of a “problem” or “problem group”. An exception is 'interesting' if the system is configured to collect it. If the system is configured to only collect 'critical' exceptions, the only 'critical' exceptions are considered 'interesting'. If the system has been configured to collect 'All' exceptions and 'Custom Handlers' have been added, then all exceptions and the custom handlers will be considered 'interesting'.
Exceptions per Domain (application domain) - For the more infrastructure-oriented folks, with less developer background, I will clarify that here are talking of Application Domains – that have nothing to do with DNS domains or the domain your web application serves, of course. Application Domains are a .NET concept and feature that allows to run multiple, distinct applications within the same windows process, while maintaining logical isolation (see here for an overview http://msdn.microsoft.com/en-us/library/2bh4z9hs(v=VS.71).aspx ). In IIS the application domain boundary is typically the Virtual Directory or the application that is shown in configuration wizard (“configure application”) in IIS. This counter therefore tracks the number of exceptions raised in particular web application (application domain). This threshold that protects your system against excessive CPU usage and from excessive network traffic from sending too much data. In situations where SQL server is down, your disk is full, not enough memory, etc., an unusual number of exceptions may be thrown from an application, each from a different place in the code and each with a different exception. This threshold will stop collecting and reporting exceptions from that application for the configured time period. The result is that if you have multiple application domains running under a single process, and one of the application domains is throwing excessive exceptions, then only that domain will be throttled while the rest or the domains/apps will continue to be monitored normally. An exception is 'interesting' if the system is configured to collect it. If the system is configured to only collect 'critical' exceptions, the only 'critical' exceptions are considered 'interesting'. If the system has been configured to collect 'All' exceptions and 'Custom Handlers' have been added, then all exceptions and the custom handlers will be considered 'interesting'.
Performance Event Groups– this counter keeps track of the number of events raised in particular event group. An event group is known as a “Problem” group in AppDiagnostics. Also note that, if you have turned on Alerting rules (see more info on http://blogs.technet.com/b/momteam/archive/2012/01/23/custom-apm-rules-for-granular-alerting.aspx ) Alerts in OM have “suppression” based (incrementing the ‘repeat count’) based on this concept of a “problem”.
Performance Events per Domain - keeps number of events raised in particular application domain. For the more infrastructure-oriented folks, with less developer background, I will clarify that here are talking of Application Domains – that have nothing to do with DNS domains or the domain your web application serves, of course. Application Domains are a .NET concept and feature that allows to run multiple, distinct applications within the same windows process, while maintaining logical isolation (see here for an overview http://msdn.microsoft.com/en-us/library/2bh4z9hs(v=VS.71).aspx ). In IIS the application domain boundary is typically the Virtual Directory or the application that is shown in configuration wizard (“configure application”) in IIS.
Total Number of Exceptions– this tracks the full number of caught exceptions. While 'Exceptions Event Groups' and 'Exceptions Per Domain' apply to 'Interesting' events, this threshold applies to all exceptions thrown regardless of whether the system is configured to report them or not. Should something occur that causes your applications to throw an excessive number of exceptions, monitored or not, this threshold will cause the agent to start ignoring them altogether, and it represents the last line of defense to protect against excessive CPU usage. We especially recommend you do not increase this limit.

Heavy / Light optimization

No discussion of APM throttling would be complete without touching on those “light” events. Anyhow, this has already been described very well by a former AVIcode employee, Alex – so I’ll defer you to his explanation rather than make up a new one – see here http://gotchahunter.net/2011/10/scom-apm-avicode-what-are-those-light-events/ . Additional related info are also on VIAcode’s blog here http://www.viacode.com/blog/2011/12/01/light-requests-optimization-logic-how-does-it-work/

One important note is that this cannot currently be switched off in OM12 in supported ways. Also interesting to note that “light” events don’t get counted against “problems” counts in many Advisor reports.

Data Retention

Another place is a rule that executes against a special Data Transfer singleton object in OM – this is responsible for continuously transferring APM data from OpsDB to DW database, perform APM data grooming – the settings for retention driving the grooming mechanism are documented here: http://technet.microsoft.com/en-us/library/jj159297

Sensitive Data

Another place is the paragraph in the documentation that explains how to “work with sensitive data” - here http://technet.microsoft.com/en-us/library/hh543995.aspx that also uses an override to define expressions for clearing up sensitive data out of function parameters (a commonly asked feature).

Discovery

Other useful overrides useful to discover your applications are documented here as part of the documentation for the “Operations Manager APM Web IIS 7” MP - http://technet.microsoft.com/en-us/library/hh916929.aspx ; keep in mind that for IIS8 in SP1 there is an equivalent MP “Operations Manager APM Web IIS 8” and equivalent discoveries/rules. As described in the documentation, those rules work by searching specific file extensions, and discovering “ASP.NET applications” or “ASP.NET webservices”, depending on what they find (or what you tell them to find, via those overrides).

Also, there is one more rule for IIS-Hosted WCF endpoints which is not documented, that works pretty much the same way: it looks for *.svc file and use that to find WCF web services (objects of type “IIS Hosted WCF Web Service Endpoint” - this is a separate discovery in another MP (that is why it isn’t currently documented), but essentially is the same idea… in this case there is an actual DISCOVERY (not a rule) with an override-able property to look for different file extensions than “.svc” (screenshot from SP1 build… in RTM I the display name was slightly different, but it can be found in the same MP, same target…). This will find WCF services in both IIS7 and IIS8. Here it is:

If you need more information about the various endpoints – look at the query and report here: http://blogs.technet.com/b/momteam/archive/2012/08/22/apm-configured-endpoint-report.aspx

Disclaimer

Before we set off, I just want to leave with a disclaimer that a lot of the overrides above (especially those related to throttling) can be lead you into an “unsupported” or “not recommended” state – increasing throttling limits can have a performance impact on your application: we put them in place to avoid overhead, and we think these to be useful as they are. Also, it is not guarantee they will work as described in future releases, and might even prevent upgrading in some cases. Also, as usual, this posting is provided "AS IS" with no warranties, and confers no rights. Use of included utilities are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm.

With this in mind, I hope this journey into APM overrides was useful to understand how things work! Happy customization, and ‘till next time!