Using alerts to drive a load test

Spread the love

In the previous post, I promised to show a practical use case for performance alerts. Today I will make good on this promise and show a complete code sample on how to drive a load test using alerts.

Imagine a client/server workload in a simple load test; the client sends messages periodically, causing the server to do some processing. The goal of the test is to make the server “busy” based on some metric for some extended duration and ensure it behaves appropriately. At first cut, you might just send messages in a loop with a certain fixed delay and hope the target rate and processing load remains constant. However, this would not account for real-world dynamism such as changing CPU and memory load on the server, network flow control and latency, and so on. If the server application is properly instrumented with performance counters, you can instead use performance alerts to ensure that your load tests are indeed keeping your server busy.

In my AlertSample project on GitHub, I implemented a complete sample showing how to achieve this. The sample incorporates the previous code from PlaSample to create alerts but adds new code for a very basic named pipe client/server app (using a WCF self-hosted service).

The metric I am tracking is the built-in “IO Data Operations/sec” counter. (Note that for a named pipe receiver, each received message will incur two data operations since a named pipe is a duplex channel underneath.) The idea is that sender will keep adjusting its send interval based on an upper and lower threshold provided by the I/O data operations. When the lower threshold alert fires, the sender should increase its send rate (by decreasing the send interval). Similarly, when the upper threshold alert fires, the sender should decrease its send rate to give the server a break.

To programmatically watch for performance alerts, I used EventLogWatcher. This required that I provide an appropriate query which would match the alert threshold events. With some help from Event Viewer, I was able to determine the appropriate event log name (“Microsoft-Windows-Diagnosis-PLA/Operational”) and the query to use. To build the query, you can manually navigate to the event log in question and filter the events and then select the “XML” tab in the “Filter Current Log” dialog. In this case, the filter should be for event 2031 fired after a certain time interval. The resultant XML query would look something like this:

<QueryList>
  <Query Id="0" Path="Microsoft-Windows-Diagnosis-PLA/Operational">
    <Select Path="Microsoft-Windows-Diagnosis-PLA/Operational">*[System[(EventID=2031) and TimeCreated[@SystemTime&gt;='2014-01-01T08:00:00.000Z']]]</Select>
  </Query>
</QueryList>

Note that you only need to provide the text inside the <Select> node with the XML entities translated to raw text (e.g. “&gt;” should be turned back to “>”).

The app configures alert thresholds with interval of three seconds and bounds of 10 and 15 I/O data operations per second. The alert will thus fire if the receive rate drops below 5.0 or rises above 7.5 messages per second for three seconds in a row. Thanks to the simple TraceSource-based logger (with a ConsoleTraceListener), you can run the app (elevated) and see how it periodically corrects the send interval in response to alerts:

[ . . . ]
AlertSample Information: 0 : Threshold reached; is lower bound? False
    ProcessId=417336
    DateTime=2014-01-02T00:38:57.4786042Z
AlertSample Information: 0 : Send interval updated to 250.0 ms.
    ProcessId=417336
    DateTime=2014-01-02T00:38:57.4786042Z
AlertSample Information: 0 : Sender sent 136 messages to 'child' (~7 messages/sec).
    ProcessId=417336
    DateTime=2014-01-02T00:38:57.8286047Z
AlertSample Information: 0 : Receiver 'child' received 136 messages (~7 messages/sec).
    ProcessId=401668
    DateTime=2014-01-02T00:38:57.8386047Z
AlertSample Information: 0 : Sender sent 140 messages to 'child' (~4 messages/sec).
    ProcessId=417336
    DateTime=2014-01-02T00:38:58.8436062Z
AlertSample Information: 0 : Receiver 'child' received 140 messages (~4 messages/sec).
    ProcessId=401668
    DateTime=2014-01-02T00:38:58.8536062Z
[ . . . ]
AlertSample Information: 0 : Threshold reached; is lower bound? True
    ProcessId=417336
    DateTime=2014-01-02T00:39:03.4836127Z
AlertSample Information: 0 : Send interval updated to 125.0 ms.
    ProcessId=417336
    DateTime=2014-01-02T00:39:03.4836127Z
AlertSample Information: 0 : Sender sent 160 messages to 'child' (~5 messages/sec).
    ProcessId=417336
    DateTime=2014-01-02T00:39:03.9136133Z
AlertSample Information: 0 : Receiver 'child' received 160 messages (~5 messages/sec).
    ProcessId=401668
    DateTime=2014-01-02T00:39:03.9236133Z
AlertSample Information: 0 : Sender sent 168 messages to 'child' (~8 messages/sec).
    ProcessId=417336
    DateTime=2014-01-02T00:39:04.9236147Z
AlertSample Information: 0 : Receiver 'child' received 168 messages (~8 messages/sec).
    ProcessId=401668
    DateTime=2014-01-02T00:39:04.9436148Z
AlertSample Information: 0 : Threshold reached; is lower bound? True
    ProcessId=417336
    DateTime=2014-01-02T00:39:05.4836155Z
AlertSample Information: 0 : Send interval updated to 63.0 ms.
    ProcessId=417336
    DateTime=2014-01-02T00:39:05.4836155Z
[ . . . ]

For simplicity, the code sample runs on a single machine and uses a child process for the server application. However, in the real world, you would probably run clients and servers on physically separate locations. But the general approach shown in the sample can still be used, given that event log subscriptions and data collector sets can be targeted to remote machines.

Leave a Reply

Your email address will not be published. Required fields are marked *