You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@jmeter.apache.org by sebb <se...@gmail.com> on 2013/10/18 00:27:35 UTC

Coordinated Omission (CO) - possible strategies

It looks to be quite difficult to avoid the issue of Coordination
Omission without a major redesign of JMeter.

However, it may be a lot easier to detect when the condition has occurred.
This would potentially allow the test settings to be changed to reduce
or eliminate the occurrences - e.g. by increasing the number of
threads or spreading the load across more JMeter instances.

The Constant Throughput Controller calculates the desired wait time,
and if this is less than zero - i.e. a sample should already have been
generated - it could trigger the creation of a failed Assertion
showing the time difference.

Would this be sufficient to detect all CO occurrences?
If not, what other metric needs to be checked?

Even if it is not the only possible cause, would it be useful as a
starting point?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org


Re: Coordinated Omission (CO) - possible strategies

Posted by sebb <se...@gmail.com>.
[Repost; corrected]

It looks to be quite difficult to avoid the issue of Coordination
Omission without a major redesign of JMeter.

However, it may be a lot easier to detect when the condition has occurred.
This would potentially allow the test settings to be changed to reduce
or eliminate the occurrences - e.g. by increasing the number of
threads or spreading the load across more JMeter instances.

The Constant Throughput Timer calculates the desired wait time,
and if this is less than zero - i.e. a sample should already have been
generated - it could trigger the creation of a failed Assertion
showing the time difference.

Would this be sufficient to detect all CO occurrences?
If not, what other metric needs to be checked?

Even if it is not the only possible cause, would it be useful as a
starting point?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org


Re: Coordinated Omission (CO) - possible strategies

Posted by sebb <se...@gmail.com>.
On 17 October 2013 23:29, Philippe Mouawad <ph...@gmail.com> wrote:
> Hello,
> You mean Constant Throughput Timer no ?

Yes, sorry.

> Regards
> Philippe
>
>
> On Fri, Oct 18, 2013 at 12:27 AM, sebb <se...@gmail.com> wrote:
>
>> It looks to be quite difficult to avoid the issue of Coordination
>> Omission without a major redesign of JMeter.
>>
>> However, it may be a lot easier to detect when the condition has occurred.
>> This would potentially allow the test settings to be changed to reduce
>> or eliminate the occurrences - e.g. by increasing the number of
>> threads or spreading the load across more JMeter instances.
>>
>> The Constant Throughput Controller calculates the desired wait time,
>> and if this is less than zero - i.e. a sample should already have been
>> generated - it could trigger the creation of a failed Assertion
>> showing the time difference.
>>
>> Would this be sufficient to detect all CO occurrences?
>> If not, what other metric needs to be checked?
>>
>> Even if it is not the only possible cause, would it be useful as a
>> starting point?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
>> For additional commands, e-mail: user-help@jmeter.apache.org
>>
>>
>
>
> --
> Cordialement.
> Philippe Mouawad.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org


Re: Coordinated Omission (CO) - possible strategies

Posted by Philippe Mouawad <ph...@gmail.com>.
Hello,
You mean Constant Throughput Timer no ?

Regards
Philippe


On Fri, Oct 18, 2013 at 12:27 AM, sebb <se...@gmail.com> wrote:

> It looks to be quite difficult to avoid the issue of Coordination
> Omission without a major redesign of JMeter.
>
> However, it may be a lot easier to detect when the condition has occurred.
> This would potentially allow the test settings to be changed to reduce
> or eliminate the occurrences - e.g. by increasing the number of
> threads or spreading the load across more JMeter instances.
>
> The Constant Throughput Controller calculates the desired wait time,
> and if this is less than zero - i.e. a sample should already have been
> generated - it could trigger the creation of a failed Assertion
> showing the time difference.
>
> Would this be sufficient to detect all CO occurrences?
> If not, what other metric needs to be checked?
>
> Even if it is not the only possible cause, would it be useful as a
> starting point?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
> For additional commands, e-mail: user-help@jmeter.apache.org
>
>


-- 
Cordialement.
Philippe Mouawad.

Re: Coordinated Omission (CO) - possible strategies

Posted by sebb <se...@gmail.com>.
On 18 October 2013 20:01, Kirk Pepperdine <ki...@gmail.com> wrote:
>>>
>>
>> I think you have missed the point of my posting.
>
> No, I just got excited that the problem is finally being looked at again 8*)
>
>>
>> The idea was to detect when CO has happened, and use that information
>> to change the test setup.
>
> I just wouldn't fail a test based on it occurring. A result is a result and should be treated as such.. good or bad...

A 404 or 500 is a result as well, but it results in a failed sample.

> Instead I'd offer a correction. In this case the correction would be to detect when the sampler should have been fired and add that time to the response time.

That would not be correct either.

There's no way of knowing what the actual elapsed time would have been
had the sample been started at the correct time.
The correct offset is probably somewhere between zero and the delay,
but even that is conjecture, as it's possible that the sample would
have failed had it been started earlier.

> I might flag a test saying that you didn't maintain a rate of throughput because of CO.

That was the point of creating the Assertion.
Maybe there are better ways to flag the problem up, but it would be a start.

> My problem with the threading model is that response times are upper bounded by a value that is a function of the number of threads. This can result in JMeter having unwanted effects on response times.

Yes, but please let's try and deal with one problem at a time.

This thread is about how to detect (and report) CO.

I suggest this is dealt with first before getting into discussions of
possible timing adjustments and avoidance techniques.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org


Re: Coordinated Omission (CO) - possible strategies

Posted by Kirk Pepperdine <ki...@gmail.com>.
>> 
> 
> I think you have missed the point of my posting.
 
No, I just got excited that the problem is finally being looked at again 8*)

> 
> The idea was to detect when CO has happened, and use that information
> to change the test setup.

I just wouldn't fail a test based on it occurring. A result is a result and should be treated as such.. good or bad... Instead I'd offer a correction. In this case the correction would be to detect when the sampler should have been fired and add that time to the response time. I might flag a test saying that you didn't maintain a rate of throughput because of CO.

My problem with the threading model is that response times are upper bounded by a value that is a function of the number of threads. This can result in JMeter having unwanted effects on response times.

> In some cases it may not be possible avoid the CO, but in other cases,
> it should be possible to reduce the transaction rate in each thread
> such that long sample times don't cause the next sample to be delayed.

I think the way around CO due to threads not being available is to simply not reuse them. Since I've been using JMeter that way I find that my tests are working much better.
HTTP is sync which means I'm not sure about how to fix the problem that Gil is taking about without a very serious work around.

> And at least the user will have the required information.
> 
> So, I'll ask again: is my proposal for *detecting* CO reasonable?
> If not, what changes are needed?

I think detection is reasonable.. it's how the problem is dealt with afterwards is something that needs discussion.
> 
> Changing JMeter to behave differently is a matter for a separate thread.

Since the problems are coupled together I'd take this as a tactical hack to start to deal with the problem... which is a good start!

Thanks,
Kirk

> 
>> Regards,
>> Kirk
>> 
>> 
>> On 2013-10-18, at 12:27 AM, sebb <se...@gmail.com> wrote:
>> 
>>> It looks to be quite difficult to avoid the issue of Coordination
>>> Omission without a major redesign of JMeter.
>>> 
>>> However, it may be a lot easier to detect when the condition has occurred.
>>> This would potentially allow the test settings to be changed to reduce
>>> or eliminate the occurrences - e.g. by increasing the number of
>>> threads or spreading the load across more JMeter instances.
>>> 
>>> The Constant Throughput Controller calculates the desired wait time,
>>> and if this is less than zero - i.e. a sample should already have been
>>> generated - it could trigger the creation of a failed Assertion
>>> showing the time difference.
>>> 
>>> Would this be sufficient to detect all CO occurrences?
>>> If not, what other metric needs to be checked?
>>> 
>>> Even if it is not the only possible cause, would it be useful as a
>>> starting point?
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
>>> For additional commands, e-mail: user-help@jmeter.apache.org
>>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
> For additional commands, e-mail: user-help@jmeter.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org


Re: Coordinated Omission (CO) - possible strategies

Posted by sebb <se...@gmail.com>.
On 18 October 2013 07:47, Kirk Pepperdine <ki...@gmail.com> wrote:
> Hi Sebb,
>
> In my testing, the option off creating threads on demand instead of all at once has made a huge difference in my being able to control rate of arrivals on the server. It has convinced me that simply using the throughput controller isn't enough and that the threading model in JMeter *must* change. It is the threading model that is the biggest source of CO in JMeter. Unfortunately we weren't able to come to some way of a non-disruptive change in JMeter to make this happen.
>
> The model I was proposing would have JMeter generate an event heap sorted by the time when a sampler should be fired. A thread pool should be used to eat off of the heap and fire the events as per scheduled. This would allow JMeter to break the inappropriate relationship of a thread being a user. The solution is not perfect in that you will still have to fight with thread schedulers and hypervisors to get things to happen on queue. However, I believe the end result will be a far more scalable product that will require far fewer threads to produce far higher loads on the server.
>
> As for your idea on the using the throughput controller. IHMO triggering an assert only worsens the CO problem. In fact, if the response times from the timeouts are not added into the results, in other words they are omitted from the data set, you've only made the problem worse as you are filter out bad data points from the result sets making the results better than they should be. Peter Lawyer's (included here for the purpose of this discussion) technique for correcting CO is to simply recognize when the event should have been triggered and then start the timer for that event at that time. So the latency reported will include the time before event triggering.
>
> Gil Tene's done some work with JMeter. I'll leave it up to him to post what he's done. The interesting bit that he's created is HrdHistogram (https://github.com/giltene/HdrHistogram). It is not only a better way to report results,it offers techniques to calculate and correct for CO. Also Gil might be able to point you to a more recent version of his on CO talk. It might be nice to have a new sampler that incorporates this work.
>
> On a side note, I've got a Servlet filter that is JMX component that measures a bunch of stats from the servers POV. It's something that could be contributed as it could be used to help understand the source of CO.. if not just complement JMeter's view of latency.
>

I think you have missed the point of my posting.

The idea was to detect when CO has happened, and use that information
to change the test setup.
In some cases it may not be possible avoid the CO, but in other cases,
it should be possible to reduce the transaction rate in each thread
such that long sample times don't cause the next sample to be delayed.
And at least the user will have the required information.

So, I'll ask again: is my proposal for *detecting* CO reasonable?
If not, what changes are needed?

Changing JMeter to behave differently is a matter for a separate thread.

> Regards,
> Kirk
>
>
> On 2013-10-18, at 12:27 AM, sebb <se...@gmail.com> wrote:
>
>> It looks to be quite difficult to avoid the issue of Coordination
>> Omission without a major redesign of JMeter.
>>
>> However, it may be a lot easier to detect when the condition has occurred.
>> This would potentially allow the test settings to be changed to reduce
>> or eliminate the occurrences - e.g. by increasing the number of
>> threads or spreading the load across more JMeter instances.
>>
>> The Constant Throughput Controller calculates the desired wait time,
>> and if this is less than zero - i.e. a sample should already have been
>> generated - it could trigger the creation of a failed Assertion
>> showing the time difference.
>>
>> Would this be sufficient to detect all CO occurrences?
>> If not, what other metric needs to be checked?
>>
>> Even if it is not the only possible cause, would it be useful as a
>> starting point?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
>> For additional commands, e-mail: user-help@jmeter.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org


Re: Coordinated Omission (CO) - possible strategies

Posted by Philippe Mouawad <p....@ubik-ingenierie.com>.
Hello Sebb,
Shouldn't we create a bigzilla for this and start working on it ?

Regards
Philippe



-----
Philippe M.
@philmdot
http://ubikloadpack.com
--
View this message in context: http://jmeter.512774.n5.nabble.com/Coordinated-Omission-CO-possible-strategies-tp5718456p5720823.html
Sent from the JMeter - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org


Re: Coordinated Omission (CO) - possible strategies

Posted by Gil Tene <gi...@azulsystems.com>.
[FYI - this is resent out of order due to a bounce. Some replies to this have already been posted].

I guess we look at human response back pressure in different ways. It's a question of whether or not you consider the humans to be part of the system you are testing, and what you think your stats are supposed to represent.

Some people will take the "forgiving" approach, which considers the client behavior client as part of the overall system behavior. In such an approach, if a human responded to slow behavior by not asking any more questions for a while, that's simply what the overall system did, and the stats reported should reflect only the actual attempts that actual humans would have, including their slowing down their requests in response to slow reaction times.

Mine is the very non-forgiving view. A view that that says that response time stats reporting are supposed to tell us what a human would experience if they randomly, without warning, tried to interact with the system. The way I see it, when a human has responded to back pressure, your reported response time stats should remain just as bad (or worse) as when no back pressure response had occurred. According to my view, the meaning of response time stats to the eventual consumer of those stats (A product manager designing a site around certain behavior expectations, A customer that signed up for a certain service level agreement, etc.) does not include any "gifts" of not counting bad behavior when humans responding to back pressure avoided observing them in some way.

A web site being completely down for 5 minutes an hour would generate a lot of human back pressure response. It may even slow down request rates so much during the outage that 99%+ of the overall actual requests by end users during an hour that included such a 5 minute outage would still be very good. Reporting on those (actual requests by humans) would be very different from reporting on what would have happened without human back pressure. But it's easy to examine which of the two reporting methods would be accepted by a reader of such reports.

If you insisted on using the "forgiving" approach and describing this terrible (5 minutes out every hour) web site behavior to a product manager or business-line owner as "our 95%'ile response time are reporting to be 1/2 a second or better", they would rightly be more than a bit upset when they found out what reality actually looks like, and that a proper description of reality was "8% of attempts during each and every the hour of the day results in people leaving our web site in anger. The response time a full 5% of first-contact attempts was 180 seconds or more,". You can argue all you want that both of the descriptions are actually correct, at the same time, as one regards human backoff and one doesn't. They would argue back that the 95%'ile report should clearly reflect the outage, as clearly much more than 5% of each hour was spent in an entirely non-repsonsive state. They would also argue that considering the human customers that by now have walked across the virtual street and bought stuff from your competitors to be part of the system under test is a wrong way to look at things, or at the very least a wrong way to report, size, and monitor them when it comes to the business at hand...

My advice to the casual or professional measurer here is this: unless you want to clearly postfix each of your response time stats with "of the good responses" (e.g. "95% of the good responses were 1/2 a second or better"), you probably don't want to include human response back pressure in the way you characterize response time in reports. And if you do put the postfix in, I'll bet you'll then be asked for what the non-postfix'ed stats are... The only reason you aren't being asked to measure all (not only the good, human-forgiven ones) already is that people think you are reporting the non-postfixed number already...


-- Gil.

On Oct 18, 2013, at 2:23 PM, Kirk Pepperdine <ki...@gmail.com>>
 wrote:


On 2013-10-18, at 11:12 PM, Gil Tene <gi...@azulsystems.com>> wrote:

This is not a problem that matters only in HFT or financial trading. Coordinated Omission is just as prevalent in making all your stats completely wrong in HTTP apps as it is in high frequency apps. It is unfortunately universal.

I'm not disagreeing with you. I am trying to sort out if it matters. One of the questions I have is what is the human response to back pressure in an HTTP based system? Next question, does my load injector (in this case JMeter) behave in the same way?

-- Kirk


On Oct 18, 2013, at 11:55 AM, Kirk Pepperdine <ki...@gmail.com>>
 wrote:


On 2013-10-18, at 7:43 PM, Gil Tene <gi...@azulsystems.com>> wrote:

I'm not saying the threading model doesn't have it's own issues, or that those issues could not in themselves cause coordinated omission. I'm saying there is already a dominant, demonstrable, and classic case of CO in JMeter that doesn't have anything to do with the threading model, and will not go away no matter what is done to the threading model. As long as JMeter test plans are expressed as describing instructions for serial, synchronous, do-these-one-after-the-other scripts for what the tester should do for a given client, coordinated omission will easily occur in executing those instructions. I believe that this will no go away without changing how all JMeter test plans are expressed, and that is probably a non-starter. As a result, I think that building in logic that will correct for coordinated omission when it inevitably occurs, as opposed to trying to avoid it's occurrence, is the only way to go for JMeter.

I can't disagree with you in that CO is present in a single threaded test. However, the nature of the type of load testing is that you play out a scenario because the results of the previous request are needed for the current request. Under those conditions you can't do much but wait until the back pressure clears or your initial request is retired. I think the best you can do under these circumstances just as Sebb has suggested in that you flag the problem and move on. I wouldn't fail nor omit the result but I'm not sure how you can correct because the back pressure in this case will result in lower loads which will allow requests to retire at a rate higher than one should normally expect.

The only "correct" way to deal with detected coordinated omission is to either correct for it, or to throw away all latency or response time data acquired with it. "Flagging" it or "moving on" and keeping the other data for analysis is the same as saying "This data is meaningless, selective, and selectively represents only the best results the system demonstrated, while specifically dropping the vast majority of actual indications of bad behavior encountered during the test".

To be clear, I'm saying that all response time data, including average 90%'ile or any other, that JMeter collects in the presence of coordinated omissions is completely wrong. They are wrong because the data they are all based on is wrong. It's easily demonstrable to be off by several orders of magnitude in real world situations, and in real world web applications.


That said, when users meet this type of system they will most likely abandon.. which is in it's self a failure. JMeter doesn't have direct facilities to support this type of behaviour.

Failure, abandon, and backoff conditions are interesting commentary that does not replace the need to include them in percentiles and averages. When someone says "my web application exhibits a 99%'ile response time of 700msec" the recipient of this information doesn't hear "99% of the the good results will be 700msec or less". They hear "99% of ALL requests attempted with this system will respond within 700 msec of less.". That includes all requests that may have resulted in users walking away in anger, or seeing long response times while the system was stalled for some reason.


Coordinated Omission is a basic problem that can happen due to many, many different reasons and causes. It is made up of two simple things: One is the Omission of some results or samples from the final data set. Second is the Coordination of such omissions with other behavior, such that it is not random. Random omission is usually not a problem. That's just sampling, and random sampling works. Coordinated Omission is a problem because it is effectively [highly] biased sampling. When Coordinated Omission occurs, the resulting data set is biased towards certain behaviors (like good response times), leading ALL statistics on the resulting data set to be highly suspect (read: "usually completely wrong and off by orders of magnitude") in describing response time or latency behavior of the observed system.

In JMeter, Coordinated Omission occurs whenever a thread doesn't execute it's test plan as planned, and does so in reaction to behavior it encounters. This is most often caused by the simple and inherent synchronous nature of test plans as they are stated in JMeter: when a specific request takes longer to respond that it would have taken the thread to send the next request in the plan, the very fact that the thread did not send the next request out on time as planned is a coordinated omission: It is the effective removal of a response time result that would have been in the data set had the coordination not happened. It is "omission" since a measurement that should have occurred didn't happen and was not recorded. It is "coordinated" because the omission is not random, and is correlated-with/influenced-by the occurrence of another longer than normal response time occurrence.

The work done with the OutlierCorrector in JMeter focused on detecting CO in streams of measured results reported to listeners, and inserting "fake" results into the stream to represent the missing, omitted results that should have been there. OutlierCorrector also has a log file corrector that can fix JMeter logs offline, and after the fact by applying the same logic.

Right, but this is for a fixed transactional rate which is typically seen in machine to machine HFTS. In Web apps, perhaps the most common use case for JMeter, client back-off due to back pressure is a common behaviour and it's one that doesn't harm the testing process in the sense that if the server can't retire transactions fast enough.. JMeter will expose it. if you want to prove 5 9's, then I agree, you've got a problem.

Actually the corrector adjusts to the current transactional rare with a configurable moving window average.

Fixed transactional rates are no more common in HFTS that they are in web applications, and client backoff is just as common there. But this has nothing to do with HFTS. In all systems with synchronous clients, whether they take 20usec or 2 seconds for a typical response, the characterization and description of response time behavior should have nothing to do with backing off. And in both types of systems, once backoff happens, coordinated omission has kicked in and your data is contaminated.

As a trivial hypothetical, imagine a web system that regularly stalls for 3 consecutive seconds out of every 40 seconds of operation under a regular load of 200 requests per second, but responds promptly in 20 msec or less the rest of the time. This scenario can easily found in the real world for GC pauses under high load for untuned systems. The 95%'ile response time of such a system clearly can't be described as lower than 2 seconds without outright lying. But here is what JMeter will report for such a systems:

- For 2 threads running 100 requests per second each: 95%'ile will show 20msec.
- For 20 threads running 10 requests per second each:  95%'ile will show 20msec.
- For 200 threads running 1 requests per second each: 95%'ile will show 20 msec.
- For 400 threads running 1 request every 2 seconds: 95%'ile will show 20 msec.
- For 2000 threads running 1 request every 10 seconds: 95%'ile will show ~1 seconds.

Clearly there is a configuration for JMeter that would expose this behavior, it's the one where the gap between requests any one client would send is higher than the largest pause the system ever exhibits. It is also the only one in the list that does not exhibit coordinated omission. But what if the real world for this system simply involved 200 interactive clients, that actually hit it with a request once every 1 second or so (think simple web based games)? JMeter's test plans for that actual real world scenario would show a 9%'ile response time results that is 100x off from reality.

And yes, in this world the real users playing this game would probably "back off". Or they may do the reverse (click repeatedly in frustration, getting a response after 3 seconds). Neither behavior would improve the 95%ile behavior or excuse a 100x off report.


It's not that I disagree with you or I don't understand what you're saying, it's just that I'm having difficulty mapping it back to the world that people on this list have to deal with. w.r.t, I've a feeling that our views are some what tainted by the worlds we live in. In the HTTP world, CO exists and I accept it as natural behaviour. In your world CO exists but it cannot be accepted.

Just because I also dabble in High frequency trading systems and 10 usec responses doesn't mean I forgot about large and small scale web applications with real people at the end, and with human response times.

My world covers many more HTTP people than it does HFT people. Many people's first reaction to realizing how badly Coordinated Omission affects the accuracy of reported response times is "this applies to someone else's domain, but mine things are  still ok because of X". Unfortunately, the problem is almost universal in synchronous testers and in synchronous internal monitoring systems, and 95%+ ( ;-) ) of Web testing environments I have encountered have dramatically under-reported 95%, 99%, and all other %'iles. Unless those systems actually don't care about 5% failure rates, their business decisions are currently being based on bad data.

The problem is that mechanical sympathy is mostly about your world. I think there is a commonality between the two worlds but I think to find it we need more discussion. I'm not sure that this list is good for this purpose so I'm going to flip back to mechanical sympathy instead of hijacking this mailing list.

-- Kirk


-- Gil.


On Oct 18, 2013, at 9:54 AM, Kirk Pepperdine <ki...@gmail.com>> wrote:

Hi Gil,

I would have to disagree as in this case I believe there is CO due to the threading model, CO on a per-thread basis as well as plain old omission. I believe these conditions are in addition to the conditions you're pointing to.

You may test at a fixed rate for HFT but in most worlds, random is necessary. Unfortunately that makes the problem more difficult to deal with.

Regards,
Kirk

On 2013-10-18, at 5:32 PM, Gil Tene <gi...@azulsystems.com>> wrote:

I don't think the thread model is the core of the Coordinated Omission problem. Unless we consider the only solution to be sending no more than one request per 20 minutes from any given thread a threading model fix. It's more of a configuration choice the way I see it, but a pretty impossible one. The thread model may need work for other reasons, but CO is not one of them.

In JMeter, as with all other synchronous testers, Coordinated Omission is a per-thread issue. It's easy to demonstrate CO with JMeter with a single client thread testing an application that has only a single client connection in the real world, or with 15 client threads testing an application that has exactly 15 real-world clients communicating at high rates (common with muxed environments, messaging, ESBs, trading systems, etc.). No amount of threading or concurrency will help get a better test results capturing for these very real system. Any occurrence of CO will make the JMeter results seriously bogus.

When any one thread misses a planned request sending time, CO has already occurred, and there is no way to avoid it at that point. You certainly detect that CO has happened. The question is what to do about it in JMeter once you detect it. The major options are:

1. Ignore it and keep working with the data as if it actually meant anything. This amount to http://tinyurl.com/o46doqf .

2. You can try to change the tester behavior to avoid CO going forward. E.g. you can try to adjust the number of threads up AND at the same time the frequency of requests that each thread sends requests at, which will amount to drastically changing the test plan in reaction to system behavior. In my opinion, changing behavior dynamically will have very limited effectiveness for two reasons: The first is that the problem had already occurred, so all the data up to and including the observed CO  is already bogus and has to be thrown away unless it can be corrected somehow. Only after you auto-adjust enough times to not see CO for a long time, your results during that time may be valid. The second is that changing the test scenario is valid (and possible) for very few real world systems.

3. You can try to correct for CO when you observe it. There are various ways this can be done, and most of them will amount to re-creating missing test sample results by projecting from past results. This can help correct the results data set so that it would better approximate what a tester that was not synchronous, and would have kept issuing requests per the actual test plan, would have experienced in the test.

4. Something else we hadn't yet thought about.

Some correction and detection example work can be found at: https://github.com/OutlierCorrector/jmeter/commit/34c34cae673fd0871a423035a9f262d049f3d9e9 , which uses code at https://github.com/OutlierCorrector/OutlierCorrector . Michael Chmiel worked at Azul Systems over the summer on this problem, and the OutlierCorrector package and the small patch to JMeter  (under the docs-2.9 branch) are some of the results of that work. This fix approach appears to work well as long as no explicitly random behavior is stated in the test scenarios (the outlier detector detects a test pattern and repeats it in repairing the data. Expressly random scenarios will not exhibit a detectable pattern.).

-- Gil.

On Oct 17, 2013, at 11:47 PM, Kirk Pepperdine <ki...@gmail.com>>
 wrote:

Hi Sebb,

In my testing, the option off creating threads on demand instead of all at once has made a huge difference in my being able to control rate of arrivals on the server. It has convinced me that simply using the throughput controller isn't enough and that the threading model in JMeter *must* change. It is the threading model that is the biggest source of CO in JMeter. Unfortunately we weren't able to come to some way of a non-disruptive change in JMeter to make this happen.

The model I was proposing would have JMeter generate an event heap sorted by the time when a sampler should be fired. A thread pool should be used to eat off of the heap and fire the events as per scheduled. This would allow JMeter to break the inappropriate relationship of a thread being a user. The solution is not perfect in that you will still have to fight with thread schedulers and hypervisors to get things to happen on queue. However, I believe the end result will be a far more scalable product that will require far fewer threads to produce far higher loads on the server.

As for your idea on the using the throughput controller. IHMO triggering an assert only worsens the CO problem. In fact, if the response times from the timeouts are not added into the results, in other words they are omitted from the data set, you've only made the problem worse as you are filter out bad data points from the result sets making the results better than they should be. Peter Lawyer's (included here for the purpose of this discussion) technique for correcting CO is to simply recognize when the event should have been triggered and then start the timer for that event at that time. So the latency reported will include the time before event triggering.

Gil Tene's done some work with JMeter. I'll leave it up to him to post what he's done. The interesting bit that he's created is HrdHistogram (https://github.com/giltene/HdrHistogram). It is not only a better way to report results,it offers techniques to calculate and correct for CO. Also Gil might be able to point you to a more recent version of his on CO talk. It might be nice to have a new sampler that incorporates this work.

On a side note, I've got a Servlet filter that is JMX component that measures a bunch of stats from the servers POV. It's something that could be contributed as it could be used to help understand the source of CO.. if not just complement JMeter's view of latency.

Regards,
Kirk


On 2013-10-18, at 12:27 AM, sebb <se...@gmail.com>> wrote:

It looks to be quite difficult to avoid the issue of Coordination
Omission without a major redesign of JMeter.

However, it may be a lot easier to detect when the condition has occurred.
This would potentially allow the test settings to be changed to reduce
or eliminate the occurrences - e.g. by increasing the number of
threads or spreading the load across more JMeter instances.

The Constant Throughput Controller calculates the desired wait time,
and if this is less than zero - i.e. a sample should already have been
generated - it could trigger the creation of a failed Assertion
showing the time difference.

Would this be sufficient to detect all CO occurrences?
If not, what other metric needs to be checked?

Even if it is not the only possible cause, would it be useful as a
starting point?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org<ma...@jmeter.apache.org>
For additional commands, e-mail: user-help@jmeter.apache.org<ma...@jmeter.apache.org>











Re: Coordinated Omission (CO) - possible strategies

Posted by Kirk Pepperdine <ki...@gmail.com>.
On 2013-10-19, at 9:56 AM, Gil Tene <gi...@azulsystems.com> wrote:

> To focus on the "how to deal with Coordinated Omission" part:
> 
> There are two main ways to deal with CO in your actual executed behavior:
> 
> 1. Change the behavior to avoid CO to begin with. 
> 
> 2. Detect it and correct it.

I'll add detect and report. I believe there is value beyond you can't believe the data. It's telling you that there is a condition that you need to eliminate from your test.
> 
> There is a "detect it and report it" one too, but I dot think it is of any real use, as detection without correction will just tell you your data can't be believed at all, but won't tell you anything about what can be. Since CO can move percentile magnitudes and position by literal multiple orders if magnitude (I have multiple measured real world production behaviors that show this) , "hoping it us not too bad" when you know it is there amounts to burying your head in the sand.
> 
> So Kirk, is the random behavior you need one if random timing, or random operation sequencing (or both)?

I need operations to occur at a random internal. That said, the interval is "random" to the server and *not* to JMeter. JMeter can pre-calculate when certain events should occur and then detect when it misses that target. The easiest way to do this is to build an event (sampler??) queue that understands when things such as the next HTTP sampler should be fired.

Regards,
Kirk
 
> 
> Sent from my iPad
> 
> On Oct 18, 2013, at 10:48 PM, "Kirk Pepperdine" <ki...@gmail.com> wrote:
> 
>> 
>> On 2013-10-19, at 1:33 AM, Gil Tene <gi...@azulsystems.com> wrote:
>> 
>>> I guess we look at human response back pressure in different ways. It's a question of whether or not you consider the humans to be part of the system you are testing, and what you think your stats are supposed to represent.
>> 
>> You've seen my presentations and so you know that I do believe that human and non-human actors are definitively part of the system. They provide the dynamics for the system being tested. A change in how that layer in my model works can and does makes a huge difference in how the other layers work to support the overall system.
>>> 
>>> Some people will take the "forgiving" approach, which considers the client behavior client as part of the overall system behavior. In such an approach, if a human responded to slow behavior by not asking any more questions for a while, that's simply what the overall system did, and the stats reported should reflect only the actual attempts that actual humans would have, including their slowing down their requests in response to slow reaction times. 
>> 
>> Sort of. I want to know that a user was inhibited from making forward progress because the previous step in their workflow blew stated tolerances. In some cases I'd like to have that user abandon. I'm not sure I'd call this forgiving though I am looking to see what the overall system can do to answer the question; is it good enough and if not, why not.
>> 
>> I'm not going to suggest your view is incorrect. I think it's quite valid. I don't believe the two views are orthogonal and that there are elements of both in each. The question here on more practical terms is; what needs to be done to reduce the level of CO that currently occurs in JMeter and how should we react to it. Throwing out entire datasets from runs seems like an academic answer to a more practical question; will our application stand up when under load. From my point of view, for JMeter to better answer that question. 
>> 
>>> 
>>> A web site being completely down for 5 minutes an hour would generate a lot of human back pressure response. It may even slow down request rates so much during the outage that 99%+ of the overall actual requests by end users during an hour that included such a 5 minute outage would still be very good. Reporting on those (actual requests by humans) would be very different from reporting on what would have happened without human back pressure. But it's easy to examine which of the two reporting methods would be accepted by a reader of such reports.
>> 
>> But then that 5 minute outage is going to show up some where and if you bury it in how you report.... that would seem to be a problem. This whole argument suggests that what you want is a better regime for the treatment of the data. If that is what you're saying, we're in complete agreement. The 5 minute pause should not be filtered out of the data!
>> 
>> IMHO, the first thing to do is eliminate or reduce the known sources of CO from JMeter. I'm not sure that tackling the CTT is the beat way to go. In fact I'd prefer a combination of approaches that includes things like how jHiccup works with a GC STW detector. As you've mentioned before, even with a fix to the threading model in JMeter, CO will still occur.
>> 
>> Regards,
>> Kirk
>> 


Re: Coordinated Omission (CO) - possible strategies

Posted by Gil Tene <gi...@azulsystems.com>.
To focus on the "how to deal with Coordinated Omission" part:

There are two main ways to deal with CO in your actual executed behavior:

1. Change the behavior to avoid CO to begin with.

2. Detect it and correct it.

There is a "detect it and report it" one too, but I dot think it is of any real use, as detection without correction will just tell you your data can't be believed at all, but won't tell you anything about what can be. Since CO can move percentile magnitudes and position by literal multiple orders if magnitude (I have multiple measured real world production behaviors that show this) , "hoping it us not too bad" when you know it is there amounts to burying your head in the sand.

Avoiding CO [option 1] is obviously preferable where possible. E.g. In load generators this can be achieved if everything the load generator does is made asynchronous, or by making sure that any synchronous part will never attempt to send messages closer together in time than the largest possible stall the system under test may ever experience (with some extra padding, this means "no closer than 10 minutes apart").

But avoiding CO in your actual measured results is unfortunately impractical for many systems. E.g. In systems where actual individual clients interact with the system using in-order transports (like TCP) with actual inter-request time gaps that are shorter than stalls that occur in the system CO will absolutely incur, both in the real world and in any tester that emulates it.

Correcting CO [option 2] is what you have to do if CO exists in the data measured by actual-executed-stuff. Correction inevitably amounts to "filling in the gaps" by projecting (without certainty or actual knowledge) a modeled behavior onto those gaps and adding data points to the data set that did nit actually get measured, but "would have" had COZ not stopped the measurements from being taken at the right points. There are various ways to correct CO in such data sets, and how well they do depends on how much we know about the behavior of the system around the gaps and how much we know about the the themselves (e.g. Knowing an actual complete stall occurred us very useful).

I think JMeter falls squarely into the synchronous tester camp, and that's not going to change. Given that many (most?) systems it measures use TCP as a transport and naturally exhibit systems stalls that are longer than inter-request times in actual use behaviors, I see eliminating CO from JMeter's actual measured results as hopeless. Coordinate Omission in JMeter is just part if life, and we have to deal with it. I therefore focus on the "how to correct" part if the equation.

Having played with correction techniques, I can say that random operation sequences (not random timing) is the hardest thing to deal with. Not necessarily impossible, but really hard. Random timing, on the other hand is easily dealt with for correction purposes, as projecting known, non-random sequences of operations into the CO gaps can be done just as well based in averaged timing data.

So Kirk, is the random behavior you need one if random timing, or random operation sequencing (or both)?

Sent from my iPad

On Oct 18, 2013, at 10:48 PM, "Kirk Pepperdine" <ki...@gmail.com>> wrote:


On 2013-10-19, at 1:33 AM, Gil Tene <gi...@azulsystems.com>> wrote:

I guess we look at human response back pressure in different ways. It's a question of whether or not you consider the humans to be part of the system you are testing, and what you think your stats are supposed to represent.

You've seen my presentations and so you know that I do believe that human and non-human actors are definitively part of the system. They provide the dynamics for the system being tested. A change in how that layer in my model works can and does makes a huge difference in how the other layers work to support the overall system.

Some people will take the "forgiving" approach, which considers the client behavior client as part of the overall system behavior. In such an approach, if a human responded to slow behavior by not asking any more questions for a while, that's simply what the overall system did, and the stats reported should reflect only the actual attempts that actual humans would have, including their slowing down their requests in response to slow reaction times.

Sort of. I want to know that a user was inhibited from making forward progress because the previous step in their workflow blew stated tolerances. In some cases I'd like to have that user abandon. I'm not sure I'd call this forgiving though I am looking to see what the overall system can do to answer the question; is it good enough and if not, why not.

I'm not going to suggest your view is incorrect. I think it's quite valid. I don't believe the two views are orthogonal and that there are elements of both in each. The question here on more practical terms is; what needs to be done to reduce the level of CO that currently occurs in JMeter and how should we react to it. Throwing out entire datasets from runs seems like an academic answer to a more practical question; will our application stand up when under load. From my point of view, for JMeter to better answer that question.


A web site being completely down for 5 minutes an hour would generate a lot of human back pressure response. It may even slow down request rates so much during the outage that 99%+ of the overall actual requests by end users during an hour that included such a 5 minute outage would still be very good. Reporting on those (actual requests by humans) would be very different from reporting on what would have happened without human back pressure. But it's easy to examine which of the two reporting methods would be accepted by a reader of such reports.

But then that 5 minute outage is going to show up some where and if you bury it in how you report.... that would seem to be a problem. This whole argument suggests that what you want is a better regime for the treatment of the data. If that is what you're saying, we're in complete agreement. The 5 minute pause should not be filtered out of the data!

IMHO, the first thing to do is eliminate or reduce the known sources of CO from JMeter. I'm not sure that tackling the CTT is the beat way to go. In fact I'd prefer a combination of approaches that includes things like how jHiccup works with a GC STW detector. As you've mentioned before, even with a fix to the threading model in JMeter, CO will still occur.

Regards,
Kirk


Re: Coordinated Omission (CO) - possible strategies

Posted by Kirk Pepperdine <ki...@gmail.com>.
On 2013-10-19, at 1:33 AM, Gil Tene <gi...@azulsystems.com> wrote:

> I guess we look at human response back pressure in different ways. It's a question of whether or not you consider the humans to be part of the system you are testing, and what you think your stats are supposed to represent.

You've seen my presentations and so you know that I do believe that human and non-human actors are definitively part of the system. They provide the dynamics for the system being tested. A change in how that layer in my model works can and does makes a huge difference in how the other layers work to support the overall system.
> 
> Some people will take the "forgiving" approach, which considers the client behavior client as part of the overall system behavior. In such an approach, if a human responded to slow behavior by not asking any more questions for a while, that's simply what the overall system did, and the stats reported should reflect only the actual attempts that actual humans would have, including their slowing down their requests in response to slow reaction times. 

Sort of. I want to know that a user was inhibited from making forward progress because the previous step in their workflow blew stated tolerances. In some cases I'd like to have that user abandon. I'm not sure I'd call this forgiving though I am looking to see what the overall system can do to answer the question; is it good enough and if not, why not.

I'm not going to suggest your view is incorrect. I think it's quite valid. I don't believe the two views are orthogonal and that there are elements of both in each. The question here on more practical terms is; what needs to be done to reduce the level of CO that currently occurs in JMeter and how should we react to it. Throwing out entire datasets from runs seems like an academic answer to a more practical question; will our application stand up when under load. From my point of view, for JMeter to better answer that question. 

> 
> A web site being completely down for 5 minutes an hour would generate a lot of human back pressure response. It may even slow down request rates so much during the outage that 99%+ of the overall actual requests by end users during an hour that included such a 5 minute outage would still be very good. Reporting on those (actual requests by humans) would be very different from reporting on what would have happened without human back pressure. But it's easy to examine which of the two reporting methods would be accepted by a reader of such reports.

But then that 5 minute outage is going to show up some where and if you bury it in how you report.... that would seem to be a problem. This whole argument suggests that what you want is a better regime for the treatment of the data. If that is what you're saying, we're in complete agreement. The 5 minute pause should not be filtered out of the data!

IMHO, the first thing to do is eliminate or reduce the known sources of CO from JMeter. I'm not sure that tackling the CTT is the beat way to go. In fact I'd prefer a combination of approaches that includes things like how jHiccup works with a GC STW detector. As you've mentioned before, even with a fix to the threading model in JMeter, CO will still occur.

Regards,
Kirk


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org


Re: Coordinated Omission (CO) - possible strategies

Posted by Kirk Pepperdine <ki...@gmail.com>.
On 2013-10-18, at 11:12 PM, Gil Tene <gi...@azulsystems.com> wrote:

> This is not a problem that matters only in HFT or financial trading. Coordinated Omission is just as prevalent in making all your stats completely wrong in HTTP apps as it is in high frequency apps. It is unfortunately universal.

I'm not disagreeing with you. I am trying to sort out if it matters. One of the questions I have is what is the human response to back pressure in an HTTP based system? Next question, does my load injector (in this case JMeter) behave in the same way?

-- Kirk

> 
> On Oct 18, 2013, at 11:55 AM, Kirk Pepperdine <ki...@gmail.com>
>  wrote:
> 
>> 
>> On 2013-10-18, at 7:43 PM, Gil Tene <gi...@azulsystems.com> wrote:
>> 
>>> I'm not saying the threading model doesn't have it's own issues, or that those issues could not in themselves cause coordinated omission. I'm saying there is already a dominant, demonstrable, and classic case of CO in JMeter that doesn't have anything to do with the threading model, and will not go away no matter what is done to the threading model. As long as JMeter test plans are expressed as describing instructions for serial, synchronous, do-these-one-after-the-other scripts for what the tester should do for a given client, coordinated omission will easily occur in executing those instructions. I believe that this will no go away without changing how all JMeter test plans are expressed, and that is probably a non-starter. As a result, I think that building in logic that will correct for coordinated omission when it inevitably occurs, as opposed to trying to avoid it's occurrence, is the only way to go for JMeter.
>> 
>> I can't disagree with you in that CO is present in a single threaded test. However, the nature of the type of load testing is that you play out a scenario because the results of the previous request are needed for the current request. Under those conditions you can't do much but wait until the back pressure clears or your initial request is retired. I think the best you can do under these circumstances just as Sebb has suggested in that you flag the problem and move on. I wouldn't fail nor omit the result but I'm not sure how you can correct because the back pressure in this case will result in lower loads which will allow requests to retire at a rate higher than one should normally expect.
> 
> The only "correct" way to deal with detected coordinated omission is to either correct for it, or to throw away all latency or response time data acquired with it. "Flagging" it or "moving on" and keeping the other data for analysis is the same as saying "This data is meaningless, selective, and selectively represents only the best results the system demonstrated, while specifically dropping the vast majority of actual indications of bad behavior encountered during the test". 
> 
> To be clear, I'm saying that all response time data, including average 90%'ile or any other, that JMeter collects in the presence of coordinated omissions is completely wrong. They are wrong because the data they are all based on is wrong. It's easily demonstrable to be off by several orders of magnitude in real world situations, and in real world web applications.
> 
>> 
>> That said, when users meet this type of system they will most likely abandon.. which is in it's self a failure. JMeter doesn't have direct facilities to support this type of behaviour.
> 
> Failure, abandon, and backoff conditions are interesting commentary that does not replace the need to include them in percentiles and averages. When someone says "my web application exhibits a 99%'ile response time of 700msec" the recipient of this information doesn't hear "99% of the the good results will be 700msec or less". They hear "99% of ALL requests attempted with this system will respond within 700 msec of less.". That includes all requests that may have resulted in users walking away in anger, or seeing long response times while the system was stalled for some reason.
> 
>>> 
>>> Coordinated Omission is a basic problem that can happen due to many, many different reasons and causes. It is made up of two simple things: One is the Omission of some results or samples from the final data set. Second is the Coordination of such omissions with other behavior, such that it is not random. Random omission is usually not a problem. That's just sampling, and random sampling works. Coordinated Omission is a problem because it is effectively [highly] biased sampling. When Coordinated Omission occurs, the resulting data set is biased towards certain behaviors (like good response times), leading ALL statistics on the resulting data set to be highly suspect (read: "usually completely wrong and off by orders of magnitude") in describing response time or latency behavior of the observed system.
>>> 
>>> In JMeter, Coordinated Omission occurs whenever a thread doesn't execute it's test plan as planned, and does so in reaction to behavior it encounters. This is most often caused by the simple and inherent synchronous nature of test plans as they are stated in JMeter: when a specific request takes longer to respond that it would have taken the thread to send the next request in the plan, the very fact that the thread did not send the next request out on time as planned is a coordinated omission: It is the effective removal of a response time result that would have been in the data set had the coordination not happened. It is "omission" since a measurement that should have occurred didn't happen and was not recorded. It is "coordinated" because the omission is not random, and is correlated-with/influenced-by the occurrence of another longer than normal response time occurrence.
>>> 
>>> The work done with the OutlierCorrector in JMeter focused on detecting CO in streams of measured results reported to listeners, and inserting "fake" results into the stream to represent the missing, omitted results that should have been there. OutlierCorrector also has a log file corrector that can fix JMeter logs offline, and after the fact by applying the same logic.
>> 
>> Right, but this is for a fixed transactional rate which is typically seen in machine to machine HFTS. In Web apps, perhaps the most common use case for JMeter, client back-off due to back pressure is a common behaviour and it's one that doesn't harm the testing process in the sense that if the server can't retire transactions fast enough.. JMeter will expose it. if you want to prove 5 9's, then I agree, you've got a problem.
> 
> Actually the corrector adjusts to the current transactional rare with a configurable moving window average. 
> 
> Fixed transactional rates are no more common in HFTS that they are in web applications, and client backoff is just as common there. But this has nothing to do with HFTS. In all systems with synchronous clients, whether they take 20usec or 2 seconds for a typical response, the characterization and description of response time behavior should have nothing to do with backing off. And in both types of systems, once backoff happens, coordinated omission has kicked in and your data is contaminated.
> 
> As a trivial hypothetical, imagine a web system that regularly stalls for 3 consecutive seconds out of every 40 seconds of operation under a regular load of 200 requests per second, but responds promptly in 20 msec or less the rest of the time. This scenario can easily found in the real world for GC pauses under high load for untuned systems. The 95%'ile response time of such a system clearly can't be described as lower than 2 seconds without outright lying. But here is what JMeter will report for such a systems:
> 
> - For 2 threads running 100 requests per second each: 95%'ile will show 20msec.
> - For 20 threads running 10 requests per second each:  95%'ile will show 20msec.
> - For 200 threads running 1 requests per second each: 95%'ile will show 20 msec.
> - For 400 threads running 1 request every 2 seconds: 95%'ile will show 20 msec.
> - For 2000 threads running 1 request every 10 seconds: 95%'ile will show ~1 seconds. 
> 
> Clearly there is a configuration for JMeter that would expose this behavior, it's the one where the gap between requests any one client would send is higher than the largest pause the system ever exhibits. It is also the only one in the list that does not exhibit coordinated omission. But what if the real world for this system simply involved 200 interactive clients, that actually hit it with a request once every 1 second or so (think simple web based games)? JMeter's test plans for that actual real world scenario would show a 9%'ile response time results that is 100x off from reality.
> 
> And yes, in this world the real users playing this game would probably "back off". Or they may do the reverse (click repeatedly in frustration, getting a response after 3 seconds). Neither behavior would improve the 95%ile behavior or excuse a 100x off report.
> 
>> 
>> It's not that I disagree with you or I don't understand what you're saying, it's just that I'm having difficulty mapping it back to the world that people on this list have to deal with. w.r.t, I've a feeling that our views are some what tainted by the worlds we live in. In the HTTP world, CO exists and I accept it as natural behaviour. In your world CO exists but it cannot be accepted.
> 
> Just because I also dabble in High frequency trading systems and 10 usec responses doesn't mean I forgot about large and small scale web applications with real people at the end, and with human response times.
> 
> My world covers many more HTTP people than it does HFT people. Many people's first reaction to realizing how badly Coordinated Omission affects the accuracy of reported response times is "this applies to someone else's domain, but mine things are  still ok because of X". Unfortunately, the problem is almost universal in synchronous testers and in synchronous internal monitoring systems, and 95%+ ( ;-) ) of Web testing environments I have encountered have dramatically under-reported 95%, 99%, and all other %'iles. Unless those systems actually don't care about 5% failure rates, their business decisions are currently being based on bad data.
> 
>> The problem is that mechanical sympathy is mostly about your world. I think there is a commonality between the two worlds but I think to find it we need more discussion. I'm not sure that this list is good for this purpose so I'm going to flip back to mechanical sympathy instead of hijacking this mailing list.
> 
>> 
>> -- Kirk
>> 
>>> 
>>> -- Gil.
>>> 
>>> 
>>> On Oct 18, 2013, at 9:54 AM, Kirk Pepperdine <ki...@gmail.com> wrote:
>>> 
>>>> Hi Gil,
>>>> 
>>>> I would have to disagree as in this case I believe there is CO due to the threading model, CO on a per-thread basis as well as plain old omission. I believe these conditions are in addition to the conditions you're pointing to.
>>>> 
>>>> You may test at a fixed rate for HFT but in most worlds, random is necessary. Unfortunately that makes the problem more difficult to deal with.
>>>> 
>>>> Regards,
>>>> Kirk
>>>> 
>>>> On 2013-10-18, at 5:32 PM, Gil Tene <gi...@azulsystems.com> wrote:
>>>> 
>>>>> I don't think the thread model is the core of the Coordinated Omission problem. Unless we consider the only solution to be sending no more than one request per 20 minutes from any given thread a threading model fix. It's more of a configuration choice the way I see it, but a pretty impossible one. The thread model may need work for other reasons, but CO is not one of them. 
>>>>> 
>>>>> In JMeter, as with all other synchronous testers, Coordinated Omission is a per-thread issue. It's easy to demonstrate CO with JMeter with a single client thread testing an application that has only a single client connection in the real world, or with 15 client threads testing an application that has exactly 15 real-world clients communicating at high rates (common with muxed environments, messaging, ESBs, trading systems, etc.). No amount of threading or concurrency will help get a better test results capturing for these very real system. Any occurrence of CO will make the JMeter results seriously bogus.
>>>>> 
>>>>> When any one thread misses a planned request sending time, CO has already occurred, and there is no way to avoid it at that point. You certainly detect that CO has happened. The question is what to do about it in JMeter once you detect it. The major options are:
>>>>> 
>>>>> 1. Ignore it and keep working with the data as if it actually meant anything. This amount to http://tinyurl.com/o46doqf .
>>>>> 
>>>>> 2. You can try to change the tester behavior to avoid CO going forward. E.g. you can try to adjust the number of threads up AND at the same time the frequency of requests that each thread sends requests at, which will amount to drastically changing the test plan in reaction to system behavior. In my opinion, changing behavior dynamically will have very limited effectiveness for two reasons: The first is that the problem had already occurred, so all the data up to and including the observed CO  is already bogus and has to be thrown away unless it can be corrected somehow. Only after you auto-adjust enough times to not see CO for a long time, your results during that time may be valid. The second is that changing the test scenario is valid (and possible) for very few real world systems.
>>>>> 
>>>>> 3. You can try to correct for CO when you observe it. There are various ways this can be done, and most of them will amount to re-creating missing test sample results by projecting from past results. This can help correct the results data set so that it would better approximate what a tester that was not synchronous, and would have kept issuing requests per the actual test plan, would have experienced in the test.
>>>>> 
>>>>> 4. Something else we hadn't yet thought about.
>>>>> 
>>>>> Some correction and detection example work can be found at: https://github.com/OutlierCorrector/jmeter/commit/34c34cae673fd0871a423035a9f262d049f3d9e9 , which uses code at https://github.com/OutlierCorrector/OutlierCorrector . Michael Chmiel worked at Azul Systems over the summer on this problem, and the OutlierCorrector package and the small patch to JMeter  (under the docs-2.9 branch) are some of the results of that work. This fix approach appears to work well as long as no explicitly random behavior is stated in the test scenarios (the outlier detector detects a test pattern and repeats it in repairing the data. Expressly random scenarios will not exhibit a detectable pattern.).
>>>>> 
>>>>> -- Gil.
>>>>> 
>>>>> On Oct 17, 2013, at 11:47 PM, Kirk Pepperdine <ki...@gmail.com>
>>>>>  wrote:
>>>>> 
>>>>>> Hi Sebb,
>>>>>> 
>>>>>> In my testing, the option off creating threads on demand instead of all at once has made a huge difference in my being able to control rate of arrivals on the server. It has convinced me that simply using the throughput controller isn't enough and that the threading model in JMeter *must* change. It is the threading model that is the biggest source of CO in JMeter. Unfortunately we weren't able to come to some way of a non-disruptive change in JMeter to make this happen.
>>>>>> 
>>>>>> The model I was proposing would have JMeter generate an event heap sorted by the time when a sampler should be fired. A thread pool should be used to eat off of the heap and fire the events as per scheduled. This would allow JMeter to break the inappropriate relationship of a thread being a user. The solution is not perfect in that you will still have to fight with thread schedulers and hypervisors to get things to happen on queue. However, I believe the end result will be a far more scalable product that will require far fewer threads to produce far higher loads on the server.
>>>>>> 
>>>>>> As for your idea on the using the throughput controller. IHMO triggering an assert only worsens the CO problem. In fact, if the response times from the timeouts are not added into the results, in other words they are omitted from the data set, you've only made the problem worse as you are filter out bad data points from the result sets making the results better than they should be. Peter Lawyer's (included here for the purpose of this discussion) technique for correcting CO is to simply recognize when the event should have been triggered and then start the timer for that event at that time. So the latency reported will include the time before event triggering.
>>>>>> 
>>>>>> Gil Tene's done some work with JMeter. I'll leave it up to him to post what he's done. The interesting bit that he's created is HrdHistogram (https://github.com/giltene/HdrHistogram). It is not only a better way to report results,it offers techniques to calculate and correct for CO. Also Gil might be able to point you to a more recent version of his on CO talk. It might be nice to have a new sampler that incorporates this work.
>>>>>> 
>>>>>> On a side note, I've got a Servlet filter that is JMX component that measures a bunch of stats from the servers POV. It's something that could be contributed as it could be used to help understand the source of CO.. if not just complement JMeter's view of latency.
>>>>>> 
>>>>>> Regards,
>>>>>> Kirk
>>>>>> 
>>>>>> 
>>>>>> On 2013-10-18, at 12:27 AM, sebb <se...@gmail.com> wrote:
>>>>>> 
>>>>>>> It looks to be quite difficult to avoid the issue of Coordination
>>>>>>> Omission without a major redesign of JMeter.
>>>>>>> 
>>>>>>> However, it may be a lot easier to detect when the condition has occurred.
>>>>>>> This would potentially allow the test settings to be changed to reduce
>>>>>>> or eliminate the occurrences - e.g. by increasing the number of
>>>>>>> threads or spreading the load across more JMeter instances.
>>>>>>> 
>>>>>>> The Constant Throughput Controller calculates the desired wait time,
>>>>>>> and if this is less than zero - i.e. a sample should already have been
>>>>>>> generated - it could trigger the creation of a failed Assertion
>>>>>>> showing the time difference.
>>>>>>> 
>>>>>>> Would this be sufficient to detect all CO occurrences?
>>>>>>> If not, what other metric needs to be checked?
>>>>>>> 
>>>>>>> Even if it is not the only possible cause, would it be useful as a
>>>>>>> starting point?
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
>>>>>>> For additional commands, e-mail: user-help@jmeter.apache.org
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 


Re: Coordinated Omission (CO) - possible strategies

Posted by Gil Tene <gi...@azulsystems.com>.
[FYI - this is resent out of order due to a bounce. Some replies to this have already been posted].

This is not a problem that matters only in HFT or financial trading. Coordinated Omission is just as prevalent in making all your stats completely wrong in HTTP apps as it is in high frequency apps. It is unfortunately universal.

On Oct 18, 2013, at 11:55 AM, Kirk Pepperdine <ki...@gmail.com>>
 wrote:


On 2013-10-18, at 7:43 PM, Gil Tene <gi...@azulsystems.com>> wrote:

I'm not saying the threading model doesn't have it's own issues, or that those issues could not in themselves cause coordinated omission. I'm saying there is already a dominant, demonstrable, and classic case of CO in JMeter that doesn't have anything to do with the threading model, and will not go away no matter what is done to the threading model. As long as JMeter test plans are expressed as describing instructions for serial, synchronous, do-these-one-after-the-other scripts for what the tester should do for a given client, coordinated omission will easily occur in executing those instructions. I believe that this will no go away without changing how all JMeter test plans are expressed, and that is probably a non-starter. As a result, I think that building in logic that will correct for coordinated omission when it inevitably occurs, as opposed to trying to avoid it's occurrence, is the only way to go for JMeter.

I can't disagree with you in that CO is present in a single threaded test. However, the nature of the type of load testing is that you play out a scenario because the results of the previous request are needed for the current request. Under those conditions you can't do much but wait until the back pressure clears or your initial request is retired. I think the best you can do under these circumstances just as Sebb has suggested in that you flag the problem and move on. I wouldn't fail nor omit the result but I'm not sure how you can correct because the back pressure in this case will result in lower loads which will allow requests to retire at a rate higher than one should normally expect.

The only "correct" way to deal with detected coordinated omission is to either correct for it, or to throw away all latency or response time data acquired with it. "Flagging" it or "moving on" and keeping the other data for analysis is the same as saying "This data is meaningless, selective, and selectively represents only the best results the system demonstrated, while specifically dropping the vast majority of actual indications of bad behavior encountered during the test".

To be clear, I'm saying that all response time data, including average 90%'ile or any other, that JMeter collects in the presence of coordinated omissions is completely wrong. They are wrong because the data they are all based on is wrong. It's easily demonstrable to be off by several orders of magnitude in real world situations, and in real world web applications.


That said, when users meet this type of system they will most likely abandon.. which is in it's self a failure. JMeter doesn't have direct facilities to support this type of behaviour.

Failure, abandon, and backoff conditions are interesting commentary that does not replace the need to include them in percentiles and averages. When someone says "my web application exhibits a 99%'ile response time of 700msec" the recipient of this information doesn't hear "99% of the the good results will be 700msec or less". They hear "99% of ALL requests attempted with this system will respond within 700 msec of less.". That includes all requests that may have resulted in users walking away in anger, or seeing long response times while the system was stalled for some reason.


Coordinated Omission is a basic problem that can happen due to many, many different reasons and causes. It is made up of two simple things: One is the Omission of some results or samples from the final data set. Second is the Coordination of such omissions with other behavior, such that it is not random. Random omission is usually not a problem. That's just sampling, and random sampling works. Coordinated Omission is a problem because it is effectively [highly] biased sampling. When Coordinated Omission occurs, the resulting data set is biased towards certain behaviors (like good response times), leading ALL statistics on the resulting data set to be highly suspect (read: "usually completely wrong and off by orders of magnitude") in describing response time or latency behavior of the observed system.

In JMeter, Coordinated Omission occurs whenever a thread doesn't execute it's test plan as planned, and does so in reaction to behavior it encounters. This is most often caused by the simple and inherent synchronous nature of test plans as they are stated in JMeter: when a specific request takes longer to respond that it would have taken the thread to send the next request in the plan, the very fact that the thread did not send the next request out on time as planned is a coordinated omission: It is the effective removal of a response time result that would have been in the data set had the coordination not happened. It is "omission" since a measurement that should have occurred didn't happen and was not recorded. It is "coordinated" because the omission is not random, and is correlated-with/influenced-by the occurrence of another longer than normal response time occurrence.

The work done with the OutlierCorrector in JMeter focused on detecting CO in streams of measured results reported to listeners, and inserting "fake" results into the stream to represent the missing, omitted results that should have been there. OutlierCorrector also has a log file corrector that can fix JMeter logs offline, and after the fact by applying the same logic.

Right, but this is for a fixed transactional rate which is typically seen in machine to machine HFTS. In Web apps, perhaps the most common use case for JMeter, client back-off due to back pressure is a common behaviour and it's one that doesn't harm the testing process in the sense that if the server can't retire transactions fast enough.. JMeter will expose it. if you want to prove 5 9's, then I agree, you've got a problem.

Actually the corrector adjusts to the current transactional rare with a configurable moving window average.

Fixed transactional rates are no more common in HFTS that they are in web applications, and client backoff is just as common there. But this has nothing to do with HFTS. In all systems with synchronous clients, whether they take 20usec or 2 seconds for a typical response, the characterization and description of response time behavior should have nothing to do with backing off. And in both types of systems, once backoff happens, coordinated omission has kicked in and your data is contaminated.

As a trivial hypothetical, imagine a web system that regularly stalls for 3 consecutive seconds out of every 40 seconds of operation under a regular load of 200 requests per second, but responds promptly in 20 msec or less the rest of the time. This scenario can easily found in the real world for GC pauses under high load for untuned systems. The 95%'ile response time of such a system clearly can't be described as lower than 2 seconds without outright lying. But here is what JMeter will report for such a systems:

- For 2 threads running 100 requests per second each: 95%'ile will show 20msec.
- For 20 threads running 10 requests per second each:  95%'ile will show 20msec.
- For 200 threads running 1 requests per second each: 95%'ile will show 20 msec.
- For 400 threads running 1 request every 2 seconds: 95%'ile will show 20 msec.
- For 2000 threads running 1 request every 10 seconds: 95%'ile will show ~1 seconds.

Clearly there is a configuration for JMeter that would expose this behavior, it's the one where the gap between requests any one client would send is higher than the largest pause the system ever exhibits. It is also the only one in the list that does not exhibit coordinated omission. But what if the real world for this system simply involved 200 interactive clients, that actually hit it with a request once every 1 second or so (think simple web based games)? JMeter's test plans for that actual real world scenario would show a 9%'ile response time results that is 100x off from reality.

And yes, in this world the real users playing this game would probably "back off". Or they may do the reverse (click repeatedly in frustration, getting a response after 3 seconds). Neither behavior would improve the 95%ile behavior or excuse a 100x off report.


It's not that I disagree with you or I don't understand what you're saying, it's just that I'm having difficulty mapping it back to the world that people on this list have to deal with. w.r.t, I've a feeling that our views are some what tainted by the worlds we live in. In the HTTP world, CO exists and I accept it as natural behaviour. In your world CO exists but it cannot be accepted.

Just because I also dabble in High frequency trading systems and 10 usec responses doesn't mean I forgot about large and small scale web applications with real people at the end, and with human response times.

My world covers many more HTTP people than it does HFT people. Many people's first reaction to realizing how badly Coordinated Omission affects the accuracy of reported response times is "this applies to someone else's domain, but mine things are  still ok because of X". Unfortunately, the problem is almost universal in synchronous testers and in synchronous internal monitoring systems, and 95%+ ( ;-) ) of Web testing environments I have encountered have dramatically under-reported 95%, 99%, and all other %'iles. Unless those systems actually don't care about 5% failure rates, their business decisions are currently being based on bad data.

The problem is that mechanical sympathy is mostly about your world. I think there is a commonality between the two worlds but I think to find it we need more discussion. I'm not sure that this list is good for this purpose so I'm going to flip back to mechanical sympathy instead of hijacking this mailing list.

-- Kirk


-- Gil.


On Oct 18, 2013, at 9:54 AM, Kirk Pepperdine <ki...@gmail.com>> wrote:

Hi Gil,

I would have to disagree as in this case I believe there is CO due to the threading model, CO on a per-thread basis as well as plain old omission. I believe these conditions are in addition to the conditions you're pointing to.

You may test at a fixed rate for HFT but in most worlds, random is necessary. Unfortunately that makes the problem more difficult to deal with.

Regards,
Kirk

On 2013-10-18, at 5:32 PM, Gil Tene <gi...@azulsystems.com>> wrote:

I don't think the thread model is the core of the Coordinated Omission problem. Unless we consider the only solution to be sending no more than one request per 20 minutes from any given thread a threading model fix. It's more of a configuration choice the way I see it, but a pretty impossible one. The thread model may need work for other reasons, but CO is not one of them.

In JMeter, as with all other synchronous testers, Coordinated Omission is a per-thread issue. It's easy to demonstrate CO with JMeter with a single client thread testing an application that has only a single client connection in the real world, or with 15 client threads testing an application that has exactly 15 real-world clients communicating at high rates (common with muxed environments, messaging, ESBs, trading systems, etc.). No amount of threading or concurrency will help get a better test results capturing for these very real system. Any occurrence of CO will make the JMeter results seriously bogus.

When any one thread misses a planned request sending time, CO has already occurred, and there is no way to avoid it at that point. You certainly detect that CO has happened. The question is what to do about it in JMeter once you detect it. The major options are:

1. Ignore it and keep working with the data as if it actually meant anything. This amount to http://tinyurl.com/o46doqf .

2. You can try to change the tester behavior to avoid CO going forward. E.g. you can try to adjust the number of threads up AND at the same time the frequency of requests that each thread sends requests at, which will amount to drastically changing the test plan in reaction to system behavior. In my opinion, changing behavior dynamically will have very limited effectiveness for two reasons: The first is that the problem had already occurred, so all the data up to and including the observed CO  is already bogus and has to be thrown away unless it can be corrected somehow. Only after you auto-adjust enough times to not see CO for a long time, your results during that time may be valid. The second is that changing the test scenario is valid (and possible) for very few real world systems.

3. You can try to correct for CO when you observe it. There are various ways this can be done, and most of them will amount to re-creating missing test sample results by projecting from past results. This can help correct the results data set so that it would better approximate what a tester that was not synchronous, and would have kept issuing requests per the actual test plan, would have experienced in the test.

4. Something else we hadn't yet thought about.

Some correction and detection example work can be found at: https://github.com/OutlierCorrector/jmeter/commit/34c34cae673fd0871a423035a9f262d049f3d9e9 , which uses code at https://github.com/OutlierCorrector/OutlierCorrector . Michael Chmiel worked at Azul Systems over the summer on this problem, and the OutlierCorrector package and the small patch to JMeter  (under the docs-2.9 branch) are some of the results of that work. This fix approach appears to work well as long as no explicitly random behavior is stated in the test scenarios (the outlier detector detects a test pattern and repeats it in repairing the data. Expressly random scenarios will not exhibit a detectable pattern.).

-- Gil.

On Oct 17, 2013, at 11:47 PM, Kirk Pepperdine <ki...@gmail.com>>
 wrote:

Hi Sebb,

In my testing, the option off creating threads on demand instead of all at once has made a huge difference in my being able to control rate of arrivals on the server. It has convinced me that simply using the throughput controller isn't enough and that the threading model in JMeter *must* change. It is the threading model that is the biggest source of CO in JMeter. Unfortunately we weren't able to come to some way of a non-disruptive change in JMeter to make this happen.

The model I was proposing would have JMeter generate an event heap sorted by the time when a sampler should be fired. A thread pool should be used to eat off of the heap and fire the events as per scheduled. This would allow JMeter to break the inappropriate relationship of a thread being a user. The solution is not perfect in that you will still have to fight with thread schedulers and hypervisors to get things to happen on queue. However, I believe the end result will be a far more scalable product that will require far fewer threads to produce far higher loads on the server.

As for your idea on the using the throughput controller. IHMO triggering an assert only worsens the CO problem. In fact, if the response times from the timeouts are not added into the results, in other words they are omitted from the data set, you've only made the problem worse as you are filter out bad data points from the result sets making the results better than they should be. Peter Lawyer's (included here for the purpose of this discussion) technique for correcting CO is to simply recognize when the event should have been triggered and then start the timer for that event at that time. So the latency reported will include the time before event triggering.

Gil Tene's done some work with JMeter. I'll leave it up to him to post what he's done. The interesting bit that he's created is HrdHistogram (https://github.com/giltene/HdrHistogram). It is not only a better way to report results,it offers techniques to calculate and correct for CO. Also Gil might be able to point you to a more recent version of his on CO talk. It might be nice to have a new sampler that incorporates this work.

On a side note, I've got a Servlet filter that is JMX component that measures a bunch of stats from the servers POV. It's something that could be contributed as it could be used to help understand the source of CO.. if not just complement JMeter's view of latency.

Regards,
Kirk


On 2013-10-18, at 12:27 AM, sebb <se...@gmail.com>> wrote:

It looks to be quite difficult to avoid the issue of Coordination
Omission without a major redesign of JMeter.

However, it may be a lot easier to detect when the condition has occurred.
This would potentially allow the test settings to be changed to reduce
or eliminate the occurrences - e.g. by increasing the number of
threads or spreading the load across more JMeter instances.

The Constant Throughput Controller calculates the desired wait time,
and if this is less than zero - i.e. a sample should already have been
generated - it could trigger the creation of a failed Assertion
showing the time difference.

Would this be sufficient to detect all CO occurrences?
If not, what other metric needs to be checked?

Even if it is not the only possible cause, would it be useful as a
starting point?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org<ma...@jmeter.apache.org>
For additional commands, e-mail: user-help@jmeter.apache.org<ma...@jmeter.apache.org>









Re: Coordinated Omission (CO) - possible strategies

Posted by Kirk Pepperdine <ki...@gmail.com>.
On 2013-10-18, at 7:43 PM, Gil Tene <gi...@azulsystems.com> wrote:

> I'm not saying the threading model doesn't have it's own issues, or that those issues could not in themselves cause coordinated omission. I'm saying there is already a dominant, demonstrable, and classic case of CO in JMeter that doesn't have anything to do with the threading model, and will not go away no matter what is done to the threading model. As long as JMeter test plans are expressed as describing instructions for serial, synchronous, do-these-one-after-the-other scripts for what the tester should do for a given client, coordinated omission will easily occur in executing those instructions. I believe that this will no go away without changing how all JMeter test plans are expressed, and that is probably a non-starter. As a result, I think that building in logic that will correct for coordinated omission when it inevitably occurs, as opposed to trying to avoid it's occurrence, is the only way to go for JMeter.

I can't disagree with you in that CO is present in a single threaded test. However, the nature of the type of load testing is that you play out a scenario because the results of the previous request are needed for the current request. Under those conditions you can't do much but wait until the back pressure clears or your initial request is retired. I think the best you can do under these circumstances just as Sebb has suggested in that you flag the problem and move on. I wouldn't fail nor omit the result but I'm not sure how you can correct because the back pressure in this case will result in lower loads which will allow requests to retire at a rate higher than one should normally expect.

That said, when users meet this type of system they will most likely abandon.. which is in it's self a failure. JMeter doesn't have direct facilities to support this type of behaviour.
> 
> Coordinated Omission is a basic problem that can happen due to many, many different reasons and causes. It is made up of two simple things: One is the Omission of some results or samples from the final data set. Second is the Coordination of such omissions with other behavior, such that it is not random. Random omission is usually not a problem. That's just sampling, and random sampling works. Coordinated Omission is a problem because it is effectively [highly] biased sampling. When Coordinated Omission occurs, the resulting data set is biased towards certain behaviors (like good response times), leading ALL statistics on the resulting data set to be highly suspect (read: "usually completely wrong and off by orders of magnitude") in describing response time or latency behavior of the observed system.
> 
> In JMeter, Coordinated Omission occurs whenever a thread doesn't execute it's test plan as planned, and does so in reaction to behavior it encounters. This is most often caused by the simple and inherent synchronous nature of test plans as they are stated in JMeter: when a specific request takes longer to respond that it would have taken the thread to send the next request in the plan, the very fact that the thread did not send the next request out on time as planned is a coordinated omission: It is the effective removal of a response time result that would have been in the data set had the coordination not happened. It is "omission" since a measurement that should have occurred didn't happen and was not recorded. It is "coordinated" because the omission is not random, and is correlated-with/influenced-by the occurrence of another longer than normal response time occurrence.
> 
> The work done with the OutlierCorrector in JMeter focused on detecting CO in streams of measured results reported to listeners, and inserting "fake" results into the stream to represent the missing, omitted results that should have been there. OutlierCorrector also has a log file corrector that can fix JMeter logs offline, and after the fact by applying the same logic.

Right, but this is for a fixed transactional rate which is typically seen in machine to machine HFTS. In Web apps, perhaps the most common use case for JMeter, client back-off due to back pressure is a common behaviour and it's one that doesn't harm the testing process in the sense that if the server can't retire transactions fast enough.. JMeter will expose it. if you want to prove 5 9's, then I agree, you've got a problem.

It's not that I disagree with you or I don't understand what you're saying, it's just that I'm having difficulty mapping it back to the world that people on this list have to deal with. w.r.t, I've a feeling that our views are some what tainted by the worlds we live in. In the HTTP world, CO exists and I accept it as natural behaviour. In your world CO exists but it cannot be accepted. The problem is that mechanical sympathy is mostly about your world. I think there is a commonality between the two worlds but I think to find it we need more discussion. I'm not sure that this list is good for this purpose so I'm going to flip back to mechanical sympathy instead of hijacking this mailing list.

-- Kirk

> 
> -- Gil.
> 
> 
> On Oct 18, 2013, at 9:54 AM, Kirk Pepperdine <ki...@gmail.com> wrote:
> 
>> Hi Gil,
>> 
>> I would have to disagree as in this case I believe there is CO due to the threading model, CO on a per-thread basis as well as plain old omission. I believe these conditions are in addition to the conditions you're pointing to.
>> 
>> You may test at a fixed rate for HFT but in most worlds, random is necessary. Unfortunately that makes the problem more difficult to deal with.
>> 
>> Regards,
>> Kirk
>> 
>> On 2013-10-18, at 5:32 PM, Gil Tene <gi...@azulsystems.com> wrote:
>> 
>>> I don't think the thread model is the core of the Coordinated Omission problem. Unless we consider the only solution to be sending no more than one request per 20 minutes from any given thread a threading model fix. It's more of a configuration choice the way I see it, but a pretty impossible one. The thread model may need work for other reasons, but CO is not one of them. 
>>> 
>>> In JMeter, as with all other synchronous testers, Coordinated Omission is a per-thread issue. It's easy to demonstrate CO with JMeter with a single client thread testing an application that has only a single client connection in the real world, or with 15 client threads testing an application that has exactly 15 real-world clients communicating at high rates (common with muxed environments, messaging, ESBs, trading systems, etc.). No amount of threading or concurrency will help get a better test results capturing for these very real system. Any occurrence of CO will make the JMeter results seriously bogus.
>>> 
>>> When any one thread misses a planned request sending time, CO has already occurred, and there is no way to avoid it at that point. You certainly detect that CO has happened. The question is what to do about it in JMeter once you detect it. The major options are:
>>> 
>>> 1. Ignore it and keep working with the data as if it actually meant anything. This amount to http://tinyurl.com/o46doqf .
>>> 
>>> 2. You can try to change the tester behavior to avoid CO going forward. E.g. you can try to adjust the number of threads up AND at the same time the frequency of requests that each thread sends requests at, which will amount to drastically changing the test plan in reaction to system behavior. In my opinion, changing behavior dynamically will have very limited effectiveness for two reasons: The first is that the problem had already occurred, so all the data up to and including the observed CO  is already bogus and has to be thrown away unless it can be corrected somehow. Only after you auto-adjust enough times to not see CO for a long time, your results during that time may be valid. The second is that changing the test scenario is valid (and possible) for very few real world systems.
>>> 
>>> 3. You can try to correct for CO when you observe it. There are various ways this can be done, and most of them will amount to re-creating missing test sample results by projecting from past results. This can help correct the results data set so that it would better approximate what a tester that was not synchronous, and would have kept issuing requests per the actual test plan, would have experienced in the test.
>>> 
>>> 4. Something else we hadn't yet thought about.
>>> 
>>> Some correction and detection example work can be found at: https://github.com/OutlierCorrector/jmeter/commit/34c34cae673fd0871a423035a9f262d049f3d9e9 , which uses code at https://github.com/OutlierCorrector/OutlierCorrector . Michael Chmiel worked at Azul Systems over the summer on this problem, and the OutlierCorrector package and the small patch to JMeter  (under the docs-2.9 branch) are some of the results of that work. This fix approach appears to work well as long as no explicitly random behavior is stated in the test scenarios (the outlier detector detects a test pattern and repeats it in repairing the data. Expressly random scenarios will not exhibit a detectable pattern.).
>>> 
>>> -- Gil.
>>> 
>>> On Oct 17, 2013, at 11:47 PM, Kirk Pepperdine <ki...@gmail.com>
>>>  wrote:
>>> 
>>>> Hi Sebb,
>>>> 
>>>> In my testing, the option off creating threads on demand instead of all at once has made a huge difference in my being able to control rate of arrivals on the server. It has convinced me that simply using the throughput controller isn't enough and that the threading model in JMeter *must* change. It is the threading model that is the biggest source of CO in JMeter. Unfortunately we weren't able to come to some way of a non-disruptive change in JMeter to make this happen.
>>>> 
>>>> The model I was proposing would have JMeter generate an event heap sorted by the time when a sampler should be fired. A thread pool should be used to eat off of the heap and fire the events as per scheduled. This would allow JMeter to break the inappropriate relationship of a thread being a user. The solution is not perfect in that you will still have to fight with thread schedulers and hypervisors to get things to happen on queue. However, I believe the end result will be a far more scalable product that will require far fewer threads to produce far higher loads on the server.
>>>> 
>>>> As for your idea on the using the throughput controller. IHMO triggering an assert only worsens the CO problem. In fact, if the response times from the timeouts are not added into the results, in other words they are omitted from the data set, you've only made the problem worse as you are filter out bad data points from the result sets making the results better than they should be. Peter Lawyer's (included here for the purpose of this discussion) technique for correcting CO is to simply recognize when the event should have been triggered and then start the timer for that event at that time. So the latency reported will include the time before event triggering.
>>>> 
>>>> Gil Tene's done some work with JMeter. I'll leave it up to him to post what he's done. The interesting bit that he's created is HrdHistogram (https://github.com/giltene/HdrHistogram). It is not only a better way to report results,it offers techniques to calculate and correct for CO. Also Gil might be able to point you to a more recent version of his on CO talk. It might be nice to have a new sampler that incorporates this work.
>>>> 
>>>> On a side note, I've got a Servlet filter that is JMX component that measures a bunch of stats from the servers POV. It's something that could be contributed as it could be used to help understand the source of CO.. if not just complement JMeter's view of latency.
>>>> 
>>>> Regards,
>>>> Kirk
>>>> 
>>>> 
>>>> On 2013-10-18, at 12:27 AM, sebb <se...@gmail.com> wrote:
>>>> 
>>>>> It looks to be quite difficult to avoid the issue of Coordination
>>>>> Omission without a major redesign of JMeter.
>>>>> 
>>>>> However, it may be a lot easier to detect when the condition has occurred.
>>>>> This would potentially allow the test settings to be changed to reduce
>>>>> or eliminate the occurrences - e.g. by increasing the number of
>>>>> threads or spreading the load across more JMeter instances.
>>>>> 
>>>>> The Constant Throughput Controller calculates the desired wait time,
>>>>> and if this is less than zero - i.e. a sample should already have been
>>>>> generated - it could trigger the creation of a failed Assertion
>>>>> showing the time difference.
>>>>> 
>>>>> Would this be sufficient to detect all CO occurrences?
>>>>> If not, what other metric needs to be checked?
>>>>> 
>>>>> Even if it is not the only possible cause, would it be useful as a
>>>>> starting point?
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
>>>>> For additional commands, e-mail: user-help@jmeter.apache.org
>>>>> 
>>>> 
>>> 
>> 
> 


Re: Coordinated Omission (CO) - possible strategies

Posted by Kirk Pepperdine <ki...@gmail.com>.
Hi Gil,

I would have to disagree as in this case I believe there is CO due to the threading model, CO on a per-thread basis as well as plain old omission. I believe these conditions are in addition to the conditions you're pointing to.

You may test at a fixed rate for HFT but in most worlds, random is necessary. Unfortunately that makes the problem more difficult to deal with.

Regards,
Kirk

On 2013-10-18, at 5:32 PM, Gil Tene <gi...@azulsystems.com> wrote:

> I don't think the thread model is the core of the Coordinated Omission problem. Unless we consider the only solution to be sending no more than one request per 20 minutes from any given thread a threading model fix. It's more of a configuration choice the way I see it, but a pretty impossible one. The thread model may need work for other reasons, but CO is not one of them. 
> 
> In JMeter, as with all other synchronous testers, Coordinated Omission is a per-thread issue. It's easy to demonstrate CO with JMeter with a single client thread testing an application that has only a single client connection in the real world, or with 15 client threads testing an application that has exactly 15 real-world clients communicating at high rates (common with muxed environments, messaging, ESBs, trading systems, etc.). No amount of threading or concurrency will help get a better test results capturing for these very real system. Any occurrence of CO will make the JMeter results seriously bogus.
> 
> When any one thread misses a planned request sending time, CO has already occurred, and there is no way to avoid it at that point. You certainly detect that CO has happened. The question is what to do about it in JMeter once you detect it. The major options are:
> 
> 1. Ignore it and keep working with the data as if it actually meant anything. This amount to http://tinyurl.com/o46doqf .
> 
> 2. You can try to change the tester behavior to avoid CO going forward. E.g. you can try to adjust the number of threads up AND at the same time the frequency of requests that each thread sends requests at, which will amount to drastically changing the test plan in reaction to system behavior. In my opinion, changing behavior dynamically will have very limited effectiveness for two reasons: The first is that the problem had already occurred, so all the data up to and including the observed CO  is already bogus and has to be thrown away unless it can be corrected somehow. Only after you auto-adjust enough times to not see CO for a long time, your results during that time may be valid. The second is that changing the test scenario is valid (and possible) for very few real world systems.
> 
> 3. You can try to correct for CO when you observe it. There are various ways this can be done, and most of them will amount to re-creating missing test sample results by projecting from past results. This can help correct the results data set so that it would better approximate what a tester that was not synchronous, and would have kept issuing requests per the actual test plan, would have experienced in the test.
> 
> 4. Something else we hadn't yet thought about.
> 
> Some correction and detection example work can be found at: https://github.com/OutlierCorrector/jmeter/commit/34c34cae673fd0871a423035a9f262d049f3d9e9 , which uses code at https://github.com/OutlierCorrector/OutlierCorrector . Michael Chmiel worked at Azul Systems over the summer on this problem, and the OutlierCorrector package and the small patch to JMeter  (under the docs-2.9 branch) are some of the results of that work. This fix approach appears to work well as long as no explicitly random behavior is stated in the test scenarios (the outlier detector detects a test pattern and repeats it in repairing the data. Expressly random scenarios will not exhibit a detectable pattern.).
> 
> -- Gil.
> 
> On Oct 17, 2013, at 11:47 PM, Kirk Pepperdine <ki...@gmail.com>
>  wrote:
> 
>> Hi Sebb,
>> 
>> In my testing, the option off creating threads on demand instead of all at once has made a huge difference in my being able to control rate of arrivals on the server. It has convinced me that simply using the throughput controller isn't enough and that the threading model in JMeter *must* change. It is the threading model that is the biggest source of CO in JMeter. Unfortunately we weren't able to come to some way of a non-disruptive change in JMeter to make this happen.
>> 
>> The model I was proposing would have JMeter generate an event heap sorted by the time when a sampler should be fired. A thread pool should be used to eat off of the heap and fire the events as per scheduled. This would allow JMeter to break the inappropriate relationship of a thread being a user. The solution is not perfect in that you will still have to fight with thread schedulers and hypervisors to get things to happen on queue. However, I believe the end result will be a far more scalable product that will require far fewer threads to produce far higher loads on the server.
>> 
>> As for your idea on the using the throughput controller. IHMO triggering an assert only worsens the CO problem. In fact, if the response times from the timeouts are not added into the results, in other words they are omitted from the data set, you've only made the problem worse as you are filter out bad data points from the result sets making the results better than they should be. Peter Lawyer's (included here for the purpose of this discussion) technique for correcting CO is to simply recognize when the event should have been triggered and then start the timer for that event at that time. So the latency reported will include the time before event triggering.
>> 
>> Gil Tene's done some work with JMeter. I'll leave it up to him to post what he's done. The interesting bit that he's created is HrdHistogram (https://github.com/giltene/HdrHistogram). It is not only a better way to report results,it offers techniques to calculate and correct for CO. Also Gil might be able to point you to a more recent version of his on CO talk. It might be nice to have a new sampler that incorporates this work.
>> 
>> On a side note, I've got a Servlet filter that is JMX component that measures a bunch of stats from the servers POV. It's something that could be contributed as it could be used to help understand the source of CO.. if not just complement JMeter's view of latency.
>> 
>> Regards,
>> Kirk
>> 
>> 
>> On 2013-10-18, at 12:27 AM, sebb <se...@gmail.com> wrote:
>> 
>>> It looks to be quite difficult to avoid the issue of Coordination
>>> Omission without a major redesign of JMeter.
>>> 
>>> However, it may be a lot easier to detect when the condition has occurred.
>>> This would potentially allow the test settings to be changed to reduce
>>> or eliminate the occurrences - e.g. by increasing the number of
>>> threads or spreading the load across more JMeter instances.
>>> 
>>> The Constant Throughput Controller calculates the desired wait time,
>>> and if this is less than zero - i.e. a sample should already have been
>>> generated - it could trigger the creation of a failed Assertion
>>> showing the time difference.
>>> 
>>> Would this be sufficient to detect all CO occurrences?
>>> If not, what other metric needs to be checked?
>>> 
>>> Even if it is not the only possible cause, would it be useful as a
>>> starting point?
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
>>> For additional commands, e-mail: user-help@jmeter.apache.org
>>> 
>> 
> 


Re: Coordinated Omission (CO) - possible strategies

Posted by Kirk Pepperdine <ki...@gmail.com>.
Hi Sebb,

In my testing, the option off creating threads on demand instead of all at once has made a huge difference in my being able to control rate of arrivals on the server. It has convinced me that simply using the throughput controller isn't enough and that the threading model in JMeter *must* change. It is the threading model that is the biggest source of CO in JMeter. Unfortunately we weren't able to come to some way of a non-disruptive change in JMeter to make this happen.

The model I was proposing would have JMeter generate an event heap sorted by the time when a sampler should be fired. A thread pool should be used to eat off of the heap and fire the events as per scheduled. This would allow JMeter to break the inappropriate relationship of a thread being a user. The solution is not perfect in that you will still have to fight with thread schedulers and hypervisors to get things to happen on queue. However, I believe the end result will be a far more scalable product that will require far fewer threads to produce far higher loads on the server.

As for your idea on the using the throughput controller. IHMO triggering an assert only worsens the CO problem. In fact, if the response times from the timeouts are not added into the results, in other words they are omitted from the data set, you've only made the problem worse as you are filter out bad data points from the result sets making the results better than they should be. Peter Lawyer's (included here for the purpose of this discussion) technique for correcting CO is to simply recognize when the event should have been triggered and then start the timer for that event at that time. So the latency reported will include the time before event triggering.

Gil Tene's done some work with JMeter. I'll leave it up to him to post what he's done. The interesting bit that he's created is HrdHistogram (https://github.com/giltene/HdrHistogram). It is not only a better way to report results,it offers techniques to calculate and correct for CO. Also Gil might be able to point you to a more recent version of his on CO talk. It might be nice to have a new sampler that incorporates this work.

On a side note, I've got a Servlet filter that is JMX component that measures a bunch of stats from the servers POV. It's something that could be contributed as it could be used to help understand the source of CO.. if not just complement JMeter's view of latency.

Regards,
Kirk


On 2013-10-18, at 12:27 AM, sebb <se...@gmail.com> wrote:

> It looks to be quite difficult to avoid the issue of Coordination
> Omission without a major redesign of JMeter.
> 
> However, it may be a lot easier to detect when the condition has occurred.
> This would potentially allow the test settings to be changed to reduce
> or eliminate the occurrences - e.g. by increasing the number of
> threads or spreading the load across more JMeter instances.
> 
> The Constant Throughput Controller calculates the desired wait time,
> and if this is less than zero - i.e. a sample should already have been
> generated - it could trigger the creation of a failed Assertion
> showing the time difference.
> 
> Would this be sufficient to detect all CO occurrences?
> If not, what other metric needs to be checked?
> 
> Even if it is not the only possible cause, would it be useful as a
> starting point?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
> For additional commands, e-mail: user-help@jmeter.apache.org
>