You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@jmeter.apache.org by sebb <se...@gmail.com> on 2013/10/19 02:26:59 UTC

Coordinated Omission - detection and reporting ONLY

[Trying again - please do not hijack this thread.]

The Constant Throughput Timer (CTT) calculates the desired wait time,
and if this is less than zero - i.e. a sample should already have been
generated - it could trigger the creation of a failed Assertion (or similar)
showing the time difference.

Would this be sufficient to detect all CO occurrences?
If not, what other metric needs to be checked?

Even if it is not the only possible cause, would it be useful as a
starting point?

I am assuming that the CTT is the primary means of controlling the
sample request rate.
If there are other elements that are commonly used to control the
rate, please note them here.

N.B: this thread is only for discussion of how to detect CO and how to
report it.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org

Re: Coordinated Omission - detection and reporting ONLY

Posted by Kirk Pepperdine <ki...@gmail.com>.

Hi Sebb,

>> 
>> I think I suggested a separate detector. Could it not be a listener instead?
> 
> Yes, a listener would have access to all the sample results.
> 
> The reason I suggested the CTT is that the element "knows" the
> expected transaction rate, so can detect late samples easily and
> cheaply.
> 
> Other code will either have to be provided with the expected rate or
> have to analyse the data to deduce the information.
> Analysis which may be expensive. [However if the detection is only
> done in GUI mode that would not be an issue.]

Sure, but if one doesn't use the CTT.... And there are other sources of jitter that can can cause difficulties with measures that may not be picked up by CTT. CTT focuses on trigger rates. I think we need to focus on retirement rates.

Regards,
Kirk


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org

Re: Coordinated Omission - detection and reporting ONLY

Posted by sebb <se...@gmail.com>.

On 22 October 2013 19:16, Kirk Pepperdine <ki...@gmail.com> wrote:
>
> On 2013-10-22, at 5:22 PM, Milamber <mi...@apache.org> wrote:
>
>>
>> Le 21/10/2013 01:55, sebb a ecrit :
>>> On 19 October 2013 16:47, Milamber <mi...@apache.org> wrote:
>>>> Le 19/10/2013 01:26, sebb a ecrit :
>>>>
>>>>> [Trying again - please do not hijack this thread.]
>>>>>
>>>>> The Constant Throughput Timer (CTT) calculates the desired wait time,
>>>>> and if this is less than zero - i.e. a sample should already have been
>>>>> generated - it could trigger the creation of a failed Assertion (or
>>>>> similar)
>>>>> showing the time difference.
>>>>>
>>>>> Would this be sufficient to detect all CO occurrences?
>>>>
>>>> An option in CTT element which allow to mark as fail the(s) sampler(s) in
>>>> the scope with a less than zero wait time seems a good point to inform
>>>> users.
>>>> The failed message can indicate the delay to help users to fix the scenario
>>>> (i.e. add more VU and reduce frequency of CTT)
>>> Or CTT would save the delay somewhere, and a new Assertion could be
>>> created to report it.
>>>
>>> That might work better if there are more places where delays could be detected.
>>
>> Yes good idea.
>
> I think I suggested a separate detector. Could it not be a listener instead?

Yes, a listener would have access to all the sample results.

The reason I suggested the CTT is that the element "knows" the
expected transaction rate, so can detect late samples easily and
cheaply.

Other code will either have to be provided with the expected rate or
have to analyse the data to deduce the information.
Analysis which may be expensive. [However if the detection is only
done in GUI mode that would not be an issue.]

> Regards,
> Kirk
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
> For additional commands, e-mail: user-help@jmeter.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org

Re: Coordinated Omission - detection and reporting ONLY

Posted by Kirk Pepperdine <ki...@gmail.com>.

On 2013-10-22, at 5:22 PM, Milamber <mi...@apache.org> wrote:

> 
> Le 21/10/2013 01:55, sebb a ecrit :
>> On 19 October 2013 16:47, Milamber <mi...@apache.org> wrote:
>>> Le 19/10/2013 01:26, sebb a ecrit :
>>> 
>>>> [Trying again - please do not hijack this thread.]
>>>> 
>>>> The Constant Throughput Timer (CTT) calculates the desired wait time,
>>>> and if this is less than zero - i.e. a sample should already have been
>>>> generated - it could trigger the creation of a failed Assertion (or
>>>> similar)
>>>> showing the time difference.
>>>> 
>>>> Would this be sufficient to detect all CO occurrences?
>>> 
>>> An option in CTT element which allow to mark as fail the(s) sampler(s) in
>>> the scope with a less than zero wait time seems a good point to inform
>>> users.
>>> The failed message can indicate the delay to help users to fix the scenario
>>> (i.e. add more VU and reduce frequency of CTT)
>> Or CTT would save the delay somewhere, and a new Assertion could be
>> created to report it.
>> 
>> That might work better if there are more places where delays could be detected.
> 
> Yes good idea.

I think I suggested a separate detector. Could it not be a listener instead?

Regards,
Kirk


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org

Re: Coordinated Omission - detection and reporting ONLY

Posted by Milamber <mi...@apache.org>.

Le 21/10/2013 01:55, sebb a ecrit :
> On 19 October 2013 16:47, Milamber <mi...@apache.org> wrote:
>> Le 19/10/2013 01:26, sebb a ecrit :
>>
>>> [Trying again - please do not hijack this thread.]
>>>
>>> The Constant Throughput Timer (CTT) calculates the desired wait time,
>>> and if this is less than zero - i.e. a sample should already have been
>>> generated - it could trigger the creation of a failed Assertion (or
>>> similar)
>>> showing the time difference.
>>>
>>> Would this be sufficient to detect all CO occurrences?
>>
>> An option in CTT element which allow to mark as fail the(s) sampler(s) in
>> the scope with a less than zero wait time seems a good point to inform
>> users.
>> The failed message can indicate the delay to help users to fix the scenario
>> (i.e. add more VU and reduce frequency of CTT)
> Or CTT would save the delay somewhere, and a new Assertion could be
> created to report it.
>
> That might work better if there are more places where delays could be detected.

Yes good idea.


>
>> Milamber
>>
>>
>>> If not, what other metric needs to be checked?
>>>
>>> Even if it is not the only possible cause, would it be useful as a
>>> starting point?
>>>
>>> I am assuming that the CTT is the primary means of controlling the
>>> sample request rate.
>>> If there are other elements that are commonly used to control the
>>> rate, please note them here.
>>
>>
>>
>>> N.B: this thread is only for discussion of how to detect CO and how to
>>> report it.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
>>> For additional commands, e-mail: user-help@jmeter.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
>> For additional commands, e-mail: user-help@jmeter.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
> For additional commands, e-mail: user-help@jmeter.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org

Re: Coordinated Omission - detection and reporting ONLY

Posted by sebb <se...@gmail.com>.

On 19 October 2013 16:47, Milamber <mi...@apache.org> wrote:
>
> Le 19/10/2013 01:26, sebb a ecrit :
>
>> [Trying again - please do not hijack this thread.]
>>
>> The Constant Throughput Timer (CTT) calculates the desired wait time,
>> and if this is less than zero - i.e. a sample should already have been
>> generated - it could trigger the creation of a failed Assertion (or
>> similar)
>> showing the time difference.
>>
>> Would this be sufficient to detect all CO occurrences?
>
>
> An option in CTT element which allow to mark as fail the(s) sampler(s) in
> the scope with a less than zero wait time seems a good point to inform
> users.
> The failed message can indicate the delay to help users to fix the scenario
> (i.e. add more VU and reduce frequency of CTT)

Or CTT would save the delay somewhere, and a new Assertion could be
created to report it.

That might work better if there are more places where delays could be detected.

> Milamber
>
>
>> If not, what other metric needs to be checked?
>>
>> Even if it is not the only possible cause, would it be useful as a
>> starting point?
>>
>> I am assuming that the CTT is the primary means of controlling the
>> sample request rate.
>> If there are other elements that are commonly used to control the
>> rate, please note them here.
>
>
>
>
>>
>> N.B: this thread is only for discussion of how to detect CO and how to
>> report it.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
>> For additional commands, e-mail: user-help@jmeter.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
> For additional commands, e-mail: user-help@jmeter.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org

Re: Coordinated Omission - detection and reporting ONLY

Posted by Milamber <mi...@apache.org>.

Le 19/10/2013 01:26, sebb a ecrit :
> [Trying again - please do not hijack this thread.]
>
> The Constant Throughput Timer (CTT) calculates the desired wait time,
> and if this is less than zero - i.e. a sample should already have been
> generated - it could trigger the creation of a failed Assertion (or similar)
> showing the time difference.
>
> Would this be sufficient to detect all CO occurrences?

An option in CTT element which allow to mark as fail the(s) sampler(s) 
in the scope with a less than zero wait time seems a good point to 
inform users.
The failed message can indicate the delay to help users to fix the 
scenario (i.e. add more VU and reduce frequency of CTT)

Milamber

> If not, what other metric needs to be checked?
>
> Even if it is not the only possible cause, would it be useful as a
> starting point?
>
> I am assuming that the CTT is the primary means of controlling the
> sample request rate.
> If there are other elements that are commonly used to control the
> rate, please note them here.



>
> N.B: this thread is only for discussion of how to detect CO and how to
> report it.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
> For additional commands, e-mail: user-help@jmeter.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org

Re: Coordinated Omission - detection and reporting ONLY

Posted by sebb <se...@gmail.com>.

On 19 October 2013 17:37, Gil Tene <gi...@azulsystems.com> wrote:
>
> On Oct 19, 2013, at 4:45 AM, sebb <se...@gmail.com>
>  wrote:
>
>> On 19 October 2013 02:17, Gil Tene <gi...@azulsystems.com> wrote:
>>>
>>>> [Trying again - please do not hijack this thread.]
>>>>
>>>> The Constant Throughput Timer (CTT) calculates the desired wait time,
>>>> and if this is less than zero - i.e. a sample should already have been
>>>> generated - it could trigger the creation of a failed Assertion (or similar)
>>>> showing the time difference.
>>
>> N.B.     ^^^^^^^^^^^
>
> I missed your point here. I thought you are looking do add detection without changing the code.

Some code clearly has to be written ...

> So you are suggesting changing the code for CTT to do this?

Yes; this would be trivial; CTT already has to calculate the delay.

> And if so, I assume you would recommend people add or use CTT for CO detection.

Potentially.

Also another Assertion element could be created that checks whether
throughput is in range.

>>
>>>>
>>>> Would this be sufficient to detect all CO occurrences?
>>>
>>> Two issues:
>>>
>>> 1. It would detect that CO probably happened, but not how much of it happened, Missing 1msec or 1 minute will look the same.
>>
>> Huh? The time difference would be reported - is that not enough?
>
> As discussed below, if you do report the time, it is useful for pointing out how bad things are. You'll probably need to somehow accumulate the reported time to make sense of it though. The interesting information to present is what % of total time was spend in CO. The pessimistic interpretation (which is the one to take unless you do more detailed analysis with the data collected) is that for an accumulated CO time of X over a test run of length Y, 100 * X/Y represents the percent of unreported operations that should be assumed to have displayed the worst case observed number in the run.
>
> I'd be careful not to fall into the trap of reporting something like "X" time in CO experienced over "N" CO occurrences and leading people to average things out. While it would be tempting to divide X by N, a single huge CO event could dominate behavior next to a thousand tiny ones, leading to very wrong interpretation of effects. Instead, you could report "Total CO time X1", with Largest single CO event "X2". Or better yet, collect and report the entire histogram of CO event lengths.

It would be easy enough to add a delay field to the sample results.
This could then be analysed.

>>
>>> 2. It would only detect CO in test plans that include an actual CTT (Constant Throughput Timer). It won't work for other timers, or when no timers are used
>>
>> Indeed, but in such cases is there any indication of the expected
>> transaction rate?
>
> Detection doesn't require a constant or known transaction rate. It just requires that you know the next transaction was not started when it was supposed to be.

Of course.

> In my experience, the "concurrent user count" injection rate approach is much more common than the "constant transaction rate" one. The concurrent user count approaches usually have the individual user threads use constant or semi-random think time emulation instead of a CTT timer. This does mean that their throughput varies with response time length, but you can still detect CO in such a test. One way to do so is to calculate an estimated interval rate based on observing the actual behavior of the test for a while (equivalent to establishing an estimated throughput through observation), and flagging strong outliers after some confidence level has been established (e.g. flag things that lie more than 3-4 std. deviations away from the mean interval).
>
> That's the sort of thing the OutlierCorrector's detection code does. you can use it to generically detect CO regardless of the timer used.
>
> I can see a benefit in using CTT to flag CO detection. It's benefit lies in the fact that with CTT the user explicitly states an expected transaction rate.

Exactly; it's simple to do.

> I can separately see a benefit to adding a new sampler or timer type that simply detect CO using a technique like we use in OutlierCorrector. It's benefit comes from applying to a wider set of scenarios.

Yes.

> Both can be useful as warning signs. And if users are able to react by fixing their test to avoid triggering the warnings at all, that would be good.

Indeed.

> However, I am separately pessimistic about users being able to adjust tests to get around CO. While this is possible in some cases, my experience shows that CO is inherent to actual system use case behavior, and that in the majority of cases it does not come from misconfigured testers but from real world behavior. I.e. that real world, actual users interact with the system with intervals that are shorter than the time the system will stall for occasionally.
>

This is starting to stray from the subject of this thread which is
about detection and reporting.

>>
>>>> If not, what other metric needs to be checked?
>>>
>>> There are various things you can look for.
>>>
>>> Our OutlierCorrector work includes a pretty elaborate CO detector. It sits either inline on the listener notification stream, or parses the log file. The detector identifies sampler patterns, establishes expected interval between patterns, and detects when the actual interval falls far above the expected interval. This is a code change to JMeter, but a pretty localized one.
>>
>> AIUI CO happens when the sample takes much longer than usual, thus
>> (potentially) delaying any subseqent samples.
>> Overlong samples can already be detected using a Delay Assertion.
[Sorry, that should have been Duration Assertion]

>
> It's not exactly "longer than usual". It's "long enough to cause you to miss sending the next request out on time".

>You can place some margin on this (and anything other than a CTT probably has to), but the margin depends on the rest of the test scenario and on the actual system behavior. It is "hard" for users to figure out how correctly set a Duration response assertion on samplers. Hard enough that it won't be done in practice IMO.

Again, this is getting off-topic for this thread.

>>
>>>> Even if it is not the only possible cause, would it be useful as a
>>>> starting point?
>>>
>>> Yes. As a warning flag saying "throw away these test results".
>>
>> The results are not necessarily useless; at the very least they will
>> show the load at which the system is starting to slow down.
>
> Correct. I though you were talking about yes/no. With an actual missed-time indicator (and an accumulator across the run) there is some useful info here..
>
>>
>>>> I am assuming that the CTT is the primary means of controlling the
>>>> sample request rate.
>>>
>>> Unfortunately many test scenarios I've seen use other means. Many people use other timers or other means for think time emulation.
>>
>> The CTT is not intended for think time emulation. It is intended to
>> ensure a specific transaction rate.
>
> I'm not saying CTT is used for think time emulation. I'm saying many people *think* in terms of think time and not in terms of throughput. It's the more natural way of coming at the problem when they are describing what a user does with the system. They don't think in terms of "the user is sending me X things per second". They think "they user presses this, then spends 3 seconds pondering, then presses that, ...". For test plans written by such people, CTT won't be used, and some other sort of delay timers will be.
>

The issue then is that there is no indication of what the expected
transaction rate is.
So there is no way of knowing at the start whether the sample elapsed
times are such that CO has occurred.
Though I take your point that analysing the repsonse times over a
sufficient time period can potentialluy be used to show when samples
are longer than usual.

However, I would expect testers to have some idea of maximum
acceptable response time for each type of request, so they should be
able to apply the relevant Duration Assertions. Likewise, they should
have some idea of expected throughput, so should be able to add a CTT
or a Transaction Rate Assertion (if such is created).

There is little point performance testing a system unless you have an
idea what the system is designed to handle.
[Stress testing is a different matter]

>>
>> Think time obviously affects the transaction rate, but is not the sole
>> determinant.
>>
>>>> If there are other elements that are commonly used to control the
>>>> rate, please note them here.
>>>>
>>>> N.B: this thread is only for discussion of how to detect CO and how to
>>>> report it.
>>>
>>> Reporting the existence of CO is an interesting starting point. But the only right way to deal with such a report showing the existence of CO (with no magnitude or other metrics) is to say " I guess the data I got is complete crap, so all the stats and graphs I'm seeing mean nothing".
>>
>> I disagree that the output is useless.
>>
>> The delays are reported, so one can see how badly the test was
>> affected. If the delays are all small, then a slight adjustment of the
>> thread count and transaction rate should eliminate them.
>
> Agreed.
>
>> Only if the delays are large are the results less meaningful, though
>> they can still show the base load at which the server starts to slow
>> down.
>
> That would be a mistake. It assumes that CO has something to do with load, and with servers "slowing down". CO is more often a result of accumulated work amounts (a pay the piper effect), or of cosmic noise.
>
> In the real world, stalls are NOT a result of the server slowing down. They are a result of the server completely stalling in order to "take care of something". E.g. a quantum lost to another thread, or a flushing or a journal, or a garbage collection. These have no more than a loose, non-dominant relationship with load. E.g. In most software systems, max response times seen at very low loads will usually be dramatically higher than "typical" response times at a much higher load (where a server would be "slower"). The rate at which stalls happen may be affected by load (sometimes increasing in frequency when load grows, and sometimes decreasing in frequency).
>
>>
>>> If you can report "how much" CO you saw,
>>
>> As I wrote originally, the CTT would report the time diffence (i.e. delay).
>>
>>> it may help a bit in determining how bad the data is, and how the stats should be treated by the reader. E.g. if you know that CO totaling some amount of time X in a test of length Y had occured, then you know that any percentile above (100 * (1-X)/Y) is completely bogus, and should be assumed to be equal to the experienced max value. You can also take the approach that the the rest of the percentiles should be shifted down by at least  (100 * X / Y). e.g. If you had CO that covered only 0.01% of the total test time, that would be relatively good news.
>>
>> Exactly.
>>
>>> But if you had CO covering 5% of the test time, your measured 99%'ile is actually the 94%'ile]. Averages are unfortunately anyone's guess when CO is in play and not actually corrected for.
>>>
>>> Once you detect both the existence and the magnitude of CO, correcting for it is actually pretty easy. The detection of "how much" is the semi-hard part.
>>
>> Detecting the delay using the CTT is trivial; it already has to
>> calculate it to decide how long to wait. A negative wait is obviously
>> a missed start time.
>> Or am I missing something here?
>
> Yup. You are right. As noted, if CTT reported the delayed time this would be detected.
>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
>> For additional commands, e-mail: user-help@jmeter.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
> For additional commands, e-mail: user-help@jmeter.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org

Re: Coordinated Omission - detection and reporting ONLY

Posted by Gil Tene <gi...@azulsystems.com>.

On Oct 19, 2013, at 4:45 AM, sebb <se...@gmail.com>
 wrote:

> On 19 October 2013 02:17, Gil Tene <gi...@azulsystems.com> wrote:
>> 
>>> [Trying again - please do not hijack this thread.]
>>> 
>>> The Constant Throughput Timer (CTT) calculates the desired wait time,
>>> and if this is less than zero - i.e. a sample should already have been
>>> generated - it could trigger the creation of a failed Assertion (or similar)
>>> showing the time difference.
> 
> N.B.     ^^^^^^^^^^^

I missed your point here. I thought you are looking do add detection without changing the code. So you are suggesting changing the code for CTT to do this? And if so, I assume you would recommend people add or use CTT for CO detection.

> 
>>> 
>>> Would this be sufficient to detect all CO occurrences?
>> 
>> Two issues:
>> 
>> 1. It would detect that CO probably happened, but not how much of it happened, Missing 1msec or 1 minute will look the same.
> 
> Huh? The time difference would be reported - is that not enough?

As discussed below, if you do report the time, it is useful for pointing out how bad things are. You'll probably need to somehow accumulate the reported time to make sense of it though. The interesting information to present is what % of total time was spend in CO. The pessimistic interpretation (which is the one to take unless you do more detailed analysis with the data collected) is that for an accumulated CO time of X over a test run of length Y, 100 * X/Y represents the percent of unreported operations that should be assumed to have displayed the worst case observed number in the run.

I'd be careful not to fall into the trap of reporting something like "X" time in CO experienced over "N" CO occurrences and leading people to average things out. While it would be tempting to divide X by N, a single huge CO event could dominate behavior next to a thousand tiny ones, leading to very wrong interpretation of effects. Instead, you could report "Total CO time X1", with Largest single CO event "X2". Or better yet, collect and report the entire histogram of CO event lengths.

> 
>> 2. It would only detect CO in test plans that include an actual CTT (Constant Throughput Timer). It won't work for other timers, or when no timers are used
> 
> Indeed, but in such cases is there any indication of the expected
> transaction rate?

Detection doesn't require a constant or known transaction rate. It just requires that you know the next transaction was not started when it was supposed to be. In my experience, the "concurrent user count" injection rate approach is much more common than the "constant transaction rate" one. The concurrent user count approaches usually have the individual user threads use constant or semi-random think time emulation instead of a CTT timer. This does mean that their throughput varies with response time length, but you can still detect CO in such a test. One way to do so is to calculate an estimated interval rate based on observing the actual behavior of the test for a while (equivalent to establishing an estimated throughput through observation), and flagging strong outliers after some confidence level has been established (e.g. flag things that lie more than 3-4 std. deviations away from the mean interval).

That's the sort of thing the OutlierCorrector's detection code does. you can use it to generically detect CO regardless of the timer used.

I can see a benefit in using CTT to flag CO detection. It's benefit lies in the fact that with CTT the user explicitly states an expected transaction rate. 

I can separately see a benefit to adding a new sampler or timer type that simply detect CO using a technique like we use in OutlierCorrector. It's benefit comes from applying to a wider set of scenarios.

Both can be useful as warning signs. And if users are able to react by fixing their test to avoid triggering the warnings at all, that would be good.

However, I am separately pessimistic about users being able to adjust tests to get around CO. While this is possible in some cases, my experience shows that CO is inherent to actual system use case behavior, and that in the majority of cases it does not come from misconfigured testers but from real world behavior. I.e. that real world, actual users interact with the system with intervals that are shorter than the time the system will stall for occasionally.

> 
>>> If not, what other metric needs to be checked?
>> 
>> There are various things you can look for.
>> 
>> Our OutlierCorrector work includes a pretty elaborate CO detector. It sits either inline on the listener notification stream, or parses the log file. The detector identifies sampler patterns, establishes expected interval between patterns, and detects when the actual interval falls far above the expected interval. This is a code change to JMeter, but a pretty localized one.
> 
> AIUI CO happens when the sample takes much longer than usual, thus
> (potentially) delaying any subseqent samples.
> Overlong samples can already be detected using a Delay Assertion.

It's not exactly "longer than usual". It's "long enough to cause you to miss sending the next request out on time". You can place some margin on this (and anything other than a CTT probably has to), but the margin depends on the rest of the test scenario and on the actual system behavior. It is "hard" for users to figure out how correctly set a Duration response assertion on samplers. Hard enough that it won't be done in practice IMO.

> 
>>> Even if it is not the only possible cause, would it be useful as a
>>> starting point?
>> 
>> Yes. As a warning flag saying "throw away these test results".
> 
> The results are not necessarily useless; at the very least they will
> show the load at which the system is starting to slow down.

Correct. I though you were talking about yes/no. With an actual missed-time indicator (and an accumulator across the run) there is some useful info here..

> 
>>> I am assuming that the CTT is the primary means of controlling the
>>> sample request rate.
>> 
>> Unfortunately many test scenarios I've seen use other means. Many people use other timers or other means for think time emulation.
> 
> The CTT is not intended for think time emulation. It is intended to
> ensure a specific transaction rate.

I'm not saying CTT is used for think time emulation. I'm saying many people *think* in terms of think time and not in terms of throughput. It's the more natural way of coming at the problem when they are describing what a user does with the system. They don't think in terms of "the user is sending me X things per second". They think "they user presses this, then spends 3 seconds pondering, then presses that, ...". For test plans written by such people, CTT won't be used, and some other sort of delay timers will be.

> 
> Think time obviously affects the transaction rate, but is not the sole
> determinant.
> 
>>> If there are other elements that are commonly used to control the
>>> rate, please note them here.
>>> 
>>> N.B: this thread is only for discussion of how to detect CO and how to
>>> report it.
>> 
>> Reporting the existence of CO is an interesting starting point. But the only right way to deal with such a report showing the existence of CO (with no magnitude or other metrics) is to say " I guess the data I got is complete crap, so all the stats and graphs I'm seeing mean nothing".
> 
> I disagree that the output is useless.
> 
> The delays are reported, so one can see how badly the test was
> affected. If the delays are all small, then a slight adjustment of the
> thread count and transaction rate should eliminate them.

Agreed.

> Only if the delays are large are the results less meaningful, though
> they can still show the base load at which the server starts to slow
> down.

That would be a mistake. It assumes that CO has something to do with load, and with servers "slowing down". CO is more often a result of accumulated work amounts (a pay the piper effect), or of cosmic noise.

In the real world, stalls are NOT a result of the server slowing down. They are a result of the server completely stalling in order to "take care of something". E.g. a quantum lost to another thread, or a flushing or a journal, or a garbage collection. These have no more than a loose, non-dominant relationship with load. E.g. In most software systems, max response times seen at very low loads will usually be dramatically higher than "typical" response times at a much higher load (where a server would be "slower"). The rate at which stalls happen may be affected by load (sometimes increasing in frequency when load grows, and sometimes decreasing in frequency).

> 
>> If you can report "how much" CO you saw,
> 
> As I wrote originally, the CTT would report the time diffence (i.e. delay).
> 
>> it may help a bit in determining how bad the data is, and how the stats should be treated by the reader. E.g. if you know that CO totaling some amount of time X in a test of length Y had occured, then you know that any percentile above (100 * (1-X)/Y) is completely bogus, and should be assumed to be equal to the experienced max value. You can also take the approach that the the rest of the percentiles should be shifted down by at least  (100 * X / Y). e.g. If you had CO that covered only 0.01% of the total test time, that would be relatively good news.
> 
> Exactly.
> 
>> But if you had CO covering 5% of the test time, your measured 99%'ile is actually the 94%'ile]. Averages are unfortunately anyone's guess when CO is in play and not actually corrected for.
>> 
>> Once you detect both the existence and the magnitude of CO, correcting for it is actually pretty easy. The detection of "how much" is the semi-hard part.
> 
> Detecting the delay using the CTT is trivial; it already has to
> calculate it to decide how long to wait. A negative wait is obviously
> a missed start time.
> Or am I missing something here?

Yup. You are right. As noted, if CTT reported the delayed time this would be detected.

> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
> For additional commands, e-mail: user-help@jmeter.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org

Re: Coordinated Omission - detection and reporting ONLY

Posted by sebb <se...@gmail.com>.

On 19 October 2013 02:17, Gil Tene <gi...@azulsystems.com> wrote:
>
>> [Trying again - please do not hijack this thread.]
>>
>> The Constant Throughput Timer (CTT) calculates the desired wait time,
>> and if this is less than zero - i.e. a sample should already have been
>> generated - it could trigger the creation of a failed Assertion (or similar)
>> showing the time difference.

N.B.     ^^^^^^^^^^^

>>
>> Would this be sufficient to detect all CO occurrences?
>
> Two issues:
>
> 1. It would detect that CO probably happened, but not how much of it happened, Missing 1msec or 1 minute will look the same.

Huh? The time difference would be reported - is that not enough?

> 2. It would only detect CO in test plans that include an actual CTT (Constant Throughput Timer). It won't work for other timers, or when no timers are used

Indeed, but in such cases is there any indication of the expected
transaction rate?

>> If not, what other metric needs to be checked?
>
> There are various things you can look for.
>
> Our OutlierCorrector work includes a pretty elaborate CO detector. It sits either inline on the listener notification stream, or parses the log file. The detector identifies sampler patterns, establishes expected interval between patterns, and detects when the actual interval falls far above the expected interval. This is a code change to JMeter, but a pretty localized one.

AIUI CO happens when the sample takes much longer than usual, thus
(potentially) delaying any subseqent samples.
Overlong samples can already be detected using a Delay Assertion.

>> Even if it is not the only possible cause, would it be useful as a
>> starting point?
>
> Yes. As a warning flag saying "throw away these test results".

The results are not necessarily useless; at the very least they will
show the load at which the system is starting to slow down.

>> I am assuming that the CTT is the primary means of controlling the
>> sample request rate.
>
> Unfortunately many test scenarios I've seen use other means. Many people use other timers or other means for think time emulation.

The CTT is not intended for think time emulation. It is intended to
ensure a specific transaction rate.

Think time obviously affects the transaction rate, but is not the sole
determinant.

>> If there are other elements that are commonly used to control the
>> rate, please note them here.
>>
>> N.B: this thread is only for discussion of how to detect CO and how to
>> report it.
>
> Reporting the existence of CO is an interesting starting point. But the only right way to deal with such a report showing the existence of CO (with no magnitude or other metrics) is to say " I guess the data I got is complete crap, so all the stats and graphs I'm seeing mean nothing".

I disagree that the output is useless.

The delays are reported, so one can see how badly the test was
affected. If the delays are all small, then a slight adjustment of the
thread count and transaction rate should eliminate them.

Only if the delays are large are the results less meaningful, though
they can still show the base load at which the server starts to slow
down.

> If you can report "how much" CO you saw,

As I wrote originally, the CTT would report the time diffence (i.e. delay).

> it may help a bit in determining how bad the data is, and how the stats should be treated by the reader. E.g. if you know that CO totaling some amount of time X in a test of length Y had occured, then you know that any percentile above (100 * (1-X)/Y) is completely bogus, and should be assumed to be equal to the experienced max value. You can also take the approach that the the rest of the percentiles should be shifted down by at least  (100 * X / Y). e.g. If you had CO that covered only 0.01% of the total test time, that would be relatively good news.

Exactly.

> But if you had CO covering 5% of the test time, your measured 99%'ile is actually the 94%'ile]. Averages are unfortunately anyone's guess when CO is in play and not actually corrected for.
>
> Once you detect both the existence and the magnitude of CO, correcting for it is actually pretty easy. The detection of "how much" is the semi-hard part.

Detecting the delay using the CTT is trivial; it already has to
calculate it to decide how long to wait. A negative wait is obviously
a missed start time.
Or am I missing something here?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org

Re: Coordinated Omission - detection and reporting ONLY

Posted by Kirk Pepperdine <ki...@gmail.com>.

Hi,

I mostly agree with Gil's comments here. The problem is; I don't use the CTT and (unfortunately Gill;-)) I need randomization. To explain why would have me hijack the thread so I won't. IMHO, you need an external monitor that would detect when this happens or (as much as I have a distaste for massaging data after the fact), "correct" the data after the run.

Regards,
Kirk


On 2013-10-19, at 3:17 AM, Gil Tene <gi...@azulsystems.com> wrote:

> 
>> [Trying again - please do not hijack this thread.]
>> 
>> The Constant Throughput Timer (CTT) calculates the desired wait time,
>> and if this is less than zero - i.e. a sample should already have been
>> generated - it could trigger the creation of a failed Assertion (or similar)
>> showing the time difference.
>> 
>> Would this be sufficient to detect all CO occurrences?
> 
> Two issues:
> 
> 1. It would detect that CO probably happened, but not how much of it happened, Missing 1msec or 1 minute will look the same.
> 
> 2. It would only detect CO in test plans that include an actual CTT (Constant Throughput Timer). It won't work for other timers, or when no timers are used
> 
>> If not, what other metric needs to be checked?
> 
> There are various things you can look for.
> 
> Our OutlierCorrector work includes a pretty elaborate CO detector. It sits either inline on the listener notification stream, or parses the log file. The detector identifies sampler patterns, establishes expected interval between patterns, and detects when the actual interval falls far above the expected interval. This is a code change to JMeter, but a pretty localized one.
> 
>> Even if it is not the only possible cause, would it be useful as a
>> starting point?
> 
> Yes. As a warning flag saying "throw away these test results".
> 
>> I am assuming that the CTT is the primary means of controlling the
>> sample request rate.
> 
> Unfortunately many test scenarios I've seen use other means. Many people use other timers or other means for think time emulation.
> 
>> If there are other elements that are commonly used to control the
>> rate, please note them here.
>> 
>> N.B: this thread is only for discussion of how to detect CO and how to
>> report it.
> 
> Reporting the existence of CO is an interesting starting point. But the only right way to deal with such a report showing the existence of CO (with no magnitude or other metrics) is to say " I guess the data I got is complete crap, so all the stats and graphs I'm seeing mean nothing".
> 
> If you can report "how much" CO you saw, it may help a bit in determining how bad the data is, and how the stats should be treated by the reader. E.g. if you know that CO totaling some amount of time X in a test of length Y had occured, then you know that any percentile above (100 * (1-X)/Y) is completely bogus, and should be assumed to be equal to the experienced max value. You can also take the approach that the the rest of the percentiles should be shifted down by at least  (100 * X / Y). e.g. If you had CO that covered only 0.01% of the total test time, that would be relatively good news. But if you had CO covering 5% of the test time, your measured 99%'ile is actually the 94%'ile]. Averages are unfortunately anyone's guess when CO is in play and not actually corrected for. 
> 
> Once you detect both the existence and the magnitude of CO, correcting for it is actually pretty easy. The detection of "how much" is the semi-hard part.
> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
>> For additional commands, e-mail: user-help@jmeter.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
> For additional commands, e-mail: user-help@jmeter.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org

Re: Coordinated Omission - detection and reporting ONLY

Posted by Gil Tene <gi...@azulsystems.com>.

> [Trying again - please do not hijack this thread.]
> 
> The Constant Throughput Timer (CTT) calculates the desired wait time,
> and if this is less than zero - i.e. a sample should already have been
> generated - it could trigger the creation of a failed Assertion (or similar)
> showing the time difference.
> 
> Would this be sufficient to detect all CO occurrences?

Two issues:

1. It would detect that CO probably happened, but not how much of it happened, Missing 1msec or 1 minute will look the same.

2. It would only detect CO in test plans that include an actual CTT (Constant Throughput Timer). It won't work for other timers, or when no timers are used

> If not, what other metric needs to be checked?

There are various things you can look for.

Our OutlierCorrector work includes a pretty elaborate CO detector. It sits either inline on the listener notification stream, or parses the log file. The detector identifies sampler patterns, establishes expected interval between patterns, and detects when the actual interval falls far above the expected interval. This is a code change to JMeter, but a pretty localized one.

> Even if it is not the only possible cause, would it be useful as a
> starting point?

Yes. As a warning flag saying "throw away these test results".

> I am assuming that the CTT is the primary means of controlling the
> sample request rate.

Unfortunately many test scenarios I've seen use other means. Many people use other timers or other means for think time emulation.

> If there are other elements that are commonly used to control the
> rate, please note them here.
> 
> N.B: this thread is only for discussion of how to detect CO and how to
> report it.

Reporting the existence of CO is an interesting starting point. But the only right way to deal with such a report showing the existence of CO (with no magnitude or other metrics) is to say " I guess the data I got is complete crap, so all the stats and graphs I'm seeing mean nothing".

If you can report "how much" CO you saw, it may help a bit in determining how bad the data is, and how the stats should be treated by the reader. E.g. if you know that CO totaling some amount of time X in a test of length Y had occured, then you know that any percentile above (100 * (1-X)/Y) is completely bogus, and should be assumed to be equal to the experienced max value. You can also take the approach that the the rest of the percentiles should be shifted down by at least  (100 * X / Y). e.g. If you had CO that covered only 0.01% of the total test time, that would be relatively good news. But if you had CO covering 5% of the test time, your measured 99%'ile is actually the 94%'ile]. Averages are unfortunately anyone's guess when CO is in play and not actually corrected for. 

Once you detect both the existence and the magnitude of CO, correcting for it is actually pretty easy. The detection of "how much" is the semi-hard part.

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
> For additional commands, e-mail: user-help@jmeter.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@jmeter.apache.org
For additional commands, e-mail: user-help@jmeter.apache.org