You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jmeter.apache.org by abalanonline <gi...@git.apache.org> on 2017/05/08 01:24:09 UTC

[GitHub] jmeter pull request #296: Bug 61078 - Percentile calculation error

GitHub user abalanonline opened a pull request:

    https://github.com/apache/jmeter/pull/296

    Bug 61078 - Percentile calculation error

    https://bz.apache.org/bugzilla/show_bug.cgi?id=61078


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/abalanonline/jmeter 61078

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jmeter/pull/296.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #296
    
----
commit 446b91f5f04fd5997136435f86c91231dbe59a2d
Author: Aleksei Balan <ab...@gmail.com>
Date:   2017-05-08T01:18:32Z

    Bug 61078 - Percentile calculation error

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by abalanonline <gi...@git.apache.org>.
Github user abalanonline commented on the issue:

    https://github.com/apache/jmeter/pull/296
  
    @Wyatts Thank you.
    R-8 is good if the interpolated approach is required. But for JMeter the R-1 is preferable because data calculated that way have a concrete understandable meaning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jmeter pull request #296: Bug 61078 - Percentile calculation error

Posted by FSchumacher <gi...@git.apache.org>.
Github user FSchumacher commented on a diff in the pull request:

    https://github.com/apache/jmeter/pull/296#discussion_r115304397
  
    --- Diff: src/jorphan/org/apache/jorphan/math/StatCalculator.java ---
    @@ -162,7 +162,7 @@ public T getPercentPoint(double percent) {
             }
     
             // use Math.round () instead of simple (long) to provide correct value rounding
    -        long target = Math.round (count * percent);
    +        long target = Math.round(Math.ceil(count * percent));
    --- End diff --
    
    No need to call `Math.round()` after `Math.ceil()`. A simple cast to `long` would be enough. But it will probably not hurt, either.
    
    But you should adapt the comment above the line to the new algorithm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by abalanonline <gi...@git.apache.org>.
Github user abalanonline commented on the issue:

    https://github.com/apache/jmeter/pull/296
  
    Hello @pmouawad ,
    Different estimation types used by Apache Math explained [here](https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample).
    The use case for nearest rank method is simple. If the 90% percentile is 1200 ms than that means that 90% of tests take no more than 1200 ms. Same for 95% and 99%. Very informative value for JMeter. It is implemented in R_1 and R_2.
    The purpose of interpolated method is to eliminate the discontinuity and to get data useful for calculation in different linear functions. Difference is in the min and max values. If we treat them only as starting and ending points in the diagram then C=1 variant (R_7) is used. Otherwise if the min and max are the full value experiment results and deserve their own space on a diagram then C=0 variant (LEGACY, R_6) is considered.
    Intermediate variant C=1/2 (R_5, R_8, R_9) is fine too because it is balanced and does not distort the "angle" of a diagram.
    The bug affects most of the libraries because of a cognitive inertia of developers. The ceiling function is very rarely used and it is psychologically easier to use round instead. Also round function give correct median for odd number of experiments. Seeing that they calm down and believe that everything is calculating correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by pmouawad <gi...@git.apache.org>.
Github user pmouawad commented on the issue:

    https://github.com/apache/jmeter/pull/296
  
    Hi Team,
    What shall we do ?
    As per Felix note on dev mailing list, it is more an algorithm variation than a bug.
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by abalanonline <gi...@git.apache.org>.
Github user abalanonline commented on the issue:

    https://github.com/apache/jmeter/pull/296
  
    @FSchumacher thank you for the remark, it is fixed now
    
    @pmouawad DescriptiveStatistics uses legacy percentile calculation method by default, which is Linear Interpolation Third Variant C=0
    To fix the test please add these lines after the loop
    ```
    statistics.setPercentileImpl((new Percentile()).withEstimationType(Percentile.EstimationType.R_1));
    for (int i=1; i<=100; i++) assertEquals(statistics.getPercentile(i), calc.getPercentPoint(i / 100D), 0.001);
    ```
    I wonder why do we still use **org.apache.jorphan** having such a powerful math library as Apache Commons Math.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Re: [GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by Vladimir Sitnikov <si...@gmail.com>.
Felix>I wonder if we should change our implementation at all.

So do I.
I wish JMeter would just throw an error when user tries to calculate 90%
percentile out of 5 values =)

Felix>Note I share your thoughts on using a dedicated library but
Felix> commons-math may be overkill in terms of performance compared to
Felix> HdrHistogram

I agree HdrHistogram might be the only way to compute high percentiles with
sane amount of memory.

Felix>R and numpy will interpolate the median and the percentiles/quantiles.

Technically speaking, R has 9 types of quantile calculation:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html

There's a comment:

R.quantile.doc>Further details are provided in Hyndman and Fan (1996) who
recommended type 8. The default method is type 7, as used by S and by R <
2.0.0.

As far as I understand that, "type 8" is somewhat better, however R
defaults to type 7 for backward compatibility reasons.

Here's what R version 3.4.0 (2017-04-21) produces:

quantile(c(15, 20, 35, 40, 50), c(0.05, 0.3, 0.4, 0.5, 1.0))
  5%  30%  40%  50% 100%
  16   23   29   35   50

quantile(c(15, 20, 35, 40, 50), c(0.05, 0.3, 0.4, 0.5, 1.0), type=8)
      5%      30%      40%      50%     100%
15.00000 19.66667 27.00000 35.00000 50.00000


Vladimir

Re: [GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by Vladimir Sitnikov <si...@gmail.com>.
Philippe>"If the 90% percentile is 1200 ms than that means that 90% of
tests take no more than 1200 ms"

Well, I get your point. It makes sense to keep the old approach unless
there's some data that confirms some other approach is better.

Vladimir>What I mean is R8 kind of
Vladimir> computation should better approximate

Philippe>Will you submit a PR for that ? Before or after release of 3.3 ?

Well, I can definitely submit one, however I'm not that fond of making a
change for a sake of change.


Vladimir

Re: [GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by Felix Schumacher <fe...@internetallee.de>.
Am 03.06.2017 um 10:55 schrieb Philippe Mouawad:
> On Wed, May 31, 2017 at 2:54 PM, Vladimir Sitnikov <
> sitnikov.vladimir@gmail.com> wrote:
>
>> Philippe>- switch everywhere to R1 (also in commons-math)
>>
>> Can you please clarify why do you prefer R1?
>>
> Because from what the reporter wrote, it looked good to me:
> "If the 90% percentile is 1200 ms than that means that 90% of tests take no
> more than 1200 ms"
>
> And there is a another pragmatic point, it seems JOrphan implementation is
> R1 once we fix the issue.
>
>
>
>> I'm inclined to R8 (as it is recommended by R for sample quantile
>> calculation).
>>
>> 1) I think interpolation would reduce run-to-run variance.
>> 2) Interpolation-like estimation is easier to implement. For instance, if
>> HdrHistogram estimator is added, then its result would be closer to R8
>> rather than to R1.
>>
> ok
>
>
>> I don't think "the result of 90% is one of the sample response times" is
>> important. The important stuff is how system under test behaves, and it is
>> not something tied to a single execution. What I mean is R8 kind of
>> computation should better approximate the true percentile value than R1
>> would, and it is the true value that is important to compare and report as
>> a test result.
>>
> Will you submit a PR for that ? Before or after release of 3.3 ?
Is there any need to rush this before 3.3?

I still believe, that there is no exact definition for the median, but 
there are may specialised definitions like the R1 .. R8 mentioned by 
Vladimir. So I think, that whatever we chose now, will be challenged by 
someone in the future.

If we change the algorithm, we should document it well and try to make 
it so flexible, that it can be configured to act as R1 .. R8 (if that is 
possible without too much work).

If we don't change the algorithm, we should document the current state.

Felix

> Thanks
>
>
>> Vladimir
>>
>
>


Re: [GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by Philippe Mouawad <ph...@gmail.com>.
On Wed, May 31, 2017 at 2:54 PM, Vladimir Sitnikov <
sitnikov.vladimir@gmail.com> wrote:

> Philippe>- switch everywhere to R1 (also in commons-math)
>
> Can you please clarify why do you prefer R1?
>

Because from what the reporter wrote, it looked good to me:
"If the 90% percentile is 1200 ms than that means that 90% of tests take no
more than 1200 ms"

And there is a another pragmatic point, it seems JOrphan implementation is
R1 once we fix the issue.



> I'm inclined to R8 (as it is recommended by R for sample quantile
> calculation).
>
> 1) I think interpolation would reduce run-to-run variance.
> 2) Interpolation-like estimation is easier to implement. For instance, if
> HdrHistogram estimator is added, then its result would be closer to R8
> rather than to R1.
>

ok


>
> I don't think "the result of 90% is one of the sample response times" is
> important. The important stuff is how system under test behaves, and it is
> not something tied to a single execution. What I mean is R8 kind of
> computation should better approximate the true percentile value than R1
> would, and it is the true value that is important to compare and report as
> a test result.
>

Will you submit a PR for that ? Before or after release of 3.3 ?
Thanks


>
> Vladimir
>



-- 
Cordialement.
Philippe Mouawad.

Re: [GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by Vladimir Sitnikov <si...@gmail.com>.
Philippe>- switch everywhere to R1 (also in commons-math)

Can you please clarify why do you prefer R1?

I'm inclined to R8 (as it is recommended by R for sample quantile
calculation).

1) I think interpolation would reduce run-to-run variance.
2) Interpolation-like estimation is easier to implement. For instance, if
HdrHistogram estimator is added, then its result would be closer to R8
rather than to R1.

I don't think "the result of 90% is one of the sample response times" is
important. The important stuff is how system under test behaves, and it is
not something tied to a single execution. What I mean is R8 kind of
computation should better approximate the true percentile value than R1
would, and it is the true value that is important to compare and report as
a test result.

Vladimir

Re: [GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by Antonio Gomes Rodrigues <ra...@gmail.com>.
Hi,

I don't have time to read the posted links yet

But I am OK to have the same way to calculate percentiles and documented it

Antonio

2017-05-28 11:51 GMT+02:00 Philippe Mouawad <ph...@gmail.com>:

> Hello,
> After reading further on this topic and also reading the different
> comments, my position would be:
> - switch everywhere to R1 (also in commons-math)
> - use the PR from contributor for the median and jorphan computations
> - document the change and algo somewhere
>
> From my understanding, tests having large results should not be affected by
> change.
>
> This would at least make computations uniform until we decide what library
> to use.
>
> I need your go before going further.
>
> If we decide for statusquo then please comment on respective bugs to
> explain to reported and contributor why we won't change anything.
>
> Regards
>
> On Tuesday, May 9, 2017, Felix Schumacher <felix.schumacher@
> internetallee.de>
> wrote:
>
> > Am 09.05.2017 09:11, schrieb pmouawad:
> >
> >> Github user pmouawad commented on the issue:
> >>
> >>     https://github.com/apache/jmeter/pull/296
> >>
> >>     Hello @abalanonline ,
> >>     Thanks for your replies and explanations !
> >>
> >>     I am not a math expert as you seem to be, so I have few questions
> >> you may be able to help on:
> >>
> >>     1. Thanks to your comment, I see default method is LEGACY, and the
> >> one you have created is R_1. Do you have some insights on the
> >> different method and their limits / use cases ?
> >>
> >>     2. Why does the "bug" you report affect all libraries I checked
> >> (HdrHistogram, https://github.com/tdunning/t-digest/ and JOrphan ) ?
> >> Can't it be due to a different method estimation algorithm ?
> >>
> >>     Note I share your thoughts on using a dedicated library but
> >> commons-math may be overkill in terms of performance compared to
> >> HdrHistogram or t-digest.
> >>
> >
> > I have tried to do a bit of research on percentiles, quantiles and
> median.
> >
> > It looks to me, that those "points" are more like ranges, and there is no
> > exact value.
> >
> > R and numpy will interpolate the median and the percentiles/quantiles.
> The
> > statistics module
> > of python 3 has three different median implementations called median,
> > median_high and median_low,
> > that interpolate, give the highest possible median and the lowest.
> >
> > Wikipedia (the german one), gives a definition of an "Empirisches
> > Quantile" (empiric quantile),
> > where it settles on the lower border of the quantiles (and therefore the
> > median).
> >
> > I wonder if we should change our implementation at all.
> >
> > Felix
> >
> >
> >>     Thanks
> >>
> >>
> >> ---
> >> If your project is set up for it, you can reply to this email and have
> >> your
> >> reply appear on GitHub as well. If your project does not have this
> feature
> >> enabled and wishes so, or if the feature is enabled but not working,
> >> please
> >> contact infrastructure at infrastructure@apache.org or file a JIRA
> ticket
> >> with INFRA.
> >> ---
> >>
> >
>
> --
> Cordialement.
> Philippe Mouawad.
>

Re: [GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by Philippe Mouawad <ph...@gmail.com>.
Hello,
After reading further on this topic and also reading the different
comments, my position would be:
- switch everywhere to R1 (also in commons-math)
- use the PR from contributor for the median and jorphan computations
- document the change and algo somewhere

From my understanding, tests having large results should not be affected by
change.

This would at least make computations uniform until we decide what library
to use.

I need your go before going further.

If we decide for statusquo then please comment on respective bugs to
explain to reported and contributor why we won't change anything.

Regards

On Tuesday, May 9, 2017, Felix Schumacher <fe...@internetallee.de>
wrote:

> Am 09.05.2017 09:11, schrieb pmouawad:
>
>> Github user pmouawad commented on the issue:
>>
>>     https://github.com/apache/jmeter/pull/296
>>
>>     Hello @abalanonline ,
>>     Thanks for your replies and explanations !
>>
>>     I am not a math expert as you seem to be, so I have few questions
>> you may be able to help on:
>>
>>     1. Thanks to your comment, I see default method is LEGACY, and the
>> one you have created is R_1. Do you have some insights on the
>> different method and their limits / use cases ?
>>
>>     2. Why does the "bug" you report affect all libraries I checked
>> (HdrHistogram, https://github.com/tdunning/t-digest/ and JOrphan ) ?
>> Can't it be due to a different method estimation algorithm ?
>>
>>     Note I share your thoughts on using a dedicated library but
>> commons-math may be overkill in terms of performance compared to
>> HdrHistogram or t-digest.
>>
>
> I have tried to do a bit of research on percentiles, quantiles and median.
>
> It looks to me, that those "points" are more like ranges, and there is no
> exact value.
>
> R and numpy will interpolate the median and the percentiles/quantiles. The
> statistics module
> of python 3 has three different median implementations called median,
> median_high and median_low,
> that interpolate, give the highest possible median and the lowest.
>
> Wikipedia (the german one), gives a definition of an "Empirisches
> Quantile" (empiric quantile),
> where it settles on the lower border of the quantiles (and therefore the
> median).
>
> I wonder if we should change our implementation at all.
>
> Felix
>
>
>>     Thanks
>>
>>
>> ---
>> If your project is set up for it, you can reply to this email and have
>> your
>> reply appear on GitHub as well. If your project does not have this feature
>> enabled and wishes so, or if the feature is enabled but not working,
>> please
>> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
>> with INFRA.
>> ---
>>
>

-- 
Cordialement.
Philippe Mouawad.

Re: [GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by Felix Schumacher <fe...@internetallee.de>.
Am 09.05.2017 09:11, schrieb pmouawad:
> Github user pmouawad commented on the issue:
> 
>     https://github.com/apache/jmeter/pull/296
> 
>     Hello @abalanonline ,
>     Thanks for your replies and explanations !
> 
>     I am not a math expert as you seem to be, so I have few questions
> you may be able to help on:
> 
>     1. Thanks to your comment, I see default method is LEGACY, and the
> one you have created is R_1. Do you have some insights on the
> different method and their limits / use cases ?
> 
>     2. Why does the "bug" you report affect all libraries I checked
> (HdrHistogram, https://github.com/tdunning/t-digest/ and JOrphan ) ?
> Can't it be due to a different method estimation algorithm ?
> 
>     Note I share your thoughts on using a dedicated library but
> commons-math may be overkill in terms of performance compared to
> HdrHistogram or t-digest.

I have tried to do a bit of research on percentiles, quantiles and 
median.

It looks to me, that those "points" are more like ranges, and there is 
no exact value.

R and numpy will interpolate the median and the percentiles/quantiles. 
The statistics module
of python 3 has three different median implementations called median, 
median_high and median_low,
that interpolate, give the highest possible median and the lowest.

Wikipedia (the german one), gives a definition of an "Empirisches 
Quantile" (empiric quantile),
where it settles on the lower border of the quantiles (and therefore the 
median).

I wonder if we should change our implementation at all.

Felix

> 
>     Thanks
> 
> 
> ---
> If your project is set up for it, you can reply to this email and have 
> your
> reply appear on GitHub as well. If your project does not have this 
> feature
> enabled and wishes so, or if the feature is enabled but not working, 
> please
> contact infrastructure at infrastructure@apache.org or file a JIRA 
> ticket
> with INFRA.
> ---

[GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by pmouawad <gi...@git.apache.org>.
Github user pmouawad commented on the issue:

    https://github.com/apache/jmeter/pull/296
  
    Hello @abalanonline ,
    Thanks for your replies and explanations !
    
    I am not a math expert as you seem to be, so I have few questions you may be able to help on:
    
    1. Thanks to your comment, I see default method is LEGACY, and the one you have created is R_1. Do you have some insights on the different method and their limits / use cases ?
    
    2. Why does the "bug" you report affect all libraries I checked (HdrHistogram, https://github.com/tdunning/t-digest/ and JOrphan ) ? Can't it be due to a different method estimation algorithm ?
    
    Note I share your thoughts on using a dedicated library but commons-math may be overkill in terms of performance compared to HdrHistogram or t-digest.
    
    Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by pmouawad <gi...@git.apache.org>.
Github user pmouawad commented on the issue:

    https://github.com/apache/jmeter/pull/296
  
    Thanks for your patch.
    
    This test fails for me:
    `    @Test
        public void testPercentagePointBug() throws Exception {
            long values[] = new long[] {
                10L,9L,5L,6L,1L,3L,8L,2L,7L,4L
            };
            DescriptiveStatistics statistics = new DescriptiveStatistics();
            for (long l : values) {
                calc.addValue(l);
                statistics.addValue(l);
            }
            assertEquals(statistics.getPercentile(90), 
                    calc.getPercentPoint(0.9), 0.5);
        }`
    
    with:
    
    `java.lang.AssertionError: expected:<9.9> but was:<9.0>`
    
    Does it sounds ok for you ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] jmeter issue #296: Bug 61078 - Percentile calculation error

Posted by Wyatts <gi...@git.apache.org>.
Github user Wyatts commented on the issue:

    https://github.com/apache/jmeter/pull/296
  
    FWIW, [this article](https://analyse-it.com/blog/2013/2/quantiles-percentiles-why-so-many-ways-to-calculate-them) has two citations in favour of defaulting to what Wikipedia calls R-8 (though it notes R defaults to R-7 for compatibility with S).  That's a decent argument for either of them, IMO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---