You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by venkatesha m <ts...@yahoo.com> on 2014/03/22 22:11:54 UTC

[math] Proposal for New way of Computing an approximate Percentile without storing input data

Hi,

I would like to propose for adding new way of computing the percentile without needing to store most of input data.  Since this is my first time on contributing to apache; please help me / correct me if i miss any procedure here.

Here are the details.

Description: 
The Percentile calculation in a  traditional way require all the data points to be stored and sorted before accesiing the pth Percentile value of the data set. However the storage of points can become prohibitive when we need to make use of the existing Percentile Implementation at big data scale(For eg: when computing the daily or weekly percentile value of a certain performance metric where the data points accumulated over day and week may run to GB and TB).  While platforms such as hadoop exist to solve the data scale issue; the need for a statistical computation of quantiles without storing data is an absolute essential.
While looking in commons-math classes though Percentile class is available it is implemented with storage of input as requirement. So was wondering if we could add a class to calculate Percentile without needing to store data.  The algorithm that i have chosen to implement and propose is based on P Square algorithm ( http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf ) which requires a minimal and finite set of memory stores to compute percentiles for continuous stream of data.

Ref: 
http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf which has succing representation of the workflow of the algorithm

Advantages: 
a) As is claimed in the orignal workd the accuracy improves over moderate to large data sets which is the need.
b) A minimal and constant sized data store used to compute a large data set 
c) Useful in Hadoop Map reduce applications

Implementation:
I have implemented this algorithm based on StorelessUnivariateStatistic after checking out from 3.2 branch.  I have also opened a JIRA ticket on the same (https://issues.apache.org/jira/browse/MATH-1112 ) for requesting a new feature to be added. 

Please let me know when and how i could send my code for review.

thanks
murthy

Re: [math] Proposal for New way of Computing an approximate Percentile without storing input data

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Mar 23, 2014 at 2:09 AM, Thomas Neidhart
<th...@gmail.com>wrote:

>
> There is already an issue for this:
>
> https://issues.apache.org/jira/browse/MATH-418
>
> It links also other implementations and algorithms, maybe you could add
> a link to your's as well?
>

Done.  Thanks for the pointer.

Re: [math] Proposal for New way of Computing an approximate Percentile without storing input data

Posted by Thomas Neidhart <th...@gmail.com>.

On 03/22/2014 11:31 PM, Ted Dunning wrote:
> Murthy,
> 
> I recently developed an alternative algorithm which provides superior
> accuracy for extreme quantiles.  You can read more at
> 
> https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf?raw=true
> 
> The library involved is available via maven and is apache licensed.  Apache
> Commons Math has a "no dependency" policy which might mean that sucking in
> the code would be a better option than simply linking to this.
> 
> The standard of the art before t-digest is generally considered to be
> either Greenwald and Khanna's algorithm GK01 or the Q-digest.  References
> are in the paper above.
> 
> In case it isn't obvious, source code is available on github at
> 
> https://github.com/tdunning/t-digest
> 
> The p^2 algorithm that you suggest is actually quite old and far from the
> state of the art.

There is already an issue for this:

https://issues.apache.org/jira/browse/MATH-418

It links also other implementations and algorithms, maybe you could add
a link to your's as well?

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Proposal for New way of Computing an approximate Percentile without storing input data

Posted by Ted Dunning <te...@gmail.com>.

Murthy,

I recently developed an alternative algorithm which provides superior
accuracy for extreme quantiles.  You can read more at

https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf?raw=true

The library involved is available via maven and is apache licensed.  Apache
Commons Math has a "no dependency" policy which might mean that sucking in
the code would be a better option than simply linking to this.

The standard of the art before t-digest is generally considered to be
either Greenwald and Khanna's algorithm GK01 or the Q-digest.  References
are in the paper above.

In case it isn't obvious, source code is available on github at

https://github.com/tdunning/t-digest

The p^2 algorithm that you suggest is actually quite old and far from the
state of the art.




On Sat, Mar 22, 2014 at 2:40 PM, Phil Steitz <ph...@gmail.com> wrote:

> On 3/22/14, 2:11 PM, venkatesha m wrote:
> > Hi,
> >
> > I would like to propose for adding new way of computing the percentile
> without needing to store most of input data.  Since this is my first time
> on contributing to apache; please help me / correct me if i miss any
> procedure here.
> >
> > Here are the details.
> >
> > Description:
> > The Percentile calculation in a  traditional way require all the data
> points to be stored and sorted before accesiing the pth Percentile value of
> the data set. However the storage of points can become prohibitive when we
> need to make use of the existing Percentile Implementation at big data
> scale(For eg: when computing the daily or weekly percentile value of a
> certain performance metric where the data points accumulated over day and
> week may run to GB and TB).  While platforms such as hadoop exist to solve
> the data scale issue; the need for a statistical computation of quantiles
> without storing data is an absolute essential.
> > While looking in commons-math classes though Percentile class is
> available it is implemented with storage of input as requirement. So was
> wondering if we could add a class to calculate Percentile without needing
> to store data.  The algorithm that i have chosen to implement and propose
> is based on P Square algorithm (
> http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf ) which requires a
> minimal and finite set of memory stores to compute percentiles for
> continuous stream of data.
> >
> > Ref:
> > http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf which has succing
> representation of the workflow of the algorithm
> >
> > Advantages:
> > a) As is claimed in the orignal workd the accuracy improves over
> moderate to large data sets which is the need.
> > b) A minimal and constant sized data store used to compute a large data
> set
> > c) Useful in Hadoop Map reduce applications
> >
> > Implementation:
> > I have implemented this algorithm based on StorelessUnivariateStatistic
> after checking out from 3.2 branch.  I have also opened a JIRA ticket on
> the same (https://issues.apache.org/jira/browse/MATH-1112 ) for
> requesting a new feature to be added.
> >
> > Please let me know when and how i could send my code for review.
>
> Thanks, Murthy and welcome!
>
> I am personally fine with the P-Square algorithm and would welcome a
> patch including implementation.  Unless others disagree with this
> approach (give it a day or two), I would go ahead and attach a patch
> with the implementation to the JIRA you opened.
>
> Thanks in advance for your contributions!
>
> Phil
> >
> > thanks
> > murthy
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [math] Proposal for New way of Computing an approximate Percentile without storing input data

Posted by Phil Steitz <ph...@gmail.com>.

On 3/22/14, 2:11 PM, venkatesha m wrote:
> Hi,
>
> I would like to propose for adding new way of computing the percentile without needing to store most of input data.  Since this is my first time on contributing to apache; please help me / correct me if i miss any procedure here.
>
> Here are the details.
>
> Description: 
> The Percentile calculation in a  traditional way require all the data points to be stored and sorted before accesiing the pth Percentile value of the data set. However the storage of points can become prohibitive when we need to make use of the existing Percentile Implementation at big data scale(For eg: when computing the daily or weekly percentile value of a certain performance metric where the data points accumulated over day and week may run to GB and TB).  While platforms such as hadoop exist to solve the data scale issue; the need for a statistical computation of quantiles without storing data is an absolute essential.
> While looking in commons-math classes though Percentile class is available it is implemented with storage of input as requirement. So was wondering if we could add a class to calculate Percentile without needing to store data.  The algorithm that i have chosen to implement and propose is based on P Square algorithm ( http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf ) which requires a minimal and finite set of memory stores to compute percentiles for continuous stream of data.
>
> Ref: 
> http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf which has succing representation of the workflow of the algorithm
>
> Advantages: 
> a) As is claimed in the orignal workd the accuracy improves over moderate to large data sets which is the need.
> b) A minimal and constant sized data store used to compute a large data set 
> c) Useful in Hadoop Map reduce applications
>
> Implementation:
> I have implemented this algorithm based on StorelessUnivariateStatistic after checking out from 3.2 branch.  I have also opened a JIRA ticket on the same (https://issues.apache.org/jira/browse/MATH-1112 ) for requesting a new feature to be added. 
>
> Please let me know when and how i could send my code for review.

Thanks, Murthy and welcome!

I am personally fine with the P-Square algorithm and would welcome a
patch including implementation.  Unless others disagree with this
approach (give it a day or two), I would go ahead and attach a patch
with the implementation to the JIRA you opened.

Thanks in advance for your contributions!

Phil
>
> thanks
> murthy


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org