You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Phil Steitz <ph...@gmail.com> on 2009/04/19 17:34:24 UTC

[math] MATH-224 - need a better idea

We should be able to find a clean way to do what this enhancement 
request is asking for.  I am feeling stupid because even when I consider 
breaking compatibility / refactoring to use generics, I can't find a 
simple way to do it.  Here is a description of the current API and some 
failed ideas that I have considered so far.   As usual, I would like to 
minimize pain for current users in addressing this, but at this point I 
am starting to think that wholesale refactoring is necessary and I would 
appreciate ideas on the best way to do this.

SummaryStatistics provides "storeless" computation of summary statistics 
- min, max, mean, variance, etc.  Here "storeless" means that the class 
does not hold the stream of data in memory.  It was designed to support 
pluggable implementations of the statistics that it computes.  It does 
this in a way that looks smelly in the new world of type-safe Java 
(well, maybe it always smelled ;)  The injectable implementation classes 
in SummaryStatistics are typed as "StorelessUnivariateStatistic" which 
is an interface that includes things like getResult() and 
increment(double).  There is nothing preventing, for example, a variance 
implementation from being "plugged in" to implement the mean.

The request in MATH-224 is to support aggregation in the following 
sense:  SummaryStatistics instance 1 gets a stream of values and 
instance 2 gets another stream of values and we want to create a new 
instance or replace instance 1 with an instance that behaves as though 
it got all the data from both streams.  The simplest way to do this 
would be to add an "aggregate" method to the 
StorelessUnivariateStatistic interface and then just implement 
aggregation in SummaryStatistics by delegation to the implementation 
instances.  This is essentially what the patch attached to MATH-224 
does.  The problem with this approach is that supporting aggregation is 
a fairly strong requirement in general, stronger than just requiring 
that the statistic be computable without storing the data.  Stronger 
still is the requirement that an implementation of a statistic be 
"aggregatable" with a possibly different implementation (since then it 
would have access only to the value of the other statistic).

So the challenge is can we find a clean way to achieve the four objectives:

0) Maintain pluggability of statistics implementations
1) Support aggregation
2) Improve type safety
3) Minimize trauma for current users

Dropping 0) makes things much simpler, but I would like to avoid that 
unless there is really no way to accomplish 1) and 2) without taking 
that step.  Strictly speaking, 1) may be impossible as I know of no way 
to support this for the higher moments.  I would be OK with aggregation 
forcing these to NaN (documented, of course).

My first thought was to define a parameterized Aggregatable interface 
that requires the same types.  Then two SummaryStatistics instances are 
aggregatable iff their implementation statistics match types.  I am OK 
with these restrictions, but am having trouble actually making it work.

Suggestions / patches welcome!

Phil



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] MATH-224 - need a better idea

Posted by Luc Maisonobe <Lu...@free.fr>.

Phil Steitz a écrit :
> We should be able to find a clean way to do what this enhancement
> request is asking for.  I am feeling stupid because even when I consider
> breaking compatibility / refactoring to use generics, I can't find a
> simple way to do it.  Here is a description of the current API and some
> failed ideas that I have considered so far.   As usual, I would like to
> minimize pain for current users in addressing this, but at this point I
> am starting to think that wholesale refactoring is necessary and I would
> appreciate ideas on the best way to do this.
> 
> SummaryStatistics provides "storeless" computation of summary statistics
> - min, max, mean, variance, etc.  Here "storeless" means that the class
> does not hold the stream of data in memory.  It was designed to support
> pluggable implementations of the statistics that it computes.  It does
> this in a way that looks smelly in the new world of type-safe Java
> (well, maybe it always smelled ;)  The injectable implementation classes
> in SummaryStatistics are typed as "StorelessUnivariateStatistic" which
> is an interface that includes things like getResult() and
> increment(double).  There is nothing preventing, for example, a variance
> implementation from being "plugged in" to implement the mean.
> 
> The request in MATH-224 is to support aggregation in the following
> sense:  SummaryStatistics instance 1 gets a stream of values and
> instance 2 gets another stream of values and we want to create a new
> instance or replace instance 1 with an instance that behaves as though
> it got all the data from both streams.  The simplest way to do this
> would be to add an "aggregate" method to the
> StorelessUnivariateStatistic interface and then just implement
> aggregation in SummaryStatistics by delegation to the implementation
> instances.  This is essentially what the patch attached to MATH-224
> does.  The problem with this approach is that supporting aggregation is
> a fairly strong requirement in general, stronger than just requiring
> that the statistic be computable without storing the data.  Stronger
> still is the requirement that an implementation of a statistic be
> "aggregatable" with a possibly different implementation (since then it
> would have access only to the value of the other statistic).
> 
> So the challenge is can we find a clean way to achieve the four objectives:
> 
> 0) Maintain pluggability of statistics implementations
> 1) Support aggregation
> 2) Improve type safety
> 3) Minimize trauma for current users
> 
> Dropping 0) makes things much simpler, but I would like to avoid that
> unless there is really no way to accomplish 1) and 2) without taking
> that step.  Strictly speaking, 1) may be impossible as I know of no way
> to support this for the higher moments.  I would be OK with aggregation
> forcing these to NaN (documented, of course).

I think 1) has a high priority. It is the whole subject of the issue. I
see several use cases for it, including for example parallel computation
 and later merge. I also think providing it for higher moments requires
not only the final result but also intermediate values (typically
sum(x^0), sum(x^1), sum(x^2) ...). So this implies these value are
available in addition to final results.

Perhaps one way to allow it would be to have different interfaces for
the various statistics. Mean would provide sum(x^0) and sum(x^1) for
example in addition to the final result which is the ratio. The cost for
this is a more complex API, but I think it is worth trying.

Luc

> 
> My first thought was to define a parameterized Aggregatable interface
> that requires the same types.  Then two SummaryStatistics instances are
> aggregatable iff their implementation statistics match types.  I am OK
> with these restrictions, but am having trouble actually making it work.
> 
> Suggestions / patches welcome!
> 
> Phil
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] MATH-224 - need a better idea

Posted by Phil Steitz <ph...@gmail.com>.

John Bollinger wrote:
> The same approach could certainly be applied for DescriptiveStatistics, but the variable window complicates things: if a finite window is selected for the aggregate statistics then they will be sensitive to the order in which values are added to the contributing per-partition statistics.  That problem exists no matter when the aggregation is performed, however, and I guess the order we would get is reasonably likely to be the desired one.  Also, the removeMostRecentValue() and replaceMostRecentValue() methods are a bit tricky if they need to cascade to the aggregate statistics because the most recent value for one contributor may not be the most recent value for the aggregate.  Anyway, I'll prepare an AggregateDescriptiveStatistics along the same line as my AggregateSummaryStatistics, and then at least we'll have something concrete to discuss.  Shall I post it as an additional patch for MATH-224?
>   
> DescriptiveStatistics does provide an opportunity for aggregating after the fact that SummaryStatistics doesn't, because each contributing statistic remembers (some of) the values provided to it.  On the other hand, users already can manually aggregate DescriptiveStatistics objects.  What they cannot easily do after the fact is duplicate the overall order in which values were added to the set of DescriptiveStatistics, and that is exactly what AggregateDescriptiveStatistics will provide.  I think I'm rambling now, so I'll stop and write some code.
>   
Always a good idea ^

I was thinking initially of post-hoc aggregation, using the backing 
data, but it is worth investigating the approach above.  Thanks!

Phil
>
> Regards,
>
> John
>
>
>
>
> ________________________________
> From: Phil Steitz <ph...@gmail.com>
> To: Commons Developers List <de...@commons.apache.org>
> Sent: Monday, April 20, 2009 7:01:20 AM
> Subject: Re: [math] MATH-224 - need a better idea
>
> Ted Dunning wrote:
>   
>> That is a fine answer for some things, but the parallel cases fail.
>>
>> My feeling is that there are a few cases where there are nice aggregatable
>> summary statistics like moments and there are many cases where this just
>> doesn't work well (such as rank statistics). 
>>     
> Yes, this is why not all statistics are "storeless."  We have another "summary" class that maintains its data in storage and supports "rolling" behavior in DescriptiveStatistics.  The discussion here is focussed on the "storeless" case, which is limited to those stats that are computable in this way.  The cases of interest are stats that can be computed in one pass through the data but which can't be "aggregated" post hoc.  John's approach provides a simple solution to this problem.
>
> For completeness, we should probably similarly implement aggregation in the sense defined in MATH-224 for DescriptiveStatistics as well. 
> Phil
>   
>>  For the latter, case I usually
>> make do with a surrogate such as a random sub-sample or a recency weighted
>> random sub-sample combined with a few aggregatable stats such as total
>> samples, max, min, sum and second moment.  That gives me most of what I want
>> and if the sub-sample is reasonably large, I can sometimes estimate a few
>> parameters such as total uniques.  The sub-sampled data streams can be
>> combined trivially so I now have a aggregatable approximation of
>> non-aggregatable statistics.  For descriptive quantiles this is generally
>> just fine.
>>
>> On Sun, Apr 19, 2009 at 2:44 PM, John Bollinger <th...@yahoo.com> wrote:
>>
>>  
>>     
>>> The key would be to generate the aggregate statistics at the same time as
>>> the per-partition ones, instead of aggregating them after the fact.
>>>    
>>>       
>>
>>
>>  
>>     
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>
>       
>   



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] MATH-224 - need a better idea

Posted by John Bollinger <th...@yahoo.com>.

The same approach could certainly be applied for DescriptiveStatistics, but the variable window complicates things: if a finite window is selected for the aggregate statistics then they will be sensitive to the order in which values are added to the contributing per-partition statistics.  That problem exists no matter when the aggregation is performed, however, and I guess the order we would get is reasonably likely to be the desired one.  Also, the removeMostRecentValue() and replaceMostRecentValue() methods are a bit tricky if they need to cascade to the aggregate statistics because the most recent value for one contributor may not be the most recent value for the aggregate.  Anyway, I'll prepare an AggregateDescriptiveStatistics along the same line as my AggregateSummaryStatistics, and then at least we'll have something concrete to discuss.  Shall I post it as an additional patch for MATH-224?

DescriptiveStatistics does provide an opportunity for aggregating after the fact that SummaryStatistics doesn't, because each contributing statistic remembers (some of) the values provided to it.  On the other hand, users already can manually aggregate DescriptiveStatistics objects.  What they cannot easily do after the fact is duplicate the overall order in which values were added to the set of DescriptiveStatistics, and that is exactly what AggregateDescriptiveStatistics will provide.  I think I'm rambling now, so I'll stop and write some code.

Regards,

John

________________________________
From: Phil Steitz <ph...@gmail.com>
To: Commons Developers List <de...@commons.apache.org>
Sent: Monday, April 20, 2009 7:01:20 AM
Subject: Re: [math] MATH-224 - need a better idea

Ted Dunning wrote:
> That is a fine answer for some things, but the parallel cases fail.
> 
> My feeling is that there are a few cases where there are nice aggregatable
> summary statistics like moments and there are many cases where this just
> doesn't work well (such as rank statistics). 
Yes, this is why not all statistics are "storeless."  We have another "summary" class that maintains its data in storage and supports "rolling" behavior in DescriptiveStatistics.  The discussion here is focussed on the "storeless" case, which is limited to those stats that are computable in this way.  The cases of interest are stats that can be computed in one pass through the data but which can't be "aggregated" post hoc.  John's approach provides a simple solution to this problem.

For completeness, we should probably similarly implement aggregation in the sense defined in MATH-224 for DescriptiveStatistics as well. 
Phil
>  For the latter, case I usually
> make do with a surrogate such as a random sub-sample or a recency weighted
> random sub-sample combined with a few aggregatable stats such as total
> samples, max, min, sum and second moment.  That gives me most of what I want
> and if the sub-sample is reasonably large, I can sometimes estimate a few
> parameters such as total uniques.  The sub-sampled data streams can be
> combined trivially so I now have a aggregatable approximation of
> non-aggregatable statistics.  For descriptive quantiles this is generally
> just fine.
> 
> On Sun, Apr 19, 2009 at 2:44 PM, John Bollinger <th...@yahoo.com> wrote:
> 
>  
>> The key would be to generate the aggregate statistics at the same time as
>> the per-partition ones, instead of aggregating them after the fact.
>>    
> 
> 
> 
> 
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] MATH-224 - need a better idea

Posted by Phil Steitz <ph...@gmail.com>.

Ted Dunning wrote:
> That is a fine answer for some things, but the parallel cases fail.
>
> My feeling is that there are a few cases where there are nice aggregatable
> summary statistics like moments and there are many cases where this just
> doesn't work well (such as rank statistics). 
Yes, this is why not all statistics are "storeless."  We have another 
"summary" class that maintains its data in storage and supports 
"rolling" behavior in DescriptiveStatistics.  The discussion here is 
focussed on the "storeless" case, which is limited to those stats that 
are computable in this way.  The cases of interest are stats that can be 
computed in one pass through the data but which can't be "aggregated" 
post hoc.  John's approach provides a simple solution to this problem.

For completeness, we should probably similarly implement aggregation in 
the sense defined in MATH-224 for DescriptiveStatistics as well. 

Phil
>  For the latter, case I usually
> make do with a surrogate such as a random sub-sample or a recency weighted
> random sub-sample combined with a few aggregatable stats such as total
> samples, max, min, sum and second moment.  That gives me most of what I want
> and if the sub-sample is reasonably large, I can sometimes estimate a few
> parameters such as total uniques.  The sub-sampled data streams can be
> combined trivially so I now have a aggregatable approximation of
> non-aggregatable statistics.  For descriptive quantiles this is generally
> just fine.
>
> On Sun, Apr 19, 2009 at 2:44 PM, John Bollinger <th...@yahoo.com> wrote:
>
>   
>> The key would be to generate the aggregate statistics at the same time as
>> the per-partition ones, instead of aggregating them after the fact.
>>     
>
>
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] MATH-224 - need a better idea

Posted by Ted Dunning <te...@gmail.com>.

That is a fine answer for some things, but the parallel cases fail.

My feeling is that there are a few cases where there are nice aggregatable
summary statistics like moments and there are many cases where this just
doesn't work well (such as rank statistics).  For the latter, case I usually
make do with a surrogate such as a random sub-sample or a recency weighted
random sub-sample combined with a few aggregatable stats such as total
samples, max, min, sum and second moment.  That gives me most of what I want
and if the sub-sample is reasonably large, I can sometimes estimate a few
parameters such as total uniques.  The sub-sampled data streams can be
combined trivially so I now have a aggregatable approximation of
non-aggregatable statistics.  For descriptive quantiles this is generally
just fine.

On Sun, Apr 19, 2009 at 2:44 PM, John Bollinger <th...@yahoo.com> wrote:

> The key would be to generate the aggregate statistics at the same time as
> the per-partition ones, instead of aggregating them after the fact.

-- 
Ted Dunning, CTO
DeepDyve

Re: [math] MATH-224 - need a better idea

Posted by Phil Steitz <ph...@gmail.com>.

John Bollinger wrote:
> I'm looking at commons-math for the first time, but I don't think the feature can be implemented as requested in a manner that is suitably generic.  On the other hand, I think the same objective could be achieved a different way without changing the base API at all.  The key would be to generate the aggregate statistics at the same time as the per-partition ones, instead of aggregating them after the fact.  That does require knowing beforehand that you're going to want the aggregate stats, but I think that's a fair tradeoff.  This could be done without making client programs update two sets of statistics with each datum, by wrapping the each StorelessUnivariateStatistic with an implementation that forwards the data to two StorelessUnivariateStatistics -- the wrapped one and one for the aggregate.  Almost all the work of setting that up can be automated.
>
> I'll see whether I can whip up a proof of concept for you to check out.
>   
I like this approach.  As you point out, it avoids entirely the issues 
raised above and is actually quite flexible in terms of when streams 
start and end, etc.  The only downsides are a) cost of all the 
"forwarded" increment calls (not likely to be a real practical issue in 
most cases) and b) ease of use.  I mention b) only because I had to 
think for 5 seconds before anticipating how the test case was going to 
be coded.  I would appreciate feedback from others on this - especially 
those requesting the feature.

Thanks!

Phil
> John
>
>
>
>
> ________________________________
> From: Phil Steitz <ph...@gmail.com>
> To: Commons Developers List <de...@commons.apache.org>
> Sent: Sunday, April 19, 2009 11:34:24 AM
> Subject: [math] MATH-224 - need a better idea
>
> We should be able to find a clean way to do what this enhancement request is asking for.  I am feeling stupid because even when I consider breaking compatibility / refactoring to use generics, I can't find a simple way to do it.  Here is a description of the current API and some failed ideas that I have considered so far.   As usual, I would like to minimize pain for current users in addressing this, but at this point I am starting to think that wholesale refactoring is necessary and I would appreciate ideas on the best way to do this.
>
> SummaryStatistics provides "storeless" computation of summary statistics - min, max, mean, variance, etc.  Here "storeless" means that the class does not hold the stream of data in memory.  It was designed to support pluggable implementations of the statistics that it computes.  It does this in a way that looks smelly in the new world of type-safe Java (well, maybe it always smelled ;)  The injectable implementation classes in SummaryStatistics are typed as "StorelessUnivariateStatistic" which is an interface that includes things like getResult() and increment(double).  There is nothing preventing, for example, a variance implementation from being "plugged in" to implement the mean.
>
> The request in MATH-224 is to support aggregation in the following sense:  SummaryStatistics instance 1 gets a stream of values and instance 2 gets another stream of values and we want to create a new instance or replace instance 1 with an instance that behaves as though it got all the data from both streams.  The simplest way to do this would be to add an "aggregate" method to the StorelessUnivariateStatistic interface and then just implement aggregation in SummaryStatistics by delegation to the implementation instances.  This is essentially what the patch attached to MATH-224 does.  The problem with this approach is that supporting aggregation is a fairly strong requirement in general, stronger than just requiring that the statistic be computable without storing the data.  Stronger still is the requirement that an implementation of a statistic be "aggregatable" with a possibly different implementation (since then it would have access only to the value
>  of the other statistic).
>
> So the challenge is can we find a clean way to achieve the four objectives:
>
> 0) Maintain pluggability of statistics implementations
> 1) Support aggregation
> 2) Improve type safety
> 3) Minimize trauma for current users
>
> Dropping 0) makes things much simpler, but I would like to avoid that unless there is really no way to accomplish 1) and 2) without taking that step.  Strictly speaking, 1) may be impossible as I know of no way to support this for the higher moments.  I would be OK with aggregation forcing these to NaN (documented, of course).
>
> My first thought was to define a parameterized Aggregatable interface that requires the same types.  Then two SummaryStatistics instances are aggregatable iff their implementation statistics match types.  I am OK with these restrictions, but am having trouble actually making it work.
>
> Suggestions / patches welcome!
>
> Phil
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>
>       
>   



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] MATH-224 - need a better idea

Posted by John Bollinger <th...@yahoo.com>.

I'm looking at commons-math for the first time, but I don't think the feature can be implemented as requested in a manner that is suitably generic.  On the other hand, I think the same objective could be achieved a different way without changing the base API at all.  The key would be to generate the aggregate statistics at the same time as the per-partition ones, instead of aggregating them after the fact.  That does require knowing beforehand that you're going to want the aggregate stats, but I think that's a fair tradeoff.  This could be done without making client programs update two sets of statistics with each datum, by wrapping the each StorelessUnivariateStatistic with an implementation that forwards the data to two StorelessUnivariateStatistics -- the wrapped one and one for the aggregate.  Almost all the work of setting that up can be automated.

I'll see whether I can whip up a proof of concept for you to check out.

John




________________________________
From: Phil Steitz <ph...@gmail.com>
To: Commons Developers List <de...@commons.apache.org>
Sent: Sunday, April 19, 2009 11:34:24 AM
Subject: [math] MATH-224 - need a better idea

We should be able to find a clean way to do what this enhancement request is asking for.  I am feeling stupid because even when I consider breaking compatibility / refactoring to use generics, I can't find a simple way to do it.  Here is a description of the current API and some failed ideas that I have considered so far.   As usual, I would like to minimize pain for current users in addressing this, but at this point I am starting to think that wholesale refactoring is necessary and I would appreciate ideas on the best way to do this.

SummaryStatistics provides "storeless" computation of summary statistics - min, max, mean, variance, etc.  Here "storeless" means that the class does not hold the stream of data in memory.  It was designed to support pluggable implementations of the statistics that it computes.  It does this in a way that looks smelly in the new world of type-safe Java (well, maybe it always smelled ;)  The injectable implementation classes in SummaryStatistics are typed as "StorelessUnivariateStatistic" which is an interface that includes things like getResult() and increment(double).  There is nothing preventing, for example, a variance implementation from being "plugged in" to implement the mean.

The request in MATH-224 is to support aggregation in the following sense:  SummaryStatistics instance 1 gets a stream of values and instance 2 gets another stream of values and we want to create a new instance or replace instance 1 with an instance that behaves as though it got all the data from both streams.  The simplest way to do this would be to add an "aggregate" method to the StorelessUnivariateStatistic interface and then just implement aggregation in SummaryStatistics by delegation to the implementation instances.  This is essentially what the patch attached to MATH-224 does.  The problem with this approach is that supporting aggregation is a fairly strong requirement in general, stronger than just requiring that the statistic be computable without storing the data.  Stronger still is the requirement that an implementation of a statistic be "aggregatable" with a possibly different implementation (since then it would have access only to the value
 of the other statistic).

So the challenge is can we find a clean way to achieve the four objectives:

0) Maintain pluggability of statistics implementations
1) Support aggregation
2) Improve type safety
3) Minimize trauma for current users

Dropping 0) makes things much simpler, but I would like to avoid that unless there is really no way to accomplish 1) and 2) without taking that step.  Strictly speaking, 1) may be impossible as I know of no way to support this for the higher moments.  I would be OK with aggregation forcing these to NaN (documented, of course).

My first thought was to define a parameterized Aggregatable interface that requires the same types.  Then two SummaryStatistics instances are aggregatable iff their implementation statistics match types.  I am OK with these restrictions, but am having trouble actually making it work.

Suggestions / patches welcome!

Phil



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org