You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Piotr Kochański <pi...@uw.edu.pl> on 2004/02/02 20:54:11 UTC

[math] Re: "Straw man" release plan

Hello

Phil Steitz wrote:

> Thinking about how this will eventually work, it has occurred to me that 
> EmpiricalDistribution could be used to digest / represent bootstrap 
> distributions.  Since we want the interface for EmpiricalDistribution to 
> be complete for 1.0, we need to make sure that bootstrap data can be 
> loaded into EmpiricalDistribution conveniently (if this makes sense), so I 
> have been thinking about adding load() methods to EmpiricalDistribution 
> that take double[] arrays and streams as values, as well as an addValue() 
> method.  Does this make sense?  I would also appreciate any comments / 
> patches on how to improve the EmpiricalDistribution interface or 
> EmpiricalDistributionImpl.  If refactoring or even holding this from the 
> release are in order, I want to make sure that we do it.

As I understand load(double[][]) would compute Empirical Distribution
Function for every bootstraped sample (provided from some other source).
Then, instead of having 

SummaryStatistics sampleStats

we should provide 

SummaryStatistics[] sampleStats

where this array would contain SummaryStatistics calculated
for every sample.  SummaryStatistics getSampleStats() would
be changed as well.

Similarly other methods/objects in EmpiricalDistribution  
would have to be modified (e.g. binStats would have to be 
an array of ArrayLists, etc.).

Do I get your intentions right?

The zeroth row of every matrix could be reserved for original
sample and the rest for bootstrapped results (if they can be
calculated, i.e. samples are given). This can be achieved but
some effort has to be made to make it simple to use for those,
who does not care about bootstrap and want to get results
based only on the original sample. 

The other thing is that such an extension would be very
usefull as long as we play with such bootstrap algorithms,
which use those statistics which are memebers of SummaryStatistics.

Often this is not the case (classic example is Median or Trimmed Mean,
which is not among SummaryStatistics). Sometimes it is also
necessary (or more comfortable) to operate on the raw bootstrap
samples, not EDF calculated from those samples. In this two
cases bootstrap embeded into EmpiricalDistribution would not
be that useful.

Two comments concerning EmpiricalDistribution 
1. Probably it would be nice to have load(double[]) method
2. Instead of
   ArrayList getBinStats();
there could be 
   List getBinStats();

although I can't imagine practical situation, where other List then
ArrayList would be better.

Piotr

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


[math] Re: "Straw man" release plan

Posted by Piotr Kochañski <pi...@uw.edu.pl>.
Hello

> Ah, now I understand what you have been trying to communicate and I agree
> that adding all of this functionality to EmpiricalDistribution is not a
> good idea.  I was only considering the simple use case modelling the
> sampling distribution of a single, known statistic.  The more general case
> in which the boostrap samples are leveraged for inferences about multiple
> statistics will require more complex machinery.  I suggest that we take
> this up again post 1.0.  For now, I don't think it makes sense to
> significantly modify EmpiricalDistribution (though given the confusion, it
> might be better to change the name :-)

No, no, the name is OK. The purpose of the EmpiricalDistribution is quite
obvious
and it really describes EDF. Now I have clear view what is
EmpiricalDistribution 
for and, what even more important, what is it *not* for.

I will patch EmpiricalDistribution with the load(double[]) method soon
(such
initialization would be quite usefull regardless of bootstrap, etc.)

Piotr

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Re: "Straw man" release plan

Posted by Phil Steitz <st...@yahoo.com>.
--- Piotr Kocha�ski <pi...@uw.edu.pl> wrote:
<snip/>

> > My thought was that we could do some things (e.g. estimate confidence 
> > intervals) without storing the boostrap samples or even the full set of
> 
> > bootstrap statistics.
> 
> This is not a problem at all. When we initialize EmpiricalDistribution
> using load(...) method, we can calculate what we want - we have
> data set at that moment. 
> 
> The problem I see is that we have to a priori specify for which
> statistics
> (bootstrap) confidence interval or standard error would be calculated. 
> 
> We should not make that decision for the user, so some configuration of 
> EmpiricalDistribution object would be necessary, e.g.
> 
> load(double[][], UnivariateStatistics[]) 
> 
> then all the interesting calculation would be done for provided 
> UnivariateStatistics. The default choice could be just SummaryStatistics:
> load(double[][]){
>    statisticsToBeBootstrapped[] = All SummaryStatistics
> }
> 
> If bootstrap samples are not provided, e.g. user uses other
> load function, we can provide confidence intervals based on the
> normal distribution assumption (for those statistics, for which
> it can be calculated).
> 
> In fact we could leave the choice which summary statistics should
> be calculated to the user at all (e.g. for performance reason - someone
> would never be interested in calculating some statistics, but it is done
> anyway, which slows down initialization of the object).
> 
> load(String, UnivariateStatistics[]) etc.
> 
> Then present getSampleStats() method should return
> an object which enables access to calculated statistics and/or
> the confidence intervals for them.
> 

Ah, now I understand what you have been trying to communicate and I agree
that adding all of this functionality to EmpiricalDistribution is not a
good idea.  I was only considering the simple use case modelling the
sampling distribution of a single, known statistic.  The more general case
in which the boostrap samples are leveraged for inferences about multiple
statistics will require more complex machinery.  I suggest that we take
this up again post 1.0.  For now, I don't think it makes sense to
significantly modify EmpiricalDistribution (though given the confusion, it
might be better to change the name :-)

Phil

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free web site building tool. Try it!
http://webhosting.yahoo.com/ps/sb/

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


[math] Re: "Straw man" release plan

Posted by Piotr Kochański <pi...@uw.edu.pl>.
Phil Steitz wrote:

> Piotr Kochan'ski wrote:
> > Mark R. Diggory wrote:
> 
> > 
> > Exactely, but the point is that we have to preserve original/bootstrap
> > values and EmpiricalDistribution is not storing them - internally it keeps
> > data
> > in the array of bins. 
> 
> My thought was that we could do some things (e.g. estimate confidence 
> intervals) without storing the boostrap samples or even the full set of 
> bootstrap statistics.

This is not a problem at all. When we initialize EmpiricalDistribution
using load(...) method, we can calculate what we want - we have
data set at that moment. 

The problem I see is that we have to a priori specify for which statistics
(bootstrap) confidence interval or standard error would be calculated. 

We should not make that decision for the user, so some configuration of 
EmpiricalDistribution object would be necessary, e.g.

load(double[][], UnivariateStatistics[]) 

then all the interesting calculation would be done for provided 
UnivariateStatistics. The default choice could be just SummaryStatistics:
load(double[][]){
   statisticsToBeBootstrapped[] = All SummaryStatistics
}

If bootstrap samples are not provided, e.g. user uses other
load function, we can provide confidence intervals based on the
normal distribution assumption (for those statistics, for which
it can be calculated).

In fact we could leave the choice which summary statistics should
be calculated to the user at all (e.g. for performance reason - someone
would never be interested in calculating some statistics, but it is done
anyway, which slows down initialization of the object).

load(String, UnivariateStatistics[]) etc.

Then present getSampleStats() method should return
an object which enables access to calculated statistics and/or
the confidence intervals for them.

Piotr

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Re: "Straw man" release plan

Posted by Phil Steitz <ph...@steitz.com>.
Piotr Kochan'ski wrote:
> Mark R. Diggory wrote:

> 
> Exactely, but the point is that we have to preserve original/bootstrap
> values and EmpiricalDistribution is not storing them - internally it keeps
> data
> in the array of bins. 

My thought was that we could do some things (e.g. estimate confidence 
intervals) without storing the boostrap samples or even the full set of 
bootstrap statistics.

As I understand this was the aim - we don't have to
> keep the whole data set in order to get important information about the
> empirical distribution.  If a data set is huge this is a true gain.

Yes.  This is why EmpiricalDistribution exists.
> 
> If, on the other hand, I want to keep the whole data set then I can easily
> use
> other tools to calculate any statistics I want so I don't need to use 
> EmpiricalDistribution.

Yes. Even for the bootstrap percentiles, if the number of bootstrap 
samples is small enough to store the stats in memory, we could get the 
percentiles directly by applying Percentile to the stored values.
> 
> Documentation for EmpiricalDistribution gives two example applications
> of this interface - preparing data for drawing a histogram and provide
> methods to draw random numbers from such a distribution. I am
> wondering if making EmpiricalDistribution responsible for other tasks
> like handling bootstrap samples or even doing bootstrap would not
> make it to complicated to use.

That is why I asked the question.  What is going on is that to meet the 
needs of the second use case above, something like a variable kernel 
density estimator was developed.  This has many uses beyond generating 
random data.  Among these might be supporting inference based on large 
numbers of bootstrap samples.  Given that the implementation now requires 
two passes through the data, there is probably not much value to this 
approach using the current implementation.  What I wanted to verify is 
that the interface is adequate to support this kind of inference (and the 
other kinds of things that it might be used for).  I never intended to 
imply that EmpiricalDistribution would manipulate or generate bootstrap 
samples itself.

Phil

> 
> Piotr
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


[math] Re: "Straw man" release plan

Posted by Piotr Kochański <pi...@uw.edu.pl>.
Mark R. Diggory wrote:

<cut/>
> 
> 
> I think maybe this should be returning the more generic 
> StatisticalSummary interface. If you are returning precalculated 
> results, you do not exactly want to expose the underlying implementation 
> to modification by the user.
> 
> StatisticalSummary[] sampleStats ...

That's good idea
 

> If your going to be preserving the original/bootstrap values in a 
> double[][], then the Standard "DescriptiveStatisticsImpl" could be used.
> 
> public interface FullStatisticalSummary {
> 	public abstract double getMean();
> 	public abstract double getVariance();
> 	public abstract double getStandardDeviation();
> 	public abstract double getMax();
> 	public abstract double getMin();
> 	public abstract long getN();
> 	public abstract double getSum();
> 	public abstract double getPercentile(double p);
> 	...
> }
> 
> or more simply,
> 
> public interface FullStatisticalSummary extends StatisticalSummary{
> 	public abstract double getPercentile(double p);
> 	...
> }

Exactely, but the point is that we have to preserve original/bootstrap
values and EmpiricalDistribution is not storing them - internally it keeps
data
in the array of bins. As I understand this was the aim - we don't have to
keep the whole data set in order to get important information about the
empirical distribution.  If a data set is huge this is a true gain.

If, on the other hand, I want to keep the whole data set then I can easily
use
other tools to calculate any statistics I want so I don't need to use 
EmpiricalDistribution.

Documentation for EmpiricalDistribution gives two example applications
of this interface - preparing data for drawing a histogram and provide
methods to draw random numbers from such a distribution. I am
wondering if making EmpiricalDistribution responsible for other tasks
like handling bootstrap samples or even doing bootstrap would not
make it to complicated to use.

Piotr


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Re: "Straw man" release plan

Posted by Phil Steitz <ph...@steitz.com>.
Mark R. Diggory wrote:
<snip/>

>>
>> where this array would contain SummaryStatistics calculated
>> for every sample.  SummaryStatistics getSampleStats() would
>> be changed as well.
> 
> 
> 
> I think maybe this should be returning the more generic 
> StatisticalSummary interface. If you are returning precalculated 
> results, you do not exactly want to expose the underlying implementation 
> to modification by the user.

I agree.  That is why StatisticalSummary was created.

Phil








---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Re: "Straw man" release plan

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.

Piotr Kochański wrote:
> Hello
> 
> Phil Steitz wrote:
> 
> 
>>Thinking about how this will eventually work, it has occurred to me that 
>>EmpiricalDistribution could be used to digest / represent bootstrap 
>>distributions.  Since we want the interface for EmpiricalDistribution to 
>>be complete for 1.0, we need to make sure that bootstrap data can be 
>>loaded into EmpiricalDistribution conveniently (if this makes sense), so I 
>>have been thinking about adding load() methods to EmpiricalDistribution 
>>that take double[] arrays and streams as values, as well as an addValue() 
>>method.  Does this make sense?  I would also appreciate any comments / 
>>patches on how to improve the EmpiricalDistribution interface or 
>>EmpiricalDistributionImpl.  If refactoring or even holding this from the 
>>release are in order, I want to make sure that we do it.
> 
> 
> As I understand load(double[][]) would compute Empirical Distribution
> Function for every bootstraped sample (provided from some other source).
> Then, instead of having 
> 
> SummaryStatistics sampleStats
> 
> we should provide 
> 
> SummaryStatistics[] sampleStats
> 
> where this array would contain SummaryStatistics calculated
> for every sample.  SummaryStatistics getSampleStats() would
> be changed as well.


I think maybe this should be returning the more generic 
StatisticalSummary interface. If you are returning precalculated 
results, you do not exactly want to expose the underlying implementation 
to modification by the user.

StatisticalSummary[] sampleStats ...

> 
> Similarly other methods/objects in EmpiricalDistribution  
> would have to be modified (e.g. binStats would have to be 
> an array of ArrayLists, etc.).
> 
> Do I get your intentions right?
> 
> The zeroth row of every matrix could be reserved for original
> sample and the rest for bootstrapped results (if they can be
> calculated, i.e. samples are given). This can be achieved but
> some effort has to be made to make it simple to use for those,
> who does not care about bootstrap and want to get results
> based only on the original sample. 
> 
> The other thing is that such an extension would be very
> usefull as long as we play with such bootstrap algorithms,
> which use those statistics which are memebers of SummaryStatistics.
> 
> Often this is not the case (classic example is Median or Trimmed Mean,
> which is not among SummaryStatistics). Sometimes it is also
> necessary (or more comfortable) to operate on the raw bootstrap
> samples, not EDF calculated from those samples. In this two
> cases bootstrap embeded into EmpiricalDistribution would not
> be that useful.
> 

If your going to be preserving the original/bootstrap values in a 
double[][], then the Standard "DescriptiveStatisticsImpl" could be used.

public interface FullStatisticalSummary {
	public abstract double getMean();
	public abstract double getVariance();
	public abstract double getStandardDeviation();
	public abstract double getMax();
	public abstract double getMin();
	public abstract long getN();
	public abstract double getSum();
	public abstract double getPercentile(double p);
	...
}

or more simply,

public interface FullStatisticalSummary extends StatisticalSummary{
	public abstract double getPercentile(double p);
	...
}

Which would then Be implemented by DescriptiveStatistcs.

If returning an Interface that exposes the statistical analysis of said 
values, then an expanded interface that includes other available 
statistics could easily be added to the API.



> Two comments concerning EmpiricalDistribution 
> 1. Probably it would be nice to have load(double[]) method
> 2. Instead of
>    ArrayList getBinStats();
> there could be 
>    List getBinStats();
> 
> although I can't imagine practical situation, where other List then
> ArrayList would be better.
> 
> Piotr
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 

-- 
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org