You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Piotr Kochański <pi...@uw.edu.pl> on 2004/02/10 10:25:18 UTC

[math][patch] Bug 26772 patches resubmitted

I hope this time everything will be ok.

By the way, is it possible to send patches as attachements through
bugzilla?

Piotr

[math] Re: EmpiricalDistribution improvments

Posted by Piotr Kochański <pi...@uw.edu.pl>.
Phil Steitz wrote:

> The use case that I had in mind was repeated simulation runs using the 
> same source dataset -- for this it would be handy to be able to digest a 
> large dataset once and then reload just the digest (EDF) for subsequent 
> runs.
> 
> There is more data than that -- remember the bin stats, etc.  If we want 
> to do it in a platform-independent way, that will be interesting; 
> otherwise we could just serialize the whole mess using Java 
> serialization (hence the comment that maybe just implementing 
> Serializable is enough).

I was thinking about writing to a file something intermediate between
a raw data file and fullblown information about EDF. Then necessary
and more coplicated things would be recalculated, however this is
not that interesting approach in case of the application you have
mentioned.

The biggest problem I see is the format of the file with EDF. We can either
invent some format (not a big deal) but then we need to provide validation
and parsing of such a file in order to load EDF in a safe and robust way.
This is no longer simple.

The other solution is to use XML, then parsing and validation would be
done by XML parses - we need to provide a proper schema only. This is
nice but the code starts to depend on XML parser. I am not sure if
this is a good idea for such a library like math?

So maybe relaying on serialization is enough, like you have suggested...

> > As long as we test means or variances we can use t test or some variance
> > equality test (Levene test). However we need to choose significane level
> > anyway, so still there is a arbitrary number (like "tolerance" we have
> > now),
> > on the other hand this number have clear interpretation.
> 
> Yes, that is the problem.  I don't see how exactly we can correctly set 
> df for the t-test, for example, since the sampling distribution of the 
> "mean of EDF-generated values" is sort of an ugly beast that depends on 
> the the number and dispersion of the origial values as well as the 
> number of bins and the number of generated values.

Ugh, true. That's rather complicated. If we have bootstrap available
in future we can use it as a test tool in such situations.

Piotr

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] EmpiricalDistribution improvments

Posted by Phil Steitz <ph...@steitz.com>.
Piotr Kochañski wrote:
> Phil Steitz wrote:
> 
> 
>>1. Either remove or implement the "not implemented yet" distribution 
>>persistence methods.  I am ambivalent on these, maybe just supporting 
>>serialization is enough.
> 
> 
> The question is if it happens very often that we obtain data in the
> form of the EDF. This might be the case if data are pre-processed
> using different application (or experimental equipment)...

The use case that I had in mind was repeated simulation runs using the 
same source dataset -- for this it would be handy to be able to digest a 
large dataset once and then reload just the digest (EDF) for subsequent 
runs.

> 
> I'm thinking about the best form in which EmpiricalDistribution can be
> saved,
> maybe saving pairs 
> observed_value_i = probability_i
> would do the job?

There is more data than that -- remember the bin stats, etc.  If we want 
to do it in a platform-independent way, that will be interesting; 
otherwise we could just serialize the whole mess using Java 
serialization (hence the comment that maybe just implementing 
Serializable is enough).

> 
> 
>>3. Develop some sort of rationale for the test tolerances.  This is an 
>>interesting mathstat problem.  I would ideally like to use statistical 
>>tests (like elsewhere in the random package), but it is not obvious what 
>>the right test or test parameters should be.
> 
> 
> As long as we test means or variances we can use t test or some variance
> equality test (Levene test). However we need to choose significane level
> anyway, so still there is a arbitrary number (like "tolerance" we have
> now),
> on the other hand this number have clear interpretation.

Yes, that is the problem.  I don't see how exactly we can correctly set 
df for the t-test, for example, since the sampling distribution of the 
"mean of EDF-generated values" is sort of an ugly beast that depends on 
the the number and dispersion of the origial values as well as the 
number of bins and the number of generated values.

Phil

> 
> Piotr
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


[math] EmpiricalDistribution improvments

Posted by Piotr Kochañski <pi...@uw.edu.pl>.
Phil Steitz wrote:

> 1. Either remove or implement the "not implemented yet" distribution 
> persistence methods.  I am ambivalent on these, maybe just supporting 
> serialization is enough.

The question is if it happens very often that we obtain data in the
form of the EDF. This might be the case if data are pre-processed
using different application (or experimental equipment)...

I'm thinking about the best form in which EmpiricalDistribution can be
saved,
maybe saving pairs 
observed_value_i = probability_i
would do the job?

> 3. Develop some sort of rationale for the test tolerances.  This is an 
> interesting mathstat problem.  I would ideally like to use statistical 
> tests (like elsewhere in the random package), but it is not obvious what 
> the right test or test parameters should be.

As long as we test means or variances we can use t test or some variance
equality test (Levene test). However we need to choose significane level
anyway, so still there is a arbitrary number (like "tolerance" we have
now),
on the other hand this number have clear interpretation.

Piotr


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math][patch] Bug 26772 patches resubmitted

Posted by Phil Steitz <ph...@steitz.com>.
Piotr,

Thanks, I applied the patches.  Here are some additional things that we 
should clean up in EmpiricalDistribution, if you (or anyone else :-) have 
more time and are interested in these things.

1. Either remove or implement the "not implemented yet" distribution 
persistence methods.  I am ambivalent on these, maybe just supporting 
serialization is enough.

2. Refactor the tests to test the double- and file- based loads more neatly.

3. Develop some sort of rationale for the test tolerances.  This is an 
interesting mathstat problem.  I would ideally like to use statistical 
tests (like elsewhere in the random package), but it is not obvious what 
the right test or test parameters should be.

Phil



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math][patch] Bug 26772 patches resubmitted

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Best to attach them as separate files. - Mark

Piotr Kochański wrote:
> I hope this time everything will be ok.
> 
> By the way, is it possible to send patches as attachements through
> bugzilla?
> 
> Piotr
> 
> 
> ------------------------------------------------------------------------
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org

-- 
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org