You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Phil Steitz <ph...@steitz.com> on 2003/05/23 15:29:38 UTC

[math] log representaion of sums was:Re: [math] Priorities, help needed

Brent Worden wrote:
> The exp(log()) technique is almost always more numerically accurate because
> log() significantly reduces the magnitude of large numbers.  Thus, reducing
> the chance of significant digit loss when adding.

Can you provide a (ideally web) reference confirming this (ideally 
showing full interval analysis and taking into account the way the JVM 
actually does the multiplication)?  I recall the same thing, but I have 
not been able to locate a reference confirming that the error introduced 
by the log approximation (and exponentiation) is on average 
significantly less.

> Also, since the log()
> values are smaller there is less chance of numerical overflow with large
> data sets.
> 
This is obvious; but I am not so sure how practially important it is 
given the size of Double.MAX_VALUE.

> Another approach I've used to reduce numerical overflow and numerical error
> is by employing induction (e.g. recursive) formulas.  For instance, the mean
> of n numbers can easily be computed from the mean of the first n-1 numbers
> and the n-th number.  Knuth's "The Art of Programming" describes this
> formula and attests to its numerical accuracy.  I think one could derive a
> induction formula for geometric mean that would exhibit similar numerical
> accuracy.
> 
Here again, a web reference and implementation would be nice.

One more thing.  Before deciding to change implementation, it would be 
nice to run some benchmarks (or get some definitive references) to see 
what the performance difference will be. I suspect that the sum of logs 
approach may actually be slower, but I have no idea by how much.

Phil

> Brent Worden
> http://www.brent.worden.org
> 


> "Mark R. Diggory" <md...@latte.harvard.edu> wrote in message
> news:3ECC037E.6000304@latte.harvard.edu...
> 
>>Phil Steitz wrote:
>>
>>
>>>Yes.  The computation is easy.  The question is is it a)
>>>more efficient and/or b) more accurate.  That is what we
>>>need to find out.
>>>
>>>
>>
>>Details I've seen thus far describe it as more efficient to use the log
>>approach. I think your points about accuracy and rounding are strong as
>>well.
>>
>>It really depends on the java.lang.Math approach to log() and exp().
>>Both these methods deligate to a native C implimentations deep in JVM.
>>
>>
>>Just some links through Google
>>http://mathforum.org/library/drmath/view/52804.html
>>http://www.buzzardsbay.org/geomean.htm
>>http://www.imsa.edu/edu/math/journal/volume3/articles/AlgebraAverages.pdf
>>
>>-Mark
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] log representaion of sums was:Re: [math] Priorities, help needed

Posted by Phil Steitz <st...@yahoo.com>.
--- Phil Steitz <ph...@steitz.com> wrote:
> Brent Worden wrote:
> > The exp(log()) technique is almost always more
> numerically accurate because
> > log() significantly reduces the magnitude of large
> numbers.  Thus, reducing
> > the chance of significant digit loss when adding.
> 
> Can you provide a (ideally web) reference confirming this
> (ideally 
> showing full interval analysis and taking into account
> the way the JVM 
> actually does the multiplication)?  I recall the same
> thing, but I have 
> not been able to locate a reference confirming that the
> error introduced 
> by the log approximation (and exponentiation) is on
> average 
> significantly less.
> 
> > Also, since the log()
> > values are smaller there is less chance of numerical
> overflow with large
> > data sets.
> > 
> This is obvious; but I am not so sure how practially
> important it is 
> given the size of Double.MAX_VALUE.
> 
> > Another approach I've used to reduce numerical overflow
> and numerical error
> > is by employing induction (e.g. recursive) formulas. 
> For instance, the mean
> > of n numbers can easily be computed from the mean of
> the first n-1 numbers
> > and the n-th number.  Knuth's "The Art of Programming"
> describes this
> > formula and attests to its numerical accuracy.  I think
> one could derive a
> > induction formula for geometric mean that would exhibit
> similar numerical
> > accuracy.
> > 
> Here again, a web reference and implementation would be
> nice.
> 
> One more thing.  Before deciding to change
> implementation, it would be 
> nice to run some benchmarks (or get some definitive
> references) to see 
> what the performance difference will be. I suspect that
> the sum of logs 
> approach may actually be slower, but I have no idea by
> how much.
> 

Another thing that we need to keep in mind is that using
running sums of logs to represent products will require
special handling for 0's in the data.  This could get quire
messy in the "rolling" implementations.

Phil

> Phil
> 
> > Brent Worden
> > http://www.brent.worden.org
> > 
> 
> 
> > "Mark R. Diggory" <md...@latte.harvard.edu> wrote in
> message
> > news:3ECC037E.6000304@latte.harvard.edu...
> > 
> >>Phil Steitz wrote:
> >>
> >>
> >>>Yes.  The computation is easy.  The question is is it
> a)
> >>>more efficient and/or b) more accurate.  That is what
> we
> >>>need to find out.
> >>>
> >>>
> >>
> >>Details I've seen thus far describe it as more
> efficient to use the log
> >>approach. I think your points about accuracy and
> rounding are strong as
> >>well.
> >>
> >>It really depends on the java.lang.Math approach to
> log() and exp().
> >>Both these methods deligate to a native C
> implimentations deep in JVM.
> >>
> >>
> >>Just some links through Google
> >>http://mathforum.org/library/drmath/view/52804.html
> >>http://www.buzzardsbay.org/geomean.htm
>
>>http://www.imsa.edu/edu/math/journal/volume3/articles/AlgebraAverages.pdf
> >>
> >>-Mark
> > 
> > 
> > 
> > 
> > 
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> commons-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> commons-dev-help@jakarta.apache.org
> > 
> > 
> 
> 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> commons-dev-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] log representaion of sums was:Re: [math] Priorities, help needed

Posted by Phil Steitz <ph...@steitz.com>.
Mark R. Diggory wrote:
> Phil Steitz wrote:
> 
>> Brent Worden wrote:
>>
>>> Agreed.  I would like to add that I think we're a little overly 
>>> concerned
>>> about the actual implementation of the algorithm.  In these early 
>>> stages of
>>> the project, I think it's wiser to spend time discussing the evolving 
>>> design
>>> and API.  In the end, that is how people will judge the value of this
>>> project.  People will care far less about how rock-solid the 
>>> geometric mean
>>> algorithm is compared to how many features does it provide and how 
>>> easy is
>>> it to use.
>>
>>
>>
>> I could not agree more.  I have been using (and sharing) the original, 
>> no-storage, no-rolling version of Univariate for a couple of years now 
>> and have found it to be simple, lightweight and easy to use.  That is 
>> why I contributed it.  The only thing that I think we really need to 
>> worry about as we get the initial release together is that we 
>> carefully document the interfaces and the contracts -- otherwise the 
>> stuff will not be usable -- and maintain implementation quality.  We 
>> should try to avoid stupid things and really bad numerical algorithms, 
>> but I agree that our focus should be on getting basic, easy to use, 
>> frequently demanded functionality into the package.  Regarding 
>> Univariate in particular, my feeling is that the most important things 
>> to get in there are percentiles and confidence intervals.  These are 
>> what people actually use (beyond the arithmetic mean and variance).
>>
>> Have you looked at the task list here:
>> http://jakarta.apache.org/commons/sandbox/math/tasks.html?
>>
>> Do you have a) comments on these / alternative suggestions  b) code to 
>> contribute or c) time to spend helping with implementation?
> 
> 
> I'm concerned its starting to get difficult to see such clear interfaces 
> with all the code piled up in one package. Refactoring is relatively 
> easy at this stage. I want to suggest we begin to isolate different 
> functionalities in separate packages for clarity's sake.

We certainly need to deal with this "soon". I posted a similar 
decomposition a while back.  Robert suggested that we wait until we had 
assembled more material.  I think it might be best to wait just a bit 
longer, since I think we all agree that we need to discuss scope and 
where we end up there will impact what the natural package structure is.

> 
> One possibility is:
> 
> *org.apache.commons.math.random*
> 
> EmpiricalDistribution
> EmpiricalDistributionImpl
> RandomData
> RandomDataImpl
> ValueServer
> 
> *org.apache.commons.math.la*
> 
> RealMatrix
> RealMatrixImpl
> 
> *org.apache.commons.math.util*
> 
> ContractableDoubleArray
> ExpandableDoubleArray
> FixedDoubleArray
> DoubleArray
> 
> *org.apache.commons.math.stat*
> 
> TestStatistic
> TestStatisticImpl
> Freq
> Univariate
> UnivariateImpl
> ListUnivariateImpl
> AbstractStoreUnivariate
> StoreUnivariate
> 
> 
> The idea being similar in nature to the SAX or DOM api's. Maybe we can 
> establish a set of interfaces/factories for these implementations. Maybe 
> there are questions about having the "Impl" vs having a factory approach 
> to object instantiation. I'm not sure that there would be enough 
> "Implementations" to support a API/spec with Factory based instantiation.

In each of the cases where interfaces have been abstracted, I think that 
there likely will be multiple implementations and in fact one of the 
advantages of commons-math should be extensibility.  I don't much like 
the "Impl" names. They should probably "soon" be changed to be 
meaningful.  For example (see more below) "RandomDataImpl" should 
probably be called something like "JDKRandomData" and "UnivariateImpl" 
should be called something like "StreamUnivariate" or 
"RollingUnivariate".  "RealMatrixImpl" should be something like 
"DoubleRealMatrix" (allowing "BigDecimalRealMatrix")  Here again, I 
would hold off just a bit longer before jumping into this.

> 
> I do have some concerns about the Random library and Random Number 
> Generation/Distributions.
> 
> 1.) the JDK provides for "plugability" behind their Random number 
> generatator. So you can plug different implmentations in behind it, 
> ideally this should be taken advantage of in terms of providing 
> different methods of random number generation. This is probibly one 
> limitation the CERN random generation libraries.
> 
I thought about this and it is addressed (sort of) in two ways in the 
current setup.  First, abstracting the RandomData interface enables 
virtually any kind of implementation to be "plugged in". Second, the 
setSecureAlgorithmpl method of RandomDataImpl allows the underlying 
algorithm and provider for the "secure" methods to be reset.  The basic 
interface and the JDK-based implementation was designed to be simple, 
easy to use, but supporting some simple, generally useful extensions of 
what comes out of the box from Math.Random: reseeding, generation of 
exponential and possion deviates and generation of uniform and normal 
values within specified ranges.

The core problem here -- and below -- is what scope are we aiming at. 
In the proposal, I suggested the following scope:

The Math project shall create and maintain a library of lightweight, 
self-contained mathematics and statistics components addressing the most 
common practical problems not immediately available in the Java 
programming language or commons-lang. The guiding principles for 
commons-math will be:

1. Real-world application use cases determine priority
2. Emphasis on small, easily integrated components rather than large
    libraries with complex dependencies
3. All algorithms are fully documented and follow generally accepted
    best practices
4. In situations where multiple standard algorithms exist, use the
    Strategy pattern to support multiple implementations
5. No external dependencies beyond Commons components and the JDK

This means that we need to keep asking ourselves the question "are we 
meeting a simple application need with a lightweight component that is 
easy to use?"  Personally, I would say that the current RandomData 
interface and the JDK-based implementation satisfy this.  Of course,as 
always, I amy be wrong.


> 2.) The Distribution library at CERN has a somewhat successfull layout, 
> but I have some problems with it in terms of not being very "Bean like". 
> parameters often lack getters/setters that are easy to access via a 
> beanlike interface.
> 
> http://hoschek.home.cern.ch/hoschek/colt/V1.0.3/doc/cern/jet/random/package-summary.html 
> 
> 
> 
> Finally, I feel a little wierd about replicating alot of the 
> functionality of the CERN library given that it is in production still. 
> Its stupid to overlook the efforts Wolfgang Hoschek has placed into 
> building a solid LGPL'ed open source mathematics library. I fear in some 
> ways we will only end up "replicating" his and others efforts here. I 
> wonder if Hoschek would have any interest in "standardization" of his 
> packages. Apache could work in his favor if he were interested in 
> allowing his code base to be further maintained and developed here. 
> Inviting community participation would open the code up to further 
> development, enhancement and refactoring to improve the libraries 
> infrustructure and save the replication of development. Maybe we should 
> consider contacting him at CERN and get his opinion on such an idea.
> 
> -Mark
> 

This hits a core issue that we need to think carefully about.  The same 
type of thing could be said regarding several other general-purpose math 
or stat libraries.  My personal opinion is that commons-math should 
*not* aim to become a "universal math library" with anything like the 
scope of Colt, JADE, VisualNumerics or any of the excellent libraries 
out there.  Our aim should be to provide a nicely designed and 
documented collection of simple utilities that save developers time and 
licensing pain -- similar to the other commons components.  If we end up 
"duplicating" functionality that exists elsewhere, I do not personally 
see this as a terrible outcome.  I see the ability to discuss and 
implement simple Java interfaces as a real advantage that we will get by 
some limited "re-invention".

In an early draft of the proposal, I had a guiding principle that said 
that each submission should be accompanied by (and evaluated according 
to) real-world application use cases.  I think that it would be a good 
idea to at least informally adhere to this.  So, for example, instead of 
just adding a large library of statistical routines, we would need to 
explain how each of the things to be added are widely used and how the 
design supports ease of use and integration.

Two of the things that I have submitted require some justification, 
which I will add here and if we do not agree, I will be OK with dropping 
them.

EmpiricalDistribution, EmpiricalDistributionImpl

This is useful in simulation or stub-based testing and in generating 
data for histograms.  Specifically, when things like service latencies 
or inter-arrival times are known to follow funny distributions and 
simulations need to generate values "like" those observed in production, 
they are sort of SOL unless they have something like this.  I know of no 
other open source component that provides the ability to generate data 
from an empirical distribution.  Admitedly, this stuff is an order of 
magnitude less demanded than RandomData or Univariate; but it does have 
real practical use, which may grow as testing, simulation and QOS become 
more important to developers.

ValueServer

This is a wrapper that combines EmpiricalDistribution, RandomData and 
the ability to "replay" data from a file directly so that simulation or 
stub-based testing applications can generate values in any of the 
supported modes.  Like EmpiricalDistribution, the main use is for 
stub-based testing and simulation.  What I mean by "stub-based testing" 
is load or functional testing with some or all back end service 
providers replaced by stubs that return canned responses.  The 
ValueServer can be used by the stubs to make them simulate 
production-like latency variation.



Thses are good these questions.  We need to keep asking them.

Phil

> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] log representaion of sums was:Re: [math] Priorities, help needed

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Phil Steitz wrote:
> Brent Worden wrote:
>> Agreed.  I would like to add that I think we're a little overly concerned
>> about the actual implementation of the algorithm.  In these early 
>> stages of
>> the project, I think it's wiser to spend time discussing the evolving 
>> design
>> and API.  In the end, that is how people will judge the value of this
>> project.  People will care far less about how rock-solid the geometric 
>> mean
>> algorithm is compared to how many features does it provide and how 
>> easy is
>> it to use.
> 
> 
> I could not agree more.  I have been using (and sharing) the original, 
> no-storage, no-rolling version of Univariate for a couple of years now 
> and have found it to be simple, lightweight and easy to use.  That is 
> why I contributed it.  The only thing that I think we really need to 
> worry about as we get the initial release together is that we carefully 
> document the interfaces and the contracts -- otherwise the stuff will 
> not be usable -- and maintain implementation quality.  We should try to 
> avoid stupid things and really bad numerical algorithms, but I agree 
> that our focus should be on getting basic, easy to use, frequently 
> demanded functionality into the package.  Regarding Univariate in 
> particular, my feeling is that the most important things to get in there 
> are percentiles and confidence intervals.  These are what people 
> actually use (beyond the arithmetic mean and variance).
> 
> Have you looked at the task list here:
> http://jakarta.apache.org/commons/sandbox/math/tasks.html?
> 
> Do you have a) comments on these / alternative suggestions  b) code to 
> contribute or c) time to spend helping with implementation?

I'm concerned its starting to get difficult to see such clear interfaces 
with all the code piled up in one package. Refactoring is relatively 
easy at this stage. I want to suggest we begin to isolate different 
functionalities in separate packages for clarity's sake.

One possibility is:

*org.apache.commons.math.random*

EmpiricalDistribution
EmpiricalDistributionImpl
RandomData
RandomDataImpl
ValueServer

*org.apache.commons.math.la*

RealMatrix
RealMatrixImpl

*org.apache.commons.math.util*

ContractableDoubleArray
ExpandableDoubleArray
FixedDoubleArray
DoubleArray

*org.apache.commons.math.stat*

TestStatistic
TestStatisticImpl
Freq
Univariate
UnivariateImpl
ListUnivariateImpl
AbstractStoreUnivariate
StoreUnivariate


The idea being similar in nature to the SAX or DOM api's. Maybe we can 
establish a set of interfaces/factories for these implementations. Maybe 
there are questions about having the "Impl" vs having a factory approach 
to object instantiation. I'm not sure that there would be enough 
"Implementations" to support a API/spec with Factory based instantiation.

I do have some concerns about the Random library and Random Number 
Generation/Distributions.

1.) the JDK provides for "plugability" behind their Random number 
generatator. So you can plug different implmentations in behind it, 
ideally this should be taken advantage of in terms of providing 
different methods of random number generation. This is probibly one 
limitation the CERN random generation libraries.

2.) The Distribution library at CERN has a somewhat successfull layout, 
but I have some problems with it in terms of not being very "Bean like". 
parameters often lack getters/setters that are easy to access via a 
beanlike interface.

http://hoschek.home.cern.ch/hoschek/colt/V1.0.3/doc/cern/jet/random/package-summary.html


Finally, I feel a little wierd about replicating alot of the 
functionality of the CERN library given that it is in production still. 
Its stupid to overlook the efforts Wolfgang Hoschek has placed into 
building a solid LGPL'ed open source mathematics library. I fear in some 
ways we will only end up "replicating" his and others efforts here. I 
wonder if Hoschek would have any interest in "standardization" of his 
packages. Apache could work in his favor if he were interested in 
allowing his code base to be further maintained and developed here. 
Inviting community participation would open the code up to further 
development, enhancement and refactoring to improve the libraries 
infrustructure and save the replication of development. Maybe we should 
consider contacting him at CERN and get his opinion on such an idea.

-Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] log representaion of sums was:Re: [math] Priorities, help needed

Posted by Phil Steitz <ph...@steitz.com>.
Brent Worden wrote:
>>For starters, I was going to implement the t-test statistic and submit it
>>for addition.
>>
>>Brent Worden
> 
> 
> I've finished adding the one-sample t-test statistic to TestStatistic.
> Incorporate it if you'd like.  It leverages the UnivariteImpl to avoid
> duplicating the summary statistic computations.
> 
> I've attached the updated source files for review.  Is there a different
> way, other than the mailing list, that I should be submitting these changes?
> Please let me know.
> 
I usually submit patches (cvs diffs if they are patches to existing 
stuff, plain text files if new classes) as attachments to bug reports in 
Bugzilla.  That makes tracking a little easier and does not clutter 
peoples' inboxes with patches.

http://issues.apache.org/bugzilla/enter_bug.cgi?product=Commons

Select sandbox and put [math] in front of the description.

I did not seem to get the attachment to this message.  You may want to 
go ahead and submit it to Bugzilla.

> 
> Along with the t-test statistic, the one task item:
> t-test statistic needs to be added and we should probably add the capability
> of actually performing t- and chi-square tests at fixed significance levels
> (.1, .05, .01, .001).
> calls for the computation of p-values.  If no one else has started
> implementing t and chi-square distributions, I wouldn't mind doing it.

Please do. Thanks!!

> 
> Brent Worden
> http://www.brent.worden.org
> 
> 
> 
> ------------------------------------------------------------------------
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


RE: [math] log representaion of sums was:Re: [math] Priorities, help needed

Posted by Brent Worden <br...@worden.org>.
> For starters, I was going to implement the t-test statistic and submit it
> for addition.
>
> Brent Worden

I've finished adding the one-sample t-test statistic to TestStatistic.
Incorporate it if you'd like.  It leverages the UnivariteImpl to avoid
duplicating the summary statistic computations.

I've attached the updated source files for review.  Is there a different
way, other than the mailing list, that I should be submitting these changes?
Please let me know.


Along with the t-test statistic, the one task item:
t-test statistic needs to be added and we should probably add the capability
of actually performing t- and chi-square tests at fixed significance levels
(.1, .05, .01, .001).
calls for the computation of p-values.  If no one else has started
implementing t and chi-square distributions, I wouldn't mind doing it.

Brent Worden
http://www.brent.worden.org


RE: [math] log representaion of sums was:Re: [math] Priorities, help needed

Posted by Brent Worden <br...@worden.org>.
>
> Have you looked at the task list here:
> http://jakarta.apache.org/commons/sandbox/math/tasks.html?
>
> Do you have a) comments on these / alternative suggestions  b) code to
> contribute or c) time to spend helping with implementation?

I'm shocked you haven't heard of the world renown Numerics C++ Library! :-)
http://www.brent.worden.org/software/numericscpp.html

I had planned on converting a bunch of things from that beautiful body of
work into this project.

For starters, I was going to implement the t-test statistic and submit it
for addition.

Brent Worden
http://www.brent.worden.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] log representaion of sums was:Re: [math] Priorities, help needed

Posted by Phil Steitz <ph...@steitz.com>.
Brent Worden wrote:
>>One more thing.  Before deciding to change implementation, it would be
>>nice to run some benchmarks (or get some definitive references) to see
>>what the performance difference will be. I suspect that the sum of logs
>>approach may actually be slower, but I have no idea by how much.
>>
>>Phil
> 
> 
> Agreed.  I would like to add that I think we're a little overly concerned
> about the actual implementation of the algorithm.  In these early stages of
> the project, I think it's wiser to spend time discussing the evolving design
> and API.  In the end, that is how people will judge the value of this
> project.  People will care far less about how rock-solid the geometric mean
> algorithm is compared to how many features does it provide and how easy is
> it to use.

I could not agree more.  I have been using (and sharing) the original, 
no-storage, no-rolling version of Univariate for a couple of years now 
and have found it to be simple, lightweight and easy to use.  That is 
why I contributed it.  The only thing that I think we really need to 
worry about as we get the initial release together is that we carefully 
document the interfaces and the contracts -- otherwise the stuff will 
not be usable -- and maintain implementation quality.  We should try to 
avoid stupid things and really bad numerical algorithms, but I agree 
that our focus should be on getting basic, easy to use, frequently 
demanded functionality into the package.  Regarding Univariate in 
particular, my feeling is that the most important things to get in there 
are percentiles and confidence intervals.  These are what people 
actually use (beyond the arithmetic mean and variance).

Have you looked at the task list here:
http://jakarta.apache.org/commons/sandbox/math/tasks.html?

Do you have a) comments on these / alternative suggestions  b) code to 
contribute or c) time to spend helping with implementation?

I am completing testing of a simple "one pass" bivariate regression 
implementation -- another lightweight thingy that I have found very 
useful as it has followed be around (through 5 languages) over the 
years. I was planning to circle back to the RealMatrix implementation 
next, but if you want to take a stab at that or anything else, please do.

Obviously, any additional feedback that you have on what is already out 
there would be appreciated.


Phil

> 
> These discussions will eventually need to take place, but I don't think now
> is the time.  The geometric mean works now for most all valid data sets and
> works well enough in terms of accuracy.  The only reason now to change it is
> if the Univariate implementation design changes, requiring rework of all the
> statistics.
> 
> Brent Worden
> http://www.brent.worden.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


RE: [math] log representaion of sums was:Re: [math] Priorities, help needed

Posted by Brent Worden <br...@worden.org>.
>
> One more thing.  Before deciding to change implementation, it would be
> nice to run some benchmarks (or get some definitive references) to see
> what the performance difference will be. I suspect that the sum of logs
> approach may actually be slower, but I have no idea by how much.
>
> Phil

Agreed.  I would like to add that I think we're a little overly concerned
about the actual implementation of the algorithm.  In these early stages of
the project, I think it's wiser to spend time discussing the evolving design
and API.  In the end, that is how people will judge the value of this
project.  People will care far less about how rock-solid the geometric mean
algorithm is compared to how many features does it provide and how easy is
it to use.

These discussions will eventually need to take place, but I don't think now
is the time.  The geometric mean works now for most all valid data sets and
works well enough in terms of accuracy.  The only reason now to change it is
if the Univariate implementation design changes, requiring rework of all the
statistics.

Brent Worden
http://www.brent.worden.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org