You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Phil Steitz <st...@yahoo.com> on 2003/07/06 02:56:02 UTC

[math] Recent commits to stat, util packages

I have a couple of problems with the recent commits to stat and util.

First, the testAddElementRolling test case in FixedDoubleArrayTest will not
compile, since it is trying to access what is now a private field in
FixedDoubleArray (internalArray). The changes to FixedDoubleArray should be
rolled back or the tests should be modified so that they compile and succeed.

Second, I do not see the value in all of the additional classes and overhead
introduced into stat. The goal of Univariate was to provide basic univariate
statistics via a simple interface and lightweight, numerically sound
implementation, consistent with the vision of commons-math and Jakarta Commons
in general. I fear that we may be straying off into statistical computation
framework-building, which I don't think belongs in commons-math (really Jakarta
Commons). More importantly, I don't think we need to add this complexity to
deliver the functionality that we are providing. The only problem that I see
with the structure prior to the recent commits is the confusion between
collections and univariates addValue methods.  I would favor eliminating the
List and BeanList univariates altogether and replacing their functionality with
methods added to StatUtils that take Lists or Collections and property names as
input and compute statistics from them. Similarly, the Univariate interface
could be modified to include addValues(double[]), addValues(List) (assumes
contents are Numbers), addValues(Collection, propertyName).

The checkin comment says that the new univariate framework is independent of
the existing implementations; but StatUtils has been modified to include
numerous static data members and to delegate computation to these.  This adds
significant overhead and I do not see the value in it.  The cost of the
additional stack operations/object creations is significant.  I ran tests
comparing the previous version that does direct computations using the double[]
arrays to the modified version and found an average of more than 6x slowdown
using the new implementation. I did not profile memory utilization, but that is
also a concern. Repeated tests computing the mean of a 1000 doubles 100000
times using the old and new implementations averaged 1.5 and 10.2 seconds,
resp. I do not see the need for all of this additional overhead. 

I suggest that we postpone introduction of a statistical computation framework
until after the initial release, if needed.  In any case, I would like to keep
StatUtils and the core UnivariateImpl small, fast and lightweight, so I would
like to request that the changes to these classes be rolled back.

If others feel that this additional infrastucture is essential, then I just
need to be educated.  It is quite possible that I am thinking too narrowly in
terms of current scope and I may be missing some looming structural problems. 
If this is the case, I am open to being educated. I just need to see a) exactly
why we need to add more complexity at this time and b) why breaking univariate
statistics into four packages and 17 classes when all we are computing is basic
statistics is necessary.  

Phil



__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Recent commits to stat, util packages

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Phil Steitz wrote:
> Brent Worden wrote
> 
>>
>> Another reason for this is performance.  The storageless Univariate
>> implementation is a compute all or nothing design.  Because of this, 
>> every
>> time a new statistic is added to the interface, the extra computation 
>> needed
>> to compute the new statistic slows down the computation of the existing
>> statistics.  
> 
> 
> That is why I wanted to keep the stats in the lightweight, one-pass, 
> storageless Univariate limited to the basics -- mean, variance, min, 
> max, sum and the things that can be derived from those (std. dev, conf 
> intervals, etc.)  I personally have many uses for such a lightweight, 
> no-storage implementation. Most of these uses are in simulation and 
> testing, so performance and overhead is actually a big concern.  Realize 
> that if you want to compute basic stats in one pass through a set of 
> data, you have to do the computation somewhere and if you want several 
> statistics, you are either going to have to pass the data multiple times 
> or compute the quantities needed for the statistics together as you pass 
> the data, which can usually be done more efficiently than computing them 
> individually. Farming the statistical computations out to lots of little 
> classes makes no sense to me. I understand, however, that I am in the 
> minority here, so I will stop complaining about this. It is time to move 
> on.
> 
> Phil
> 

Too be open to critiquing myself (and to be to be fair to Phil's 
argument), I think the strongest point that can be made against what I 
have done with the individual stats (and only in terms of the 
storageless case) is that in some cases, the same calculation is 
performed more than once (primarily in the case of moments).

Kurtosis --> requires 1st, 2nd, 3rd and 4th moments

Skew --> requires 1st, 2nd, and 3rd  moments

Variance --> requires 1st and 2nd moments

Mean --> requires 1st moment

Currently with the class separation, this calculation occurs 
independently in each class. So, Phil's concern here is a valid one in 
terms of some extra calculation occuring in the storageless case, but 
only in the storageless case is this occuring.

Still, with a little invention I think this could be easily worked 
around. Especially with the functor model. Either with separate 
extensions of the increment method that could accept precalculated 
moments to use in the calculation, or constructors that wire in the 
moment being used by the UnivaraiateStatistic, thus reducing the 
replication.

*Constructor approach to reusing moments.*
Mean mean = new Mean();
SecondMoment m2 = new SecondMoment(mean);
ThirdMoment m3 = new ThirdMoment(mean, m2);
FourthMoment m4 = new FourthMoment(mean, m2, m3);
Variance var = new Variance(m2);
Skew skew = new Skew(variance, m3);
Kurt kurt = new Kurt(variance, m4);

*Incremental approach to reusing moments.*
Mean mean = new Mean();
SecondMoment m2 = new SecondMoment();
ThirdMoment m3 = new ThirdMoment();
FourthMoment m4 = new FourthMoment();
Variance var = new Variance();
Skew skew = new Skew();
Kurt kurt = new Kurt();

mean.increment(d);
m2.increment(d, m1);
m3.increment(d, m1, m2);
m4.increment(d, m1, m2, m3);

var.increment(d, m1);
skew.increment(d, m2);
kurt.increment(d, m4);

But these extra methods do not end up in the StorelessUnivariate 
interface, you would only have them available when directly working with 
the Classes.

-Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Recent commits to stat, util packages

Posted by Phil Steitz <ph...@steitz.com>.
Brent Worden wrote
> 
> Another reason for this is performance.  The storageless Univariate
> implementation is a compute all or nothing design.  Because of this, every
> time a new statistic is added to the interface, the extra computation needed
> to compute the new statistic slows down the computation of the existing
> statistics.  

That is why I wanted to keep the stats in the lightweight, one-pass, 
storageless Univariate limited to the basics -- mean, variance, min, 
max, sum and the things that can be derived from those (std. dev, conf 
intervals, etc.)  I personally have many uses for such a lightweight, 
no-storage implementation. Most of these uses are in simulation and 
testing, so performance and overhead is actually a big concern.  Realize 
that if you want to compute basic stats in one pass through a set of 
data, you have to do the computation somewhere and if you want several 
statistics, you are either going to have to pass the data multiple times 
or compute the quantities needed for the statistics together as you pass 
the data, which can usually be done more efficiently than computing them 
individually. Farming the statistical computations out to lots of little 
classes makes no sense to me. I understand, however, that I am in the 
minority here, so I will stop complaining about this. It is time to move on.

Phil
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Recent commits to stat, util packages

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Tim O'Brien wrote:

>On Mon, 2003-07-07 at 00:39, Mark R. Diggory wrote:
>  
>
>>>Also, can we now discuss dropping author tags?
>>>
>>>      
>>>
>>What is your concern? Did I leave some out? Or are you suggesting we 
>>drop them altogether?
>>
>>    
>>
>
>It is unrelated to your commits, don't worry.  Read this:
>http://nagoya.apache.org/eyebrowse/ReadMsg?listName=community@apache.org&msgNo=1956
>
>In general, author tags seem inappropriate for commons-math at the code
>level.  Records of committer and contributor involved are in CVS and in
>the project members xdoc.  If one really wants to find out how active a
>committer is they can turn to statCVS.
>
>I was a skeptic, but some of the posts on that other (public archived)
>list, changed my mind.  
>
>Tim
>
>
>
>  
>
I read the whole thread and I have to say its true as long as the 
committing of patches is done in accordance with the cvs template and 
the person who provided the code patch is identified in the comment 
appropriately. Otherwise, theres really only the the contributors 
section to establish who has contributed. I guess I'm ok with their 
removal, I understand the tendency to get "territorial" about a section 
of code, I am so guilty of that. Ah, such struggles are just the walk 
down the path to enlightenment, detachment from material possessions 
will lead you to bliss. Besides, do I really want to get bothered by an 
email concerning this code some 15 years down the road when I'm old and 
senile? Oh, wait I forgot about compassion...ok, I promise I'll still 
answer emails in 15 years. At least by then we'll be able to send email 
via telepathic implants...

Can you tell I've been at work too long today?
Mark

-- 
Mark Diggory
Software Developer
Harvard MIT Data Center
http://www.hmdc.harvard.edu



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Recent commits to stat, util packages

Posted by Tim O'Brien <to...@discursive.com>.
On Mon, 2003-07-07 at 00:39, Mark R. Diggory wrote:
> > Also, can we now discuss dropping author tags?
> > 
> 
> What is your concern? Did I leave some out? Or are you suggesting we 
> drop them altogether?
> 

It is unrelated to your commits, don't worry.  Read this:
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=community@apache.org&msgNo=1956

In general, author tags seem inappropriate for commons-math at the code
level.  Records of committer and contributor involved are in CVS and in
the project members xdoc.  If one really wants to find out how active a
committer is they can turn to statCVS.

I was a skeptic, but some of the posts on that other (public archived)
list, changed my mind.  

Tim



-- 
-----------------------------------------------------	
Tim O'Brien - tobrien@discursive.com - (847) 863-7045


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Recent commits to stat, util packages

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.

Tim O'Brien wrote:
> <snip/>  (I do a lot of snipping, don't I?  Save the ASF bandwidth
> money, and snip liberally.)
> 
> The statistic objects strike me as being very flexible, and I do agree with 
> Brent's assessment that the storageless Univariate implementation was an 
> "all or nothing" affair.
> 
> Mark, how do you propose to integrate these "functor-esque" objects with the
> existing Univariate implementations?    
> 

One approach is shown in my last attachments. These delegate to these 
Statistic methods as internal objects in the AbstractUnivariate and 
AbstractStoreUnivariate.


> Also, can we now discuss dropping author tags?
> 

What is your concern? Did I leave some out? Or are you suggesting we 
drop them altogether?


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


RE: [math] author tags was Re: recent commits to stat, util packages

Posted by Phil Steitz <st...@yahoo.com>.
> Also, can we now discuss dropping author tags?

+1 for me at least -- i.e., I am OK dropping all of my @author tags.  

Phil



__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


RE: [math] Recent commits to stat, util packages

Posted by Tim O'Brien <to...@discursive.com>.
<snip/>  (I do a lot of snipping, don't I?  Save the ASF bandwidth
money, and snip liberally.)

The statistic objects strike me as being very flexible, and I do agree with 
Brent's assessment that the storageless Univariate implementation was an 
"all or nothing" affair.

Mark, how do you propose to integrate these "functor-esque" objects with the
existing Univariate implementations?    

Also, can we now discuss dropping author tags?

Tim



  



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


RE: [math] Recent commits to stat, util packages

Posted by Brent Worden <br...@worden.org>.
> Then we're back to dependency issues where now theres "another"
> interface that is restrictive and difficult to expand upon easily, it
> will be hard to add things to the library because everyone will be
> arguing about what should/shouldn't be in interface, uughh. :-(
>
> I am becoming more and more against having these "generic" Univariate
> interfaces where a particular statistic is embodied in a "method". every
> time someone comes up with a "new method" there will be debate about
> should it be in the interface or not. Instead of just being an
> additional class that can be added to the package. This is the benefit
> of a framework over monolithic interfaces.
>

Preach on brother Mark.

I too would argue that the Univariate interfaces need to change and take on
more of the look and feel provided by Mark and his statistic objects.

One reason for this is flexibility.  By forcing Univariate to determine the
statistics it can compute we limit users as to when they can use
Univariate's features of window and storageless computation.  That is to
say, a user only gets the benefit of those features iff the statistics they
need computed are defined by the interface.  Granted, we've taken some
effort to include the most commonly used statistics in the interface, but
others remain that the user might have a need for but can't leverage
Univariate to compute them.  By using statistic objects, like those supplied
by Mark, and some way to supply those object data via the Univariate objects
we can eliminate that limitation of the current design.

Another reason for this is performance.  The storageless Univariate
implementation is a compute all or nothing design.  Because of this, every
time a new statistic is added to the interface, the extra computation needed
to compute the new statistic slows down the computation of the existing
statistics.  With the current design, if a new statistic is added, all
current users of Univariate in adversely effected, in terms of performance,
by the change even though they don't rely on the new statistic.  For
instance, add Mark's auto-correlation coefficients and all of a sudden all
existing test statistics using Univariate have gotten slower.  That is a
clear sign of bad design.

Still another reason is maintenance.  With every new statistic added to
Univariate, besides the method added to support the statistic, numerous
other methods need to change to support the computation, most notably
addValue.  Each addition of a statistic will make that method more complex
which in turn makes it harder to debug, harder to change, and harder to
understand.  Also, with changing existing, working code, one runs the risk
of introducing errors where none previously existed.  It would royally bite,
once we added auto-correlation coefficients again and then have variance no
longer work.  In comparison, it's impossible to break existing code by
adding new code that isn't dependent on anything else, such as the case with
Marks statistic objects.

Brent Worden
http://www.brent.worden.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Recent commits to stat, util packages

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Ok, my next email will be much shorter than this (I promise I'll start 
<snip/>ing).

Phil Steitz wrote:
> --- "Mark R. Diggory" <md...@latte.harvard.edu> wrote:
>  
> 
>>>Well, I for one would prefer to have the simple computational methods in
>>
>>one
>>
>>>place.  I would support making the class require instantiation, however,
>>
>>i.e.
>>
>>>making the methods non-static.
>>>
>>
>>Yes, but again is a question of having big flat monolithic classes vs 
>>having extensible implementations that can easily be expanded on. I'm 
>>not particularly thrilled at the idea of being totally locked into such 
>>an interface like Univariate or StatUtils. It is just totally inflexible 
>>and there always too much restriction and argument about what do we want 
>>to put in it vs, not put in it.
> 
> 
> I think that it is a good idea to have these discussions and I don't understand
> what you mean by "inflexible".  

Inflexability in that a "well designed" interface doesn't really need to 
grow or change over time, with an interface that can grows as the 
project grows, theres always going to be ALOT of growing pains (between 
developers, and in terms of orgaization of the code).

> In retrospect, we probably should have named
> StoreUnivariate "ExtendedUnivariate", since it really represents a statistical
> object supporting more statistics.  Univariate can always be extended --

This is the problem, every time we implement another stat are we going 
to Extend Univariate and create a whole new set of implementations just 
to support that stat.

> statistics can be added to the base interface as well as to the abstract and
> concrete classes that implement the base interface.  Some of these statistics
> can be based on computational methods in StatUtils.  If we eliminate the static
> methods in StatUtils, then we can make the computational strategies pluggable.
> 

This was my whole intention with the separate statistics. These 
eliminate the need to delegate to the static StatUtils and provide both 
plugability and individual implementations, as such establishing an 
organized library with room for growth. If you want to use an individual 
implementation you can, if you want to use a facade, you can. The 
facades just delegate to the individual stats in the same fashion we 
currently have UnivariateImpl delegating to StatUtils.

> One more sort of philosphical point that makes me want to keep Univariates as
> objects with statistics as properties:  to me a Univariate is in fact a java
> bean.  It's state is the data that it is characterizing and its properties are
> the statistics describing these data.  

And this will still be the case, I'm just modularizing the Statistical 
Implementations so there's more room for alternate implmentation and 
alternate usage.

Univariates that support only a limited
> set of statistics don't have to hold all of the individual data values
> comprising their state internally.  


Extended Univariates require more overhead.
>  It is natural, therefore, to define the extended statistics in an extended
> interface. 
> 
>>Yes, simple, but not very organized, and not as extensible as a 
>>framework like "solvers" is. You can implement any new "solver" we could 
>>desire right now without much complaint, but try to implement a new 
>>statistic and blam, all this argument starts up as to whether its 
>>appropriate or not in the Univariate interface.
> 
> 
> You are confusing strategies with implementations. The rootfinding framework
> exists to support multiple strategies to do rootfinding, not to support
> arbitrary numerical methods. A better analogy would be to the distribution
> framework which supports creation of different probability distributions.  You
> could argue that a "statistic" is as natural an abstraction as a probability
> distribution.  I disagree with that.  There is lots of structure in a
> probability distribution, very little in a statistic from an abstract
> standpoint.

Ok, I do like your analogy better. And I agree that a statistic does not 
have as much "structure" as a probability distribution. But I disagree 
that this is grounds for not approaching my strategy.
> 
>  There's not room for 
> 
>>growth here! If I decide to go down the road an try to implement things 
>>like auto-correlation coefficients (which would be a logical addition 
>>someday) then I end up having to get permission just to "add" the 
>>implementation, whereas if there's a logical framework, theres more room 
>>for growth without stepping on each others toes so much. This is very 
>>logical to me.
> 
> 
> I disagree. Extending a class or adding a method to an interface is no harder
> than adding a new class (actually easier). It seems ridiculous to me
 > to add a new class for each univariate statistic that we want to
 > support.

So far any time the Univariate interface is modified, it usually results 
in a disagreement from someone that the change was not appropriate, 
usually this is based on opinion and not the functional capabilities of 
the particular method. This is not easily extendable because any new 
experimental development is in a constant battle with the conservation 
of the interface.

Unfortunately we come from different schools, I have to defend that I 
will always find adding a class to be much easier than "redefining and 
interface" on any day of the week.

  If the stats
> are going to be meaningfully integrated, they will have to be used/defined by
> the core univariate classes any way, unless your idea is to eliminate these and
> force users think about statistics one at a time instead of as part of a
> univariate statistical summary. This may be the crux of our disagreement.  I
> see the statistics as natural properties of a set of data, not meaningful
> objects in their own right. 
> 

This is not my intent at all. I continue a defense of this statement 
below, just keep in mind, a statistic can be both a functional object in 
its own right and be part of an "bean like interface". Simply look at 
Univariate delegating methods to the static methods in StatUtils, here 
statistical methods are both bean properties and "objects" of a sort. I 
just more clearly defined the "object" characteristics of the 
statistics. If you look back at the version of StatUtils I rolled back 
from you can clearly see this dualistic state of methods as "objects" 
and that it does work well.

> I would like to propose the following compromise solution that allows the kind
> of flexibility that you want without breaking things apart as much.
> 
> 1. Rename StoreUnivariate to ExtendedUnivariate and change all other "Store"
> names to "Extended".   
> 

The naming is a trivial aspect of what is going on here.

> 2. Make the methods in StatUtils non-static. Continue to use these for basic
> computational methods shared by Univariate and ExtendedUnivariate
> implementations and for direct use by applications and elsewhere in
> commons-math. These methods do not have to be used by all Univariate
> implementation strategies.  
> 

This is exactly what I have accomplished in the UnivariateStatistic 
package I have developed. These classes can easily be delegated to from 
within UnvariateImpl, StoreUnivariateImpl, StatUtils or any other 
interface of your choosing.

> 3. Add addValues methods to Univariate that accept double[], List and
> Collection with property name and eliminate ListUnivariate and
> BeanListUnivariate.

This is the difficult point, The current examples act as "wrappers" 
around a specific Collection/Data Structure. The type of that structure 
is independent of Statistics, I don't see how adding methods that 
support a particular Object type are going to benefit us if the 
underlying Data Structure is not already capable of support Objects vs. 
double[]. This is where providing different implementations of 
Univariates that polymorphically support different internal data 
structures becomes critical.

I have examples of all the Univariates implemented to support both the 
UnivariateStatistic approach and to support the various internal 
Collection types, and to support the "Transformation" of the objects 
stored in these collections to double primitive values in such a way 
that the statistical implementations do not need modification to support 
  so many different input types.

> 
> 4. Rename UnivariateImpl to SimpleUnivariate and add a UnivariateFactory with
> factory methods to create Simple, Extended  and whatever other sorts of
> Univariates we may define.
> 

I feel the same about factories as I do about renaming, I don't feel its 
part of this topic.

> To add new statistics or computational strategies in this environment, we can
> a) add to the Univariate interface if we think that they are really basic -- I
> think that t-based confidence interval half-width for the mean is a basic stat
> that is now missing, for example b) add to the ExtendedUnivariate interface 

Here we are again, I really do believe that in an ideal design, one 
should be able to add a particular statistical approach to the project 
without having to "modify" interfaces, and thus incur the cost of 
argument between conservative and experimental viewpoints. Plus altering 
interfaces leads to problems down the road with different versions 
having differnt methods in the interface. Imagine if you picked up JDK 
1.5 and the Collection Interface had been altered to remove a method. 
You would be very frustrated in having to rewrite all your current code. 
No, I don't think altering interfaces is a viable means of 
extensability. That will create headaches for our users.

> c)
> extend an existing Univariate implementation to add the new statistic or d)
> create a new Univariate including the new statistic or computational strategy.

This is all somewhat messy, it doesn't lend will to organized 
extensibility. I'm trying to provide a solid framework for extending the 
statistical capabilities of the project without this constant interface 
expansion (1) because it is not scalable, (2) because it limits 
development to the LCD (least common denominator) of what the group can 
actually agree upon. (3) because without a framework and a few rules for 
implementation of a statistic, the resulting codebase will grow in a 
disorganized fashion.

Lastly, I do not see how having an instantiable version of StatUtils as 
monolithic class of methods and having Univariate and StoreUnivariate 
Facades delegate to it is of any benefit over having each Statistic 
implemented separately and have the methods in Univarates delegate to 
the individual stat?

To show you the benefit of this approach I've attached my new 
AbstractStoreUnivariate and AbstractUnivariate implementations which do 
delegate to the frame work, I've also added my new UnivariateImpl, 
StoreUnivariateImpl and other Univariate Impl's to show how easily it is 
to extend off the Univariates to create implementations that support 
different Data Structures at the core.

If you look through the classes you can see the benefits of using the 
Facades as polymorphic implementations on top of various data 
structures. Separating the Statistical implementations further releases 
these algorithms from being restricted to a specific implementation.

-Mark

Re: 1. Interfaces shold be stable 2. How to do one-pass computations? (was : [math] Recent commits to stat, util packages)

Posted by Phil Steitz <st...@yahoo.com>.
--- Anton Tagunov <at...@mail.cnt.ru> wrote:
> Hello, Phil and All the [math] Developers!
> 
> 1.
> 
> PS> Univariate can always be extended -- statistics can be added to the base
> PS> interface...
> 
> Oh, no.. I feel terribly sorry to break in, but this is
> probably going to cause us users some troubles..

That is a good point.  I would expect all interfaces to stabilize prior to
release.  I should have qualified the statement.

> In fact, extending an interface in a released (and also unreleased
> but already adopted by users) project is an extremely painful action.
> It breaks all user implementations of that interface.

Agreed.  Good point.
> 
> AFAIK jakarta-commons strives hard to avoid doing that even if
> extending an interface would give significant benefits.
> 
 > 
> Please, can you use a design that keeps interfaces stable?

I see no reason that this cannot be done.  

Phil
> 
 

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


1. Interfaces shold be stable 2. How to do one-pass computations? (was : [math] Recent commits to stat, util packages)

Posted by Anton Tagunov <at...@mail.cnt.ru>.
Hello, Phil and All the [math] Developers!

1.

PS> Univariate can always be extended -- statistics can be added to the base
PS> interface...

Oh, no.. I feel terribly sorry to break in, but this is
probably going to cause us users some troubles..

Imagine

* some class appears in [math] that accepts an
  object implementing Univariance as an argument of some
  method

* I, as a user, create my own implementation of Univariance
  interface

* I pass my object to that method

All is ok, so far.

* as the next version of [math] comes out the Univariance
  interface is extended

Bump! My project no longer compiles, as my own implementation
of Univariance does not implement the new method.

2.

In fact, extending an interface in a released (and also unreleased
but already adopted by users) project is an extremely painful action.
It breaks all user implementations of that interface.

AFAIK jakarta-commons strives hard to avoid doing that even if
extending an interface would give significant benefits.

AFAIK this is the path to creating incompatible versions of
jars (w/o backward compatibility) and this is part of
the "jar hell" problem widely known here which makes
developers "cut and paste" code rather then create a
dependency on a sister project (augh!)

Please, can you use a design that keeps interfaces stable?

3.

However,

PS> I wanted to keep the stats in ... one-pass, storageless Univariate
PS> limited to the basics -- mean, variance, min, max, sum and the things
PS> that can be derived from those (std. dev, conf intervals, etc.)

This sounds meaningful as well.
How can the one-pass computation be implemented in the
modular framework?

WBR, Anton


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


RE: [math] Recent commits to stat, util packages

Posted by Brent Worden <br...@worden.org>.
>
> One more sort of philosphical point that makes me want to keep
> Univariates as
> objects with statistics as properties:  to me a Univariate is in
> fact a java
> bean.  It's state is the data that it is characterizing and its
> properties are
> the statistics describing these data.

And why can't these statistics be objects?  Objects that are smart in that
they know how to modify themselves.  Currently, univariate has all that
knowledge which will get more complex with every new statistic.  The object
approach places the responsibility of data update and computation where it
belongs, internal to the statistic.

> You are confusing strategies with implementations. The
> rootfinding framework
> exists to support multiple strategies to do rootfinding, not to support
> arbitrary numerical methods. A better analogy would be to the distribution
> framework which supports creation of different probability
> distributions.  You
> could argue that a "statistic" is as natural an abstraction as a
> probability
> distribution.  I disagree with that.  There is lots of structure in a
> probability distribution, very little in a statistic from an abstract
> standpoint.

But the simple abstractions are always the most useful.  They are more
easily adapted, reused, and understood.

> I disagree. Extending a class or adding a method to an interface
> is no harder
> than adding a new class (actually easier).  It seems ridiculous
> to me to add a
> new class for each univariate statistic that we want to support.

Funny.  You just suggested a way to support additional statistics is by
creating a new class via extension.  Yet you claim adding a new class for a
statistic is ridiculous.  Are you saying your idea is ridiculous?

> If the stats
> are going to be meaningfully integrated, they will have to be
> used/defined by
> the core univariate classes any way, unless your idea is to
> eliminate these and
> force users think about statistics one at a time instead of as part of a
> univariate statistical summary.

You can easily create a univariate class that is open-ended to the
statistics that it computes and treat them as a logical set.  One would
create a univariate and any set of statistic objects.  Then you would add
data to the univariate which would in turn pass the data to each of the
statistic objects.
The statistic objects then take that the data and update themselves.  Now we
have a univariate that can compute any statistic, either one provided by
commons-math or one created by a user, on a needs basis and not the
all-or-nothing approach.

> This may be the crux of our
> disagreement.  I
> see the statistics as natural properties of a set of data, not meaningful
> objects in their own right.

And what limits properties to be only dumb data values?  With your logic,
objects such as Calendars, Colors, and InputStream could not be used as
properties.  Currently, univariate has the responsibility of computing a
mean.  Taking that responsibility away from univariate and giving it to a
statistic object makes that object tremendously meaningful.

>
> We are always going to have to discuss what goes in to
> commons-math and what
> does not go in, regardless of how packages are organized.  For
> example, I would
> be opposed (as I suspect J, Al and Brent would be too) to adding
> a Newton's
> method solver now, since it would provide no value beyond what we
> already have.
> This has nothing to do with how the package is organized.
>
> I would like to propose the following compromise solution that
> allows the kind
> of flexibility that you want without breaking things apart as much.
>
> 1. Rename StoreUnivariate to ExtendedUnivariate and change all
> other "Store"
> names to "Extended".

Changing the name doesn't make the design any better.  Do you think if
Microsoft had named Windows, Portals, it would be a better OS?

>
> 2. Make the methods in StatUtils non-static. Continue to use
> these for basic
> computational methods shared by Univariate and ExtendedUnivariate
> implementations and for direct use by applications and elsewhere in
> commons-math. These methods do not have to be used by all Univariate
> implementation strategies.

>
> 3. Add addValues methods to Univariate that accept double[], List and
> Collection with property name and eliminate ListUnivariate and
> BeanListUnivariate.

With that you just tripled the complexity of univariate.  And as a result,
tripled the complexity of adding a statistic, tripled the likelihood of
introducing errors with each change, tripled this, tripled that.  It's in
yet obvious this is flawed?

>
> 4. Rename UnivariateImpl to SimpleUnivariate and add a
> UnivariateFactory with
> factory methods to create Simple, Extended  and whatever other sorts of
> Univariates we may define.
>
> To add new statistics or computational strategies in this
> environment, we can
> a) add to the Univariate interface if we think that they are
> really basic -- I
> think that t-based confidence interval half-width for the mean is
> a basic stat
> that is now missing

Yes.  And if you were a user, with the currently implementation, there is
nothing you could do about it but pray it'll be added in the next release of
commons-math.  However, with the object approach, you'd create a simple
statistic object that can be used with univariate and all your troubles go
away.

> for example b) add to the ExtendedUnivariate
> interface c)
> extend an existing Univariate implementation to add the new
> statistic or d)
> create a new Univariate including the new statistic or
> computational strategy.
> U

Again, you yourself labeled c and d as ridiculous when you labeled Mark's
idea of adding a class for each statistic as such.

The current univariates have encapsulated way too much responsibility
instead of delegating it to other objects.  This makes the code very
unstable as it will need to change frequently.  As I see it, the univariate
types are responsible for two things: maintaining a window of data and
computing summary statistics

I would suggest separating each of these responsibilities into separate
objects.  I would make a window policy object that knows if/when data values
should be removed when others are added and if individual data values are
accessible.  I would make a statistics strategy object that knows what
statistics to compute and how to compute them based on the window policy.
The univariate would act as a mediator between the two objects.  I like
Mark's approach, but I think I would take it a little further in terms of
abstraction by making univariate independent of the statistics its
calculating.

Brent Worden
http://www.brent.worden.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Recent commits to stat, util packages

Posted by Phil Steitz <st...@yahoo.com>.
--- "Mark R. Diggory" <md...@latte.harvard.edu> wrote:
 
> > 
> > Well, I for one would prefer to have the simple computational methods in
> one
> > place.  I would support making the class require instantiation, however,
> i.e.
> > making the methods non-static.
> > 
> 
> Yes, but again is a question of having big flat monolithic classes vs 
> having extensible implementations that can easily be expanded on. I'm 
> not particularly thrilled at the idea of being totally locked into such 
> an interface like Univariate or StatUtils. It is just totally inflexible 
> and there always too much restriction and argument about what do we want 
> to put in it vs, not put in it.

I think that it is a good idea to have these discussions and I don't understand
what you mean by "inflexible".  In retrospect, we probably should have named
StoreUnivariate "ExtendedUnivariate", since it really represents a statistical
object supporting more statistics.  Univariate can always be extended --
statistics can be added to the base interface as well as to the abstract and
concrete classes that implement the base interface.  Some of these statistics
can be based on computational methods in StatUtils.  If we eliminate the static
methods in StatUtils, then we can make the computational strategies pluggable.

One more sort of philosphical point that makes me want to keep Univariates as
objects with statistics as properties:  to me a Univariate is in fact a java
bean.  It's state is the data that it is characterizing and its properties are
the statistics describing these data.  Univariates that support only a limited
set of statistics don't have to hold all of the individual data values
comprising their state internally.  Extended Univariates require more overhead.
 It is natural, therefore, to define the extended statistics in an extended
interface. 
> 
> > 
> Yes, simple, but not very organized, and not as extensible as a 
> framework like "solvers" is. You can implement any new "solver" we could 
> desire right now without much complaint, but try to implement a new 
> statistic and blam, all this argument starts up as to whether its 
> appropriate or not in the Univariate interface.

You are confusing strategies with implementations. The rootfinding framework
exists to support multiple strategies to do rootfinding, not to support
arbitrary numerical methods. A better analogy would be to the distribution
framework which supports creation of different probability distributions.  You
could argue that a "statistic" is as natural an abstraction as a probability
distribution.  I disagree with that.  There is lots of structure in a
probability distribution, very little in a statistic from an abstract
standpoint.

 There's not room for 
> growth here! If I decide to go down the road an try to implement things 
> like auto-correlation coefficients (which would be a logical addition 
> someday) then I end up having to get permission just to "add" the 
> implementation, whereas if there's a logical framework, theres more room 
> for growth without stepping on each others toes so much. This is very 
> logical to me.

I disagree. Extending a class or adding a method to an interface is no harder
than adding a new class (actually easier).  It seems ridiculous to me to add a
new class for each univariate statistic that we want to support. If the stats
are going to be meaningfully integrated, they will have to be used/defined by
the core univariate classes any way, unless your idea is to eliminate these and
force users think about statistics one at a time instead of as part of a
univariate statistical summary. This may be the crux of our disagreement.  I
see the statistics as natural properties of a set of data, not meaningful
objects in their own right. 

We are always going to have to discuss what goes in to commons-math and what
does not go in, regardless of how packages are organized.  For example, I would
be opposed (as I suspect J, Al and Brent would be too) to adding a Newton's
method solver now, since it would provide no value beyond what we already have.
This has nothing to do with how the package is organized. 

I would like to propose the following compromise solution that allows the kind
of flexibility that you want without breaking things apart as much.

1. Rename StoreUnivariate to ExtendedUnivariate and change all other "Store"
names to "Extended".   

2. Make the methods in StatUtils non-static. Continue to use these for basic
computational methods shared by Univariate and ExtendedUnivariate
implementations and for direct use by applications and elsewhere in
commons-math. These methods do not have to be used by all Univariate
implementation strategies.  

3. Add addValues methods to Univariate that accept double[], List and
Collection with property name and eliminate ListUnivariate and
BeanListUnivariate.

4. Rename UnivariateImpl to SimpleUnivariate and add a UnivariateFactory with
factory methods to create Simple, Extended  and whatever other sorts of
Univariates we may define.

To add new statistics or computational strategies in this environment, we can
a) add to the Univariate interface if we think that they are really basic -- I
think that t-based confidence interval half-width for the mean is a basic stat
that is now missing, for example b) add to the ExtendedUnivariate interface c)
extend an existing Univariate implementation to add the new statistic or d)
create a new Univariate including the new statistic or computational strategy.
U
 
Phil
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 





__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Recent commits to stat, util packages

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.

Phil Steitz wrote:
> Sorry, last reply got sent before I was done with it.  Pls disregard and try
> this....
>  
> 
>>>This adds
>>>significant overhead and I do not see the value in it.  The cost of the
>>>additional stack operations/object creations is significant.  I ran tests
>>>comparing the previous version that does direct computations using the
>>
>>double[]
>>
>>>arrays to the modified version and found an average of more than 6x
>>
>>slowdown
>>
>>>using the new implementation. I did not profile memory utilization, but
>>
>>that is
>>
>>>also a concern. Repeated tests computing the mean of a 1000 doubles 100000
>>>times using the old and new implementations averaged 1.5 and 10.2 seconds,
>>>resp. I do not see the need for all of this additional overhead. 
>>>
>>
>>If you review the code, you'll find there is no added "object creation", 
>>the static Variable objects calculate on double[] just as the 
>>Univariates did, I would have to see more substantial analysis to 
>>believe your claim. All thats going on here are that the Static StatUtil 
>>methods are delegating to individual static instances of 
>>UnivariateStatistics. These are instantiated on JVM startup like all 
>>static objects, calling a method in such an object should not require 
>>any more overhead than having the method coded directly into the static 
>>method.
>>
>>If there are performance considerations, lets discuss these.
> 
> 
> Here is what I added to StatUtils.test
> 
>  double[] x = new double[1000];
>  for (int i = 0; i < 1000; i++) {
>  	x[i] = (5 - i) * (i - 200);
>  }
>  long startTick = 0;
>  double res = 0;
>   for (int j = 0; j < 10; j++) {    
>     startTick = System.currentTimeMillis();
>     for (int i = 0; i < 100000; i++) {
>       res = OStatUtils.mean(x);
>     }
>     System.out.println("old: " + (System.currentTimeMillis() - startTick));
>     startTick = System.currentTimeMillis();
>     for (int i = 0; i < 100000; i++) {
>       res = StatUtils.mean(x);
>     }
>     System.out.println("new: " + (System.currentTimeMillis() - startTick));
> 
> The result was a mean of 10203 for the "new" and 1531.1 for the "old", with
> standard deviations 81.1 and 13.4 resp.  The overhead is the stack operations
> and temp object creations.
>  
> 

Ok, yes, You've got me on this one, I ran these tests and your correct, 
the increment method approach (while great for storageless 
implementation) does incur the added costs in calculation (specifically 
in added divisions that are ocuring). It would have been better for me 
to still provide the implementations provided for in the static StatUtil 
lib for the double[] based methods.

*the direct evaluation approach here*

public double evaluate(double[] d, int start, int length) {
    double accum = 0.0;
    for (int i = start; i < start + length; i++) {
        accum += d[i];
    }
  return accum / (double) d.length;
}

*takes cycles to calculate than the below incremental approach*

int n = 0;
double m1 = Double.NaN;

public double evaluate(double[] d, int start, int length) {
     for (int i = start; i < start + length; i++) {
         increment(d[i]);
     }
     return getValue();
}

public double increment(double d) {
     if (n < 1) {
         m1 = 0.0;
     }
     n++;
     m1 += (d - m1) / ((double) n);
     return m1;
}

I will add the direct approaches into my implementations I have written 
to regain this efficiency. I'll also roll back StatUtils so that it is 
not dependent on these limitations in the mean-time.


>>I doubt (as the numerous discussions over the past week have pointed 
>>out) that what we really want to have in StatUtils is one monolithic 
>>Static class with all the implemented methods present in it. If I have 
>>misinterpreted this opinion in the group, then I'm sure there will be 
>>responses to this.
> 
> 
> Well, I for one would prefer to have the simple computational methods in one
> place.  I would support making the class require instantiation, however, i.e.
> making the methods non-static.
> 

Yes, but again is a question of having big flat monolithic classes vs 
having extensible implementations that can easily be expanded on. I'm 
not particularly thrilled at the idea of being totally locked into such 
an interface like Univariate or StatUtils. It is just totally inflexible 
and there always too much restriction and argument about what do we want 
to put in it vs, not put in it.

> 
> 
>>There was a great deal of discussion about the benefit of not having the 
>>methods implemented directly in static StatUtils because they could not 
>>be "overridden" or worked with in an Instantiable form. This approach 
>>frees the implementations up to be overridden and frees up room for 
>>alternate implementations.
> 
> 
> As I said above, the simplest way to deal with this is to make the methods
> non-static.
> 
Yes, simple, but not very organized, and not as extensible as a 
framework like "solvers" is. You can implement any new "solver" we could 
desire right now without much complaint, but try to implement a new 
statistic and blam, all this argument starts up as to whether its 
appropriate or not in the Univariate interface. There's not room for 
growth here! If I decide to go down the road an try to implement things 
like auto-correlation coefficients (which would be a logical addition 
someday) then I end up having to get permission just to "add" the 
implementation, whereas if there's a logical framework, theres more room 
for growth without stepping on each others toes so much. This is very 
logical to me.

> 
>>You may have your opinions of how you would like to see the packages 
>>organized and implemented. Others in the group do have alternate 
>>opinions to yours. I for one see a strong value in individually 
>>implemented Statistics. I also have a strong vision that the framework I 
>>have been working on provides substantial benefits.
>>
>>(1a.) It Allows both the storageless and storage based implementations 
>>to function behind the same interface. No matter if your calling
>>
>>increment(double d)
>>
>>or
>>
>>evaluate(double[]...)
>>
>>your working with the same algorithm.
> 
> 
> That is true in the old implementation as well, with the core computational
> methods in StatUtils.

No, in the original implementation "incremental" approaches are 
different implementations than "evaluation" double[] approaches, as 
we've seen in the case above. The trade off is accuracy vs. efficiency. 
In the old implementations case the incremental's are in the 
UnivaraiteImpl while the evaluation strategies are in StatUtils (and 
currently duplicated in StoreUnivariateImpl.

> 
>>(1b.) If you wish to have alternate implementations for evaluate and 
>>increment, it is easily possible of overload theses methods in future 
>>versions of the implementations.
> 
> 
> Just make the methods non-static and that will be possible.  I am not sure,
> given the relative triviality of these methods, if this is really a big deal,
> howerver.

Then we're back to dependency issues where now theres "another" 
interface that is restrictive and difficult to expand upon easily, it 
will be hard to add things to the library because everyone will be 
arguing about what should/shouldn't be in interface, uughh. :-(

I am becoming more and more against having these "generic" Univariate 
interfaces where a particular statistic is embodied in a "method". every 
time someone comes up with a "new method" there will be debate about 
should it be in the interface or not. Instead of just being an 
additional class that can be added to the package. This is the benefit 
of a framework over monolithic interfaces.

> 
>>
>>Phil, its clear we have very different "schools of thought" on the 
>>subject of how the library should be designed. As a developer on the 
>>project I have a right to promote my design model and interests. The 
>>architecture is something I have a strong interest in working with.
> 
> 
> You certainly have the right to your opinions.  Others also have the right to
> disagree with them.
> 
>>Apache projects are "group" projects, If a project such a [math] cannot 
>>find community and room for multiple directions of development. If it 
>>cannot make room for alternate ideas and visions, if both revolutionary 
>>and evolutionary processes cannot coexist, I doubt the project will have 
>>much of a future at all.
> 
> 
> I agree with this as well; but from what I have observed, open source projects
> do best when they do not try to go off in divergent directions at the same
> time. If we cannot agree on a consistent architecture direction, then I don't
> think we will succeed. If we can and we stay focused, then we will.  

I don't think trying to come up with the best design for the library 
equates very well to "being unfocused".

 > As I said
> above, if others agree with the approach that you want to take, then that is
> the direction that the project will go.  I am interested in the opinions of
> Tim, Robert and the rest of the team.
> 
> Phil
> 

I am interested as well in what they have to say.

-Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Recent commits to stat, util packages

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Thank you Phil for the code, I've finished adding the StatUtil 
strategies to the evaluation methods of the new classes I've been 
working on. The Timings are once again comparable for both packages. As 
they should be, since the code is the same for these approaches.

Mean (old=StatUtils, new=Mean class)
old: 3505    new: 2534
old: 3395    new: 2513
old: 3385    new: 2514
old: 3385    new: 2513
old: 3405    new: 2504
old: 3395    new: 2503
old: 3405    new: 2504
old: 3405    new: 2513
old: 3385    new: 2524
old: 3385    new: 2523
old: mean=3405.0 std=36.20926830400073
new: mean=2514.5 std=10.013879257199855

Variance (old=StatUtils, new=Variance class)
old: 38265    new: 40168
old: 38235    new: 40038
old: 38235    new: 40037
old: 38255    new: 40098
old: 38305    new: 40098
old: 38275    new: 40147
old: 38285    new: 40068
old: 38185    new: 39977
old: 38205    new: 39978
old: 38185    new: 39977
old: mean=38243.0 std=41.57990967870072
new: mean=40058.6 std=69.6661881961239

I've also added test that exec the same set of values and results on 
both evaluation and incremental methods to show that both approaches 
return equal results within a tolerance of 10E-12 for the provided dataset.

-Mark

Phil Steitz wrote:
> Sorry, last reply got sent before I was done with it.  Pls disregard and try
> this....
>  
> 
>>>This adds
>>>significant overhead and I do not see the value in it.  The cost of the
>>>additional stack operations/object creations is significant.  I ran tests
>>>comparing the previous version that does direct computations using the
>>
>>double[]
>>
>>>arrays to the modified version and found an average of more than 6x
>>
>>slowdown
>>
>>>using the new implementation. I did not profile memory utilization, but
>>
>>that is
>>
>>>also a concern. Repeated tests computing the mean of a 1000 doubles 100000
>>>times using the old and new implementations averaged 1.5 and 10.2 seconds,
>>>resp. I do not see the need for all of this additional overhead. 
>>>
>>
>>If you review the code, you'll find there is no added "object creation", 
>>the static Variable objects calculate on double[] just as the 
>>Univariates did, I would have to see more substantial analysis to 
>>believe your claim. All thats going on here are that the Static StatUtil 
>>methods are delegating to individual static instances of 
>>UnivariateStatistics. These are instantiated on JVM startup like all 
>>static objects, calling a method in such an object should not require 
>>any more overhead than having the method coded directly into the static 
>>method.
>>
>>If there are performance considerations, lets discuss these.
> 
> 
> Here is what I added to StatUtils.test
> 
>  double[] x = new double[1000];
>  for (int i = 0; i < 1000; i++) {
>  	x[i] = (5 - i) * (i - 200);
>  }
>  long startTick = 0;
>  double res = 0;
>   for (int j = 0; j < 10; j++) {    
>     startTick = System.currentTimeMillis();
>     for (int i = 0; i < 100000; i++) {
>       res = OStatUtils.mean(x);
>     }
>     System.out.println("old: " + (System.currentTimeMillis() - startTick));
>     startTick = System.currentTimeMillis();
>     for (int i = 0; i < 100000; i++) {
>       res = StatUtils.mean(x);
>     }
>     System.out.println("new: " + (System.currentTimeMillis() - startTick));
> 
> The result was a mean of 10203 for the "new" and 1531.1 for the "old", with
> standard deviations 81.1 and 13.4 resp.  The overhead is the stack operations
> and temp object creations.
>  
> 
>>I doubt (as the numerous discussions over the past week have pointed 
>>out) that what we really want to have in StatUtils is one monolithic 
>>Static class with all the implemented methods present in it. If I have 
>>misinterpreted this opinion in the group, then I'm sure there will be 
>>responses to this.
> 
> 
> Well, I for one would prefer to have the simple computational methods in one
> place.  I would support making the class require instantiation, however, i.e.
> making the methods non-static.
> 
> 
> 
>>There was a great deal of discussion about the benefit of not having the 
>>methods implemented directly in static StatUtils because they could not 
>>be "overridden" or worked with in an Instantiable form. This approach 
>>frees the implementations up to be overridden and frees up room for 
>>alternate implementations.
> 
> 
> As I said above, the simplest way to deal with this is to make the methods
> non-static.
> 
> 
>>You may have your opinions of how you would like to see the packages 
>>organized and implemented. Others in the group do have alternate 
>>opinions to yours. I for one see a strong value in individually 
>>implemented Statistics. I also have a strong vision that the framework I 
>>have been working on provides substantial benefits.
>>
>>(1a.) It Allows both the storageless and storage based implementations 
>>to function behind the same interface. No matter if your calling
>>
>>increment(double d)
>>
>>or
>>
>>evaluate(double[]...)
>>
>>your working with the same algorithm.
> 
> 
> That is true in the old implementation as well, with the core computational
> methods in StatUtils.
> 
>>(1b.) If you wish to have alternate implementations for evaluate and 
>>increment, it is easily possible of overload theses methods in future 
>>versions of the implementations.
> 
> 
> Just make the methods non-static and that will be possible.  I am not sure,
> given the relative triviality of these methods, if this is really a big deal,
> howerver.
> 
>  
> 
>>
>>Phil, its clear we have very different "schools of thought" on the 
>>subject of how the library should be designed. As a developer on the 
>>project I have a right to promote my design model and interests. The 
>>architecture is something I have a strong interest in working with.
> 
> 
> You certainly have the right to your opinions.  Others also have the right to
> disagree with them.
> 
>>Apache projects are "group" projects, If a project such a [math] cannot 
>>find community and room for multiple directions of development. If it 
>>cannot make room for alternate ideas and visions, if both revolutionary 
>>and evolutionary processes cannot coexist, I doubt the project will have 
>>much of a future at all.
> 
> 
> I agree with this as well; but from what I have observed, open source projects
> do best when they do not try to go off in divergent directions at the same
> time. If we cannot agree on a consistent architecture direction, then I don't
> think we will succeed. If we can and we stay focussed, then we will.  As I said
> above, if others agree with the approach that you want to take, then that is
> the direction that the project will go.  I am interested in the opinions of
> Tim, Robert and the rest of the team.
> 
> Phil
> 
>>
>>-Mark
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>>
> 
> 
> 
> __________________________________
> Do you Yahoo!?
> SBC Yahoo! DSL - Now only $29.95 per month!
> http://sbc.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Recent commits to stat, util packages

Posted by Phil Steitz <st...@yahoo.com>.
Sorry, last reply got sent before I was done with it.  Pls disregard and try
this....
 
> > This adds
> > significant overhead and I do not see the value in it.  The cost of the
> > additional stack operations/object creations is significant.  I ran tests
> > comparing the previous version that does direct computations using the
> double[]
> > arrays to the modified version and found an average of more than 6x
> slowdown
> > using the new implementation. I did not profile memory utilization, but
> that is
> > also a concern. Repeated tests computing the mean of a 1000 doubles 100000
> > times using the old and new implementations averaged 1.5 and 10.2 seconds,
> > resp. I do not see the need for all of this additional overhead. 
> > 
> 
> If you review the code, you'll find there is no added "object creation", 
> the static Variable objects calculate on double[] just as the 
> Univariates did, I would have to see more substantial analysis to 
> believe your claim. All thats going on here are that the Static StatUtil 
> methods are delegating to individual static instances of 
> UnivariateStatistics. These are instantiated on JVM startup like all 
> static objects, calling a method in such an object should not require 
> any more overhead than having the method coded directly into the static 
> method.
> 
> If there are performance considerations, lets discuss these.

Here is what I added to StatUtils.test

 double[] x = new double[1000];
 for (int i = 0; i < 1000; i++) {
 	x[i] = (5 - i) * (i - 200);
 }
 long startTick = 0;
 double res = 0;
  for (int j = 0; j < 10; j++) {    
    startTick = System.currentTimeMillis();
    for (int i = 0; i < 100000; i++) {
      res = OStatUtils.mean(x);
    }
    System.out.println("old: " + (System.currentTimeMillis() - startTick));
    startTick = System.currentTimeMillis();
    for (int i = 0; i < 100000; i++) {
      res = StatUtils.mean(x);
    }
    System.out.println("new: " + (System.currentTimeMillis() - startTick));

The result was a mean of 10203 for the "new" and 1531.1 for the "old", with
standard deviations 81.1 and 13.4 resp.  The overhead is the stack operations
and temp object creations.
 
> 
> I doubt (as the numerous discussions over the past week have pointed 
> out) that what we really want to have in StatUtils is one monolithic 
> Static class with all the implemented methods present in it. If I have 
> misinterpreted this opinion in the group, then I'm sure there will be 
> responses to this.

Well, I for one would prefer to have the simple computational methods in one
place.  I would support making the class require instantiation, however, i.e.
making the methods non-static.


> There was a great deal of discussion about the benefit of not having the 
> methods implemented directly in static StatUtils because they could not 
> be "overridden" or worked with in an Instantiable form. This approach 
> frees the implementations up to be overridden and frees up room for 
> alternate implementations.

As I said above, the simplest way to deal with this is to make the methods
non-static.

> 
> You may have your opinions of how you would like to see the packages 
> organized and implemented. Others in the group do have alternate 
> opinions to yours. I for one see a strong value in individually 
> implemented Statistics. I also have a strong vision that the framework I 
> have been working on provides substantial benefits.
> 
> (1a.) It Allows both the storageless and storage based implementations 
> to function behind the same interface. No matter if your calling
> 
> increment(double d)
> 
> or
> 
> evaluate(double[]...)
> 
> your working with the same algorithm.

That is true in the old implementation as well, with the core computational
methods in StatUtils.
> 
> (1b.) If you wish to have alternate implementations for evaluate and 
> increment, it is easily possible of overload theses methods in future 
> versions of the implementations.

Just make the methods non-static and that will be possible.  I am not sure,
given the relative triviality of these methods, if this is really a big deal,
howerver.

 
> 
> 
> Phil, its clear we have very different "schools of thought" on the 
> subject of how the library should be designed. As a developer on the 
> project I have a right to promote my design model and interests. The 
> architecture is something I have a strong interest in working with.

You certainly have the right to your opinions.  Others also have the right to
disagree with them.
> 
> Apache projects are "group" projects, If a project such a [math] cannot 
> find community and room for multiple directions of development. If it 
> cannot make room for alternate ideas and visions, if both revolutionary 
> and evolutionary processes cannot coexist, I doubt the project will have 
> much of a future at all.

I agree with this as well; but from what I have observed, open source projects
do best when they do not try to go off in divergent directions at the same
time. If we cannot agree on a consistent architecture direction, then I don't
think we will succeed. If we can and we stay focussed, then we will.  As I said
above, if others agree with the approach that you want to take, then that is
the direction that the project will go.  I am interested in the opinions of
Tim, Robert and the rest of the team.

Phil
> 
> 
> -Mark
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Recent commits to stat, util packages

Posted by Phil Steitz <st...@yahoo.com>.
--- "Mark R. Diggory" <md...@latte.harvard.edu> wrote:
 
> 
> 
> > This adds
> > significant overhead and I do not see the value in it.  The cost of the
> > additional stack operations/object creations is significant.  I ran tests
> > comparing the previous version that does direct computations using the
> double[]
> > arrays to the modified version and found an average of more than 6x
> slowdown
> > using the new implementation. I did not profile memory utilization, but
> that is
> > also a concern. Repeated tests computing the mean of a 1000 doubles 100000
> > times using the old and new implementations averaged 1.5 and 10.2 seconds,
> > resp. I do not see the need for all of this additional overhead. 
> > 
> 
> If you review the code, you'll find there is no added "object creation", 
> the static Variable objects calculate on double[] just as the 
> Univariates did, I would have to see more substantial analysis to 
> believe your claim. All thats going on here are that the Static StatUtil 
> methods are delegating to individual static instances of 
> UnivariateStatistics. These are instantiated on JVM startup like all 
> static objects, calling a method in such an object should not require 
> any more overhead than having the method coded directly into the static 
> method.

Here is what I added to one of the methods in StatUtilsTest, after copying and
renaming the old version OStatUtils:

for (int j = 0; j < 10; j++) {
            
 startTick = System.currentTimeMillis();
 for (int i = 0; i < 100000; i++) {
   res = OStatUtils.mean(x);
   System.out.println("old: " + (System.currentTimeMillis() - startTick));
   startTick = System.currentTimeMillis();
   for (int i = 0; i < 100000; i++) {
   res = StatUtils.mean(x);
            }
            //newStats.addValue(System.currentTimeMillis() - startTick); 
            System.out.println("new: " + (System.currentTimeMillis() -
startTick));
        }for (int j = 0; j < 10; j++) {
            
            startTick = System.currentTimeMillis();
            for (int i = 0; i < 100000; i++) {
                res = OStatUtils.mean(x);
            }
            System.out.println("old: " + (System.currentTimeMillis() -
startTick));
            //oldStats.addValue(System.currentTimeMillis() - startTick);
            
            startTick = System.currentTimeMillis();
            for (int i = 0; i < 100000; i++) {
                res = StatUtils.mean(x);
            }
            //newStats.addValue(System.currentTimeMillis() - startTick); 
            System.out.println("new: " + (System.currentTimeMillis() -
startTick));
  
> 
> If there are performance considerations, lets discuss these.
> 
> I doubt (as the numerous discussions over the past week have pointed 
> out) that what we really want to have in StatUtils is one monolithic 
> Static class with all the implemented methods present in it. If I have 
> misinterpreted this opinion in the group, then I'm sure there will be 
> responses to this.
> 
> > I suggest that we postpone introduction of a statistical computation
> framework
> > until after the initial release, if needed.  In any case, I would like to
> keep
> > StatUtils and the core UnivariateImpl small, fast and lightweight, so I
> would
> > like to request that the changes to these classes be rolled back.
> > 
> I would really like to see an architecture thats more than just on flat 
> static class with a bunch of double[] methods in it. this is not very 
> useful to me.
> 
> > If others feel that this additional infrastructure is essential, then I
> just
> > need to be educated.  It is quite possible that I am thinking too narrowly
> in
> > terms of current scope and I may be missing some looming structural
> problems. 
> > If this is the case, I am open to being educated. I just need to see a)
> exactly
> > why we need to add more complexity at this time and b) why breaking
> univariate
> > statistics into four packages and 17 classes when all we are computing is
> basic
> > statistics is necessary.  
> > 
> 
> The packages are categorical, the classes are implementations of each 
> statistic. The framework provides an intuitive and organized means for 
> others to easily implement and add statistics to the packages without 
> being restricted to a fascist and monolithic Univariate interface or 
> static StatUtils interface.
> 
> If anything the continued conflict between our two schools of thought 
> shows the necessity of such an approach. Your school of thought can 
> retain the monolithic Interfaces for "Univariate" and "StatUtil". While 
> the framework can provide others with the ability to extend and expand 
> the library without such "heavy handed" restrictions that cripple the 
> extendability of the project.
> 
> There was a great deal of discussion about the benefit of not having the 
> methods implemented directly in static StatUtils because they could not 
> be "overridden" or worked with in an Instantiable form. This approach 
> frees the implementations up to be overridden and frees up room for 
> alternate implementations.
> 
> You may have your opinions of how you would like to see the packages 
> organized and implemented. Others in the group do have alternate 
> opinions to yours. I for one see a strong value in individually 
> implemented Statistics. I also have a strong vision that the framework I 
> have been working on provides substantial benefits.
> 
> (1a.) It Allows both the storageless and storage based implementations 
> to function behind the same interface. No matter if your calling
> 
> increment(double d)
> 
> or
> 
> evaluate(double[]...)
> 
> your working with the same algorithm.
> 
> (1b.) If you wish to have alternate implementations for evaluate and 
> increment, it is easily possible of overload theses methods in future 
> versions of the implementations.
> 
> (2.) With individual Implementations, alternate approaches can be coded 
> and included for the benefit of those who have an interest in such 
> implementations. Thus there could be multiple versions of Variance, 
> based on the strategy of interest and the numerical accuracy required.
> 
> (3.) Having the same implementations of statistics usable across all 
> Univariate implementations assures a standard behavior and the same 
> expected results no matter if your using incremental or evaluation based 
> approaches.
> 
> (4.) The frame work provides a formal structure for the future growth of 
> the library. Knowing what a UnviariateStatistic is, and seeing the 
> various implementations, its obvious the route one will take to 
> implement future statistics of interest.
> 
> 
> Phil, its clear we have very different "schools of thought" on the 
> subject of how the library should be designed. As a developer on the 
> project I have a right to promote my design model and interests. The 
> architecture is something I have a strong interest in working with.
> 
> Apache projects are "group" projects, If a project such a [math] cannot 
> find community and room for multiple directions of development. If it 
> cannot make room for alternate ideas and visions, if both revolutionary 
> and evolutionary processes cannot coexist, I doubt the project will have 
> much of a future at all.
> 
> 
> -Mark
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [math] Recent commits to stat, util packages

Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.

Phil Steitz wrote:
> First, the testAddElementRolling test case in FixedDoubleArrayTest will not
> compile, since it is trying to access what is now a private field in
> FixedDoubleArray (internalArray). The changes to FixedDoubleArray should be
> rolled back or the tests should be modified so that they compile and succeed.
> 
Thanks for pointing this out, this was a minor problem that was easily 
fixed by using the appropriate getValues method.

> Second, I do not see the value in all of the additional classes and overhead
> introduced into stat. The goal of Univariate was to provide basic univariate
> statistics via a simple interface and lightweight, numerically sound
> implementation, consistent with the vision of commons-math and Jakarta Commons
> in general. 

I am unclear how the work I have done is any less "lightweight" than 
anything else in the commons project (or the math project for that 
matter)? If organizing your classes into packages does not constitute 
"heavyness".

> I fear that we may be straying off into statistical computation
> framework-building, which I don't think belongs in commons-math (really Jakarta
> Commons).

On the contrary, The "design" of the math package is whats in question 
here, this is about alternate opinions coming together to establish what 
an optimal design can be. Not that someone in this group is going to 
have to go off and write their own library because their vision is what 
another individual in the project likes, this is not a "one man" project.

> More importantly, I don't think we need to add this complexity to
> deliver the functionality that we are providing. The only problem that I see
> with the structure prior to the recent commits is the confusion between
> collections and univariates addValue methods.  I would favor eliminating the
> List and BeanList univariates altogether and replacing their functionality with
> methods added to StatUtils that take Lists or Collections and property names as
> input and compute statistics from them. Similarly, the Univariate interface
> could be modified to include addValues(double[]), addValues(List) (assumes
> contents are Numbers), addValues(Collection, propertyName).

IMHO, This just creates an even more "monolithic" StatUtils class, every 
time we decide to work with another Collection or object type, are we 
going to have to implement another set of duplicate delegation methods 
in StatUtils? this doesn't seem very beneficial to me. I've already gone 
forward and implemented a new "MixedListUnivariate" implementation that 
works with heterogeneous objects (which can be mapped to double 
primitives with "NumberTransformer" objects, my next steps were to 
commit these to the project.

Unfortunately, I can tell the supporting work I've completed in this 
area will also be controversial with you as well as it involves some 
restructuring the Unviariate class hierarchy with the addition of an 
AbstractUnivariate class. Which I'm sure will receive objection as well. 
This saddens me because the work I am doing makes total sense to me 
conceptually. In an Object Oriented language, the tools we work with are 
objects. Its conceptually OO and not proceedural, Java is not Fortran.

> The checkin comment says that the new univariate framework is independent of
> the existing implementations; but StatUtils has been modified to include
> numerous static data members and to delegate computation to these.  

Yes, I committed the modifications to StatUtils some time after my 
commit of the framework. These were different commits with different 
purposes. If the StatUtils commit is premature it can be rolled back, 
but I would rather hear from other parties concerning the architecture 
and the changes before taking any backward steps. I tested the JUnit 
tests to verify that the changes did not create any inaccuracies. My 
implementations of the Statistics are based entirely on both the methods 
in Univariate and the previous StatUtils classes.

> This adds
> significant overhead and I do not see the value in it.  The cost of the
> additional stack operations/object creations is significant.  I ran tests
> comparing the previous version that does direct computations using the double[]
> arrays to the modified version and found an average of more than 6x slowdown
> using the new implementation. I did not profile memory utilization, but that is
> also a concern. Repeated tests computing the mean of a 1000 doubles 100000
> times using the old and new implementations averaged 1.5 and 10.2 seconds,
> resp. I do not see the need for all of this additional overhead. 
> 

If you review the code, you'll find there is no added "object creation", 
the static Variable objects calculate on double[] just as the 
Univariates did, I would have to see more substantial analysis to 
believe your claim. All thats going on here are that the Static StatUtil 
methods are delegating to individual static instances of 
UnivariateStatistics. These are instantiated on JVM startup like all 
static objects, calling a method in such an object should not require 
any more overhead than having the method coded directly into the static 
method.

If there are performance considerations, lets discuss these.

I doubt (as the numerous discussions over the past week have pointed 
out) that what we really want to have in StatUtils is one monolithic 
Static class with all the implemented methods present in it. If I have 
misinterpreted this opinion in the group, then I'm sure there will be 
responses to this.

> I suggest that we postpone introduction of a statistical computation framework
> until after the initial release, if needed.  In any case, I would like to keep
> StatUtils and the core UnivariateImpl small, fast and lightweight, so I would
> like to request that the changes to these classes be rolled back.
> 
I would really like to see an architecture thats more than just on flat 
static class with a bunch of double[] methods in it. this is not very 
useful to me.

> If others feel that this additional infrastructure is essential, then I just
> need to be educated.  It is quite possible that I am thinking too narrowly in
> terms of current scope and I may be missing some looming structural problems. 
> If this is the case, I am open to being educated. I just need to see a) exactly
> why we need to add more complexity at this time and b) why breaking univariate
> statistics into four packages and 17 classes when all we are computing is basic
> statistics is necessary.  
> 

The packages are categorical, the classes are implementations of each 
statistic. The framework provides an intuitive and organized means for 
others to easily implement and add statistics to the packages without 
being restricted to a fascist and monolithic Univariate interface or 
static StatUtils interface.

If anything the continued conflict between our two schools of thought 
shows the necessity of such an approach. Your school of thought can 
retain the monolithic Interfaces for "Univariate" and "StatUtil". While 
the framework can provide others with the ability to extend and expand 
the library without such "heavy handed" restrictions that cripple the 
extendability of the project.

There was a great deal of discussion about the benefit of not having the 
methods implemented directly in static StatUtils because they could not 
be "overridden" or worked with in an Instantiable form. This approach 
frees the implementations up to be overridden and frees up room for 
alternate implementations.

You may have your opinions of how you would like to see the packages 
organized and implemented. Others in the group do have alternate 
opinions to yours. I for one see a strong value in individually 
implemented Statistics. I also have a strong vision that the framework I 
have been working on provides substantial benefits.

(1a.) It Allows both the storageless and storage based implementations 
to function behind the same interface. No matter if your calling

increment(double d)

or

evaluate(double[]...)

your working with the same algorithm.

(1b.) If you wish to have alternate implementations for evaluate and 
increment, it is easily possible of overload theses methods in future 
versions of the implementations.

(2.) With individual Implementations, alternate approaches can be coded 
and included for the benefit of those who have an interest in such 
implementations. Thus there could be multiple versions of Variance, 
based on the strategy of interest and the numerical accuracy required.

(3.) Having the same implementations of statistics usable across all 
Univariate implementations assures a standard behavior and the same 
expected results no matter if your using incremental or evaluation based 
approaches.

(4.) The frame work provides a formal structure for the future growth of 
the library. Knowing what a UnviariateStatistic is, and seeing the 
various implementations, its obvious the route one will take to 
implement future statistics of interest.


Phil, its clear we have very different "schools of thought" on the 
subject of how the library should be designed. As a developer on the 
project I have a right to promote my design model and interests. The 
architecture is something I have a strong interest in working with.

Apache projects are "group" projects, If a project such a [math] cannot 
find community and room for multiple directions of development. If it 
cannot make room for alternate ideas and visions, if both revolutionary 
and evolutionary processes cannot coexist, I doubt the project will have 
much of a future at all.


-Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org