You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Phil Steitz <ph...@steitz.com> on 2004/08/30 02:57:50 UTC

Re: [MATH] Summary proposed changes

Kim van der Linde wrote:
> Well, I had a discussion with several collegues (type science users, we 
> went snorkeling) on several of these issues. The score of the day was 
> the idea that the simple linear LS regression was considered a 
> multvariate statistics.

So it should stay where it is.
> 
> Phil Steitz wrote:
> 
>> Nothing has been "put aside."  We make decisions by consensus.  You 
>> have provided input and we are considering it.  To make sure I have it 
>> all right, you have proposed four changes:


>> 2) Change the name of "BivariateRegression" to "UnivariateRegression" 
>> (or something else)
> 
> 
> Put it in univariate, name it LSRegression. (or better, 
> SimpleRegression, and bild in the option for RMA and MA regressions).

The placement in .univariate contradicts what you say both above and 
below. Even with just one independent variable, regression is a 
multivatiate technique.
> 
>> 3) Change Variance to be configurable to generate the population 
>> statistic.
> 
> 
> Yup, or even beter, configurable bias reduction (n = N-a default a = 1, 
> but settable by constuctor and specific methods to mantain the option of 
> getting both statistics from the same dataset without doing things 
> twice). The current situation actually introduces fundamental errors.

Huh?  The formula provides unbiased estimates -- "fundamental error" would 
be to use the biased estimator for sample statistics. As I stated in an 
earlier post, the statistics in the univariate package are all designed to 
produce unbiased estimates for (unknown) population parameters based on 
sample data. The "population variance" that you want to add is either a 
biased (therefore inappropriate) estimator for the population variance 
based on a sample, or an exact expression of the population variance of 
the discrete distribution whose mass points are the data (i.e., assuming 
that the data values *are* the population and not a sample from it -- 
which is why it is called "population variance").  In either case it is a 
different statistic and to keep our design consistent, we should not use 
the same univariate to compute different statistics.

>  From the JavaDoc for Variance and SD class:
> 
> - double evaluate(double[] values, double mean, int begin, int length)
>     Returns the variance of the entries in the specified portion of the 
> input array, using the precomputed mean value.
> 
> And in Variance only:
> - double evaluate(double[] values, double mean)
>     Returns the variance of the entries in the input array, using the 
> precomputed mean value.
> 
> If you compute the variance based on a already existing mean obtained 
> different from the sample you estblish the variance on, the population 
> variance should be used as there is no loss of "degree's of freedom" by 
>  first establishing the mean of the sample. IF the mean is based in the 
> same sample, than it is correct.

These methods, like Variance itself, assume that the mean and variance are 
being computed based on sample data.  This is why it says "precomputed" 
rather than "known population parameter". The methods are provided to save 
computation when the sample mean has already been computed.
> 
>> 4) Combine the univariate and multivariate packages, since it is 
>> confusing to separate statistics that focus on one variable and 
>> sometimes the word "univariate" is used in the context of multivariate 
>> techniques (e.g. "Univariate Anova").
> 
> 
> No, keep them separate, but just locate things where they belong and not 
> reinvent that simple LS regressions should be within the multivariate 
> package.

Contradicts above -- assuming you mean that regression belongs in 
.multivariate, which it does.
> 
> I have question for you. Where would you locate a Covariance class....?

I am not sure that we would define a covariance class; but if we did, it 
would certainly belong in .multivariate, since covariance is a property of 
the joint distribution of two variables rather than just one.  The basic 
idea is very simple: univariate is for statistics that characterize the 
distribution of just one random variable, multivariate is for analyses 
that involve joint distributions of multiple random variables.

Phil
> 
> Kim
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [MATH] Summary proposed changes

Posted by Kim van der Linde <ki...@kimvdlinde.com>.

Hi,

Brent Worden wrote:
>>1) Change the RealMatrix getEntry, getRow, getColumn methods to use
>>0-based indexing.
> 
> 
> Looking at the implementation, I believe the current indexing is
> satisfactory and I can't think of where using it with native arrays would be
> overly burdensome or confusing.

Well, so you think I requested this out of pure filosophical reasons. I 
am running into problems with it, that's why. But maybe I should just do 
it differently, and make a derived class from it and distribute that 
with the classes I am making.....

> APIs are supposed to be language agnostic

I think API's should be logical, and desinged such that they minimise 
errors.

> let's call
> it what it is, SimpleLeastSquaresRegression.  If that is too long, then
> SimpleRegression

Fine with me.

>>3) Change Variance to be configurable to generate the population
> statistic.
> 
> Since population variance and sample variance are different statistics, they
> should be different classes as that is the design we have chosen.

I disagree, but in that case I will follow the same way on these classes 
as mentioned for the Matrix classes.

>>4) Combine the univariate and multivariate packages, since it is confusing
>>to separate statistics that focus on one variable and sometimes the word
>>"univariate" is used in the context of multivariate techniques (e.g.
>>"Univariate Anova").

> Both these statements indicate regression is a technique that involves more
> than one variable.  Therefore, regression in general is a multivariate
> technique.  The case where there is only one predictor is immaterial as
> there are two variable quantities.  Would one call a model with one
> predictor variable and two response variables a univariate technique?  I
> wouldn't and I doubt if anyone else would.  The path we have chosen, by
> placing procedures dealing with one variable in the univariate package and
> all other procedures dealing with more than one variable is satisfactory and
> makes for a good discriminant.

See my response, this is not what I proposed. Anyway, common 
interpretation (even among my collegues who do nothing else that complex 
multivariate analyses) is that the one independent, one dependent 
regressions are univariate regressions, although they can see the logic 
as there are two variables.

But in that sense, the TTest should be within the multivariate package 
too. Both simple regression and t-tests are in the end simplified 
versions of the GLM using only one dependent and one independent variable.

Anyway, I thank you all for the help and I will just make derived 
classes were I need a different implementation as provided by this package.

Cheers,

Kim
-- 
http://www.kimvdlinde.com

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

RE: [MATH] Summary proposed changes

Posted by Brent Worden <br...@worden.org>.

> 1) Change the RealMatrix getEntry, getRow, getColumn methods to use
> 0-based indexing.

Looking at the implementation, I believe the current indexing is
satisfactory and I can't think of where using it with native arrays would be
overly burdensome or confusing.

As for letting the language dictate the indexing, I think this is a bad
practice for developing an API.  APIs are supposed to be language agnostic
and should exhibit the same behavior no matter the implementing language.
If we allow the language to dictate the behavior of an API method, its
possible the behavior will be different for other languages.  I feel these
situations should be avoided so the API is portable to a wide array of
languages, which I feel is a long-term goal of some of our developers.

> 2) Change the name of "BivariateRegression" to "UnivariateRegression" (or
> something else)

If we're bothering to change its name to make it less confusing, let's call
it what it is, SimpleLeastSquaresRegression.  If that is too long, then
SimpleRegression as least squares is the inferred method when one mentions
regression.

> 3) Change Variance to be configurable to generate the population
statistic.

Since population variance and sample variance are different statistics, they
should be different classes as that is the design we have chosen.

As for the static methods on the variance and standard deviation classes,
the javadoc should be changed to better explain the source of the mean
argument.  The comments should indicate the mean is pre-computed using the
same values that are going to be used to compute the variation estimate.
Any other mean passed in will result in the variation computation to be
unreliable.

> 4) Combine the univariate and multivariate packages, since it is confusing
> to separate statistics that focus on one variable and sometimes the word
> "univariate" is used in the context of multivariate techniques (e.g.
> "Univariate Anova").

"Regression is used to study relationships between measurable variables."
[Weisberg, 1985]

"Regression analysis is a statistical tool that utilizes the relations
between two or more quantitative variables..." [Neter, et al., 1985]

Both these statements indicate regression is a technique that involves more
than one variable.  Therefore, regression in general is a multivariate
technique.  The case where there is only one predictor is immaterial as
there are two variable quantities.  Would one call a model with one
predictor variable and two response variables a univariate technique?  I
wouldn't and I doubt if anyone else would.  The path we have chosen, by
placing procedures dealing with one variable in the univariate package and
all other procedures dealing with more than one variable is satisfactory and
makes for a good discriminant.

Brent Worden


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [MATH] Summary proposed changes

Posted by Kim van der Linde <ki...@kimvdlinde.com>.


Phil Steitz wrote:

> Kim van der Linde wrote:
> 
>> Well, I had a discussion with several collegues (type science users, 
>> we went snorkeling) on several of these issues. The score of the day 
>> was the idea that the simple linear LS regression was considered a 
>> multvariate statistics.
> 
> 
> So it should stay where it is.

Huh, excuse me, they all disagreed completly with you, they all say is 
is univariate!

Kim
-- 
http://www.kimvdlinde.com


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org