You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Greg Sterijevski <gs...@gmail.com> on 2011/07/14 06:34:55 UTC

[Math] R Squared Consistency

All,

I am working on some additions to the regression package and have run into a
bit of difficulty.

The statistical R Squared is equal to 1.0 -
SumOfSquaredError/SumOfSquaresTotal. Say that I run my regression two
different ways. The first manner I tell the regression technique to include
a constant, so the SumOfSquaresTotal = Summation of  ( Y - Mean(Y) ) ^2. In
the next run, I tell the regression technique not to include a constant, but
I do include one in the data I supply ( one rhs variable is always set to
one). The models are identical, but the R Squared may not be consistent,
since in the second run I will assume  Mean(Y) = 0.0.

The question to the list is what is the proper course of action? Ignore it
and leave the obvious inconsistency? Force a mean? (That's not exactly a
good solution). Empirically test the data as it comes in? If an independent
variable exhibits zero variance, then it must be the constant. I then set
the flag for it, and get the correct result?

Thoughts?

 -Greg

Re: [Math] R Squared Consistency

Posted by Phil Steitz <ph...@gmail.com>.
On 7/13/11 9:34 PM, Greg Sterijevski wrote:
> All,
>
> I am working on some additions to the regression package and have run into a
> bit of difficulty.
>
> The statistical R Squared is equal to 1.0 -
> SumOfSquaredError/SumOfSquaresTotal. Say that I run my regression two
> different ways. The first manner I tell the regression technique to include
> a constant, so the SumOfSquaresTotal = Summation of  ( Y - Mean(Y) ) ^2. In
> the next run, I tell the regression technique not to include a constant, but
> I do include one in the data I supply ( one rhs variable is always set to
> one). The models are identical, but the R Squared may not be consistent,
> since in the second run I will assume  Mean(Y) = 0.0.

The models are not identical, because in the second case, the
unitary column is a (zero variance) regressor.  I would say whoever
supplied the data did not understand the API and the reported
R-square is meaningless.  This is why we indicate in the javadoc
that the data should *not* include unitary columns, but the
hasIntercept property should be used instead to indicate that the
model should include an intercept term.
>
> The question to the list is what is the proper course of action? Ignore it
> and leave the obvious inconsistency? Force a mean? (That's not exactly a
> good solution). Empirically test the data as it comes in? If an independent
> variable exhibits zero variance, then it must be the constant. I then set
> the flag for it, and get the correct result?

It is never a good idea to try to "fix" API abuses / data anomalies
/ bad specifications, other than to throw exceptions on easily
discernible precondition violations.  In this case, for example,
zero variance regressors may actually occur in data and our decision
to change the model specification may be not at all what the user
wants. Our contract with users is that we clearly document how the
API works, preconditions, algorithms, etc. and compute what we say
we compute.  I would say don't do anything special in this case. 

Phil
>
> Thoughts?
>
>  -Greg
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org