You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Phil Steitz <ph...@steitz.com> on 2008/06/09 04:17:00 UTC

[math] Improving numerics in OLSMultipleLinearRegression

While clear and elegant from a matrix algebra standpoint, the "nailve" 
implementation in OLSMultipleLinearRegression has bad numerical 
qualities.  It is well known that solving the normal equations directly 
does not give good numerics.  I just added some tests to actually verify 
parameter values, using the classic "Longly" dataset, for which NIST 
provides certified statistics.  This is a "hard" design matrix.  R was 
able to get to within 1E-8 of the certified parameter values.  
OLSMultipleLinearRegression can only get 1E-1. 

We have talked in the past about providing an implementation based on QR 
decomposition.   Anyone up for  using the QR decomposition that we now 
have to do this?  I really think we need to do it (or something else to 
improve numerics) before releasing this class.  I will get to it 
eventually, but am a little pegged at the moment.  I will review and 
apply patches if someone is willing to do the implementation.  I can 
also explain here or offline how the R tests and NIST datasets work, as 
these are useful in validating code.

Another thing that we should think about before releasing any of this 
stuff is the completeness of the API.  Many standard regression 
statistics are missing.  If we are going to stick with the Interface / 
Implementation setup, we need to get the right stuff into the 
interface.  It is also awkward to have to insert "1"'s in the design 
matrix to get an intercept term computed.  This is convenient for 
implementation, but awkward for users.  A more natural setup (IMHO) 
would be to expose a "noIntercept" or "hasIntercept" property for the model.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Improving numerics in OLSMultipleLinearRegression

Posted by Mauro Talevi <ma...@aquilonia.org>.

Phil Steitz wrote:

> No, just X.  see the references here:
> http://apache.markmail.org/message/3aybm5emimg5da42
> I think R uses QR as described above.  Comments or suggestions for other 
> default implementations are most welcome.  We should aim to provide a 
> default implementation that is reasonably fast and provides good 
> numerics across a broad range of design matrices.

Ok - noted.  I'll take a look at numerics issue during the week.

> We do need to decide what the API is, so even if it takes a while to 
> implement things, or the initial implementations are naive, we should 
> decide what statistics we are going to provide and how we are going to 
> provide them.  Same for the specification of models (i.e., "input data")

Yes - agreed, but meant to say that before we start adding these methods
to the interfaces, we should decide the whole list of statistics and 
input data - and that can be done on a wiki page, where people can 
add/comment.

>> Perhaps it would help if we had overloaded newData methods that accept 
>> different input strategies, but ultimately they will produce a n x m 
>> double array.  That way we can provide users with choice.

> I was thinking the same thing.  

Ok

> The bit that is troubling me is the 
> omega matrix required by GLS cluttering the OLS interface.  Other types 
> of models (e.g. weighted) will require other data.  Could be we need 
> separate interfaces for the different types of regression, but maybe it 
> is better to dispense with the abstract interface altogether.  The 
> reason we have interface / implementation separation is to allow 
> alternative implementations to be plugged in.  Given the 2.0 approach to 
> support IOC, what may make more sense is to just encapsulate the core 
> model estimators (things like R's lm, gls),  make them pluggable via 
> setters or constructors and get rid of the abstract interface.  Any 
> thoughts on this?
> 

I see your point.  What made me fall on the side of a unified interface 
was that OLS could be seen as special case of GLS.  But yes the 
covariance muddles the OLS case.  I still think an interface defining 
the common statistics available from the different types of regression 
might be useful.  We would just not add the data input to the interface, 
which would instead be implementation specific.

I'm all for pluggable/IOC approaches, but I fail to see how this would 
get rid of the interface.

Cheers

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Improving numerics in OLSMultipleLinearRegression

Posted by Mauro Talevi <ma...@aquilonia.org>.

Phil Steitz wrote:
>> Perhaps it would help if we had overloaded newData methods that accept 
>> different input strategies, but ultimately they will produce a n x m 
>> double array.  That way we can provide users with choice.
> I was thinking the same thing.  The bit that is troubling me is the 
> omega matrix required by GLS cluttering the OLS interface.  Other types 
> of models (e.g. weighted) will require other data.  Could be we need 
> separate interfaces for the different types of regression, but maybe it 
> is better to dispense with the abstract interface altogether.  The 
> reason we have interface / implementation separation is to allow 
> alternative implementations to be plugged in. 

Phil - I created a new issue for this refactor:

https://issues.apache.org/jira/browse/MATH-211

For the moment I kept the MultipleLinearRegression interface as common 
read-only interface, pushing down the data input to the implementing 
classes.   IMO there is a benefit in maintaining an interface that 
defines what you obtain from regression, regardless of input and 
implementation.  Also helps with mocking strategies.

The patch attached also incorporates the loadModelData() method  that 
you  had used in the OLS tests - ie it's now been pulled to the abstract 
regression class (renamed to newSampleData() for consistency but we can 
swap "sample" for "model" - it's just semantics).  Tests have been 
refactored to use new input method.

Cheers

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Improving numerics in OLSMultipleLinearRegression

Posted by Phil Steitz <ph...@steitz.com>.

Mauro Talevi wrote:
> Phil,
>
> Phil Steitz wrote:
>> I think R uses QR as described above.  Comments or suggestions for 
>> other default implementations are most welcome.  We should aim to 
>> provide a default implementation that is reasonably fast and provides 
>> good numerics across a broad range of design matrices.
>
> Got around to testing QR decomposition on OLS.
>
> The short answer is that is does not seems to make much difference. 
> Rather it looks like the dataset and the number of observations that 
> are far more significant for good numerics, as is also shown by your 
> recent addition of Swiss Fertility dataset (with nobs 3 times as 
> large), for which results match (either with or without QR) up to 
> 10^-12 tolerance.
See comments below on implementation.  Good numerics means good accuracy 
over a broad range of input data.  The Longley dataset is "hard" 
numerically and the Swiss fertility dataset is an easy one. 
>
> Here's the resulting numerics for comparison:
>
> Longley dataset (nobs=16):
>
> PL=[-3482258.6569276676, 15.06187677821299, -0.03581918037047249, 
> -2.0202298136474104, -1.0332268695603801, -0.051104103746114404, 
> 1829.1514737363977]
> QR=[-3482258.7119702557, 15.061873615257795, -0.03581918168712586, 
> -2.020229840231328, -1.0332268778552742, -0.05110409751647271, 
> 1829.1515061042903]
> LG=[-3482258.63459582, 15.0618722713733, -0.035819179292591, 
> -2.02022980381683, -1.03322686717359, -0.0511041056535807, 
> 1829.15146461355]
>
> Swiss Fertility dataset (nobs=47):
>
> PL=[91.05542390271336, -0.22064551045713723, -0.26058239824327045, 
> -0.9616123845602972, 0.12441843147162471]
> QR=[91.05542390271366, -0.22064551045714642, -0.26058239824326457, 
> -0.9616123845602974, 0.12441843147162669]
> SF=[91.05542390271397, -0.22064551045715, -0.26058239824328, 
> -0.9616123845603, 0.12441843147162]
>
> (Legend: PL = plain OLS, QR = QR-decomposed OLS, LG = Longley R 
> results, SF = Swiss Fertility R results).
>
> Interestingly, it's only on the intercepts (ie the first regression 
> parameter) that we get the very poor numerics.  While not a numerical 
> argument, one could say that the statistically more significant 
> parameter is the slope.
>
> Anyway, attached is patch with QR-based implementation and modified 
> test to print out comparison results.
Sorry it took so long for me to review this.  To really take advantage 
of the QR decomposition, the upper-trinagular system R b = Q' y (using 
your notation from javadoc) should be solved by back-substitution, 
rather than by inverting RTR.  That will require a little more work to 
implement, but should improve accuracy.  I just opened MATH-217 to track 
this.

Phil
 
 
> ------------------------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Improving numerics in OLSMultipleLinearRegression

Posted by Mauro Talevi <ma...@aquilonia.org>.

Phil,

Phil Steitz wrote:
> I think R uses QR as described above.  Comments or suggestions for other 
> default implementations are most welcome.  We should aim to provide a 
> default implementation that is reasonably fast and provides good 
> numerics across a broad range of design matrices.

Got around to testing QR decomposition on OLS.

The short answer is that is does not seems to make much difference. 
Rather it looks like the dataset and the number of observations that are 
far more significant for good numerics, as is also shown by your recent 
addition of Swiss Fertility dataset (with nobs 3 times as large), for 
which results match (either with or without QR) up to 10^-12 tolerance.

Here's the resulting numerics for comparison:

Longley dataset (nobs=16):

PL=[-3482258.6569276676, 15.06187677821299, -0.03581918037047249, 
-2.0202298136474104, -1.0332268695603801, -0.051104103746114404, 
1829.1514737363977]
QR=[-3482258.7119702557, 15.061873615257795, -0.03581918168712586, 
-2.020229840231328, -1.0332268778552742, -0.05110409751647271, 
1829.1515061042903]
LG=[-3482258.63459582, 15.0618722713733, -0.035819179292591, 
-2.02022980381683, -1.03322686717359, -0.0511041056535807, 1829.15146461355]

Swiss Fertility dataset (nobs=47):

PL=[91.05542390271336, -0.22064551045713723, -0.26058239824327045, 
-0.9616123845602972, 0.12441843147162471]
QR=[91.05542390271366, -0.22064551045714642, -0.26058239824326457, 
-0.9616123845602974, 0.12441843147162669]
SF=[91.05542390271397, -0.22064551045715, -0.26058239824328, 
-0.9616123845603, 0.12441843147162]

(Legend: PL = plain OLS, QR = QR-decomposed OLS, LG = Longley R results, 
SF = Swiss Fertility R results).

Interestingly, it's only on the intercepts (ie the first regression 
parameter) that we get the very poor numerics.  While not a numerical 
argument, one could say that the statistically more significant 
parameter is the slope.

Anyway, attached is patch with QR-based implementation and modified test 
to print out comparison results.

Cheers

Re: [math] Improving numerics in OLSMultipleLinearRegression

Posted by Phil Steitz <ph...@steitz.com>.

Mauro Talevi wrote:
> Phil Steitz wrote:
>
>> Yes, and I would distinguish performance optimization from numerical 
>> accuracy.  From my perspective, we can release a ".0" with room for 
>> performance improvement, but at least decent numerics are required.
>
> I agree that decent numerics are required.   I'm still rather 
> surprised that the diagonal covariance case would yield such bad 
> numerics wrt the GLS case - which has been tested with independent 
> fortran code to a level of 10^-6.
I have only tested the OLS implementation.  To perform similar tests 
against R for the GLS impl, we need to look at the R "gls" function.  
See the link below for some comments on why we need to be careful with 
validation tests.
>
>>>> We have talked in the past about providing an implementation based 
>>>> on QR decomposition.   Anyone up for  using the QR decomposition 
>>>> that we now have to do this?  I really think we need to do it (or 
>>>> something else to improve numerics) before releasing this class.  I 
>>>> will get to it eventually, but am a little pegged at the moment. 
>
> Are you proposing doing a QR decomposition of both the X and Y 
> matrices and working out the formulas using the decomposed ones?
>
No, just X.  see the references here:
http://apache.markmail.org/message/3aybm5emimg5da42
I think R uses QR as described above.  Comments or suggestions for other 
default implementations are most welcome.  We should aim to provide a 
default implementation that is reasonably fast and provides good 
numerics across a broad range of design matrices. 
>> Here are some initial ideas on what should be included in the 
>> multiple regression API.  Other suggestions welcome!
>>
>> 1.  Coefficients should be accompanied by standard errors, 
>> t-statistics, two-sided t probablilities (can get these using t 
>> distribution from distributions package) and ideally confidence 
>> intervals.
>> 2.  F, R-square, adjusted R-square, F prob (again can use 
>> distributions package to estimate)
>> 3.  ANOVA table (Regression sum of squares, residual sum of squares)
>> 4.  Residuals
>>
>> R, SAS, SPSS and Excel all represent (or in the case of R, can 
>> construct) these basic statistics in some way in their output.  We 
>> should model them in classes representing properties of the computed 
>> model.
>
> Perhaps we should put these on the wiki or even better in jira.
> IMO, it's best to deal with the numerics and the new data input 
> strategies, before adding new functionality in the frame.
We do need to decide what the API is, so even if it takes a while to 
implement things, or the initial implementations are naive, we should 
decide what statistics we are going to provide and how we are going to 
provide them.  Same for the specification of models (i.e., "input data")
>
>>> And finally, how do you see the no/hasIntercept model working?
>> As a configurable property - noIntercept means the model is estimated 
>> without an intercept.   The point I was making was more how the data 
>> is supplied via the API.  It is awkward to have to fill in a column 
>> of 1's to get the linear algebra to work to estimate a model with 
>> intercept (which should be the default).
>
> ok - good point.
>
>> I would recommend that we have setData or "newData" provide a n x m 
>> matrix, where n is the number of observations and m-1 is the number 
>> of independent variables.  Then either a) have the constructor take 
>> another argument specifying which column holds the dependent variable 
>> b) assume it is the first column c) support column labels and some 
>> form of model specification such as what R provides (a lot of work) 
>> d) split off the y vector, so setting data requires separate x and y 
>> vectors.  Probably a) is easiest for users, who will most often be 
>> starting with a rectangular array of data with the dependent variable 
>> in one of the columns.
>
> Perhaps it would help if we had overloaded newData methods that accept 
> different input strategies, but ultimately they will produce a n x m 
> double array.  That way we can provide users with choice.
I was thinking the same thing.  The bit that is troubling me is the 
omega matrix required by GLS cluttering the OLS interface.  Other types 
of models (e.g. weighted) will require other data.  Could be we need 
separate interfaces for the different types of regression, but maybe it 
is better to dispense with the abstract interface altogether.  The 
reason we have interface / implementation separation is to allow 
alternative implementations to be plugged in.  Given the 2.0 approach to 
support IOC, what may make more sense is to just encapsulate the core 
model estimators (things like R's lm, gls),  make them pluggable via 
setters or constructors and get rid of the abstract interface.  Any 
thoughts on this?

Phil
 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Improving numerics in OLSMultipleLinearRegression

Posted by Mauro Talevi <ma...@aquilonia.org>.

Phil Steitz wrote:

> Yes, and I would distinguish performance optimization from numerical 
> accuracy.  From my perspective, we can release a ".0" with room for 
> performance improvement, but at least decent numerics are required.

I agree that decent numerics are required.   I'm still rather surprised 
that the diagonal covariance case would yield such bad numerics wrt the 
GLS case - which has been tested with independent fortran code to a 
level of 10^-6.

>>> We have talked in the past about providing an implementation based on 
>>> QR decomposition.   Anyone up for  using the QR decomposition that we 
>>> now have to do this?  I really think we need to do it (or something 
>>> else to improve numerics) before releasing this class.  I will get to 
>>> it eventually, but am a little pegged at the moment. 

Are you proposing doing a QR decomposition of both the X and Y matrices 
and working out the formulas using the decomposed ones?

> Here are some initial ideas on what should be included in the multiple 
> regression API.  Other suggestions welcome!
> 
> 1.  Coefficients should be accompanied by standard errors, t-statistics, 
> two-sided t probablilities (can get these using t distribution from 
> distributions package) and ideally confidence intervals.
> 2.  F, R-square, adjusted R-square, F prob (again can use distributions 
> package to estimate)
> 3.  ANOVA table (Regression sum of squares, residual sum of squares)
> 4.  Residuals
> 
> R, SAS, SPSS and Excel all represent (or in the case of R, can 
> construct) these basic statistics in some way in their output.  We 
> should model them in classes representing properties of the computed model.

Perhaps we should put these on the wiki or even better in jira.
IMO, it's best to deal with the numerics and the new data input 
strategies, before adding new functionality in the frame.

>> And finally, how do you see the no/hasIntercept model working?
> As a configurable property - noIntercept means the model is estimated 
> without an intercept.   The point I was making was more how the data is 
> supplied via the API.  It is awkward to have to fill in a column of 1's 
> to get the linear algebra to work to estimate a model with intercept 
> (which should be the default).

ok - good point.

> I would recommend that we have setData or "newData" provide a n x m 
> matrix, where n is the number of observations and m-1 is the number of 
> independent variables.  Then either a) have the constructor take another 
> argument specifying which column holds the dependent variable b) assume 
> it is the first column c) support column labels and some form of model 
> specification such as what R provides (a lot of work) d) split off the y 
> vector, so setting data requires separate x and y vectors.  Probably a) 
> is easiest for users, who will most often be starting with a rectangular 
> array of data with the dependent variable in one of the columns.

Perhaps it would help if we had overloaded newData methods that accept 
different input strategies, but ultimately they will produce a n x m 
double array.  That way we can provide users with choice.

I'll get to it in the next week or so - ATM I'm a bit loaded myself.

Cheers


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Improving numerics in OLSMultipleLinearRegression

Posted by Phil Steitz <ph...@steitz.com>.

Mauro Talevi wrote:
> Hi Phil,
>
> thanks for reviewing the multiple linear regression implementations 
> and setting up the R/NIST data tests.   I finally got around to 
> installing R and can now run them too.
>
> Phil Steitz wrote:
>> While clear and elegant from a matrix algebra standpoint, the 
>> "nailve" implementation in OLSMultipleLinearRegression has bad 
>> numerical qualities.  It is well known that solving the normal 
>> equations directly does not give good numerics.  I just added some 
>> tests to actually verify parameter values, using the classic "Longly" 
>> dataset, for which NIST provides certified statistics.  This is a 
>> "hard" design matrix.  R was able to get to within 1E-8 of the 
>> certified parameter values.  OLSMultipleLinearRegression can only get 
>> 1E-1.
>
> The OLS implementation has been added as a simple by-product of the 
> GLS case - which is the main one I have needed for hypothesis testing 
> - as it came "for free" with unitary covariance.
> True - the emphasis was on clarity and formulaic simplicity.  And also 
> following the old Donald Knuth maxim "optimization is the root of all 
> evil".  But it seems like there is a need for refinement of the 
> implementation - the devil raised his head :-)
>
Yes, and I would distinguish performance optimization from numerical 
accuracy.  From my perspective, we can release a ".0" with room for 
performance improvement, but at least decent numerics are required.
>> We have talked in the past about providing an implementation based on 
>> QR decomposition.   Anyone up for  using the QR decomposition that we 
>> now have to do this?  I really think we need to do it (or something 
>> else to improve numerics) before releasing this class.  I will get to 
>> it eventually, but am a little pegged at the moment.  I will review 
>> and apply patches if someone is willing to do the implementation.  I 
>> can also explain here or offline how the R tests and NIST datasets 
>> work, as these are useful in validating code.
>
> I'd be happy to improve the impl.  I'm getting my head around R and 
> NIST, but perhaps a chat offline would not hurt!
I may be hard to catch synchronously, as my day-and-night job is a 
little demanding, but I would be happy to answer questions (with maybe a 
little latency ;)
>
>> Another thing that we should think about before releasing any of this 
>> stuff is the completeness of the API.  Many standard regression 
>> statistics are missing.  If we are going to stick with the Interface 
>> / Implementation setup, we need to get the right stuff into the 
>> interface.  It is also awkward to have to insert "1"'s in the design 
>> matrix to get an intercept term computed.  This is convenient for 
>> implementation, but awkward for users.  A more natural setup (IMHO) 
>> would be to expose a "noIntercept" or "hasIntercept" property for the 
>> model.
>
> No problem with adding other statistics - let's just decide on what is 
> the stardard regression API.
Here are some initial ideas on what should be included in the multiple 
regression API.  Other suggestions welcome!

1.  Coefficients should be accompanied by standard errors, t-statistics, 
two-sided t probablilities (can get these using t distribution from 
distributions package) and ideally confidence intervals.
2.  F, R-square, adjusted R-square, F prob (again can use distributions 
package to estimate)
3.  ANOVA table (Regression sum of squares, residual sum of squares)
4.  Residuals

R, SAS, SPSS and Excel all represent (or in the case of R, can 
construct) these basic statistics in some way in their output.  We 
should model them in classes representing properties of the computed 
model. 
>
> And finally, how do you see the no/hasIntercept model working?
As a configurable property - noIntercept means the model is estimated 
without an intercept.   The point I was making was more how the data is 
supplied via the API.  It is awkward to have to fill in a column of 1's 
to get the linear algebra to work to estimate a model with intercept 
(which should be the default).

I would recommend that we have setData or "newData" provide a n x m 
matrix, where n is the number of observations and m-1 is the number of 
independent variables.  Then either a) have the constructor take another 
argument specifying which column holds the dependent variable b) assume 
it is the first column c) support column labels and some form of model 
specification such as what R provides (a lot of work) d) split off the y 
vector, so setting data requires separate x and y vectors.  Probably a) 
is easiest for users, who will most often be starting with a rectangular 
array of data with the dependent variable in one of the columns.

Phil
 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [math] Improving numerics in OLSMultipleLinearRegression

Posted by Mauro Talevi <ma...@aquilonia.org>.

Hi Phil,

thanks for reviewing the multiple linear regression implementations and 
setting up the R/NIST data tests.   I finally got around to installing R 
and can now run them too.

Phil Steitz wrote:
> While clear and elegant from a matrix algebra standpoint, the "nailve" 
> implementation in OLSMultipleLinearRegression has bad numerical 
> qualities.  It is well known that solving the normal equations directly 
> does not give good numerics.  I just added some tests to actually verify 
> parameter values, using the classic "Longly" dataset, for which NIST 
> provides certified statistics.  This is a "hard" design matrix.  R was 
> able to get to within 1E-8 of the certified parameter values.  
> OLSMultipleLinearRegression can only get 1E-1.

The OLS implementation has been added as a simple by-product of the GLS 
case - which is the main one I have needed for hypothesis testing - as 
it came "for free" with unitary covariance.
True - the emphasis was on clarity and formulaic simplicity.  And also 
following the old Donald Knuth maxim "optimization is the root of all 
evil".  But it seems like there is a need for refinement of the 
implementation - the devil raised his head :-)

> We have talked in the past about providing an implementation based on QR 
> decomposition.   Anyone up for  using the QR decomposition that we now 
> have to do this?  I really think we need to do it (or something else to 
> improve numerics) before releasing this class.  I will get to it 
> eventually, but am a little pegged at the moment.  I will review and 
> apply patches if someone is willing to do the implementation.  I can 
> also explain here or offline how the R tests and NIST datasets work, as 
> these are useful in validating code.

I'd be happy to improve the impl.  I'm getting my head around R and 
NIST, but perhaps a chat offline would not hurt!

> Another thing that we should think about before releasing any of this 
> stuff is the completeness of the API.  Many standard regression 
> statistics are missing.  If we are going to stick with the Interface / 
> Implementation setup, we need to get the right stuff into the 
> interface.  It is also awkward to have to insert "1"'s in the design 
> matrix to get an intercept term computed.  This is convenient for 
> implementation, but awkward for users.  A more natural setup (IMHO) 
> would be to expose a "noIntercept" or "hasIntercept" property for the 
> model.

No problem with adding other statistics - let's just decide on what is 
the stardard regression API.

And finally, how do you see the no/hasIntercept model working?

Cheers

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org