You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Phil Steitz <ph...@gmail.com> on 2011/06/24 21:06:20 UTC

[math] Re: Additions to support Large Linear Regression problems

On 6/24/11 11:44 AM, Greg Sterijevski wrote:
> Hello All,
>
> I have been a user of the math commons jar for a little over a year and am
> very impressed with it. I was wondering whether anyone is actively working
> on implementing functionality to do regressions on very very large data
> sets. The current implementation of the OLS routine is an in-core QR
> decomposition with substitution. While the solutions are typically accurate,
> the in-core nature limits the usefulness of these objects.
>
> Looking through the code, most of the implementation of an InputStream based
> regression routine would respect the contract implicit in the interface
> MultipleLinearRegression. However, large regression problems are important
> enough that there should be a way to:
>
> 1. Wrap a potentially large data source, perhaps as an InputStream of some
> sort.
> 2. Have a separate contract with methods like clear() ( to clear whatever
> intermediate calculations are stored), and regress() which generates
> immutable results that are not affected by further updates of the data.
>
> I would appreciate any thoughts or comments, as well suggestions about
> functionality already in math commons which might address some points I
> raised.
>
> Thank you,
>
> -Greg
>
Hi Greg,

Thanks for the feedback and suggestion.  You are correct that the
multiple regression classes use QR decomp of the design matrix, so
are not really suitable for very large datasets.  I agree that this
would make a good enhancement and I would be willing to work with
you on design and implementation.  The SimpleRegression class, which
handles only bivariate regression has what amounts to a streaming
interface now, so for just bivariate models, arbitrary-sized
datasets can be accommodated with the current code.  But multiple
regression will require some more work.

If you are interested in working on this, please open a JIRA and
start with a patch proposing the API enhancements above in a new
class.  I am not sure if it makes sense to have the new class extend
AbstractMultipleLinearRegression, since that class really is
fixed-model oriented and methods like getResidials() would have to
be dropped or replaced by methods returning streams.  I would say
start with a new class and do not feel constrained to conform to the
matrix-oriented API in the current (multiple) regression classes. 
The API of SimpleRegression may actually be a better model to start
with. 

As we prepare for 3.0, we have the opportunity to improve / repair
the 2.x API, so if you have comments or suggestions for improvement
of the existing classes, those would also be most welcome.

Thanks!

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org