You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "David G. Boney" <da...@gmail.com> on 2010/08/24 21:28:42 UTC

Contributing regression routines to Mahout

Hi,

I would like to contribute to Mahout. Who would be the point person on the following topics: linear algebra routines, regression (real inputs, real outputs), subset selection for regression (lasso or ridge regression), and spectral graph methods? 

I am in the process of implementing a least squares linear regression algorithm, LSQR. In my quick review of Mahout, and I make no claims of digging into the code at this point, there appears to be extensive work in the area of classifiers, discrete outputs, but not regression, real output. I have an interest in building up a library of regression techniques (real inputs, real outputs). I am also interested in the implementation of the numerical linear algebra routines, as these algorithms are at the crux of most regression problems.


-------------
Sincerely,
David G. Boney
david.g.boney@gmail.com
http://www.sonicartifacts.com

Re: Contributing regression routines to Mahout

Posted by Ted Dunning <te...@gmail.com>.

See for instance this reference: http://arxiv.org/abs/0710.1435

On Sat, Aug 28, 2010 at 6:45 PM, Ted Dunning <te...@gmail.com> wrote:

>
> LSQR is essentially a Krylov sub-space technique since it is based on
> Lanczos algorithm.  As such, it is highly iterative and requires many passes
> through the data.
>
> It might be more efficient if you could use a random projection technique.
>  These algorithms can generally do the work of many
> Krylov iterations in a single pass over the data.  I haven't looked into
> this in detail, but it seems to me that a single pass could get you the
> least-squares fit for the first k < 100 singular values in a single pass and
> that subsequent passes could be used
> on the error to get the effect of 100 singular values per iteration.
>
> The crux of the idea is that if you right multiply your large matrix A by a
> tall skinny matrix random matrix R, you should be able to
> generate an orthonormal matrix Q such that Q Y = A R.  The exciting part is
> that Q Q' A will be a close approximation of A in terms
> of Frobenius norm.  Minimizing || Q Q' A x - b ||_F should be possible
> fairly quickly without actually constructing Q Q' A because
> (I think) you should be able to minimize || Q' A x - Q' b ||_F instead.
>  Moreover, since Q' A has limited rank, I think you can use
> (Q' A)' (Q' A) to find the answer you want without the horrible numerical
> problems that approach usually entails.
>
> Less directly, you can form the partial SVD and use that to solve for x.
>
> Solving the L_2 regularized system would proceed essentially the same way
> with a fancier value for A.  I don't know how this would
> extend to solving the L_1 regularized system.
>
> Since just reading A is the dominant cost and since you get something
> proportional to k singular vectors in a single determination of
> Q versus something proportional to 1 with each Krylov iteration, this could
> get you two or three orders of magnitude improvement
> in speed.
>
> As usual, I would refer you to Halko, Martinsson, and Tropp's great survey
> article at http://arxiv.org/abs/0909.4061v1
>
> They refer to some approaches for least squares solutions, but I haven't
> looked at the references that they cite.
>
>
> On Tue, Aug 24, 2010 at 10:21 PM, David G. Boney <da...@gmail.com>wrote:
>
>> I assume that the proper etiquette is to send all correspondences to the
>> list, that side conversations are discouraged.
>>
>> I apologize in advance if I appear ignorant of all that Mahout has to
>> offer. I am currently reading though the book by Mahout in Action and
>> looking at the code.
>>
>> The regression algorithm (real inputs, real outputs) that I am working to
>> implement is LSQR. It is an iterative algorithm based on the conjugate
>> gradient method. It has a nice property that is only uses calculations from
>> Ax and A'y where A is a matrix, A' is the transpose, and x & y are vectors.
>> I would assume that with big data, one wants to avoid calculating A inverse
>> since this is O(n^3). Also, with sparse matrices, one wants to avoid
>> calculating AA' because the result can be dense. Below is a link to the
>> algorithm.
>>
>> http://www.stanford.edu/group/SOL/software/lsqr.html
>>
>> One issue I am working on is calculating Ax and A'y with only one pass
>> through A but perhaps multiple passes through x and y. In general, for
>> regression problems A is m rows and n columns where m >> n. The vector x has
>> the dimension n and the vector y has the dimension m. It might be the case
>> the x will fit in memory but I think the design should not assume this.  I
>> also want to look at the issues where A is dense and A is sparse.
>>
>> LSQR can be used for maximum likelihood estimation of parameters of models
>> that are linear in their coefficients. I will first work on linear
>> regression then investigate the issue of using other basis sets. There may
>> already be work in Mahout to draw on for this issue.
>>
>> I think this will keep me busy for a while.
>>
>> Future issues would include:
>> 1. subset selection using the lasso or ridge regression
>> 2. simple bayesian regression, which their may already be work to draw on
>> in Mahout
>> 3. special basis sets
>>
>> My reference for least squares is:
>> Bjorck, Ake, (1996) Numerical Methods for Least Squares Problems
>>
>> http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=numerical+least+squares&x=0&y=0&ih=12_7_1_1_0_1_0_0_1_1.96_104&fsc=-1
>>
>> My reference for numerical linear algebra is:
>> Saad, Yousef, (2003), Iterative Methods for Sparse Linear Systems
>> http://www-users.cs.umn.edu/~saad/books.html
>>
>> -------------
>> Sincerely,
>> David G. Boney
>> david.g.boney@gmail.com
>> http://www.sonicartifacts.com
>>
>>
>>
>>
>> On Aug 24, 2010, at 9:29 PM, Ted Dunning wrote:
>>
>> > On Tue, Aug 24, 2010 at 1:28 PM, David G. Boney <
>> david.g.boney@gmail.com>wrote:
>> >
>> >>
>> >> I would like to contribute to Mahout. Who would be the point person on
>> the
>> >> following topics: linear algebra routines, regression (real inputs,
>> real
>> >> outputs), subset selection for regression (lasso or ridge regression),
>> and
>> >> spectral graph methods?
>> >>
>> >
>> > Several of us can help on linear algebra.
>> >
>> > We have no linear regression to speak of at this point.
>> >
>> > I have done a fair bit of work on gradient descent for regularization.
>> >
>> > We have the beginnings of a spectral clustering model (not the same as
>> > general spectral graph meethods) and we have an OK, but not widely used
>> > large-scale eigenvector decomposer.
>> >
>> > I am in the process of implementing a least squares linear regression
>> >> algorithm, LSQR. In my quick review of Mahout, and I make no claims of
>> >> digging into the code at this point, there appears to be extensive work
>> in
>> >> the area of classifiers, discrete outputs, but not regression, real
>> output.
>> >> I have an interest in building up a library of regression techniques
>> (real
>> >> inputs, real outputs).
>> >
>> >
>> > As far as Mahout is concerned, scalability is the key question.  For
>> many
>> > regression problems, especially those with sparse inputs, gradient
>> descent
>> > is very effective and the current SGD for logistic regression could
>> probably
>> > be leveraged.  My guess is that for non-gradient descent methods, the
>> SVD
>> > decompositions would be a better starting point.
>> >
>> > I believe that Vowpal Wabbit can be used for linear regression which
>> > probably implies that Mahout's SGD solver could be as well with small
>> > changes to the gradient computation.
>> >
>> >
>> >> I am also interested in the implementation of the numerical linear
>> algebra
>> >> routines, as these algorithms are at the crux of most regression
>> problems.
>> >>
>> >
>> > We would love more help with this, especially in distributed cases.  Raw
>> > numerical speed is rarely the bottleneck for mahout code because large
>> scale
>> > systems are typically I/O and network bound.  That said, much of our
>> matrix
>> > code is still as we inherited it and while the quality is surprisingly
>> high
>> > for code without unit tests, I know from direct experience that there
>> are
>> > problems.  I am testing and correcting things as I need them, but you
>> would
>> > likely have a broader reach and thus might have substantially more
>> impact
>> > than I have had on the testing side.
>>
>>
>

Re: Contributing regression routines to Mahout

Posted by "David G. Boney" <da...@gmail.com>.

I am going to stick with LSQR. Part of this is an exercise to learn how to use Hadoop and Mahout. I also consulted with Diane O'Leary at University of Maryland, a leading numerical linear algebra researcher (and a former professor of mine). She highly recommended the algorithm. I want to start with something well know and reliable.

I am very interested in making additional contributions to the numerical linear algebra system. I will gladly review the method you suggested for future projects. 

Right now I am reading up on iterative methods in a book by Henk A. van der Vorst. 

Henk A. van der Vorst, (2009), Iterative Krylov Methods for Large Linear Systems.
http://www.amazon.com/Iterative-Cambridge-Monographs-Computational-Mathematics/dp/0521183707/ref=sr_1_3?ie=UTF8&s=books&qid=1283044114&sr=8-3
-------------
Sincerely,
David G. Boney
david.g.boney@gmail.com
http://www.sonicartifacts.com




On Aug 28, 2010, at 7:45 PM, Ted Dunning wrote:

> LSQR is essentially a Krylov sub-space technique since it is based on
> Lanczos algorithm.  As such, it is highly iterative and requires many passes
> through the data.
> 
> It might be more efficient if you could use a random projection technique.
> These algorithms can generally do the work of many
> Krylov iterations in a single pass over the data.  I haven't looked into
> this in detail, but it seems to me that a single pass could get you the
> least-squares fit for the first k < 100 singular values in a single pass and
> that subsequent passes could be used
> on the error to get the effect of 100 singular values per iteration.
> 
> The crux of the idea is that if you right multiply your large matrix A by a
> tall skinny matrix random matrix R, you should be able to
> generate an orthonormal matrix Q such that Q Y = A R.  The exciting part is
> that Q Q' A will be a close approximation of A in terms
> of Frobenius norm.  Minimizing || Q Q' A x - b ||_F should be possible
> fairly quickly without actually constructing Q Q' A because
> (I think) you should be able to minimize || Q' A x - Q' b ||_F instead.
> Moreover, since Q' A has limited rank, I think you can use
> (Q' A)' (Q' A) to find the answer you want without the horrible numerical
> problems that approach usually entails.
> 
> Less directly, you can form the partial SVD and use that to solve for x.
> 
> Solving the L_2 regularized system would proceed essentially the same way
> with a fancier value for A.  I don't know how this would
> extend to solving the L_1 regularized system.
> 
> Since just reading A is the dominant cost and since you get something
> proportional to k singular vectors in a single determination of
> Q versus something proportional to 1 with each Krylov iteration, this could
> get you two or three orders of magnitude improvement
> in speed.
> 
> As usual, I would refer you to Halko, Martinsson, and Tropp's great survey
> article at http://arxiv.org/abs/0909.4061v1
> 
> They refer to some approaches for least squares solutions, but I haven't
> looked at the references that they cite.
> 
> 
> On Tue, Aug 24, 2010 at 10:21 PM, David G. Boney <da...@gmail.com>wrote:
> 
>> I assume that the proper etiquette is to send all correspondences to the
>> list, that side conversations are discouraged.
>> 
>> I apologize in advance if I appear ignorant of all that Mahout has to
>> offer. I am currently reading though the book by Mahout in Action and
>> looking at the code.
>> 
>> The regression algorithm (real inputs, real outputs) that I am working to
>> implement is LSQR. It is an iterative algorithm based on the conjugate
>> gradient method. It has a nice property that is only uses calculations from
>> Ax and A'y where A is a matrix, A' is the transpose, and x & y are vectors.
>> I would assume that with big data, one wants to avoid calculating A inverse
>> since this is O(n^3). Also, with sparse matrices, one wants to avoid
>> calculating AA' because the result can be dense. Below is a link to the
>> algorithm.
>> 
>> http://www.stanford.edu/group/SOL/software/lsqr.html
>> 
>> One issue I am working on is calculating Ax and A'y with only one pass
>> through A but perhaps multiple passes through x and y. In general, for
>> regression problems A is m rows and n columns where m >> n. The vector x has
>> the dimension n and the vector y has the dimension m. It might be the case
>> the x will fit in memory but I think the design should not assume this.  I
>> also want to look at the issues where A is dense and A is sparse.
>> 
>> LSQR can be used for maximum likelihood estimation of parameters of models
>> that are linear in their coefficients. I will first work on linear
>> regression then investigate the issue of using other basis sets. There may
>> already be work in Mahout to draw on for this issue.
>> 
>> I think this will keep me busy for a while.
>> 
>> Future issues would include:
>> 1. subset selection using the lasso or ridge regression
>> 2. simple bayesian regression, which their may already be work to draw on
>> in Mahout
>> 3. special basis sets
>> 
>> My reference for least squares is:
>> Bjorck, Ake, (1996) Numerical Methods for Least Squares Problems
>> 
>> http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=numerical+least+squares&x=0&y=0&ih=12_7_1_1_0_1_0_0_1_1.96_104&fsc=-1
>> 
>> My reference for numerical linear algebra is:
>> Saad, Yousef, (2003), Iterative Methods for Sparse Linear Systems
>> http://www-users.cs.umn.edu/~saad/books.html
>> 
>> -------------
>> Sincerely,
>> David G. Boney
>> david.g.boney@gmail.com
>> http://www.sonicartifacts.com
>> 
>> 
>> 
>> 
>> On Aug 24, 2010, at 9:29 PM, Ted Dunning wrote:
>> 
>>> On Tue, Aug 24, 2010 at 1:28 PM, David G. Boney <david.g.boney@gmail.com
>>> wrote:
>>> 
>>>> 
>>>> I would like to contribute to Mahout. Who would be the point person on
>> the
>>>> following topics: linear algebra routines, regression (real inputs, real
>>>> outputs), subset selection for regression (lasso or ridge regression),
>> and
>>>> spectral graph methods?
>>>> 
>>> 
>>> Several of us can help on linear algebra.
>>> 
>>> We have no linear regression to speak of at this point.
>>> 
>>> I have done a fair bit of work on gradient descent for regularization.
>>> 
>>> We have the beginnings of a spectral clustering model (not the same as
>>> general spectral graph meethods) and we have an OK, but not widely used
>>> large-scale eigenvector decomposer.
>>> 
>>> I am in the process of implementing a least squares linear regression
>>>> algorithm, LSQR. In my quick review of Mahout, and I make no claims of
>>>> digging into the code at this point, there appears to be extensive work
>> in
>>>> the area of classifiers, discrete outputs, but not regression, real
>> output.
>>>> I have an interest in building up a library of regression techniques
>> (real
>>>> inputs, real outputs).
>>> 
>>> 
>>> As far as Mahout is concerned, scalability is the key question.  For many
>>> regression problems, especially those with sparse inputs, gradient
>> descent
>>> is very effective and the current SGD for logistic regression could
>> probably
>>> be leveraged.  My guess is that for non-gradient descent methods, the SVD
>>> decompositions would be a better starting point.
>>> 
>>> I believe that Vowpal Wabbit can be used for linear regression which
>>> probably implies that Mahout's SGD solver could be as well with small
>>> changes to the gradient computation.
>>> 
>>> 
>>>> I am also interested in the implementation of the numerical linear
>> algebra
>>>> routines, as these algorithms are at the crux of most regression
>> problems.
>>>> 
>>> 
>>> We would love more help with this, especially in distributed cases.  Raw
>>> numerical speed is rarely the bottleneck for mahout code because large
>> scale
>>> systems are typically I/O and network bound.  That said, much of our
>> matrix
>>> code is still as we inherited it and while the quality is surprisingly
>> high
>>> for code without unit tests, I know from direct experience that there are
>>> problems.  I am testing and correcting things as I need them, but you
>> would
>>> likely have a broader reach and thus might have substantially more impact
>>> than I have had on the testing side.
>> 
>>

Re: Contributing regression routines to Mahout

Posted by Ted Dunning <te...@gmail.com>.

LSQR is essentially a Krylov sub-space technique since it is based on
Lanczos algorithm.  As such, it is highly iterative and requires many passes
through the data.

It might be more efficient if you could use a random projection technique.
 These algorithms can generally do the work of many
Krylov iterations in a single pass over the data.  I haven't looked into
this in detail, but it seems to me that a single pass could get you the
least-squares fit for the first k < 100 singular values in a single pass and
that subsequent passes could be used
on the error to get the effect of 100 singular values per iteration.

The crux of the idea is that if you right multiply your large matrix A by a
tall skinny matrix random matrix R, you should be able to
generate an orthonormal matrix Q such that Q Y = A R.  The exciting part is
that Q Q' A will be a close approximation of A in terms
of Frobenius norm.  Minimizing || Q Q' A x - b ||_F should be possible
fairly quickly without actually constructing Q Q' A because
(I think) you should be able to minimize || Q' A x - Q' b ||_F instead.
 Moreover, since Q' A has limited rank, I think you can use
(Q' A)' (Q' A) to find the answer you want without the horrible numerical
problems that approach usually entails.

Less directly, you can form the partial SVD and use that to solve for x.

Solving the L_2 regularized system would proceed essentially the same way
with a fancier value for A.  I don't know how this would
extend to solving the L_1 regularized system.

Since just reading A is the dominant cost and since you get something
proportional to k singular vectors in a single determination of
Q versus something proportional to 1 with each Krylov iteration, this could
get you two or three orders of magnitude improvement
in speed.

As usual, I would refer you to Halko, Martinsson, and Tropp's great survey
article at http://arxiv.org/abs/0909.4061v1

They refer to some approaches for least squares solutions, but I haven't
looked at the references that they cite.


On Tue, Aug 24, 2010 at 10:21 PM, David G. Boney <da...@gmail.com>wrote:

> I assume that the proper etiquette is to send all correspondences to the
> list, that side conversations are discouraged.
>
> I apologize in advance if I appear ignorant of all that Mahout has to
> offer. I am currently reading though the book by Mahout in Action and
> looking at the code.
>
> The regression algorithm (real inputs, real outputs) that I am working to
> implement is LSQR. It is an iterative algorithm based on the conjugate
> gradient method. It has a nice property that is only uses calculations from
> Ax and A'y where A is a matrix, A' is the transpose, and x & y are vectors.
> I would assume that with big data, one wants to avoid calculating A inverse
> since this is O(n^3). Also, with sparse matrices, one wants to avoid
> calculating AA' because the result can be dense. Below is a link to the
> algorithm.
>
> http://www.stanford.edu/group/SOL/software/lsqr.html
>
> One issue I am working on is calculating Ax and A'y with only one pass
> through A but perhaps multiple passes through x and y. In general, for
> regression problems A is m rows and n columns where m >> n. The vector x has
> the dimension n and the vector y has the dimension m. It might be the case
> the x will fit in memory but I think the design should not assume this.  I
> also want to look at the issues where A is dense and A is sparse.
>
> LSQR can be used for maximum likelihood estimation of parameters of models
> that are linear in their coefficients. I will first work on linear
> regression then investigate the issue of using other basis sets. There may
> already be work in Mahout to draw on for this issue.
>
> I think this will keep me busy for a while.
>
> Future issues would include:
> 1. subset selection using the lasso or ridge regression
> 2. simple bayesian regression, which their may already be work to draw on
> in Mahout
> 3. special basis sets
>
> My reference for least squares is:
> Bjorck, Ake, (1996) Numerical Methods for Least Squares Problems
>
> http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=numerical+least+squares&x=0&y=0&ih=12_7_1_1_0_1_0_0_1_1.96_104&fsc=-1
>
> My reference for numerical linear algebra is:
> Saad, Yousef, (2003), Iterative Methods for Sparse Linear Systems
> http://www-users.cs.umn.edu/~saad/books.html
>
> -------------
> Sincerely,
> David G. Boney
> david.g.boney@gmail.com
> http://www.sonicartifacts.com
>
>
>
>
> On Aug 24, 2010, at 9:29 PM, Ted Dunning wrote:
>
> > On Tue, Aug 24, 2010 at 1:28 PM, David G. Boney <david.g.boney@gmail.com
> >wrote:
> >
> >>
> >> I would like to contribute to Mahout. Who would be the point person on
> the
> >> following topics: linear algebra routines, regression (real inputs, real
> >> outputs), subset selection for regression (lasso or ridge regression),
> and
> >> spectral graph methods?
> >>
> >
> > Several of us can help on linear algebra.
> >
> > We have no linear regression to speak of at this point.
> >
> > I have done a fair bit of work on gradient descent for regularization.
> >
> > We have the beginnings of a spectral clustering model (not the same as
> > general spectral graph meethods) and we have an OK, but not widely used
> > large-scale eigenvector decomposer.
> >
> > I am in the process of implementing a least squares linear regression
> >> algorithm, LSQR. In my quick review of Mahout, and I make no claims of
> >> digging into the code at this point, there appears to be extensive work
> in
> >> the area of classifiers, discrete outputs, but not regression, real
> output.
> >> I have an interest in building up a library of regression techniques
> (real
> >> inputs, real outputs).
> >
> >
> > As far as Mahout is concerned, scalability is the key question.  For many
> > regression problems, especially those with sparse inputs, gradient
> descent
> > is very effective and the current SGD for logistic regression could
> probably
> > be leveraged.  My guess is that for non-gradient descent methods, the SVD
> > decompositions would be a better starting point.
> >
> > I believe that Vowpal Wabbit can be used for linear regression which
> > probably implies that Mahout's SGD solver could be as well with small
> > changes to the gradient computation.
> >
> >
> >> I am also interested in the implementation of the numerical linear
> algebra
> >> routines, as these algorithms are at the crux of most regression
> problems.
> >>
> >
> > We would love more help with this, especially in distributed cases.  Raw
> > numerical speed is rarely the bottleneck for mahout code because large
> scale
> > systems are typically I/O and network bound.  That said, much of our
> matrix
> > code is still as we inherited it and while the quality is surprisingly
> high
> > for code without unit tests, I know from direct experience that there are
> > problems.  I am testing and correcting things as I need them, but you
> would
> > likely have a broader reach and thus might have substantially more impact
> > than I have had on the testing side.
>
>

Re: Contributing regression routines to Mahout

Posted by "David G. Boney" <da...@gmail.com>.

I assume that the proper etiquette is to send all correspondences to the list, that side conversations are discouraged.

I apologize in advance if I appear ignorant of all that Mahout has to offer. I am currently reading though the book by Mahout in Action and looking at the code.

The regression algorithm (real inputs, real outputs) that I am working to implement is LSQR. It is an iterative algorithm based on the conjugate gradient method. It has a nice property that is only uses calculations from Ax and A'y where A is a matrix, A' is the transpose, and x & y are vectors. I would assume that with big data, one wants to avoid calculating A inverse since this is O(n^3). Also, with sparse matrices, one wants to avoid calculating AA' because the result can be dense. Below is a link to the algorithm.

http://www.stanford.edu/group/SOL/software/lsqr.html

One issue I am working on is calculating Ax and A'y with only one pass through A but perhaps multiple passes through x and y. In general, for regression problems A is m rows and n columns where m >> n. The vector x has the dimension n and the vector y has the dimension m. It might be the case the x will fit in memory but I think the design should not assume this.  I also want to look at the issues where A is dense and A is sparse.

LSQR can be used for maximum likelihood estimation of parameters of models that are linear in their coefficients. I will first work on linear regression then investigate the issue of using other basis sets. There may already be work in Mahout to draw on for this issue.

I think this will keep me busy for a while.

Future issues would include:
1. subset selection using the lasso or ridge regression
2. simple bayesian regression, which their may already be work to draw on in Mahout
3. special basis sets

My reference for least squares is:
Bjorck, Ake, (1996) Numerical Methods for Least Squares Problems 
http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=numerical+least+squares&x=0&y=0&ih=12_7_1_1_0_1_0_0_1_1.96_104&fsc=-1

My reference for numerical linear algebra is:
Saad, Yousef, (2003), Iterative Methods for Sparse Linear Systems
http://www-users.cs.umn.edu/~saad/books.html

-------------
Sincerely,
David G. Boney
david.g.boney@gmail.com
http://www.sonicartifacts.com

On Aug 24, 2010, at 9:29 PM, Ted Dunning wrote:

> On Tue, Aug 24, 2010 at 1:28 PM, David G. Boney <da...@gmail.com>wrote:
> 
>> 
>> I would like to contribute to Mahout. Who would be the point person on the
>> following topics: linear algebra routines, regression (real inputs, real
>> outputs), subset selection for regression (lasso or ridge regression), and
>> spectral graph methods?
>> 
> 
> Several of us can help on linear algebra.
> 
> We have no linear regression to speak of at this point.
> 
> I have done a fair bit of work on gradient descent for regularization.
> 
> We have the beginnings of a spectral clustering model (not the same as
> general spectral graph meethods) and we have an OK, but not widely used
> large-scale eigenvector decomposer.
> 
> I am in the process of implementing a least squares linear regression
>> algorithm, LSQR. In my quick review of Mahout, and I make no claims of
>> digging into the code at this point, there appears to be extensive work in
>> the area of classifiers, discrete outputs, but not regression, real output.
>> I have an interest in building up a library of regression techniques (real
>> inputs, real outputs).
> 
> 
> As far as Mahout is concerned, scalability is the key question.  For many
> regression problems, especially those with sparse inputs, gradient descent
> is very effective and the current SGD for logistic regression could probably
> be leveraged.  My guess is that for non-gradient descent methods, the SVD
> decompositions would be a better starting point.
> 
> I believe that Vowpal Wabbit can be used for linear regression which
> probably implies that Mahout's SGD solver could be as well with small
> changes to the gradient computation.
> 
> 
>> I am also interested in the implementation of the numerical linear algebra
>> routines, as these algorithms are at the crux of most regression problems.
>> 
> 
> We would love more help with this, especially in distributed cases.  Raw
> numerical speed is rarely the bottleneck for mahout code because large scale
> systems are typically I/O and network bound.  That said, much of our matrix
> code is still as we inherited it and while the quality is surprisingly high
> for code without unit tests, I know from direct experience that there are
> problems.  I am testing and correcting things as I need them, but you would
> likely have a broader reach and thus might have substantially more impact
> than I have had on the testing side.

Re: Contributing regression routines to Mahout

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Aug 24, 2010 at 1:28 PM, David G. Boney <da...@gmail.com>wrote:

>
> I would like to contribute to Mahout. Who would be the point person on the
> following topics: linear algebra routines, regression (real inputs, real
> outputs), subset selection for regression (lasso or ridge regression), and
> spectral graph methods?
>

Several of us can help on linear algebra.

We have no linear regression to speak of at this point.

I have done a fair bit of work on gradient descent for regularization.

We have the beginnings of a spectral clustering model (not the same as
general spectral graph meethods) and we have an OK, but not widely used
large-scale eigenvector decomposer.

I am in the process of implementing a least squares linear regression
> algorithm, LSQR. In my quick review of Mahout, and I make no claims of
> digging into the code at this point, there appears to be extensive work in
> the area of classifiers, discrete outputs, but not regression, real output.
> I have an interest in building up a library of regression techniques (real
> inputs, real outputs).

As far as Mahout is concerned, scalability is the key question.  For many
regression problems, especially those with sparse inputs, gradient descent
is very effective and the current SGD for logistic regression could probably
be leveraged.  My guess is that for non-gradient descent methods, the SVD
decompositions would be a better starting point.

I believe that Vowpal Wabbit can be used for linear regression which
probably implies that Mahout's SGD solver could be as well with small
changes to the gradient computation.

> I am also interested in the implementation of the numerical linear algebra
> routines, as these algorithms are at the crux of most regression problems.
>

We would love more help with this, especially in distributed cases.  Raw
numerical speed is rarely the bottleneck for mahout code because large scale
systems are typically I/O and network bound.  That said, much of our matrix
code is still as we inherited it and while the quality is surprisingly high
for code without unit tests, I know from direct experience that there are
problems.  I am testing and correcting things as I need them, but you would
likely have a broader reach and thus might have substantially more impact
than I have had on the testing side.