You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2013/07/21 22:10:49 UTC

[jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

    [ https://issues.apache.org/jira/browse/MAHOUT-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714796#comment-13714796 ] 

Ted Dunning edited comment on MAHOUT-1273 at 7/21/13 8:09 PM:
--------------------------------------------------------------

The algorithm document describes a standard method for solving the normal equations in a single pass.  This works fine with a small number of variables, but is completely infeasible if we have a very large number of variables as when the input is very sparse.  

At least as important is the fact that the document ignores the problem of solving the regularized version of the problem.  A penalty function, p_lambda( x ), is mentioned, but the algorithm quoted ignores this penalty and solves the unpenalized form.


                
      was (Author: tdunning):
    The algorithm document describes a standard method for solving the normal equations in a single pass.  This works fine with a small number of variables, but is completely infeasible if we have a very large number of variables as when the input is very sparse.  

At least as important is the fact that the document ignores the problem of solving the regularized version of the problem.  A penalty function, p_lambda(x), is mentioned, but the algorithm quoted ignores this penalty and solves the unpenalized form.


                  
> Single Pass Algorithm for Penalized Linear Regression on MapReduce
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-1273
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1273
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Kun Yang
>         Attachments: PenalizedLinear.pdf
>
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Penalized linear regression such as Lasso, Elastic-net are widely used in machine learning, but there are no very efficient scalable implementations on MapReduce.
> The published distributed algorithms for solving this problem is either iterative (which is not good for MapReduce, see Steven Boyd's paper) or approximate (what if we need exact solutions, see Paralleled stochastic gradient descent); another disadvantage of these algorithms is that they can not do cross validation in the training phase, which requires a user-specified penalty parameter in advance. 
> My ideas can train the model with cross validation in a single pass. They are based on some simple observations.
> I have implemented the primitive version of this algorithm in Alpine Data Labs. Advanced features such as inner-mapper combiner are employed to reduce the network traffic in the shuffle phase.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

Posted by Kun Yang <ku...@stanford.edu>.

see the draft please!

----- Original Message -----
From: "DB Tsai" <db...@dbtsai.com>
To: dev@mahout.apache.org
Cc: "Ted Dunning" <te...@gmail.com>
Sent: Sunday, July 21, 2013 2:38:01 PM
Subject: Re: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

Coordinate decent is essentially a iterative algorithm,  how can you do it
in single pass of data with L2 regularization?
On Jul 21, 2013 2:09 PM, "Michael Kun Yang" <ku...@stanford.edu> wrote:

> I will update the document to detail the algorithm.
>
>
> On Sun, Jul 21, 2013 at 1:50 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Sun, Jul 21, 2013 at 1:41 PM, Kun Yang <ku...@stanford.edu> wrote:
> >
> > > The algorithm is not solving the normal equation as in the ordinary
> > linear
> > > regression. I did not detail the algorithm to solve the penalized
> > > optimization in the paper. To solve the penalized version, I will use
> the
> > > coordinate descent which is well documented in other paper (see
> > Freedman's
> > > paper, for 1000 variables, it takes ~1min to do cross validation in
> the R
> > > package) and is very efficient.
> > >
> > > As I discussed in the conclusion section, to solve the problem with
> large
> > > number of predictors, it is still a challenge even though in the single
> > > machine or MPI version, but the proposed algorithm can handle the
> number
> > of
> > > variable at the order of 5000 and it covers lots of applications.
> > >
> >
> > Should the document be updated to describe what you intend to do?
> >
>

Re: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

Posted by Kun Yang <ku...@stanford.edu>.

Hi Ted,

Thank you very much for your clarification, it is exactly the essence of my ideas. I will solve L1 and L2 and their combination (elastic net).

My experience is that there are some applications in bioinformatics (1000 genes) or images (32 * 32 images) where the inputs are dense. So my plan is to implement the dense one first then sparse one as the next step.

Best,
-Kun

----- Original Message -----
From: "Ted Dunning" <te...@gmail.com>
To: "DB Tsai" <db...@dbtsai.com>
Cc: "Mahout Dev List" <de...@mahout.apache.org>
Sent: Sunday, July 21, 2013 3:37:21 PM
Subject: Re: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

A single pass with L2 regularization is relatively easy with a small number
of variables since the penalty is a quadratic form.

For L1 or elastic band, this is much more difficult.  It isn't clear to me
that Michael is still proposing a single pass algorithm for that or not.

It isn't hard to reduce the fitting part of the problem to a relatively
small dense problem if you have a small number of variables.  From there,
iterating on a single machine is pretty easy and as Michael says, the 1000
variable case can be solved quickly.  Presumably the 5000x5000 case can be
solved quickly enough to be still interesting.

The basic idea there is to convert a disk resident problem into a memory
resident problem.  Whether the memory resident part is solved in parallel
or not is a separate question.

To the extent non-sparse problems are important, this is a good step.  My
experience is that big-data optimizers almost always have sparse inputs.

On Sun, Jul 21, 2013 at 2:38 PM, DB Tsai <db...@dbtsai.com> wrote:

> Coordinate decent is essentially a iterative algorithm,  how can you do it
> in single pass of data with L2 regularization?
> On Jul 21, 2013 2:09 PM, "Michael Kun Yang" <ku...@stanford.edu> wrote:
>
>> I will update the document to detail the algorithm.
>>
>>
>> On Sun, Jul 21, 2013 at 1:50 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> > On Sun, Jul 21, 2013 at 1:41 PM, Kun Yang <ku...@stanford.edu> wrote:
>> >
>> > > The algorithm is not solving the normal equation as in the ordinary
>> > linear
>> > > regression. I did not detail the algorithm to solve the penalized
>> > > optimization in the paper. To solve the penalized version, I will use
>> the
>> > > coordinate descent which is well documented in other paper (see
>> > Freedman's
>> > > paper, for 1000 variables, it takes ~1min to do cross validation in
>> the R
>> > > package) and is very efficient.
>> > >
>> > > As I discussed in the conclusion section, to solve the problem with
>> large
>> > > number of predictors, it is still a challenge even though in the
>> single
>> > > machine or MPI version, but the proposed algorithm can handle the
>> number
>> > of
>> > > variable at the order of 5000 and it covers lots of applications.
>> > >
>> >
>> > Should the document be updated to describe what you intend to do?
>> >
>>
>

Re: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

Posted by Ted Dunning <te...@gmail.com>.

A single pass with L2 regularization is relatively easy with a small number
of variables since the penalty is a quadratic form.

For L1 or elastic band, this is much more difficult.  It isn't clear to me
that Michael is still proposing a single pass algorithm for that or not.

It isn't hard to reduce the fitting part of the problem to a relatively
small dense problem if you have a small number of variables.  From there,
iterating on a single machine is pretty easy and as Michael says, the 1000
variable case can be solved quickly.  Presumably the 5000x5000 case can be
solved quickly enough to be still interesting.

The basic idea there is to convert a disk resident problem into a memory
resident problem.  Whether the memory resident part is solved in parallel
or not is a separate question.

To the extent non-sparse problems are important, this is a good step.  My
experience is that big-data optimizers almost always have sparse inputs.

On Sun, Jul 21, 2013 at 2:38 PM, DB Tsai <db...@dbtsai.com> wrote:

> Coordinate decent is essentially a iterative algorithm,  how can you do it
> in single pass of data with L2 regularization?
> On Jul 21, 2013 2:09 PM, "Michael Kun Yang" <ku...@stanford.edu> wrote:
>
>> I will update the document to detail the algorithm.
>>
>>
>> On Sun, Jul 21, 2013 at 1:50 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> > On Sun, Jul 21, 2013 at 1:41 PM, Kun Yang <ku...@stanford.edu> wrote:
>> >
>> > > The algorithm is not solving the normal equation as in the ordinary
>> > linear
>> > > regression. I did not detail the algorithm to solve the penalized
>> > > optimization in the paper. To solve the penalized version, I will use
>> the
>> > > coordinate descent which is well documented in other paper (see
>> > Freedman's
>> > > paper, for 1000 variables, it takes ~1min to do cross validation in
>> the R
>> > > package) and is very efficient.
>> > >
>> > > As I discussed in the conclusion section, to solve the problem with
>> large
>> > > number of predictors, it is still a challenge even though in the
>> single
>> > > machine or MPI version, but the proposed algorithm can handle the
>> number
>> > of
>> > > variable at the order of 5000 and it covers lots of applications.
>> > >
>> >
>> > Should the document be updated to describe what you intend to do?
>> >
>>
>

Re: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

Posted by DB Tsai <db...@dbtsai.com>.

Coordinate decent is essentially a iterative algorithm,  how can you do it
in single pass of data with L2 regularization?
On Jul 21, 2013 2:09 PM, "Michael Kun Yang" <ku...@stanford.edu> wrote:

> I will update the document to detail the algorithm.
>
>
> On Sun, Jul 21, 2013 at 1:50 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Sun, Jul 21, 2013 at 1:41 PM, Kun Yang <ku...@stanford.edu> wrote:
> >
> > > The algorithm is not solving the normal equation as in the ordinary
> > linear
> > > regression. I did not detail the algorithm to solve the penalized
> > > optimization in the paper. To solve the penalized version, I will use
> the
> > > coordinate descent which is well documented in other paper (see
> > Freedman's
> > > paper, for 1000 variables, it takes ~1min to do cross validation in
> the R
> > > package) and is very efficient.
> > >
> > > As I discussed in the conclusion section, to solve the problem with
> large
> > > number of predictors, it is still a challenge even though in the single
> > > machine or MPI version, but the proposed algorithm can handle the
> number
> > of
> > > variable at the order of 5000 and it covers lots of applications.
> > >
> >
> > Should the document be updated to describe what you intend to do?
> >
>

Re: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

Posted by Michael Kun Yang <ku...@stanford.edu>.

I will update the document to detail the algorithm.


On Sun, Jul 21, 2013 at 1:50 PM, Ted Dunning <te...@gmail.com> wrote:

> On Sun, Jul 21, 2013 at 1:41 PM, Kun Yang <ku...@stanford.edu> wrote:
>
> > The algorithm is not solving the normal equation as in the ordinary
> linear
> > regression. I did not detail the algorithm to solve the penalized
> > optimization in the paper. To solve the penalized version, I will use the
> > coordinate descent which is well documented in other paper (see
> Freedman's
> > paper, for 1000 variables, it takes ~1min to do cross validation in the R
> > package) and is very efficient.
> >
> > As I discussed in the conclusion section, to solve the problem with large
> > number of predictors, it is still a challenge even though in the single
> > machine or MPI version, but the proposed algorithm can handle the number
> of
> > variable at the order of 5000 and it covers lots of applications.
> >
>
> Should the document be updated to describe what you intend to do?
>

Re: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Jul 21, 2013 at 1:41 PM, Kun Yang <ku...@stanford.edu> wrote:

> The algorithm is not solving the normal equation as in the ordinary linear
> regression. I did not detail the algorithm to solve the penalized
> optimization in the paper. To solve the penalized version, I will use the
> coordinate descent which is well documented in other paper (see Freedman's
> paper, for 1000 variables, it takes ~1min to do cross validation in the R
> package) and is very efficient.
>
> As I discussed in the conclusion section, to solve the problem with large
> number of predictors, it is still a challenge even though in the single
> machine or MPI version, but the proposed algorithm can handle the number of
> variable at the order of 5000 and it covers lots of applications.
>

Should the document be updated to describe what you intend to do?

Re: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

Posted by Kun Yang <ku...@stanford.edu>.

PS, the MPI version(see Steven Boyd's webpage) can not do cross validation which is quite essential. I think there is a trade-off: if you want to do cross-validation, large number of predictors will be an issue; if you want to handle large number of predictors, cross-validation will be an issue.

Freedman's paper about coordinate descent: http://www.jstatsoft.org/v33/i01/paper, http://www.jstatsoft.org/v33/i01/paper
Steven Boyd's paper about iterative distributed version: http://www.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf, their implementation is on MPI.

Best,
-Kun  

----- Original Message -----
From: "Kun Yang" <ku...@stanford.edu>
To: dev@mahout.apache.org
Cc: "Ted Dunning (JIRA)" <ji...@apache.org>
Sent: Sunday, July 21, 2013 1:41:45 PM
Subject: Re: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

Hi,

Thanks for the feedback.

The algorithm is not solving the normal equation as in the ordinary linear regression. I did not detail the algorithm to solve the penalized optimization in the paper. To solve the penalized version, I will use the coordinate descent which is well documented in other paper (see Freedman's paper, for 1000 variables, it takes ~1min to do cross validation in the R package) and is very efficient.

As I discussed in the conclusion section, to solve the problem with large number of predictors, it is still a challenge even though in the single machine or MPI version, but the proposed algorithm can handle the number of variable at the order of 5000 and it covers lots of applications.

My plan is to implement a working version first then add some refinements such as sparse vectors.

Any feedback?
Best
-Kun

----- Original Message -----
From: "Ted Dunning (JIRA)" <ji...@apache.org>
To: dev@mahout.apache.org
Sent: Sunday, July 21, 2013 1:10:49 PM
Subject: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce


    [ https://issues.apache.org/jira/browse/MAHOUT-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714796#comment-13714796 ] 

Ted Dunning edited comment on MAHOUT-1273 at 7/21/13 8:09 PM:
--------------------------------------------------------------

The algorithm document describes a standard method for solving the normal equations in a single pass.  This works fine with a small number of variables, but is completely infeasible if we have a very large number of variables as when the input is very sparse.  

At least as important is the fact that the document ignores the problem of solving the regularized version of the problem.  A penalty function, p_lambda( x ), is mentioned, but the algorithm quoted ignores this penalty and solves the unpenalized form.


                
      was (Author: tdunning):
    The algorithm document describes a standard method for solving the normal equations in a single pass.  This works fine with a small number of variables, but is completely infeasible if we have a very large number of variables as when the input is very sparse.  

At least as important is the fact that the document ignores the problem of solving the regularized version of the problem.  A penalty function, p_lambda(x), is mentioned, but the algorithm quoted ignores this penalty and solves the unpenalized form.


                  
> Single Pass Algorithm for Penalized Linear Regression on MapReduce
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-1273
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1273
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Kun Yang
>         Attachments: PenalizedLinear.pdf
>
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Penalized linear regression such as Lasso, Elastic-net are widely used in machine learning, but there are no very efficient scalable implementations on MapReduce.
> The published distributed algorithms for solving this problem is either iterative (which is not good for MapReduce, see Steven Boyd's paper) or approximate (what if we need exact solutions, see Paralleled stochastic gradient descent); another disadvantage of these algorithms is that they can not do cross validation in the training phase, which requires a user-specified penalty parameter in advance. 
> My ideas can train the model with cross validation in a single pass. They are based on some simple observations.
> I have implemented the primitive version of this algorithm in Alpine Data Labs. Advanced features such as inner-mapper combiner are employed to reduce the network traffic in the shuffle phase.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

Posted by Kun Yang <ku...@stanford.edu>.

Hi,

Thanks for the feedback.

The algorithm is not solving the normal equation as in the ordinary linear regression. I did not detail the algorithm to solve the penalized optimization in the paper. To solve the penalized version, I will use the coordinate descent which is well documented in other paper (see Freedman's paper, for 1000 variables, it takes ~1min to do cross validation in the R package) and is very efficient.

As I discussed in the conclusion section, to solve the problem with large number of predictors, it is still a challenge even though in the single machine or MPI version, but the proposed algorithm can handle the number of variable at the order of 5000 and it covers lots of applications.

My plan is to implement a working version first then add some refinements such as sparse vectors.

Any feedback?
Best
-Kun

----- Original Message -----
From: "Ted Dunning (JIRA)" <ji...@apache.org>
To: dev@mahout.apache.org
Sent: Sunday, July 21, 2013 1:10:49 PM
Subject: [jira] [Comment Edited] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce


    [ https://issues.apache.org/jira/browse/MAHOUT-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714796#comment-13714796 ] 

Ted Dunning edited comment on MAHOUT-1273 at 7/21/13 8:09 PM:
--------------------------------------------------------------

The algorithm document describes a standard method for solving the normal equations in a single pass.  This works fine with a small number of variables, but is completely infeasible if we have a very large number of variables as when the input is very sparse.  

At least as important is the fact that the document ignores the problem of solving the regularized version of the problem.  A penalty function, p_lambda( x ), is mentioned, but the algorithm quoted ignores this penalty and solves the unpenalized form.


                
      was (Author: tdunning):
    The algorithm document describes a standard method for solving the normal equations in a single pass.  This works fine with a small number of variables, but is completely infeasible if we have a very large number of variables as when the input is very sparse.  

At least as important is the fact that the document ignores the problem of solving the regularized version of the problem.  A penalty function, p_lambda(x), is mentioned, but the algorithm quoted ignores this penalty and solves the unpenalized form.


                  
> Single Pass Algorithm for Penalized Linear Regression on MapReduce
> ------------------------------------------------------------------
>
>                 Key: MAHOUT-1273
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1273
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Kun Yang
>         Attachments: PenalizedLinear.pdf
>
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Penalized linear regression such as Lasso, Elastic-net are widely used in machine learning, but there are no very efficient scalable implementations on MapReduce.
> The published distributed algorithms for solving this problem is either iterative (which is not good for MapReduce, see Steven Boyd's paper) or approximate (what if we need exact solutions, see Paralleled stochastic gradient descent); another disadvantage of these algorithms is that they can not do cross validation in the training phase, which requires a user-specified penalty parameter in advance. 
> My ideas can train the model with cross validation in a single pass. They are based on some simple observations.
> I have implemented the primitive version of this algorithm in Alpine Data Labs. Advanced features such as inner-mapper combiner are employed to reduce the network traffic in the shuffle phase.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira