You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2009/12/23 21:02:30 UTC

[jira] Created: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Need sequential logistic regression implementation using SGD techniques
-----------------------------------------------------------------------

                 Key: MAHOUT-228
                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
             Project: Mahout
          Issue Type: New Feature
          Components: Classification
            Reporter: Ted Dunning


Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).

I often need to have a logistic regression in Java as well, so that is a reasonable place to start.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by Ted Dunning <te...@gmail.com>.

I am going to be in and out of connectivity for several days.  Probably
won't get to this.

On Sat, Feb 6, 2010 at 3:12 AM, Robin Anil (JIRA) <ji...@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830506#action_12830506]
>
> Robin Anil commented on MAHOUT-228:
> -----------------------------------
>
> Hi Ted, Is there a new patch with separated randomizer?.
>
> I see lots of code checkin in oliver's git branch. Can you update the same
> as a patch here.
>
>
>
> > Need sequential logistic regression implementation using SGD techniques
> > -----------------------------------------------------------------------
> >
> >                 Key: MAHOUT-228
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: Classification
> >            Reporter: Ted Dunning
> >             Fix For: 0.3
> >
> >         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv,
> sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
> >
> >
> > Stochastic gradient descent (SGD) is often fast enough for highly
> scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/<http://hunch.net/%7Evw/>
> ).
> > I often need to have a logistic regression in Java as well, so that is a
> reasonable place to start.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by Ted Dunning <te...@gmail.com>.

It's there.

On Wed, Dec 23, 2009 at 1:38 PM, Ted Dunning <te...@gmail.com> wrote:

>
> I always forget that marking patch available doesn't actually make the
> patch available.
>
> Patch will be there very shortly.
>
>

Re: [jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by Ted Dunning <te...@gmail.com>.

I always forget that marking patch available doesn't actually make the patch
available.

Patch will be there very shortly.

On Wed, Dec 23, 2009 at 1:20 PM, Jake Mannix <ja...@gmail.com> wrote:

> Wait, I thought there was a patch, is there no code on this yet?  The JIRA
> ticket
> says "patch available", but there's no files attached?
>
>  -jake
>
> On Wed, Dec 23, 2009 at 1:15 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Hey Ted,
> >
> >   I'll try out the patch, but I doubt it duplicates any of the stuff I've
> > got coming in - I've
> > been meaning to put together an SGD impl, but while ideologically it
> > overlaps with some
> > of my decomposition stuff (and the current in-memory SVD which is in
> Taste
> > is actually
> > of the SGD variety, so there may be some overlap with that) but any
> > scalable impl of
> > that would be awesome.
> >
> >   But this patch is for SGD for logistic regression, right?  How
> > customizable is it for
> > solving different plugged in optimization functions?  I guess I could
> just
> > try it out and
> > see, eh?
> >
> >   -jake
> >
> >
> > On Wed, Dec 23, 2009 at 12:52 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >
> >> Jake,
> >>
> >> I would appreciate your comments on this, especially in light of any
> >> duplication.
> >>
> >> David,
> >>
> >> If you have any time, your comments are always very welcome as well.
> >>
> >> On Wed, Dec 23, 2009 at 12:50 PM, Ted Dunning (JIRA) <jira@apache.org
> >> >wrote:
> >>
> >> >
> >> >     [
> >> >
> >>
> https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> >> ]
> >> >
> >> > Ted Dunning updated MAHOUT-228:
> >> > -------------------------------
> >> >
> >> >    Fix Version/s: 0.3
> >> >           Status: Patch Available  (was: Open)
> >> >
> >> > Here is an early implementation.  The learning has been implemented,
> but
> >> > not tested.  Most other aspects are reasonably well tested.
> >> >
> >> > > Need sequential logistic regression implementation using SGD
> >> techniques
> >> > >
> >> -----------------------------------------------------------------------
> >> > >
> >> > >                 Key: MAHOUT-228
> >> > >                 URL:
> https://issues.apache.org/jira/browse/MAHOUT-228
> >> > >             Project: Mahout
> >> > >          Issue Type: New Feature
> >> > >          Components: Classification
> >> > >            Reporter: Ted Dunning
> >> > >             Fix For: 0.3
> >> > >
> >> > >
> >> > > Stochastic gradient descent (SGD) is often fast enough for highly
> >> > scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/<http://hunch.net/%7Evw/>
> <
> >> http://hunch.net/%7Evw/>
> >> > ).
> >> > > I often need to have a logistic regression in Java as well, so that
> is
> >> a
> >> > reasonable place to start.
> >> >
> >> > --
> >> > This message is automatically generated by JIRA.
> >> > -
> >> > You can reply to this email to add a comment to the issue online.
> >> >
> >> >
> >>
> >>
> >> --
> >> Ted Dunning, CTO
> >> DeepDyve
> >>
> >
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by Jake Mannix <ja...@gmail.com>.

Wait, I thought there was a patch, is there no code on this yet?  The JIRA
ticket
says "patch available", but there's no files attached?

  -jake

On Wed, Dec 23, 2009 at 1:15 PM, Jake Mannix <ja...@gmail.com> wrote:

> Hey Ted,
>
>   I'll try out the patch, but I doubt it duplicates any of the stuff I've
> got coming in - I've
> been meaning to put together an SGD impl, but while ideologically it
> overlaps with some
> of my decomposition stuff (and the current in-memory SVD which is in Taste
> is actually
> of the SGD variety, so there may be some overlap with that) but any
> scalable impl of
> that would be awesome.
>
>   But this patch is for SGD for logistic regression, right?  How
> customizable is it for
> solving different plugged in optimization functions?  I guess I could just
> try it out and
> see, eh?
>
>   -jake
>
>
> On Wed, Dec 23, 2009 at 12:52 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> Jake,
>>
>> I would appreciate your comments on this, especially in light of any
>> duplication.
>>
>> David,
>>
>> If you have any time, your comments are always very welcome as well.
>>
>> On Wed, Dec 23, 2009 at 12:50 PM, Ted Dunning (JIRA) <jira@apache.org
>> >wrote:
>>
>> >
>> >     [
>> >
>> https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> ]
>> >
>> > Ted Dunning updated MAHOUT-228:
>> > -------------------------------
>> >
>> >    Fix Version/s: 0.3
>> >           Status: Patch Available  (was: Open)
>> >
>> > Here is an early implementation.  The learning has been implemented, but
>> > not tested.  Most other aspects are reasonably well tested.
>> >
>> > > Need sequential logistic regression implementation using SGD
>> techniques
>> > >
>> -----------------------------------------------------------------------
>> > >
>> > >                 Key: MAHOUT-228
>> > >                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>> > >             Project: Mahout
>> > >          Issue Type: New Feature
>> > >          Components: Classification
>> > >            Reporter: Ted Dunning
>> > >             Fix For: 0.3
>> > >
>> > >
>> > > Stochastic gradient descent (SGD) is often fast enough for highly
>> > scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/<
>> http://hunch.net/%7Evw/>
>> > ).
>> > > I often need to have a logistic regression in Java as well, so that is
>> a
>> > reasonable place to start.
>> >
>> > --
>> > This message is automatically generated by JIRA.
>> > -
>> > You can reply to this email to add a comment to the issue online.
>> >
>> >
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>
>

Re: [jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by Jake Mannix <ja...@gmail.com>.

Hey Ted,

  I'll try out the patch, but I doubt it duplicates any of the stuff I've
got coming in - I've
been meaning to put together an SGD impl, but while ideologically it
overlaps with some
of my decomposition stuff (and the current in-memory SVD which is in Taste
is actually
of the SGD variety, so there may be some overlap with that) but any scalable
impl of
that would be awesome.

  But this patch is for SGD for logistic regression, right?  How
customizable is it for
solving different plugged in optimization functions?  I guess I could just
try it out and
see, eh?

  -jake


On Wed, Dec 23, 2009 at 12:52 PM, Ted Dunning <te...@gmail.com> wrote:

> Jake,
>
> I would appreciate your comments on this, especially in light of any
> duplication.
>
> David,
>
> If you have any time, your comments are always very welcome as well.
>
> On Wed, Dec 23, 2009 at 12:50 PM, Ted Dunning (JIRA) <jira@apache.org
> >wrote:
>
> >
> >     [
> >
> https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> >
> > Ted Dunning updated MAHOUT-228:
> > -------------------------------
> >
> >    Fix Version/s: 0.3
> >           Status: Patch Available  (was: Open)
> >
> > Here is an early implementation.  The learning has been implemented, but
> > not tested.  Most other aspects are reasonably well tested.
> >
> > > Need sequential logistic regression implementation using SGD techniques
> > > -----------------------------------------------------------------------
> > >
> > >                 Key: MAHOUT-228
> > >                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
> > >             Project: Mahout
> > >          Issue Type: New Feature
> > >          Components: Classification
> > >            Reporter: Ted Dunning
> > >             Fix For: 0.3
> > >
> > >
> > > Stochastic gradient descent (SGD) is often fast enough for highly
> > scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/<
> http://hunch.net/%7Evw/>
> > ).
> > > I often need to have a logistic regression in Java as well, so that is
> a
> > reasonable place to start.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: [jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by Ted Dunning <te...@gmail.com>.

Jake,

I would appreciate your comments on this, especially in light of any
duplication.

David,

If you have any time, your comments are always very welcome as well.

On Wed, Dec 23, 2009 at 12:50 PM, Ted Dunning (JIRA) <ji...@apache.org>wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Ted Dunning updated MAHOUT-228:
> -------------------------------
>
>    Fix Version/s: 0.3
>           Status: Patch Available  (was: Open)
>
> Here is an early implementation.  The learning has been implemented, but
> not tested.  Most other aspects are reasonably well tested.
>
> > Need sequential logistic regression implementation using SGD techniques
> > -----------------------------------------------------------------------
> >
> >                 Key: MAHOUT-228
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: Classification
> >            Reporter: Ted Dunning
> >             Fix For: 0.3
> >
> >
> > Stochastic gradient descent (SGD) is often fast enough for highly
> scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/<http://hunch.net/%7Evw/>
> ).
> > I often need to have a logistic regression in Java as well, so that is a
> reasonable place to start.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Ted Dunning, CTO
DeepDyve

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802032#action_12802032 ] 

Olivier Grisel commented on MAHOUT-228:
---------------------------------------

For the records: I am working adding more tests and debugging in the following branch (keps in sync with the trunk) hosted on github:

  http://github.com/ogrisel/mahout/commits/MAHOUT-228

Fixed so far:
 - convergence issues (inconstency on the index of the 'missing' beta row)
 - make sure that L1 is sparsity inducing my apply eager post update regularization

Still TODO (independently of Ted's TODOs) - migh be splitted into specific jira issues:
 - test that highly redundant dataset can lean to very sparse models with L1 prior
 - an hadoop driver to do // extraction vector features of documents using the Randomizer classes
 - an hadoop driver to do // cross validation and confusion matrix evaluation (along with confidence interval)
 - an hadoop driver to perform hyperparameters grid search (lambda, priorfunc, learning rate, ...)
 - a sample hadoop driver to categorize wikipedia articles by country
 - profile it a bit


> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848583#action_12848583 ] 

Jake Mannix commented on MAHOUT-228:
------------------------------------

Excellent.  The only thing I did to make it compile was update SparseVector to RandomAccessSparseVector, and replace Functions.exp in favor of the merged Colt/Mahout Functions.exp.  

So it should basically be the way you left it.  Not sure why the TermRandomizerTest doesn't pass. 

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.4
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, MAHOUT-228.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848581#action_12848581 ] 

Ted Dunning commented on MAHOUT-228:
------------------------------------

{quote}
Or if you, Ted, don't have time to finish it yourself, could you at least check this patch out, and document a little about what the rest of us need to do to get this up running (and verified as working)?
{quote}
That only sounds fair given what you have done so far.

Let me dig in tomorrow.



> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.4
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, MAHOUT-228.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800248#action_12800248 ] 

Ted Dunning commented on MAHOUT-228:
------------------------------------


We need a few things:

- a few functions should be separated out for more general utitlity

- the random vectorizer should be generalized a bit

- we need some real world testing.  20 newsgroups would be a good test as would be rcv1.  Cloning the new svm package's tests would probably be the best short-term answer.

I, unfortunately, won't have time for a week or two to followup.

As such, perhaps the best step is to commit this now.  It won't break anything.
 

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-228:
-------------------------------

    Attachment:     (was: MAHOUT-228-1.patch)

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795144#action_12795144 ] 

Robin Anil commented on MAHOUT-228:
-----------------------------------

I say. let the hash functions be in math. 

The text Randomizers can go in util.vectors 

vectors.lucence, vectors.arff etc are there currently. Or we move the all these to core along with Randomizers and DictionaryBased?

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-228:
-------------------------------

    Attachment: r.csv
                logP.csv
                sgd.csv


I have been doing some testing on the training algorithm and there seems to be a glitch in it.  The problem is that the prior gradient is strong enough that for lambda > really small, the regularization zeros out all of the coefficients on every iteration.  Not good.

I will attach some sample data that I have been using for these experiments.  These reference for these experiments was an optimization I did in R where I explicitly optimized a simple example and got very plausible results.

For the R example, I used the following definition of the function to optimize:

{noformat}
f <- function(beta) {
    p = w(rowSums(x %*% matrix(beta, ncol=1)));
    r1 = -sum(y*log(p+(p==0))+(1-y)*log(1-p+(p==1))); 
    r2=lambda*sum(abs(beta)); 
    (r1+r2)
}

w <- function(x) {
    return(1/(1+exp(-x)))
}
{noformat}
Here beta is the coefficient vector, lambda sets the amount of regularization, x are the input vectors one observation per row, y are the known categories for the rows of x, f is the combined log likelihood (r1) and log prior (r2), and w is the logistic function.  I used an unsimplified form for the overall logistic likelihood for simplicity.  Normally, a simpler form is used of -sum(y - p), but I wanted to keep things straightforward.

The attached file sgd.csv contains the value of x.  The value of y is simply 30 0's followed by 30 1's.  

Optimization was done using this:
{noformat}
lambda <- 0.1
beta.01 <- optim(beta,f, method="CG", control=list(maxit=10000))
lambda <- 1
beta.1 <- optim(beta,f, method="CG", control=list(maxit=10000))
lambda <- 10
beta.10 <- optim(beta,f, method="CG", control=list(maxit=10000))
{noformat}
The values for beta obtained are contained in the file r.csv and the log-MAP likelihoods are in logP.csv

I will shortly add a patch that has my initial test in it.  This patch will contain these test data files.  I will be working on this problem off and on over the next few days, but any hints that anybody has are welcome.  My expectation is that there is a silly oversight in my Java code.




> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-1.patch, MAHOUT-228-2.patch, r.csv, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Steve Umfleet (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795064#action_12795064 ] 

Steve Umfleet commented on MAHOUT-228:
--------------------------------------

Hi Ted.  Watching your progress on SGD was instructive.  Thanks for the "template" of how to submit and proceed with an issue.

At what point in the process are decisions about packages resolved?  For example, MurmurHash at first glance, and based on its own documentation, seems like it might be broadly useful outside of org.apache.mahout.classifier.

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-228:
-------------------------------

    Attachment: MAHOUT-228-2.patch

Updated to avoid googles guava libraries.

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: MAHOUT-228-1.patch, MAHOUT-228-2.patch
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix updated MAHOUT-228:
-------------------------------

    Attachment: MAHOUT-228.patch

*bump*

I think this is now the third time I'd brought this patch up-to-date.  Compiles, but internal tests don't pass.  Not sure why, as I haven't dug into them too deeply.

Ted, or anyone else with a desire to get Vowpal-Wabbit-style awesomeness in Mahout, want to take this patch for a spin and see what is up with it?

Or if you, Ted, don't have time to finish it yourself, could you at least check this patch out, and document a little about what the rest of us need to do to get this up running (and verified as working)?

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.4
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, MAHOUT-228.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix updated MAHOUT-228:
-------------------------------

    Fix Version/s:     (was: 0.3)
                   0.4

Pushing out to 0.4 based on Olivier's comments on mahout-dev

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.4
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794661#action_12794661 ] 

Ted Dunning commented on MAHOUT-228:
------------------------------------


The orginal code was very nearly correct as it turns out.  The problem is that lambda in the batch learning is used to weight the prior against all of the training examples.  In the on-line algorithm the prior gradient is applied for each update.  

In the example I used, this caused an effective increase in the value of lambda by 60 (the number of training examples).

After adjusting the value of lambda, I get values from the on-line algorithm very similar to those obtained by the batch algorithm (after lots of iterations).

I will post a new patch shortly for review.



> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802106#action_12802106 ] 

Ted Dunning commented on MAHOUT-228:
------------------------------------

{quote}
make sure that L1 is sparsity inducing my apply eager post update regularization
{quote}

Are you sure that this is correct?  The lazy regularization update should be applied before any coefficient is used for prediction or for update.  Is eager regularization after the update necessary? 

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-228:
-------------------------------

    Attachment: sgd-derivation.tex
                sgd-derivation.pdf

Here are the derivations of the formulae used.

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-1.patch, MAHOUT-228-2.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by Steve Umfleet <s_...@yahoo.com>.

Hadoop put their MurmurHash in utils, so that might be a consideration.  But for Mahout it fits better, imo, in org.apache.mahout.common with other code that has similar philosophy and purpose.  I make the assumption that others will want to add some alternative hash tools, therefore I'd create a "hash" package in mahout.common.

The randomizers I'd put in org.apache.mahout.math due to their interaction with Vector, either at that very depth or in org.apache.mahout.math.randomizer, as .math is beginning to get dense based on number of modules.

I imagine using the priors outside of sgd, so they could be moved to org.apache.mahout.math as well, where they may merit their own sub package. 


--- On Tue, 12/29/09, Ted Dunning (JIRA) <ji...@apache.org> wrote:

> From: Ted Dunning (JIRA) <ji...@apache.org>
> Subject: [jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques
> To: mahout-dev@lucene.apache.org
> Date: Tuesday, December 29, 2009, 12:29 PM
> 
>     [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795138#action_12795138
> ] 
> 
> Ted Dunning commented on MAHOUT-228:
> ------------------------------------
> 
> 
> This is the time.  The MurmurHash and Randomizer
> classes both seem ripe for promotion to other packages.
> 
> What I will do is file some additional JIRA's that include
> just those classes (one JIRA for Murmur, one for
> Randomizer/Vectorizer).  Those patches will probably
> make it in before this one does because they are
> simpler.  At that point, I will rework the patch on
> this JIRA to not include those classes.
> 
> Where would you recommend these others go?
> 
> 
> > Need sequential logistic regression implementation
> using SGD techniques
> >
> -----------------------------------------------------------------------
> >
> >             
>    Key: MAHOUT-228
> >             
>    URL: https://issues.apache.org/jira/browse/MAHOUT-228
> >         
>    Project: Mahout
> >          Issue Type: New
> Feature
> >          Components:
> Classification
> >            Reporter: Ted
> Dunning
> >         
>    Fix For: 0.3
> >
> >         Attachments:
> logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf,
> sgd-derivation.tex, sgd.csv
> >
> >
> > Stochastic gradient descent (SGD) is often fast enough
> for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> > I often need to have a logistic regression in Java as
> well, so that is a reasonable place to start.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue
> online.
> 
>

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795138#action_12795138 ] 

Ted Dunning commented on MAHOUT-228:
------------------------------------


This is the time.  The MurmurHash and Randomizer classes both seem ripe for promotion to other packages.

What I will do is file some additional JIRA's that include just those classes (one JIRA for Murmur, one for Randomizer/Vectorizer).  Those patches will probably make it in before this one does because they are simpler.  At that point, I will rework the patch on this JIRA to not include those classes.

Where would you recommend these others go?


> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800087#action_12800087 ] 

Jake Mannix commented on MAHOUT-228:
------------------------------------

I think I just drove myself nearly insane: I was creating a patch for MAHOUT-206, but I had already merged in Ted's patch here, and then trying to test apply the patch over to a fresh trunk checkout, it couldn't find these classes, so I went hunting thoughout all of SVN history, trying to find it, but they had "vanished".  They were here just fine in my local git-repo, but somehow there was no log of them anywhere, even when I started digging through older revisions on svn.apache.org... gone!

Heh.  Good side-effect: I have a patch which updates this patch.  Of course, it's not useful until this is committed.  What more is needed on this, Ted?

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794230#action_12794230 ] 

Ted Dunning commented on MAHOUT-228:
------------------------------------


This implementation is purely logistic regression.  Changing to other supervised learning algorithms shouldn't be difficult and I have made the regularization pluggable, but I would as soon get this working as is before adding too much generality.  In particular, I have strongly used the presumption that I can do sparse updates and lazy regularization.  I don't know how much that applies to other problems.

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: MAHOUT-228-1.patch
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802165#action_12802165 ] 

Olivier Grisel commented on MAHOUT-228:
---------------------------------------

bq. Are you sure that this is correct? The lazy regularization update should be applied before any coefficient is used for prediction or for update. Is eager regularization after the update necessary?

I made it is eager only for the coefficients that have just be updated by the current train step, the remaining coefficient regularization is still delayed until the next "classify()" affecting those coefficients.

If we do not do this (or find a some how equivalent work around) the coefficient are only regularized upon the classify call and hence are marked as regularized for the current step value while at the same time the training update make the coefficient of the current step non-null hence inducing a completely dense parameters set.

While this is not a big deal as long as beta is using a DenseMatrix representation, this prevent us to actually measure the real impact of the lambda value by measuring the sparsity of the parameters. Maybe on problem leading to very sparse models, using a SparseRowMatrix of some kind will be determinant performance-wise and in that case the sparsity inducing ability of L1 should be ensured.

Maybe lazy regularization could also be implemented in a more simple / readable way by doing full regularizeration of beta every "regularizationSkip" training steps (IIRC, this is the case in Leon Bottou's SvmSgd2 but this adds yet another hyperparameter to fiddle with).

There might also be a way to mostly keep the lazy reg as it is and rethink the updateSteps update to avoid breaking the sparsity of L1. Maybe this is just a matter of moving the step++; call after the classify(instance); call. I don't remember if it tried that...

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795141#action_12795141 ] 

Jake Mannix commented on MAHOUT-228:
------------------------------------

bq. Where would you recommend these others go?

Somewhere in the math module, package name, I don't know.

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802187#action_12802187 ] 

Olivier Grisel commented on MAHOUT-228:
---------------------------------------

Indeed, just moving the step++ call after the update makes the sparsification work as expected will keeping the code natural (no forceOne flag hack).

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830506#action_12830506 ] 

Robin Anil commented on MAHOUT-228:
-----------------------------------

Hi Ted, Is there a new patch with separated randomizer?. 

I see lots of code checkin in oliver's git branch. Can you update the same as a patch here.



> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-228:
-------------------------------

    Attachment: MAHOUT-228-3.patch
                sgd-derivation.pdf

Here is the patch with test files and a description of the derivation of the formulae.

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794314#action_12794314 ] 

Jake Mannix commented on MAHOUT-228:
------------------------------------

Ted, how do we get google-guava for this?  Maven doesn't find it anywhere... I can download myself to try it out for now, I suppose.

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: MAHOUT-228-1.patch
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-228:
-------------------------------

    Attachment: MAHOUT-228-1.patch

Here is the actual patch file.

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: MAHOUT-228-1.patch
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Olivier Grisel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802165#action_12802165 ] 

Olivier Grisel edited comment on MAHOUT-228 at 1/19/10 9:46 AM:
----------------------------------------------------------------

bq. Are you sure that this is correct? The lazy regularization update should be applied before any coefficient is used for prediction or for update. Is eager regularization after the update necessary?

I made it eager only for the coefficients that have just been updated by the current train step, the remaining coefficients regularization is still delayed until the next "classify(instance)" affecting those coefficients.

If we do not do this (or find a somehow equivalent work around) the coefficient are only regularized upon the classify(instance) call and hence are marked as regularized for the current step value while at the same time the training update make the coefficient of the current step non-null hence inducing a completely dense parameters set.

While this is not a big deal as long as beta is using a DenseMatrix representation, this prevents us from actually measuring the real impact of the lambda value by measuring the sparsity of the parameters. Maybe on problems leading to very sparse models, using a SparseRowMatrix of some kind will be determinant performance-wise and in that case the sparsity inducing ability of L1 should be ensured.

Maybe lazy regularization could also be implemented in a more simple / readable way by doing full regularizeration of beta every "regularizationSkip" training steps (IIRC, this is the case in Leon Bottou's SvmSgd2 but this adds yet another hyperparameter to fiddle with).

There might also be a way to mostly keep the lazy reg as it is and rethink the updateSteps update to avoid breaking the sparsity of L1. Maybe this is just a matter of moving the step++; call after the classify(instance); call. I don't remember if it tried that in the first place...

      was (Author: ogrisel):
    bq. Are you sure that this is correct? The lazy regularization update should be applied before any coefficient is used for prediction or for update. Is eager regularization after the update necessary?

I made it is eager only for the coefficients that have just be updated by the current train step, the remaining coefficient regularization is still delayed until the next "classify()" affecting those coefficients.

If we do not do this (or find a some how equivalent work around) the coefficient are only regularized upon the classify call and hence are marked as regularized for the current step value while at the same time the training update make the coefficient of the current step non-null hence inducing a completely dense parameters set.

While this is not a big deal as long as beta is using a DenseMatrix representation, this prevent us to actually measure the real impact of the lambda value by measuring the sparsity of the parameters. Maybe on problem leading to very sparse models, using a SparseRowMatrix of some kind will be determinant performance-wise and in that case the sparsity inducing ability of L1 should be ensured.

Maybe lazy regularization could also be implemented in a more simple / readable way by doing full regularizeration of beta every "regularizationSkip" training steps (IIRC, this is the case in Leon Bottou's SvmSgd2 but this adds yet another hyperparameter to fiddle with).

There might also be a way to mostly keep the lazy reg as it is and rethink the updateSteps update to avoid breaking the sparsity of L1. Maybe this is just a matter of moving the step++; call after the classify(instance); call. I don't remember if it tried that...
  
> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794323#action_12794323 ] 

Ted Dunning commented on MAHOUT-228:
------------------------------------

{quote}
Ted, how do we get google-guava for this? Maven doesn't find it anywhere... I can download myself to try it out for now, I suppose. 
{quote}

Hmm... I bet somebody published to our company internal repository (we use guava and collections in several systems).  Then I bet it wound up in my local repository and the mahout build picked it up from there.

Let me go back and remove the use of guava for now.  It is very nice to be able to read all the lines in a resource in one line, but not that important.



> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: MAHOUT-228-1.patch
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-228:
-------------------------------

    Attachment:     (was: MAHOUT-228-2.patch)

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-228:
-------------------------------

    Fix Version/s: 0.3
           Status: Patch Available  (was: Open)

Here is an early implementation.  The learning has been implemented, but not tested.  Most other aspects are reasonably well tested.

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-228:
-------------------------------

    Attachment:     (was: sgd-derivation.pdf)

> Need sequential logistic regression implementation using SGD techniques
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-228
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-228
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>            Reporter: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: logP.csv, MAHOUT-228-3.patch, r.csv, sgd-derivation.pdf, sgd-derivation.tex, sgd.csv
>
>
> Stochastic gradient descent (SGD) is often fast enough for highly scalable learning (see Vowpal Wabbit, http://hunch.net/~vw/).
> I often need to have a logistic regression in Java as well, so that is a reasonable place to start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.