You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Robin Anil <ro...@gmail.com> on 2010/12/23 19:46:24 UTC

Vowpal Wabbit 5.0 with optimisations

Worth a look, maybe some new ideas for sgd package

Robin

http://hunch.net/?p=1594

Vowpal Wabbit, version 5.0, and the second heresy <http://hunch.net/?p=1594>
Tags: Announcements <http://hunch.net/?cat=4>, Code<http://hunch.net/?cat=42>
, Machine Learning <http://hunch.net/?cat=29> — jl <http://hunch.net/~jl>@
9:52 pm

I’ve released version 5.0 <http://hunch.net/~vw/vowpal_wabbit_v5.0.tar.gz> of
the Vowpal Wabbit <http://hunch.net/~jl> online learning software. The major
number has changed since the last release <http://hunch.net/?p=1068>because
I regard all earlier versions as obsolete—there are several new algorithms &
features including substantial changes and upgrades to the default learning
algorithm.

The biggest changes are new algorithms:

   1. Nikos <http://www.cs.cornell.edu/~nk/> and I improved the default
   algorithm. The basic update rule still uses gradient descent, but the size
   of the update is carefully controlled so that it’s impossible to overrun the
   label. In addition, the normalization has changed. Computationally, these
   changes are virtually free and yield better results, sometimes much better.
   Less careful updates can be reenabled with –loss_function classic, although
   results are still not identical to previous due to normalization changes.
   2. Nikos also implemented the per-feature learning rates as per
these two<http://www.colt2010.org/papers/104mcmahan.pdf>
    papers <http://www.colt2010.org/papers/023Duchi.pdf>. Often, this works
   better than the default algorithm. It isn’t the default because it isn’t
   (yet) as adaptable in terms of learning rate decay. This is enabled with
   –adaptive and learned regressors are compatible with the default.
   Computationally, you might see a factor of 4 slowdown if using ‘-q’. Nikos
   noticed that the phenomenal quake inverse square
root<http://en.wikipedia.org/wiki/Fast_inverse_square_root> hack
   applies making this substantially faster than a naive implementation.
   3. Nikos and Daniel <http://cseweb.ucsd.edu/~djhsu/> also implemented
   active learning derived from this
paper<http://cseweb.ucsd.edu/~djhsu/papers/iwal-cal.pdf>,
   usable via –active_simulation (to test parameters on an existing supervised
   dataset) or –active_learning (to do the real thing). This runs at full speed
   which is much faster than is reasonable in any active learning scenario. We
   see this approach dominating supervised learning on all classification
   datasets so far, often with far fewer labeled examples required, as the
   theory predicts. The learned predictor is compatible with the default.
   4. Olivier <http://research.yahoo.com/Olivier_Chapelle> helped me
   implement preconditioned conjugate gradient based on Jonathan
Shewchuk<http://www.cs.berkeley.edu/~jrs/>
   ’s tutorial<http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf>.
   This is a *batch* algorithm and hence requires multiple passes over any
   dataset to do something useful. Each step of conjugate gradient requires 2
   passes. The advantage of cg is that it converges relatively quickly via the
   use of second derivative information. This can be particularly helpful if
   your features are of widely differing scales. The use of –regularization
   0.001 (or smaller) is almost required with –conjugate_gradient as it will
   otherwise overfit hard. This implementation has two advantages over the
   basic approach: it implicitly computes a Hessian in *O(n)* time where *n* is
   the number of features and it operates out of core, hence making it
   applicable to datasets that don’t conveniently fit in RAM. The learned
   predictor is compatible with the default, although you’ll notice that a
   factor of 8 more RAM is required when learning.
   5. Matt Hoffman <http://www.cs.princeton.edu/~mdhoffma/> and I
   implemented Online Latent Dirichlet
Allocation<http://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf>.
   This code is still experimental and likely to change over the next week. It
   really does a minibatch update under the hood. The code appears to be
   substantially faster than Matt’s earlier python implementation making this
   probably the most efficient LDA anywhere. LDA is still much slower than
   online linear learning as it is quite computationally heavy in
   comparison—perhaps a good candidate for GPU optimization.
   6. Nikos, Daniel, and I have been experimenting with more online cluster
   parallel learning algorithms (–corrective, –backprop, –delayed_global). We
   aren’t yet satisfied with these although they are improving. Details are at
   the LCCC workshop <http://lccc.eecs.berkeley.edu/>.

In addition, Ariel <http://www.yendor.com/> added a test suite,
Shravan<http://research.yahoo.com/Shravan_Narayanamurthy> helped
with ngrams, and there are several other minor new features and bug fixes
including a very subtle one caught by Vaclav<http://www.occamslab.com/petricek/>
.

The documentation on the website hasn’t kept up with the code. I’m planning
to rectify that over the next week, and have a new tutorial starting at 2pm
in the LCCC <http://lccc.eecs.berkeley.edu/schedule.html> room for those
interested. Yes, I’ll not be skiing [image: :)]

Re: Vowpal Wabbit 5.0 with optimisations

Posted by Ted Dunning <te...@gmail.com>.
Yeah... some of his changes are actually tracking some of our stuff (not by
looking at our stuff, but looking at the common upstream idea source).  The
active learning stuff is a different case and would be excellent to have.
 Essentially that involves sieving out training examples that "don't matter"
which, in my experience, is typically 99+% of the data.  In practice,
down-sampling is the poor man's active learning, but real active learning
would be much, much better.

On Thu, Dec 23, 2010 at 10:46 AM, Robin Anil <ro...@gmail.com> wrote:

> Worth a look, maybe some new ideas for sgd package
>
> Robin
>
> http://hunch.net/?p=1594
>
> Vowpal Wabbit, version 5.0, and the second heresy<http://hunch.net/?p=1594>
> Tags: Announcements <http://hunch.net/?cat=4>, Code<http://hunch.net/?cat=42>
> , Machine Learning <http://hunch.net/?cat=29> — jl <http://hunch.net/~jl>@
> 9:52 pm
>
> I’ve released version 5.0 <http://hunch.net/~vw/vowpal_wabbit_v5.0.tar.gz> of
> the Vowpal Wabbit <http://hunch.net/~jl> online learning software. The
> major number has changed since the last release <http://hunch.net/?p=1068>because
> I regard all earlier versions as obsolete—there are several new algorithms &
> features including substantial changes and upgrades to the default learning
> algorithm.
>
> The biggest changes are new algorithms:
>
>    1. Nikos <http://www.cs.cornell.edu/~nk/> and I improved the default
>    algorithm. The basic update rule still uses gradient descent, but the size
>    of the update is carefully controlled so that it’s impossible to overrun the
>    label. In addition, the normalization has changed. Computationally, these
>    changes are virtually free and yield better results, sometimes much better.
>    Less careful updates can be reenabled with –loss_function classic, although
>    results are still not identical to previous due to normalization changes.
>    2. Nikos also implemented the per-feature learning rates as per these
>    two <http://www.colt2010.org/papers/104mcmahan.pdf> papers<http://www.colt2010.org/papers/023Duchi.pdf>.
>    Often, this works better than the default algorithm. It isn’t the default
>    because it isn’t (yet) as adaptable in terms of learning rate decay. This is
>    enabled with –adaptive and learned regressors are compatible with the
>    default. Computationally, you might see a factor of 4 slowdown if using
>    ‘-q’. Nikos noticed that the phenomenal quake inverse square root<http://en.wikipedia.org/wiki/Fast_inverse_square_root> hack
>    applies making this substantially faster than a naive implementation.
>    3. Nikos and Daniel <http://cseweb.ucsd.edu/~djhsu/> also implemented
>    active learning derived from this paper<http://cseweb.ucsd.edu/~djhsu/papers/iwal-cal.pdf>,
>    usable via –active_simulation (to test parameters on an existing supervised
>    dataset) or –active_learning (to do the real thing). This runs at full speed
>    which is much faster than is reasonable in any active learning scenario. We
>    see this approach dominating supervised learning on all classification
>    datasets so far, often with far fewer labeled examples required, as the
>    theory predicts. The learned predictor is compatible with the default.
>    4. Olivier <http://research.yahoo.com/Olivier_Chapelle> helped me
>    implement preconditioned conjugate gradient based on Jonathan Shewchuk<http://www.cs.berkeley.edu/~jrs/>
>    ’s tutorial<http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf>.
>    This is a *batch* algorithm and hence requires multiple passes over any
>    dataset to do something useful. Each step of conjugate gradient requires 2
>    passes. The advantage of cg is that it converges relatively quickly via the
>    use of second derivative information. This can be particularly helpful if
>    your features are of widely differing scales. The use of –regularization
>    0.001 (or smaller) is almost required with –conjugate_gradient as it will
>    otherwise overfit hard. This implementation has two advantages over the
>    basic approach: it implicitly computes a Hessian in *O(n)* time where *
>    n* is the number of features and it operates out of core, hence making
>    it applicable to datasets that don’t conveniently fit in RAM. The learned
>    predictor is compatible with the default, although you’ll notice that a
>    factor of 8 more RAM is required when learning.
>    5. Matt Hoffman <http://www.cs.princeton.edu/~mdhoffma/> and I
>    implemented Online Latent Dirichlet Allocation<http://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf>.
>    This code is still experimental and likely to change over the next week. It
>    really does a minibatch update under the hood. The code appears to be
>    substantially faster than Matt’s earlier python implementation making this
>    probably the most efficient LDA anywhere. LDA is still much slower than
>    online linear learning as it is quite computationally heavy in
>    comparison—perhaps a good candidate for GPU optimization.
>    6. Nikos, Daniel, and I have been experimenting with more online
>    cluster parallel learning algorithms (–corrective, –backprop,
>    –delayed_global). We aren’t yet satisfied with these although they are
>    improving. Details are at the LCCC workshop<http://lccc.eecs.berkeley.edu/>
>    .
>
> In addition, Ariel <http://www.yendor.com/> added a test suite, Shravan<http://research.yahoo.com/Shravan_Narayanamurthy> helped
> with ngrams, and there are several other minor new features and bug fixes
> including a very subtle one caught by Vaclav<http://www.occamslab.com/petricek/>
> .
>
> The documentation on the website hasn’t kept up with the code. I’m planning
> to rectify that over the next week, and have a new tutorial starting at 2pm
> in the LCCC <http://lccc.eecs.berkeley.edu/schedule.html> room for those
> interested. Yes, I’ll not be skiing [image: :)]
>
>