You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Danny Leshem <dl...@gmail.com> on 2010/05/06 18:15:23 UTC

Library for scalable logistic regression

Hi!

I'm currently working on a rather large-scale dataset (~300M samples
represented as dense vectors of cardinality ~100).
The data lives in an EC2 Hadoop cluster and pre-processed using MR jobs,
including heavy usage of Mahout (Lanczos decomposition, clustering, etc).

I'm now looking for ways to learn a logistic regression model based on the
data.
So far I postponed this part of the project, hoping for
MAHOUT-228<https://issues.apache.org/jira/browse/MAHOUT-228>to be
ready... but unfortunately I can't afford to wait any more :)

Looking around, I've found Google's
sofia-ml<http://code.google.com/p/sofia-ml/>and some UC Berkeley
Hadoop-based
implementation<http://berkeley-mltea.pbworks.com/Hadoop-for-Machine-Learning-Guide>
.
Anyone has experience with these, or knows of / used a good library for
logistic regressions of this scale?

Thanks,
Danny

Re: Library for scalable logistic regression

Posted by "Mahout Chen (补丁象夫)" <ma...@gmail.com>.
Hi, Danny,

I remembered that sofia-ml minimizes pair-wise ranking errors, so it
might not be the solution for you if AUC is not your evaluation
creteria. In addition, it only supports linear models, is it enough
for your problem?

2010/5/7, Ted Dunning <te...@gmail.com>:
> Glad to hear that you have made good use of Mahout so far.
>
> My recommendations right now for scalable classifiers are generally in the
> SGD area, the canonical example of which is Vowpal Wabbit.  Another
> benchmark implementation is glmnet which does Lasso and elastic band
> regularization.  Vowpal Wabbit will definitely scale to the size you are
> talking about, but truly shines on very large feature spaces.  Glmnet is
> very, very good and very efficient, but assumes an in-core implementation
> right now, thus limiting applicability to your problem.
>
> With only 100 features, my guess is that you can train a main-effects model
> with a relatively small subset of your data, particularly if you have an
> asymmetric target.  You can also use the standard "train-on-errors"
> techniques to augment your original sampled dataset so as to still have a
> small training set which captures what you need out of your larger dataset.
>  This might be particularly helpful if you want to train on interactions.
>
> The general procedure there would be to
>
> a) train a main-effects model on about 1M balanced sample
> b) scan your full dataset and retain about 1M samples that have the worst
> errors
> c) build a fancy new model on the 2M samples
> d) rinse, repeat while AUC improves
>
>
> On Thu, May 6, 2010 at 9:15 AM, Danny Leshem <dl...@gmail.com> wrote:
>
>> Hi!
>>
>> I'm currently working on a rather large-scale dataset (~300M samples
>> represented as dense vectors of cardinality ~100).
>> The data lives in an EC2 Hadoop cluster and pre-processed using MR jobs,
>> including heavy usage of Mahout (Lanczos decomposition, clustering, etc).
>>
>> I'm now looking for ways to learn a logistic regression model based on the
>> data.
>> So far I postponed this part of the project, hoping for
>> MAHOUT-228<https://issues.apache.org/jira/browse/MAHOUT-228>to be
>> ready... but unfortunately I can't afford to wait any more :)
>>
>> Looking around, I've found Google's
>> sofia-ml<http://code.google.com/p/sofia-ml/>and some UC Berkeley
>> Hadoop-based
>> implementation<
>> http://berkeley-mltea.pbworks.com/Hadoop-for-Machine-Learning-Guide>
>> .
>> Anyone has experience with these, or knows of / used a good library for
>> logistic regressions of this scale?
>>
>> Thanks,
>> Danny
>>
>


-- 
Blog of Mahout Chen:
http://blog.sina.com.cn/apachemahout

Re: Library for scalable logistic regression

Posted by Ted Dunning <te...@gmail.com>.
Glad to hear that you have made good use of Mahout so far.

My recommendations right now for scalable classifiers are generally in the
SGD area, the canonical example of which is Vowpal Wabbit.  Another
benchmark implementation is glmnet which does Lasso and elastic band
regularization.  Vowpal Wabbit will definitely scale to the size you are
talking about, but truly shines on very large feature spaces.  Glmnet is
very, very good and very efficient, but assumes an in-core implementation
right now, thus limiting applicability to your problem.

With only 100 features, my guess is that you can train a main-effects model
with a relatively small subset of your data, particularly if you have an
asymmetric target.  You can also use the standard "train-on-errors"
techniques to augment your original sampled dataset so as to still have a
small training set which captures what you need out of your larger dataset.
 This might be particularly helpful if you want to train on interactions.

The general procedure there would be to

a) train a main-effects model on about 1M balanced sample
b) scan your full dataset and retain about 1M samples that have the worst
errors
c) build a fancy new model on the 2M samples
d) rinse, repeat while AUC improves


On Thu, May 6, 2010 at 9:15 AM, Danny Leshem <dl...@gmail.com> wrote:

> Hi!
>
> I'm currently working on a rather large-scale dataset (~300M samples
> represented as dense vectors of cardinality ~100).
> The data lives in an EC2 Hadoop cluster and pre-processed using MR jobs,
> including heavy usage of Mahout (Lanczos decomposition, clustering, etc).
>
> I'm now looking for ways to learn a logistic regression model based on the
> data.
> So far I postponed this part of the project, hoping for
> MAHOUT-228<https://issues.apache.org/jira/browse/MAHOUT-228>to be
> ready... but unfortunately I can't afford to wait any more :)
>
> Looking around, I've found Google's
> sofia-ml<http://code.google.com/p/sofia-ml/>and some UC Berkeley
> Hadoop-based
> implementation<
> http://berkeley-mltea.pbworks.com/Hadoop-for-Machine-Learning-Guide>
> .
> Anyone has experience with these, or knows of / used a good library for
> logistic regressions of this scale?
>
> Thanks,
> Danny
>