You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Andreas Bauer <bu...@gmx.net> on 2013/11/07 22:48:06 UTC

OnlineLogisticRegression: Are my settings sensible

Hi,  

I’m trying to use OnlineLogisticRegression for a two-class classification problem, but as my classification results are not very good, I wanted to ask for support to find out if my settings are correct and if I’m using Mahout correctly. Because if I’m doing it correctly then probably my features are crap...

In total I have 12 features. All are continuous values and all are normalized/standardized (has not effect on the classification performance at the moment).  

Training samples keep flowing in at constant rate (i.e. incremental training), but in total it won’t be more than a few thousand (class split pos/negative 30:70).  

My performance measure do not really get good, e.g. with approx. 3600 training samples I get

f-measure(beta=0.5): 0.38
precision: 0.33
recall: 0.47

The parameters I use are

lambda=0.0001
offset=1000
alpha=1
decay_exponent=0.9
learning_rate=50


FEATURE_NUMBER = 100;
CATEGORIES_NUMBER = 2;



Java code snip:

private OnlineLogisticRegression olr;
private ContinuousValueEncoder continousValueEncoder;

private static final FeatureVectorEncoder BIAS = new ConstantValueEncoder("Intercept“);

…
public Training() {
       olr = new OnlineLogisticRegression(CATEGORIES_NUMBER, FEATURE_NUMBER,new L1()); //L2 or ElasticBandPrior do not affect the performance
       olr.lambda(lambda).learningRate(learning_rate).stepOffset(offset).decayExponent(decay_exponent);
       this.continousValueEncoder = new ContinuousValueEncoder("ContinuousValueEncoder");
       this.continousValueEncoder.setProbes(20);
      ….
}


public void train(TrainingSample sample, int target){
DenseVector denseVector = new DenseVector(FEATURE_NUMBER);
//sample.getFeatureValue1-15() return a double value
        this.continousValueEncoder.addToVector((byte[]) null, sample.getFeatureValue1(), denseVector);
….
this.continousValueEncoder.addToVector((byte[]) null, sample.getFeatureValue15(), denseVector);
BIAS.addToVector((byte[]) null, 1, denseVector);
        olr.train(target, denseVector);
}

It is also interesting to notice, that when I use the model both test and classification yield always probabilities of 1.0 or 0.99xxx for either class.  

result = this.olr.classifyFull(input);
LOGGER.debug("TrainingSink test: classify real category:"
+ realCategory + " olr classifier result: "
+ result.maxValueIndex() + " prob: " + result.maxValue());




Would be great if you could give me some advise.  

Many thanks,

Andreas

Re: OnlineLogisticRegression: Are my settings sensible

Posted by Ted Dunning <te...@gmail.com>.

You are correct that it should work with smaller data as well, but the
trade-offs are going to be very different.

In particular, some algorithms are completely infeasible at large scale,
but are very effective at small scale.  Some like those used in glmnet
inherently require multiple passes through the data.

The Mahout committers have generally elected to spend time on larger scale
problems, especially where really good small-scale solutions already exist.

That could change if somebody wanted to come in and support some set of
algorithms (hint, hint).




On Fri, Nov 8, 2013 at 3:15 AM, Andreas Bauer <bu...@gmx.net> wrote:

> Ok,  I'll have a look. Thanks! I know mahout is intended for large scale
> machine learning,  but I guess it shouldn't have problems with such small
> data either.
>
>
>
> Ted Dunning <te...@gmail.com> schrieb:
> >On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer <bu...@gmx.net> wrote:
> >
> >> Hi,
> >>
> >> Thanks for your comments.
> >>
> >> I modified the examples from the mahout in action book,  therefore I
> >used
> >> the hashed approach and that's why i used 100 features. I'll adjust
> >the
> >> number.
> >>
> >
> >Makes sense.  But the book was doing sparse features.
> >
> >
> >
> >> You say that I'm using the same CVE for all features,  so you mean i
> >> should create 12 separate CVE for adding features to the vector like
> >this?
> >>
> >
> >Yes.  Otherwise you don't get different hashes.  With a CVE, the
> >hashing
> >pattern is generated from the name of the variable.  For a work
> >encoder,
> >the hashing pattern is generated by the name of the variable (specified
> >at
> >construction of the encoder) and the word itself (specified at encode
> >time).  Text is just repeated words except that the weights aren't
> >necessarily linear in the number of times a word appears.
> >
> >In your case, you could have used a goofy trick with a word encoder
> >where
> >the "word" is the variable name and the value of the variable is passed
> >as
> >the weight of the word.
> >
> >But all of this hashing is really just extra work for you.  Easier to
> >just
> >pack your data into a dense vector.
> >
> >
> >> Finally, I thought online logistic regression meant that it is an
> >online
> >> algorithm so it's fine to train only once. Does it mean, should i
> >invoke
> >> the train method over and over again with the same training sample
> >until
> >> the next one arrives or how should i make the model converge (or at
> >least
> >> try to with the few samples) ?
> >>
> >
> >What online really implies is that training data is measured in terms
> >of
> >number of input records instead of in terms of passes through the data.
> > To
> >converge, you have to see enough data.  If that means you need to pass
> >through the data several times to fool the learner ... well, it means
> >you
> >have to pass through the data several times.
> >
> >Some online learners are exact in that they always have the exact
> >result at
> >hand for all the data they have seen.  Welford's algorithm for
> >computing
> >sample mean and variance is like that. Others approximate an answer.
> >Most
> >systems which are estimating some property of a distribution are
> >necessarily approximate.  In fact, even Welford's method for means is
> >really only approximating the mean of the distribution based on what it
> >has
> >seen so far.  It happens that it gives you the best possible estimate
> >so
> >far, but that is just because computing a mean is simple enough.  With
> >regularized logistic regression, the estimation is trickier and you can
> >only say that the algorithm will converge to the correct result
> >eventually
> >rather than say that the answer is always as good as it can be.
> >
> >Another way to say it is that the key property of on-line learning is
> >that
> >the learning takes a fixed amount of time and no additional memory for
> >each
> >input example.
> >
> >
> >> What would you suggest to use for incremental training instead of
> >OLR?  Is
> >> mahout perhaps the wrong library?
> >>
> >
> >Well, for thousands of examples, anything at all will work quite well,
> >even
> >R.  Just keep all the data around and fit the data whenever requested.
> >
> >Take a look at glmnet for a very nicely done in-memory L1/L2
> >regularized
> >learner.  A quick experiment indicates that it will handle 200K samples
> >of
> >the sort you are looking in about a second with multiple levels of
> >lambda
> >thrown into the bargain.  Versions available in R, Matlab and Fortran
> >(at
> >least).
> >
> >http://www-stat.stanford.edu/~tibs/glmnet-matlab/
> >
> >This kind of in-memory, single machine problem is just not what Mahout
> >is
> >intended to solve.
>
>

Re: OnlineLogisticRegression: Are my settings sensible

Posted by Andreas Bauer <bu...@gmx.net>.

Ok,  I'll have a look. Thanks! I know mahout is intended for large scale machine learning,  but I guess it shouldn't have problems with such small data either. 



Ted Dunning <te...@gmail.com> schrieb:
>On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer <bu...@gmx.net> wrote:
>
>> Hi,
>>
>> Thanks for your comments.
>>
>> I modified the examples from the mahout in action book,  therefore I
>used
>> the hashed approach and that's why i used 100 features. I'll adjust
>the
>> number.
>>
>
>Makes sense.  But the book was doing sparse features.
>
>
>
>> You say that I'm using the same CVE for all features,  so you mean i
>> should create 12 separate CVE for adding features to the vector like
>this?
>>
>
>Yes.  Otherwise you don't get different hashes.  With a CVE, the
>hashing
>pattern is generated from the name of the variable.  For a work
>encoder,
>the hashing pattern is generated by the name of the variable (specified
>at
>construction of the encoder) and the word itself (specified at encode
>time).  Text is just repeated words except that the weights aren't
>necessarily linear in the number of times a word appears.
>
>In your case, you could have used a goofy trick with a word encoder
>where
>the "word" is the variable name and the value of the variable is passed
>as
>the weight of the word.
>
>But all of this hashing is really just extra work for you.  Easier to
>just
>pack your data into a dense vector.
>
>
>> Finally, I thought online logistic regression meant that it is an
>online
>> algorithm so it's fine to train only once. Does it mean, should i
>invoke
>> the train method over and over again with the same training sample
>until
>> the next one arrives or how should i make the model converge (or at
>least
>> try to with the few samples) ?
>>
>
>What online really implies is that training data is measured in terms
>of
>number of input records instead of in terms of passes through the data.
> To
>converge, you have to see enough data.  If that means you need to pass
>through the data several times to fool the learner ... well, it means
>you
>have to pass through the data several times.
>
>Some online learners are exact in that they always have the exact
>result at
>hand for all the data they have seen.  Welford's algorithm for
>computing
>sample mean and variance is like that. Others approximate an answer. 
>Most
>systems which are estimating some property of a distribution are
>necessarily approximate.  In fact, even Welford's method for means is
>really only approximating the mean of the distribution based on what it
>has
>seen so far.  It happens that it gives you the best possible estimate
>so
>far, but that is just because computing a mean is simple enough.  With
>regularized logistic regression, the estimation is trickier and you can
>only say that the algorithm will converge to the correct result
>eventually
>rather than say that the answer is always as good as it can be.
>
>Another way to say it is that the key property of on-line learning is
>that
>the learning takes a fixed amount of time and no additional memory for
>each
>input example.
>
>
>> What would you suggest to use for incremental training instead of
>OLR?  Is
>> mahout perhaps the wrong library?
>>
>
>Well, for thousands of examples, anything at all will work quite well,
>even
>R.  Just keep all the data around and fit the data whenever requested.
>
>Take a look at glmnet for a very nicely done in-memory L1/L2
>regularized
>learner.  A quick experiment indicates that it will handle 200K samples
>of
>the sort you are looking in about a second with multiple levels of
>lambda
>thrown into the bargain.  Versions available in R, Matlab and Fortran
>(at
>least).
>
>http://www-stat.stanford.edu/~tibs/glmnet-matlab/
>
>This kind of in-memory, single machine problem is just not what Mahout
>is
>intended to solve.

Re: OnlineLogisticRegression: Are my settings sensible

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer <bu...@gmx.net> wrote:

> Hi,
>
> Thanks for your comments.
>
> I modified the examples from the mahout in action book,  therefore I used
> the hashed approach and that's why i used 100 features. I'll adjust the
> number.
>

Makes sense.  But the book was doing sparse features.

> You say that I'm using the same CVE for all features,  so you mean i
> should create 12 separate CVE for adding features to the vector like this?
>

Yes.  Otherwise you don't get different hashes.  With a CVE, the hashing
pattern is generated from the name of the variable.  For a work encoder,
the hashing pattern is generated by the name of the variable (specified at
construction of the encoder) and the word itself (specified at encode
time).  Text is just repeated words except that the weights aren't
necessarily linear in the number of times a word appears.

In your case, you could have used a goofy trick with a word encoder where
the "word" is the variable name and the value of the variable is passed as
the weight of the word.

But all of this hashing is really just extra work for you.  Easier to just
pack your data into a dense vector.

> Finally, I thought online logistic regression meant that it is an online
> algorithm so it's fine to train only once. Does it mean, should i invoke
> the train method over and over again with the same training sample until
> the next one arrives or how should i make the model converge (or at least
> try to with the few samples) ?
>

What online really implies is that training data is measured in terms of
number of input records instead of in terms of passes through the data.  To
converge, you have to see enough data.  If that means you need to pass
through the data several times to fool the learner ... well, it means you
have to pass through the data several times.

Some online learners are exact in that they always have the exact result at
hand for all the data they have seen.  Welford's algorithm for computing
sample mean and variance is like that. Others approximate an answer.  Most
systems which are estimating some property of a distribution are
necessarily approximate.  In fact, even Welford's method for means is
really only approximating the mean of the distribution based on what it has
seen so far.  It happens that it gives you the best possible estimate so
far, but that is just because computing a mean is simple enough.  With
regularized logistic regression, the estimation is trickier and you can
only say that the algorithm will converge to the correct result eventually
rather than say that the answer is always as good as it can be.

Another way to say it is that the key property of on-line learning is that
the learning takes a fixed amount of time and no additional memory for each
input example.

> What would you suggest to use for incremental training instead of OLR?  Is
> mahout perhaps the wrong library?
>

Well, for thousands of examples, anything at all will work quite well, even
R.  Just keep all the data around and fit the data whenever requested.

Take a look at glmnet for a very nicely done in-memory L1/L2 regularized
learner.  A quick experiment indicates that it will handle 200K samples of
the sort you are looking in about a second with multiple levels of lambda
thrown into the bargain.  Versions available in R, Matlab and Fortran (at
least).

http://www-stat.stanford.edu/~tibs/glmnet-matlab/

This kind of in-memory, single machine problem is just not what Mahout is
intended to solve.

Re: OnlineLogisticRegression: Are my settings sensible

Posted by Andreas Bauer <bu...@gmx.net>.

Hi, 

Thanks for your comments. 

I modified the examples from the mahout in action book,  therefore I used the hashed approach and that's why i used 100 features. I'll adjust the number.

You say that I'm using the same CVE for all features,  so you mean i should create 12 separate CVE for adding features to the vector like this? 


BIAS.addToVector((byte[]) null,1, denseVector);

this. cve1.addToVector((byte[]) null,
sample.getFeatureValue1(), denseVector);
... 
this. cve12.addToVector((byte[]) null,
sample.getFeatureValue12(), denseVector);

It's only a typo 12/15. Should be getFeatureValue12.

Finally, I thought online logistic regression meant that it is an online algorithm so it's fine to train only once. Does it mean, should i invoke the train method over and over again with the same training sample until the next one arrives or how should i make the model converge (or at least try to with the few samples) ?

What would you suggest to use for incremental training instead of OLR?  Is mahout perhaps the wrong library? 

Many thanks, 

Andreas 



Ted Dunning <te...@gmail.com> schrieb:
>Why is FEATURE_NUMBER != 13?
>
>With 12 features that are already lovely and continuous, just stick
>them in
>elements 1..12 of a 13 long vector and put a constant value at the
>beginning of it.  Hashed encoding is good for sparse stuff, but
>confusing
>for your case.
>
>Also, it looks like you only pass through the (very small) training set
>once.  The OnlineLogisticRegression is unlikely to converge very well
>with
>such a small number of examples.
>
>Finally, in the hashed representation that you are using, you use
>exactly
>the same CVE to put all 15 (12?) of the variables into the vector. 
>Since
>you are using the same CVE, all of these values will be put into
>exactly
>the same location which is going to kill performance since you will get
>the
>effect of summing all your variables together.
>
>
>
>
>
>On Thu, Nov 7, 2013 at 1:48 PM, Andreas Bauer <bu...@gmx.net> wrote:
>
>> Hi,
>>
>> I’m trying to use OnlineLogisticRegression for a two-class
>classification
>> problem, but as my classification results are not very good, I wanted
>to
>> ask for support to find out if my settings are correct and if I’m
>using
>> Mahout correctly. Because if I’m doing it correctly then probably my
>> features are crap...
>>
>> In total I have 12 features. All are continuous values and all are
>> normalized/standardized (has not effect on the classification
>performance
>> at the moment).
>>
>> Training samples keep flowing in at constant rate (i.e. incremental
>> training), but in total it won’t be more than a few thousand (class
>split
>> pos/negative 30:70).
>>
>> My performance measure do not really get good, e.g. with approx. 3600
>> training samples I get
>>
>> f-measure(beta=0.5): 0.38
>> precision: 0.33
>> recall: 0.47
>>
>> The parameters I use are
>>
>> lambda=0.0001
>> offset=1000
>> alpha=1
>> decay_exponent=0.9
>> learning_rate=50
>>
>>
>> FEATURE_NUMBER = 100;
>> CATEGORIES_NUMBER = 2;
>>
>>
>>
>> Java code snip:
>>
>> private OnlineLogisticRegression olr;
>> private ContinuousValueEncoder continousValueEncoder;
>>
>> private static final FeatureVectorEncoder BIAS = new
>> ConstantValueEncoder("Intercept“);
>>
>> …
>> public Training() {
>>        olr = new OnlineLogisticRegression(CATEGORIES_NUMBER,
>> FEATURE_NUMBER,new L1()); //L2 or ElasticBandPrior do not affect the
>> performance
>>
>> 
>olr.lambda(lambda).learningRate(learning_rate).stepOffset(offset).decayExponent(decay_exponent);
>>        this.continousValueEncoder = new
>> ContinuousValueEncoder("ContinuousValueEncoder");
>>        this.continousValueEncoder.setProbes(20);
>>       ….
>> }
>>
>>
>> public void train(TrainingSample sample, int target){
>> DenseVector denseVector = new DenseVector(FEATURE_NUMBER);
>> //sample.getFeatureValue1-15() return a double value
>>         this.continousValueEncoder.addToVector((byte[]) null,
>> sample.getFeatureValue1(), denseVector);
>> ….
>> this.continousValueEncoder.addToVector((byte[]) null,
>> sample.getFeatureValue15(), denseVector);
>> BIAS.addToVector((byte[]) null, 1, denseVector);
>>         olr.train(target, denseVector);
>> }
>>
>> It is also interesting to notice, that when I use the model both test
>and
>> classification yield always probabilities of 1.0 or 0.99xxx for
>either
>> class.
>>
>> result = this.olr.classifyFull(input);
>> LOGGER.debug("TrainingSink test: classify real category:"
>> + realCategory + " olr classifier result: "
>> + result.maxValueIndex() + " prob: " + result.maxValue());
>>
>>
>>
>>
>> Would be great if you could give me some advise.
>>
>> Many thanks,
>>
>> Andreas
>>
>>
>>

Re: OnlineLogisticRegression: Are my settings sensible

Posted by Ted Dunning <te...@gmail.com>.

Why is FEATURE_NUMBER != 13?

With 12 features that are already lovely and continuous, just stick them in
elements 1..12 of a 13 long vector and put a constant value at the
beginning of it.  Hashed encoding is good for sparse stuff, but confusing
for your case.

Also, it looks like you only pass through the (very small) training set
once.  The OnlineLogisticRegression is unlikely to converge very well with
such a small number of examples.

Finally, in the hashed representation that you are using, you use exactly
the same CVE to put all 15 (12?) of the variables into the vector.  Since
you are using the same CVE, all of these values will be put into exactly
the same location which is going to kill performance since you will get the
effect of summing all your variables together.





On Thu, Nov 7, 2013 at 1:48 PM, Andreas Bauer <bu...@gmx.net> wrote:

> Hi,
>
> I’m trying to use OnlineLogisticRegression for a two-class classification
> problem, but as my classification results are not very good, I wanted to
> ask for support to find out if my settings are correct and if I’m using
> Mahout correctly. Because if I’m doing it correctly then probably my
> features are crap...
>
> In total I have 12 features. All are continuous values and all are
> normalized/standardized (has not effect on the classification performance
> at the moment).
>
> Training samples keep flowing in at constant rate (i.e. incremental
> training), but in total it won’t be more than a few thousand (class split
> pos/negative 30:70).
>
> My performance measure do not really get good, e.g. with approx. 3600
> training samples I get
>
> f-measure(beta=0.5): 0.38
> precision: 0.33
> recall: 0.47
>
> The parameters I use are
>
> lambda=0.0001
> offset=1000
> alpha=1
> decay_exponent=0.9
> learning_rate=50
>
>
> FEATURE_NUMBER = 100;
> CATEGORIES_NUMBER = 2;
>
>
>
> Java code snip:
>
> private OnlineLogisticRegression olr;
> private ContinuousValueEncoder continousValueEncoder;
>
> private static final FeatureVectorEncoder BIAS = new
> ConstantValueEncoder("Intercept“);
>
> …
> public Training() {
>        olr = new OnlineLogisticRegression(CATEGORIES_NUMBER,
> FEATURE_NUMBER,new L1()); //L2 or ElasticBandPrior do not affect the
> performance
>
>  olr.lambda(lambda).learningRate(learning_rate).stepOffset(offset).decayExponent(decay_exponent);
>        this.continousValueEncoder = new
> ContinuousValueEncoder("ContinuousValueEncoder");
>        this.continousValueEncoder.setProbes(20);
>       ….
> }
>
>
> public void train(TrainingSample sample, int target){
> DenseVector denseVector = new DenseVector(FEATURE_NUMBER);
> //sample.getFeatureValue1-15() return a double value
>         this.continousValueEncoder.addToVector((byte[]) null,
> sample.getFeatureValue1(), denseVector);
> ….
> this.continousValueEncoder.addToVector((byte[]) null,
> sample.getFeatureValue15(), denseVector);
> BIAS.addToVector((byte[]) null, 1, denseVector);
>         olr.train(target, denseVector);
> }
>
> It is also interesting to notice, that when I use the model both test and
> classification yield always probabilities of 1.0 or 0.99xxx for either
> class.
>
> result = this.olr.classifyFull(input);
> LOGGER.debug("TrainingSink test: classify real category:"
> + realCategory + " olr classifier result: "
> + result.maxValueIndex() + " prob: " + result.maxValue());
>
>
>
>
> Would be great if you could give me some advise.
>
> Many thanks,
>
> Andreas
>
>
>