You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by hakeem <to...@indeed.com> on 2011/07/07 23:20:57 UTC

Logistic Regression: poor results on small data set

I'm attempting to run a logistic regression on a small data set: about 350
documents, 30 features.

I am using this toy data set for two reasons:
1) Confirm that my Mahout vector representation is sensible;
2) Confirm that Mahout logistic regression provides sensible results.

My end goal is to run the same procedure on a very large data set:
potentially billions of documents.

I began my investigation with OnlineLogisticRegression. The results were
poor (described in greater detail below), and I then stepped over to
AdaptiveLogisticRegression (again, poor results).

For validation, I am using comparing the Mahout results to those obtained
using R glm (family=binomial). (Note: I previously validated the R results
with other methods -- and, I have a consensus on what is reasonable).

Because I have so few documents, I run the set of documents through train()
in epochs -- up to 1000 times, shuffling the order of the documents on each
epoch.

The Mahout results are poor. Mahout does a reasonable job at identifying the
features positive weights (the top-third of the features). However, it does
a very poor job of assigning weights to the features in the middle-third and
bottom-third of the weight rankings.

My questions:
1) Are these results surprising to you? Or, should they be expected given
the small size of my data set?
2) How might I tweak the OnlineLogisticRegression settings to accommodate my
small data set?



Thank you for your feedback.

--
View this message in context: http://lucene.472066.n3.nabble.com/Logistic-Regression-poor-results-on-small-data-set-tp3149694p3149694.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Logistic Regression: poor results on small data set

Posted by Ted Dunning <te...@gmail.com>.
On Thu, Jul 7, 2011 at 2:20 PM, hakeem <to...@indeed.com> wrote:

> Because I have so few documents, I run the set of documents through train()
> in epochs -- up to 1000 times, shuffling the order of the documents on each
> epoch.
>

Fair.


> My questions:
> 1) Are these results surprising to you? Or, should they be expected given
> the small size of my data set?
>

They are surprising.


> 2) How might I tweak the OnlineLogisticRegression settings to accommodate
> my
> small data set?
>

You didn't mention how you encode the data, nor what kind of features you
have.  Is this a standard data set?  Can you post the data so that we can
turn it into a worked example?

Re: Logistic Regression: poor results on small data set

Posted by Ted Dunning <te...@gmail.com>.
If you keep the probes at 2, you should have better results with sparse
features and a large dimensionality reduction.

On Thu, Jul 7, 2011 at 5:58 PM, hakeem <to...@indeed.com> wrote:

> I increased the vector size substantially and reduced the number of probes
> to 1. With the collisions eliminated, I find much more reasonable results.
>
> I suppose the lesson is: Vectorization of the data has substantial
> computational performance benefits; however, a degradation in model
> accuracy
> is a potential trade-off.
>

Re: Logistic Regression: poor results on small data set

Posted by hakeem <to...@indeed.com>.
After further experimentation, I discovered that the vectorization of my
data was a major cause for the degradation in accuracy of the learned
weights. The majority of the features used the same encoder
(StaticWordValueEncoder) with 2 probes. One of the stronger features
collided with both of the weakest features -- thus, causing this strong
feature to assume ("learn") a weak weight.

I increased the vector size substantially and reduced the number of probes
to 1. With the collisions eliminated, I find much more reasonable results.

I suppose the lesson is: Vectorization of the data has substantial
computational performance benefits; however, a degradation in model accuracy
is a potential trade-off.  


--
View this message in context: http://lucene.472066.n3.nabble.com/Logistic-Regression-poor-results-on-small-data-set-tp3149694p3150228.html
Sent from the Mahout User List mailing list archive at Nabble.com.