You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by hakeem <to...@indeed.com> on 2011/07/07 23:20:57 UTC
Logistic Regression: poor results on small data set
I'm attempting to run a logistic regression on a small data set: about 350
documents, 30 features.
I am using this toy data set for two reasons:
1) Confirm that my Mahout vector representation is sensible;
2) Confirm that Mahout logistic regression provides sensible results.
My end goal is to run the same procedure on a very large data set:
potentially billions of documents.
I began my investigation with OnlineLogisticRegression. The results were
poor (described in greater detail below), and I then stepped over to
AdaptiveLogisticRegression (again, poor results).
For validation, I am using comparing the Mahout results to those obtained
using R glm (family=binomial). (Note: I previously validated the R results
with other methods -- and, I have a consensus on what is reasonable).
Because I have so few documents, I run the set of documents through train()
in epochs -- up to 1000 times, shuffling the order of the documents on each
epoch.
The Mahout results are poor. Mahout does a reasonable job at identifying the
features positive weights (the top-third of the features). However, it does
a very poor job of assigning weights to the features in the middle-third and
bottom-third of the weight rankings.
My questions:
1) Are these results surprising to you? Or, should they be expected given
the small size of my data set?
2) How might I tweak the OnlineLogisticRegression settings to accommodate my
small data set?
Thank you for your feedback.
--
View this message in context: http://lucene.472066.n3.nabble.com/Logistic-Regression-poor-results-on-small-data-set-tp3149694p3149694.html
Sent from the Mahout User List mailing list archive at Nabble.com.
Re: Logistic Regression: poor results on small data set
Posted by Ted Dunning <te...@gmail.com>.
On Thu, Jul 7, 2011 at 2:20 PM, hakeem <to...@indeed.com> wrote:
> Because I have so few documents, I run the set of documents through train()
> in epochs -- up to 1000 times, shuffling the order of the documents on each
> epoch.
>
Fair.
> My questions:
> 1) Are these results surprising to you? Or, should they be expected given
> the small size of my data set?
>
They are surprising.
> 2) How might I tweak the OnlineLogisticRegression settings to accommodate
> my
> small data set?
>
You didn't mention how you encode the data, nor what kind of features you
have. Is this a standard data set? Can you post the data so that we can
turn it into a worked example?
Re: Logistic Regression: poor results on small data set
Posted by Ted Dunning <te...@gmail.com>.
If you keep the probes at 2, you should have better results with sparse
features and a large dimensionality reduction.
On Thu, Jul 7, 2011 at 5:58 PM, hakeem <to...@indeed.com> wrote:
> I increased the vector size substantially and reduced the number of probes
> to 1. With the collisions eliminated, I find much more reasonable results.
>
> I suppose the lesson is: Vectorization of the data has substantial
> computational performance benefits; however, a degradation in model
> accuracy
> is a potential trade-off.
>
Re: Logistic Regression: poor results on small data set
Posted by hakeem <to...@indeed.com>.
After further experimentation, I discovered that the vectorization of my
data was a major cause for the degradation in accuracy of the learned
weights. The majority of the features used the same encoder
(StaticWordValueEncoder) with 2 probes. One of the stronger features
collided with both of the weakest features -- thus, causing this strong
feature to assume ("learn") a weak weight.
I increased the vector size substantially and reduced the number of probes
to 1. With the collisions eliminated, I find much more reasonable results.
I suppose the lesson is: Vectorization of the data has substantial
computational performance benefits; however, a degradation in model accuracy
is a potential trade-off.
--
View this message in context: http://lucene.472066.n3.nabble.com/Logistic-Regression-poor-results-on-small-data-set-tp3149694p3150228.html
Sent from the Mahout User List mailing list archive at Nabble.com.