You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Stanley Xu <we...@gmail.com> on 2011/05/23 10:18:29 UTC

SGD didn't work well with high dimension by a random generated data test.

Dear All,

I am trying to evaluate the correctness of the SGD algorithm in Mahout. I
use a program to generate random weights, training data and test data and
use OnlineLogisticRegression and AdaptiveLogisticRegression to train and
classify the result. But it looks that the SGD didn't works well. I am
wondering if I missed anything in using the SGD algorithm?

I did the test with the following data set:

1. 10 feature dimension, value would be 0 or 1. Weight is generated randomly
and the weight value scope would be from -5 to 5. The training data set is
10k records or 100 records. The data of negative and positive target would
be 1:1.
The classification on both the training data or test data looks fine to me.
Both the false positive and false negative is less than 100, which would be
less than 1%.

2. 100 feature dimension, value would be 0 or 1. Weight is generated
randomly and the weight value scope would be from -5 to 5. The training data
set is 100k records to 1000k records. The data of negative and positive
target would be 1:1.
The classification on both the training data or test data is not very well.
The false positive and false negative are all close to 10%. But the AUC is
pretty well, it would be 90% by AdaptiveLogisticRegression, 85% with raw
OnlineLogisticRegression.

3. 100 feature dimension, but change the negative and positive target to
10:1 to match the real training set we will get.
With the raw OnlineLogisticRegression, most of positive target will be
predicted as negative(more than 90%). And the AUC decrease to 60%. Even
worse, with the AdaptiveLogisticRegression, all the positive target will be
predicted as negative, and AUC decreased to 58%.

The code to generate the data could be found here.
http://pastebin.com/GAA1di5z

The code to train and classify the data could be found here.
http://pastebin.com/EjMpGQ1h

The parameters there could be changed to generate different set of data.

I thought the incorrectness is unacceptable hight, especially with a data
which has a perfect line which could separate the data. And, the
incorrectness is unusually high in the training data set.

I knew SGD is an approximate solution rather than an accurate one, but isn't
20% error in classification is too high?

I understood for the unbalance positive and negative for the training set,
we could add a weight in the training example. I have tried but it is also
hard to decide the weight we should choose, and per my understand, we should
also get the weight changed dynamically with the current learning rate.
Since the high learning rate with a high weight will mis-lead the model to
an incorrect direction. We have tried some strategy, but the efforts is not
well, any tips on how to set the weight for SGD? Since it is not a global
convex optimization solution comparing to other algorithm of Logistic
Regression.

Thanks.


Best wishes,
Stanley Xu

Re: SGD didn't work well with high dimension by a random generated data test.

Posted by Hector Yee <he...@gmail.com>.
I'm curious, have you tried patch 702? (SGD passive aggressive). There's no
training wrapper for it yet but the one for any other sgd should work.

I've had useful models with much higher dimension and less training data
from it before.

On Wed, May 25, 2011 at 9:24 PM, Ted Dunning <te...@gmail.com> wrote:

> Stanley,
>
> This is interesting work.  The data you are generating is a bit outside the
> data that the ALR is normally used to process, but the results do seem like
> they should be better.  It is possible, however, that there is insufficient
> data to train the model (100 dense dimensions are a pretty high dimensional
> problem for only 100K examples).
>
> In order to evaluate this more it would be good to use a system like glmnet
> or bayesglm (in R) on the same data.  What results do you get with these
> alternative algorithms?
>
> I am not going to be able to deal with this in more detail for a while due
> to schedule conflicts, hopefully somebody else can comment in more detail.
>
> On Mon, May 23, 2011 at 1:18 AM, Stanley Xu <we...@gmail.com> wrote:
>
> > Dear All,
> >
> > I am trying to evaluate the correctness of the SGD algorithm in Mahout. I
> > use a program to generate random weights, training data and test data and
> > use OnlineLogisticRegression and AdaptiveLogisticRegression to train and
> > classify the result. But it looks that the SGD didn't works well. I am
> > wondering if I missed anything in using the SGD algorithm?
> >
> > I did the test with the following data set:
> >
> > 1. 10 feature dimension, value would be 0 or 1. Weight is generated
> > randomly
> > and the weight value scope would be from -5 to 5. The training data set
> is
> > 10k records or 100 records. The data of negative and positive target
> would
> > be 1:1.
> > The classification on both the training data or test data looks fine to
> me.
> > Both the false positive and false negative is less than 100, which would
> be
> > less than 1%.
> >
> > 2. 100 feature dimension, value would be 0 or 1. Weight is generated
> > randomly and the weight value scope would be from -5 to 5. The training
> > data
> > set is 100k records to 1000k records. The data of negative and positive
> > target would be 1:1.
> > The classification on both the training data or test data is not very
> well.
> > The false positive and false negative are all close to 10%. But the AUC
> is
> > pretty well, it would be 90% by AdaptiveLogisticRegression, 85% with raw
> > OnlineLogisticRegression.
> >
> > 3. 100 feature dimension, but change the negative and positive target to
> > 10:1 to match the real training set we will get.
> > With the raw OnlineLogisticRegression, most of positive target will be
> > predicted as negative(more than 90%). And the AUC decrease to 60%. Even
> > worse, with the AdaptiveLogisticRegression, all the positive target will
> be
> > predicted as negative, and AUC decreased to 58%.
> >
> > The code to generate the data could be found here.
> > http://pastebin.com/GAA1di5z
> >
> > The code to train and classify the data could be found here.
> > http://pastebin.com/EjMpGQ1h
> >
> > The parameters there could be changed to generate different set of data.
> >
> > I thought the incorrectness is unacceptable hight, especially with a data
> > which has a perfect line which could separate the data. And, the
> > incorrectness is unusually high in the training data set.
> >
> > I knew SGD is an approximate solution rather than an accurate one, but
> > isn't
> > 20% error in classification is too high?
> >
> > I understood for the unbalance positive and negative for the training
> set,
> > we could add a weight in the training example. I have tried but it is
> also
> > hard to decide the weight we should choose, and per my understand, we
> > should
> > also get the weight changed dynamically with the current learning rate.
> > Since the high learning rate with a high weight will mis-lead the model
> to
> > an incorrect direction. We have tried some strategy, but the efforts is
> not
> > well, any tips on how to set the weight for SGD? Since it is not a global
> > convex optimization solution comparing to other algorithm of Logistic
> > Regression.
> >
> > Thanks.
> >
> >
> > Best wishes,
> > Stanley Xu
> >
>



-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)

Re: SGD didn't work well with high dimension by a random generated data test.

Posted by Ted Dunning <te...@gmail.com>.
Stanley,

This is interesting work.  The data you are generating is a bit outside the
data that the ALR is normally used to process, but the results do seem like
they should be better.  It is possible, however, that there is insufficient
data to train the model (100 dense dimensions are a pretty high dimensional
problem for only 100K examples).

In order to evaluate this more it would be good to use a system like glmnet
or bayesglm (in R) on the same data.  What results do you get with these
alternative algorithms?

I am not going to be able to deal with this in more detail for a while due
to schedule conflicts, hopefully somebody else can comment in more detail.

On Mon, May 23, 2011 at 1:18 AM, Stanley Xu <we...@gmail.com> wrote:

> Dear All,
>
> I am trying to evaluate the correctness of the SGD algorithm in Mahout. I
> use a program to generate random weights, training data and test data and
> use OnlineLogisticRegression and AdaptiveLogisticRegression to train and
> classify the result. But it looks that the SGD didn't works well. I am
> wondering if I missed anything in using the SGD algorithm?
>
> I did the test with the following data set:
>
> 1. 10 feature dimension, value would be 0 or 1. Weight is generated
> randomly
> and the weight value scope would be from -5 to 5. The training data set is
> 10k records or 100 records. The data of negative and positive target would
> be 1:1.
> The classification on both the training data or test data looks fine to me.
> Both the false positive and false negative is less than 100, which would be
> less than 1%.
>
> 2. 100 feature dimension, value would be 0 or 1. Weight is generated
> randomly and the weight value scope would be from -5 to 5. The training
> data
> set is 100k records to 1000k records. The data of negative and positive
> target would be 1:1.
> The classification on both the training data or test data is not very well.
> The false positive and false negative are all close to 10%. But the AUC is
> pretty well, it would be 90% by AdaptiveLogisticRegression, 85% with raw
> OnlineLogisticRegression.
>
> 3. 100 feature dimension, but change the negative and positive target to
> 10:1 to match the real training set we will get.
> With the raw OnlineLogisticRegression, most of positive target will be
> predicted as negative(more than 90%). And the AUC decrease to 60%. Even
> worse, with the AdaptiveLogisticRegression, all the positive target will be
> predicted as negative, and AUC decreased to 58%.
>
> The code to generate the data could be found here.
> http://pastebin.com/GAA1di5z
>
> The code to train and classify the data could be found here.
> http://pastebin.com/EjMpGQ1h
>
> The parameters there could be changed to generate different set of data.
>
> I thought the incorrectness is unacceptable hight, especially with a data
> which has a perfect line which could separate the data. And, the
> incorrectness is unusually high in the training data set.
>
> I knew SGD is an approximate solution rather than an accurate one, but
> isn't
> 20% error in classification is too high?
>
> I understood for the unbalance positive and negative for the training set,
> we could add a weight in the training example. I have tried but it is also
> hard to decide the weight we should choose, and per my understand, we
> should
> also get the weight changed dynamically with the current learning rate.
> Since the high learning rate with a high weight will mis-lead the model to
> an incorrect direction. We have tried some strategy, but the efforts is not
> well, any tips on how to set the weight for SGD? Since it is not a global
> convex optimization solution comparing to other algorithm of Logistic
> Regression.
>
> Thanks.
>
>
> Best wishes,
> Stanley Xu
>

Re: SGD didn't work well with high dimension by a random generated data test.

Posted by Ted Dunning <te...@gmail.com>.
I have done some unrelated tests and I think that SGD has suffered from some
unknown decrease in accuracy.  20 newsgroups used to get to 86% accuracy and
now only gets to near 80.  When I find time I will try to figure out what
has happened.

Your test results may be related.

On Mon, May 23, 2011 at 4:22 AM, Stanley Xu <we...@gmail.com> wrote:

> Looks if I set decay to 1(no learning rate decay),remove the
> regularization,
> use the raw OnlineLogisticRegression and adjust the learning rate, the
> performance would be much better.
>
> Best wishes,
> Stanley Xu
>
>
>
> On Mon, May 23, 2011 at 4:18 PM, Stanley Xu <we...@gmail.com> wrote:
>
> > Dear All,
> >
> > I am trying to evaluate the correctness of the SGD algorithm in Mahout. I
> > use a program to generate random weights, training data and test data and
> > use OnlineLogisticRegression and AdaptiveLogisticRegression to train and
> > classify the result. But it looks that the SGD didn't works well. I am
> > wondering if I missed anything in using the SGD algorithm?
> >
> > I did the test with the following data set:
> >
> > 1. 10 feature dimension, value would be 0 or 1. Weight is generated
> > randomly and the weight value scope would be from -5 to 5. The training
> data
> > set is 10k records or 100 records. The data of negative and positive
> > target would be 1:1.
> > The classification on both the training data or test data looks fine to
> me.
> > Both the false positive and false negative is less than 100, which would
> be
> > less than 1%.
> >
> > 2. 100 feature dimension, value would be 0 or 1. Weight is generated
> > randomly and the weight value scope would be from -5 to 5. The training
> data
> > set is 100k records to 1000k records. The data of negative and positive
> > target would be 1:1.
> > The classification on both the training data or test data is not very
> well.
> > The false positive and false negative are all close to 10%. But the AUC
> is
> > pretty well, it would be 90% by AdaptiveLogisticRegression, 85% with raw
> > OnlineLogisticRegression.
> >
> > 3. 100 feature dimension, but change the negative and positive target to
> > 10:1 to match the real training set we will get.
> > With the raw OnlineLogisticRegression, most of positive target will be
> > predicted as negative(more than 90%). And the AUC decrease to 60%. Even
> > worse, with the AdaptiveLogisticRegression, all the positive target will
> be
> > predicted as negative, and AUC decreased to 58%.
> >
> > The code to generate the data could be found here.
> > http://pastebin.com/GAA1di5z
> >
> > The code to train and classify the data could be found here.
> > http://pastebin.com/EjMpGQ1h
> >
> > The parameters there could be changed to generate different set of data.
> >
> > I thought the incorrectness is unacceptable hight, especially with a data
> > which has a perfect line which could separate the data. And, the
> > incorrectness is unusually high in the training data set.
> >
> > I knew SGD is an approximate solution rather than an accurate one, but
> > isn't 20% error in classification is too high?
> >
> > I understood for the unbalance positive and negative for the training
> set,
> > we could add a weight in the training example. I have tried but it is
> also
> > hard to decide the weight we should choose, and per my understand, we
> should
> > also get the weight changed dynamically with the current learning rate.
> > Since the high learning rate with a high weight will mis-lead the model
> to
> > an incorrect direction. We have tried some strategy, but the efforts is
> not
> > well, any tips on how to set the weight for SGD? Since it is not a global
> > convex optimization solution comparing to other algorithm of Logistic
> > Regression.
> >
> > Thanks.
> >
> >
> > Best wishes,
> > Stanley Xu
> >
> >
>

Re: SGD didn't work well with high dimension by a random generated data test.

Posted by Stanley Xu <we...@gmail.com>.
Looks if I set decay to 1(no learning rate decay),remove the regularization,
use the raw OnlineLogisticRegression and adjust the learning rate, the
performance would be much better.

Best wishes,
Stanley Xu



On Mon, May 23, 2011 at 4:18 PM, Stanley Xu <we...@gmail.com> wrote:

> Dear All,
>
> I am trying to evaluate the correctness of the SGD algorithm in Mahout. I
> use a program to generate random weights, training data and test data and
> use OnlineLogisticRegression and AdaptiveLogisticRegression to train and
> classify the result. But it looks that the SGD didn't works well. I am
> wondering if I missed anything in using the SGD algorithm?
>
> I did the test with the following data set:
>
> 1. 10 feature dimension, value would be 0 or 1. Weight is generated
> randomly and the weight value scope would be from -5 to 5. The training data
> set is 10k records or 100 records. The data of negative and positive
> target would be 1:1.
> The classification on both the training data or test data looks fine to me.
> Both the false positive and false negative is less than 100, which would be
> less than 1%.
>
> 2. 100 feature dimension, value would be 0 or 1. Weight is generated
> randomly and the weight value scope would be from -5 to 5. The training data
> set is 100k records to 1000k records. The data of negative and positive
> target would be 1:1.
> The classification on both the training data or test data is not very well.
> The false positive and false negative are all close to 10%. But the AUC is
> pretty well, it would be 90% by AdaptiveLogisticRegression, 85% with raw
> OnlineLogisticRegression.
>
> 3. 100 feature dimension, but change the negative and positive target to
> 10:1 to match the real training set we will get.
> With the raw OnlineLogisticRegression, most of positive target will be
> predicted as negative(more than 90%). And the AUC decrease to 60%. Even
> worse, with the AdaptiveLogisticRegression, all the positive target will be
> predicted as negative, and AUC decreased to 58%.
>
> The code to generate the data could be found here.
> http://pastebin.com/GAA1di5z
>
> The code to train and classify the data could be found here.
> http://pastebin.com/EjMpGQ1h
>
> The parameters there could be changed to generate different set of data.
>
> I thought the incorrectness is unacceptable hight, especially with a data
> which has a perfect line which could separate the data. And, the
> incorrectness is unusually high in the training data set.
>
> I knew SGD is an approximate solution rather than an accurate one, but
> isn't 20% error in classification is too high?
>
> I understood for the unbalance positive and negative for the training set,
> we could add a weight in the training example. I have tried but it is also
> hard to decide the weight we should choose, and per my understand, we should
> also get the weight changed dynamically with the current learning rate.
> Since the high learning rate with a high weight will mis-lead the model to
> an incorrect direction. We have tried some strategy, but the efforts is not
> well, any tips on how to set the weight for SGD? Since it is not a global
> convex optimization solution comparing to other algorithm of Logistic
> Regression.
>
> Thanks.
>
>
> Best wishes,
> Stanley Xu
>
>