You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sreejith S <sr...@gmail.com> on 2011/12/22 06:26:46 UTC

Mahout SGD / Bayes prediction results over 20newsgroups

Hi all,

I made a comparison between SGD and Bayes classifiers over 20news-bydate
dataset.
http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

The classifier results and confusion matrix seems a bit confused, since it
is said that SGD is better for small datasets and Bayes for large datasets.
Pls check my test scenario http://pastebin.com/K0cy0ayk

It seems that even in small dataset like 20news-bydate Bayes gives 97 %
accuracy and SGD gives 63 % :(
Am i missing something?? Pls clarify.

Thank You,
-- 


*Sreejith.S*
http://srijiths.wordpress.com/
* *http://sreejiths.emurse.com/

tweet2sree@twitter <http://tweet2Sree>

Re:Re: How to apply my bayes-model on test data without labels ?

Posted by enyun <co...@126.com>.
hi Ramprakash,

Thanks very much for your help.
But I'm feel very curious about that mahout bayes module did not provide the common usage tool which could predict the label of new instance.

thanks,


 
 
 
 > -----原始邮件-----
> 发件人: "Ramprakash Ramamoorthy" <yo...@gmail.com>
> 发送时间: 2011年12月26日 星期一
> 收件人: user@mahout.apache.org
> 抄送: 
> 主题: Re: How to apply my bayes-model on test data without labels ?
> 
> 2011/12/26 enyun <co...@126.com>
> 
> > I.E. under bayes method, I want to get the prediction label with score,
> > but testclassifier only provide overall confusion result instead of
> > detailed result.
> > how do you deal with it?
> >
> >
> > thanks,
> >
> >
> >
> >  > -----原始邮件-----
> > > 发件人: enyun <co...@126.com>
> > > 发送时间: 2011年12月26日 星期一
> > > 收件人: user@mahout.apache.org
> > > 抄送:
> > > 主题: How to apply my bayes-model on test data without labels ?
> > >
> > > hi guys,
> > >
> > > I have finished the training process and got one bayes model with the
> > command  : mahout trainclassifer.
> > > Then I want to apply the model to new test data, while these test data
> > doesn't have labels.
> > >
> > > I have checked "testclassifier", but found the test data must have
> > labels.
> > > Does anybody know how to apply my bayes-model on those test data without
> > labels?
> > >
> > > thanks,
> > >
> > >
> >
> 
> Use the classifyDocument().
> 
> 
> Check out
> http://aredko.blogspot.com/2010/11/getting-started-with-apache-mahout.html .
> This should help.
> 
> -- 
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> Project Trainee,
> Zoho Corporation.
> +91 9626975420

Re: How to apply my bayes-model on test data without labels ?

Posted by Ramprakash Ramamoorthy <yo...@gmail.com>.
2011/12/26 enyun <co...@126.com>

> I.E. under bayes method, I want to get the prediction label with score,
> but testclassifier only provide overall confusion result instead of
> detailed result.
> how do you deal with it?
>
>
> thanks,
>
>
>
>  > -----原始邮件-----
> > 发件人: enyun <co...@126.com>
> > 发送时间: 2011年12月26日 星期一
> > 收件人: user@mahout.apache.org
> > 抄送:
> > 主题: How to apply my bayes-model on test data without labels ?
> >
> > hi guys,
> >
> > I have finished the training process and got one bayes model with the
> command  : mahout trainclassifer.
> > Then I want to apply the model to new test data, while these test data
> doesn't have labels.
> >
> > I have checked "testclassifier", but found the test data must have
> labels.
> > Does anybody know how to apply my bayes-model on those test data without
> labels?
> >
> > thanks,
> >
> >
>

Use the classifyDocument().


Check out
http://aredko.blogspot.com/2010/11/getting-started-with-apache-mahout.html .
This should help.

-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Project Trainee,
Zoho Corporation.
+91 9626975420

Re:How to apply my bayes-model on test data without labels ?

Posted by enyun <co...@126.com>.
I.E. under bayes method, I want to get the prediction label with score, but testclassifier only provide overall confusion result instead of detailed result.
how do you deal with it?

 
thanks,


 
 > -----原始邮件-----
> 发件人: enyun <co...@126.com>
> 发送时间: 2011年12月26日 星期一
> 收件人: user@mahout.apache.org
> 抄送: 
> 主题: How to apply my bayes-model on test data without labels ?
> 
> hi guys,
> 
> I have finished the training process and got one bayes model with the command  : mahout trainclassifer.
> Then I want to apply the model to new test data, while these test data doesn't have labels.
> 
> I have checked "testclassifier", but found the test data must have labels.
> Does anybody know how to apply my bayes-model on those test data without labels?
> 
> thanks,
> 
> 

How to apply my bayes-model on test data without labels ?

Posted by enyun <co...@126.com>.
hi guys,

I have finished the training process and got one bayes model with the command  : mahout trainclassifer.
Then I want to apply the model to new test data, while these test data doesn't have labels.

I have checked "testclassifier", but found the test data must have labels.
Does anybody know how to apply my bayes-model on those test data without labels?

thanks,



Re: Mahout SGD / Bayes prediction results over 20newsgroups

Posted by Lance Norskog <go...@gmail.com>.
The accuracy fell from round 8000 to round 10000. Yet, the SGD trainer
saved the final model. Is this a bug?

On Fri, Dec 30, 2011 at 1:56 AM, Lance Norskog <go...@gmail.com> wrote:
> examples/bin/classify-20newsgroups.sh:
>
> Naive Bayes, N-grams = 1:
> 6 minutes
> 79.9% correct
>
> Naive Bayes, N-grams = 2:
> 20 minutes
> 81.3% correct
>
> SGD with leaktype 6 (3 and 6 do the same)
> 12 minutes
> 62.3% peak, then drops to 61%
> SGD leaves a series of models after various numbers of iterations,
> showing its progression until it stops improving:
>
> /tmp/news-group-1000.model
> Correctly Classified Instances          :       2859        37.958%
> Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
> 75%-ile: -2.5782489881172115
> /tmp/news-group-1200.model
> Correctly Classified Instances          :       2859        37.958%
> Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
> 75%-ile: -2.5782489881172115
> /tmp/news-group-1400.model
> Correctly Classified Instances          :       2859        37.958%
> Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
> 75%-ile: -2.5782489881172115
> /tmp/news-group-1500.model
> Correctly Classified Instances          :       2859        37.958%
> Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
> 75%-ile: -2.5782489881172115
> /tmp/news-group-2000.model
> Correctly Classified Instances          :       3351       44.4902%
> Avg. Log-likelihood: NaN 25%-ile: NaN 75%-ile: NaN
> /tmp/news-group-2500.model
> Correctly Classified Instances          :       3940       52.3101%
> Avg. Log-likelihood: -3.353904736632817 25%-ile: -4.70678014243904
> 75%-ile: -0.6282543277378467
> /tmp/news-group-3000.model
> Correctly Classified Instances          :       3940       52.3101%
> Avg. Log-likelihood: -3.353904736632817 25%-ile: -4.70678014243904
> 75%-ile: -0.6282543277378467
> /tmp/news-group-4000.model
> Correctly Classified Instances          :       3809       50.5709%
> Avg. Log-likelihood: -3.927160619166431 25%-ile: -5.511528690450845
> 75%-ile: -0.7226749027343783
> /tmp/news-group-5000.model
> Correctly Classified Instances          :       4386       58.2315%
> Avg. Log-likelihood: -3.153884339533505 25%-ile: -4.301429974183646
> 75%-ile: -0.24757357759053825
> /tmp/news-group-6000.model
> Correctly Classified Instances          :       4507        59.838%
> Avg. Log-likelihood: -3.112089198948625 25%-ile: -4.141184371965078
> 75%-ile: -0.18253005926770405
> /tmp/news-group-7000.model
> Correctly Classified Instances          :       4569       60.6612%
> Avg. Log-likelihood: -3.02017716448018 25%-ile: -3.921831347572432
> 75%-ile: -0.19148778067035277
> /tmp/news-group-8000.model
> Correctly Classified Instances          :       4698       62.3739%
> Avg. Log-likelihood: -2.9454041622918785 25%-ile: -3.7975533569786766
> 75%-ile: -0.14104508309186575
> /tmp/news-group-10000.model
> Correctly Classified Instances          :       4634       61.5242%
> Avg. Log-likelihood: -3.161176354750601 25%-ile: -4.281455155523565
> 75%-ile: -0.16246336765931288
>
> This script prints the above sequence:
> for f in /tmp/news-group-????.model /tmp/news-group-?????.model
> do
>        echo $f
>        mahout  org.apache.mahout.classifier.sgd.TestNewsGroups --input
> /tmp/mahout-work-lancenorskog/20news-bydate/20news-bydate-test/
> --model $f 2>/dev/null | egrep "(Correctly|Log)"
>  done
>
> On Wed, Dec 21, 2011 at 10:58 PM, Ted Dunning <te...@gmail.com> wrote:
>> On Wed, Dec 21, 2011 at 10:46 PM, Sreejith S <sr...@gmail.com> wrote:
>>
>>> On Thu, Dec 22, 2011 at 12:04 PM, Lance Norskog <go...@gmail.com> wrote:
>>>
>>> > The Bayes in the examples doesn't work very well in the 20 newsgroups
>>> > example. Something is wrong  in the data ETL, the tuning options, or
>>> > the Bayes implementation.
>>> >
>>> > On Wed, Dec 21, 2011 at 10:18 PM, Ted Dunning <te...@gmail.com>
>>> > wrote:
>>> > > 97% is not correct.  This sounds like you ran it on the training data.
>>> >
>>>
>>> @Ted , yes i ran it on the same training data.
>>>
>>
>> That isn't a valid test.
>>
>>
>>>
>>> > >
>>> > > 63% also sounds low.  I don't know what happened there.
>>> >
>>>
>>> Is any one tested same 20newsgrop with SGD and got better results ?
>>>
>>
>> I remember getting mid 80's.  I think that some accuracy testing is in
>> order, however, since I have seen hints that the auto-tuning is clamping
>> down too soon.
>>
>> Also, vowpal wabbit has had excellent results using one round of SGD and
>> additional rounds of L-BFGS.  That might make a very powerful version of
>> SGD that doesn't need as much of the tuning as we currently have.
>
>
>
> --
> Lance Norskog
> goksron@gmail.com



-- 
Lance Norskog
goksron@gmail.com

Re: Mahout SGD / Bayes prediction results over 20newsgroups

Posted by Josh Patterson <jo...@cloudera.com>.
playing with classifying some tweets with LR/SGD is yeilding in the
60s for me as well.

I'm running from the command line with "mahout runlogistic", a few
hundred training samples.

I'm continuing to play with the tuning params.

JP

On Fri, Dec 30, 2011 at 4:56 AM, Lance Norskog <go...@gmail.com> wrote:
> examples/bin/classify-20newsgroups.sh:
>
> Naive Bayes, N-grams = 1:
> 6 minutes
> 79.9% correct
>
> Naive Bayes, N-grams = 2:
> 20 minutes
> 81.3% correct
>
> SGD with leaktype 6 (3 and 6 do the same)
> 12 minutes
> 62.3% peak, then drops to 61%
> SGD leaves a series of models after various numbers of iterations,
> showing its progression until it stops improving:
>
> /tmp/news-group-1000.model
> Correctly Classified Instances          :       2859        37.958%
> Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
> 75%-ile: -2.5782489881172115
> /tmp/news-group-1200.model
> Correctly Classified Instances          :       2859        37.958%
> Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
> 75%-ile: -2.5782489881172115
> /tmp/news-group-1400.model
> Correctly Classified Instances          :       2859        37.958%
> Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
> 75%-ile: -2.5782489881172115
> /tmp/news-group-1500.model
> Correctly Classified Instances          :       2859        37.958%
> Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
> 75%-ile: -2.5782489881172115
> /tmp/news-group-2000.model
> Correctly Classified Instances          :       3351       44.4902%
> Avg. Log-likelihood: NaN 25%-ile: NaN 75%-ile: NaN
> /tmp/news-group-2500.model
> Correctly Classified Instances          :       3940       52.3101%
> Avg. Log-likelihood: -3.353904736632817 25%-ile: -4.70678014243904
> 75%-ile: -0.6282543277378467
> /tmp/news-group-3000.model
> Correctly Classified Instances          :       3940       52.3101%
> Avg. Log-likelihood: -3.353904736632817 25%-ile: -4.70678014243904
> 75%-ile: -0.6282543277378467
> /tmp/news-group-4000.model
> Correctly Classified Instances          :       3809       50.5709%
> Avg. Log-likelihood: -3.927160619166431 25%-ile: -5.511528690450845
> 75%-ile: -0.7226749027343783
> /tmp/news-group-5000.model
> Correctly Classified Instances          :       4386       58.2315%
> Avg. Log-likelihood: -3.153884339533505 25%-ile: -4.301429974183646
> 75%-ile: -0.24757357759053825
> /tmp/news-group-6000.model
> Correctly Classified Instances          :       4507        59.838%
> Avg. Log-likelihood: -3.112089198948625 25%-ile: -4.141184371965078
> 75%-ile: -0.18253005926770405
> /tmp/news-group-7000.model
> Correctly Classified Instances          :       4569       60.6612%
> Avg. Log-likelihood: -3.02017716448018 25%-ile: -3.921831347572432
> 75%-ile: -0.19148778067035277
> /tmp/news-group-8000.model
> Correctly Classified Instances          :       4698       62.3739%
> Avg. Log-likelihood: -2.9454041622918785 25%-ile: -3.7975533569786766
> 75%-ile: -0.14104508309186575
> /tmp/news-group-10000.model
> Correctly Classified Instances          :       4634       61.5242%
> Avg. Log-likelihood: -3.161176354750601 25%-ile: -4.281455155523565
> 75%-ile: -0.16246336765931288
>
> This script prints the above sequence:
> for f in /tmp/news-group-????.model /tmp/news-group-?????.model
> do
>        echo $f
>        mahout  org.apache.mahout.classifier.sgd.TestNewsGroups --input
> /tmp/mahout-work-lancenorskog/20news-bydate/20news-bydate-test/
> --model $f 2>/dev/null | egrep "(Correctly|Log)"
>  done
>
> On Wed, Dec 21, 2011 at 10:58 PM, Ted Dunning <te...@gmail.com> wrote:
>> On Wed, Dec 21, 2011 at 10:46 PM, Sreejith S <sr...@gmail.com> wrote:
>>
>>> On Thu, Dec 22, 2011 at 12:04 PM, Lance Norskog <go...@gmail.com> wrote:
>>>
>>> > The Bayes in the examples doesn't work very well in the 20 newsgroups
>>> > example. Something is wrong  in the data ETL, the tuning options, or
>>> > the Bayes implementation.
>>> >
>>> > On Wed, Dec 21, 2011 at 10:18 PM, Ted Dunning <te...@gmail.com>
>>> > wrote:
>>> > > 97% is not correct.  This sounds like you ran it on the training data.
>>> >
>>>
>>> @Ted , yes i ran it on the same training data.
>>>
>>
>> That isn't a valid test.
>>
>>
>>>
>>> > >
>>> > > 63% also sounds low.  I don't know what happened there.
>>> >
>>>
>>> Is any one tested same 20newsgrop with SGD and got better results ?
>>>
>>
>> I remember getting mid 80's.  I think that some accuracy testing is in
>> order, however, since I have seen hints that the auto-tuning is clamping
>> down too soon.
>>
>> Also, vowpal wabbit has had excellent results using one round of SGD and
>> additional rounds of L-BFGS.  That might make a very powerful version of
>> SGD that doesn't need as much of the tuning as we currently have.
>
>
>
> --
> Lance Norskog
> goksron@gmail.com



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com

Re: Mahout SGD / Bayes prediction results over 20newsgroups

Posted by Lance Norskog <go...@gmail.com>.
examples/bin/classify-20newsgroups.sh:

Naive Bayes, N-grams = 1:
6 minutes
79.9% correct

Naive Bayes, N-grams = 2:
20 minutes
81.3% correct

SGD with leaktype 6 (3 and 6 do the same)
12 minutes
62.3% peak, then drops to 61%
SGD leaves a series of models after various numbers of iterations,
showing its progression until it stops improving:

/tmp/news-group-1000.model
Correctly Classified Instances          :       2859	    37.958%
Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
75%-ile: -2.5782489881172115
/tmp/news-group-1200.model
Correctly Classified Instances          :       2859	    37.958%
Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
75%-ile: -2.5782489881172115
/tmp/news-group-1400.model
Correctly Classified Instances          :       2859	    37.958%
Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
75%-ile: -2.5782489881172115
/tmp/news-group-1500.model
Correctly Classified Instances          :       2859	    37.958%
Avg. Log-likelihood: -3.8431954124766556 25%-ile: -5.840980481119284
75%-ile: -2.5782489881172115
/tmp/news-group-2000.model
Correctly Classified Instances          :       3351	   44.4902%
Avg. Log-likelihood: NaN 25%-ile: NaN 75%-ile: NaN
/tmp/news-group-2500.model
Correctly Classified Instances          :       3940	   52.3101%
Avg. Log-likelihood: -3.353904736632817 25%-ile: -4.70678014243904
75%-ile: -0.6282543277378467
/tmp/news-group-3000.model
Correctly Classified Instances          :       3940	   52.3101%
Avg. Log-likelihood: -3.353904736632817 25%-ile: -4.70678014243904
75%-ile: -0.6282543277378467
/tmp/news-group-4000.model
Correctly Classified Instances          :       3809	   50.5709%
Avg. Log-likelihood: -3.927160619166431 25%-ile: -5.511528690450845
75%-ile: -0.7226749027343783
/tmp/news-group-5000.model
Correctly Classified Instances          :       4386	   58.2315%
Avg. Log-likelihood: -3.153884339533505 25%-ile: -4.301429974183646
75%-ile: -0.24757357759053825
/tmp/news-group-6000.model
Correctly Classified Instances          :       4507	    59.838%
Avg. Log-likelihood: -3.112089198948625 25%-ile: -4.141184371965078
75%-ile: -0.18253005926770405
/tmp/news-group-7000.model
Correctly Classified Instances          :       4569	   60.6612%
Avg. Log-likelihood: -3.02017716448018 25%-ile: -3.921831347572432
75%-ile: -0.19148778067035277
/tmp/news-group-8000.model
Correctly Classified Instances          :       4698	   62.3739%
Avg. Log-likelihood: -2.9454041622918785 25%-ile: -3.7975533569786766
75%-ile: -0.14104508309186575
/tmp/news-group-10000.model
Correctly Classified Instances          :       4634	   61.5242%
Avg. Log-likelihood: -3.161176354750601 25%-ile: -4.281455155523565
75%-ile: -0.16246336765931288

This script prints the above sequence:
for f in /tmp/news-group-????.model /tmp/news-group-?????.model
do
	echo $f
	mahout  org.apache.mahout.classifier.sgd.TestNewsGroups --input
/tmp/mahout-work-lancenorskog/20news-bydate/20news-bydate-test/
--model $f 2>/dev/null | egrep "(Correctly|Log)"
 done

On Wed, Dec 21, 2011 at 10:58 PM, Ted Dunning <te...@gmail.com> wrote:
> On Wed, Dec 21, 2011 at 10:46 PM, Sreejith S <sr...@gmail.com> wrote:
>
>> On Thu, Dec 22, 2011 at 12:04 PM, Lance Norskog <go...@gmail.com> wrote:
>>
>> > The Bayes in the examples doesn't work very well in the 20 newsgroups
>> > example. Something is wrong  in the data ETL, the tuning options, or
>> > the Bayes implementation.
>> >
>> > On Wed, Dec 21, 2011 at 10:18 PM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> > > 97% is not correct.  This sounds like you ran it on the training data.
>> >
>>
>> @Ted , yes i ran it on the same training data.
>>
>
> That isn't a valid test.
>
>
>>
>> > >
>> > > 63% also sounds low.  I don't know what happened there.
>> >
>>
>> Is any one tested same 20newsgrop with SGD and got better results ?
>>
>
> I remember getting mid 80's.  I think that some accuracy testing is in
> order, however, since I have seen hints that the auto-tuning is clamping
> down too soon.
>
> Also, vowpal wabbit has had excellent results using one round of SGD and
> additional rounds of L-BFGS.  That might make a very powerful version of
> SGD that doesn't need as much of the tuning as we currently have.



-- 
Lance Norskog
goksron@gmail.com

Re: Mahout SGD / Bayes prediction results over 20newsgroups

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Dec 21, 2011 at 10:46 PM, Sreejith S <sr...@gmail.com> wrote:

> On Thu, Dec 22, 2011 at 12:04 PM, Lance Norskog <go...@gmail.com> wrote:
>
> > The Bayes in the examples doesn't work very well in the 20 newsgroups
> > example. Something is wrong  in the data ETL, the tuning options, or
> > the Bayes implementation.
> >
> > On Wed, Dec 21, 2011 at 10:18 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > > 97% is not correct.  This sounds like you ran it on the training data.
> >
>
> @Ted , yes i ran it on the same training data.
>

That isn't a valid test.


>
> > >
> > > 63% also sounds low.  I don't know what happened there.
> >
>
> Is any one tested same 20newsgrop with SGD and got better results ?
>

I remember getting mid 80's.  I think that some accuracy testing is in
order, however, since I have seen hints that the auto-tuning is clamping
down too soon.

Also, vowpal wabbit has had excellent results using one round of SGD and
additional rounds of L-BFGS.  That might make a very powerful version of
SGD that doesn't need as much of the tuning as we currently have.

Re:Re: how to deal with out-of-memory issue of bayes/testclassifier?

Posted by enyun <co...@126.com>.
hi lance,
 
I can't express my appreciation to you too much.
Lance, thank you very much!
 
enyun
 
 > -----原始邮件-----
> 发件人: "Lance Norskog" <go...@gmail.com>
> 发送时间: 2012年1月1日 星期日
> 收件人: user@mahout.apache.org
> 抄送: 
> 主题: Re: how to deal with out-of-memory issue of bayes/testclassifier?
> 
> There are two answers:
> 
> First answer: you are using the "old" Bayes classifier.
> TrainNaiveBayes and TestNaiveBayes are newer and apparently work
> better (I cannot tell you how). TrainNaiveBayes reads the entire model
> in one program at the end of the training pass, so this surprise will
> not happen. 'mahout trainnb' and 'mahout testnb'.
> 
> Second answer: you are training with too much data.  Try a smaller
> corpus, or use the minSupport and minDf parameters to limit the terms
> you train against.
> 
> 
> 2011/12/30 enyun <co...@126.com>:
> > hi all,
> >
> > I'm using mahout bayes model to predict some new data.
> > After I got the model by 'trainclassifier', I found this model would cause out-of-memory when I was using 'testclassifer'.
> > I have tried to enlarge my java heap size to 4g, but it still did not work.
> > I felt it was very strange of trainclassifer's working well while testclassifer's not working.
> > Do you know how to deal with this issue?
> >
> > 'java.lang.OutOfMemoryError: Java heap space
> >    at org.apache.mahout.math.map.OpenObjectIntHashMap.rehash(OpenObjectIntHashMap.java:435)
> >    at org.apache.mahout.math.map.OpenObjectIntHashMap.put(OpenObjectIntHashMap.java:387)
> >    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.getFeatureID(InMemoryBayesDatastore.java:131)
> >    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.setSumFeatureWeight(InMemoryBayesDatastore.java:153)
> >    at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadFeatureWeights(SequenceFileModelReader.java:82)
> >    at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadModel(SequenceFileModelReader.java:46)
> >    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.initialize(InMemoryBayesDatastore.java:72)
> >    at org.apache.mahout.classifier.bayes.ClassifierContext.initialize(ClassifierContext.java:44)
> >    at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesClassifierMapper.configure(BayesClassifierMapper.java:130)
> >    ... 22 more'
> >
> > thanks,
> >
> >
> > mahout testclassifier -m /user/mahoutTest//bayes-model -d /user/enyun/mahoutTest//bayes-test-input -type bayes -ng 1 -source hdfs -method mapreduce
> > MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> > Running on hadoop, using HADOOP_HOME=/home/work/Programs/hadoop/hadoop-0.20.203.0/
> > HADOOP_CONF_DIR=/home/work/Programs/hadoop/hadoop-0.20.203.0//conf
> > MAHOUT-JOB: /home/work/code/hadoop/mahout/mahout-trunk/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
> > 11/12/30 16:22:59 WARN driver.MahoutDriver: No testclassifier.props found on classpath, will use command-line arguments only
> > 11/12/30 16:23:00 INFO common.HadoopUtil: Deleting /user/mahoutTest/bayes-test-input-output
> > 11/12/30 16:23:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
> > 11/12/30 16:23:01 INFO mapred.FileInputFormat: Total input paths to process : 2
> > 11/12/30 16:23:02 INFO mapred.JobClient: Running job: job_201112231028_0058
> > 11/12/30 16:23:03 INFO mapred.JobClient:  map 0% reduce 0%
> > 11/12/30 16:23:28 INFO mapred.JobClient: Task Id : attempt_201112231028_0058_m_000000_0, Status : FAILED
> > java.lang.RuntimeException: Error in configuring object
> >    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> >    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> >    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> >    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:431)
> >    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371)
> >    at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> >    at java.security.AccessController.doPrivileged(Native Method)
> >    at javax.security.auth.Subject.doAs(Subject.java:396)
> >    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> >    at org.apache.hadoop.mapred.Child.main(Child.java:253)
> > Caused by: java.lang.reflect.InvocationTargetException
> >    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >    at java.lang.reflect.Method.invoke(Method.java:597)
> >    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> >    ... 9 more
> > Caused by: java.lang.RuntimeException: Error in configuring object
> >    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> >    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> >    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> >    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> >    ... 14 more
> > Caused by: java.lang.reflect.InvocationTargetException
> >    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >    at java.lang.reflect.Method.invoke(Method.java:597)
> >    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> >    ... 17 more
> > Caused by: java.lang.OutOfMemoryError: Java heap space
> >    at org.apache.mahout.math.map.OpenObjectIntHashMap.rehash(OpenObjectIntHashMap.java:435)
> >    at org.apache.mahout.math.map.OpenObjectIntHashMap.put(OpenObjectIntHashMap.java:387)
> >    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.getFeatureID(InMemoryBayesDatastore.java:131)
> >    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.setSumFeatureWeight(InMemoryBayesDatastore.java:153)
> >    at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadFeatureWeights(SequenceFileModelReader.java:82)
> >    at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadModel(SequenceFileModelReader.java:46)
> >    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.initialize(InMemoryBayesDatastore.java:72)
> >    at org.apache.mahout.classifier.bayes.ClassifierContext.initialize(ClassifierContext.java:44)
> >    at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesClassifierMapper.configure(BayesClassifierMapper.java:130)
> >    ... 22 more
> >
> > 11/12/30 16:23:28 INFO mapred.JobClient: Task Id : attempt_201112231028_0058_m_000001_0, Status : FAILED
> > java.lang.RuntimeException: Error in configuring object
> >    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> >    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> >    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> >    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:431)
> >    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371)
> >    at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
> >    at java.security.AccessController.doPrivileged(Native Method)
> >    at javax.security.auth.Subject.doAs(Subject.java:396)
> >    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> >
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Re: how to deal with out-of-memory issue of bayes/testclassifier?

Posted by Lance Norskog <go...@gmail.com>.
There are two answers:

First answer: you are using the "old" Bayes classifier.
TrainNaiveBayes and TestNaiveBayes are newer and apparently work
better (I cannot tell you how). TrainNaiveBayes reads the entire model
in one program at the end of the training pass, so this surprise will
not happen. 'mahout trainnb' and 'mahout testnb'.

Second answer: you are training with too much data.  Try a smaller
corpus, or use the minSupport and minDf parameters to limit the terms
you train against.


2011/12/30 enyun <co...@126.com>:
> hi all,
>
> I'm using mahout bayes model to predict some new data.
> After I got the model by 'trainclassifier', I found this model would cause out-of-memory when I was using 'testclassifer'.
> I have tried to enlarge my java heap size to 4g, but it still did not work.
> I felt it was very strange of trainclassifer's working well while testclassifer's not working.
> Do you know how to deal with this issue?
>
> 'java.lang.OutOfMemoryError: Java heap space
>    at org.apache.mahout.math.map.OpenObjectIntHashMap.rehash(OpenObjectIntHashMap.java:435)
>    at org.apache.mahout.math.map.OpenObjectIntHashMap.put(OpenObjectIntHashMap.java:387)
>    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.getFeatureID(InMemoryBayesDatastore.java:131)
>    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.setSumFeatureWeight(InMemoryBayesDatastore.java:153)
>    at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadFeatureWeights(SequenceFileModelReader.java:82)
>    at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadModel(SequenceFileModelReader.java:46)
>    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.initialize(InMemoryBayesDatastore.java:72)
>    at org.apache.mahout.classifier.bayes.ClassifierContext.initialize(ClassifierContext.java:44)
>    at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesClassifierMapper.configure(BayesClassifierMapper.java:130)
>    ... 22 more'
>
> thanks,
>
>
> mahout testclassifier -m /user/mahoutTest//bayes-model -d /user/enyun/mahoutTest//bayes-test-input -type bayes -ng 1 -source hdfs -method mapreduce
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using HADOOP_HOME=/home/work/Programs/hadoop/hadoop-0.20.203.0/
> HADOOP_CONF_DIR=/home/work/Programs/hadoop/hadoop-0.20.203.0//conf
> MAHOUT-JOB: /home/work/code/hadoop/mahout/mahout-trunk/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
> 11/12/30 16:22:59 WARN driver.MahoutDriver: No testclassifier.props found on classpath, will use command-line arguments only
> 11/12/30 16:23:00 INFO common.HadoopUtil: Deleting /user/mahoutTest/bayes-test-input-output
> 11/12/30 16:23:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
> 11/12/30 16:23:01 INFO mapred.FileInputFormat: Total input paths to process : 2
> 11/12/30 16:23:02 INFO mapred.JobClient: Running job: job_201112231028_0058
> 11/12/30 16:23:03 INFO mapred.JobClient:  map 0% reduce 0%
> 11/12/30 16:23:28 INFO mapred.JobClient: Task Id : attempt_201112231028_0058_m_000000_0, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
>    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:431)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371)
>    at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>    at java.security.AccessController.doPrivileged(Native Method)
>    at javax.security.auth.Subject.doAs(Subject.java:396)
>    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>    at org.apache.hadoop.mapred.Child.main(Child.java:253)
> Caused by: java.lang.reflect.InvocationTargetException
>    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>    at java.lang.reflect.Method.invoke(Method.java:597)
>    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>    ... 9 more
> Caused by: java.lang.RuntimeException: Error in configuring object
>    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>    ... 14 more
> Caused by: java.lang.reflect.InvocationTargetException
>    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>    at java.lang.reflect.Method.invoke(Method.java:597)
>    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>    ... 17 more
> Caused by: java.lang.OutOfMemoryError: Java heap space
>    at org.apache.mahout.math.map.OpenObjectIntHashMap.rehash(OpenObjectIntHashMap.java:435)
>    at org.apache.mahout.math.map.OpenObjectIntHashMap.put(OpenObjectIntHashMap.java:387)
>    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.getFeatureID(InMemoryBayesDatastore.java:131)
>    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.setSumFeatureWeight(InMemoryBayesDatastore.java:153)
>    at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadFeatureWeights(SequenceFileModelReader.java:82)
>    at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadModel(SequenceFileModelReader.java:46)
>    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.initialize(InMemoryBayesDatastore.java:72)
>    at org.apache.mahout.classifier.bayes.ClassifierContext.initialize(ClassifierContext.java:44)
>    at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesClassifierMapper.configure(BayesClassifierMapper.java:130)
>    ... 22 more
>
> 11/12/30 16:23:28 INFO mapred.JobClient: Task Id : attempt_201112231028_0058_m_000001_0, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
>    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:431)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371)
>    at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>    at java.security.AccessController.doPrivileged(Native Method)
>    at javax.security.auth.Subject.doAs(Subject.java:396)
>    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>



-- 
Lance Norskog
goksron@gmail.com

how to deal with out-of-memory issue of bayes/testclassifier?

Posted by enyun <co...@126.com>.
hi all,

I'm using mahout bayes model to predict some new data.
After I got the model by 'trainclassifier', I found this model would cause out-of-memory when I was using 'testclassifer'.
I have tried to enlarge my java heap size to 4g, but it still did not work.
I felt it was very strange of trainclassifer's working well while testclassifer's not working.
Do you know how to deal with this issue?

'java.lang.OutOfMemoryError: Java heap space
    at org.apache.mahout.math.map.OpenObjectIntHashMap.rehash(OpenObjectIntHashMap.java:435)
    at org.apache.mahout.math.map.OpenObjectIntHashMap.put(OpenObjectIntHashMap.java:387)
    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.getFeatureID(InMemoryBayesDatastore.java:131)
    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.setSumFeatureWeight(InMemoryBayesDatastore.java:153)
    at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadFeatureWeights(SequenceFileModelReader.java:82)
    at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadModel(SequenceFileModelReader.java:46)
    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.initialize(InMemoryBayesDatastore.java:72)
    at org.apache.mahout.classifier.bayes.ClassifierContext.initialize(ClassifierContext.java:44)
    at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesClassifierMapper.configure(BayesClassifierMapper.java:130)
    ... 22 more'

thanks,


mahout testclassifier -m /user/mahoutTest//bayes-model -d /user/enyun/mahoutTest//bayes-test-input -type bayes -ng 1 -source hdfs -method mapreduce
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/work/Programs/hadoop/hadoop-0.20.203.0/
HADOOP_CONF_DIR=/home/work/Programs/hadoop/hadoop-0.20.203.0//conf
MAHOUT-JOB: /home/work/code/hadoop/mahout/mahout-trunk/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
11/12/30 16:22:59 WARN driver.MahoutDriver: No testclassifier.props found on classpath, will use command-line arguments only
11/12/30 16:23:00 INFO common.HadoopUtil: Deleting /user/mahoutTest/bayes-test-input-output
11/12/30 16:23:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/12/30 16:23:01 INFO mapred.FileInputFormat: Total input paths to process : 2
11/12/30 16:23:02 INFO mapred.JobClient: Running job: job_201112231028_0058
11/12/30 16:23:03 INFO mapred.JobClient:  map 0% reduce 0%
11/12/30 16:23:28 INFO mapred.JobClient: Task Id : attempt_201112231028_0058_m_000000_0, Status : FAILED
java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:431)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
    ... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
    ... 14 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
    ... 17 more
Caused by: java.lang.OutOfMemoryError: Java heap space
    at org.apache.mahout.math.map.OpenObjectIntHashMap.rehash(OpenObjectIntHashMap.java:435)
    at org.apache.mahout.math.map.OpenObjectIntHashMap.put(OpenObjectIntHashMap.java:387)
    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.getFeatureID(InMemoryBayesDatastore.java:131)
    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.setSumFeatureWeight(InMemoryBayesDatastore.java:153)
    at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadFeatureWeights(SequenceFileModelReader.java:82)
    at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadModel(SequenceFileModelReader.java:46)
    at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.initialize(InMemoryBayesDatastore.java:72)
    at org.apache.mahout.classifier.bayes.ClassifierContext.initialize(ClassifierContext.java:44)
    at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesClassifierMapper.configure(BayesClassifierMapper.java:130)
    ... 22 more

11/12/30 16:23:28 INFO mapred.JobClient: Task Id : attempt_201112231028_0058_m_000001_0, Status : FAILED
java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:431)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)


Re: Mahout SGD / Bayes prediction results over 20newsgroups

Posted by Ted Dunning <te...@gmail.com>.
THanks.

This confirms my suspicions that the AdaptiveLogisticRegression has
regressed somehow.

I am munching on the pig interfaces right now and should get back to this
before too long.

On Fri, Dec 30, 2011 at 10:18 AM, Josh Patterson <jo...@cloudera.com> wrote:

> I'm on random text (tweets), which are just like blobs of text like
> the newsgroups dataset.
>
> I was stuck in the 60s as well and then tried playing with the
> parameters. What worked for me to get up into the upper 70s was to set
> the "-features" param higher (started at 20, moved up 200 to get 76%).
>
> Hope that helps, playing with parameters is always an art in ML, can
> be time consuming.
>
> JP
>
> On Thu, Dec 22, 2011 at 1:46 AM, Sreejith S <sr...@gmail.com> wrote:
> > On Thu, Dec 22, 2011 at 12:04 PM, Lance Norskog <go...@gmail.com>
> wrote:
> >
> >> The Bayes in the examples doesn't work very well in the 20 newsgroups
> >> example. Something is wrong  in the data ETL, the tuning options, or
> >> the Bayes implementation.
> >>
> >> On Wed, Dec 21, 2011 at 10:18 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >> > 97% is not correct.  This sounds like you ran it on the training data.
> >>
> >
> > @Ted , yes i ran it on the same training data.
> >
> >
> >> >
> >> > 63% also sounds low.  I don't know what happened there.
> >>
> >
> > Is any one tested same 20newsgrop with SGD and got better results ?
> >
> >> >
> >> > On Wed, Dec 21, 2011 at 9:26 PM, Sreejith S <sr...@gmail.com>
> >> wrote:
> >> >
> >> >> Hi all,
> >> >>
> >> >> I made a comparison between SGD and Bayes classifiers over
> 20news-bydate
> >> >> dataset.
> >> >>
> http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
> >> >>
> >> >> The classifier results and confusion matrix seems a bit confused,
> since
> >> it
> >> >> is said that SGD is better for small datasets and Bayes for large
> >> datasets.
> >> >> Pls check my test scenario http://pastebin.com/K0cy0ayk
> >> >>
> >> >> It seems that even in small dataset like 20news-bydate Bayes gives
> 97 %
> >> >> accuracy and SGD gives 63 % :(
> >> >> Am i missing something?? Pls clarify.
> >> >>
> >> >> Thank You,
> >> >> --
> >> >>
> >> >>
> >> >> *Sreejith.S*
> >> >> http://srijiths.wordpress.com/
> >> >> * *http://sreejiths.emurse.com/
> >> >>
> >> >> tweet2sree@twitter <http://tweet2Sree>
> >> >>
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >>
> >
> >
> >
> > --
> >
> >
> > *Sreejith.S*
> > http://srijiths.wordpress.com/
> > * *http://sreejiths.emurse.com/
> >
> > tweet2sree@twitter <http://tweet2Sree>
>
>
>
> --
> Twitter: @jpatanooga
> Solution Architect @ Cloudera
> hadoop: http://www.cloudera.com
>

Re: Mahout SGD / Bayes prediction results over 20newsgroups

Posted by Josh Patterson <jo...@cloudera.com>.
I'm on random text (tweets), which are just like blobs of text like
the newsgroups dataset.

I was stuck in the 60s as well and then tried playing with the
parameters. What worked for me to get up into the upper 70s was to set
the "-features" param higher (started at 20, moved up 200 to get 76%).

Hope that helps, playing with parameters is always an art in ML, can
be time consuming.

JP

On Thu, Dec 22, 2011 at 1:46 AM, Sreejith S <sr...@gmail.com> wrote:
> On Thu, Dec 22, 2011 at 12:04 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> The Bayes in the examples doesn't work very well in the 20 newsgroups
>> example. Something is wrong  in the data ETL, the tuning options, or
>> the Bayes implementation.
>>
>> On Wed, Dec 21, 2011 at 10:18 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > 97% is not correct.  This sounds like you ran it on the training data.
>>
>
> @Ted , yes i ran it on the same training data.
>
>
>> >
>> > 63% also sounds low.  I don't know what happened there.
>>
>
> Is any one tested same 20newsgrop with SGD and got better results ?
>
>> >
>> > On Wed, Dec 21, 2011 at 9:26 PM, Sreejith S <sr...@gmail.com>
>> wrote:
>> >
>> >> Hi all,
>> >>
>> >> I made a comparison between SGD and Bayes classifiers over 20news-bydate
>> >> dataset.
>> >> http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
>> >>
>> >> The classifier results and confusion matrix seems a bit confused, since
>> it
>> >> is said that SGD is better for small datasets and Bayes for large
>> datasets.
>> >> Pls check my test scenario http://pastebin.com/K0cy0ayk
>> >>
>> >> It seems that even in small dataset like 20news-bydate Bayes gives 97 %
>> >> accuracy and SGD gives 63 % :(
>> >> Am i missing something?? Pls clarify.
>> >>
>> >> Thank You,
>> >> --
>> >>
>> >>
>> >> *Sreejith.S*
>> >> http://srijiths.wordpress.com/
>> >> * *http://sreejiths.emurse.com/
>> >>
>> >> tweet2sree@twitter <http://tweet2Sree>
>> >>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>
>
>
> --
>
>
> *Sreejith.S*
> http://srijiths.wordpress.com/
> * *http://sreejiths.emurse.com/
>
> tweet2sree@twitter <http://tweet2Sree>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com

Re: Mahout SGD / Bayes prediction results over 20newsgroups

Posted by Sreejith S <sr...@gmail.com>.
On Thu, Dec 22, 2011 at 12:04 PM, Lance Norskog <go...@gmail.com> wrote:

> The Bayes in the examples doesn't work very well in the 20 newsgroups
> example. Something is wrong  in the data ETL, the tuning options, or
> the Bayes implementation.
>
> On Wed, Dec 21, 2011 at 10:18 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > 97% is not correct.  This sounds like you ran it on the training data.
>

@Ted , yes i ran it on the same training data.


> >
> > 63% also sounds low.  I don't know what happened there.
>

Is any one tested same 20newsgrop with SGD and got better results ?

> >
> > On Wed, Dec 21, 2011 at 9:26 PM, Sreejith S <sr...@gmail.com>
> wrote:
> >
> >> Hi all,
> >>
> >> I made a comparison between SGD and Bayes classifiers over 20news-bydate
> >> dataset.
> >> http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
> >>
> >> The classifier results and confusion matrix seems a bit confused, since
> it
> >> is said that SGD is better for small datasets and Bayes for large
> datasets.
> >> Pls check my test scenario http://pastebin.com/K0cy0ayk
> >>
> >> It seems that even in small dataset like 20news-bydate Bayes gives 97 %
> >> accuracy and SGD gives 63 % :(
> >> Am i missing something?? Pls clarify.
> >>
> >> Thank You,
> >> --
> >>
> >>
> >> *Sreejith.S*
> >> http://srijiths.wordpress.com/
> >> * *http://sreejiths.emurse.com/
> >>
> >> tweet2sree@twitter <http://tweet2Sree>
> >>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 


*Sreejith.S*
http://srijiths.wordpress.com/
* *http://sreejiths.emurse.com/

tweet2sree@twitter <http://tweet2Sree>

Re: Mahout SGD / Bayes prediction results over 20newsgroups

Posted by Lance Norskog <go...@gmail.com>.
The Bayes in the examples doesn't work very well in the 20 newsgroups
example. Something is wrong  in the data ETL, the tuning options, or
the Bayes implementation.

On Wed, Dec 21, 2011 at 10:18 PM, Ted Dunning <te...@gmail.com> wrote:
> 97% is not correct.  This sounds like you ran it on the training data.
>
> 63% also sounds low.  I don't know what happened there.
>
> On Wed, Dec 21, 2011 at 9:26 PM, Sreejith S <sr...@gmail.com> wrote:
>
>> Hi all,
>>
>> I made a comparison between SGD and Bayes classifiers over 20news-bydate
>> dataset.
>> http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
>>
>> The classifier results and confusion matrix seems a bit confused, since it
>> is said that SGD is better for small datasets and Bayes for large datasets.
>> Pls check my test scenario http://pastebin.com/K0cy0ayk
>>
>> It seems that even in small dataset like 20news-bydate Bayes gives 97 %
>> accuracy and SGD gives 63 % :(
>> Am i missing something?? Pls clarify.
>>
>> Thank You,
>> --
>>
>>
>> *Sreejith.S*
>> http://srijiths.wordpress.com/
>> * *http://sreejiths.emurse.com/
>>
>> tweet2sree@twitter <http://tweet2Sree>
>>



-- 
Lance Norskog
goksron@gmail.com

Re: Mahout SGD / Bayes prediction results over 20newsgroups

Posted by Ted Dunning <te...@gmail.com>.
97% is not correct.  This sounds like you ran it on the training data.

63% also sounds low.  I don't know what happened there.

On Wed, Dec 21, 2011 at 9:26 PM, Sreejith S <sr...@gmail.com> wrote:

> Hi all,
>
> I made a comparison between SGD and Bayes classifiers over 20news-bydate
> dataset.
> http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
>
> The classifier results and confusion matrix seems a bit confused, since it
> is said that SGD is better for small datasets and Bayes for large datasets.
> Pls check my test scenario http://pastebin.com/K0cy0ayk
>
> It seems that even in small dataset like 20news-bydate Bayes gives 97 %
> accuracy and SGD gives 63 % :(
> Am i missing something?? Pls clarify.
>
> Thank You,
> --
>
>
> *Sreejith.S*
> http://srijiths.wordpress.com/
> * *http://sreejiths.emurse.com/
>
> tweet2sree@twitter <http://tweet2Sree>
>