You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sam Cunningham <sa...@yahoo.com> on 2011/12/09 20:57:03 UTC

PLEASE HELP! - MAHOUT CLASSIFICATION

I really need help. I am working on a project: I have a cron job that collects
RSS feeds from news sites (Reuters and Associated Press). I need to classify
these news data based on their content (just like 20news example). The
categories are business, entertainment, health, politics, scitech, and sports. I
use half of the data for training and the other half for testing. Attached,
please find the training, testing and model files in compressed form. As you
will see when I test the model I get extremely good results for some topics
(business, sports, and entertainment). I get really bad results (almost %0) for
other topics (health, scitech, and politics). What's wrong?

What is more interesting is that I get real bad results with "health" topic when
I test the classifier against the training data which is the dataset in creating
the model, itself. This is strange.

Please help.

Thank you,

Sam




Re: PLEASE HELP! - MAHOUT CLASSIFICATION

Posted by Sam Cunningham <sa...@yahoo.com>.
Dmitriy Lyubimov <dlieu.7 <at> gmail.com> writes:

> 
> Sam, the list wouldn't let attachments .
> 

Hi Dmitriy,

Here is the link to the attachments along with the same message content. Please
let me know if you can't get the attachments. Thank you for your help,

http://lucene.472066.n3.nabble.com/PLEASE-HELP-MAHOUT-CLASSIFICATION-td3573905.html

Sam


Re: PLEASE HELP! - MAHOUT CLASSIFICATION

Posted by Sam Cunningham <sa...@yahoo.com>.
Dmitriy Lyubimov <dlieu.7 <at> gmail.com> writes:

> 
> Sam, the list wouldn't let attachments .
> 

Hi Dmitriy,

Here is the link to the attachments along with the same message content:

http://lucene.472066.n3.nabble.com/PLEASE-HELP-MAHOUT-CLASSIFICATION-td3573905.html

Thank you,

Sam


Re: PLEASE HELP! - MAHOUT CLASSIFICATION

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Sam, the list wouldn't let attachments .

On Fri, Dec 9, 2011 at 11:57 AM, Sam Cunningham <sa...@yahoo.com> wrote:
> I really need help. I am working on a project: I have a cron job that collects
> RSS feeds from news sites (Reuters and Associated Press). I need to classify
> these news data based on their content (just like 20news example). The
> categories are business, entertainment, health, politics, scitech, and sports. I
> use half of the data for training and the other half for testing. Attached,
> please find the training, testing and model files in compressed form. As you
> will see when I test the model I get extremely good results for some topics
> (business, sports, and entertainment). I get really bad results (almost %0) for
> other topics (health, scitech, and politics). What's wrong?
>
> What is more interesting is that I get real bad results with "health" topic when
> I test the classifier against the training data which is the dataset in creating
> the model, itself. This is strange.
>
> Please help.
>
> Thank you,
>
> Sam
>
>
>

Re: PLEASE HELP! - MAHOUT CLASSIFICATION

Posted by Suneel Marthi <su...@yahoo.com>.
Sorry, I stand correted on the SGD command line tools, please look at TrainNewsGroups as Ted suggests.



________________________________
 From: Suneel Marthi <su...@yahoo.com>
To: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Saturday, December 10, 2011 12:12 PM
Subject: Re: PLEASE HELP! - MAHOUT CLASSIFICATION
 
Sam,

Per Ted's email below please run with the trunk for your work. Please look at Chapters 13 - 16 in the Mahout in Action book for sample code snippets for classifying 20 newsgroups with SGD.  There presently is no command line option (I am not aware of one and could be wrong) for running the 20 newsgroup example with SGD.

The only command line tools for SGD - trainlogistic and runlogistic expect the input files to be in CSV format which is not what you have.

I have a sample program for qualifying datasets (similar to the format you have) using SGD which I can share with you later today.


Regards,
Suneel



________________________________
From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org 
Sent: Saturday, December 10, 2011 3:20 AM
Subject: Re: PLEASE HELP! - MAHOUT CLASSIFICATION

a) run with trunk

b) see https://github.com/tdunning/Chapter-16

c) also see org.apache.mahout.classifier.sgd.TrainNewsGroups

Your training data is tiny.  The bayes classifiers are designed for large
data.  Poor results are not very surprising at this data size.

On Fri, Dec 9, 2011 at 8:03 PM, Sam Cunningham <sa...@yahoo.com> wrote:

> I am running Mahout distribution v0.5. Though, I am not sure what
> difference
> would that make? I ran my dataset with bayes/cbayes only. I don't have any
> sample code for SGD or its command option. Is there any SGD example for
> 20news
> dataset so that I can follow (for training and testing)?
>

Re: PLEASE HELP! - MAHOUT CLASSIFICATION

Posted by Suneel Marthi <su...@yahoo.com>.
Sam,

Per Ted's email below please run with the trunk for your work. Please look at Chapters 13 - 16 in the Mahout in Action book for sample code snippets for classifying 20 newsgroups with SGD.  There presently is no command line option (I am not aware of one and could be wrong) for running the 20 newsgroup example with SGD.

The only command line tools for SGD - trainlogistic and runlogistic expect the input files to be in CSV format which is not what you have.

I have a sample program for qualifying datasets (similar to the format you have) using SGD which I can share with you later today.


Regards,
Suneel



________________________________
 From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org 
Sent: Saturday, December 10, 2011 3:20 AM
Subject: Re: PLEASE HELP! - MAHOUT CLASSIFICATION
 
a) run with trunk

b) see https://github.com/tdunning/Chapter-16

c) also see org.apache.mahout.classifier.sgd.TrainNewsGroups

Your training data is tiny.  The bayes classifiers are designed for large
data.  Poor results are not very surprising at this data size.

On Fri, Dec 9, 2011 at 8:03 PM, Sam Cunningham <sa...@yahoo.com> wrote:

> I am running Mahout distribution v0.5. Though, I am not sure what
> difference
> would that make? I ran my dataset with bayes/cbayes only. I don't have any
> sample code for SGD or its command option. Is there any SGD example for
> 20news
> dataset so that I can follow (for training and testing)?
>

Re: PLEASE HELP! - MAHOUT CLASSIFICATION

Posted by Ted Dunning <te...@gmail.com>.
a) run with trunk

b) see https://github.com/tdunning/Chapter-16

c) also see org.apache.mahout.classifier.sgd.TrainNewsGroups

Your training data is tiny.  The bayes classifiers are designed for large
data.  Poor results are not very surprising at this data size.

On Fri, Dec 9, 2011 at 8:03 PM, Sam Cunningham <sa...@yahoo.com> wrote:

> I am running Mahout distribution v0.5. Though, I am not sure what
> difference
> would that make? I ran my dataset with bayes/cbayes only. I don't have any
> sample code for SGD or its command option. Is there any SGD example for
> 20news
> dataset so that I can follow (for training and testing)?
>

Re: PLEASE HELP! - MAHOUT CLASSIFICATION

Posted by Sam Cunningham <sa...@yahoo.com>.
Suneel Marthi <suneel_marthi <at> yahoo.com> writes:

> 
> Hi Sam,
> 
> I am assuming that you are running the latest code from the Mahout 0.6 trunk. 
> 
> Did you try running your dataset through SGD classifier for both training and
testing?
> 
> Suneel
> 

Suneel,

I am running Mahout distribution v0.5. Though, I am not sure what difference
would that make? I ran my dataset with bayes/cbayes only. I don't have any
sample code for SGD or its command option. Is there any SGD example for 20news
dataset so that I can follow (for training and testing)? 

Sam


Re: PLEASE HELP! - MAHOUT CLASSIFICATION

Posted by Suneel Marthi <su...@yahoo.com>.
Hi Sam,

I am assuming that you are running the latest code from the Mahout 0.6 trunk. 


Did you try running your dataset through SGD classifier for both training and testing?


Suneel


________________________________
 From: Sam Cunningham <sa...@yahoo.com>
To: user@mahout.apache.org 
Sent: Friday, December 9, 2011 6:37 PM
Subject: Re: PLEASE HELP! - MAHOUT CLASSIFICATION
 

Suneel Marthi <suneel_marthi <at> yahoo.com> writes:

> 
> Which classifier r u running?
> 


Hi Suneel,

I am running cbayes. Here is the command options for the trainer:

$MAHOUT_HOME/bin/mahout trainclassifier -i /user/sayhan/articles-train -o
/user/sayhan/articles-model -type cbayes -ng 1 -source hdfs

I am running cbayes for testing the classifier as well.

Sam

Re: PLEASE HELP! - MAHOUT CLASSIFICATION

Posted by Sam Cunningham <sa...@yahoo.com>.
Suneel Marthi <suneel_marthi <at> yahoo.com> writes:

> 
> Which classifier r u running?
> 


Hi Suneel,

I am running cbayes. Here is the command options for the trainer:

$MAHOUT_HOME/bin/mahout trainclassifier -i /user/sayhan/articles-train -o
/user/sayhan/articles-model -type cbayes -ng 1 -source hdfs

I am running cbayes for testing the classifier as well.

Sam


Re: PLEASE HELP! - MAHOUT CLASSIFICATION

Posted by Suneel Marthi <su...@yahoo.com>.
Which classifier r u running?



________________________________
 From: Sam Cunningham <sa...@yahoo.com>
To: user@mahout.apache.org 
Sent: Friday, December 9, 2011 2:57 PM
Subject: PLEASE HELP! - MAHOUT CLASSIFICATION
 
I really need help. I am working on a project: I have a cron job that collects
RSS feeds from news sites (Reuters and Associated Press). I need to classify
these news data based on their content (just like 20news example). The
categories are business, entertainment, health, politics, scitech, and sports. I
use half of the data for training and the other half for testing. Attached,
please find the training, testing and model files in compressed form. As you
will see when I test the model I get extremely good results for some topics
(business, sports, and entertainment). I get really bad results (almost %0) for
other topics (health, scitech, and politics). What's wrong?

What is more interesting is that I get real bad results with "health" topic when
I test the classifier against the training data which is the dataset in creating
the model, itself. This is strange.

Please help.

Thank you,

Sam