You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by Sam Cunningham <sa...@yahoo.com> on 2011/11/02 18:33:31 UTC

Mahout In Action - Bayes/CBayes Classification returns NaN

My objective is to be able to classify news documents to these classes:
Sports, Entertainment, Politics, Business, etc. Here are the steps I took:

- Used prepare20newsgroups command (page 277 - Mahout In Action) to prepare
the training data set (one long document ~5MB per class).
- Moved training dataset to HDFS and ran trainclassifier command (page 278)
and created the model
- Moved the model from HDFS to local FS and ran Classify.java (at
http://search-lucene.com/c/Mahout:/core/src/main/java/org/apache/mahout/classifier/Classify.java%7C%7Clucene)
on a sample document
- The result is NaN for all classes. It apparently can't assign any classes
to this document. Finally it is labeling with default category: unknown.

I know the program works with 20news dataset. I also know I am training
correctly and my dataset is pretty realistic. What might be the reason that
it can not classify? I tried a few other documents. The result is the same.
NaN. Just to note, when I run prepare20newsgroups command on the training
documents, it puts a single target variable and a single line of document,
which is very long such that (Sports - tab - a long single document) Would
this be the reason? Because I know the 20news dataset has a number of
repeated target variables with a number of documents in it.

Please help. Thanks, 

--
View this message in context: http://lucene.472066.n3.nabble.com/Mahout-In-Action-Bayes-CBayes-Classification-returns-NaN-tp3474535p3474535.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Fwd: Mahout In Action - Bayes/CBayes Classification returns NaN

Posted by Ted Dunning <te...@gmail.com>.
Forwarded to mahout list instead of lucene.  Let's move the discussion
there.

---------- Forwarded message ----------
From: Sam Cunningham <sa...@yahoo.com>
Date: Wed, Nov 2, 2011 at 10:33 AM
Subject: Mahout In Action - Bayes/CBayes Classification returns NaN
To: general@lucene.apache.org


My objective is to be able to classify news documents to these classes:
Sports, Entertainment, Politics, Business, etc. Here are the steps I took:

- Used prepare20newsgroups command (page 277 - Mahout In Action) to prepare
the training data set (one long document ~5MB per class).
- Moved training dataset to HDFS and ran trainclassifier command (page 278)
and created the model
- Moved the model from HDFS to local FS and ran Classify.java (at
http://search-lucene.com/c/Mahout:/core/src/main/java/org/apache/mahout/classifier/Classify.java%7C%7Clucene
)
on a sample document
- The result is NaN for all classes. It apparently can't assign any classes
to this document. Finally it is labeling with default category: unknown.

I know the program works with 20news dataset. I also know I am training
correctly and my dataset is pretty realistic. What might be the reason that
it can not classify? I tried a few other documents. The result is the same.
NaN. Just to note, when I run prepare20newsgroups command on the training
documents, it puts a single target variable and a single line of document,
which is very long such that (Sports - tab - a long single document) Would
this be the reason? Because I know the 20news dataset has a number of
repeated target variables with a number of documents in it.

Please help. Thanks,

--
View this message in context:
http://lucene.472066.n3.nabble.com/Mahout-In-Action-Bayes-CBayes-Classification-returns-NaN-tp3474535p3474535.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Fwd: Mahout In Action - Bayes/CBayes Classification returns NaN

Posted by Ted Dunning <te...@gmail.com>.
Forwarded to mahout list instead of lucene.  Let's move the discussion
there.

---------- Forwarded message ----------
From: Sam Cunningham <sa...@yahoo.com>
Date: Wed, Nov 2, 2011 at 10:33 AM
Subject: Mahout In Action - Bayes/CBayes Classification returns NaN
To: general@lucene.apache.org


My objective is to be able to classify news documents to these classes:
Sports, Entertainment, Politics, Business, etc. Here are the steps I took:

- Used prepare20newsgroups command (page 277 - Mahout In Action) to prepare
the training data set (one long document ~5MB per class).
- Moved training dataset to HDFS and ran trainclassifier command (page 278)
and created the model
- Moved the model from HDFS to local FS and ran Classify.java (at
http://search-lucene.com/c/Mahout:/core/src/main/java/org/apache/mahout/classifier/Classify.java%7C%7Clucene
)
on a sample document
- The result is NaN for all classes. It apparently can't assign any classes
to this document. Finally it is labeling with default category: unknown.

I know the program works with 20news dataset. I also know I am training
correctly and my dataset is pretty realistic. What might be the reason that
it can not classify? I tried a few other documents. The result is the same.
NaN. Just to note, when I run prepare20newsgroups command on the training
documents, it puts a single target variable and a single line of document,
which is very long such that (Sports - tab - a long single document) Would
this be the reason? Because I know the 20news dataset has a number of
repeated target variables with a number of documents in it.

Please help. Thanks,

--
View this message in context:
http://lucene.472066.n3.nabble.com/Mahout-In-Action-Bayes-CBayes-Classification-returns-NaN-tp3474535p3474535.html
Sent from the Lucene - General mailing list archive at Nabble.com.