You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Tyson Hamilton <ty...@gmail.com> on 2012/04/19 05:03:05 UTC

Always classified in final category

Hello Users,

I am attempting to write a very simple classification algorithm, based off
of the 20NewsGroup example.  Going for barebones, I took out all the
overallcount and statistics generating pieces. Introducing simple test
data, only ten examples or so, the classification algorithm fails to
classify even the test data itself properly. In fact, everything goes into
the final category (of which there is only two in my tests).

To ensure it wasn't the small sample size I used to train/test the
algorithm, I extended to support the full 20NewsGroup dataset with very
similar results. This resulted in 95%+ of the classification results being
in the final category. Here are my questions,

1. Can I use the training data in a classification test and expect a near
perfect classification success rate?
2. Can this training data be a small (<20) sample size?
3. Is there something obvious I may be doing wrong that is resulting in
nearly all test data being classified in the last category?

Thank you!
-- 
Tyson