You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Drew Farris (JIRA)" <ji...@apache.org> on 2010/07/19 03:15:49 UTC

[jira] Created: (MAHOUT-442) Simple feature reduction options for Bayes clasification

Simple feature reduction options for Bayes clasification 
---------------------------------------------------------

                 Key: MAHOUT-442
                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
             Project: Mahout
          Issue Type: Improvement
          Components: Classification
    Affects Versions: 0.3
            Reporter: Drew Farris
            Assignee: Drew Farris


Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.

More background: 

When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:

{code}
./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
{code}

This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.

It appears that Grant ran into a similar issue last year: 
http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c

This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Drew Farris <dr...@gmail.com>.
On Thu, Jul 22, 2010 at 3:09 PM, Ted Dunning <te...@gmail.com> wrote:
> This pretty massively over-trained.  I wouldn't draw any conclusions from
> this unless it is accuracy for held out data.
>

Ahh, yes, of course. Forgot that the 20news example doesn't hold out
anything. I'll have some more reasonable numbers soon.

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Ted Dunning <te...@gmail.com>.
This pretty massively over-trained.  I wouldn't draw any conclusions from
this unless it is accuracy for held out data.

Not trimming is definitely going to help test scores on the original
training data.  It may well help on held-out data.

On Thu, Jul 22, 2010 at 11:59 AM, Drew Farris (JIRA) <ji...@apache.org>wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Drew Farris updated MAHOUT-442:
> -------------------------------
>
>    Attachment: MAHOUT-442-20news-comparison.txt
>
> Here's the confusion matrices for a untrimmed run against 20-news and run
> against 20-news with --minDf=2 and --minSupport=2
>
> The trimmed version did not do as well as the untrimmed in this case:
>
> Untrimmed:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :      18305       97.2222%
> Incorrectly Classified Instances        :        523        2.7778%
> Total Classified Instances              :      18828
>
> Trimmed:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :      18085       96.0537%
> Incorrectly Classified Instances        :        743        3.9463%
> Total Classified Instances              :      18828
>
>
>
> > Simple feature reduction options for Bayes classifiers
> > ------------------------------------------------------
> >
> >                 Key: MAHOUT-442
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Classification
> >    Affects Versions: 0.3
> >            Reporter: Drew Farris
> >            Assignee: Drew Farris
> >         Attachments: MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch
> >
> >
> > Adding options to the Bayes TrainClassifier driver to filter features
> using minimum df or tf. Features that only appear in a handful of documents
> or less than X times within the entire input set will be removed from the
> training feature set entirely. This will allow the Bayes classifiers to
> scale to larger corpora.
> > More background:
> > When running the wikipedia example, I discovered that the number of
> features produced with -ng 1 was pretty outstanding: 9,500,000 using the
> default settings after running the following commands:
> > {code}
> > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
> wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> > ./bin/mahout
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> wikipedia/chunks -o wikipedia/bayes-input -c
> examples/src/test/resources/country.txt
> > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
> wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source
> hdfs
> > {code}
> > This if course makes testing the classifier tricky on machines of modest
> means because TestClassifier attempts to load all features into memory on
> the machines the mapper is running on.
> > It appears that Grant ran into a similar issue last year:
> >
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> > This patch will add --minDf and --minSupport options to TrainClassifier.
> Also --skipCleanup to prevent the deletion of the output of the
> BayesFeatureDriver, which can be useful in order to allow inspection the
> resulting feature set in order to tune rules for feature production.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Drew Farris <dr...@gmail.com>.
Does this look good to commit as-is in that case?

On Wed, Jul 28, 2010 at 12:20 PM, Robin Anil <ro...@gmail.com> wrote:
> thats too small a text to apply pruning. Should run it without pruning. Its
> good that, it croaked when changing the code. Its a sanity check to see if
> things are running alright :)
>
>
> On Tue, Jul 27, 2010 at 9:42 PM, Drew Farris (JIRA) <ji...@apache.org> wrote:
>
>>
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>
>> Drew Farris updated MAHOUT-442:
>> -------------------------------
>>
>>     Attachment: MAHOUT-442.patch
>>
>> Latest patch cleans up a couple issues. Not too sure what do to about the
>> BayesClassifierSelfTest, when run with minDf and minSupport set to 2 it
>> produces some pretty nasty results, which is not necessarilly a surprise,
>> but lessens the utility of changing the test to apply these parameters in
>> the first place.
>>
>> > Simple feature reduction options for Bayes classifiers
>> > ------------------------------------------------------
>> >
>> >                 Key: MAHOUT-442
>> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>> >             Project: Mahout
>> >          Issue Type: Improvement
>> >          Components: Classification
>> >    Affects Versions: 0.3
>> >            Reporter: Drew Farris
>> >            Assignee: Drew Farris
>> >         Attachments: MAHOUT-442-20news-comparison.txt,
>> MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch, MAHOUT-442.patch
>> >
>> >
>> > Adding options to the Bayes TrainClassifier driver to filter features
>> using minimum df or tf. Features that only appear in a handful of documents
>> or less than X times within the entire input set will be removed from the
>> training feature set entirely. This will allow the Bayes classifiers to
>> scale to larger corpora.
>> > More background:
>> > When running the wikipedia example, I discovered that the number of
>> features produced with -ng 1 was pretty outstanding: 9,500,000 using the
>> default settings after running the following commands:
>> > {code}
>> > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
>> wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
>> > ./bin/mahout
>> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
>> wikipedia/chunks -o wikipedia/bayes-input -c
>> examples/src/test/resources/country.txt
>> > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
>> wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source
>> hdfs
>> > {code}
>> > This if course makes testing the classifier tricky on machines of modest
>> means because TestClassifier attempts to load all features into memory on
>> the machines the mapper is running on.
>> > It appears that Grant ran into a similar issue last year:
>> >
>> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
>> > This patch will add --minDf and --minSupport options to TrainClassifier.
>> Also --skipCleanup to prevent the deletion of the output of the
>> BayesFeatureDriver, which can be useful in order to allow inspection the
>> resulting feature set in order to tune rules for feature production.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Robin Anil <ro...@gmail.com>.
thats too small a text to apply pruning. Should run it without pruning. Its
good that, it croaked when changing the code. Its a sanity check to see if
things are running alright :)


On Tue, Jul 27, 2010 at 9:42 PM, Drew Farris (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Drew Farris updated MAHOUT-442:
> -------------------------------
>
>     Attachment: MAHOUT-442.patch
>
> Latest patch cleans up a couple issues. Not too sure what do to about the
> BayesClassifierSelfTest, when run with minDf and minSupport set to 2 it
> produces some pretty nasty results, which is not necessarilly a surprise,
> but lessens the utility of changing the test to apply these parameters in
> the first place.
>
> > Simple feature reduction options for Bayes classifiers
> > ------------------------------------------------------
> >
> >                 Key: MAHOUT-442
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Classification
> >    Affects Versions: 0.3
> >            Reporter: Drew Farris
> >            Assignee: Drew Farris
> >         Attachments: MAHOUT-442-20news-comparison.txt,
> MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch, MAHOUT-442.patch
> >
> >
> > Adding options to the Bayes TrainClassifier driver to filter features
> using minimum df or tf. Features that only appear in a handful of documents
> or less than X times within the entire input set will be removed from the
> training feature set entirely. This will allow the Bayes classifiers to
> scale to larger corpora.
> > More background:
> > When running the wikipedia example, I discovered that the number of
> features produced with -ng 1 was pretty outstanding: 9,500,000 using the
> default settings after running the following commands:
> > {code}
> > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
> wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> > ./bin/mahout
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> wikipedia/chunks -o wikipedia/bayes-input -c
> examples/src/test/resources/country.txt
> > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
> wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source
> hdfs
> > {code}
> > This if course makes testing the classifier tricky on machines of modest
> means because TestClassifier attempts to load all features into memory on
> the machines the mapper is running on.
> > It appears that Grant ran into a similar issue last year:
> >
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> > This patch will add --minDf and --minSupport options to TrainClassifier.
> Also --skipCleanup to prevent the deletion of the output of the
> BayesFeatureDriver, which can be useful in order to allow inspection the
> resulting feature set in order to tune rules for feature production.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Robin Anil <ro...@gmail.com>.
Yeah, the result looks like the paper. But unlike the paper, we have bigram
support, so result might look a wee bit better with it and even more so with
pruning.

Robin

On Thu, Jul 22, 2010 at 9:31 PM, Ted Dunning <te...@gmail.com> wrote:

> This looks much more in line with the figures in Rennie's paper (86% best
> score, if I remember) and the numbers that I get for the SGD system running
> on the bytime version of the 20 newsgroups (about 83-85%).  The bytime
> version of the corpus has test documents that were segregated by time which
> mirrors normal operations a little bit better than random selection.  It
> also has a few duplicate documents removed.
>
> On Thu, Jul 22, 2010 at 8:32 PM, Drew Farris (JIRA) <ji...@apache.org>
> wrote:
>
> >
> >     [
> >
> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> >
> > Drew Farris updated MAHOUT-442:
> > -------------------------------
> >
> >    Attachment: MAHOUT-442-20news-comparison.txt
> >
> > Held back 100 documents from each newsgroup -- the results look a bit
> > better.
> >
> > Untrimmed;
> >
> > =======================================================
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :       1698          84.9%
> > Incorrectly Classified Instances        :        302          15.1%
> > Total Classified Instances              :       2000
> >
> > =======================================================
> >
> >
> > Trimmed:
> > =======================================================
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :       1705         85.25%
> > Incorrectly Classified Instances        :        295         14.75%
> > Total Classified Instances              :       2000
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> >
> > > Simple feature reduction options for Bayes classifiers
> > > ------------------------------------------------------
> > >
> > >                 Key: MAHOUT-442
> > >                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
> > >             Project: Mahout
> > >          Issue Type: Improvement
> > >          Components: Classification
> > >    Affects Versions: 0.3
> > >            Reporter: Drew Farris
> > >            Assignee: Drew Farris
> > >         Attachments: MAHOUT-442-20news-comparison.txt,
> > MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch
> > >
> > >
> > > Adding options to the Bayes TrainClassifier driver to filter features
> > using minimum df or tf. Features that only appear in a handful of
> documents
> > or less than X times within the entire input set will be removed from the
> > training feature set entirely. This will allow the Bayes classifiers to
> > scale to larger corpora.
> > > More background:
> > > When running the wikipedia example, I discovered that the number of
> > features produced with -ng 1 was pretty outstanding: 9,500,000 using the
> > default settings after running the following commands:
> > > {code}
> > > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
> > wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> > > ./bin/mahout
> > org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> > wikipedia/chunks -o wikipedia/bayes-input -c
> > examples/src/test/resources/country.txt
> > > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
> > wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1
>  -source
> > hdfs
> > > {code}
> > > This if course makes testing the classifier tricky on machines of
> modest
> > means because TestClassifier attempts to load all features into memory on
> > the machines the mapper is running on.
> > > It appears that Grant ran into a similar issue last year:
> > >
> >
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> > > This patch will add --minDf and --minSupport options to
> TrainClassifier.
> > Also --skipCleanup to prevent the deletion of the output of the
> > BayesFeatureDriver, which can be useful in order to allow inspection the
> > resulting feature set in order to tune rules for feature production.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Ted Dunning <te...@gmail.com>.
This looks much more in line with the figures in Rennie's paper (86% best
score, if I remember) and the numbers that I get for the SGD system running
on the bytime version of the 20 newsgroups (about 83-85%).  The bytime
version of the corpus has test documents that were segregated by time which
mirrors normal operations a little bit better than random selection.  It
also has a few duplicate documents removed.

On Thu, Jul 22, 2010 at 8:32 PM, Drew Farris (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Drew Farris updated MAHOUT-442:
> -------------------------------
>
>    Attachment: MAHOUT-442-20news-comparison.txt
>
> Held back 100 documents from each newsgroup -- the results look a bit
> better.
>
> Untrimmed;
>
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :       1698          84.9%
> Incorrectly Classified Instances        :        302          15.1%
> Total Classified Instances              :       2000
>
> =======================================================
>
>
> Trimmed:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :       1705         85.25%
> Incorrectly Classified Instances        :        295         14.75%
> Total Classified Instances              :       2000
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
>
> > Simple feature reduction options for Bayes classifiers
> > ------------------------------------------------------
> >
> >                 Key: MAHOUT-442
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Classification
> >    Affects Versions: 0.3
> >            Reporter: Drew Farris
> >            Assignee: Drew Farris
> >         Attachments: MAHOUT-442-20news-comparison.txt,
> MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch
> >
> >
> > Adding options to the Bayes TrainClassifier driver to filter features
> using minimum df or tf. Features that only appear in a handful of documents
> or less than X times within the entire input set will be removed from the
> training feature set entirely. This will allow the Bayes classifiers to
> scale to larger corpora.
> > More background:
> > When running the wikipedia example, I discovered that the number of
> features produced with -ng 1 was pretty outstanding: 9,500,000 using the
> default settings after running the following commands:
> > {code}
> > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
> wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> > ./bin/mahout
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> wikipedia/chunks -o wikipedia/bayes-input -c
> examples/src/test/resources/country.txt
> > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
> wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source
> hdfs
> > {code}
> > This if course makes testing the classifier tricky on machines of modest
> means because TestClassifier attempts to load all features into memory on
> the machines the mapper is running on.
> > It appears that Grant ran into a similar issue last year:
> >
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> > This patch will add --minDf and --minSupport options to TrainClassifier.
> Also --skipCleanup to prevent the deletion of the output of the
> BayesFeatureDriver, which can be useful in order to allow inspection the
> resulting feature set in order to tune rules for feature production.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Robin Anil <ro...@gmail.com>.
Yes makes perfect sense
Try it with a test set!=train set. The performance could improve due to lack
of overfitting. Otherwise looks good to go

sent from nexus one

On Jul 22, 2010 12:00 PM, "Drew Farris (JIRA)" <ji...@apache.org> wrote:
>
> [
https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Drew Farris updated MAHOUT-442:
> -------------------------------
>
> Attachment: MAHOUT-442-20news-comparison.txt
>
> Here's the confusion matrices for a untrimmed run against 20-news and run
against 20-news with --minDf=2 and --minSupport=2
>
> The trimmed version did not do as well as the untrimmed in this case:
>
> Untrimmed:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances : 18305 97.2222%
> Incorrectly Classified Instances : 523 2.7778%
> Total Classified Instances : 18828
>
> Trimmed:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances : 18085 96.0537%
> Incorrectly Classified Instances : 743 3.9463%
> Total Classified Instances : 18828
>
>
>
>> Simple feature reduction options for Bayes classifiers
>> ------------------------------------------------------
>>
>> Key: MAHOUT-442
>> URL: https://issues.apache.org/jira/browse/MAHOUT-442
>> Project: Mahout
>> Issue Type: Improvement
>> Components: Classification
>> Affects Versions: 0.3
>> Reporter: Drew Farris
>> Assignee: Drew Farris
>> Attachments: MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch
>>
>>
>> Adding options to the Bayes TrainClassifier driver to filter features
using minimum df or tf. Features that only appear in a handful of documents
or less than X times within the entire input set will be removed from the
training feature set entirely. This will allow the Bayes classifiers to
scale to larger corpora.
>> More background:
>> When running the wikipedia example, I discovered that the number of
features produced with -ng 1 was pretty outstanding: 9,500,000 using the
default settings after running the following commands:
>> {code}
>> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
>> ./bin/mahout
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
wikipedia/chunks -o wikipedia/bayes-input -c
examples/src/test/resources/country.txt
>> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1 -source
hdfs
>> {code}
>> This if course makes testing the classifier tricky on machines of modest
means because TestClassifier attempts to load all features into memory on
the machines the mapper is running on.
>> It appears that Grant ran into a similar issue last year:
>>
http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
>> This patch will add --minDf and --minSupport options to TrainClassifier.
Also --skipCleanup to prevent the deletion of the output of the
BayesFeatureDriver, which can be useful in order to allow inspection the
resulting feature set in order to tune rules for feature production.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes clasification

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Drew Farris updated MAHOUT-442:
-------------------------------

    Status: Patch Available  (was: Open)

> Simple feature reduction options for Bayes clasification 
> ---------------------------------------------------------
>
>                 Key: MAHOUT-442
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>         Attachments: MAHOUT-442.patch
>
>
> Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.
> More background: 
> When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:
> {code}
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
> {code}
> This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.
> It appears that Grant ran into a similar issue last year: 
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiiers

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Drew Farris updated MAHOUT-442:
-------------------------------

    Summary: Simple feature reduction options for Bayes classifiiers  (was: Simple feature reduction options for Bayes clasification )

> Simple feature reduction options for Bayes classifiiers
> -------------------------------------------------------
>
>                 Key: MAHOUT-442
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>         Attachments: MAHOUT-442.patch
>
>
> Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.
> More background: 
> When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:
> {code}
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
> {code}
> This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.
> It appears that Grant ran into a similar issue last year: 
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890979#action_12890979 ] 

Drew Farris commented on MAHOUT-442:
------------------------------------

setting -minDf and -minSupport to 2 for all wikipedia chunks and countries generated using the commands above causes a drop from 9.5m features to 2.7m features and 22.3M label/feature pairs to 15.6M label/feature pairs. 


> Simple feature reduction options for Bayes classifiers
> ------------------------------------------------------
>
>                 Key: MAHOUT-442
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>         Attachments: MAHOUT-442.patch
>
>
> Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.
> More background: 
> When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:
> {code}
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
> {code}
> This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.
> It appears that Grant ran into a similar issue last year: 
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Drew Farris updated MAHOUT-442:
-------------------------------

           Status: Resolved  (was: Patch Available)
    Fix Version/s: 0.4
       Resolution: Fixed

> Simple feature reduction options for Bayes classifiers
> ------------------------------------------------------
>
>                 Key: MAHOUT-442
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>             Fix For: 0.4
>
>         Attachments: MAHOUT-442-20news-comparison.txt, MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch, MAHOUT-442.patch
>
>
> Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.
> More background: 
> When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:
> {code}
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
> {code}
> This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.
> It appears that Grant ran into a similar issue last year: 
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896729#action_12896729 ] 

Hudson commented on MAHOUT-442:
-------------------------------

Integrated in Mahout-Quality #175 (See [http://hudson.zones.apache.org/hudson/job/Mahout-Quality/175/])
    MAHOUT-442: Simple feature reduction options for Bayes classifiers (--minDf and --minSupport)


> Simple feature reduction options for Bayes classifiers
> ------------------------------------------------------
>
>                 Key: MAHOUT-442
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>         Attachments: MAHOUT-442-20news-comparison.txt, MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch, MAHOUT-442.patch
>
>
> Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.
> More background: 
> When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:
> {code}
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
> {code}
> This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.
> It appears that Grant ran into a similar issue last year: 
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Drew Farris updated MAHOUT-442:
-------------------------------

    Summary: Simple feature reduction options for Bayes classifiers  (was: Simple feature reduction options for Bayes classifiiers)

> Simple feature reduction options for Bayes classifiers
> ------------------------------------------------------
>
>                 Key: MAHOUT-442
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>         Attachments: MAHOUT-442.patch
>
>
> Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.
> More background: 
> When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:
> {code}
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
> {code}
> This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.
> It appears that Grant ran into a similar issue last year: 
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Drew Farris updated MAHOUT-442:
-------------------------------

    Attachment: MAHOUT-442-20news-comparison.txt

Held back 100 documents from each newsgroup -- the results look a bit better.

Untrimmed;

=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       1698	      84.9%
Incorrectly Classified Instances        :        302	      15.1%
Total Classified Instances              :       2000

=======================================================


Trimmed:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       1705	     85.25%
Incorrectly Classified Instances        :        295	     14.75%
Total Classified Instances              :       2000

=======================================================
Confusion Matrix
-------------------------------------------------------

> Simple feature reduction options for Bayes classifiers
> ------------------------------------------------------
>
>                 Key: MAHOUT-442
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>         Attachments: MAHOUT-442-20news-comparison.txt, MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch
>
>
> Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.
> More background: 
> When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:
> {code}
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
> {code}
> This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.
> It appears that Grant ran into a similar issue last year: 
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Drew Farris updated MAHOUT-442:
-------------------------------

    Attachment: MAHOUT-442.patch

Latest patch cleans up a couple issues. Not too sure what do to about the BayesClassifierSelfTest, when run with minDf and minSupport set to 2 it produces some pretty nasty results, which is not necessarilly a surprise, but lessens the utility of changing the test to apply these parameters in the first place.

> Simple feature reduction options for Bayes classifiers
> ------------------------------------------------------
>
>                 Key: MAHOUT-442
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>         Attachments: MAHOUT-442-20news-comparison.txt, MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch, MAHOUT-442.patch
>
>
> Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.
> More background: 
> When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:
> {code}
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
> {code}
> This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.
> It appears that Grant ran into a similar issue last year: 
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891268#action_12891268 ] 

Robin Anil commented on MAHOUT-442:
-----------------------------------

This is nice. Could you try the end to end test for the BayesClassifier with pruning and 20 news groups. and see what's the accuracy difference

> Simple feature reduction options for Bayes classifiers
> ------------------------------------------------------
>
>                 Key: MAHOUT-442
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>         Attachments: MAHOUT-442.patch
>
>
> Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.
> More background: 
> When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:
> {code}
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
> {code}
> This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.
> It appears that Grant ran into a similar issue last year: 
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Drew Farris updated MAHOUT-442:
-------------------------------

    Attachment: MAHOUT-442-20news-comparison.txt

Here's the confusion matrices for a untrimmed run against 20-news and run against 20-news with --minDf=2 and --minSupport=2

The trimmed version did not do as well as the untrimmed in this case:

Untrimmed:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :      18305       97.2222%
Incorrectly Classified Instances        :        523        2.7778%
Total Classified Instances              :      18828

Trimmed:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :      18085       96.0537%
Incorrectly Classified Instances        :        743        3.9463%
Total Classified Instances              :      18828



> Simple feature reduction options for Bayes classifiers
> ------------------------------------------------------
>
>                 Key: MAHOUT-442
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>         Attachments: MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch
>
>
> Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.
> More background: 
> When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:
> {code}
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
> {code}
> This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.
> It appears that Grant ran into a similar issue last year: 
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes clasification

Posted by "Drew Farris (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Drew Farris updated MAHOUT-442:
-------------------------------

    Attachment: MAHOUT-442.patch


Core changes:
   * BayesFeatureMapper now collects tf as FEATURE_TF. df is already collected as FEATURE_COUNT. 
   * Introduced BayesFeatureCombiner to do simple combination since the operations performed by the reducer are more complex.
   * FeaturePartitioner ensures that all tuples for a given feature (term/ngram) are directed to the same reducer.
   * FeatureLabelComparator ensures that FEATURE_TF and FEATURE_COUNT arrive at the reducer prior to any other tuples, and that all tuples for a given feature are processed consecutively. 
   * BayesFeatureReducer now does filtering on all tuples based on TF and DF configured using --minSupport and --minDf, passed in as a part of the BayesParameters object.
   * deprecated the BayesParameters(ngramSize) constructor in favor of setNgramSize, setMinDf, setMinSupport methods.
   * Included unit test for BayesFeature mapreduce process.
   * All other unit tests pass.

Other changes:
   * A couple fixes for cases where the BayesParameters weren't printing properly
   * Plumbing for the new command-line options.







> Simple feature reduction options for Bayes clasification 
> ---------------------------------------------------------
>
>                 Key: MAHOUT-442
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>         Attachments: MAHOUT-442.patch
>
>
> Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.
> More background: 
> When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:
> {code}
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
> {code}
> This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.
> It appears that Grant ran into a similar issue last year: 
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.