You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Drew Farris (JIRA)" <ji...@apache.org> on 2010/07/19 03:15:49 UTC

[jira] Created: (MAHOUT-442) Simple feature reduction options for Bayes clasification

Simple feature reduction options for Bayes clasification 
---------------------------------------------------------

                 Key: MAHOUT-442
                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
             Project: Mahout
          Issue Type: Improvement
          Components: Classification
    Affects Versions: 0.3
            Reporter: Drew Farris
            Assignee: Drew Farris


Adding options to the Bayes TrainClassifier driver to filter features using minimum df or tf. Features that only appear in a handful of documents or less than X times within the entire input set will be removed from the training feature set entirely. This will allow the Bayes classifiers to scale to larger corpora.

More background: 

When running the wikipedia example, I discovered that the number of features produced with -ng 1 was pretty outstanding: 9,500,000 using the default settings after running the following commands:

{code}
./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipedia/bayes-input -c examples/src/test/resources/country.txt
./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source hdfs
{code}

This if course makes testing the classifier tricky on machines of modest means because TestClassifier attempts to load all features into memory on the machines the mapper is running on.

It appears that Grant ran into a similar issue last year: 
http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c

This patch will add --minDf and --minSupport options to TrainClassifier. Also --skipCleanup to prevent the deletion of the output of the BayesFeatureDriver, which can be useful in order to allow inspection the resulting feature set in order to tune rules for feature production.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Drew Farris <dr...@gmail.com>.

On Thu, Jul 22, 2010 at 3:09 PM, Ted Dunning <te...@gmail.com> wrote:
> This pretty massively over-trained.  I wouldn't draw any conclusions from
> this unless it is accuracy for held out data.
>

Ahh, yes, of course. Forgot that the 20news example doesn't hold out
anything. I'll have some more reasonable numbers soon.

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Ted Dunning <te...@gmail.com>.

This pretty massively over-trained.  I wouldn't draw any conclusions from
this unless it is accuracy for held out data.

Not trimming is definitely going to help test scores on the original
training data.  It may well help on held-out data.

On Thu, Jul 22, 2010 at 11:59 AM, Drew Farris (JIRA) <ji...@apache.org>wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Drew Farris updated MAHOUT-442:
> -------------------------------
>
>    Attachment: MAHOUT-442-20news-comparison.txt
>
> Here's the confusion matrices for a untrimmed run against 20-news and run
> against 20-news with --minDf=2 and --minSupport=2
>
> The trimmed version did not do as well as the untrimmed in this case:
>
> Untrimmed:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :      18305       97.2222%
> Incorrectly Classified Instances        :        523        2.7778%
> Total Classified Instances              :      18828
>
> Trimmed:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :      18085       96.0537%
> Incorrectly Classified Instances        :        743        3.9463%
> Total Classified Instances              :      18828
>
>
>
> > Simple feature reduction options for Bayes classifiers
> > ------------------------------------------------------
> >
> >                 Key: MAHOUT-442
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Classification
> >    Affects Versions: 0.3
> >            Reporter: Drew Farris
> >            Assignee: Drew Farris
> >         Attachments: MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch
> >
> >
> > Adding options to the Bayes TrainClassifier driver to filter features
> using minimum df or tf. Features that only appear in a handful of documents
> or less than X times within the entire input set will be removed from the
> training feature set entirely. This will allow the Bayes classifiers to
> scale to larger corpora.
> > More background:
> > When running the wikipedia example, I discovered that the number of
> features produced with -ng 1 was pretty outstanding: 9,500,000 using the
> default settings after running the following commands:
> > {code}
> > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
> wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> > ./bin/mahout
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> wikipedia/chunks -o wikipedia/bayes-input -c
> examples/src/test/resources/country.txt
> > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
> wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source
> hdfs
> > {code}
> > This if course makes testing the classifier tricky on machines of modest
> means because TestClassifier attempts to load all features into memory on
> the machines the mapper is running on.
> > It appears that Grant ran into a similar issue last year:
> >
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> > This patch will add --minDf and --minSupport options to TrainClassifier.
> Also --skipCleanup to prevent the deletion of the output of the
> BayesFeatureDriver, which can be useful in order to allow inspection the
> resulting feature set in order to tune rules for feature production.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Drew Farris <dr...@gmail.com>.

Does this look good to commit as-is in that case?

On Wed, Jul 28, 2010 at 12:20 PM, Robin Anil <ro...@gmail.com> wrote:
> thats too small a text to apply pruning. Should run it without pruning. Its
> good that, it croaked when changing the code. Its a sanity check to see if
> things are running alright :)
>
>
> On Tue, Jul 27, 2010 at 9:42 PM, Drew Farris (JIRA) <ji...@apache.org> wrote:
>
>>
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>
>> Drew Farris updated MAHOUT-442:
>> -------------------------------
>>
>>     Attachment: MAHOUT-442.patch
>>
>> Latest patch cleans up a couple issues. Not too sure what do to about the
>> BayesClassifierSelfTest, when run with minDf and minSupport set to 2 it
>> produces some pretty nasty results, which is not necessarilly a surprise,
>> but lessens the utility of changing the test to apply these parameters in
>> the first place.
>>
>> > Simple feature reduction options for Bayes classifiers
>> > ------------------------------------------------------
>> >
>> >                 Key: MAHOUT-442
>> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>> >             Project: Mahout
>> >          Issue Type: Improvement
>> >          Components: Classification
>> >    Affects Versions: 0.3
>> >            Reporter: Drew Farris
>> >            Assignee: Drew Farris
>> >         Attachments: MAHOUT-442-20news-comparison.txt,
>> MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch, MAHOUT-442.patch
>> >
>> >
>> > Adding options to the Bayes TrainClassifier driver to filter features
>> using minimum df or tf. Features that only appear in a handful of documents
>> or less than X times within the entire input set will be removed from the
>> training feature set entirely. This will allow the Bayes classifiers to
>> scale to larger corpora.
>> > More background:
>> > When running the wikipedia example, I discovered that the number of
>> features produced with -ng 1 was pretty outstanding: 9,500,000 using the
>> default settings after running the following commands:
>> > {code}
>> > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
>> wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
>> > ./bin/mahout
>> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
>> wikipedia/chunks -o wikipedia/bayes-input -c
>> examples/src/test/resources/country.txt
>> > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
>> wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source
>> hdfs
>> > {code}
>> > This if course makes testing the classifier tricky on machines of modest
>> means because TestClassifier attempts to load all features into memory on
>> the machines the mapper is running on.
>> > It appears that Grant ran into a similar issue last year:
>> >
>> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
>> > This patch will add --minDf and --minSupport options to TrainClassifier.
>> Also --skipCleanup to prevent the deletion of the output of the
>> BayesFeatureDriver, which can be useful in order to allow inspection the
>> resulting feature set in order to tune rules for feature production.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Robin Anil <ro...@gmail.com>.

thats too small a text to apply pruning. Should run it without pruning. Its
good that, it croaked when changing the code. Its a sanity check to see if
things are running alright :)


On Tue, Jul 27, 2010 at 9:42 PM, Drew Farris (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Drew Farris updated MAHOUT-442:
> -------------------------------
>
>     Attachment: MAHOUT-442.patch
>
> Latest patch cleans up a couple issues. Not too sure what do to about the
> BayesClassifierSelfTest, when run with minDf and minSupport set to 2 it
> produces some pretty nasty results, which is not necessarilly a surprise,
> but lessens the utility of changing the test to apply these parameters in
> the first place.
>
> > Simple feature reduction options for Bayes classifiers
> > ------------------------------------------------------
> >
> >                 Key: MAHOUT-442
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Classification
> >    Affects Versions: 0.3
> >            Reporter: Drew Farris
> >            Assignee: Drew Farris
> >         Attachments: MAHOUT-442-20news-comparison.txt,
> MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch, MAHOUT-442.patch
> >
> >
> > Adding options to the Bayes TrainClassifier driver to filter features
> using minimum df or tf. Features that only appear in a handful of documents
> or less than X times within the entire input set will be removed from the
> training feature set entirely. This will allow the Bayes classifiers to
> scale to larger corpora.
> > More background:
> > When running the wikipedia example, I discovered that the number of
> features produced with -ng 1 was pretty outstanding: 9,500,000 using the
> default settings after running the following commands:
> > {code}
> > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
> wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> > ./bin/mahout
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> wikipedia/chunks -o wikipedia/bayes-input -c
> examples/src/test/resources/country.txt
> > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
> wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source
> hdfs
> > {code}
> > This if course makes testing the classifier tricky on machines of modest
> means because TestClassifier attempts to load all features into memory on
> the machines the mapper is running on.
> > It appears that Grant ran into a similar issue last year:
> >
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> > This patch will add --minDf and --minSupport options to TrainClassifier.
> Also --skipCleanup to prevent the deletion of the output of the
> BayesFeatureDriver, which can be useful in order to allow inspection the
> resulting feature set in order to tune rules for feature production.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Robin Anil <ro...@gmail.com>.

Yeah, the result looks like the paper. But unlike the paper, we have bigram
support, so result might look a wee bit better with it and even more so with
pruning.

Robin

On Thu, Jul 22, 2010 at 9:31 PM, Ted Dunning <te...@gmail.com> wrote:

> This looks much more in line with the figures in Rennie's paper (86% best
> score, if I remember) and the numbers that I get for the SGD system running
> on the bytime version of the 20 newsgroups (about 83-85%).  The bytime
> version of the corpus has test documents that were segregated by time which
> mirrors normal operations a little bit better than random selection.  It
> also has a few duplicate documents removed.
>
> On Thu, Jul 22, 2010 at 8:32 PM, Drew Farris (JIRA) <ji...@apache.org>
> wrote:
>
> >
> >     [
> >
> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> >
> > Drew Farris updated MAHOUT-442:
> > -------------------------------
> >
> >    Attachment: MAHOUT-442-20news-comparison.txt
> >
> > Held back 100 documents from each newsgroup -- the results look a bit
> > better.
> >
> > Untrimmed;
> >
> > =======================================================
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :       1698          84.9%
> > Incorrectly Classified Instances        :        302          15.1%
> > Total Classified Instances              :       2000
> >
> > =======================================================
> >
> >
> > Trimmed:
> > =======================================================
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :       1705         85.25%
> > Incorrectly Classified Instances        :        295         14.75%
> > Total Classified Instances              :       2000
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> >
> > > Simple feature reduction options for Bayes classifiers
> > > ------------------------------------------------------
> > >
> > >                 Key: MAHOUT-442
> > >                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
> > >             Project: Mahout
> > >          Issue Type: Improvement
> > >          Components: Classification
> > >    Affects Versions: 0.3
> > >            Reporter: Drew Farris
> > >            Assignee: Drew Farris
> > >         Attachments: MAHOUT-442-20news-comparison.txt,
> > MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch
> > >
> > >
> > > Adding options to the Bayes TrainClassifier driver to filter features
> > using minimum df or tf. Features that only appear in a handful of
> documents
> > or less than X times within the entire input set will be removed from the
> > training feature set entirely. This will allow the Bayes classifiers to
> > scale to larger corpora.
> > > More background:
> > > When running the wikipedia example, I discovered that the number of
> > features produced with -ng 1 was pretty outstanding: 9,500,000 using the
> > default settings after running the following commands:
> > > {code}
> > > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
> > wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> > > ./bin/mahout
> > org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> > wikipedia/chunks -o wikipedia/bayes-input -c
> > examples/src/test/resources/country.txt
> > > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
> > wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1
>  -source
> > hdfs
> > > {code}
> > > This if course makes testing the classifier tricky on machines of
> modest
> > means because TestClassifier attempts to load all features into memory on
> > the machines the mapper is running on.
> > > It appears that Grant ran into a similar issue last year:
> > >
> >
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> > > This patch will add --minDf and --minSupport options to
> TrainClassifier.
> > Also --skipCleanup to prevent the deletion of the output of the
> > BayesFeatureDriver, which can be useful in order to allow inspection the
> > resulting feature set in order to tune rules for feature production.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Posted by Ted Dunning <te...@gmail.com>.

This looks much more in line with the figures in Rennie's paper (86% best
score, if I remember) and the numbers that I get for the SGD system running
on the bytime version of the 20 newsgroups (about 83-85%).  The bytime
version of the corpus has test documents that were segregated by time which
mirrors normal operations a little bit better than random selection.  It
also has a few duplicate documents removed.

On Thu, Jul 22, 2010 at 8:32 PM, Drew Farris (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Drew Farris updated MAHOUT-442:
> -------------------------------
>
>    Attachment: MAHOUT-442-20news-comparison.txt
>
> Held back 100 documents from each newsgroup -- the results look a bit
> better.
>
> Untrimmed;
>
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :       1698          84.9%
> Incorrectly Classified Instances        :        302          15.1%
> Total Classified Instances              :       2000
>
> =======================================================
>
>
> Trimmed:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :       1705         85.25%
> Incorrectly Classified Instances        :        295         14.75%
> Total Classified Instances              :       2000
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
>
> > Simple feature reduction options for Bayes classifiers
> > ------------------------------------------------------
> >
> >                 Key: MAHOUT-442
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Classification
> >    Affects Versions: 0.3
> >            Reporter: Drew Farris
> >            Assignee: Drew Farris
> >         Attachments: MAHOUT-442-20news-comparison.txt,
> MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch
> >
> >
> > Adding options to the Bayes TrainClassifier driver to filter features
> using minimum df or tf. Features that only appear in a handful of documents
> or less than X times within the entire input set will be removed from the
> training feature set entirely. This will allow the Bayes classifiers to
> scale to larger corpora.
> > More background:
> > When running the wikipedia example, I discovered that the number of
> features produced with -ng 1 was pretty outstanding: 9,500,000 using the
> default settings after running the following commands:
> > {code}
> > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
> wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> > ./bin/mahout
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> wikipedia/chunks -o wikipedia/bayes-input -c
> examples/src/test/resources/country.txt
> > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
> wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source
> hdfs
> > {code}
> > This if course makes testing the classifier tricky on machines of modest
> means because TestClassifier attempts to load all features into memory on
> the machines the mapper is running on.
> > It appears that Grant ran into a similar issue last year:
> >
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> > This patch will add --minDf and --minSupport options to TrainClassifier.
> Also --skipCleanup to prevent the deletion of the output of the
> BayesFeatureDriver, which can be useful in order to allow inspection the
> resulting feature set in order to tune rules for feature production.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>