You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/07/22 04:41:17 UTC

Getting Started with Classification

I have been doing some work on classification (of Wikipedia) and am  
having a hard time actually running the Test classifier.  I trained on  
a couple of categories (history and science) on quite a few docs, but  
now the model is so big, I can't load it, even with almost 3 GB of  
memory.   I'm just wondering what people would recommend here.  One  
thought is that our code is really String/Text based.  I also notice  
we start with default values for the maps used to load the models,  
which probably means we are resizing a lot.  Should we use Strings or  
would it be better to have some custom Writables and then keep track  
of the actual terms separately kind of like the doc clustering does as  
well as tracking the size so we can avoid resizing?

Also, what is generally the size of training sets that people use for  
something like Naive Bayes (or complementary)?  Or, do I suck it up  
and just use more memory?

Thoughts?

-Grant

Re: Getting Started with Classification

Posted by Ted Dunning <te...@gmail.com>.

It is unusual to have all that many training sets for document classifiers,
but special cases exist where you have hundreds of thousands of training
examples.  With complementary naive bayes, you effectively have a training
set the size of your (negative) corpus (I think).

But, again, model size for Naive Bayesian models should be proportional to
the number of terms modeled.  Even with lots of data, that shouldn't be all
*that* many.  You should also be able to trim out hapax to moderate size.

On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Also, what is generally the size of training sets that people use for
> something like Naive Bayes (or complementary)?  Or, do I suck it up and just
> use more memory?

-- 
Ted Dunning, CTO
DeepDyve

Re: Getting Started with Classification

Posted by Robin Anil <ro...@gmail.com>.

Yeah trying to do the same. I am removing the conflicts one by one. Its
refusing to build now. The Code style patch broke everything.

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

Sounds reasonable.  How about plain old file system?

Can you update the patch so that it applies cleanly?

On Jul 22, 2009, at 1:29 PM, Robin Anil wrote:

> Take a look at MAHOUT-124. DataStore is refactored out. If you want  
> to, say,
> add MemCache DB as the Matrix storage backend. You can implement  
> another
> DataStore interface and write connectivity code. etc
> Robin
>
> On Wed, Jul 22, 2009 at 10:51 PM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>> OK, so how far away is the HBase approach?  Also, does it assume  
>> that HBase
>> is the only way?
>>
>> Also, no need to deprecate, I don't think.  We are < 1.0 and have  
>> made no
>> guarantees about API contracts yet, w/ the possible exception of  
>> Taste.
>>
>>
>>
>> On Jul 22, 2009, at 12:20 PM, Robin Anil wrote:
>>
>> In SequenceFileModelReader(Mahout 0.1) release you can disable  
>> calling
>>> loadFeatureWeights  in loadModel function as a hack. But that  
>>> method is
>>> deprecated right now.  Current Implementation(HBase Patch) does it
>>> correctly.
>>> Robin
>>>

Re: Getting Started with Classification

Posted by Robin Anil <ro...@gmail.com>.

Take a look at MAHOUT-124. DataStore is refactored out. If you want to, say,
add MemCache DB as the Matrix storage backend. You can implement another
DataStore interface and write connectivity code. etc
Robin

On Wed, Jul 22, 2009 at 10:51 PM, Grant Ingersoll <gs...@apache.org>wrote:

> OK, so how far away is the HBase approach?  Also, does it assume that HBase
> is the only way?
>
> Also, no need to deprecate, I don't think.  We are < 1.0 and have made no
> guarantees about API contracts yet, w/ the possible exception of Taste.
>
>
>
> On Jul 22, 2009, at 12:20 PM, Robin Anil wrote:
>
>  In SequenceFileModelReader(Mahout 0.1) release you can disable calling
>> loadFeatureWeights  in loadModel function as a hack. But that method is
>> deprecated right now.  Current Implementation(HBase Patch) does it
>> correctly.
>> Robin
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

OK, so how far away is the HBase approach?  Also, does it assume that  
HBase is the only way?

Also, no need to deprecate, I don't think.  We are < 1.0 and have made  
no guarantees about API contracts yet, w/ the possible exception of  
Taste.

On Jul 22, 2009, at 12:20 PM, Robin Anil wrote:

> In SequenceFileModelReader(Mahout 0.1) release you can disable calling
> loadFeatureWeights  in loadModel function as a hack. But that method  
> is
> deprecated right now.  Current Implementation(HBase Patch) does it
> correctly.
> Robin

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Getting Started with Classification

Posted by Robin Anil <ro...@gmail.com>.

In SequenceFileModelReader(Mahout 0.1) release you can disable calling
 loadFeatureWeights  in loadModel function as a hack. But that method is
deprecated right now.  Current Implementation(HBase Patch) does it
correctly.
Robin

Re: Getting Started with Classification

Posted by Ted Dunning <te...@gmail.com>.

Also reasonable to hand-judge 20 docs from each of the four cells of the
confusion matrix.  That will give you a rough idea of what the error
processes are.

On Wed, Jul 22, 2009 at 1:13 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:

> it is probably good to benchmark against standard datasets.  for text
> classification this tends to be the Reuters set:
>
> http://www.daviddlewis.com/resources/testcollections/
>
> this way you know if you are doing a good job
>
> Miles
>
> 2009/7/22 Grant Ingersoll <gs...@apache.org>
>
> > The model size is much smaller with unigrams.  :-)
> >
> > I'm not quite sure what constitutes good just yet, but, I can report the
> > following using the commands I reported earlier w/ the exception that I
> am
> > using unigrams:
> >
> > I have two categories:  History and Science
> >
> > 0. Splitter:
> > org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
> > --dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --outputDir
> > /PATH/wikipedia/chunks -c 64
> >
> > Then prep:
> > org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
> > --input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/test
> > --categories PATH/mahout-clean/examples/src/test/resources/subjects.txt
> > (also do this for the training set)
> >
> > 1. Train set:
> > ls ../chunks
> > chunk-0001.xml  chunk-0005.xml  chunk-0009.xml  chunk-0013.xml
> >  chunk-0017.xml  chunk-0021.xml  chunk-0025.xml  chunk-0029.xml
> >  chunk-0033.xml  chunk-0037.xml
> > chunk-0002.xml  chunk-0006.xml  chunk-0010.xml  chunk-0014.xml
> >  chunk-0018.xml  chunk-0022.xml  chunk-0026.xml  chunk-0030.xml
> >  chunk-0034.xml  chunk-0038.xml
> > chunk-0003.xml  chunk-0007.xml  chunk-0011.xml  chunk-0015.xml
> >  chunk-0019.xml  chunk-0023.xml  chunk-0027.xml  chunk-0031.xml
> >  chunk-0035.xml  chunk-0039.xml
> > chunk-0004.xml  chunk-0008.xml  chunk-0012.xml  chunk-0016.xml
> >  chunk-0020.xml  chunk-0024.xml  chunk-0028.xml  chunk-0032.xml
> >  chunk-0036.xml
> >
> > 2. Test Set:
> >  ls
> > chunk-0101.xml  chunk-0103.xml  chunk-0105.xml  chunk-0108.xml
> >  chunk-0130.xml  chunk-0132.xml  chunk-0134.xml  chunk-0137.xml
> > chunk-0102.xml  chunk-0104.xml  chunk-0107.xml  chunk-0109.xml
> >  chunk-0131.xml  chunk-0133.xml  chunk-0135.xml  chunk-0139.xml
> >
> > 3. Run the Trainer on the train set:
> > --input PATH/wikipedia/subjects/out --output
> PATH/wikipedia/subjects/model
> > --gramSize 1 --classifierType bayes
> >
> > 4. Run the TestClassifier.
> >
> > --model PATH/wikipedia/subjects/model --testDir
> > PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes
> >
> > Output is:
> >
> > <snip>
> > 9/07/22 15:55:09 INFO bayes.TestClassifier:
> > =======================================================
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :       4143       74.0615%
> > Incorrectly Classified Instances        :       1451       25.9385%
> > Total Classified Instances              :       5594
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> > a       b       <--Classified as
> > 3910    186      |  4096        a     = history
> > 1265    233      |  1498        b     = science
> > Default Category: unknown: 2
> > </snip>
> >
> > At least it's better than 50%, which is presumably a good thing ;-)  I
> have
> > no clue what the state of the art is these days, but it doesn't seem
> > _horrendous_ either.
> >
> > I'd love to see someone validate what I have done.  Let me know if you
> need
> > more details.  I'd also like to know how I can improve it.
> >
> > On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:
> >
> >  Indeed.  I hadn't snapped to the fact you were using trigrams.
> >>
> >> 30 million features is quite plausible for that.  To effectively use
> long
> >> n-grams as features in classification of documents you really need to
> have
> >> the following:
> >>
> >> a) good statistical methods for resolving what is useful and what is
> not.
> >> Everybody here knows that my preference for a first hack is
> sparsification
> >> with log-likelihood ratios.
> >>
> >> b) some kind of smoothing using smaller n-grams
> >>
> >> c) some kind of smoothing over variants of n-grams.
> >>
> >> AFAIK, mahout doesn't have many or any of these in place.  You are
> likely
> >> to
> >> do better with unigrams as a result.
> >>
> >> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <gsingers@apache.org
> >> >wrote:
> >>
> >>  I suspect the explosion in the number of features, Ted, is due to the
> use
> >>> of n-grams producing a lot of unique terms.  I can try w/ gramSize = 1,
> >>> that
> >>> will likely reduce the feature set quite a bit.
> >>>
> >>>
> >>
> >>
> >> --
> >> Ted Dunning, CTO
> >> DeepDyve
> >>
> >
> >
> >
>
>
> --
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

More info:

Classifying docs (same train/test set) as "Republicans" or "Democrats"  
yields:
[java] Summary
      [java] -------------------------------------------------------
      [java] Correctly Classified Instances          :          
56           76.7123%
      [java] Incorrectly Classified Instances        :          
17           23.2877%
      [java] Total Classified Instances              :         73
      [java]
      [java] =======================================================
      [java] Confusion Matrix
      [java] -------------------------------------------------------
      [java] a           b       <--Classified as
      [java] 21          9        |  30          a     = democrats
      [java] 8           35       |  43          b     = republicans
      [java] Default Category: unknown: 2
      [java]
      [java]

For these, the training data was roughly equal in size (both about  
1.5MB) and for the test I got about 81% right for Republicans and 70%  
for the Democrats (does this imply Repub's do a better job of sticking  
to message on Wikipedia than Dems?  :-)   Would be interesting to  
train on a larger set).

-Grant

On Jul 22, 2009, at 9:50 PM, Robin Anil wrote:

> Did you try CBayes. Its supposed to negate the class imbalance effect
> to some extend
>
>
>
> On Thu, Jul 23, 2009 at 5:02 AM, Ted Dunning<te...@gmail.com>  
> wrote:
>> Some learning algorithms deal with this better than others.  The  
>> problem is
>> particularly bad in information retrieval (negative examples  
>> include almost
>> the entire corpus, positives are a tiny fraction) and fraud (less  
>> than 1% of
>> the training data is typically fraud).
>>
>> Down-sampling the over-represented case is the simplest answer  
>> where you
>> have lots of data.  It doesn't help much to have more than 3x more  
>> data for
>> one case as another anyway (at least in binary decisions).
>>
>> Another aspect of this is the cost of different errors.  For  
>> instance, in
>> fraud, verifying a transaction with a customer has low cost (but not
>> non-zero) while not detecting a fraud in progress can be very, very  
>> bad.
>> False negatives are thus more of a problem than false positives and  
>> the
>> models are tuned accordingly.
>>
>> On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne <mi...@inf.ed.ac.uk>  
>> wrote:
>>
>>> this is the class imbalance problem  (ie you have many more  
>>> instances for
>>> one class than another one).
>>>
>>> in this case, you could ensure that the training set was balanced  
>>> (50:50);
>>> more interestingly, you can have a prior which corrects for this.   
>>> or, you
>>> could over-sample or even under-sample the training set, etc etc.
>>>
>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

I did, and it performs worse, but maybe I did something wrong.

On Jul 22, 2009, at 9:50 PM, Robin Anil wrote:

> Did you try CBayes. Its supposed to negate the class imbalance effect
> to some extend
>
>
>
> On Thu, Jul 23, 2009 at 5:02 AM, Ted Dunning<te...@gmail.com>  
> wrote:
>> Some learning algorithms deal with this better than others.  The  
>> problem is
>> particularly bad in information retrieval (negative examples  
>> include almost
>> the entire corpus, positives are a tiny fraction) and fraud (less  
>> than 1% of
>> the training data is typically fraud).
>>
>> Down-sampling the over-represented case is the simplest answer  
>> where you
>> have lots of data.  It doesn't help much to have more than 3x more  
>> data for
>> one case as another anyway (at least in binary decisions).
>>
>> Another aspect of this is the cost of different errors.  For  
>> instance, in
>> fraud, verifying a transaction with a customer has low cost (but not
>> non-zero) while not detecting a fraud in progress can be very, very  
>> bad.
>> False negatives are thus more of a problem than false positives and  
>> the
>> models are tuned accordingly.
>>
>> On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne <mi...@inf.ed.ac.uk>  
>> wrote:
>>
>>> this is the class imbalance problem  (ie you have many more  
>>> instances for
>>> one class than another one).
>>>
>>> in this case, you could ensure that the training set was balanced  
>>> (50:50);
>>> more interestingly, you can have a prior which corrects for this.   
>>> or, you
>>> could over-sample or even under-sample the training set, etc etc.
>>>
>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Getting Started with Classification

Posted by Robin Anil <ro...@gmail.com>.

Did you try CBayes. Its supposed to negate the class imbalance effect
to some extend



On Thu, Jul 23, 2009 at 5:02 AM, Ted Dunning<te...@gmail.com> wrote:
> Some learning algorithms deal with this better than others.  The problem is
> particularly bad in information retrieval (negative examples include almost
> the entire corpus, positives are a tiny fraction) and fraud (less than 1% of
> the training data is typically fraud).
>
> Down-sampling the over-represented case is the simplest answer where you
> have lots of data.  It doesn't help much to have more than 3x more data for
> one case as another anyway (at least in binary decisions).
>
> Another aspect of this is the cost of different errors.  For instance, in
> fraud, verifying a transaction with a customer has low cost (but not
> non-zero) while not detecting a fraud in progress can be very, very bad.
> False negatives are thus more of a problem than false positives and the
> models are tuned accordingly.
>
> On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>
>> this is the class imbalance problem  (ie you have many more instances for
>> one class than another one).
>>
>> in this case, you could ensure that the training set was balanced (50:50);
>> more interestingly, you can have a prior which corrects for this.  or, you
>> could over-sample or even under-sample the training set, etc etc.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Getting Started with Classification

Posted by Ted Dunning <te...@gmail.com>.

Some learning algorithms deal with this better than others.  The problem is
particularly bad in information retrieval (negative examples include almost
the entire corpus, positives are a tiny fraction) and fraud (less than 1% of
the training data is typically fraud).

Down-sampling the over-represented case is the simplest answer where you
have lots of data.  It doesn't help much to have more than 3x more data for
one case as another anyway (at least in binary decisions).

Another aspect of this is the cost of different errors.  For instance, in
fraud, verifying a transaction with a customer has low cost (but not
non-zero) while not detecting a fraud in progress can be very, very bad.
False negatives are thus more of a problem than false positives and the
models are tuned accordingly.

On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:

> this is the class imbalance problem  (ie you have many more instances for
> one class than another one).
>
> in this case, you could ensure that the training set was balanced (50:50);
> more interestingly, you can have a prior which corrects for this.  or, you
> could over-sample or even under-sample the training set, etc etc.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Getting Started with Classification

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

this is the class imbalance problem  (ie you have many more instances for
one class than another one).

in this case, you could ensure that the training set was balanced (50:50);
more interestingly, you can have a prior which corrects for this.  or, you
could over-sample or even under-sample the training set, etc etc.

Miles

2009/7/22 Grant Ingersoll <gs...@apache.org>

> <done_basking>Grant</done_basking>
>
> Here's an interesting piece:
> 09/07/22 18:23:02 INFO bayes.TestClassifier:
> Testing:wikipedia/subjects/prepared-test/history.txt
> 09/07/22 18:23:07 INFO bayes.TestClassifier: history    95.458984375
>  3910/4096.0
> 09/07/22 18:23:07 INFO bayes.TestClassifier: --------------
> 09/07/22 18:23:07 INFO bayes.TestClassifier:
> Testing:/wikipedia/subjects/prepared-test/science.txt
> 09/07/22 18:23:08 INFO bayes.TestClassifier: science    15.554072096128172
>      233/1498.0
> 09/07/22 18:23:08 INFO bayes.TestClassifier:
> =======================================================
>
>
> In other words, I'm really good at predicting History as a category and
> really bad at predicting Science.
>
> I think the following might help explain why:
> ls -l
> total 245360
> -rwxrwxrwx  1 grantingersoll  staff  89518235 Jul 22 17:53 history.txt*
> -rwxrwxrwx  1 grantingersoll  staff  36099183 Jul 22 17:53 science.txt*
>
> The number of history examples is almost double the number of science based
> on my test set.
>
> There is obviously a teaching moment here.  I know there is a lot out there
> about sample sizes, feature selection etc., can we boil some of these down
> into some cogent recommendations for our users?
>
>
> -Grant
>
> On Jul 22, 2009, at 5:23 PM, Grant Ingersoll wrote:
>
>  <basking>Grant</basking>
>>
>> On Jul 22, 2009, at 4:46 PM, Ted Dunning wrote:
>>
>>  Getting something to run is a big step.  It is important to bask in the
>>> glow
>>> for a tiny moment.
>>>
>>> On Wed, Jul 22, 2009 at 1:05 PM, Grant Ingersoll <gsingers@apache.org
>>> >wrote:
>>>
>>>  Confusion Matrix
>>>> -------------------------------------------------------
>>>> a       b       <--Classified as
>>>> 3910    186      |  4096        a     = history
>>>> 1265    233      |  1498        b     = science
>>>> Default Category: unknown: 2
>>>> </snip>
>>>>
>>>> At least it's better than 50%, which is presumably a good thing ;-)  I
>>>> have
>>>> no clue what the state of the art is these days, but it doesn't seem
>>>> _horrendous_ either.
>>>>
>>>> I'd love to see someone validate what I have done.  Let me know if you
>>>> need
>>>> more details.  I'd also like to know how I can improve it.
>>>>
>>>>
>>>
>>>
>>> --
>>> Ted Dunning, CTO
>>> DeepDyve
>>>
>>
>>
>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

<done_basking>Grant</done_basking>

Here's an interesting piece:
09/07/22 18:23:02 INFO bayes.TestClassifier: Testing:wikipedia/ 
subjects/prepared-test/history.txt
09/07/22 18:23:07 INFO bayes.TestClassifier: history	95.458984375	 
3910/4096.0
09/07/22 18:23:07 INFO bayes.TestClassifier: --------------
09/07/22 18:23:07 INFO bayes.TestClassifier: Testing:/wikipedia/ 
subjects/prepared-test/science.txt
09/07/22 18:23:08 INFO bayes.TestClassifier: science	 
15.554072096128172	233/1498.0
09/07/22 18:23:08 INFO bayes.TestClassifier:  
=======================================================


In other words, I'm really good at predicting History as a category  
and really bad at predicting Science.

I think the following might help explain why:
ls -l
total 245360
-rwxrwxrwx  1 grantingersoll  staff  89518235 Jul 22 17:53 history.txt*
-rwxrwxrwx  1 grantingersoll  staff  36099183 Jul 22 17:53 science.txt*

The number of history examples is almost double the number of science  
based on my test set.

There is obviously a teaching moment here.  I know there is a lot out  
there about sample sizes, feature selection etc., can we boil some of  
these down into some cogent recommendations for our users?


-Grant

On Jul 22, 2009, at 5:23 PM, Grant Ingersoll wrote:

> <basking>Grant</basking>
>
> On Jul 22, 2009, at 4:46 PM, Ted Dunning wrote:
>
>> Getting something to run is a big step.  It is important to bask in  
>> the glow
>> for a tiny moment.
>>
>> On Wed, Jul 22, 2009 at 1:05 PM, Grant Ingersoll  
>> <gs...@apache.org>wrote:
>>
>>> Confusion Matrix
>>> -------------------------------------------------------
>>> a       b       <--Classified as
>>> 3910    186      |  4096        a     = history
>>> 1265    233      |  1498        b     = science
>>> Default Category: unknown: 2
>>> </snip>
>>>
>>> At least it's better than 50%, which is presumably a good  
>>> thing ;-)  I have
>>> no clue what the state of the art is these days, but it doesn't seem
>>> _horrendous_ either.
>>>
>>> I'd love to see someone validate what I have done.  Let me know if  
>>> you need
>>> more details.  I'd also like to know how I can improve it.
>>>
>>
>>
>>
>> -- 
>> Ted Dunning, CTO
>> DeepDyve
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

<basking>Grant</basking>

On Jul 22, 2009, at 4:46 PM, Ted Dunning wrote:

> Getting something to run is a big step.  It is important to bask in  
> the glow
> for a tiny moment.
>
> On Wed, Jul 22, 2009 at 1:05 PM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>> Confusion Matrix
>> -------------------------------------------------------
>> a       b       <--Classified as
>> 3910    186      |  4096        a     = history
>> 1265    233      |  1498        b     = science
>> Default Category: unknown: 2
>> </snip>
>>
>> At least it's better than 50%, which is presumably a good  
>> thing ;-)  I have
>> no clue what the state of the art is these days, but it doesn't seem
>> _horrendous_ either.
>>
>> I'd love to see someone validate what I have done.  Let me know if  
>> you need
>> more details.  I'd also like to know how I can improve it.
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

Re: Getting Started with Classification

Posted by Ted Dunning <te...@gmail.com>.

Getting something to run is a big step.  It is important to bask in the glow
for a tiny moment.

On Wed, Jul 22, 2009 at 1:05 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Confusion Matrix
> -------------------------------------------------------
> a       b       <--Classified as
> 3910    186      |  4096        a     = history
> 1265    233      |  1498        b     = science
> Default Category: unknown: 2
> </snip>
>
> At least it's better than 50%, which is presumably a good thing ;-)  I have
> no clue what the state of the art is these days, but it doesn't seem
> _horrendous_ either.
>
> I'd love to see someone validate what I have done.  Let me know if you need
> more details.  I'd also like to know how I can improve it.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Getting Started with Classification

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

you could just use Reuters for sanity checking (ie internally).

but i would also consider checking your results against Rainbow:

http://www.cs.cmu.edu/~mccallum/bow/

this does a pretty good job of text classification and it would be simple to
get it to run over your Wikipedia set.  so long as the data is labelled in
the same way, it shouldn't matter that much about things like multiple
categories etc.

Miles

2009/7/22 Grant Ingersoll <gs...@apache.org>

>
> On Jul 22, 2009, at 4:13 PM, Miles Osborne wrote:
>
>  it is probably good to benchmark against standard datasets.  for text
>> classification this tends to be the Reuters set:
>>
>> http://www.daviddlewis.com/resources/testcollections/
>>
>> this way you know if you are doing a good job
>>
>
> Yeah, good point.  Only problem is, for my demo, I am doing it all on
> Wikipedia, because I want coherent examples and don't want to have to
> introduce another dataset.  I know there are a few areas for error in the
> process, since we are just picking a single category for a document even
> though they have multiple, furthermore, we are picking the first category
> that matches, even thought multiple input categories might be present, or
> even, both categories in one (i.e. History of Science)
>
> Still, good to try out w/ the Reuters collection as well.  Sigh, I'll put
> it on the list to do.
>
>
>
>
>> Miles
>>
>> 2009/7/22 Grant Ingersoll <gs...@apache.org>
>>
>>  The model size is much smaller with unigrams.  :-)
>>>
>>> I'm not quite sure what constitutes good just yet, but, I can report the
>>> following using the commands I reported earlier w/ the exception that I
>>> am
>>> using unigrams:
>>>
>>> I have two categories:  History and Science
>>>
>>> 0. Splitter:
>>> org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
>>> --dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --outputDir
>>> /PATH/wikipedia/chunks -c 64
>>>
>>> Then prep:
>>> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
>>> --input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/test
>>> --categories PATH/mahout-clean/examples/src/test/resources/subjects.txt
>>> (also do this for the training set)
>>>
>>> 1. Train set:
>>> ls ../chunks
>>> chunk-0001.xml  chunk-0005.xml  chunk-0009.xml  chunk-0013.xml
>>> chunk-0017.xml  chunk-0021.xml  chunk-0025.xml  chunk-0029.xml
>>> chunk-0033.xml  chunk-0037.xml
>>> chunk-0002.xml  chunk-0006.xml  chunk-0010.xml  chunk-0014.xml
>>> chunk-0018.xml  chunk-0022.xml  chunk-0026.xml  chunk-0030.xml
>>> chunk-0034.xml  chunk-0038.xml
>>> chunk-0003.xml  chunk-0007.xml  chunk-0011.xml  chunk-0015.xml
>>> chunk-0019.xml  chunk-0023.xml  chunk-0027.xml  chunk-0031.xml
>>> chunk-0035.xml  chunk-0039.xml
>>> chunk-0004.xml  chunk-0008.xml  chunk-0012.xml  chunk-0016.xml
>>> chunk-0020.xml  chunk-0024.xml  chunk-0028.xml  chunk-0032.xml
>>> chunk-0036.xml
>>>
>>> 2. Test Set:
>>> ls
>>> chunk-0101.xml  chunk-0103.xml  chunk-0105.xml  chunk-0108.xml
>>> chunk-0130.xml  chunk-0132.xml  chunk-0134.xml  chunk-0137.xml
>>> chunk-0102.xml  chunk-0104.xml  chunk-0107.xml  chunk-0109.xml
>>> chunk-0131.xml  chunk-0133.xml  chunk-0135.xml  chunk-0139.xml
>>>
>>> 3. Run the Trainer on the train set:
>>> --input PATH/wikipedia/subjects/out --output
>>> PATH/wikipedia/subjects/model
>>> --gramSize 1 --classifierType bayes
>>>
>>> 4. Run the TestClassifier.
>>>
>>> --model PATH/wikipedia/subjects/model --testDir
>>> PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes
>>>
>>> Output is:
>>>
>>> <snip>
>>> 9/07/22 15:55:09 INFO bayes.TestClassifier:
>>> =======================================================
>>> Summary
>>> -------------------------------------------------------
>>> Correctly Classified Instances          :       4143       74.0615%
>>> Incorrectly Classified Instances        :       1451       25.9385%
>>> Total Classified Instances              :       5594
>>>
>>> =======================================================
>>> Confusion Matrix
>>> -------------------------------------------------------
>>> a       b       <--Classified as
>>> 3910    186      |  4096        a     = history
>>> 1265    233      |  1498        b     = science
>>> Default Category: unknown: 2
>>> </snip>
>>>
>>> At least it's better than 50%, which is presumably a good thing ;-)  I
>>> have
>>> no clue what the state of the art is these days, but it doesn't seem
>>> _horrendous_ either.
>>>
>>> I'd love to see someone validate what I have done.  Let me know if you
>>> need
>>> more details.  I'd also like to know how I can improve it.
>>>
>>> On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:
>>>
>>> Indeed.  I hadn't snapped to the fact you were using trigrams.
>>>
>>>>
>>>> 30 million features is quite plausible for that.  To effectively use
>>>> long
>>>> n-grams as features in classification of documents you really need to
>>>> have
>>>> the following:
>>>>
>>>> a) good statistical methods for resolving what is useful and what is
>>>> not.
>>>> Everybody here knows that my preference for a first hack is
>>>> sparsification
>>>> with log-likelihood ratios.
>>>>
>>>> b) some kind of smoothing using smaller n-grams
>>>>
>>>> c) some kind of smoothing over variants of n-grams.
>>>>
>>>> AFAIK, mahout doesn't have many or any of these in place.  You are
>>>> likely
>>>> to
>>>> do better with unigrams as a result.
>>>>
>>>> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <gsingers@apache.org
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>> I suspect the explosion in the number of features, Ted, is due to the
>>>> use
>>>>
>>>>> of n-grams producing a lot of unique terms.  I can try w/ gramSize = 1,
>>>>> that
>>>>> will likely reduce the feature set quite a bit.
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ted Dunning, CTO
>>>> DeepDyve
>>>>
>>>>
>>>
>>>
>>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in Scotland,
>> with registration number SC005336.
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 22, 2009, at 4:13 PM, Miles Osborne wrote:

> it is probably good to benchmark against standard datasets.  for text
> classification this tends to be the Reuters set:
>
> http://www.daviddlewis.com/resources/testcollections/
>
> this way you know if you are doing a good job

Yeah, good point.  Only problem is, for my demo, I am doing it all on  
Wikipedia, because I want coherent examples and don't want to have to  
introduce another dataset.  I know there are a few areas for error in  
the process, since we are just picking a single category for a  
document even though they have multiple, furthermore, we are picking  
the first category that matches, even thought multiple input  
categories might be present, or even, both categories in one (i.e.  
History of Science)

Still, good to try out w/ the Reuters collection as well.  Sigh, I'll  
put it on the list to do.


>
> Miles
>
> 2009/7/22 Grant Ingersoll <gs...@apache.org>
>
>> The model size is much smaller with unigrams.  :-)
>>
>> I'm not quite sure what constitutes good just yet, but, I can  
>> report the
>> following using the commands I reported earlier w/ the exception  
>> that I am
>> using unigrams:
>>
>> I have two categories:  History and Science
>>
>> 0. Splitter:
>> org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
>> --dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml -- 
>> outputDir
>> /PATH/wikipedia/chunks -c 64
>>
>> Then prep:
>> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
>> --input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/ 
>> subjects/test
>> --categories PATH/mahout-clean/examples/src/test/resources/ 
>> subjects.txt
>> (also do this for the training set)
>>
>> 1. Train set:
>> ls ../chunks
>> chunk-0001.xml  chunk-0005.xml  chunk-0009.xml  chunk-0013.xml
>> chunk-0017.xml  chunk-0021.xml  chunk-0025.xml  chunk-0029.xml
>> chunk-0033.xml  chunk-0037.xml
>> chunk-0002.xml  chunk-0006.xml  chunk-0010.xml  chunk-0014.xml
>> chunk-0018.xml  chunk-0022.xml  chunk-0026.xml  chunk-0030.xml
>> chunk-0034.xml  chunk-0038.xml
>> chunk-0003.xml  chunk-0007.xml  chunk-0011.xml  chunk-0015.xml
>> chunk-0019.xml  chunk-0023.xml  chunk-0027.xml  chunk-0031.xml
>> chunk-0035.xml  chunk-0039.xml
>> chunk-0004.xml  chunk-0008.xml  chunk-0012.xml  chunk-0016.xml
>> chunk-0020.xml  chunk-0024.xml  chunk-0028.xml  chunk-0032.xml
>> chunk-0036.xml
>>
>> 2. Test Set:
>> ls
>> chunk-0101.xml  chunk-0103.xml  chunk-0105.xml  chunk-0108.xml
>> chunk-0130.xml  chunk-0132.xml  chunk-0134.xml  chunk-0137.xml
>> chunk-0102.xml  chunk-0104.xml  chunk-0107.xml  chunk-0109.xml
>> chunk-0131.xml  chunk-0133.xml  chunk-0135.xml  chunk-0139.xml
>>
>> 3. Run the Trainer on the train set:
>> --input PATH/wikipedia/subjects/out --output PATH/wikipedia/ 
>> subjects/model
>> --gramSize 1 --classifierType bayes
>>
>> 4. Run the TestClassifier.
>>
>> --model PATH/wikipedia/subjects/model --testDir
>> PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes
>>
>> Output is:
>>
>> <snip>
>> 9/07/22 15:55:09 INFO bayes.TestClassifier:
>> =======================================================
>> Summary
>> -------------------------------------------------------
>> Correctly Classified Instances          :       4143       74.0615%
>> Incorrectly Classified Instances        :       1451       25.9385%
>> Total Classified Instances              :       5594
>>
>> =======================================================
>> Confusion Matrix
>> -------------------------------------------------------
>> a       b       <--Classified as
>> 3910    186      |  4096        a     = history
>> 1265    233      |  1498        b     = science
>> Default Category: unknown: 2
>> </snip>
>>
>> At least it's better than 50%, which is presumably a good  
>> thing ;-)  I have
>> no clue what the state of the art is these days, but it doesn't seem
>> _horrendous_ either.
>>
>> I'd love to see someone validate what I have done.  Let me know if  
>> you need
>> more details.  I'd also like to know how I can improve it.
>>
>> On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:
>>
>> Indeed.  I hadn't snapped to the fact you were using trigrams.
>>>
>>> 30 million features is quite plausible for that.  To effectively  
>>> use long
>>> n-grams as features in classification of documents you really need  
>>> to have
>>> the following:
>>>
>>> a) good statistical methods for resolving what is useful and what  
>>> is not.
>>> Everybody here knows that my preference for a first hack is  
>>> sparsification
>>> with log-likelihood ratios.
>>>
>>> b) some kind of smoothing using smaller n-grams
>>>
>>> c) some kind of smoothing over variants of n-grams.
>>>
>>> AFAIK, mahout doesn't have many or any of these in place.  You are  
>>> likely
>>> to
>>> do better with unigrams as a result.
>>>
>>> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <gsingers@apache.org
>>>> wrote:
>>>
>>> I suspect the explosion in the number of features, Ted, is due to  
>>> the use
>>>> of n-grams producing a lot of unique terms.  I can try w/  
>>>> gramSize = 1,
>>>> that
>>>> will likely reduce the feature set quite a bit.
>>>>
>>>>
>>>
>>>
>>> --
>>> Ted Dunning, CTO
>>> DeepDyve
>>>
>>
>>
>>
>
>
> -- 
> The University of Edinburgh is a charitable body, registered in  
> Scotland,
> with registration number SC005336.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Getting Started with Classification

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

it is probably good to benchmark against standard datasets.  for text
classification this tends to be the Reuters set:

http://www.daviddlewis.com/resources/testcollections/

this way you know if you are doing a good job

Miles

2009/7/22 Grant Ingersoll <gs...@apache.org>

> The model size is much smaller with unigrams.  :-)
>
> I'm not quite sure what constitutes good just yet, but, I can report the
> following using the commands I reported earlier w/ the exception that I am
> using unigrams:
>
> I have two categories:  History and Science
>
> 0. Splitter:
> org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
> --dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --outputDir
> /PATH/wikipedia/chunks -c 64
>
> Then prep:
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
> --input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/test
> --categories PATH/mahout-clean/examples/src/test/resources/subjects.txt
> (also do this for the training set)
>
> 1. Train set:
> ls ../chunks
> chunk-0001.xml  chunk-0005.xml  chunk-0009.xml  chunk-0013.xml
>  chunk-0017.xml  chunk-0021.xml  chunk-0025.xml  chunk-0029.xml
>  chunk-0033.xml  chunk-0037.xml
> chunk-0002.xml  chunk-0006.xml  chunk-0010.xml  chunk-0014.xml
>  chunk-0018.xml  chunk-0022.xml  chunk-0026.xml  chunk-0030.xml
>  chunk-0034.xml  chunk-0038.xml
> chunk-0003.xml  chunk-0007.xml  chunk-0011.xml  chunk-0015.xml
>  chunk-0019.xml  chunk-0023.xml  chunk-0027.xml  chunk-0031.xml
>  chunk-0035.xml  chunk-0039.xml
> chunk-0004.xml  chunk-0008.xml  chunk-0012.xml  chunk-0016.xml
>  chunk-0020.xml  chunk-0024.xml  chunk-0028.xml  chunk-0032.xml
>  chunk-0036.xml
>
> 2. Test Set:
>  ls
> chunk-0101.xml  chunk-0103.xml  chunk-0105.xml  chunk-0108.xml
>  chunk-0130.xml  chunk-0132.xml  chunk-0134.xml  chunk-0137.xml
> chunk-0102.xml  chunk-0104.xml  chunk-0107.xml  chunk-0109.xml
>  chunk-0131.xml  chunk-0133.xml  chunk-0135.xml  chunk-0139.xml
>
> 3. Run the Trainer on the train set:
> --input PATH/wikipedia/subjects/out --output PATH/wikipedia/subjects/model
> --gramSize 1 --classifierType bayes
>
> 4. Run the TestClassifier.
>
> --model PATH/wikipedia/subjects/model --testDir
> PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes
>
> Output is:
>
> <snip>
> 9/07/22 15:55:09 INFO bayes.TestClassifier:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :       4143       74.0615%
> Incorrectly Classified Instances        :       1451       25.9385%
> Total Classified Instances              :       5594
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a       b       <--Classified as
> 3910    186      |  4096        a     = history
> 1265    233      |  1498        b     = science
> Default Category: unknown: 2
> </snip>
>
> At least it's better than 50%, which is presumably a good thing ;-)  I have
> no clue what the state of the art is these days, but it doesn't seem
> _horrendous_ either.
>
> I'd love to see someone validate what I have done.  Let me know if you need
> more details.  I'd also like to know how I can improve it.
>
> On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:
>
>  Indeed.  I hadn't snapped to the fact you were using trigrams.
>>
>> 30 million features is quite plausible for that.  To effectively use long
>> n-grams as features in classification of documents you really need to have
>> the following:
>>
>> a) good statistical methods for resolving what is useful and what is not.
>> Everybody here knows that my preference for a first hack is sparsification
>> with log-likelihood ratios.
>>
>> b) some kind of smoothing using smaller n-grams
>>
>> c) some kind of smoothing over variants of n-grams.
>>
>> AFAIK, mahout doesn't have many or any of these in place.  You are likely
>> to
>> do better with unigrams as a result.
>>
>> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <gsingers@apache.org
>> >wrote:
>>
>>  I suspect the explosion in the number of features, Ted, is due to the use
>>> of n-grams producing a lot of unique terms.  I can try w/ gramSize = 1,
>>> that
>>> will likely reduce the feature set quite a bit.
>>>
>>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

The model size is much smaller with unigrams.  :-)

I'm not quite sure what constitutes good just yet, but, I can report  
the following using the commands I reported earlier w/ the exception  
that I am using unigrams:

I have two categories:  History and Science

0. Splitter:
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
--dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml -- 
outputDir /PATH/wikipedia/chunks -c 64

Then prep:
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
--input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/ 
test --categories PATH/mahout-clean/examples/src/test/resources/ 
subjects.txt
(also do this for the training set)

1. Train set:
ls ../chunks
chunk-0001.xml  chunk-0005.xml  chunk-0009.xml  chunk-0013.xml   
chunk-0017.xml  chunk-0021.xml  chunk-0025.xml  chunk-0029.xml   
chunk-0033.xml  chunk-0037.xml
chunk-0002.xml  chunk-0006.xml  chunk-0010.xml  chunk-0014.xml   
chunk-0018.xml  chunk-0022.xml  chunk-0026.xml  chunk-0030.xml   
chunk-0034.xml  chunk-0038.xml
chunk-0003.xml  chunk-0007.xml  chunk-0011.xml  chunk-0015.xml   
chunk-0019.xml  chunk-0023.xml  chunk-0027.xml  chunk-0031.xml   
chunk-0035.xml  chunk-0039.xml
chunk-0004.xml  chunk-0008.xml  chunk-0012.xml  chunk-0016.xml   
chunk-0020.xml  chunk-0024.xml  chunk-0028.xml  chunk-0032.xml   
chunk-0036.xml

2. Test Set:
  ls
chunk-0101.xml  chunk-0103.xml  chunk-0105.xml  chunk-0108.xml   
chunk-0130.xml  chunk-0132.xml  chunk-0134.xml  chunk-0137.xml
chunk-0102.xml  chunk-0104.xml  chunk-0107.xml  chunk-0109.xml   
chunk-0131.xml  chunk-0133.xml  chunk-0135.xml  chunk-0139.xml

3. Run the Trainer on the train set:
--input PATH/wikipedia/subjects/out --output PATH/wikipedia/subjects/ 
model --gramSize 1 --classifierType bayes

4. Run the TestClassifier.

--model PATH/wikipedia/subjects/model --testDir PATH/wikipedia/ 
subjects/test --gramSize 1 --classifierType bayes

Output is:

<snip>
9/07/22 15:55:09 INFO bayes.TestClassifier:  
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       4143	   74.0615%
Incorrectly Classified Instances        :       1451	   25.9385%
Total Classified Instances              :       5594

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	<--Classified as
3910 	186  	 |  4096  	a     = history
1265 	233  	 |  1498  	b     = science
Default Category: unknown: 2
</snip>

At least it's better than 50%, which is presumably a good thing ;-)  I  
have no clue what the state of the art is these days, but it doesn't  
seem _horrendous_ either.

I'd love to see someone validate what I have done.  Let me know if you  
need more details.  I'd also like to know how I can improve it.

On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:

> Indeed.  I hadn't snapped to the fact you were using trigrams.
>
> 30 million features is quite plausible for that.  To effectively use  
> long
> n-grams as features in classification of documents you really need  
> to have
> the following:
>
> a) good statistical methods for resolving what is useful and what is  
> not.
> Everybody here knows that my preference for a first hack is  
> sparsification
> with log-likelihood ratios.
>
> b) some kind of smoothing using smaller n-grams
>
> c) some kind of smoothing over variants of n-grams.
>
> AFAIK, mahout doesn't have many or any of these in place.  You are  
> likely to
> do better with unigrams as a result.
>
> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>> I suspect the explosion in the number of features, Ted, is due to  
>> the use
>> of n-grams producing a lot of unique terms.  I can try w/ gramSize  
>> = 1, that
>> will likely reduce the feature set quite a bit.
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

Re: Getting Started with Classification

Posted by Ted Dunning <te...@gmail.com>.

Indeed.  I hadn't snapped to the fact you were using trigrams.

30 million features is quite plausible for that.  To effectively use long
n-grams as features in classification of documents you really need to have
the following:

a) good statistical methods for resolving what is useful and what is not.
Everybody here knows that my preference for a first hack is sparsification
with log-likelihood ratios.

b) some kind of smoothing using smaller n-grams

c) some kind of smoothing over variants of n-grams.

AFAIK, mahout doesn't have many or any of these in place.  You are likely to
do better with unigrams as a result.

On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <gs...@apache.org>wrote:

> I suspect the explosion in the number of features, Ted, is due to the use
> of n-grams producing a lot of unique terms.  I can try w/ gramSize = 1, that
> will likely reduce the feature set quite a bit.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

I'm doing a pretty naive (pun intended) approach for this based on the  
viewpoint of someone coming in new to Mahout and ML, for that matter,  
(I also will admit I haven't done a lot of practical classification  
myself, even if I've read many of the papers, so it isn't a stretch  
for me) and just want to get started doing some basic classification  
that works reasonably well to demonstrate the idea.

The code is all publicly available in Mahout.  The Wikipedia data set  
I'm using is at http://people.apache.org/~gsingers/wikipedia/ (ignore  
the small files, the big bz2 file is the one I've used)

  I'm happy to share the commands I used:

1. WikipediaDataSetCreatorDriver:  --input PATH/wikipedia/chunks/ -- 
output PATH/wikipedia/subjects/out --categories PATH TO MAHOUT CODE/ 
examples/src/test/resources/subjects.txt

2. TrainClassifier: --input PATH/wikipedia/subjects/out --output PATH/ 
wikipedia/subjects/model --gramSize 3 --classifierType bayes

3. Test Classifier: --model PATH/wikipedia/subjects/model --testDir  
PATH/wikipedia/subjects/test --gramSize 3 --classifierType bayes

The training data was produced by the Wikipedia Splitter (first 60  
chunks) and the test data was some other chunks not in the first 60 (I  
haven't successfully completed a Test run yet, or at least not one  
that resulted in even decent results)

I suspect the explosion in the number of features, Ted, is due to the  
use of n-grams producing a lot of unique terms.  I can try w/ gramSize  
= 1, that will likely reduce the feature set quite a bit.

I am using the WikipediaTokenizer from Lucene which does a better job  
of removing cruft from Wikipedia than StandardAnalyzer.

This is all based on me piecing together from the Wiki and the code  
and is not on any great insight on my end.

-Grant

On Jul 22, 2009, at 2:24 PM, Ted Dunning wrote:

> It is common to have more features than there are plausible words.
>
> If these features are common enough to provide some support for the
> statistical inferences, then they are fine to use as long as they  
> aren't
> target leaks.  If they are rare (page URL for instance), then they  
> have
> little utility and should be pruned.
>
> Pruning will generally improve accuracy as well as speed and memory  
> use.
>
> On Wed, Jul 22, 2009 at 11:19 AM, Robin Anil <ro...@gmail.com>  
> wrote:
>
>> Yes, I agree. Maybe we can add a prune step or a minSupport parameter
>> to prune. But then again a lot depends on the tokenizer used.  
>> Numerals
>> plus string literal combinations like say 100-sanfrancisco-ugs found
>> in Wikipedia data a lot.  They add up to the feature count more than
>> English words
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Getting Started with Classification

Posted by Ted Dunning <te...@gmail.com>.

It is common to have more features than there are plausible words.

If these features are common enough to provide some support for the
statistical inferences, then they are fine to use as long as they aren't
target leaks.  If they are rare (page URL for instance), then they have
little utility and should be pruned.

Pruning will generally improve accuracy as well as speed and memory use.

On Wed, Jul 22, 2009 at 11:19 AM, Robin Anil <ro...@gmail.com> wrote:

> Yes, I agree. Maybe we can add a prune step or a minSupport parameter
> to prune. But then again a lot depends on the tokenizer used. Numerals
> plus string literal combinations like say 100-sanfrancisco-ugs found
> in Wikipedia data a lot.  They add up to the feature count more than
> English words
>
>

Re: Getting Started with Classification

Posted by Robin Anil <ro...@gmail.com>.

Yes, I agree. Maybe we can add a prune step or a minSupport parameter
to prune. But then again a lot depends on the tokenizer used. Numerals
plus string literal combinations like say 100-sanfrancisco-ugs found
in Wikipedia data a lot.  They add up to the feature count more than
English words

Robin

On Wed, Jul 22, 2009 at 11:41 PM, Ted Dunning<te...@gmail.com> wrote:
> I could be mis-reading this, but it looks like you are saying that you have
> 31 million features.  That is, to put it mildly, a bit absurd.  Something is
> whacked to get that many features.  At the very least, singletons should not
> be used as features.
>
> On Wed, Jul 22, 2009 at 9:14 AM, Grant Ingersoll <gs...@apache.org>wrote:
>
>>  Where are the <label,feature> values stored?
>>>>
>>>>
>>> tf-Idf Folder part-****
>>>
>>
>> That's 1.28 GB.  Count: 31,216,595
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Getting Started with Classification

Posted by Ted Dunning <te...@gmail.com>.

I could be mis-reading this, but it looks like you are saying that you have
31 million features.  That is, to put it mildly, a bit absurd.  Something is
whacked to get that many features.  At the very least, singletons should not
be used as features.

On Wed, Jul 22, 2009 at 9:14 AM, Grant Ingersoll <gs...@apache.org>wrote:

>  Where are the <label,feature> values stored?
>>>
>>>
>> tf-Idf Folder part-****
>>
>
> That's 1.28 GB.  Count: 31,216,595

-- 
Ted Dunning, CTO
DeepDyve

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 22, 2009, at 11:50 AM, Robin Anil wrote:

> On Wed, Jul 22, 2009 at 8:55 PM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>>
>> On Jul 22, 2009, at 10:38 AM, Robin Anil wrote:
>>
>> Dear Grant,               Could you post some stats like the number  
>> of
>>> labels and features that you have and the number of unique  
>>> label,feature
>>> pair.
>>>
>>
>> labels: history and science
>> Docs trained on: chunk 1 - 60 generated using the Wikipedia  
>> Splitter with
>> the WikipediaAnalyzer (MAHOUT-146) with chunk size set to 64
>>
>> Where are the <label,feature> values stored?
>>
>
> tf-Idf Folder part-****

That's 1.28 GB.  Count: 31216595

(FYI, I modified the SequenceFileDumper to spit out counts from a  
SeqFile)

>
>
>>
>>
>> Both Naive bayes and Complementary naive bayes use the same data
>>> except Sigma_j set.
>>>
>>
>> So, why do I need to load it or even calculate it if I am using  
>> Bayes?  I
>> think I would like to have the choice.  That is, if I plan on using  
>> both,
>> then I can calculate/load both.  At a minimum, when classifying  
>> with Bayes,
>> we should not be loading it, even if we did calculate it.

Thoughts on this?  Can I disable it for Bayes?

>>
>> Could you add some writeup on http://cwiki.apache.org/MAHOUT/bayesian.html 
>>  about
>> the steps that are taken?  Also, I've read the CNB paper, but do  
>> you have a
>> reference for the NB part using many of these values?
>>
>
> Sure. I will ASAP
>
>>
>>
>> But regardless the matrix stored is sparse. I am not
>>> surprised  that with a larger set like that you have taken, memory  
>>> limit
>>> was
>>> crossed. Another thing the number of unique terms in wikipedia is  
>>> quite
>>> large. So best choice for you right now is to use the Hbase  
>>> solution. The
>>> large matrix is stored easily on the it. I am currently writing the
>>> Distributed version of Hbase classification for parallelizing.
>>>
>>>
>> HBase isn't an option right now, as it isn't committed and I'm  
>> putting
>> together a demo on current capabilities.
>>
>>
>>
>> Robin
>>>
>>> On Wed, Jul 22, 2009 at 4:53 PM, Grant Ingersoll  
>>> <gsingers@apache.org
>>>> wrote:
>>>
>>> The other thing is, I don't even think Sigma_J is even used for  
>>> Bayes,
>>>> only
>>>> Complementary Bayes.
>>>>
>>>>
>>>>
>>>> On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:
>>>>
>>>> AFAICT, It is loading the Sum Feature Weights, stored in the  
>>>> Sigma_J
>>>>
>>>>> directory under the model.  For me, this file is 1.04 GB.  The  
>>>>> values in
>>>>> this file are loaded into a List of Doubles (which brings with  
>>>>> it a
>>>>> whole
>>>>> log of auto-boxing, too).  It seems like that should fit in  
>>>>> memory,
>>>>> especially since it is the first thing loaded, AFAICT.  I have not
>>>>> looked
>>>>> yet into the structure of the file itself.
>>>>>
>>>>> I guess I will have to dig deeper, this code has changed a lot  
>>>>> from when
>>>>> I
>>>>> first wrote it as a very simple naive bayes model to one that now
>>>>> appears to
>>>>> be weighted by TF-IDF, normalization, etc. and I need to  
>>>>> understand it
>>>>> better.
>>>>>
>>>>> On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
>>>>>
>>>>> This is kind of surprising.  It would seem that this model  
>>>>> shouldn't
>>>>> have
>>>>>
>>>>>> more than a few doubles per unique term and there should be  
>>>>>> <half a
>>>>>> million
>>>>>> terms.  Even with pretty evil data structures, this really  
>>>>>> shouldn't be
>>>>>> more
>>>>>> than a few hundred megs for the model alone.
>>>>>>
>>>>>> Sparsity *is* a virtue with these models and I always try to  
>>>>>> eliminate
>>>>>> terms
>>>>>> that might as well have zero value, but that doesn't sound like  
>>>>>> the
>>>>>> root
>>>>>> problem here.
>>>>>>
>>>>>> Regarding strings or Writables, strings have the wonderful
>>>>>> characteristic
>>>>>> that they cache their hashed value.  This means that hash maps  
>>>>>> are
>>>>>> nearly
>>>>>> as
>>>>>> fast as arrays because you wind up indexing to nearly the right  
>>>>>> place
>>>>>> and
>>>>>> then do a few (or one) integer compare to find the right  
>>>>>> value.  Custom
>>>>>> data
>>>>>> types rarely do this and thus wind up slow.
>>>>>>
>>>>>> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <gsingers@apache.org
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>> I trained on a couple of categories (history and science) on  
>>>>>> quite a
>>>>>> few
>>>>>>
>>>>>>> docs, but now the model is so big, I can't load it, even with  
>>>>>>> almost 3
>>>>>>> GB of
>>>>>>> memory.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ted Dunning, CTO
>>>>>> DeepDyve
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Getting Started with Classification

Posted by Robin Anil <ro...@gmail.com>.

On Wed, Jul 22, 2009 at 8:55 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Jul 22, 2009, at 10:38 AM, Robin Anil wrote:
>
>  Dear Grant,               Could you post some stats like the number of
>> labels and features that you have and the number of unique label,feature
>> pair.
>>
>
> labels: history and science
> Docs trained on: chunk 1 - 60 generated using the Wikipedia Splitter with
> the WikipediaAnalyzer (MAHOUT-146) with chunk size set to 64
>
> Where are the <label,feature> values stored?
>

tf-Idf Folder part-****


>
>
>  Both Naive bayes and Complementary naive bayes use the same data
>> except Sigma_j set.
>>
>
> So, why do I need to load it or even calculate it if I am using Bayes?  I
> think I would like to have the choice.  That is, if I plan on using both,
> then I can calculate/load both.  At a minimum, when classifying with Bayes,
> we should not be loading it, even if we did calculate it.
>
> Could you add some writeup on http://cwiki.apache.org/MAHOUT/bayesian.html about
> the steps that are taken?  Also, I've read the CNB paper, but do you have a
> reference for the NB part using many of these values?
>

Sure. I will ASAP

>
>
>  But regardless the matrix stored is sparse. I am not
>> surprised  that with a larger set like that you have taken, memory limit
>> was
>> crossed. Another thing the number of unique terms in wikipedia is quite
>> large. So best choice for you right now is to use the Hbase solution. The
>> large matrix is stored easily on the it. I am currently writing the
>> Distributed version of Hbase classification for parallelizing.
>>
>>
> HBase isn't an option right now, as it isn't committed and I'm putting
> together a demo on current capabilities.
>
>
>
>  Robin
>>
>> On Wed, Jul 22, 2009 at 4:53 PM, Grant Ingersoll <gsingers@apache.org
>> >wrote:
>>
>>  The other thing is, I don't even think Sigma_J is even used for Bayes,
>>> only
>>> Complementary Bayes.
>>>
>>>
>>>
>>> On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:
>>>
>>> AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J
>>>
>>>> directory under the model.  For me, this file is 1.04 GB.  The values in
>>>> this file are loaded into a List of Doubles (which brings with it a
>>>> whole
>>>> log of auto-boxing, too).  It seems like that should fit in memory,
>>>> especially since it is the first thing loaded, AFAICT.  I have not
>>>> looked
>>>> yet into the structure of the file itself.
>>>>
>>>> I guess I will have to dig deeper, this code has changed a lot from when
>>>> I
>>>> first wrote it as a very simple naive bayes model to one that now
>>>> appears to
>>>> be weighted by TF-IDF, normalization, etc. and I need to understand it
>>>> better.
>>>>
>>>> On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
>>>>
>>>> This is kind of surprising.  It would seem that this model shouldn't
>>>> have
>>>>
>>>>> more than a few doubles per unique term and there should be <half a
>>>>> million
>>>>> terms.  Even with pretty evil data structures, this really shouldn't be
>>>>> more
>>>>> than a few hundred megs for the model alone.
>>>>>
>>>>> Sparsity *is* a virtue with these models and I always try to eliminate
>>>>> terms
>>>>> that might as well have zero value, but that doesn't sound like the
>>>>> root
>>>>> problem here.
>>>>>
>>>>> Regarding strings or Writables, strings have the wonderful
>>>>> characteristic
>>>>> that they cache their hashed value.  This means that hash maps are
>>>>> nearly
>>>>> as
>>>>> fast as arrays because you wind up indexing to nearly the right place
>>>>> and
>>>>> then do a few (or one) integer compare to find the right value.  Custom
>>>>> data
>>>>> types rarely do this and thus wind up slow.
>>>>>
>>>>> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <gsingers@apache.org
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>> I trained on a couple of categories (history and science) on quite a
>>>>> few
>>>>>
>>>>>> docs, but now the model is so big, I can't load it, even with almost 3
>>>>>> GB of
>>>>>> memory.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ted Dunning, CTO
>>>>> DeepDyve
>>>>>
>>>>>
>>>>
>>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 22, 2009, at 10:38 AM, Robin Anil wrote:

> Dear Grant,               Could you post some stats like the number of
> labels and features that you have and the number of unique  
> label,feature
> pair.

labels: history and science
Docs trained on: chunk 1 - 60 generated using the Wikipedia Splitter  
with the WikipediaAnalyzer (MAHOUT-146) with chunk size set to 64

Where are the <label,feature> values stored?

> Both Naive bayes and Complementary naive bayes use the same data
> except Sigma_j set.

So, why do I need to load it or even calculate it if I am using  
Bayes?  I think I would like to have the choice.  That is, if I plan  
on using both, then I can calculate/load both.  At a minimum, when  
classifying with Bayes, we should not be loading it, even if we did  
calculate it.

Could you add some writeup on http://cwiki.apache.org/MAHOUT/bayesian.html 
  about the steps that are taken?  Also, I've read the CNB paper, but  
do you have a reference for the NB part using many of these values?

> But regardless the matrix stored is sparse. I am not
> surprised  that with a larger set like that you have taken, memory  
> limit was
> crossed. Another thing the number of unique terms in wikipedia is  
> quite
> large. So best choice for you right now is to use the Hbase  
> solution. The
> large matrix is stored easily on the it. I am currently writing the
> Distributed version of Hbase classification for parallelizing.
>

HBase isn't an option right now, as it isn't committed and I'm putting  
together a demo on current capabilities.


> Robin
>
> On Wed, Jul 22, 2009 at 4:53 PM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>> The other thing is, I don't even think Sigma_J is even used for  
>> Bayes, only
>> Complementary Bayes.
>>
>>
>>
>> On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:
>>
>> AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J
>>> directory under the model.  For me, this file is 1.04 GB.  The  
>>> values in
>>> this file are loaded into a List of Doubles (which brings with it  
>>> a whole
>>> log of auto-boxing, too).  It seems like that should fit in memory,
>>> especially since it is the first thing loaded, AFAICT.  I have not  
>>> looked
>>> yet into the structure of the file itself.
>>>
>>> I guess I will have to dig deeper, this code has changed a lot  
>>> from when I
>>> first wrote it as a very simple naive bayes model to one that now  
>>> appears to
>>> be weighted by TF-IDF, normalization, etc. and I need to  
>>> understand it
>>> better.
>>>
>>> On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
>>>
>>> This is kind of surprising.  It would seem that this model  
>>> shouldn't have
>>>> more than a few doubles per unique term and there should be <half a
>>>> million
>>>> terms.  Even with pretty evil data structures, this really  
>>>> shouldn't be
>>>> more
>>>> than a few hundred megs for the model alone.
>>>>
>>>> Sparsity *is* a virtue with these models and I always try to  
>>>> eliminate
>>>> terms
>>>> that might as well have zero value, but that doesn't sound like  
>>>> the root
>>>> problem here.
>>>>
>>>> Regarding strings or Writables, strings have the wonderful  
>>>> characteristic
>>>> that they cache their hashed value.  This means that hash maps  
>>>> are nearly
>>>> as
>>>> fast as arrays because you wind up indexing to nearly the right  
>>>> place and
>>>> then do a few (or one) integer compare to find the right value.   
>>>> Custom
>>>> data
>>>> types rarely do this and thus wind up slow.
>>>>
>>>> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <gsingers@apache.org
>>>>> wrote:
>>>>
>>>> I trained on a couple of categories (history and science) on  
>>>> quite a few
>>>>> docs, but now the model is so big, I can't load it, even with  
>>>>> almost 3
>>>>> GB of
>>>>> memory.
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ted Dunning, CTO
>>>> DeepDyve
>>>>
>>>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Getting Started with Classification

Posted by Robin Anil <ro...@gmail.com>.

Dear Grant,               Could you post some stats like the number of
labels and features that you have and the number of unique label,feature
pair. Both Naive bayes and Complementary naive bayes use the same data
except Sigma_j set. But regardless the matrix stored is sparse. I am not
surprised  that with a larger set like that you have taken, memory limit was
crossed. Another thing the number of unique terms in wikipedia is quite
large. So best choice for you right now is to use the Hbase solution. The
large matrix is stored easily on the it. I am currently writing the
Distributed version of Hbase classification for parallelizing.

Robin

On Wed, Jul 22, 2009 at 4:53 PM, Grant Ingersoll <gs...@apache.org>wrote:

> The other thing is, I don't even think Sigma_J is even used for Bayes, only
> Complementary Bayes.
>
>
>
> On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:
>
>  AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J
>> directory under the model.  For me, this file is 1.04 GB.  The values in
>> this file are loaded into a List of Doubles (which brings with it a whole
>> log of auto-boxing, too).  It seems like that should fit in memory,
>> especially since it is the first thing loaded, AFAICT.  I have not looked
>> yet into the structure of the file itself.
>>
>> I guess I will have to dig deeper, this code has changed a lot from when I
>> first wrote it as a very simple naive bayes model to one that now appears to
>> be weighted by TF-IDF, normalization, etc. and I need to understand it
>> better.
>>
>> On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
>>
>>  This is kind of surprising.  It would seem that this model shouldn't have
>>> more than a few doubles per unique term and there should be <half a
>>> million
>>> terms.  Even with pretty evil data structures, this really shouldn't be
>>> more
>>> than a few hundred megs for the model alone.
>>>
>>> Sparsity *is* a virtue with these models and I always try to eliminate
>>> terms
>>> that might as well have zero value, but that doesn't sound like the root
>>> problem here.
>>>
>>> Regarding strings or Writables, strings have the wonderful characteristic
>>> that they cache their hashed value.  This means that hash maps are nearly
>>> as
>>> fast as arrays because you wind up indexing to nearly the right place and
>>> then do a few (or one) integer compare to find the right value.  Custom
>>> data
>>> types rarely do this and thus wind up slow.
>>>
>>> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <gsingers@apache.org
>>> >wrote:
>>>
>>>  I trained on a couple of categories (history and science) on quite a few
>>>> docs, but now the model is so big, I can't load it, even with almost 3
>>>> GB of
>>>> memory.
>>>>
>>>
>>>
>>>
>>>
>>> --
>>> Ted Dunning, CTO
>>> DeepDyve
>>>
>>
>>
>

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

The other thing is, I don't even think Sigma_J is even used for Bayes,  
only Complementary Bayes.


On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:

> AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J  
> directory under the model.  For me, this file is 1.04 GB.  The  
> values in this file are loaded into a List of Doubles (which brings  
> with it a whole log of auto-boxing, too).  It seems like that should  
> fit in memory, especially since it is the first thing loaded,  
> AFAICT.  I have not looked yet into the structure of the file itself.
>
> I guess I will have to dig deeper, this code has changed a lot from  
> when I first wrote it as a very simple naive bayes model to one that  
> now appears to be weighted by TF-IDF, normalization, etc. and I need  
> to understand it better.
>
> On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
>
>> This is kind of surprising.  It would seem that this model  
>> shouldn't have
>> more than a few doubles per unique term and there should be <half a  
>> million
>> terms.  Even with pretty evil data structures, this really  
>> shouldn't be more
>> than a few hundred megs for the model alone.
>>
>> Sparsity *is* a virtue with these models and I always try to  
>> eliminate terms
>> that might as well have zero value, but that doesn't sound like the  
>> root
>> problem here.
>>
>> Regarding strings or Writables, strings have the wonderful  
>> characteristic
>> that they cache their hashed value.  This means that hash maps are  
>> nearly as
>> fast as arrays because you wind up indexing to nearly the right  
>> place and
>> then do a few (or one) integer compare to find the right value.   
>> Custom data
>> types rarely do this and thus wind up slow.
>>
>> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll  
>> <gs...@apache.org>wrote:
>>
>>> I trained on a couple of categories (history and science) on quite  
>>> a few
>>> docs, but now the model is so big, I can't load it, even with  
>>> almost 3 GB of
>>> memory.
>>
>>
>>
>>
>> -- 
>> Ted Dunning, CTO
>> DeepDyve
>

Re: Getting Started with Classification

Posted by Grant Ingersoll <gs...@apache.org>.

AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J  
directory under the model.  For me, this file is 1.04 GB.  The values  
in this file are loaded into a List of Doubles (which brings with it a  
whole log of auto-boxing, too).  It seems like that should fit in  
memory, especially since it is the first thing loaded, AFAICT.  I have  
not looked yet into the structure of the file itself.

I guess I will have to dig deeper, this code has changed a lot from  
when I first wrote it as a very simple naive bayes model to one that  
now appears to be weighted by TF-IDF, normalization, etc. and I need  
to understand it better.

On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:

> This is kind of surprising.  It would seem that this model shouldn't  
> have
> more than a few doubles per unique term and there should be <half a  
> million
> terms.  Even with pretty evil data structures, this really shouldn't  
> be more
> than a few hundred megs for the model alone.
>
> Sparsity *is* a virtue with these models and I always try to  
> eliminate terms
> that might as well have zero value, but that doesn't sound like the  
> root
> problem here.
>
> Regarding strings or Writables, strings have the wonderful  
> characteristic
> that they cache their hashed value.  This means that hash maps are  
> nearly as
> fast as arrays because you wind up indexing to nearly the right  
> place and
> then do a few (or one) integer compare to find the right value.   
> Custom data
> types rarely do this and thus wind up slow.
>
> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>> I trained on a couple of categories (history and science) on quite  
>> a few
>> docs, but now the model is so big, I can't load it, even with  
>> almost 3 GB of
>> memory.
>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Getting Started with Classification

Posted by Ted Dunning <te...@gmail.com>.

This is kind of surprising.  It would seem that this model shouldn't have
more than a few doubles per unique term and there should be <half a million
terms.  Even with pretty evil data structures, this really shouldn't be more
than a few hundred megs for the model alone.

Sparsity *is* a virtue with these models and I always try to eliminate terms
that might as well have zero value, but that doesn't sound like the root
problem here.

Regarding strings or Writables, strings have the wonderful characteristic
that they cache their hashed value.  This means that hash maps are nearly as
fast as arrays because you wind up indexing to nearly the right place and
then do a few (or one) integer compare to find the right value.  Custom data
types rarely do this and thus wind up slow.

On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <gs...@apache.org>wrote:

>  I trained on a couple of categories (history and science) on quite a few
> docs, but now the model is so big, I can't load it, even with almost 3 GB of
> memory.

-- 
Ted Dunning, CTO
DeepDyve

Re: Getting Started with Classification

Posted by Ted Dunning <te...@gmail.com>.

Resizing should only have log N cost.  It shouldn't be obvious with what you
are seeing.  The old copies should disappear pretty quickly next time there
is a gc after resizing.  Only the last resize or three will be around long
enough to make it into tenured space and even then, this should collect
pretty quickly.

THis isn't a first order effect.  Something else is causing your grief.

On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <gs...@apache.org>wrote:

>  I also notice we start with default values for the maps used to load the
> models, which probably means we are resizing a lot.  Should we use Strings
> or would it be better to have some custom Writables and then keep track of
> the actual terms separately kind of like the doc clustering does as well as
> tracking the size so we can avoid resizing?
>

-- 
Ted Dunning, CTO
DeepDyve