You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Magesh Sarma <ma...@gmail.com> on 2012/12/26 17:01:57 UTC

Document Classification - Recommended Algorithms?

Hi:

Coming from the Weka world, I have Newb question.

My problem is straight-forward: I have to label a given document.  Each
document will have only one label.  I have hundreds of labels.  I have a
big training set (thousands of labeled documents).  Accuracy is important.
So is the ability to incrementally train, or alternatively rebuild the
model from scratch fast.

I have used the J48 (based on C4.5) algorithm in Weka with a good degree of
success.  Accuracy is high, but training speed is very slow.  Plus, it does
not support incremental training.

Any recommendation on what algorithm(s) would be a good fit if I switch to
Mahout?

Cheers,
Magesh

Re: Document Classification - Recommended Algorithms?

Posted by Magesh Sarma <ma...@gmail.com>.
Ted:
Thanks for the helpful pointers.

> Do you have thousands of labeled documents for each category?
Yes, I have several years worth of human-classified documents.  I can
get my hands on as many labeled documents as needed.

> Are the categories groupable into very similar clusters?
I don't understand what you mean by this.  Each "document" in my case
will have one or more pages - typically 1 to 3 pages.  When testing,
any page may be fed in for classification, and the label needs to be
correctly applied.  So, for training purposes, I split a multi-page
document into single-page ones, and give each page the same category.

All documents belong to the same business domain and are very similar
in terms used.  However, I'm not sure if that answers your question.

> Do categories come and go?
Very rarely.  When this happens, it will be a highly controlled event.

> What is high accuracy to you?
With J48, I was able to get upwards of 99.5% accurate predictions on a
5000 document test set.  It was as good as, if not better than, human
classification, assuming the human makes errors too.

> My first recommendation for text classification always is L_1 regularized
> logistic regression.  Since your training data is small, I would recommend
> that you start with glmnet on R with word level features.  If you have
> additional meta-data such as source of the text or time of day or whatnot,
> label that specially and see if including it helps.

There is no meta data - just OCR'd sheets of text.

>
> Whether you want a multinomial model or lots of binomial models is an open
> question.  Try each design if you can (glmnet will only do the binomial
> option).
>
> As an interesting tree-based alternative, I think that your data is small
> enough to use the standard random forest implementation.
>
> If you have usable category nesting, you might try training a top-level
> model, then taking the top few super-categories and trying a category
> specific model at that level.
>
> R should suffice as long as your data are less than hundreds of thousands.
>  Some algorithms in R work with larger data, most will not.

OK - I will give that a try.

>
> On Wed, Dec 26, 2012 at 8:01 AM, Magesh Sarma <ma...@gmail.com>wrote:
>
> > Hi:
> >
> > Coming from the Weka world, I have Newb question.
> >
> > My problem is straight-forward: I have to label a given document.  Each
> > document will have only one label.  I have hundreds of labels.  I have a
> > big training set (thousands of labeled documents).  Accuracy is important.
> > So is the ability to incrementally train, or alternatively rebuild the
> > model from scratch fast.
> >
> > I have used the J48 (based on C4.5) algorithm in Weka with a good degree of
> > success.  Accuracy is high, but training speed is very slow.  Plus, it does
> > not support incremental training.
> >
> > Any recommendation on what algorithm(s) would be a good fit if I switch to
> > Mahout?
> >
> > Cheers,
> > Magesh
> >

Re: Document Classification - Recommended Algorithms?

Posted by Ted Dunning <te...@gmail.com>.
Glad this worked for you.

There is a random forest implementation in Mahout as well.  That might be
helpful at some point.

On Fri, Dec 28, 2012 at 5:23 AM, Magesh Sarma <ma...@gmail.com>wrote:

> On Wed, Dec 26, 2012 at 2:20 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > As an interesting tree-based alternative, I think that your data is small
> > enough to use the standard random forest implementation.
>
> Random Forest appears to work quite well for my situation.  Training
> is very fast; Prediction is also very fast;  Accuracy is quite high -
> even better than J48 on a bigger test set.
>
> Thanks for the pointers, Ted.
>
> Cheers,
> Magesh
>

Re: Document Classification - Recommended Algorithms?

Posted by Magesh Sarma <ma...@gmail.com>.
On Wed, Dec 26, 2012 at 2:20 PM, Ted Dunning <te...@gmail.com> wrote:

> As an interesting tree-based alternative, I think that your data is small
> enough to use the standard random forest implementation.

Random Forest appears to work quite well for my situation.  Training
is very fast; Prediction is also very fast;  Accuracy is quite high -
even better than J48 on a bigger test set.

Thanks for the pointers, Ted.

Cheers,
Magesh

Re: Document Classification - Recommended Algorithms?

Posted by Ted Dunning <te...@gmail.com>.
Do you have thousands of labeled documents for each category?

Are the categories groupable into very similar clusters?

Do categories come and go?

What is high accuracy to you?

My first recommendation for text classification always is L_1 regularized
logistic regression.  Since your training data is small, I would recommend
that you start with glmnet on R with word level features.  If you have
additional meta-data such as source of the text or time of day or whatnot,
label that specially and see if including it helps.

Whether you want a multinomial model or lots of binomial models is an open
question.  Try each design if you can (glmnet will only do the binomial
option).

As an interesting tree-based alternative, I think that your data is small
enough to use the standard random forest implementation.

If you have usable category nesting, you might try training a top-level
model, then taking the top few super-categories and trying a category
specific model at that level.

R should suffice as long as your data are less than hundreds of thousands.
 Some algorithms in R work with larger data, most will not.

On Wed, Dec 26, 2012 at 8:01 AM, Magesh Sarma <ma...@gmail.com>wrote:

> Hi:
>
> Coming from the Weka world, I have Newb question.
>
> My problem is straight-forward: I have to label a given document.  Each
> document will have only one label.  I have hundreds of labels.  I have a
> big training set (thousands of labeled documents).  Accuracy is important.
> So is the ability to incrementally train, or alternatively rebuild the
> model from scratch fast.
>
> I have used the J48 (based on C4.5) algorithm in Weka with a good degree of
> success.  Accuracy is high, but training speed is very slow.  Plus, it does
> not support incremental training.
>
> Any recommendation on what algorithm(s) would be a good fit if I switch to
> Mahout?
>
> Cheers,
> Magesh
>