You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lyall Morrison <ly...@gmail.com> on 2011/11/16 13:51:14 UTC

Maximum number of categories in a Bayesian classifier

Hi everyone,

I'm trying to classify some unsorted text files into different categories
using a Bayesian classifier, and it's going well until I try to run a
classifier with more than about 30 categories in it (the limit is between
27 and 32, I haven't nailed it down yet).

The training process claims to work fine up to the ~150 categories I have
identified, but actually running the classifier with a model with too many
categories in it causes it to hang without reporting any errors.

Can anyone tell me if there is a known limit here or suggest an easy way to
diagnose this? My next resort is source diving, which I would prefer to
avoid if I can.

If I'm reading it correctly, the version I'm using is Mahout 0.5-SNAPSHOT
which I haven't been keeping up to date as I feel better using a static
codebase while I'm mucking around - at least that way if something stops
working I know it's my fault ;)

Thanks for your time,

Lyall Morrison

Re: Maximum number of categories in a Bayesian classifier

Posted by Ted Dunning <te...@gmail.com>.
Which classifier?

How are you running it?

Can you publish your data?

On Wed, Nov 16, 2011 at 4:51 AM, Lyall Morrison <ly...@gmail.com>wrote:

> Hi everyone,
>
> I'm trying to classify some unsorted text files into different categories
> using a Bayesian classifier, and it's going well until I try to run a
> classifier with more than about 30 categories in it (the limit is between
> 27 and 32, I haven't nailed it down yet).
>
> The training process claims to work fine up to the ~150 categories I have
> identified, but actually running the classifier with a model with too many
> categories in it causes it to hang without reporting any errors.
>
> Can anyone tell me if there is a known limit here or suggest an easy way to
> diagnose this? My next resort is source diving, which I would prefer to
> avoid if I can.
>
> If I'm reading it correctly, the version I'm using is Mahout 0.5-SNAPSHOT
> which I haven't been keeping up to date as I feel better using a static
> codebase while I'm mucking around - at least that way if something stops
> working I know it's my fault ;)
>
> Thanks for your time,
>
> Lyall Morrison
>

Re: Maximum number of categories in a Bayesian classifier

Posted by Ted Dunning <te...@gmail.com>.
I would recommend the SGD classifiers.  I would also consider hierarchical
use of SGD classifiers for >40 categories or so.

On Fri, Dec 2, 2011 at 5:46 PM, Tom Pierce <tc...@cloudera.com> wrote:

> Hi,
>
> I've run into the same or a similar error; I've filed MAHOUT-911 with
> a set of Wikipedia categories you can use to trigger this condition
> using the Wikipedia/NaiveBayes example recipe (classifier application
> fails in either mapreduce or sequential mode).
>
> -tom
>
> On Wed, Nov 16, 2011 at 7:51 AM, Lyall Morrison
> <ly...@gmail.com> wrote:
> > Hi everyone,
> >
> > I'm trying to classify some unsorted text files into different categories
> > using a Bayesian classifier, and it's going well until I try to run a
> > classifier with more than about 30 categories in it (the limit is between
> > 27 and 32, I haven't nailed it down yet).
> >
> > The training process claims to work fine up to the ~150 categories I have
> > identified, but actually running the classifier with a model with too
> many
> > categories in it causes it to hang without reporting any errors.
> >
> > Can anyone tell me if there is a known limit here or suggest an easy way
> to
> > diagnose this? My next resort is source diving, which I would prefer to
> > avoid if I can.
> >
> > If I'm reading it correctly, the version I'm using is Mahout 0.5-SNAPSHOT
> > which I haven't been keeping up to date as I feel better using a static
> > codebase while I'm mucking around - at least that way if something stops
> > working I know it's my fault ;)
> >
> > Thanks for your time,
> >
> > Lyall Morrison
>

Re: Maximum number of categories in a Bayesian classifier

Posted by Tom Pierce <tc...@cloudera.com>.
Hi,

I've run into the same or a similar error; I've filed MAHOUT-911 with
a set of Wikipedia categories you can use to trigger this condition
using the Wikipedia/NaiveBayes example recipe (classifier application
fails in either mapreduce or sequential mode).

-tom

On Wed, Nov 16, 2011 at 7:51 AM, Lyall Morrison
<ly...@gmail.com> wrote:
> Hi everyone,
>
> I'm trying to classify some unsorted text files into different categories
> using a Bayesian classifier, and it's going well until I try to run a
> classifier with more than about 30 categories in it (the limit is between
> 27 and 32, I haven't nailed it down yet).
>
> The training process claims to work fine up to the ~150 categories I have
> identified, but actually running the classifier with a model with too many
> categories in it causes it to hang without reporting any errors.
>
> Can anyone tell me if there is a known limit here or suggest an easy way to
> diagnose this? My next resort is source diving, which I would prefer to
> avoid if I can.
>
> If I'm reading it correctly, the version I'm using is Mahout 0.5-SNAPSHOT
> which I haven't been keeping up to date as I feel better using a static
> codebase while I'm mucking around - at least that way if something stops
> working I know it's my fault ;)
>
> Thanks for your time,
>
> Lyall Morrison