You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Claudia Grieco <gr...@crmpa.unisa.it> on 2011/02/03 12:55:11 UTC

R: Help with Mahout Classification

Thanks for your help.
I've tried implementing several "boolean" classifiers ("sport" or "not sport") but they don't seem to work very well (they tend to classify everything as "positive" or everything as "negative"). Do you think that for it to return meaningful classifications, the model should be trained with an almost equal amount of "positive" and "negative" data? 

-----Messaggio originale-----
Da: Ted Dunning [mailto:ted.dunning@gmail.com] 
Inviato: lunedì 31 gennaio 2011 16.43
A: user@mahout.apache.org
Oggetto: Re: Help with Mahout Classification

For 50 categories, yes.  For 5000, no.

If you have 50 categories, you probably also have inter-category constraints
(i.e. cannot be about football but not sports).

To deal with that, training 50 independent models and then training 50
models that get to use the output of the first 50 models as inputs might
help (haven't tried this sort of thing for several years).

On Mon, Jan 31, 2011 at 2:55 AM, Claudia Grieco <gr...@crmpa.unisa.it>wrote:

> Hi,
> Just one more question about the SGD classifier.
> When you say " train one classifier per category" it means that for every
> possible tag (ex. sport) I should create a classifier that classifies it as
> "sport" or "not sport"? (sorry, English is not my first language)
> Do you think this approach is feasible for many categories (let's say 50)?
> Thanks again
> Claudia
>
> -----Messaggio originale-----
> Da: Ted Dunning [mailto:ted.dunning@gmail.com]
> Inviato: venerdì 14 gennaio 2011 17.32
> A: user@mahout.apache.org
> Oggetto: Re: Help with Mahout Classification
>
> If you don't have truly massive volumes, then SGD is almost certainly a
> better choice because it is simpler.
>
> If you have more than 10 million training examples *per*model* and
> *after*downsampling* then you should consider alternatives but even up to
> about 50 million training examples, SGD will do very well.  SGD is
> currently
> also mostly appropriate for sparse feature vectors.
>
> Having multiple categories isn't a big deal.  The simplest solution is to
> train a classifier per category.  There are more advanced arrangements,
> though.  For instance, you can train one classifier per category (the first
> level models), then train another classifier per category where the inputs
> are the outputs of the first level models.  Which techniques will help is
> highly dependent on your particular problem.
>
> On Fri, Jan 14, 2011 at 7:10 AM, Claudia Grieco <grieco@crmpa.unisa.it
> >wrote:
>
> > Do you think SGD will be a better choice? New documents are added to the
> > training set very often and documents can belong to more than one
> category
> > (ex. "sport", "italy")
>
>

Re: Help with Mahout Classification

Posted by Ted Dunning <te...@gmail.com>.

This usually means that something is wrong in the data or the classifier
itself.

Do you have some sample data?

On Thu, Feb 3, 2011 at 3:55 AM, Claudia Grieco <gr...@crmpa.unisa.it>wrote:

> Thanks for your help.
> I've tried implementing several "boolean" classifiers ("sport" or "not
> sport") but they don't seem to work very well (they tend to classify
> everything as "positive" or everything as "negative"). Do you think that for
> it to return meaningful classifications, the model should be trained with an
> almost equal amount of "positive" and "negative" data?
>
> -----Messaggio originale-----
> Da: Ted Dunning [mailto:ted.dunning@gmail.com]
> Inviato: lunedì 31 gennaio 2011 16.43
> A: user@mahout.apache.org
> Oggetto: Re: Help with Mahout Classification
>
> For 50 categories, yes.  For 5000, no.
>
> If you have 50 categories, you probably also have inter-category
> constraints
> (i.e. cannot be about football but not sports).
>
> To deal with that, training 50 independent models and then training 50
> models that get to use the output of the first 50 models as inputs might
> help (haven't tried this sort of thing for several years).
>
>
> On Mon, Jan 31, 2011 at 2:55 AM, Claudia Grieco <grieco@crmpa.unisa.it
> >wrote:
>
> > Hi,
> > Just one more question about the SGD classifier.
> > When you say " train one classifier per category" it means that for every
> > possible tag (ex. sport) I should create a classifier that classifies it
> as
> > "sport" or "not sport"? (sorry, English is not my first language)
> > Do you think this approach is feasible for many categories (let's say
> 50)?
> > Thanks again
> > Claudia
> >
> > -----Messaggio originale-----
> > Da: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Inviato: venerdì 14 gennaio 2011 17.32
> > A: user@mahout.apache.org
> > Oggetto: Re: Help with Mahout Classification
> >
> > If you don't have truly massive volumes, then SGD is almost certainly a
> > better choice because it is simpler.
> >
> > If you have more than 10 million training examples *per*model* and
> > *after*downsampling* then you should consider alternatives but even up to
> > about 50 million training examples, SGD will do very well.  SGD is
> > currently
> > also mostly appropriate for sparse feature vectors.
> >
> > Having multiple categories isn't a big deal.  The simplest solution is to
> > train a classifier per category.  There are more advanced arrangements,
> > though.  For instance, you can train one classifier per category (the
> first
> > level models), then train another classifier per category where the
> inputs
> > are the outputs of the first level models.  Which techniques will help is
> > highly dependent on your particular problem.
> >
> > On Fri, Jan 14, 2011 at 7:10 AM, Claudia Grieco <grieco@crmpa.unisa.it
> > >wrote:
> >
> > > Do you think SGD will be a better choice? New documents are added to
> the
> > > training set very often and documents can belong to more than one
> > category
> > > (ex. "sport", "italy")
> >
> >
>
>

Re: Help with Mahout Classification

Posted by Ted Dunning <te...@gmail.com>.

Balancing can sometimes help, but mostly it is used to improve training
speed because if one class totally dominates the other then down-sampling it
will not make much difference.  Classifiers other than SGD have more trouble
with this.

On Thu, Feb 3, 2011 at 3:55 AM, Claudia Grieco <gr...@crmpa.unisa.it>wrote:

> Thanks for your help.
> I've tried implementing several "boolean" classifiers ("sport" or "not
> sport") but they don't seem to work very well (they tend to classify
> everything as "positive" or everything as "negative"). Do you think that for
> it to return meaningful classifications, the model should be trained with an
> almost equal amount of "positive" and "negative" data?
>
> -----Messaggio originale-----
> Da: Ted Dunning [mailto:ted.dunning@gmail.com]
> Inviato: lunedì 31 gennaio 2011 16.43
> A: user@mahout.apache.org
> Oggetto: Re: Help with Mahout Classification
>
> For 50 categories, yes.  For 5000, no.
>
> If you have 50 categories, you probably also have inter-category
> constraints
> (i.e. cannot be about football but not sports).
>
> To deal with that, training 50 independent models and then training 50
> models that get to use the output of the first 50 models as inputs might
> help (haven't tried this sort of thing for several years).
>
>
> On Mon, Jan 31, 2011 at 2:55 AM, Claudia Grieco <grieco@crmpa.unisa.it
> >wrote:
>
> > Hi,
> > Just one more question about the SGD classifier.
> > When you say " train one classifier per category" it means that for every
> > possible tag (ex. sport) I should create a classifier that classifies it
> as
> > "sport" or "not sport"? (sorry, English is not my first language)
> > Do you think this approach is feasible for many categories (let's say
> 50)?
> > Thanks again
> > Claudia
> >
> > -----Messaggio originale-----
> > Da: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Inviato: venerdì 14 gennaio 2011 17.32
> > A: user@mahout.apache.org
> > Oggetto: Re: Help with Mahout Classification
> >
> > If you don't have truly massive volumes, then SGD is almost certainly a
> > better choice because it is simpler.
> >
> > If you have more than 10 million training examples *per*model* and
> > *after*downsampling* then you should consider alternatives but even up to
> > about 50 million training examples, SGD will do very well.  SGD is
> > currently
> > also mostly appropriate for sparse feature vectors.
> >
> > Having multiple categories isn't a big deal.  The simplest solution is to
> > train a classifier per category.  There are more advanced arrangements,
> > though.  For instance, you can train one classifier per category (the
> first
> > level models), then train another classifier per category where the
> inputs
> > are the outputs of the first level models.  Which techniques will help is
> > highly dependent on your particular problem.
> >
> > On Fri, Jan 14, 2011 at 7:10 AM, Claudia Grieco <grieco@crmpa.unisa.it
> > >wrote:
> >
> > > Do you think SGD will be a better choice? New documents are added to
> the
> > > training set very often and documents can belong to more than one
> > category
> > > (ex. "sport", "italy")
> >
> >
>
>