You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Bhaskar Ghosh <bj...@yahoo.co.in> on 2010/09/26 18:37:48 UTC

What are the ways to train and run classifiers on text?

Dear All,

I need to classify a bunch of text files, so determine which class does each one 
of these texts fall. 


Now I have seen through the 20Newsgroups example. I see that the input text 
files need to have a particular format:

<class-label> <tab> <unique features (words) associated with the class-label>


But the real question is how do I get such a pre-processed input file? Do I need 
to process the input text files, to get it into the required format? Then it 
would required extracting the unique words/features from the raw text, in 
addition to assigning class-labels, as well.

OR

There is some classifier class that can take raw input files? My input would be 
something like:

<class-label1> <file1-text>
<class-label2> <file3-text>
<class-label1> <file2-text>
etc.
 

Thanks
Bhaskar Ghosh
Hyderabad, India

http://www.google.com/profiles/bjgindia

"Ignorance is Bliss... Knowledge never brings Peace!!!"

Re: What are the ways to train and run classifiers on text?

Posted by Ted Dunning <te...@gmail.com>.

Drew,

You do recall correctly.  This is a good example to follow for the Naive
Bayes side of the house.

On Sun, Sep 26, 2010 at 1:05 PM, Drew Farris <dr...@apache.org> wrote:

> The
> PrepareTwentyNewsgroups example converts a bunch of files organized
> into directories into the Bayes input format, iirc.
>

Re: What are the ways to train and run classifiers on text?

Posted by Drew Farris <dr...@apache.org>.

Hi Bhaskar,

Thake a look at the latest from svn trunk:
https://svn.apache.org/repos/asf/mahout/trunk/, you'll find the
TrainNewsGroups class in the examples project. It is alll pretty new,
so there are no docs on the wiki, but the code is very readable.

If you are interested in working with the Bayes classifiers, take a
look at the classifier.bayes.* package in the example project. The
PrepareTwentyNewsgroups example converts a bunch of files organized
into directories into the Bayes input format, iirc.

Drew

On Sun, Sep 26, 2010 at 1:17 PM, Bhaskar Ghosh <bj...@yahoo.co.in> wrote:
> Thanks Ted. But, I am unable to find the org.apache.mahout.classifier.sgd
> package. I could only locate the classifier.bayes.* packages
>
>  Thanks
> Bhaskar Ghosh
> Hyderabad, India
>
> http://www.google.com/profiles/bjgindia
>
> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>
>
>
>
> ________________________________
> From: Ted Dunning <te...@gmail.com>
> To: user@mahout.apache.org
> Sent: Sun, 26 September, 2010 9:40:17 AM
> Subject: Re: What are the ways to train and run classifiers on text?
>
> Take a look also at TrainNewsGroups in the classifier.sgd package in
> examples.
>
> That shows how to parse documents for use with an SGD classifier (different
> from NaiveBayes).
>
> There is much more format flexibility with an API oriented approach.
>
> On Sun, Sep 26, 2010 at 9:37 AM, Bhaskar Ghosh <bj...@yahoo.co.in> wrote:
>
>> Dear All,
>>
>> I need to classify a bunch of text files, so determine which class does
>> each one
>> of these texts fall.
>>
>>
>> Now I have seen through the 20Newsgroups example. I see that the input text
>> files need to have a particular format:
>>
>> <class-label> <tab> <unique features (words) associated with the
>> class-label>
>>
>>
>> But the real question is how do I get such a pre-processed input file? Do I
>> need
>> to process the input text files, to get it into the required format? Then
>> it
>> would required extracting the unique words/features from the raw text, in
>> addition to assigning class-labels, as well.
>>
>> OR
>>
>> There is some classifier class that can take raw input files? My input
>> would be
>> something like:
>>
>> <class-label1> <file1-text>
>> <class-label2> <file3-text>
>> <class-label1> <file2-text>
>> etc.
>>
>>
>> Thanks
>> Bhaskar Ghosh
>> Hyderabad, India
>>
>> http://www.google.com/profiles/bjgindia
>>
>> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>>
>>
>>
>
>
>

Re: What are the ways to train and run classifiers on text?

Posted by Bhaskar Ghosh <bj...@yahoo.co.in>.

Thanks Ted. But, I am unable to find the org.apache.mahout.classifier.sgd 
package. I could only locate the classifier.bayes.* packages

 Thanks
Bhaskar Ghosh
Hyderabad, India

http://www.google.com/profiles/bjgindia

"Ignorance is Bliss... Knowledge never brings Peace!!!"

________________________________
From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org
Sent: Sun, 26 September, 2010 9:40:17 AM
Subject: Re: What are the ways to train and run classifiers on text?

Take a look also at TrainNewsGroups in the classifier.sgd package in
examples.

That shows how to parse documents for use with an SGD classifier (different
from NaiveBayes).

There is much more format flexibility with an API oriented approach.

On Sun, Sep 26, 2010 at 9:37 AM, Bhaskar Ghosh <bj...@yahoo.co.in> wrote:

> Dear All,
>
> I need to classify a bunch of text files, so determine which class does
> each one
> of these texts fall.
>
>
> Now I have seen through the 20Newsgroups example. I see that the input text
> files need to have a particular format:
>
> <class-label> <tab> <unique features (words) associated with the
> class-label>
>
>
> But the real question is how do I get such a pre-processed input file? Do I
> need
> to process the input text files, to get it into the required format? Then
> it
> would required extracting the unique words/features from the raw text, in
> addition to assigning class-labels, as well.
>
> OR
>
> There is some classifier class that can take raw input files? My input
> would be
> something like:
>
> <class-label1> <file1-text>
> <class-label2> <file3-text>
> <class-label1> <file2-text>
> etc.
>
>
> Thanks
> Bhaskar Ghosh
> Hyderabad, India
>
> http://www.google.com/profiles/bjgindia
>
> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>
>
>

Re: What are the ways to train and run classifiers on text?

Posted by Ted Dunning <te...@gmail.com>.

Take a look also at TrainNewsGroups in the classifier.sgd package in
examples.

That shows how to parse documents for use with an SGD classifier (different
from NaiveBayes).

There is much more format flexibility with an API oriented approach.

On Sun, Sep 26, 2010 at 9:37 AM, Bhaskar Ghosh <bj...@yahoo.co.in> wrote:

> Dear All,
>
> I need to classify a bunch of text files, so determine which class does
> each one
> of these texts fall.
>
>
> Now I have seen through the 20Newsgroups example. I see that the input text
> files need to have a particular format:
>
> <class-label> <tab> <unique features (words) associated with the
> class-label>
>
>
> But the real question is how do I get such a pre-processed input file? Do I
> need
> to process the input text files, to get it into the required format? Then
> it
> would required extracting the unique words/features from the raw text, in
> addition to assigning class-labels, as well.
>
> OR
>
> There is some classifier class that can take raw input files? My input
> would be
> something like:
>
> <class-label1> <file1-text>
> <class-label2> <file3-text>
> <class-label1> <file2-text>
> etc.
>
>
> Thanks
> Bhaskar Ghosh
> Hyderabad, India
>
> http://www.google.com/profiles/bjgindia
>
> "Ignorance is Bliss... Knowledge never brings Peace!!!"
>
>
>