You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Neil Ghosh <ne...@gmail.com> on 2010/09/30 16:00:40 UTC

unknown test data twenty-newsgroups example

Hi,

In this example

https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html

The test is done on the already classified input text documents.

My Question is , If I want to test unknown, documents , do I need it in
specific format ? or just keep them (as raw text ) in the input folder while
testing ?

Thanks and Regards
Neil
http://neilghosh.com

Re: unknown test data twenty-newsgroups example

Posted by Ted Dunning <te...@gmail.com>.
I *love* it that all these people whose faces I wouldn't even know in a
crowd are jumping in and answering questions
before I even see the question.

Way to go Alex and Federico!

On Thu, Oct 21, 2010 at 10:40 AM, Alexander Hans <al...@ahans.de> wrote:

> Hi Neil,
>
> this is a JUnit test, so you need JUnit to run it. I'm using Eclipse IDE
> with the JUnit plugin and run it from there. From the commandline you can
> use the TestRunner, for instance. See
>
> http://www.junit.org/apidocs/junit/textui/TestRunner.html
>
> You will have to make sure that you have a JUnit jar file in your
> classpath.
>
>
> HTH,
>
> Alex
>
>
> > Thanks Federico and Robin , Got it now.
> > Anybody knows the command line parameters running this ?
> >
> > On Thu, Oct 21, 2010 at 11:02 PM, Federico Castanedo
> > <fc...@inf.uc3m.es>wrote:
> >
> >> Hello Neil,
> >>
> >> The file is here:
> >>
> >> /core/src/test/java/org/apache/mahout/classifier/bayes
> >>
> >> Regards
> >>
> >>
> >> 2010/10/21 Neil Ghosh <ne...@gmail.com>:
> >> > Thanks Drew
> >> > I could not find the file
> >> >
> >> >
> >>
> http://svn.apache.org/repos/asf/mahout/trunk/core/src/test/java/org/apache/mahout/classifier/bayes/BayesClassifierSelfTest.java
> >> >
> >> > In my mahout trunk in this directory
> >> >
> >> > neil@neil-laptop
> >> :~/trunk/core/src/main/java/org/apache/mahout/classifier/bayes$
> >> > ll
> >> > total 92
> >> > drwxr-xr-x 11 neil neil  4096 2010-09-19 12:15 ./
> >> > drwxr-xr-x  7 neil neil  4096 2010-09-19 12:15 ../
> >> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 algorithm/
> >> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 common/
> >> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 datastore/
> >> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 exceptions/
> >> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 interfaces/
> >> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 io/
> >> > drwxr-xr-x  6 neil neil  4096 2010-09-19 12:15 mapreduce/
> >> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 model/
> >> > -rw-r--r--  1 neil neil  8249 2010-09-19 12:15
> >> MultipleOutputFormat.java
> >> > -rw-r--r--  1 neil neil  1441 2010-09-19 12:15
> >> MultipleTextOutputFormat.java
> >> > -rw-r--r--  1 neil neil  4133 2010-09-19 12:15 package.html
> >> > drwxr-xr-x  6 neil neil  4096 2010-09-19 12:15 .svn/
> >> > -rw-r--r--  1 neil neil 13066 2010-09-19 12:15 TestClassifier.java
> >> > -rw-r--r--  1 neil neil  7660 2010-09-19 12:15 TrainClassifier.java
> >> >
> >> > Am I looking at the correct directory ?
> >> > Any reference how to run this ?
> >> >
> >> > On Thu, Sep 30, 2010 at 11:58 PM, Drew Farris <dr...@apache.org>
> wrote:
> >> >
> >> >> On Thu, Sep 30, 2010 at 10:00 AM, Neil Ghosh <ne...@gmail.com>
> >> wrote:
> >> >> >
> >> >> > My Question is , If I want to test unknown, documents , do I need
> >> it
> >> in
> >> >> > specific format ? or just keep them (as raw text ) in the input
> >> folder
> >> >> while
> >> >> > testing ?
> >> >>
> >> >> If I interpret your question correctly, you're saying "I've trained
> >> my
> >> >> classifier and tested it, now how do I use it in production?". I
> >> don't
> >> >> know that this is covered by the example.
> >> >>
> >> >> The unit test, in core/src/test/java --
> >> >> org.apache.mahout.classifier.bayes.BayesClassifierSelfTest provides a
> >> >> potentially useful example. Take a look at the testSelfTestBayes()
> >> >> method.
> >> >>
> >> >> In general, the operations involved include;
> >> >>   Create an instance of Algorithm and Datastore, configure as
> >> appropriate .
> >> >>   Create an instance of ClassifierContext (named classifier) using
> >> >> the Algorithm and Datastore, calling initialize() upon i the context.
> >> >>   Generate tokens from your input document (either individual words
> >> >> or ngrams based on how the data used to train the model was
> >> >> processed).
> >> >>   Call classifier.classifyDocument(String[] tokens, String
> >> >> defaultCat) this will return a ClassifierResult containing the top
> >> >> classifications for the input document ranked by score).
> >> >>
> >> >> HTH,
> >> >>
> >> >> Drew
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks and Regards
> >> > Neil
> >> > http://neilghosh.com
> >> >
> >>
> >
> >
> >
>
>
>

Re: unknown test data twenty-newsgroups example

Posted by Alexander Hans <al...@ahans.de>.
Hi Neil,

this is a JUnit test, so you need JUnit to run it. I'm using Eclipse IDE
with the JUnit plugin and run it from there. From the commandline you can
use the TestRunner, for instance. See

http://www.junit.org/apidocs/junit/textui/TestRunner.html

You will have to make sure that you have a JUnit jar file in your classpath.


HTH,

Alex


> Thanks Federico and Robin , Got it now.
> Anybody knows the command line parameters running this ?
>
> On Thu, Oct 21, 2010 at 11:02 PM, Federico Castanedo
> <fc...@inf.uc3m.es>wrote:
>
>> Hello Neil,
>>
>> The file is here:
>>
>> /core/src/test/java/org/apache/mahout/classifier/bayes
>>
>> Regards
>>
>>
>> 2010/10/21 Neil Ghosh <ne...@gmail.com>:
>> > Thanks Drew
>> > I could not find the file
>> >
>> >
>> http://svn.apache.org/repos/asf/mahout/trunk/core/src/test/java/org/apache/mahout/classifier/bayes/BayesClassifierSelfTest.java
>> >
>> > In my mahout trunk in this directory
>> >
>> > neil@neil-laptop
>> :~/trunk/core/src/main/java/org/apache/mahout/classifier/bayes$
>> > ll
>> > total 92
>> > drwxr-xr-x 11 neil neil  4096 2010-09-19 12:15 ./
>> > drwxr-xr-x  7 neil neil  4096 2010-09-19 12:15 ../
>> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 algorithm/
>> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 common/
>> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 datastore/
>> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 exceptions/
>> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 interfaces/
>> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 io/
>> > drwxr-xr-x  6 neil neil  4096 2010-09-19 12:15 mapreduce/
>> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 model/
>> > -rw-r--r--  1 neil neil  8249 2010-09-19 12:15
>> MultipleOutputFormat.java
>> > -rw-r--r--  1 neil neil  1441 2010-09-19 12:15
>> MultipleTextOutputFormat.java
>> > -rw-r--r--  1 neil neil  4133 2010-09-19 12:15 package.html
>> > drwxr-xr-x  6 neil neil  4096 2010-09-19 12:15 .svn/
>> > -rw-r--r--  1 neil neil 13066 2010-09-19 12:15 TestClassifier.java
>> > -rw-r--r--  1 neil neil  7660 2010-09-19 12:15 TrainClassifier.java
>> >
>> > Am I looking at the correct directory ?
>> > Any reference how to run this ?
>> >
>> > On Thu, Sep 30, 2010 at 11:58 PM, Drew Farris <dr...@apache.org> wrote:
>> >
>> >> On Thu, Sep 30, 2010 at 10:00 AM, Neil Ghosh <ne...@gmail.com>
>> wrote:
>> >> >
>> >> > My Question is , If I want to test unknown, documents , do I need
>> it
>> in
>> >> > specific format ? or just keep them (as raw text ) in the input
>> folder
>> >> while
>> >> > testing ?
>> >>
>> >> If I interpret your question correctly, you're saying "I've trained
>> my
>> >> classifier and tested it, now how do I use it in production?". I
>> don't
>> >> know that this is covered by the example.
>> >>
>> >> The unit test, in core/src/test/java --
>> >> org.apache.mahout.classifier.bayes.BayesClassifierSelfTest provides a
>> >> potentially useful example. Take a look at the testSelfTestBayes()
>> >> method.
>> >>
>> >> In general, the operations involved include;
>> >>   Create an instance of Algorithm and Datastore, configure as
>> appropriate .
>> >>   Create an instance of ClassifierContext (named classifier) using
>> >> the Algorithm and Datastore, calling initialize() upon i the context.
>> >>   Generate tokens from your input document (either individual words
>> >> or ngrams based on how the data used to train the model was
>> >> processed).
>> >>   Call classifier.classifyDocument(String[] tokens, String
>> >> defaultCat) this will return a ClassifierResult containing the top
>> >> classifications for the input document ranked by score).
>> >>
>> >> HTH,
>> >>
>> >> Drew
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks and Regards
>> > Neil
>> > http://neilghosh.com
>> >
>>
>
>
>



Re: unknown test data twenty-newsgroups example

Posted by Neil Ghosh <ne...@gmail.com>.
Thanks Federico and Robin , Got it now.
Anybody knows the command line parameters running this ?

On Thu, Oct 21, 2010 at 11:02 PM, Federico Castanedo
<fc...@inf.uc3m.es>wrote:

> Hello Neil,
>
> The file is here:
>
> /core/src/test/java/org/apache/mahout/classifier/bayes
>
> Regards
>
>
> 2010/10/21 Neil Ghosh <ne...@gmail.com>:
> > Thanks Drew
> > I could not find the file
> >
> >
> http://svn.apache.org/repos/asf/mahout/trunk/core/src/test/java/org/apache/mahout/classifier/bayes/BayesClassifierSelfTest.java
> >
> > In my mahout trunk in this directory
> >
> > neil@neil-laptop
> :~/trunk/core/src/main/java/org/apache/mahout/classifier/bayes$
> > ll
> > total 92
> > drwxr-xr-x 11 neil neil  4096 2010-09-19 12:15 ./
> > drwxr-xr-x  7 neil neil  4096 2010-09-19 12:15 ../
> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 algorithm/
> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 common/
> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 datastore/
> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 exceptions/
> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 interfaces/
> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 io/
> > drwxr-xr-x  6 neil neil  4096 2010-09-19 12:15 mapreduce/
> > drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 model/
> > -rw-r--r--  1 neil neil  8249 2010-09-19 12:15 MultipleOutputFormat.java
> > -rw-r--r--  1 neil neil  1441 2010-09-19 12:15
> MultipleTextOutputFormat.java
> > -rw-r--r--  1 neil neil  4133 2010-09-19 12:15 package.html
> > drwxr-xr-x  6 neil neil  4096 2010-09-19 12:15 .svn/
> > -rw-r--r--  1 neil neil 13066 2010-09-19 12:15 TestClassifier.java
> > -rw-r--r--  1 neil neil  7660 2010-09-19 12:15 TrainClassifier.java
> >
> > Am I looking at the correct directory ?
> > Any reference how to run this ?
> >
> > On Thu, Sep 30, 2010 at 11:58 PM, Drew Farris <dr...@apache.org> wrote:
> >
> >> On Thu, Sep 30, 2010 at 10:00 AM, Neil Ghosh <ne...@gmail.com>
> wrote:
> >> >
> >> > My Question is , If I want to test unknown, documents , do I need it
> in
> >> > specific format ? or just keep them (as raw text ) in the input folder
> >> while
> >> > testing ?
> >>
> >> If I interpret your question correctly, you're saying "I've trained my
> >> classifier and tested it, now how do I use it in production?". I don't
> >> know that this is covered by the example.
> >>
> >> The unit test, in core/src/test/java --
> >> org.apache.mahout.classifier.bayes.BayesClassifierSelfTest provides a
> >> potentially useful example. Take a look at the testSelfTestBayes()
> >> method.
> >>
> >> In general, the operations involved include;
> >>   Create an instance of Algorithm and Datastore, configure as
> appropriate .
> >>   Create an instance of ClassifierContext (named classifier) using
> >> the Algorithm and Datastore, calling initialize() upon i the context.
> >>   Generate tokens from your input document (either individual words
> >> or ngrams based on how the data used to train the model was
> >> processed).
> >>   Call classifier.classifyDocument(String[] tokens, String
> >> defaultCat) this will return a ClassifierResult containing the top
> >> classifications for the input document ranked by score).
> >>
> >> HTH,
> >>
> >> Drew
> >>
> >
> >
> >
> > --
> > Thanks and Regards
> > Neil
> > http://neilghosh.com
> >
>



-- 
Thanks and Regards
Neil
http://neilghosh.com

Re: unknown test data twenty-newsgroups example

Posted by Federico Castanedo <fc...@inf.uc3m.es>.
Hello Neil,

The file is here:

/core/src/test/java/org/apache/mahout/classifier/bayes

Regards


2010/10/21 Neil Ghosh <ne...@gmail.com>:
> Thanks Drew
> I could not find the file
>
> http://svn.apache.org/repos/asf/mahout/trunk/core/src/test/java/org/apache/mahout/classifier/bayes/BayesClassifierSelfTest.java
>
> In my mahout trunk in this directory
>
> neil@neil-laptop:~/trunk/core/src/main/java/org/apache/mahout/classifier/bayes$
> ll
> total 92
> drwxr-xr-x 11 neil neil  4096 2010-09-19 12:15 ./
> drwxr-xr-x  7 neil neil  4096 2010-09-19 12:15 ../
> drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 algorithm/
> drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 common/
> drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 datastore/
> drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 exceptions/
> drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 interfaces/
> drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 io/
> drwxr-xr-x  6 neil neil  4096 2010-09-19 12:15 mapreduce/
> drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 model/
> -rw-r--r--  1 neil neil  8249 2010-09-19 12:15 MultipleOutputFormat.java
> -rw-r--r--  1 neil neil  1441 2010-09-19 12:15 MultipleTextOutputFormat.java
> -rw-r--r--  1 neil neil  4133 2010-09-19 12:15 package.html
> drwxr-xr-x  6 neil neil  4096 2010-09-19 12:15 .svn/
> -rw-r--r--  1 neil neil 13066 2010-09-19 12:15 TestClassifier.java
> -rw-r--r--  1 neil neil  7660 2010-09-19 12:15 TrainClassifier.java
>
> Am I looking at the correct directory ?
> Any reference how to run this ?
>
> On Thu, Sep 30, 2010 at 11:58 PM, Drew Farris <dr...@apache.org> wrote:
>
>> On Thu, Sep 30, 2010 at 10:00 AM, Neil Ghosh <ne...@gmail.com> wrote:
>> >
>> > My Question is , If I want to test unknown, documents , do I need it in
>> > specific format ? or just keep them (as raw text ) in the input folder
>> while
>> > testing ?
>>
>> If I interpret your question correctly, you're saying "I've trained my
>> classifier and tested it, now how do I use it in production?". I don't
>> know that this is covered by the example.
>>
>> The unit test, in core/src/test/java --
>> org.apache.mahout.classifier.bayes.BayesClassifierSelfTest provides a
>> potentially useful example. Take a look at the testSelfTestBayes()
>> method.
>>
>> In general, the operations involved include;
>>   Create an instance of Algorithm and Datastore, configure as appropriate .
>>   Create an instance of ClassifierContext (named classifier) using
>> the Algorithm and Datastore, calling initialize() upon i the context.
>>   Generate tokens from your input document (either individual words
>> or ngrams based on how the data used to train the model was
>> processed).
>>   Call classifier.classifyDocument(String[] tokens, String
>> defaultCat) this will return a ClassifierResult containing the top
>> classifications for the input document ranked by score).
>>
>> HTH,
>>
>> Drew
>>
>
>
>
> --
> Thanks and Regards
> Neil
> http://neilghosh.com
>

Re: unknown test data twenty-newsgroups example

Posted by Neil Ghosh <ne...@gmail.com>.
Thanks Drew
I could not find the file

http://svn.apache.org/repos/asf/mahout/trunk/core/src/test/java/org/apache/mahout/classifier/bayes/BayesClassifierSelfTest.java

In my mahout trunk in this directory

neil@neil-laptop:~/trunk/core/src/main/java/org/apache/mahout/classifier/bayes$
ll
total 92
drwxr-xr-x 11 neil neil  4096 2010-09-19 12:15 ./
drwxr-xr-x  7 neil neil  4096 2010-09-19 12:15 ../
drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 algorithm/
drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 common/
drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 datastore/
drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 exceptions/
drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 interfaces/
drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 io/
drwxr-xr-x  6 neil neil  4096 2010-09-19 12:15 mapreduce/
drwxr-xr-x  3 neil neil  4096 2010-09-19 12:15 model/
-rw-r--r--  1 neil neil  8249 2010-09-19 12:15 MultipleOutputFormat.java
-rw-r--r--  1 neil neil  1441 2010-09-19 12:15 MultipleTextOutputFormat.java
-rw-r--r--  1 neil neil  4133 2010-09-19 12:15 package.html
drwxr-xr-x  6 neil neil  4096 2010-09-19 12:15 .svn/
-rw-r--r--  1 neil neil 13066 2010-09-19 12:15 TestClassifier.java
-rw-r--r--  1 neil neil  7660 2010-09-19 12:15 TrainClassifier.java

Am I looking at the correct directory ?
Any reference how to run this ?

On Thu, Sep 30, 2010 at 11:58 PM, Drew Farris <dr...@apache.org> wrote:

> On Thu, Sep 30, 2010 at 10:00 AM, Neil Ghosh <ne...@gmail.com> wrote:
> >
> > My Question is , If I want to test unknown, documents , do I need it in
> > specific format ? or just keep them (as raw text ) in the input folder
> while
> > testing ?
>
> If I interpret your question correctly, you're saying "I've trained my
> classifier and tested it, now how do I use it in production?". I don't
> know that this is covered by the example.
>
> The unit test, in core/src/test/java --
> org.apache.mahout.classifier.bayes.BayesClassifierSelfTest provides a
> potentially useful example. Take a look at the testSelfTestBayes()
> method.
>
> In general, the operations involved include;
>   Create an instance of Algorithm and Datastore, configure as appropriate .
>   Create an instance of ClassifierContext (named classifier) using
> the Algorithm and Datastore, calling initialize() upon i the context.
>   Generate tokens from your input document (either individual words
> or ngrams based on how the data used to train the model was
> processed).
>   Call classifier.classifyDocument(String[] tokens, String
> defaultCat) this will return a ClassifierResult containing the top
> classifications for the input document ranked by score).
>
> HTH,
>
> Drew
>



-- 
Thanks and Regards
Neil
http://neilghosh.com

Re: unknown test data twenty-newsgroups example

Posted by Drew Farris <dr...@apache.org>.
On Thu, Sep 30, 2010 at 10:00 AM, Neil Ghosh <ne...@gmail.com> wrote:
>
> My Question is , If I want to test unknown, documents , do I need it in
> specific format ? or just keep them (as raw text ) in the input folder while
> testing ?

If I interpret your question correctly, you're saying "I've trained my
classifier and tested it, now how do I use it in production?". I don't
know that this is covered by the example.

The unit test, in core/src/test/java --
org.apache.mahout.classifier.bayes.BayesClassifierSelfTest provides a
potentially useful example. Take a look at the testSelfTestBayes()
method.

In general, the operations involved include;
   Create an instance of Algorithm and Datastore, configure as appropriate .
   Create an instance of ClassifierContext (named classifier) using
the Algorithm and Datastore, calling initialize() upon i the context.
   Generate tokens from your input document (either individual words
or ngrams based on how the data used to train the model was
processed).
   Call classifier.classifyDocument(String[] tokens, String
defaultCat) this will return a ClassifierResult containing the top
classifications for the input document ranked by score).

HTH,

Drew

Re: unknown test data twenty-newsgroups example

Posted by Ted Dunning <te...@gmail.com>.
A very good practice is to use a data set like this:
http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Segregating by date avoids problems with duplicate documents appearing in
both training and test.  It also gives you a standard split so that you can
compare to other peoples' results.

On Thu, Sep 30, 2010 at 7:00 AM, Neil Ghosh <ne...@gmail.com> wrote:

> Hi,
>
> In this example
>
> https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html
>
> The test is done on the already classified input text documents.
>
> My Question is , If I want to test unknown, documents , do I need it in
> specific format ? or just keep them (as raw text ) in the input folder
> while
> testing ?
>
> Thanks and Regards
> Neil
> http://neilghosh.com
>

Re: unknown test data twenty-newsgroups example

Posted by Bhaskar Ghosh <bj...@yahoo.co.in>.
Thanks Ted, Robin, and Neil. I am now clear of my doubts, and would try the 
approach now.
 Regards
Bhaskar Ghosh
Hyderabad, India

http://www.google.com/profiles/bjgindia

"Ignorance is Bliss... Knowledge never brings Peace!!!"




________________________________
From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org
Cc: Bhaskar Ghosh <bj...@yahoo.co.in>; neil.ghosh@gmail.com
Sent: Sat, 2 October, 2010 12:11:53 AM
Subject: Re: unknown test data twenty-newsgroups example


Yes.  Instance = training example.

Your method of duplicating lines is just what Robin meant.


On Fri, Oct 1, 2010 at 3:55 AM, Robin Anil <ro...@gmail.com> wrote:

> Let me list what I understood. Pl confirm if I got it correct?
>>
>> Add duplicate extra lines many times in an extra file (conforming to the
>> format required by the Bayes Classifier) in the format
>> <class-name1><tab><word1> <word2>
>> If I want to increase the weight of word1 and word2, so that text with
>> those words have higher chance of getting classified as <class-name1>
>>
>> *
>> *
>>
>No. Duplicating lines increases DF and therefore decreases (IDF == inverse
>document frequency) So weight goes down. To increase weight of the word
>repeat the word in the same line
>
>
>Regards
>Robin
>



Re: unknown test data twenty-newsgroups example

Posted by Ted Dunning <te...@gmail.com>.
Yes.  Instance = training example.

Your method of duplicating lines is just what Robin meant.

On Fri, Oct 1, 2010 at 3:55 AM, Robin Anil <ro...@gmail.com> wrote:

> > Let me list what I understood. Pl confirm if I got it correct?
> >
> > Add duplicate extra lines many times in an extra file (conforming to the
> > format required by the Bayes Classifier) in the format
> > <class-name1><tab><word1> <word2>
> > If I want to increase the weight of word1 and word2, so that text with
> > those words have higher chance of getting classified as <class-name1>
> >
> > *
> > *
> >
> No. Duplicating lines increases DF and therefore decreases (IDF == inverse
> document frequency) So weight goes down. To increase weight of the word
> repeat the word in the same line
>
>
> Regards
> Robin
>

Re: unknown test data twenty-newsgroups example

Posted by Robin Anil <ro...@gmail.com>.
> Let me list what I understood. Pl confirm if I got it correct?
>
> Add duplicate extra lines many times in an extra file (conforming to the
> format required by the Bayes Classifier) in the format
> <class-name1><tab><word1> <word2>
> If I want to increase the weight of word1 and word2, so that text with
> those words have higher chance of getting classified as <class-name1>
>
> *
> *
>
No. Duplicating lines increases DF and therefore decreases (IDF == inverse
document frequency) So weight goes down. To increase weight of the word
repeat the word in the same line


Regards
Robin

Re: unknown test data twenty-newsgroups example

Posted by Bhaskar Ghosh <bj...@yahoo.co.in>.
Hi Robin/Neil,

I was also trying the 20Newsgroups example, and was following your conversation. 
I am confused now with the use of the word 'instance'.
I actually could not get the meaning of these lines:

extra file or extra line, duplicated instances(to decrease the weights) or
>duplicate feature in the same instance to increase the weights(classic
>tf-idf)
 
Let me list what I understood. Pl confirm if I got it correct?

Add duplicate extra lines many times in an extra file (conforming to the format 
required by the Bayes Classifier) in the format 
><class-name1><tab><word1> <word2>
>If I want to increase the weight of word1 and word2, so that text with those 
>words have higher chance of getting classified as <class-name1>
Thanks
Bhaskar Ghosh
Hyderabad, India

http://www.google.com/profiles/bjgindia

"Ignorance is Bliss... Knowledge never brings Peace!!!"




________________________________
From: Robin Anil <ro...@gmail.com>
To: neil.ghosh@gmail.com
Cc: user@mahout.apache.org
Sent: Thu, 30 September, 2010 9:59:47 PM
Subject: Re: unknown test data twenty-newsgroups example

On Thu, Sep 30, 2010 at 9:45 PM, Neil Ghosh <ne...@gmail.com> wrote:
>
> Do you mean , I should 1st create the model with correct data in correct
> folder (Label).
>
>
Now you throw an instance at it and you will get the correct label, well
most of the time.



Re: unknown test data twenty-newsgroups example

Posted by Robin Anil <ro...@gmail.com>.
On Thu, Sep 30, 2010 at 9:45 PM, Neil Ghosh <ne...@gmail.com> wrote:
>
> Do you mean , I should 1st create the model with correct data in correct
> folder (Label).
>
>
Now you throw an instance at it and you will get the correct label, well
most of the time.

Re: unknown test data twenty-newsgroups example

Posted by Neil Ghosh <ne...@gmail.com>.
Do you mean , I should 1st create the model with correct data in correct
folder (Label).

Then now randomly distribute the raw text files in among two folders and
generate input data.

Now I should run the tester for the mis-labelled data ?

On Thu, Sep 30, 2010 at 9:37 PM, Robin Anil <ro...@gmail.com> wrote:

> You may split the dataset in 80/20 or some other ratio and try. You can
> split them after you have created the data in Bayes classifier format or
> split it into different folders and make them as described in t
> documentation.
>
>
> Robin
>
>
> On Thu, Sep 30, 2010 at 7:30 PM, Neil Ghosh <ne...@gmail.com> wrote:
>
>> Hi,
>>
>> In this example
>>
>> https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html
>>
>> The test is done on the already classified input text documents.
>>
>> My Question is , If I want to test unknown, documents , do I need it in
>> specific format ? or just keep them (as raw text ) in the input folder
>> while
>> testing ?
>>
>> Thanks and Regards
>> Neil
>> http://neilghosh.com
>>
>
>


-- 
Thanks and Regards
Neil
http://neilghosh.com

Re: unknown test data twenty-newsgroups example

Posted by Robin Anil <ro...@gmail.com>.
You may split the dataset in 80/20 or some other ratio and try. You can
split them after you have created the data in Bayes classifier format or
split it into different folders and make them as described in the
documentation.


Robin

On Thu, Sep 30, 2010 at 7:30 PM, Neil Ghosh <ne...@gmail.com> wrote:

> Hi,
>
> In this example
>
> https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html
>
> The test is done on the already classified input text documents.
>
> My Question is , If I want to test unknown, documents , do I need it in
> specific format ? or just keep them (as raw text ) in the input folder
> while
> testing ?
>
> Thanks and Regards
> Neil
> http://neilghosh.com
>