You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Joe Kumar <jo...@gmail.com> on 2010/09/15 06:56:09 UTC

Options in TrainClassifier.java

Hi all,

As I was going through wikipedia example, I encountered a situation with
TrainClassifier wherein some of the options with default values are actually
mandatory.
The documentation / command line help says that

   1. default source (--datasource) is hdfs but TrainClassifier
   has withRequired(true) while building the --datasource option. We are
   checking if the dataSourceType is hbase else set it to hdfs. so
   ideally withRequired should be set to false
   2. default --classifierType is bayes but withRequired is set to true and
   we have code like

if ("bayes".equalsIgnoreCase(classifierType)) {
        log.info("Training Bayes Classifier");
        trainNaiveBayes(inputPath, outputPath, params);

      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
        log.info("Training Complementary Bayes Classifier");
        // setup the HDFS and copy the files there, then run the trainer
        trainCNaiveBayes(inputPath, outputPath, params);
      }

which should be changed to

*if ("cbayes".equalsIgnoreCase(classifierType)) {*
        log.info("Training Complementary Bayes Classifier");
        trainCNaiveBayes(inputPath, outputPath, params);

      } *else  {*
        log.info("Training  Bayes Classifier");
        // setup the HDFS and copy the files there, then run the trainer
        trainNaiveBayes(inputPath, outputPath, params);
      }

Please let me know if this looks valid and I'll submit a patch for a JIRA
issue.

reg
Joe.

Re: Options in TrainClassifier.java

Posted by deneche abdelhakim <ad...@gmail.com>.

I don't know if it's related, but I remember getting a similar
Exception one year ago when I was  working on the implementation of
Random Forests. In my case it was caused by
SequenceFile.Sorter.merge(). I ended up writing my own merge function
because I really didn't need to sort the output.

On Mon, Sep 20, 2010 at 6:14 AM, Joe Kumar <jo...@gmail.com> wrote:
> Gangadhar,
>
> Just to eliminate the usual suspects, I am using Mac OSX 10.5.8, Mahout 0.4
> (revision 986659), Hadoop 0.20.2, 2GB Mem for Hadoop , 80 GB free space.
> commands tat I executed.
>
> I had issues with my namenode and so did a format using hadoop namenode
> -format.
> $MAHOUT_HOME/examples/src/test/resources/country.txt had just 1 entry
> (spain). I havent tried with multiple entries.
>
> $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
> $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o
> wikipedia/chunks -c 64
>
> $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> wikipedia/chunks -o wikipediainput -c
> $MAHOUT_HOME/examples/src/test/resources/country.txt
>
> $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o
> wikipediamodel  -type bayes -source hdfs
>
> $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
>  wikipediainput  -ng 3 -type bayes -source hdfs
>
> Please try the above and let me know. we'll try and find out what is going
> wrong.
> Reg,
> Joe.
>
> On Sun, Sep 19, 2010 at 11:13 PM, Gangadhar Nittala <npk.gangadhar@gmail.com
>> wrote:
>
>> Joe,
>> Even I tried with reducing the number of countries in the country.txt.
>> That didn't help. And in my case, I was monitoring the disk space and
>> at no time did it reach 0%. So, I am not sure if that is the case. To
>> remove the dependency on the number of countries, I even tried with
>> the subjects.txt as the classification - that also did not help.
>> I think this problem is due to the type of the data being processed,
>> but what I am not sure of is what I need to change to get the data to
>> be processed successfully.
>>
>> The experienced folks on Mahout will be able to tell us what is missing I
>> guess.
>>
>> Thank you
>> Gangadhar
>>
>> On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com> wrote:
>> > Gangadhar,
>> >
>> > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just
>> have
>> > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
>> > wikipediainput data set and then ran TrainClassifier and it worked. when
>> I
>> > ran TestClassifier as below, I got blank results in the output.
>> >
>> > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
>> >  wikipediainput  -ng 3 -type bayes -source hdfs
>> >
>> > Summary
>> > -------------------------------------------------------
>> > Correctly Classified Instances          :          0         ?%
>> > Incorrectly Classified Instances        :          0         ?%
>> > Total Classified Instances              :          0
>> >
>> > =======================================================
>> > Confusion Matrix
>> > -------------------------------------------------------
>> > a     <--Classified as
>> > 0     |  0     a     = spain
>> > Default Category: unknown: 1
>> >
>> > I am not sure if I am doing something wrong.. have to figure out why my
>> o/p
>> > is so blank.
>> > I'll document these steps and mention about country.txt in the wiki.
>> >
>> > Question to all
>> > Should we have 2 country.txt
>> >
>> >   1. country_full_list.txt - this is the existing list
>> >   2. country_sample_list.txt - a list with 2 or 3 countries
>> >
>> > To get a flavor of the wikipedia bayes example, we can use
>> > country_sample.txt. When new people want to just try out the example,
>> they
>> > can reference this txt file  as a parameter.
>> > To run the example in a robust scalable infrastructure, we could use
>> > country_full_list.txt.
>> > any thots ?
>> >
>> > regards
>> > Joe.
>> >
>> > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com> wrote:
>> >
>> >> Gangadhar,
>> >>
>> >> After running TrainClassifier again, the map task just failed with the
>> same
>> >> exception and I am pretty sure it is an issue with disk space.
>> >> As the map was progressing, I was monitoring my free disk space dropping
>> >> from 81GB. It came down to 0 after almost 66% through the map task and
>> then
>> >> the exception happened. After the exception, another map task was
>> resuming
>> >> at 33% and I got close to 15GB free space (i guess the first map task
>> freed
>> >> up some space) and I am sure they would drop down to zero again and
>> throw
>> >> the same exception.
>> >> I am going to modify the country.txt to just 1 country and recreate
>> >> wikipediainput and run TrainClassifier. Will let you know how it goes..
>> >>
>> >> Do we have any benchmarks / system requirements for running this example
>> ?
>> >> Has anyone else had success running this example anytime. Would
>> appreciate
>> >> your inputs / thots.
>> >>
>> >> Should we look at tuning the code for handling these situations ? Any
>> quick
>> >> suggestions on where to start looking at ?
>> >>
>> >> regards,
>> >> Joe.
>> >>
>> >>
>> >>
>> >>
>> >
>>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.

Gangadhar,

Just to eliminate the usual suspects, I am using Mac OSX 10.5.8, Mahout 0.4
(revision 986659), Hadoop 0.20.2, 2GB Mem for Hadoop , 80 GB free space.
commands tat I executed.

I had issues with my namenode and so did a format using hadoop namenode
-format.
$MAHOUT_HOME/examples/src/test/resources/country.txt had just 1 entry
(spain). I havent tried with multiple entries.

$> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
$MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o
wikipedia/chunks -c 64

$> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
wikipedia/chunks -o wikipediainput -c
$MAHOUT_HOME/examples/src/test/resources/country.txt

$> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o
wikipediamodel  -type bayes -source hdfs

$> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
 wikipediainput  -ng 3 -type bayes -source hdfs

Please try the above and let me know. we'll try and find out what is going
wrong.
Reg,
Joe.

On Sun, Sep 19, 2010 at 11:13 PM, Gangadhar Nittala <npk.gangadhar@gmail.com
> wrote:

> Joe,
> Even I tried with reducing the number of countries in the country.txt.
> That didn't help. And in my case, I was monitoring the disk space and
> at no time did it reach 0%. So, I am not sure if that is the case. To
> remove the dependency on the number of countries, I even tried with
> the subjects.txt as the classification - that also did not help.
> I think this problem is due to the type of the data being processed,
> but what I am not sure of is what I need to change to get the data to
> be processed successfully.
>
> The experienced folks on Mahout will be able to tell us what is missing I
> guess.
>
> Thank you
> Gangadhar
>
> On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com> wrote:
> > Gangadhar,
> >
> > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just
> have
> > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> > wikipediainput data set and then ran TrainClassifier and it worked. when
> I
> > ran TestClassifier as below, I got blank results in the output.
> >
> > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
> >  wikipediainput  -ng 3 -type bayes -source hdfs
> >
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :          0         ?%
> > Incorrectly Classified Instances        :          0         ?%
> > Total Classified Instances              :          0
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> > a     <--Classified as
> > 0     |  0     a     = spain
> > Default Category: unknown: 1
> >
> > I am not sure if I am doing something wrong.. have to figure out why my
> o/p
> > is so blank.
> > I'll document these steps and mention about country.txt in the wiki.
> >
> > Question to all
> > Should we have 2 country.txt
> >
> >   1. country_full_list.txt - this is the existing list
> >   2. country_sample_list.txt - a list with 2 or 3 countries
> >
> > To get a flavor of the wikipedia bayes example, we can use
> > country_sample.txt. When new people want to just try out the example,
> they
> > can reference this txt file  as a parameter.
> > To run the example in a robust scalable infrastructure, we could use
> > country_full_list.txt.
> > any thots ?
> >
> > regards
> > Joe.
> >
> > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com> wrote:
> >
> >> Gangadhar,
> >>
> >> After running TrainClassifier again, the map task just failed with the
> same
> >> exception and I am pretty sure it is an issue with disk space.
> >> As the map was progressing, I was monitoring my free disk space dropping
> >> from 81GB. It came down to 0 after almost 66% through the map task and
> then
> >> the exception happened. After the exception, another map task was
> resuming
> >> at 33% and I got close to 15GB free space (i guess the first map task
> freed
> >> up some space) and I am sure they would drop down to zero again and
> throw
> >> the same exception.
> >> I am going to modify the country.txt to just 1 country and recreate
> >> wikipediainput and run TrainClassifier. Will let you know how it goes..
> >>
> >> Do we have any benchmarks / system requirements for running this example
> ?
> >> Has anyone else had success running this example anytime. Would
> appreciate
> >> your inputs / thots.
> >>
> >> Should we look at tuning the code for handling these situations ? Any
> quick
> >> suggestions on where to start looking at ?
> >>
> >> regards,
> >> Joe.
> >>
> >>
> >>
> >>
> >
>

Re: Options in TrainClassifier.java

Posted by Ted Dunning <te...@gmail.com>.

There is a test program called TrainNewsGroups
in org.apache.mahout.classifier.sgd in the examples module.

I would love to work with you to get better documentation pulled together.

On Mon, Sep 20, 2010 at 8:13 PM, Gangadhar Nittala
<np...@gmail.com>wrote:

> Joe,
> I will try with the ngram setting of 1 and let you know how it goes.
> Robin, the ngram parameter is used to check the number of subsequences
> of characters isn't it ? Or is it evaluated differently w.r.t to the
> Bayesian classifier ?
>
> Ted, like Joe mentioned, if you could point us to some information on
> SGD we could try it and report back the results to the list.
>
> Thank you
> Gangadhar
>
> On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <jo...@gmail.com> wrote:
> > Robin / Gangadhar,
> > With ngram as 1 and all the countries in the country.txt , the model is
> > getting created without any issues.
> > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i
> wikipediainput
> > -o wikipediamodel -type bayes -source hdfs
> >
> > Robin,
> > Even for ngram parameter, the default value is mentioned as 1 but it is
> set
> > as a mandatory parameter in TrainClassifier. so i'll modify the code to
> set
> > the default ngram as 1 and make it as a non mandatory param.
> >
> > That aside, When I try to test the model, the summary is getting printed
> > like below.
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :          0         ?%
> > Incorrectly Classified Instances        :          0         ?%
> > Total Classified Instances              :          0
> > Need to figure out the reason..
> >
> > Since TestClassifier also has the same params and settings like
> > TrainClassifier, can i modify it to set the default values for ngram,
> > classifierType & dataSource ?
> >
> > reg,
> > Joe.
> >
> > On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:
> >
> >> Robin,
> >>
> >> Thanks for your tip.
> >> Will try it out and post updates.
> >>
> >> reg
> >> Joe.
> >>
> >>
> >> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com>
> wrote:
> >>
> >>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st.
> You
> >>> need atleast 2 countries. otherwise there is no classification.
> Secondly
> >>> ngram =3 is a bit too high. With wikipedia this will result in a huge
> >>> number
> >>> of features. Why dont you try with one and see.
> >>>
> >>> Robin
> >>>
> >>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com>
> wrote:
> >>>
> >>> > Hi Ted,
> >>> >
> >>> > sure. will keep digging..
> >>> >
> >>> > About SGD, I dont have an idea about how it works et al. If there is
> >>> some
> >>> > documentation / reference / quick summary to read about it that'll be
> >>> gr8.
> >>> > Just saw one reference in
> >>> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
> >>> >
> >>> > I am assuming we should be able to create a model from wikipedia
> >>> articles
> >>> > and label the country of a new article. If so, could you please
> provide
> >>> a
> >>> > note on how to do this. We already have the wikipedia data being
> >>> extracted
> >>> > for specific countries using WikipediaDatasetCreatorDriver. How do we
> go
> >>> > about training the classifier using SGD ?
> >>> >
> >>> > thanks for your help,
> >>> > Joe.
> >>> >
> >>> >
> >>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <ted.dunning@gmail.com
> >
> >>> > wrote:
> >>> >
> >>> > > I am watching these efforts with interest, but have been unable to
> >>> > > contribute much to the process.  I would encourage Joe and others
> to
> >>> keep
> >>> > > whittling this problem down so that we can understand what is
> causing
> >>> it.
> >>> > >
> >>> > > In the meantime, I think that the SGD classifiers are close to
> >>> production
> >>> > > quality.  For problems with less than several million training
> >>> examples,
> >>> > > and
> >>> > > especially problems with many sparse features, I think that these
> >>> > > classifiers might be easier to get started with than the Naive
> Bayes
> >>> > > classifiers.  To make a virtue of a defect, the SGD based
> classifiers
> >>> to
> >>> > > not
> >>> > > use Hadoop for training.  This makes deployment of a classification
> >>> > > training
> >>> > > workflow easier, but limits the total size of data that can be
> >>> handled.
> >>> > >
> >>> > > What would you guys need to get started with trying these
> alternative
> >>> > > models?
> >>> > >
> >>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> >>> > > <np...@gmail.com>wrote:
> >>> > >
> >>> > > > Joe,
> >>> > > > Even I tried with reducing the number of countries in the
> >>> country.txt.
> >>> > > > That didn't help. And in my case, I was monitoring the disk space
> >>> and
> >>> > > > at no time did it reach 0%. So, I am not sure if that is the
> case.
> >>> To
> >>> > > > remove the dependency on the number of countries, I even tried
> with
> >>> > > > the subjects.txt as the classification - that also did not help.
> >>> > > > I think this problem is due to the type of the data being
> processed,
> >>> > > > but what I am not sure of is what I need to change to get the
> data
> >>> to
> >>> > > > be processed successfully.
> >>> > > >
> >>> > > > The experienced folks on Mahout will be able to tell us what is
> >>> missing
> >>> > I
> >>> > > > guess.
> >>> > > >
> >>> > > > Thank you
> >>> > > > Gangadhar
> >>> > > >
> >>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
> >>> wrote:
> >>> > > > > Gangadhar,
> >>> > > > >
> >>> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt
> to
> >>> > just
> >>> > > > have
> >>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to
> create
> >>> the
> >>> > > > > wikipediainput data set and then ran TrainClassifier and it
> >>> worked.
> >>> > > when
> >>> > > > I
> >>> > > > > ran TestClassifier as below, I got blank results in the output.
> >>> > > > >
> >>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> >>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
> >>> wikipediamodel
> >>> > -d
> >>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
> >>> > > > >
> >>> > > > > Summary
> >>> > > > > -------------------------------------------------------
> >>> > > > > Correctly Classified Instances          :          0         ?%
> >>> > > > > Incorrectly Classified Instances        :          0         ?%
> >>> > > > > Total Classified Instances              :          0
> >>> > > > >
> >>> > > > > =======================================================
> >>> > > > > Confusion Matrix
> >>> > > > > -------------------------------------------------------
> >>> > > > > a     <--Classified as
> >>> > > > > 0     |  0     a     = spain
> >>> > > > > Default Category: unknown: 1
> >>> > > > >
> >>> > > > > I am not sure if I am doing something wrong.. have to figure
> out
> >>> why
> >>> > my
> >>> > > > o/p
> >>> > > > > is so blank.
> >>> > > > > I'll document these steps and mention about country.txt in the
> >>> wiki.
> >>> > > > >
> >>> > > > > Question to all
> >>> > > > > Should we have 2 country.txt
> >>> > > > >
> >>> > > > >   1. country_full_list.txt - this is the existing list
> >>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
> >>> > > > >
> >>> > > > > To get a flavor of the wikipedia bayes example, we can use
> >>> > > > > country_sample.txt. When new people want to just try out the
> >>> example,
> >>> > > > they
> >>> > > > > can reference this txt file  as a parameter.
> >>> > > > > To run the example in a robust scalable infrastructure, we
> could
> >>> use
> >>> > > > > country_full_list.txt.
> >>> > > > > any thots ?
> >>> > > > >
> >>> > > > > regards
> >>> > > > > Joe.
> >>> > > > >
> >>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <joekumar@gmail.com
> >
> >>> > wrote:
> >>> > > > >
> >>> > > > >> Gangadhar,
> >>> > > > >>
> >>> > > > >> After running TrainClassifier again, the map task just failed
> >>> with
> >>> > the
> >>> > > > same
> >>> > > > >> exception and I am pretty sure it is an issue with disk space.
> >>> > > > >> As the map was progressing, I was monitoring my free disk
> space
> >>> > > dropping
> >>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
> >>> task
> >>> > and
> >>> > > > then
> >>> > > > >> the exception happened. After the exception, another map task
> was
> >>> > > > resuming
> >>> > > > >> at 33% and I got close to 15GB free space (i guess the first
> map
> >>> > task
> >>> > > > freed
> >>> > > > >> up some space) and I am sure they would drop down to zero
> again
> >>> and
> >>> > > > throw
> >>> > > > >> the same exception.
> >>> > > > >> I am going to modify the country.txt to just 1 country and
> >>> recreate
> >>> > > > >> wikipediainput and run TrainClassifier. Will let you know how
> it
> >>> > > goes..
> >>> > > > >>
> >>> > > > >> Do we have any benchmarks / system requirements for running
> this
> >>> > > example
> >>> > > > ?
> >>> > > > >> Has anyone else had success running this example anytime.
> Would
> >>> > > > appreciate
> >>> > > > >> your inputs / thots.
> >>> > > > >>
> >>> > > > >> Should we look at tuning the code for handling these
> situations ?
> >>> > Any
> >>> > > > quick
> >>> > > > >> suggestions on where to start looking at ?
> >>> > > > >>
> >>> > > > >> regards,
> >>> > > > >> Joe.
> >>> > > > >>
> >>> > > > >>
> >>> > > > >>
> >>> > > > >>
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >>
> >>
> >>
> >
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.

Ted,

I've added the patch MAHOUT-509_1.patch in Jira [
https://issues.apache.org/jira/browse/MAHOUT-509 ] .

Thank you

On Thu, Oct 7, 2010 at 12:57 PM, Ted Dunning <te...@gmail.com> wrote:
> Can you attach the patch there?  The mailing list strips attachments.
>
> On Wed, Oct 6, 2010 at 9:22 PM, Gangadhar Nittala
> <np...@gmail.com>wrote:
>
>> I have attached a patch which has the modified testclassifier.props
>> and the fix with the parseInt. I think both these belong to
>> MAHOUT-509
>>
>

Re: Options in TrainClassifier.java

Posted by Ted Dunning <te...@gmail.com>.

Can you attach the patch there?  The mailing list strips attachments.

On Wed, Oct 6, 2010 at 9:22 PM, Gangadhar Nittala
<np...@gmail.com>wrote:

> I have attached a patch which has the modified testclassifier.props
> and the fix with the parseInt. I think both these belong to
> MAHOUT-509
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.

Joe / others,

I was finally able to test the changes that were done as part of
MAHOUT-509[ https://issues.apache.org/jira/browse/MAHOUT-509] and
follow the instructions in the wiki for the Bayes example [
https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example
]. The instructions in the wiki work only if the testclassifier.props
has the values for the required options. Else, the user needs to
provide the values on the command line for the datasource,
classifiertype and the n-gram size. The testClassifier executed and
printed a large matrix of values (though I still don't know how to
interpret the results :) )

Also, I found a minor problem in the TestClassifier.java where in
there is an Integer.parseInt with the command line option that is
read. If there are any leading / ending spaces in the
testclassifier.props, this results in a NumberFormatException.
Attached patch does a trim on the string before doing a parseInt.

I have attached a patch which has the modified testclassifier.props
and the fix with the parseInt. I think both these belong to
MAHOUT-509. If you think the wiki can be modified to include the
parameters instead of having settings in a .props file (preferring
clarity for the user over ease of use), then I can modify the wiki
instructions and remove the .props file from the patch.

The fix for the TestClassifier.java though, I think is required - it
is to sanitize the user input.

I am not sure of what is the preferred approach for providing patches
for a resolved issue. Should I create a new issue just for this or
would it be easier to add this patch to the existing issue itself?
Please let me know and I shall create a new issue and attach the
modified patch file to it.

Thank you
Gangadhar
p.s: I named the patch file with an underscore as the existing issue
already has a MAHOUT-509.patch

On Sun, Sep 26, 2010 at 9:28 AM, Gangadhar Nittala
<np...@gmail.com> wrote:
> Joe,
> I am out of town for this week and won't have access to my machine. I
> will check this during the weekend and will get back to you. Will
> follow the steps in the wiki.
>
> Thank you
>
> On Fri, Sep 24, 2010 at 8:44 AM, Joe Kumar <jo...@gmail.com> wrote:
>> Hi Gangadhar,
>>
>> I ran TestClassifier with similar parameters. It didnt take me 2 hrs though.
>>
>> I have documented the steps that worked for me at
>> https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example
>> Can you please get the patch available at MAHOUT-509 and apply it and then
>> try the steps in the wiki.
>> Please let me know if you still face issues.
>>
>> reg
>> Joe.
>>
>>
>> On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <npk.gangadhar@gmail.com
>>> wrote:
>>
>>> Joe,
>>> Can you let me know what was the command you used to test the
>>> classifier ? With the ngrams set to 1 as suggested by Robin, I was
>>> able to train the classifier. The command:
>>> $HADOOP_HOME/bin/hadoop jar
>>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1
>>> --input wikipediainput10 --output wikipediamodel10 --classifierType
>>> bayes --dataSource hdfs
>>>
>>> After this, as per the wiki, we need to get the data from HDFS. I did that
>>> <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10
>>>
>>> After this, the classifier is to be tested:
>>> $HADOOP_HOME/bin/hadoop jar
>>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10
>>> -d wikipediainput10  -ng 1 -type bayes -source hdfs
>>>
>>> When I run this, this runs for close to 2 hours and after 2 hours, it
>>> errors out with a java.io.FileException saying that the logs_ is a
>>> directory in the wikipediainput10 folder. I am sorry I can't provide
>>> the stack trace right now because I accidentally closed the terminal
>>> window before I could copy it. I will run this again and send the
>>> stack trace.
>>>
>>> But, if you can send me the steps that you followed after running the
>>> classifier, I can repeat those and see if I am able to successfully
>>> execute the classifier.
>>>
>>> Thank you
>>> Gangadhar
>>>
>>>
>>> On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala
>>> <np...@gmail.com> wrote:
>>> > Joe,
>>> > I will try with the ngram setting of 1 and let you know how it goes.
>>> > Robin, the ngram parameter is used to check the number of subsequences
>>> > of characters isn't it ? Or is it evaluated differently w.r.t to the
>>> > Bayesian classifier ?
>>> >
>>> > Ted, like Joe mentioned, if you could point us to some information on
>>> > SGD we could try it and report back the results to the list.
>>> >
>>> > Thank you
>>> > Gangadhar
>>> >
>>> > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <jo...@gmail.com> wrote:
>>> >> Robin / Gangadhar,
>>> >> With ngram as 1 and all the countries in the country.txt , the model is
>>> >> getting created without any issues.
>>> >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i
>>> wikipediainput
>>> >> -o wikipediamodel -type bayes -source hdfs
>>> >>
>>> >> Robin,
>>> >> Even for ngram parameter, the default value is mentioned as 1 but it is
>>> set
>>> >> as a mandatory parameter in TrainClassifier. so i'll modify the code to
>>> set
>>> >> the default ngram as 1 and make it as a non mandatory param.
>>> >>
>>> >> That aside, When I try to test the model, the summary is getting printed
>>> >> like below.
>>> >> Summary
>>> >> -------------------------------------------------------
>>> >> Correctly Classified Instances          :          0         ?%
>>> >> Incorrectly Classified Instances        :          0         ?%
>>> >> Total Classified Instances              :          0
>>> >> Need to figure out the reason..
>>> >>
>>> >> Since TestClassifier also has the same params and settings like
>>> >> TrainClassifier, can i modify it to set the default values for ngram,
>>> >> classifierType & dataSource ?
>>> >>
>>> >> reg,
>>> >> Joe.
>>> >>
>>> >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:
>>> >>
>>> >>> Robin,
>>> >>>
>>> >>> Thanks for your tip.
>>> >>> Will try it out and post updates.
>>> >>>
>>> >>> reg
>>> >>> Joe.
>>> >>>
>>> >>>
>>> >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com>
>>> wrote:
>>> >>>
>>> >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st.
>>> You
>>> >>>> need atleast 2 countries. otherwise there is no classification.
>>> Secondly
>>> >>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>>> >>>> number
>>> >>>> of features. Why dont you try with one and see.
>>> >>>>
>>> >>>> Robin
>>> >>>>
>>> >>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com>
>>> wrote:
>>> >>>>
>>> >>>> > Hi Ted,
>>> >>>> >
>>> >>>> > sure. will keep digging..
>>> >>>> >
>>> >>>> > About SGD, I dont have an idea about how it works et al. If there is
>>> >>>> some
>>> >>>> > documentation / reference / quick summary to read about it that'll
>>> be
>>> >>>> gr8.
>>> >>>> > Just saw one reference in
>>> >>>> >
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>>> >>>> >
>>> >>>> > I am assuming we should be able to create a model from wikipedia
>>> >>>> articles
>>> >>>> > and label the country of a new article. If so, could you please
>>> provide
>>> >>>> a
>>> >>>> > note on how to do this. We already have the wikipedia data being
>>> >>>> extracted
>>> >>>> > for specific countries using WikipediaDatasetCreatorDriver. How do
>>> we go
>>> >>>> > about training the classifier using SGD ?
>>> >>>> >
>>> >>>> > thanks for your help,
>>> >>>> > Joe.
>>> >>>> >
>>> >>>> >
>>> >>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <
>>> ted.dunning@gmail.com>
>>> >>>> > wrote:
>>> >>>> >
>>> >>>> > > I am watching these efforts with interest, but have been unable to
>>> >>>> > > contribute much to the process.  I would encourage Joe and others
>>> to
>>> >>>> keep
>>> >>>> > > whittling this problem down so that we can understand what is
>>> causing
>>> >>>> it.
>>> >>>> > >
>>> >>>> > > In the meantime, I think that the SGD classifiers are close to
>>> >>>> production
>>> >>>> > > quality.  For problems with less than several million training
>>> >>>> examples,
>>> >>>> > > and
>>> >>>> > > especially problems with many sparse features, I think that these
>>> >>>> > > classifiers might be easier to get started with than the Naive
>>> Bayes
>>> >>>> > > classifiers.  To make a virtue of a defect, the SGD based
>>> classifiers
>>> >>>> to
>>> >>>> > > not
>>> >>>> > > use Hadoop for training.  This makes deployment of a
>>> classification
>>> >>>> > > training
>>> >>>> > > workflow easier, but limits the total size of data that can be
>>> >>>> handled.
>>> >>>> > >
>>> >>>> > > What would you guys need to get started with trying these
>>> alternative
>>> >>>> > > models?
>>> >>>> > >
>>> >>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>>> >>>> > > <np...@gmail.com>wrote:
>>> >>>> > >
>>> >>>> > > > Joe,
>>> >>>> > > > Even I tried with reducing the number of countries in the
>>> >>>> country.txt.
>>> >>>> > > > That didn't help. And in my case, I was monitoring the disk
>>> space
>>> >>>> and
>>> >>>> > > > at no time did it reach 0%. So, I am not sure if that is the
>>> case.
>>> >>>> To
>>> >>>> > > > remove the dependency on the number of countries, I even tried
>>> with
>>> >>>> > > > the subjects.txt as the classification - that also did not help.
>>> >>>> > > > I think this problem is due to the type of the data being
>>> processed,
>>> >>>> > > > but what I am not sure of is what I need to change to get the
>>> data
>>> >>>> to
>>> >>>> > > > be processed successfully.
>>> >>>> > > >
>>> >>>> > > > The experienced folks on Mahout will be able to tell us what is
>>> >>>> missing
>>> >>>> > I
>>> >>>> > > > guess.
>>> >>>> > > >
>>> >>>> > > > Thank you
>>> >>>> > > > Gangadhar
>>> >>>> > > >
>>> >>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
>>> >>>> wrote:
>>> >>>> > > > > Gangadhar,
>>> >>>> > > > >
>>> >>>> > > > > I modified
>>> $MAHOUT_HOME/examples/src/test/resources/country.txt to
>>> >>>> > just
>>> >>>> > > > have
>>> >>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to
>>> create
>>> >>>> the
>>> >>>> > > > > wikipediainput data set and then ran TrainClassifier and it
>>> >>>> worked.
>>> >>>> > > when
>>> >>>> > > > I
>>> >>>> > > > > ran TestClassifier as below, I got blank results in the
>>> output.
>>> >>>> > > > >
>>> >>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> >>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>>> >>>> wikipediamodel
>>> >>>> > -d
>>> >>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>>> >>>> > > > >
>>> >>>> > > > > Summary
>>> >>>> > > > > -------------------------------------------------------
>>> >>>> > > > > Correctly Classified Instances          :          0
>>> ?%
>>> >>>> > > > > Incorrectly Classified Instances        :          0
>>> ?%
>>> >>>> > > > > Total Classified Instances              :          0
>>> >>>> > > > >
>>> >>>> > > > > =======================================================
>>> >>>> > > > > Confusion Matrix
>>> >>>> > > > > -------------------------------------------------------
>>> >>>> > > > > a     <--Classified as
>>> >>>> > > > > 0     |  0     a     = spain
>>> >>>> > > > > Default Category: unknown: 1
>>> >>>> > > > >
>>> >>>> > > > > I am not sure if I am doing something wrong.. have to figure
>>> out
>>> >>>> why
>>> >>>> > my
>>> >>>> > > > o/p
>>> >>>> > > > > is so blank.
>>> >>>> > > > > I'll document these steps and mention about country.txt in the
>>> >>>> wiki.
>>> >>>> > > > >
>>> >>>> > > > > Question to all
>>> >>>> > > > > Should we have 2 country.txt
>>> >>>> > > > >
>>> >>>> > > > >   1. country_full_list.txt - this is the existing list
>>> >>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>>> >>>> > > > >
>>> >>>> > > > > To get a flavor of the wikipedia bayes example, we can use
>>> >>>> > > > > country_sample.txt. When new people want to just try out the
>>> >>>> example,
>>> >>>> > > > they
>>> >>>> > > > > can reference this txt file  as a parameter.
>>> >>>> > > > > To run the example in a robust scalable infrastructure, we
>>> could
>>> >>>> use
>>> >>>> > > > > country_full_list.txt.
>>> >>>> > > > > any thots ?
>>> >>>> > > > >
>>> >>>> > > > > regards
>>> >>>> > > > > Joe.
>>> >>>> > > > >
>>> >>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <
>>> joekumar@gmail.com>
>>> >>>> > wrote:
>>> >>>> > > > >
>>> >>>> > > > >> Gangadhar,
>>> >>>> > > > >>
>>> >>>> > > > >> After running TrainClassifier again, the map task just failed
>>> >>>> with
>>> >>>> > the
>>> >>>> > > > same
>>> >>>> > > > >> exception and I am pretty sure it is an issue with disk
>>> space.
>>> >>>> > > > >> As the map was progressing, I was monitoring my free disk
>>> space
>>> >>>> > > dropping
>>> >>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>>> >>>> task
>>> >>>> > and
>>> >>>> > > > then
>>> >>>> > > > >> the exception happened. After the exception, another map task
>>> was
>>> >>>> > > > resuming
>>> >>>> > > > >> at 33% and I got close to 15GB free space (i guess the first
>>> map
>>> >>>> > task
>>> >>>> > > > freed
>>> >>>> > > > >> up some space) and I am sure they would drop down to zero
>>> again
>>> >>>> and
>>> >>>> > > > throw
>>> >>>> > > > >> the same exception.
>>> >>>> > > > >> I am going to modify the country.txt to just 1 country and
>>> >>>> recreate
>>> >>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how
>>> it
>>> >>>> > > goes..
>>> >>>> > > > >>
>>> >>>> > > > >> Do we have any benchmarks / system requirements for running
>>> this
>>> >>>> > > example
>>> >>>> > > > ?
>>> >>>> > > > >> Has anyone else had success running this example anytime.
>>> Would
>>> >>>> > > > appreciate
>>> >>>> > > > >> your inputs / thots.
>>> >>>> > > > >>
>>> >>>> > > > >> Should we look at tuning the code for handling these
>>> situations ?
>>> >>>> > Any
>>> >>>> > > > quick
>>> >>>> > > > >> suggestions on where to start looking at ?
>>> >>>> > > > >>
>>> >>>> > > > >> regards,
>>> >>>> > > > >> Joe.
>>> >>>> > > > >>
>>> >>>> > > > >>
>>> >>>> > > > >>
>>> >>>> > > > >>
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>
>>> >
>>>
>>
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.

Joe,
I am out of town for this week and won't have access to my machine. I
will check this during the weekend and will get back to you. Will
follow the steps in the wiki.

Thank you

On Fri, Sep 24, 2010 at 8:44 AM, Joe Kumar <jo...@gmail.com> wrote:
> Hi Gangadhar,
>
> I ran TestClassifier with similar parameters. It didnt take me 2 hrs though.
>
> I have documented the steps that worked for me at
> https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example
> Can you please get the patch available at MAHOUT-509 and apply it and then
> try the steps in the wiki.
> Please let me know if you still face issues.
>
> reg
> Joe.
>
>
> On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <npk.gangadhar@gmail.com
>> wrote:
>
>> Joe,
>> Can you let me know what was the command you used to test the
>> classifier ? With the ngrams set to 1 as suggested by Robin, I was
>> able to train the classifier. The command:
>> $HADOOP_HOME/bin/hadoop jar
>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1
>> --input wikipediainput10 --output wikipediamodel10 --classifierType
>> bayes --dataSource hdfs
>>
>> After this, as per the wiki, we need to get the data from HDFS. I did that
>> <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10
>>
>> After this, the classifier is to be tested:
>> $HADOOP_HOME/bin/hadoop jar
>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10
>> -d wikipediainput10  -ng 1 -type bayes -source hdfs
>>
>> When I run this, this runs for close to 2 hours and after 2 hours, it
>> errors out with a java.io.FileException saying that the logs_ is a
>> directory in the wikipediainput10 folder. I am sorry I can't provide
>> the stack trace right now because I accidentally closed the terminal
>> window before I could copy it. I will run this again and send the
>> stack trace.
>>
>> But, if you can send me the steps that you followed after running the
>> classifier, I can repeat those and see if I am able to successfully
>> execute the classifier.
>>
>> Thank you
>> Gangadhar
>>
>>
>> On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala
>> <np...@gmail.com> wrote:
>> > Joe,
>> > I will try with the ngram setting of 1 and let you know how it goes.
>> > Robin, the ngram parameter is used to check the number of subsequences
>> > of characters isn't it ? Or is it evaluated differently w.r.t to the
>> > Bayesian classifier ?
>> >
>> > Ted, like Joe mentioned, if you could point us to some information on
>> > SGD we could try it and report back the results to the list.
>> >
>> > Thank you
>> > Gangadhar
>> >
>> > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <jo...@gmail.com> wrote:
>> >> Robin / Gangadhar,
>> >> With ngram as 1 and all the countries in the country.txt , the model is
>> >> getting created without any issues.
>> >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i
>> wikipediainput
>> >> -o wikipediamodel -type bayes -source hdfs
>> >>
>> >> Robin,
>> >> Even for ngram parameter, the default value is mentioned as 1 but it is
>> set
>> >> as a mandatory parameter in TrainClassifier. so i'll modify the code to
>> set
>> >> the default ngram as 1 and make it as a non mandatory param.
>> >>
>> >> That aside, When I try to test the model, the summary is getting printed
>> >> like below.
>> >> Summary
>> >> -------------------------------------------------------
>> >> Correctly Classified Instances          :          0         ?%
>> >> Incorrectly Classified Instances        :          0         ?%
>> >> Total Classified Instances              :          0
>> >> Need to figure out the reason..
>> >>
>> >> Since TestClassifier also has the same params and settings like
>> >> TrainClassifier, can i modify it to set the default values for ngram,
>> >> classifierType & dataSource ?
>> >>
>> >> reg,
>> >> Joe.
>> >>
>> >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:
>> >>
>> >>> Robin,
>> >>>
>> >>> Thanks for your tip.
>> >>> Will try it out and post updates.
>> >>>
>> >>> reg
>> >>> Joe.
>> >>>
>> >>>
>> >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com>
>> wrote:
>> >>>
>> >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st.
>> You
>> >>>> need atleast 2 countries. otherwise there is no classification.
>> Secondly
>> >>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>> >>>> number
>> >>>> of features. Why dont you try with one and see.
>> >>>>
>> >>>> Robin
>> >>>>
>> >>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com>
>> wrote:
>> >>>>
>> >>>> > Hi Ted,
>> >>>> >
>> >>>> > sure. will keep digging..
>> >>>> >
>> >>>> > About SGD, I dont have an idea about how it works et al. If there is
>> >>>> some
>> >>>> > documentation / reference / quick summary to read about it that'll
>> be
>> >>>> gr8.
>> >>>> > Just saw one reference in
>> >>>> >
>> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>> >>>> >
>> >>>> > I am assuming we should be able to create a model from wikipedia
>> >>>> articles
>> >>>> > and label the country of a new article. If so, could you please
>> provide
>> >>>> a
>> >>>> > note on how to do this. We already have the wikipedia data being
>> >>>> extracted
>> >>>> > for specific countries using WikipediaDatasetCreatorDriver. How do
>> we go
>> >>>> > about training the classifier using SGD ?
>> >>>> >
>> >>>> > thanks for your help,
>> >>>> > Joe.
>> >>>> >
>> >>>> >
>> >>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <
>> ted.dunning@gmail.com>
>> >>>> > wrote:
>> >>>> >
>> >>>> > > I am watching these efforts with interest, but have been unable to
>> >>>> > > contribute much to the process.  I would encourage Joe and others
>> to
>> >>>> keep
>> >>>> > > whittling this problem down so that we can understand what is
>> causing
>> >>>> it.
>> >>>> > >
>> >>>> > > In the meantime, I think that the SGD classifiers are close to
>> >>>> production
>> >>>> > > quality.  For problems with less than several million training
>> >>>> examples,
>> >>>> > > and
>> >>>> > > especially problems with many sparse features, I think that these
>> >>>> > > classifiers might be easier to get started with than the Naive
>> Bayes
>> >>>> > > classifiers.  To make a virtue of a defect, the SGD based
>> classifiers
>> >>>> to
>> >>>> > > not
>> >>>> > > use Hadoop for training.  This makes deployment of a
>> classification
>> >>>> > > training
>> >>>> > > workflow easier, but limits the total size of data that can be
>> >>>> handled.
>> >>>> > >
>> >>>> > > What would you guys need to get started with trying these
>> alternative
>> >>>> > > models?
>> >>>> > >
>> >>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>> >>>> > > <np...@gmail.com>wrote:
>> >>>> > >
>> >>>> > > > Joe,
>> >>>> > > > Even I tried with reducing the number of countries in the
>> >>>> country.txt.
>> >>>> > > > That didn't help. And in my case, I was monitoring the disk
>> space
>> >>>> and
>> >>>> > > > at no time did it reach 0%. So, I am not sure if that is the
>> case.
>> >>>> To
>> >>>> > > > remove the dependency on the number of countries, I even tried
>> with
>> >>>> > > > the subjects.txt as the classification - that also did not help.
>> >>>> > > > I think this problem is due to the type of the data being
>> processed,
>> >>>> > > > but what I am not sure of is what I need to change to get the
>> data
>> >>>> to
>> >>>> > > > be processed successfully.
>> >>>> > > >
>> >>>> > > > The experienced folks on Mahout will be able to tell us what is
>> >>>> missing
>> >>>> > I
>> >>>> > > > guess.
>> >>>> > > >
>> >>>> > > > Thank you
>> >>>> > > > Gangadhar
>> >>>> > > >
>> >>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
>> >>>> wrote:
>> >>>> > > > > Gangadhar,
>> >>>> > > > >
>> >>>> > > > > I modified
>> $MAHOUT_HOME/examples/src/test/resources/country.txt to
>> >>>> > just
>> >>>> > > > have
>> >>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to
>> create
>> >>>> the
>> >>>> > > > > wikipediainput data set and then ran TrainClassifier and it
>> >>>> worked.
>> >>>> > > when
>> >>>> > > > I
>> >>>> > > > > ran TestClassifier as below, I got blank results in the
>> output.
>> >>>> > > > >
>> >>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> >>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>> >>>> wikipediamodel
>> >>>> > -d
>> >>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>> >>>> > > > >
>> >>>> > > > > Summary
>> >>>> > > > > -------------------------------------------------------
>> >>>> > > > > Correctly Classified Instances          :          0
>> ?%
>> >>>> > > > > Incorrectly Classified Instances        :          0
>> ?%
>> >>>> > > > > Total Classified Instances              :          0
>> >>>> > > > >
>> >>>> > > > > =======================================================
>> >>>> > > > > Confusion Matrix
>> >>>> > > > > -------------------------------------------------------
>> >>>> > > > > a     <--Classified as
>> >>>> > > > > 0     |  0     a     = spain
>> >>>> > > > > Default Category: unknown: 1
>> >>>> > > > >
>> >>>> > > > > I am not sure if I am doing something wrong.. have to figure
>> out
>> >>>> why
>> >>>> > my
>> >>>> > > > o/p
>> >>>> > > > > is so blank.
>> >>>> > > > > I'll document these steps and mention about country.txt in the
>> >>>> wiki.
>> >>>> > > > >
>> >>>> > > > > Question to all
>> >>>> > > > > Should we have 2 country.txt
>> >>>> > > > >
>> >>>> > > > >   1. country_full_list.txt - this is the existing list
>> >>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>> >>>> > > > >
>> >>>> > > > > To get a flavor of the wikipedia bayes example, we can use
>> >>>> > > > > country_sample.txt. When new people want to just try out the
>> >>>> example,
>> >>>> > > > they
>> >>>> > > > > can reference this txt file  as a parameter.
>> >>>> > > > > To run the example in a robust scalable infrastructure, we
>> could
>> >>>> use
>> >>>> > > > > country_full_list.txt.
>> >>>> > > > > any thots ?
>> >>>> > > > >
>> >>>> > > > > regards
>> >>>> > > > > Joe.
>> >>>> > > > >
>> >>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <
>> joekumar@gmail.com>
>> >>>> > wrote:
>> >>>> > > > >
>> >>>> > > > >> Gangadhar,
>> >>>> > > > >>
>> >>>> > > > >> After running TrainClassifier again, the map task just failed
>> >>>> with
>> >>>> > the
>> >>>> > > > same
>> >>>> > > > >> exception and I am pretty sure it is an issue with disk
>> space.
>> >>>> > > > >> As the map was progressing, I was monitoring my free disk
>> space
>> >>>> > > dropping
>> >>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>> >>>> task
>> >>>> > and
>> >>>> > > > then
>> >>>> > > > >> the exception happened. After the exception, another map task
>> was
>> >>>> > > > resuming
>> >>>> > > > >> at 33% and I got close to 15GB free space (i guess the first
>> map
>> >>>> > task
>> >>>> > > > freed
>> >>>> > > > >> up some space) and I am sure they would drop down to zero
>> again
>> >>>> and
>> >>>> > > > throw
>> >>>> > > > >> the same exception.
>> >>>> > > > >> I am going to modify the country.txt to just 1 country and
>> >>>> recreate
>> >>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how
>> it
>> >>>> > > goes..
>> >>>> > > > >>
>> >>>> > > > >> Do we have any benchmarks / system requirements for running
>> this
>> >>>> > > example
>> >>>> > > > ?
>> >>>> > > > >> Has anyone else had success running this example anytime.
>> Would
>> >>>> > > > appreciate
>> >>>> > > > >> your inputs / thots.
>> >>>> > > > >>
>> >>>> > > > >> Should we look at tuning the code for handling these
>> situations ?
>> >>>> > Any
>> >>>> > > > quick
>> >>>> > > > >> suggestions on where to start looking at ?
>> >>>> > > > >>
>> >>>> > > > >> regards,
>> >>>> > > > >> Joe.
>> >>>> > > > >>
>> >>>> > > > >>
>> >>>> > > > >>
>> >>>> > > > >>
>> >>>> > > > >
>> >>>> > > >
>> >>>> > >
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >
>>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.

Hi Gangadhar,

I ran TestClassifier with similar parameters. It didnt take me 2 hrs though.

I have documented the steps that worked for me at
https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example
Can you please get the patch available at MAHOUT-509 and apply it and then
try the steps in the wiki.
Please let me know if you still face issues.

reg
Joe.


On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <npk.gangadhar@gmail.com
> wrote:

> Joe,
> Can you let me know what was the command you used to test the
> classifier ? With the ngrams set to 1 as suggested by Robin, I was
> able to train the classifier. The command:
> $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1
> --input wikipediainput10 --output wikipediamodel10 --classifierType
> bayes --dataSource hdfs
>
> After this, as per the wiki, we need to get the data from HDFS. I did that
> <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10
>
> After this, the classifier is to be tested:
> $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10
> -d wikipediainput10  -ng 1 -type bayes -source hdfs
>
> When I run this, this runs for close to 2 hours and after 2 hours, it
> errors out with a java.io.FileException saying that the logs_ is a
> directory in the wikipediainput10 folder. I am sorry I can't provide
> the stack trace right now because I accidentally closed the terminal
> window before I could copy it. I will run this again and send the
> stack trace.
>
> But, if you can send me the steps that you followed after running the
> classifier, I can repeat those and see if I am able to successfully
> execute the classifier.
>
> Thank you
> Gangadhar
>
>
> On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala
> <np...@gmail.com> wrote:
> > Joe,
> > I will try with the ngram setting of 1 and let you know how it goes.
> > Robin, the ngram parameter is used to check the number of subsequences
> > of characters isn't it ? Or is it evaluated differently w.r.t to the
> > Bayesian classifier ?
> >
> > Ted, like Joe mentioned, if you could point us to some information on
> > SGD we could try it and report back the results to the list.
> >
> > Thank you
> > Gangadhar
> >
> > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <jo...@gmail.com> wrote:
> >> Robin / Gangadhar,
> >> With ngram as 1 and all the countries in the country.txt , the model is
> >> getting created without any issues.
> >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i
> wikipediainput
> >> -o wikipediamodel -type bayes -source hdfs
> >>
> >> Robin,
> >> Even for ngram parameter, the default value is mentioned as 1 but it is
> set
> >> as a mandatory parameter in TrainClassifier. so i'll modify the code to
> set
> >> the default ngram as 1 and make it as a non mandatory param.
> >>
> >> That aside, When I try to test the model, the summary is getting printed
> >> like below.
> >> Summary
> >> -------------------------------------------------------
> >> Correctly Classified Instances          :          0         ?%
> >> Incorrectly Classified Instances        :          0         ?%
> >> Total Classified Instances              :          0
> >> Need to figure out the reason..
> >>
> >> Since TestClassifier also has the same params and settings like
> >> TrainClassifier, can i modify it to set the default values for ngram,
> >> classifierType & dataSource ?
> >>
> >> reg,
> >> Joe.
> >>
> >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:
> >>
> >>> Robin,
> >>>
> >>> Thanks for your tip.
> >>> Will try it out and post updates.
> >>>
> >>> reg
> >>> Joe.
> >>>
> >>>
> >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com>
> wrote:
> >>>
> >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st.
> You
> >>>> need atleast 2 countries. otherwise there is no classification.
> Secondly
> >>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
> >>>> number
> >>>> of features. Why dont you try with one and see.
> >>>>
> >>>> Robin
> >>>>
> >>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com>
> wrote:
> >>>>
> >>>> > Hi Ted,
> >>>> >
> >>>> > sure. will keep digging..
> >>>> >
> >>>> > About SGD, I dont have an idea about how it works et al. If there is
> >>>> some
> >>>> > documentation / reference / quick summary to read about it that'll
> be
> >>>> gr8.
> >>>> > Just saw one reference in
> >>>> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
> >>>> >
> >>>> > I am assuming we should be able to create a model from wikipedia
> >>>> articles
> >>>> > and label the country of a new article. If so, could you please
> provide
> >>>> a
> >>>> > note on how to do this. We already have the wikipedia data being
> >>>> extracted
> >>>> > for specific countries using WikipediaDatasetCreatorDriver. How do
> we go
> >>>> > about training the classifier using SGD ?
> >>>> >
> >>>> > thanks for your help,
> >>>> > Joe.
> >>>> >
> >>>> >
> >>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <
> ted.dunning@gmail.com>
> >>>> > wrote:
> >>>> >
> >>>> > > I am watching these efforts with interest, but have been unable to
> >>>> > > contribute much to the process.  I would encourage Joe and others
> to
> >>>> keep
> >>>> > > whittling this problem down so that we can understand what is
> causing
> >>>> it.
> >>>> > >
> >>>> > > In the meantime, I think that the SGD classifiers are close to
> >>>> production
> >>>> > > quality.  For problems with less than several million training
> >>>> examples,
> >>>> > > and
> >>>> > > especially problems with many sparse features, I think that these
> >>>> > > classifiers might be easier to get started with than the Naive
> Bayes
> >>>> > > classifiers.  To make a virtue of a defect, the SGD based
> classifiers
> >>>> to
> >>>> > > not
> >>>> > > use Hadoop for training.  This makes deployment of a
> classification
> >>>> > > training
> >>>> > > workflow easier, but limits the total size of data that can be
> >>>> handled.
> >>>> > >
> >>>> > > What would you guys need to get started with trying these
> alternative
> >>>> > > models?
> >>>> > >
> >>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> >>>> > > <np...@gmail.com>wrote:
> >>>> > >
> >>>> > > > Joe,
> >>>> > > > Even I tried with reducing the number of countries in the
> >>>> country.txt.
> >>>> > > > That didn't help. And in my case, I was monitoring the disk
> space
> >>>> and
> >>>> > > > at no time did it reach 0%. So, I am not sure if that is the
> case.
> >>>> To
> >>>> > > > remove the dependency on the number of countries, I even tried
> with
> >>>> > > > the subjects.txt as the classification - that also did not help.
> >>>> > > > I think this problem is due to the type of the data being
> processed,
> >>>> > > > but what I am not sure of is what I need to change to get the
> data
> >>>> to
> >>>> > > > be processed successfully.
> >>>> > > >
> >>>> > > > The experienced folks on Mahout will be able to tell us what is
> >>>> missing
> >>>> > I
> >>>> > > > guess.
> >>>> > > >
> >>>> > > > Thank you
> >>>> > > > Gangadhar
> >>>> > > >
> >>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
> >>>> wrote:
> >>>> > > > > Gangadhar,
> >>>> > > > >
> >>>> > > > > I modified
> $MAHOUT_HOME/examples/src/test/resources/country.txt to
> >>>> > just
> >>>> > > > have
> >>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to
> create
> >>>> the
> >>>> > > > > wikipediainput data set and then ran TrainClassifier and it
> >>>> worked.
> >>>> > > when
> >>>> > > > I
> >>>> > > > > ran TestClassifier as below, I got blank results in the
> output.
> >>>> > > > >
> >>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> >>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
> >>>> wikipediamodel
> >>>> > -d
> >>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
> >>>> > > > >
> >>>> > > > > Summary
> >>>> > > > > -------------------------------------------------------
> >>>> > > > > Correctly Classified Instances          :          0
> ?%
> >>>> > > > > Incorrectly Classified Instances        :          0
> ?%
> >>>> > > > > Total Classified Instances              :          0
> >>>> > > > >
> >>>> > > > > =======================================================
> >>>> > > > > Confusion Matrix
> >>>> > > > > -------------------------------------------------------
> >>>> > > > > a     <--Classified as
> >>>> > > > > 0     |  0     a     = spain
> >>>> > > > > Default Category: unknown: 1
> >>>> > > > >
> >>>> > > > > I am not sure if I am doing something wrong.. have to figure
> out
> >>>> why
> >>>> > my
> >>>> > > > o/p
> >>>> > > > > is so blank.
> >>>> > > > > I'll document these steps and mention about country.txt in the
> >>>> wiki.
> >>>> > > > >
> >>>> > > > > Question to all
> >>>> > > > > Should we have 2 country.txt
> >>>> > > > >
> >>>> > > > >   1. country_full_list.txt - this is the existing list
> >>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
> >>>> > > > >
> >>>> > > > > To get a flavor of the wikipedia bayes example, we can use
> >>>> > > > > country_sample.txt. When new people want to just try out the
> >>>> example,
> >>>> > > > they
> >>>> > > > > can reference this txt file  as a parameter.
> >>>> > > > > To run the example in a robust scalable infrastructure, we
> could
> >>>> use
> >>>> > > > > country_full_list.txt.
> >>>> > > > > any thots ?
> >>>> > > > >
> >>>> > > > > regards
> >>>> > > > > Joe.
> >>>> > > > >
> >>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <
> joekumar@gmail.com>
> >>>> > wrote:
> >>>> > > > >
> >>>> > > > >> Gangadhar,
> >>>> > > > >>
> >>>> > > > >> After running TrainClassifier again, the map task just failed
> >>>> with
> >>>> > the
> >>>> > > > same
> >>>> > > > >> exception and I am pretty sure it is an issue with disk
> space.
> >>>> > > > >> As the map was progressing, I was monitoring my free disk
> space
> >>>> > > dropping
> >>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
> >>>> task
> >>>> > and
> >>>> > > > then
> >>>> > > > >> the exception happened. After the exception, another map task
> was
> >>>> > > > resuming
> >>>> > > > >> at 33% and I got close to 15GB free space (i guess the first
> map
> >>>> > task
> >>>> > > > freed
> >>>> > > > >> up some space) and I am sure they would drop down to zero
> again
> >>>> and
> >>>> > > > throw
> >>>> > > > >> the same exception.
> >>>> > > > >> I am going to modify the country.txt to just 1 country and
> >>>> recreate
> >>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how
> it
> >>>> > > goes..
> >>>> > > > >>
> >>>> > > > >> Do we have any benchmarks / system requirements for running
> this
> >>>> > > example
> >>>> > > > ?
> >>>> > > > >> Has anyone else had success running this example anytime.
> Would
> >>>> > > > appreciate
> >>>> > > > >> your inputs / thots.
> >>>> > > > >>
> >>>> > > > >> Should we look at tuning the code for handling these
> situations ?
> >>>> > Any
> >>>> > > > quick
> >>>> > > > >> suggestions on where to start looking at ?
> >>>> > > > >>
> >>>> > > > >> regards,
> >>>> > > > >> Joe.
> >>>> > > > >>
> >>>> > > > >>
> >>>> > > > >>
> >>>> > > > >>
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.

Joe,
Can you let me know what was the command you used to test the
classifier ? With the ngrams set to 1 as suggested by Robin, I was
able to train the classifier. The command:
$HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1
--input wikipediainput10 --output wikipediamodel10 --classifierType
bayes --dataSource hdfs

After this, as per the wiki, we need to get the data from HDFS. I did that
<HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10

After this, the classifier is to be tested:
$HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10
-d wikipediainput10  -ng 1 -type bayes -source hdfs

When I run this, this runs for close to 2 hours and after 2 hours, it
errors out with a java.io.FileException saying that the logs_ is a
directory in the wikipediainput10 folder. I am sorry I can't provide
the stack trace right now because I accidentally closed the terminal
window before I could copy it. I will run this again and send the
stack trace.

But, if you can send me the steps that you followed after running the
classifier, I can repeat those and see if I am able to successfully
execute the classifier.

Thank you
Gangadhar


On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala
<np...@gmail.com> wrote:
> Joe,
> I will try with the ngram setting of 1 and let you know how it goes.
> Robin, the ngram parameter is used to check the number of subsequences
> of characters isn't it ? Or is it evaluated differently w.r.t to the
> Bayesian classifier ?
>
> Ted, like Joe mentioned, if you could point us to some information on
> SGD we could try it and report back the results to the list.
>
> Thank you
> Gangadhar
>
> On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <jo...@gmail.com> wrote:
>> Robin / Gangadhar,
>> With ngram as 1 and all the countries in the country.txt , the model is
>> getting created without any issues.
>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput
>> -o wikipediamodel -type bayes -source hdfs
>>
>> Robin,
>> Even for ngram parameter, the default value is mentioned as 1 but it is set
>> as a mandatory parameter in TrainClassifier. so i'll modify the code to set
>> the default ngram as 1 and make it as a non mandatory param.
>>
>> That aside, When I try to test the model, the summary is getting printed
>> like below.
>> Summary
>> -------------------------------------------------------
>> Correctly Classified Instances          :          0         ?%
>> Incorrectly Classified Instances        :          0         ?%
>> Total Classified Instances              :          0
>> Need to figure out the reason..
>>
>> Since TestClassifier also has the same params and settings like
>> TrainClassifier, can i modify it to set the default values for ngram,
>> classifierType & dataSource ?
>>
>> reg,
>> Joe.
>>
>> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:
>>
>>> Robin,
>>>
>>> Thanks for your tip.
>>> Will try it out and post updates.
>>>
>>> reg
>>> Joe.
>>>
>>>
>>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com> wrote:
>>>
>>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
>>>> need atleast 2 countries. otherwise there is no classification. Secondly
>>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>>>> number
>>>> of features. Why dont you try with one and see.
>>>>
>>>> Robin
>>>>
>>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com> wrote:
>>>>
>>>> > Hi Ted,
>>>> >
>>>> > sure. will keep digging..
>>>> >
>>>> > About SGD, I dont have an idea about how it works et al. If there is
>>>> some
>>>> > documentation / reference / quick summary to read about it that'll be
>>>> gr8.
>>>> > Just saw one reference in
>>>> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>>>> >
>>>> > I am assuming we should be able to create a model from wikipedia
>>>> articles
>>>> > and label the country of a new article. If so, could you please provide
>>>> a
>>>> > note on how to do this. We already have the wikipedia data being
>>>> extracted
>>>> > for specific countries using WikipediaDatasetCreatorDriver. How do we go
>>>> > about training the classifier using SGD ?
>>>> >
>>>> > thanks for your help,
>>>> > Joe.
>>>> >
>>>> >
>>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <te...@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > I am watching these efforts with interest, but have been unable to
>>>> > > contribute much to the process.  I would encourage Joe and others to
>>>> keep
>>>> > > whittling this problem down so that we can understand what is causing
>>>> it.
>>>> > >
>>>> > > In the meantime, I think that the SGD classifiers are close to
>>>> production
>>>> > > quality.  For problems with less than several million training
>>>> examples,
>>>> > > and
>>>> > > especially problems with many sparse features, I think that these
>>>> > > classifiers might be easier to get started with than the Naive Bayes
>>>> > > classifiers.  To make a virtue of a defect, the SGD based classifiers
>>>> to
>>>> > > not
>>>> > > use Hadoop for training.  This makes deployment of a classification
>>>> > > training
>>>> > > workflow easier, but limits the total size of data that can be
>>>> handled.
>>>> > >
>>>> > > What would you guys need to get started with trying these alternative
>>>> > > models?
>>>> > >
>>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>>>> > > <np...@gmail.com>wrote:
>>>> > >
>>>> > > > Joe,
>>>> > > > Even I tried with reducing the number of countries in the
>>>> country.txt.
>>>> > > > That didn't help. And in my case, I was monitoring the disk space
>>>> and
>>>> > > > at no time did it reach 0%. So, I am not sure if that is the case.
>>>> To
>>>> > > > remove the dependency on the number of countries, I even tried with
>>>> > > > the subjects.txt as the classification - that also did not help.
>>>> > > > I think this problem is due to the type of the data being processed,
>>>> > > > but what I am not sure of is what I need to change to get the data
>>>> to
>>>> > > > be processed successfully.
>>>> > > >
>>>> > > > The experienced folks on Mahout will be able to tell us what is
>>>> missing
>>>> > I
>>>> > > > guess.
>>>> > > >
>>>> > > > Thank you
>>>> > > > Gangadhar
>>>> > > >
>>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
>>>> wrote:
>>>> > > > > Gangadhar,
>>>> > > > >
>>>> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
>>>> > just
>>>> > > > have
>>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create
>>>> the
>>>> > > > > wikipediainput data set and then ran TrainClassifier and it
>>>> worked.
>>>> > > when
>>>> > > > I
>>>> > > > > ran TestClassifier as below, I got blank results in the output.
>>>> > > > >
>>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>>>> wikipediamodel
>>>> > -d
>>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>>>> > > > >
>>>> > > > > Summary
>>>> > > > > -------------------------------------------------------
>>>> > > > > Correctly Classified Instances          :          0         ?%
>>>> > > > > Incorrectly Classified Instances        :          0         ?%
>>>> > > > > Total Classified Instances              :          0
>>>> > > > >
>>>> > > > > =======================================================
>>>> > > > > Confusion Matrix
>>>> > > > > -------------------------------------------------------
>>>> > > > > a     <--Classified as
>>>> > > > > 0     |  0     a     = spain
>>>> > > > > Default Category: unknown: 1
>>>> > > > >
>>>> > > > > I am not sure if I am doing something wrong.. have to figure out
>>>> why
>>>> > my
>>>> > > > o/p
>>>> > > > > is so blank.
>>>> > > > > I'll document these steps and mention about country.txt in the
>>>> wiki.
>>>> > > > >
>>>> > > > > Question to all
>>>> > > > > Should we have 2 country.txt
>>>> > > > >
>>>> > > > >   1. country_full_list.txt - this is the existing list
>>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>>>> > > > >
>>>> > > > > To get a flavor of the wikipedia bayes example, we can use
>>>> > > > > country_sample.txt. When new people want to just try out the
>>>> example,
>>>> > > > they
>>>> > > > > can reference this txt file  as a parameter.
>>>> > > > > To run the example in a robust scalable infrastructure, we could
>>>> use
>>>> > > > > country_full_list.txt.
>>>> > > > > any thots ?
>>>> > > > >
>>>> > > > > regards
>>>> > > > > Joe.
>>>> > > > >
>>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com>
>>>> > wrote:
>>>> > > > >
>>>> > > > >> Gangadhar,
>>>> > > > >>
>>>> > > > >> After running TrainClassifier again, the map task just failed
>>>> with
>>>> > the
>>>> > > > same
>>>> > > > >> exception and I am pretty sure it is an issue with disk space.
>>>> > > > >> As the map was progressing, I was monitoring my free disk space
>>>> > > dropping
>>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>>>> task
>>>> > and
>>>> > > > then
>>>> > > > >> the exception happened. After the exception, another map task was
>>>> > > > resuming
>>>> > > > >> at 33% and I got close to 15GB free space (i guess the first map
>>>> > task
>>>> > > > freed
>>>> > > > >> up some space) and I am sure they would drop down to zero again
>>>> and
>>>> > > > throw
>>>> > > > >> the same exception.
>>>> > > > >> I am going to modify the country.txt to just 1 country and
>>>> recreate
>>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how it
>>>> > > goes..
>>>> > > > >>
>>>> > > > >> Do we have any benchmarks / system requirements for running this
>>>> > > example
>>>> > > > ?
>>>> > > > >> Has anyone else had success running this example anytime. Would
>>>> > > > appreciate
>>>> > > > >> your inputs / thots.
>>>> > > > >>
>>>> > > > >> Should we look at tuning the code for handling these situations ?
>>>> > Any
>>>> > > > quick
>>>> > > > >> suggestions on where to start looking at ?
>>>> > > > >>
>>>> > > > >> regards,
>>>> > > > >> Joe.
>>>> > > > >>
>>>> > > > >>
>>>> > > > >>
>>>> > > > >>
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.

Joe,
I will try with the ngram setting of 1 and let you know how it goes.
Robin, the ngram parameter is used to check the number of subsequences
of characters isn't it ? Or is it evaluated differently w.r.t to the
Bayesian classifier ?

Ted, like Joe mentioned, if you could point us to some information on
SGD we could try it and report back the results to the list.

Thank you
Gangadhar

On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <jo...@gmail.com> wrote:
> Robin / Gangadhar,
> With ngram as 1 and all the countries in the country.txt , the model is
> getting created without any issues.
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput
> -o wikipediamodel -type bayes -source hdfs
>
> Robin,
> Even for ngram parameter, the default value is mentioned as 1 but it is set
> as a mandatory parameter in TrainClassifier. so i'll modify the code to set
> the default ngram as 1 and make it as a non mandatory param.
>
> That aside, When I try to test the model, the summary is getting printed
> like below.
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :          0         ?%
> Incorrectly Classified Instances        :          0         ?%
> Total Classified Instances              :          0
> Need to figure out the reason..
>
> Since TestClassifier also has the same params and settings like
> TrainClassifier, can i modify it to set the default values for ngram,
> classifierType & dataSource ?
>
> reg,
> Joe.
>
> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:
>
>> Robin,
>>
>> Thanks for your tip.
>> Will try it out and post updates.
>>
>> reg
>> Joe.
>>
>>
>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com> wrote:
>>
>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
>>> need atleast 2 countries. otherwise there is no classification. Secondly
>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>>> number
>>> of features. Why dont you try with one and see.
>>>
>>> Robin
>>>
>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com> wrote:
>>>
>>> > Hi Ted,
>>> >
>>> > sure. will keep digging..
>>> >
>>> > About SGD, I dont have an idea about how it works et al. If there is
>>> some
>>> > documentation / reference / quick summary to read about it that'll be
>>> gr8.
>>> > Just saw one reference in
>>> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>>> >
>>> > I am assuming we should be able to create a model from wikipedia
>>> articles
>>> > and label the country of a new article. If so, could you please provide
>>> a
>>> > note on how to do this. We already have the wikipedia data being
>>> extracted
>>> > for specific countries using WikipediaDatasetCreatorDriver. How do we go
>>> > about training the classifier using SGD ?
>>> >
>>> > thanks for your help,
>>> > Joe.
>>> >
>>> >
>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <te...@gmail.com>
>>> > wrote:
>>> >
>>> > > I am watching these efforts with interest, but have been unable to
>>> > > contribute much to the process.  I would encourage Joe and others to
>>> keep
>>> > > whittling this problem down so that we can understand what is causing
>>> it.
>>> > >
>>> > > In the meantime, I think that the SGD classifiers are close to
>>> production
>>> > > quality.  For problems with less than several million training
>>> examples,
>>> > > and
>>> > > especially problems with many sparse features, I think that these
>>> > > classifiers might be easier to get started with than the Naive Bayes
>>> > > classifiers.  To make a virtue of a defect, the SGD based classifiers
>>> to
>>> > > not
>>> > > use Hadoop for training.  This makes deployment of a classification
>>> > > training
>>> > > workflow easier, but limits the total size of data that can be
>>> handled.
>>> > >
>>> > > What would you guys need to get started with trying these alternative
>>> > > models?
>>> > >
>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>>> > > <np...@gmail.com>wrote:
>>> > >
>>> > > > Joe,
>>> > > > Even I tried with reducing the number of countries in the
>>> country.txt.
>>> > > > That didn't help. And in my case, I was monitoring the disk space
>>> and
>>> > > > at no time did it reach 0%. So, I am not sure if that is the case.
>>> To
>>> > > > remove the dependency on the number of countries, I even tried with
>>> > > > the subjects.txt as the classification - that also did not help.
>>> > > > I think this problem is due to the type of the data being processed,
>>> > > > but what I am not sure of is what I need to change to get the data
>>> to
>>> > > > be processed successfully.
>>> > > >
>>> > > > The experienced folks on Mahout will be able to tell us what is
>>> missing
>>> > I
>>> > > > guess.
>>> > > >
>>> > > > Thank you
>>> > > > Gangadhar
>>> > > >
>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
>>> wrote:
>>> > > > > Gangadhar,
>>> > > > >
>>> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
>>> > just
>>> > > > have
>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create
>>> the
>>> > > > > wikipediainput data set and then ran TrainClassifier and it
>>> worked.
>>> > > when
>>> > > > I
>>> > > > > ran TestClassifier as below, I got blank results in the output.
>>> > > > >
>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>>> wikipediamodel
>>> > -d
>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>>> > > > >
>>> > > > > Summary
>>> > > > > -------------------------------------------------------
>>> > > > > Correctly Classified Instances          :          0         ?%
>>> > > > > Incorrectly Classified Instances        :          0         ?%
>>> > > > > Total Classified Instances              :          0
>>> > > > >
>>> > > > > =======================================================
>>> > > > > Confusion Matrix
>>> > > > > -------------------------------------------------------
>>> > > > > a     <--Classified as
>>> > > > > 0     |  0     a     = spain
>>> > > > > Default Category: unknown: 1
>>> > > > >
>>> > > > > I am not sure if I am doing something wrong.. have to figure out
>>> why
>>> > my
>>> > > > o/p
>>> > > > > is so blank.
>>> > > > > I'll document these steps and mention about country.txt in the
>>> wiki.
>>> > > > >
>>> > > > > Question to all
>>> > > > > Should we have 2 country.txt
>>> > > > >
>>> > > > >   1. country_full_list.txt - this is the existing list
>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>>> > > > >
>>> > > > > To get a flavor of the wikipedia bayes example, we can use
>>> > > > > country_sample.txt. When new people want to just try out the
>>> example,
>>> > > > they
>>> > > > > can reference this txt file  as a parameter.
>>> > > > > To run the example in a robust scalable infrastructure, we could
>>> use
>>> > > > > country_full_list.txt.
>>> > > > > any thots ?
>>> > > > >
>>> > > > > regards
>>> > > > > Joe.
>>> > > > >
>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com>
>>> > wrote:
>>> > > > >
>>> > > > >> Gangadhar,
>>> > > > >>
>>> > > > >> After running TrainClassifier again, the map task just failed
>>> with
>>> > the
>>> > > > same
>>> > > > >> exception and I am pretty sure it is an issue with disk space.
>>> > > > >> As the map was progressing, I was monitoring my free disk space
>>> > > dropping
>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>>> task
>>> > and
>>> > > > then
>>> > > > >> the exception happened. After the exception, another map task was
>>> > > > resuming
>>> > > > >> at 33% and I got close to 15GB free space (i guess the first map
>>> > task
>>> > > > freed
>>> > > > >> up some space) and I am sure they would drop down to zero again
>>> and
>>> > > > throw
>>> > > > >> the same exception.
>>> > > > >> I am going to modify the country.txt to just 1 country and
>>> recreate
>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how it
>>> > > goes..
>>> > > > >>
>>> > > > >> Do we have any benchmarks / system requirements for running this
>>> > > example
>>> > > > ?
>>> > > > >> Has anyone else had success running this example anytime. Would
>>> > > > appreciate
>>> > > > >> your inputs / thots.
>>> > > > >>
>>> > > > >> Should we look at tuning the code for handling these situations ?
>>> > Any
>>> > > > quick
>>> > > > >> suggestions on where to start looking at ?
>>> > > > >>
>>> > > > >> regards,
>>> > > > >> Joe.
>>> > > > >>
>>> > > > >>
>>> > > > >>
>>> > > > >>
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>>
>>
>>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.

Robin / Gangadhar,
With ngram as 1 and all the countries in the country.txt , the model is
getting created without any issues.
$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput
-o wikipediamodel -type bayes -source hdfs

Robin,
Even for ngram parameter, the default value is mentioned as 1 but it is set
as a mandatory parameter in TrainClassifier. so i'll modify the code to set
the default ngram as 1 and make it as a non mandatory param.

That aside, When I try to test the model, the summary is getting printed
like below.
Summary
-------------------------------------------------------
Correctly Classified Instances          :          0         ?%
Incorrectly Classified Instances        :          0         ?%
Total Classified Instances              :          0
Need to figure out the reason..

Since TestClassifier also has the same params and settings like
TrainClassifier, can i modify it to set the default values for ngram,
classifierType & dataSource ?

reg,
Joe.

On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:

> Robin,
>
> Thanks for your tip.
> Will try it out and post updates.
>
> reg
> Joe.
>
>
> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com> wrote:
>
>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
>> need atleast 2 countries. otherwise there is no classification. Secondly
>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>> number
>> of features. Why dont you try with one and see.
>>
>> Robin
>>
>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com> wrote:
>>
>> > Hi Ted,
>> >
>> > sure. will keep digging..
>> >
>> > About SGD, I dont have an idea about how it works et al. If there is
>> some
>> > documentation / reference / quick summary to read about it that'll be
>> gr8.
>> > Just saw one reference in
>> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>> >
>> > I am assuming we should be able to create a model from wikipedia
>> articles
>> > and label the country of a new article. If so, could you please provide
>> a
>> > note on how to do this. We already have the wikipedia data being
>> extracted
>> > for specific countries using WikipediaDatasetCreatorDriver. How do we go
>> > about training the classifier using SGD ?
>> >
>> > thanks for your help,
>> > Joe.
>> >
>> >
>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> >
>> > > I am watching these efforts with interest, but have been unable to
>> > > contribute much to the process.  I would encourage Joe and others to
>> keep
>> > > whittling this problem down so that we can understand what is causing
>> it.
>> > >
>> > > In the meantime, I think that the SGD classifiers are close to
>> production
>> > > quality.  For problems with less than several million training
>> examples,
>> > > and
>> > > especially problems with many sparse features, I think that these
>> > > classifiers might be easier to get started with than the Naive Bayes
>> > > classifiers.  To make a virtue of a defect, the SGD based classifiers
>> to
>> > > not
>> > > use Hadoop for training.  This makes deployment of a classification
>> > > training
>> > > workflow easier, but limits the total size of data that can be
>> handled.
>> > >
>> > > What would you guys need to get started with trying these alternative
>> > > models?
>> > >
>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>> > > <np...@gmail.com>wrote:
>> > >
>> > > > Joe,
>> > > > Even I tried with reducing the number of countries in the
>> country.txt.
>> > > > That didn't help. And in my case, I was monitoring the disk space
>> and
>> > > > at no time did it reach 0%. So, I am not sure if that is the case.
>> To
>> > > > remove the dependency on the number of countries, I even tried with
>> > > > the subjects.txt as the classification - that also did not help.
>> > > > I think this problem is due to the type of the data being processed,
>> > > > but what I am not sure of is what I need to change to get the data
>> to
>> > > > be processed successfully.
>> > > >
>> > > > The experienced folks on Mahout will be able to tell us what is
>> missing
>> > I
>> > > > guess.
>> > > >
>> > > > Thank you
>> > > > Gangadhar
>> > > >
>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
>> wrote:
>> > > > > Gangadhar,
>> > > > >
>> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
>> > just
>> > > > have
>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create
>> the
>> > > > > wikipediainput data set and then ran TrainClassifier and it
>> worked.
>> > > when
>> > > > I
>> > > > > ran TestClassifier as below, I got blank results in the output.
>> > > > >
>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>> wikipediamodel
>> > -d
>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>> > > > >
>> > > > > Summary
>> > > > > -------------------------------------------------------
>> > > > > Correctly Classified Instances          :          0         ?%
>> > > > > Incorrectly Classified Instances        :          0         ?%
>> > > > > Total Classified Instances              :          0
>> > > > >
>> > > > > =======================================================
>> > > > > Confusion Matrix
>> > > > > -------------------------------------------------------
>> > > > > a     <--Classified as
>> > > > > 0     |  0     a     = spain
>> > > > > Default Category: unknown: 1
>> > > > >
>> > > > > I am not sure if I am doing something wrong.. have to figure out
>> why
>> > my
>> > > > o/p
>> > > > > is so blank.
>> > > > > I'll document these steps and mention about country.txt in the
>> wiki.
>> > > > >
>> > > > > Question to all
>> > > > > Should we have 2 country.txt
>> > > > >
>> > > > >   1. country_full_list.txt - this is the existing list
>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>> > > > >
>> > > > > To get a flavor of the wikipedia bayes example, we can use
>> > > > > country_sample.txt. When new people want to just try out the
>> example,
>> > > > they
>> > > > > can reference this txt file  as a parameter.
>> > > > > To run the example in a robust scalable infrastructure, we could
>> use
>> > > > > country_full_list.txt.
>> > > > > any thots ?
>> > > > >
>> > > > > regards
>> > > > > Joe.
>> > > > >
>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com>
>> > wrote:
>> > > > >
>> > > > >> Gangadhar,
>> > > > >>
>> > > > >> After running TrainClassifier again, the map task just failed
>> with
>> > the
>> > > > same
>> > > > >> exception and I am pretty sure it is an issue with disk space.
>> > > > >> As the map was progressing, I was monitoring my free disk space
>> > > dropping
>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>> task
>> > and
>> > > > then
>> > > > >> the exception happened. After the exception, another map task was
>> > > > resuming
>> > > > >> at 33% and I got close to 15GB free space (i guess the first map
>> > task
>> > > > freed
>> > > > >> up some space) and I am sure they would drop down to zero again
>> and
>> > > > throw
>> > > > >> the same exception.
>> > > > >> I am going to modify the country.txt to just 1 country and
>> recreate
>> > > > >> wikipediainput and run TrainClassifier. Will let you know how it
>> > > goes..
>> > > > >>
>> > > > >> Do we have any benchmarks / system requirements for running this
>> > > example
>> > > > ?
>> > > > >> Has anyone else had success running this example anytime. Would
>> > > > appreciate
>> > > > >> your inputs / thots.
>> > > > >>
>> > > > >> Should we look at tuning the code for handling these situations ?
>> > Any
>> > > > quick
>> > > > >> suggestions on where to start looking at ?
>> > > > >>
>> > > > >> regards,
>> > > > >> Joe.
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
>
>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.

Robin,

Thanks for your tip.
Will try it out and post updates.

reg
Joe.

On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com> wrote:

> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
> need atleast 2 countries. otherwise there is no classification. Secondly
> ngram =3 is a bit too high. With wikipedia this will result in a huge
> number
> of features. Why dont you try with one and see.
>
> Robin
>
> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com> wrote:
>
> > Hi Ted,
> >
> > sure. will keep digging..
> >
> > About SGD, I dont have an idea about how it works et al. If there is some
> > documentation / reference / quick summary to read about it that'll be
> gr8.
> > Just saw one reference in
> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
> >
> > I am assuming we should be able to create a model from wikipedia articles
> > and label the country of a new article. If so, could you please provide a
> > note on how to do this. We already have the wikipedia data being
> extracted
> > for specific countries using WikipediaDatasetCreatorDriver. How do we go
> > about training the classifier using SGD ?
> >
> > thanks for your help,
> > Joe.
> >
> >
> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > I am watching these efforts with interest, but have been unable to
> > > contribute much to the process.  I would encourage Joe and others to
> keep
> > > whittling this problem down so that we can understand what is causing
> it.
> > >
> > > In the meantime, I think that the SGD classifiers are close to
> production
> > > quality.  For problems with less than several million training
> examples,
> > > and
> > > especially problems with many sparse features, I think that these
> > > classifiers might be easier to get started with than the Naive Bayes
> > > classifiers.  To make a virtue of a defect, the SGD based classifiers
> to
> > > not
> > > use Hadoop for training.  This makes deployment of a classification
> > > training
> > > workflow easier, but limits the total size of data that can be handled.
> > >
> > > What would you guys need to get started with trying these alternative
> > > models?
> > >
> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> > > <np...@gmail.com>wrote:
> > >
> > > > Joe,
> > > > Even I tried with reducing the number of countries in the
> country.txt.
> > > > That didn't help. And in my case, I was monitoring the disk space and
> > > > at no time did it reach 0%. So, I am not sure if that is the case. To
> > > > remove the dependency on the number of countries, I even tried with
> > > > the subjects.txt as the classification - that also did not help.
> > > > I think this problem is due to the type of the data being processed,
> > > > but what I am not sure of is what I need to change to get the data to
> > > > be processed successfully.
> > > >
> > > > The experienced folks on Mahout will be able to tell us what is
> missing
> > I
> > > > guess.
> > > >
> > > > Thank you
> > > > Gangadhar
> > > >
> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
> wrote:
> > > > > Gangadhar,
> > > > >
> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
> > just
> > > > have
> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create
> the
> > > > > wikipediainput data set and then ran TrainClassifier and it worked.
> > > when
> > > > I
> > > > > ran TestClassifier as below, I got blank results in the output.
> > > > >
> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel
> > -d
> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
> > > > >
> > > > > Summary
> > > > > -------------------------------------------------------
> > > > > Correctly Classified Instances          :          0         ?%
> > > > > Incorrectly Classified Instances        :          0         ?%
> > > > > Total Classified Instances              :          0
> > > > >
> > > > > =======================================================
> > > > > Confusion Matrix
> > > > > -------------------------------------------------------
> > > > > a     <--Classified as
> > > > > 0     |  0     a     = spain
> > > > > Default Category: unknown: 1
> > > > >
> > > > > I am not sure if I am doing something wrong.. have to figure out
> why
> > my
> > > > o/p
> > > > > is so blank.
> > > > > I'll document these steps and mention about country.txt in the
> wiki.
> > > > >
> > > > > Question to all
> > > > > Should we have 2 country.txt
> > > > >
> > > > >   1. country_full_list.txt - this is the existing list
> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
> > > > >
> > > > > To get a flavor of the wikipedia bayes example, we can use
> > > > > country_sample.txt. When new people want to just try out the
> example,
> > > > they
> > > > > can reference this txt file  as a parameter.
> > > > > To run the example in a robust scalable infrastructure, we could
> use
> > > > > country_full_list.txt.
> > > > > any thots ?
> > > > >
> > > > > regards
> > > > > Joe.
> > > > >
> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com>
> > wrote:
> > > > >
> > > > >> Gangadhar,
> > > > >>
> > > > >> After running TrainClassifier again, the map task just failed with
> > the
> > > > same
> > > > >> exception and I am pretty sure it is an issue with disk space.
> > > > >> As the map was progressing, I was monitoring my free disk space
> > > dropping
> > > > >> from 81GB. It came down to 0 after almost 66% through the map task
> > and
> > > > then
> > > > >> the exception happened. After the exception, another map task was
> > > > resuming
> > > > >> at 33% and I got close to 15GB free space (i guess the first map
> > task
> > > > freed
> > > > >> up some space) and I am sure they would drop down to zero again
> and
> > > > throw
> > > > >> the same exception.
> > > > >> I am going to modify the country.txt to just 1 country and
> recreate
> > > > >> wikipediainput and run TrainClassifier. Will let you know how it
> > > goes..
> > > > >>
> > > > >> Do we have any benchmarks / system requirements for running this
> > > example
> > > > ?
> > > > >> Has anyone else had success running this example anytime. Would
> > > > appreciate
> > > > >> your inputs / thots.
> > > > >>
> > > > >> Should we look at tuning the code for handling these situations ?
> > Any
> > > > quick
> > > > >> suggestions on where to start looking at ?
> > > > >>
> > > > >> regards,
> > > > >> Joe.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Options in TrainClassifier.java

Posted by Robin Anil <ro...@gmail.com>.

Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
need atleast 2 countries. otherwise there is no classification. Secondly
ngram =3 is a bit too high. With wikipedia this will result in a huge number
of features. Why dont you try with one and see.

Robin

On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com> wrote:

> Hi Ted,
>
> sure. will keep digging..
>
> About SGD, I dont have an idea about how it works et al. If there is some
> documentation / reference / quick summary to read about it that'll be gr8.
> Just saw one reference in
> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>
> I am assuming we should be able to create a model from wikipedia articles
> and label the country of a new article. If so, could you please provide a
> note on how to do this. We already have the wikipedia data being extracted
> for specific countries using WikipediaDatasetCreatorDriver. How do we go
> about training the classifier using SGD ?
>
> thanks for your help,
> Joe.
>
>
> On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I am watching these efforts with interest, but have been unable to
> > contribute much to the process.  I would encourage Joe and others to keep
> > whittling this problem down so that we can understand what is causing it.
> >
> > In the meantime, I think that the SGD classifiers are close to production
> > quality.  For problems with less than several million training examples,
> > and
> > especially problems with many sparse features, I think that these
> > classifiers might be easier to get started with than the Naive Bayes
> > classifiers.  To make a virtue of a defect, the SGD based classifiers to
> > not
> > use Hadoop for training.  This makes deployment of a classification
> > training
> > workflow easier, but limits the total size of data that can be handled.
> >
> > What would you guys need to get started with trying these alternative
> > models?
> >
> > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> > <np...@gmail.com>wrote:
> >
> > > Joe,
> > > Even I tried with reducing the number of countries in the country.txt.
> > > That didn't help. And in my case, I was monitoring the disk space and
> > > at no time did it reach 0%. So, I am not sure if that is the case. To
> > > remove the dependency on the number of countries, I even tried with
> > > the subjects.txt as the classification - that also did not help.
> > > I think this problem is due to the type of the data being processed,
> > > but what I am not sure of is what I need to change to get the data to
> > > be processed successfully.
> > >
> > > The experienced folks on Mahout will be able to tell us what is missing
> I
> > > guess.
> > >
> > > Thank you
> > > Gangadhar
> > >
> > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com> wrote:
> > > > Gangadhar,
> > > >
> > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
> just
> > > have
> > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> > > > wikipediainput data set and then ran TrainClassifier and it worked.
> > when
> > > I
> > > > ran TestClassifier as below, I got blank results in the output.
> > > >
> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel
> -d
> > > >  wikipediainput  -ng 3 -type bayes -source hdfs
> > > >
> > > > Summary
> > > > -------------------------------------------------------
> > > > Correctly Classified Instances          :          0         ?%
> > > > Incorrectly Classified Instances        :          0         ?%
> > > > Total Classified Instances              :          0
> > > >
> > > > =======================================================
> > > > Confusion Matrix
> > > > -------------------------------------------------------
> > > > a     <--Classified as
> > > > 0     |  0     a     = spain
> > > > Default Category: unknown: 1
> > > >
> > > > I am not sure if I am doing something wrong.. have to figure out why
> my
> > > o/p
> > > > is so blank.
> > > > I'll document these steps and mention about country.txt in the wiki.
> > > >
> > > > Question to all
> > > > Should we have 2 country.txt
> > > >
> > > >   1. country_full_list.txt - this is the existing list
> > > >   2. country_sample_list.txt - a list with 2 or 3 countries
> > > >
> > > > To get a flavor of the wikipedia bayes example, we can use
> > > > country_sample.txt. When new people want to just try out the example,
> > > they
> > > > can reference this txt file  as a parameter.
> > > > To run the example in a robust scalable infrastructure, we could use
> > > > country_full_list.txt.
> > > > any thots ?
> > > >
> > > > regards
> > > > Joe.
> > > >
> > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com>
> wrote:
> > > >
> > > >> Gangadhar,
> > > >>
> > > >> After running TrainClassifier again, the map task just failed with
> the
> > > same
> > > >> exception and I am pretty sure it is an issue with disk space.
> > > >> As the map was progressing, I was monitoring my free disk space
> > dropping
> > > >> from 81GB. It came down to 0 after almost 66% through the map task
> and
> > > then
> > > >> the exception happened. After the exception, another map task was
> > > resuming
> > > >> at 33% and I got close to 15GB free space (i guess the first map
> task
> > > freed
> > > >> up some space) and I am sure they would drop down to zero again and
> > > throw
> > > >> the same exception.
> > > >> I am going to modify the country.txt to just 1 country and recreate
> > > >> wikipediainput and run TrainClassifier. Will let you know how it
> > goes..
> > > >>
> > > >> Do we have any benchmarks / system requirements for running this
> > example
> > > ?
> > > >> Has anyone else had success running this example anytime. Would
> > > appreciate
> > > >> your inputs / thots.
> > > >>
> > > >> Should we look at tuning the code for handling these situations ?
> Any
> > > quick
> > > >> suggestions on where to start looking at ?
> > > >>
> > > >> regards,
> > > >> Joe.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.

Hi Ted,

sure. will keep digging..

About SGD, I dont have an idea about how it works et al. If there is some
documentation / reference / quick summary to read about it that'll be gr8.
Just saw one reference in
https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.

I am assuming we should be able to create a model from wikipedia articles
and label the country of a new article. If so, could you please provide a
note on how to do this. We already have the wikipedia data being extracted
for specific countries using WikipediaDatasetCreatorDriver. How do we go
about training the classifier using SGD ?

thanks for your help,
Joe.


On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <te...@gmail.com> wrote:

> I am watching these efforts with interest, but have been unable to
> contribute much to the process.  I would encourage Joe and others to keep
> whittling this problem down so that we can understand what is causing it.
>
> In the meantime, I think that the SGD classifiers are close to production
> quality.  For problems with less than several million training examples,
> and
> especially problems with many sparse features, I think that these
> classifiers might be easier to get started with than the Naive Bayes
> classifiers.  To make a virtue of a defect, the SGD based classifiers to
> not
> use Hadoop for training.  This makes deployment of a classification
> training
> workflow easier, but limits the total size of data that can be handled.
>
> What would you guys need to get started with trying these alternative
> models?
>
> On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> <np...@gmail.com>wrote:
>
> > Joe,
> > Even I tried with reducing the number of countries in the country.txt.
> > That didn't help. And in my case, I was monitoring the disk space and
> > at no time did it reach 0%. So, I am not sure if that is the case. To
> > remove the dependency on the number of countries, I even tried with
> > the subjects.txt as the classification - that also did not help.
> > I think this problem is due to the type of the data being processed,
> > but what I am not sure of is what I need to change to get the data to
> > be processed successfully.
> >
> > The experienced folks on Mahout will be able to tell us what is missing I
> > guess.
> >
> > Thank you
> > Gangadhar
> >
> > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com> wrote:
> > > Gangadhar,
> > >
> > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just
> > have
> > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> > > wikipediainput data set and then ran TrainClassifier and it worked.
> when
> > I
> > > ran TestClassifier as below, I got blank results in the output.
> > >
> > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
> > >  wikipediainput  -ng 3 -type bayes -source hdfs
> > >
> > > Summary
> > > -------------------------------------------------------
> > > Correctly Classified Instances          :          0         ?%
> > > Incorrectly Classified Instances        :          0         ?%
> > > Total Classified Instances              :          0
> > >
> > > =======================================================
> > > Confusion Matrix
> > > -------------------------------------------------------
> > > a     <--Classified as
> > > 0     |  0     a     = spain
> > > Default Category: unknown: 1
> > >
> > > I am not sure if I am doing something wrong.. have to figure out why my
> > o/p
> > > is so blank.
> > > I'll document these steps and mention about country.txt in the wiki.
> > >
> > > Question to all
> > > Should we have 2 country.txt
> > >
> > >   1. country_full_list.txt - this is the existing list
> > >   2. country_sample_list.txt - a list with 2 or 3 countries
> > >
> > > To get a flavor of the wikipedia bayes example, we can use
> > > country_sample.txt. When new people want to just try out the example,
> > they
> > > can reference this txt file  as a parameter.
> > > To run the example in a robust scalable infrastructure, we could use
> > > country_full_list.txt.
> > > any thots ?
> > >
> > > regards
> > > Joe.
> > >
> > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com> wrote:
> > >
> > >> Gangadhar,
> > >>
> > >> After running TrainClassifier again, the map task just failed with the
> > same
> > >> exception and I am pretty sure it is an issue with disk space.
> > >> As the map was progressing, I was monitoring my free disk space
> dropping
> > >> from 81GB. It came down to 0 after almost 66% through the map task and
> > then
> > >> the exception happened. After the exception, another map task was
> > resuming
> > >> at 33% and I got close to 15GB free space (i guess the first map task
> > freed
> > >> up some space) and I am sure they would drop down to zero again and
> > throw
> > >> the same exception.
> > >> I am going to modify the country.txt to just 1 country and recreate
> > >> wikipediainput and run TrainClassifier. Will let you know how it
> goes..
> > >>
> > >> Do we have any benchmarks / system requirements for running this
> example
> > ?
> > >> Has anyone else had success running this example anytime. Would
> > appreciate
> > >> your inputs / thots.
> > >>
> > >> Should we look at tuning the code for handling these situations ? Any
> > quick
> > >> suggestions on where to start looking at ?
> > >>
> > >> regards,
> > >> Joe.
> > >>
> > >>
> > >>
> > >>
> > >
> >
>

Re: Options in TrainClassifier.java

Posted by Ted Dunning <te...@gmail.com>.

I am watching these efforts with interest, but have been unable to
contribute much to the process.  I would encourage Joe and others to keep
whittling this problem down so that we can understand what is causing it.

In the meantime, I think that the SGD classifiers are close to production
quality.  For problems with less than several million training examples, and
especially problems with many sparse features, I think that these
classifiers might be easier to get started with than the Naive Bayes
classifiers.  To make a virtue of a defect, the SGD based classifiers to not
use Hadoop for training.  This makes deployment of a classification training
workflow easier, but limits the total size of data that can be handled.

What would you guys need to get started with trying these alternative
models?

On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
<np...@gmail.com>wrote:

> Joe,
> Even I tried with reducing the number of countries in the country.txt.
> That didn't help. And in my case, I was monitoring the disk space and
> at no time did it reach 0%. So, I am not sure if that is the case. To
> remove the dependency on the number of countries, I even tried with
> the subjects.txt as the classification - that also did not help.
> I think this problem is due to the type of the data being processed,
> but what I am not sure of is what I need to change to get the data to
> be processed successfully.
>
> The experienced folks on Mahout will be able to tell us what is missing I
> guess.
>
> Thank you
> Gangadhar
>
> On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com> wrote:
> > Gangadhar,
> >
> > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just
> have
> > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> > wikipediainput data set and then ran TrainClassifier and it worked. when
> I
> > ran TestClassifier as below, I got blank results in the output.
> >
> > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
> >  wikipediainput  -ng 3 -type bayes -source hdfs
> >
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :          0         ?%
> > Incorrectly Classified Instances        :          0         ?%
> > Total Classified Instances              :          0
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> > a     <--Classified as
> > 0     |  0     a     = spain
> > Default Category: unknown: 1
> >
> > I am not sure if I am doing something wrong.. have to figure out why my
> o/p
> > is so blank.
> > I'll document these steps and mention about country.txt in the wiki.
> >
> > Question to all
> > Should we have 2 country.txt
> >
> >   1. country_full_list.txt - this is the existing list
> >   2. country_sample_list.txt - a list with 2 or 3 countries
> >
> > To get a flavor of the wikipedia bayes example, we can use
> > country_sample.txt. When new people want to just try out the example,
> they
> > can reference this txt file  as a parameter.
> > To run the example in a robust scalable infrastructure, we could use
> > country_full_list.txt.
> > any thots ?
> >
> > regards
> > Joe.
> >
> > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com> wrote:
> >
> >> Gangadhar,
> >>
> >> After running TrainClassifier again, the map task just failed with the
> same
> >> exception and I am pretty sure it is an issue with disk space.
> >> As the map was progressing, I was monitoring my free disk space dropping
> >> from 81GB. It came down to 0 after almost 66% through the map task and
> then
> >> the exception happened. After the exception, another map task was
> resuming
> >> at 33% and I got close to 15GB free space (i guess the first map task
> freed
> >> up some space) and I am sure they would drop down to zero again and
> throw
> >> the same exception.
> >> I am going to modify the country.txt to just 1 country and recreate
> >> wikipediainput and run TrainClassifier. Will let you know how it goes..
> >>
> >> Do we have any benchmarks / system requirements for running this example
> ?
> >> Has anyone else had success running this example anytime. Would
> appreciate
> >> your inputs / thots.
> >>
> >> Should we look at tuning the code for handling these situations ? Any
> quick
> >> suggestions on where to start looking at ?
> >>
> >> regards,
> >> Joe.
> >>
> >>
> >>
> >>
> >
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.

Joe,
Even I tried with reducing the number of countries in the country.txt.
That didn't help. And in my case, I was monitoring the disk space and
at no time did it reach 0%. So, I am not sure if that is the case. To
remove the dependency on the number of countries, I even tried with
the subjects.txt as the classification - that also did not help.
I think this problem is due to the type of the data being processed,
but what I am not sure of is what I need to change to get the data to
be processed successfully.

The experienced folks on Mahout will be able to tell us what is missing I guess.

Thank you
Gangadhar

On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com> wrote:
> Gangadhar,
>
> I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just have
> 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> wikipediainput data set and then ran TrainClassifier and it worked. when I
> ran TestClassifier as below, I got blank results in the output.
>
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
>  wikipediainput  -ng 3 -type bayes -source hdfs
>
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :          0         ?%
> Incorrectly Classified Instances        :          0         ?%
> Total Classified Instances              :          0
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a     <--Classified as
> 0     |  0     a     = spain
> Default Category: unknown: 1
>
> I am not sure if I am doing something wrong.. have to figure out why my o/p
> is so blank.
> I'll document these steps and mention about country.txt in the wiki.
>
> Question to all
> Should we have 2 country.txt
>
>   1. country_full_list.txt - this is the existing list
>   2. country_sample_list.txt - a list with 2 or 3 countries
>
> To get a flavor of the wikipedia bayes example, we can use
> country_sample.txt. When new people want to just try out the example, they
> can reference this txt file  as a parameter.
> To run the example in a robust scalable infrastructure, we could use
> country_full_list.txt.
> any thots ?
>
> regards
> Joe.
>
> On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com> wrote:
>
>> Gangadhar,
>>
>> After running TrainClassifier again, the map task just failed with the same
>> exception and I am pretty sure it is an issue with disk space.
>> As the map was progressing, I was monitoring my free disk space dropping
>> from 81GB. It came down to 0 after almost 66% through the map task and then
>> the exception happened. After the exception, another map task was resuming
>> at 33% and I got close to 15GB free space (i guess the first map task freed
>> up some space) and I am sure they would drop down to zero again and throw
>> the same exception.
>> I am going to modify the country.txt to just 1 country and recreate
>> wikipediainput and run TrainClassifier. Will let you know how it goes..
>>
>> Do we have any benchmarks / system requirements for running this example ?
>> Has anyone else had success running this example anytime. Would appreciate
>> your inputs / thots.
>>
>> Should we look at tuning the code for handling these situations ? Any quick
>> suggestions on where to start looking at ?
>>
>> regards,
>> Joe.
>>
>>
>>
>>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.

Gangadhar,

I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just have
1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
wikipediainput data set and then ran TrainClassifier and it worked. when I
ran TestClassifier as below, I got blank results in the output.

$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
 wikipediainput  -ng 3 -type bayes -source hdfs

Summary
-------------------------------------------------------
Correctly Classified Instances          :          0         ?%
Incorrectly Classified Instances        :          0         ?%
Total Classified Instances              :          0

=======================================================
Confusion Matrix
-------------------------------------------------------
a     <--Classified as
0     |  0     a     = spain
Default Category: unknown: 1

I am not sure if I am doing something wrong.. have to figure out why my o/p
is so blank.
I'll document these steps and mention about country.txt in the wiki.

Question to all
Should we have 2 country.txt

   1. country_full_list.txt - this is the existing list
   2. country_sample_list.txt - a list with 2 or 3 countries

To get a flavor of the wikipedia bayes example, we can use
country_sample.txt. When new people want to just try out the example, they
can reference this txt file  as a parameter.
To run the example in a robust scalable infrastructure, we could use
country_full_list.txt.
any thots ?

regards
Joe.

On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com> wrote:

> Gangadhar,
>
> After running TrainClassifier again, the map task just failed with the same
> exception and I am pretty sure it is an issue with disk space.
> As the map was progressing, I was monitoring my free disk space dropping
> from 81GB. It came down to 0 after almost 66% through the map task and then
> the exception happened. After the exception, another map task was resuming
> at 33% and I got close to 15GB free space (i guess the first map task freed
> up some space) and I am sure they would drop down to zero again and throw
> the same exception.
> I am going to modify the country.txt to just 1 country and recreate
> wikipediainput and run TrainClassifier. Will let you know how it goes..
>
> Do we have any benchmarks / system requirements for running this example ?
> Has anyone else had success running this example anytime. Would appreciate
> your inputs / thots.
>
> Should we look at tuning the code for handling these situations ? Any quick
> suggestions on where to start looking at ?
>
> regards,
> Joe.
>
>
>
>