You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Joe Kumar <jo...@gmail.com> on 2010/09/15 06:56:09 UTC

Options in TrainClassifier.java

Hi all,

As I was going through wikipedia example, I encountered a situation with
TrainClassifier wherein some of the options with default values are actually
mandatory.
The documentation / command line help says that

   1. default source (--datasource) is hdfs but TrainClassifier
   has withRequired(true) while building the --datasource option. We are
   checking if the dataSourceType is hbase else set it to hdfs. so
   ideally withRequired should be set to false
   2. default --classifierType is bayes but withRequired is set to true and
   we have code like

if ("bayes".equalsIgnoreCase(classifierType)) {
        log.info("Training Bayes Classifier");
        trainNaiveBayes(inputPath, outputPath, params);

      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
        log.info("Training Complementary Bayes Classifier");
        // setup the HDFS and copy the files there, then run the trainer
        trainCNaiveBayes(inputPath, outputPath, params);
      }

which should be changed to

*if ("cbayes".equalsIgnoreCase(classifierType)) {*
        log.info("Training Complementary Bayes Classifier");
        trainCNaiveBayes(inputPath, outputPath, params);

      } *else  {*
        log.info("Training  Bayes Classifier");
        // setup the HDFS and copy the files there, then run the trainer
        trainNaiveBayes(inputPath, outputPath, params);
      }

Please let me know if this looks valid and I'll submit a patch for a JIRA
issue.

reg
Joe.

Re: Options in TrainClassifier.java

Posted by deneche abdelhakim <ad...@gmail.com>.
I don't know if it's related, but I remember getting a similar
Exception one year ago when I was  working on the implementation of
Random Forests. In my case it was caused by
SequenceFile.Sorter.merge(). I ended up writing my own merge function
because I really didn't need to sort the output.

On Mon, Sep 20, 2010 at 6:14 AM, Joe Kumar <jo...@gmail.com> wrote:
> Gangadhar,
>
> Just to eliminate the usual suspects, I am using Mac OSX 10.5.8, Mahout 0.4
> (revision 986659), Hadoop 0.20.2, 2GB Mem for Hadoop , 80 GB free space.
> commands tat I executed.
>
> I had issues with my namenode and so did a format using hadoop namenode
> -format.
> $MAHOUT_HOME/examples/src/test/resources/country.txt had just 1 entry
> (spain). I havent tried with multiple entries.
>
> $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
> $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o
> wikipedia/chunks -c 64
>
> $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> wikipedia/chunks -o wikipediainput -c
> $MAHOUT_HOME/examples/src/test/resources/country.txt
>
> $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o
> wikipediamodel  -type bayes -source hdfs
>
> $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
>  wikipediainput  -ng 3 -type bayes -source hdfs
>
> Please try the above and let me know. we'll try and find out what is going
> wrong.
> Reg,
> Joe.
>
> On Sun, Sep 19, 2010 at 11:13 PM, Gangadhar Nittala <npk.gangadhar@gmail.com
>> wrote:
>
>> Joe,
>> Even I tried with reducing the number of countries in the country.txt.
>> That didn't help. And in my case, I was monitoring the disk space and
>> at no time did it reach 0%. So, I am not sure if that is the case. To
>> remove the dependency on the number of countries, I even tried with
>> the subjects.txt as the classification - that also did not help.
>> I think this problem is due to the type of the data being processed,
>> but what I am not sure of is what I need to change to get the data to
>> be processed successfully.
>>
>> The experienced folks on Mahout will be able to tell us what is missing I
>> guess.
>>
>> Thank you
>> Gangadhar
>>
>> On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com> wrote:
>> > Gangadhar,
>> >
>> > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just
>> have
>> > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
>> > wikipediainput data set and then ran TrainClassifier and it worked. when
>> I
>> > ran TestClassifier as below, I got blank results in the output.
>> >
>> > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
>> >  wikipediainput  -ng 3 -type bayes -source hdfs
>> >
>> > Summary
>> > -------------------------------------------------------
>> > Correctly Classified Instances          :          0         ?%
>> > Incorrectly Classified Instances        :          0         ?%
>> > Total Classified Instances              :          0
>> >
>> > =======================================================
>> > Confusion Matrix
>> > -------------------------------------------------------
>> > a     <--Classified as
>> > 0     |  0     a     = spain
>> > Default Category: unknown: 1
>> >
>> > I am not sure if I am doing something wrong.. have to figure out why my
>> o/p
>> > is so blank.
>> > I'll document these steps and mention about country.txt in the wiki.
>> >
>> > Question to all
>> > Should we have 2 country.txt
>> >
>> >   1. country_full_list.txt - this is the existing list
>> >   2. country_sample_list.txt - a list with 2 or 3 countries
>> >
>> > To get a flavor of the wikipedia bayes example, we can use
>> > country_sample.txt. When new people want to just try out the example,
>> they
>> > can reference this txt file  as a parameter.
>> > To run the example in a robust scalable infrastructure, we could use
>> > country_full_list.txt.
>> > any thots ?
>> >
>> > regards
>> > Joe.
>> >
>> > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com> wrote:
>> >
>> >> Gangadhar,
>> >>
>> >> After running TrainClassifier again, the map task just failed with the
>> same
>> >> exception and I am pretty sure it is an issue with disk space.
>> >> As the map was progressing, I was monitoring my free disk space dropping
>> >> from 81GB. It came down to 0 after almost 66% through the map task and
>> then
>> >> the exception happened. After the exception, another map task was
>> resuming
>> >> at 33% and I got close to 15GB free space (i guess the first map task
>> freed
>> >> up some space) and I am sure they would drop down to zero again and
>> throw
>> >> the same exception.
>> >> I am going to modify the country.txt to just 1 country and recreate
>> >> wikipediainput and run TrainClassifier. Will let you know how it goes..
>> >>
>> >> Do we have any benchmarks / system requirements for running this example
>> ?
>> >> Has anyone else had success running this example anytime. Would
>> appreciate
>> >> your inputs / thots.
>> >>
>> >> Should we look at tuning the code for handling these situations ? Any
>> quick
>> >> suggestions on where to start looking at ?
>> >>
>> >> regards,
>> >> Joe.
>> >>
>> >>
>> >>
>> >>
>> >
>>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.
Gangadhar,

Just to eliminate the usual suspects, I am using Mac OSX 10.5.8, Mahout 0.4
(revision 986659), Hadoop 0.20.2, 2GB Mem for Hadoop , 80 GB free space.
commands tat I executed.

I had issues with my namenode and so did a format using hadoop namenode
-format.
$MAHOUT_HOME/examples/src/test/resources/country.txt had just 1 entry
(spain). I havent tried with multiple entries.

$> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
$MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o
wikipedia/chunks -c 64

$> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
wikipedia/chunks -o wikipediainput -c
$MAHOUT_HOME/examples/src/test/resources/country.txt

$> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o
wikipediamodel  -type bayes -source hdfs

$> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
 wikipediainput  -ng 3 -type bayes -source hdfs

Please try the above and let me know. we'll try and find out what is going
wrong.
Reg,
Joe.

On Sun, Sep 19, 2010 at 11:13 PM, Gangadhar Nittala <npk.gangadhar@gmail.com
> wrote:

> Joe,
> Even I tried with reducing the number of countries in the country.txt.
> That didn't help. And in my case, I was monitoring the disk space and
> at no time did it reach 0%. So, I am not sure if that is the case. To
> remove the dependency on the number of countries, I even tried with
> the subjects.txt as the classification - that also did not help.
> I think this problem is due to the type of the data being processed,
> but what I am not sure of is what I need to change to get the data to
> be processed successfully.
>
> The experienced folks on Mahout will be able to tell us what is missing I
> guess.
>
> Thank you
> Gangadhar
>
> On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com> wrote:
> > Gangadhar,
> >
> > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just
> have
> > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> > wikipediainput data set and then ran TrainClassifier and it worked. when
> I
> > ran TestClassifier as below, I got blank results in the output.
> >
> > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
> >  wikipediainput  -ng 3 -type bayes -source hdfs
> >
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :          0         ?%
> > Incorrectly Classified Instances        :          0         ?%
> > Total Classified Instances              :          0
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> > a     <--Classified as
> > 0     |  0     a     = spain
> > Default Category: unknown: 1
> >
> > I am not sure if I am doing something wrong.. have to figure out why my
> o/p
> > is so blank.
> > I'll document these steps and mention about country.txt in the wiki.
> >
> > Question to all
> > Should we have 2 country.txt
> >
> >   1. country_full_list.txt - this is the existing list
> >   2. country_sample_list.txt - a list with 2 or 3 countries
> >
> > To get a flavor of the wikipedia bayes example, we can use
> > country_sample.txt. When new people want to just try out the example,
> they
> > can reference this txt file  as a parameter.
> > To run the example in a robust scalable infrastructure, we could use
> > country_full_list.txt.
> > any thots ?
> >
> > regards
> > Joe.
> >
> > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com> wrote:
> >
> >> Gangadhar,
> >>
> >> After running TrainClassifier again, the map task just failed with the
> same
> >> exception and I am pretty sure it is an issue with disk space.
> >> As the map was progressing, I was monitoring my free disk space dropping
> >> from 81GB. It came down to 0 after almost 66% through the map task and
> then
> >> the exception happened. After the exception, another map task was
> resuming
> >> at 33% and I got close to 15GB free space (i guess the first map task
> freed
> >> up some space) and I am sure they would drop down to zero again and
> throw
> >> the same exception.
> >> I am going to modify the country.txt to just 1 country and recreate
> >> wikipediainput and run TrainClassifier. Will let you know how it goes..
> >>
> >> Do we have any benchmarks / system requirements for running this example
> ?
> >> Has anyone else had success running this example anytime. Would
> appreciate
> >> your inputs / thots.
> >>
> >> Should we look at tuning the code for handling these situations ? Any
> quick
> >> suggestions on where to start looking at ?
> >>
> >> regards,
> >> Joe.
> >>
> >>
> >>
> >>
> >
>

Re: Options in TrainClassifier.java

Posted by Ted Dunning <te...@gmail.com>.
There is a test program called TrainNewsGroups
in org.apache.mahout.classifier.sgd in the examples module.

I would love to work with you to get better documentation pulled together.

On Mon, Sep 20, 2010 at 8:13 PM, Gangadhar Nittala
<np...@gmail.com>wrote:

> Joe,
> I will try with the ngram setting of 1 and let you know how it goes.
> Robin, the ngram parameter is used to check the number of subsequences
> of characters isn't it ? Or is it evaluated differently w.r.t to the
> Bayesian classifier ?
>
> Ted, like Joe mentioned, if you could point us to some information on
> SGD we could try it and report back the results to the list.
>
> Thank you
> Gangadhar
>
> On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <jo...@gmail.com> wrote:
> > Robin / Gangadhar,
> > With ngram as 1 and all the countries in the country.txt , the model is
> > getting created without any issues.
> > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i
> wikipediainput
> > -o wikipediamodel -type bayes -source hdfs
> >
> > Robin,
> > Even for ngram parameter, the default value is mentioned as 1 but it is
> set
> > as a mandatory parameter in TrainClassifier. so i'll modify the code to
> set
> > the default ngram as 1 and make it as a non mandatory param.
> >
> > That aside, When I try to test the model, the summary is getting printed
> > like below.
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :          0         ?%
> > Incorrectly Classified Instances        :          0         ?%
> > Total Classified Instances              :          0
> > Need to figure out the reason..
> >
> > Since TestClassifier also has the same params and settings like
> > TrainClassifier, can i modify it to set the default values for ngram,
> > classifierType & dataSource ?
> >
> > reg,
> > Joe.
> >
> > On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:
> >
> >> Robin,
> >>
> >> Thanks for your tip.
> >> Will try it out and post updates.
> >>
> >> reg
> >> Joe.
> >>
> >>
> >> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com>
> wrote:
> >>
> >>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st.
> You
> >>> need atleast 2 countries. otherwise there is no classification.
> Secondly
> >>> ngram =3 is a bit too high. With wikipedia this will result in a huge
> >>> number
> >>> of features. Why dont you try with one and see.
> >>>
> >>> Robin
> >>>
> >>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com>
> wrote:
> >>>
> >>> > Hi Ted,
> >>> >
> >>> > sure. will keep digging..
> >>> >
> >>> > About SGD, I dont have an idea about how it works et al. If there is
> >>> some
> >>> > documentation / reference / quick summary to read about it that'll be
> >>> gr8.
> >>> > Just saw one reference in
> >>> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
> >>> >
> >>> > I am assuming we should be able to create a model from wikipedia
> >>> articles
> >>> > and label the country of a new article. If so, could you please
> provide
> >>> a
> >>> > note on how to do this. We already have the wikipedia data being
> >>> extracted
> >>> > for specific countries using WikipediaDatasetCreatorDriver. How do we
> go
> >>> > about training the classifier using SGD ?
> >>> >
> >>> > thanks for your help,
> >>> > Joe.
> >>> >
> >>> >
> >>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <ted.dunning@gmail.com
> >
> >>> > wrote:
> >>> >
> >>> > > I am watching these efforts with interest, but have been unable to
> >>> > > contribute much to the process.  I would encourage Joe and others
> to
> >>> keep
> >>> > > whittling this problem down so that we can understand what is
> causing
> >>> it.
> >>> > >
> >>> > > In the meantime, I think that the SGD classifiers are close to
> >>> production
> >>> > > quality.  For problems with less than several million training
> >>> examples,
> >>> > > and
> >>> > > especially problems with many sparse features, I think that these
> >>> > > classifiers might be easier to get started with than the Naive
> Bayes
> >>> > > classifiers.  To make a virtue of a defect, the SGD based
> classifiers
> >>> to
> >>> > > not
> >>> > > use Hadoop for training.  This makes deployment of a classification
> >>> > > training
> >>> > > workflow easier, but limits the total size of data that can be
> >>> handled.
> >>> > >
> >>> > > What would you guys need to get started with trying these
> alternative
> >>> > > models?
> >>> > >
> >>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> >>> > > <np...@gmail.com>wrote:
> >>> > >
> >>> > > > Joe,
> >>> > > > Even I tried with reducing the number of countries in the
> >>> country.txt.
> >>> > > > That didn't help. And in my case, I was monitoring the disk space
> >>> and
> >>> > > > at no time did it reach 0%. So, I am not sure if that is the
> case.
> >>> To
> >>> > > > remove the dependency on the number of countries, I even tried
> with
> >>> > > > the subjects.txt as the classification - that also did not help.
> >>> > > > I think this problem is due to the type of the data being
> processed,
> >>> > > > but what I am not sure of is what I need to change to get the
> data
> >>> to
> >>> > > > be processed successfully.
> >>> > > >
> >>> > > > The experienced folks on Mahout will be able to tell us what is
> >>> missing
> >>> > I
> >>> > > > guess.
> >>> > > >
> >>> > > > Thank you
> >>> > > > Gangadhar
> >>> > > >
> >>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
> >>> wrote:
> >>> > > > > Gangadhar,
> >>> > > > >
> >>> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt
> to
> >>> > just
> >>> > > > have
> >>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to
> create
> >>> the
> >>> > > > > wikipediainput data set and then ran TrainClassifier and it
> >>> worked.
> >>> > > when
> >>> > > > I
> >>> > > > > ran TestClassifier as below, I got blank results in the output.
> >>> > > > >
> >>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> >>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
> >>> wikipediamodel
> >>> > -d
> >>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
> >>> > > > >
> >>> > > > > Summary
> >>> > > > > -------------------------------------------------------
> >>> > > > > Correctly Classified Instances          :          0         ?%
> >>> > > > > Incorrectly Classified Instances        :          0         ?%
> >>> > > > > Total Classified Instances              :          0
> >>> > > > >
> >>> > > > > =======================================================
> >>> > > > > Confusion Matrix
> >>> > > > > -------------------------------------------------------
> >>> > > > > a     <--Classified as
> >>> > > > > 0     |  0     a     = spain
> >>> > > > > Default Category: unknown: 1
> >>> > > > >
> >>> > > > > I am not sure if I am doing something wrong.. have to figure
> out
> >>> why
> >>> > my
> >>> > > > o/p
> >>> > > > > is so blank.
> >>> > > > > I'll document these steps and mention about country.txt in the
> >>> wiki.
> >>> > > > >
> >>> > > > > Question to all
> >>> > > > > Should we have 2 country.txt
> >>> > > > >
> >>> > > > >   1. country_full_list.txt - this is the existing list
> >>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
> >>> > > > >
> >>> > > > > To get a flavor of the wikipedia bayes example, we can use
> >>> > > > > country_sample.txt. When new people want to just try out the
> >>> example,
> >>> > > > they
> >>> > > > > can reference this txt file  as a parameter.
> >>> > > > > To run the example in a robust scalable infrastructure, we
> could
> >>> use
> >>> > > > > country_full_list.txt.
> >>> > > > > any thots ?
> >>> > > > >
> >>> > > > > regards
> >>> > > > > Joe.
> >>> > > > >
> >>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <joekumar@gmail.com
> >
> >>> > wrote:
> >>> > > > >
> >>> > > > >> Gangadhar,
> >>> > > > >>
> >>> > > > >> After running TrainClassifier again, the map task just failed
> >>> with
> >>> > the
> >>> > > > same
> >>> > > > >> exception and I am pretty sure it is an issue with disk space.
> >>> > > > >> As the map was progressing, I was monitoring my free disk
> space
> >>> > > dropping
> >>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
> >>> task
> >>> > and
> >>> > > > then
> >>> > > > >> the exception happened. After the exception, another map task
> was
> >>> > > > resuming
> >>> > > > >> at 33% and I got close to 15GB free space (i guess the first
> map
> >>> > task
> >>> > > > freed
> >>> > > > >> up some space) and I am sure they would drop down to zero
> again
> >>> and
> >>> > > > throw
> >>> > > > >> the same exception.
> >>> > > > >> I am going to modify the country.txt to just 1 country and
> >>> recreate
> >>> > > > >> wikipediainput and run TrainClassifier. Will let you know how
> it
> >>> > > goes..
> >>> > > > >>
> >>> > > > >> Do we have any benchmarks / system requirements for running
> this
> >>> > > example
> >>> > > > ?
> >>> > > > >> Has anyone else had success running this example anytime.
> Would
> >>> > > > appreciate
> >>> > > > >> your inputs / thots.
> >>> > > > >>
> >>> > > > >> Should we look at tuning the code for handling these
> situations ?
> >>> > Any
> >>> > > > quick
> >>> > > > >> suggestions on where to start looking at ?
> >>> > > > >>
> >>> > > > >> regards,
> >>> > > > >> Joe.
> >>> > > > >>
> >>> > > > >>
> >>> > > > >>
> >>> > > > >>
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >>
> >>
> >>
> >
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.
Ted,

I've added the patch MAHOUT-509_1.patch in Jira [
https://issues.apache.org/jira/browse/MAHOUT-509 ] .

Thank you

On Thu, Oct 7, 2010 at 12:57 PM, Ted Dunning <te...@gmail.com> wrote:
> Can you attach the patch there?  The mailing list strips attachments.
>
> On Wed, Oct 6, 2010 at 9:22 PM, Gangadhar Nittala
> <np...@gmail.com>wrote:
>
>> I have attached a patch which has the modified testclassifier.props
>> and the fix with the parseInt. I think both these belong to
>> MAHOUT-509
>>
>

Re: Options in TrainClassifier.java

Posted by Ted Dunning <te...@gmail.com>.
Can you attach the patch there?  The mailing list strips attachments.

On Wed, Oct 6, 2010 at 9:22 PM, Gangadhar Nittala
<np...@gmail.com>wrote:

> I have attached a patch which has the modified testclassifier.props
> and the fix with the parseInt. I think both these belong to
> MAHOUT-509
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.
Joe / others,

I was finally able to test the changes that were done as part of
MAHOUT-509[ https://issues.apache.org/jira/browse/MAHOUT-509] and
follow the instructions in the wiki for the Bayes example [
https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example
]. The instructions in the wiki work only if the testclassifier.props
has the values for the required options. Else, the user needs to
provide the values on the command line for the datasource,
classifiertype and the n-gram size. The testClassifier executed and
printed a large matrix of values (though I still don't know how to
interpret the results :) )

Also, I found a minor problem in the TestClassifier.java where in
there is an Integer.parseInt with the command line option that is
read. If there are any leading / ending spaces in the
testclassifier.props, this results in a NumberFormatException.
Attached patch does a trim on the string before doing a parseInt.

I have attached a patch which has the modified testclassifier.props
and the fix with the parseInt. I think both these belong to
MAHOUT-509. If you think the wiki can be modified to include the
parameters instead of having settings in a .props file (preferring
clarity for the user over ease of use), then I can modify the wiki
instructions and remove the .props file from the patch.

The fix for the TestClassifier.java though, I think is required - it
is to sanitize the user input.

I am not sure of what is the preferred approach for providing patches
for a resolved issue. Should I create a new issue just for this or
would it be easier to add this patch to the existing issue itself?
Please let me know and I shall create a new issue and attach the
modified patch file to it.

Thank you
Gangadhar
p.s: I named the patch file with an underscore as the existing issue
already has a MAHOUT-509.patch

On Sun, Sep 26, 2010 at 9:28 AM, Gangadhar Nittala
<np...@gmail.com> wrote:
> Joe,
> I am out of town for this week and won't have access to my machine. I
> will check this during the weekend and will get back to you. Will
> follow the steps in the wiki.
>
> Thank you
>
> On Fri, Sep 24, 2010 at 8:44 AM, Joe Kumar <jo...@gmail.com> wrote:
>> Hi Gangadhar,
>>
>> I ran TestClassifier with similar parameters. It didnt take me 2 hrs though.
>>
>> I have documented the steps that worked for me at
>> https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example
>> Can you please get the patch available at MAHOUT-509 and apply it and then
>> try the steps in the wiki.
>> Please let me know if you still face issues.
>>
>> reg
>> Joe.
>>
>>
>> On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <npk.gangadhar@gmail.com
>>> wrote:
>>
>>> Joe,
>>> Can you let me know what was the command you used to test the
>>> classifier ? With the ngrams set to 1 as suggested by Robin, I was
>>> able to train the classifier. The command:
>>> $HADOOP_HOME/bin/hadoop jar
>>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1
>>> --input wikipediainput10 --output wikipediamodel10 --classifierType
>>> bayes --dataSource hdfs
>>>
>>> After this, as per the wiki, we need to get the data from HDFS. I did that
>>> <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10
>>>
>>> After this, the classifier is to be tested:
>>> $HADOOP_HOME/bin/hadoop jar
>>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10
>>> -d wikipediainput10  -ng 1 -type bayes -source hdfs
>>>
>>> When I run this, this runs for close to 2 hours and after 2 hours, it
>>> errors out with a java.io.FileException saying that the logs_ is a
>>> directory in the wikipediainput10 folder. I am sorry I can't provide
>>> the stack trace right now because I accidentally closed the terminal
>>> window before I could copy it. I will run this again and send the
>>> stack trace.
>>>
>>> But, if you can send me the steps that you followed after running the
>>> classifier, I can repeat those and see if I am able to successfully
>>> execute the classifier.
>>>
>>> Thank you
>>> Gangadhar
>>>
>>>
>>> On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala
>>> <np...@gmail.com> wrote:
>>> > Joe,
>>> > I will try with the ngram setting of 1 and let you know how it goes.
>>> > Robin, the ngram parameter is used to check the number of subsequences
>>> > of characters isn't it ? Or is it evaluated differently w.r.t to the
>>> > Bayesian classifier ?
>>> >
>>> > Ted, like Joe mentioned, if you could point us to some information on
>>> > SGD we could try it and report back the results to the list.
>>> >
>>> > Thank you
>>> > Gangadhar
>>> >
>>> > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <jo...@gmail.com> wrote:
>>> >> Robin / Gangadhar,
>>> >> With ngram as 1 and all the countries in the country.txt , the model is
>>> >> getting created without any issues.
>>> >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i
>>> wikipediainput
>>> >> -o wikipediamodel -type bayes -source hdfs
>>> >>
>>> >> Robin,
>>> >> Even for ngram parameter, the default value is mentioned as 1 but it is
>>> set
>>> >> as a mandatory parameter in TrainClassifier. so i'll modify the code to
>>> set
>>> >> the default ngram as 1 and make it as a non mandatory param.
>>> >>
>>> >> That aside, When I try to test the model, the summary is getting printed
>>> >> like below.
>>> >> Summary
>>> >> -------------------------------------------------------
>>> >> Correctly Classified Instances          :          0         ?%
>>> >> Incorrectly Classified Instances        :          0         ?%
>>> >> Total Classified Instances              :          0
>>> >> Need to figure out the reason..
>>> >>
>>> >> Since TestClassifier also has the same params and settings like
>>> >> TrainClassifier, can i modify it to set the default values for ngram,
>>> >> classifierType & dataSource ?
>>> >>
>>> >> reg,
>>> >> Joe.
>>> >>
>>> >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:
>>> >>
>>> >>> Robin,
>>> >>>
>>> >>> Thanks for your tip.
>>> >>> Will try it out and post updates.
>>> >>>
>>> >>> reg
>>> >>> Joe.
>>> >>>
>>> >>>
>>> >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com>
>>> wrote:
>>> >>>
>>> >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st.
>>> You
>>> >>>> need atleast 2 countries. otherwise there is no classification.
>>> Secondly
>>> >>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>>> >>>> number
>>> >>>> of features. Why dont you try with one and see.
>>> >>>>
>>> >>>> Robin
>>> >>>>
>>> >>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com>
>>> wrote:
>>> >>>>
>>> >>>> > Hi Ted,
>>> >>>> >
>>> >>>> > sure. will keep digging..
>>> >>>> >
>>> >>>> > About SGD, I dont have an idea about how it works et al. If there is
>>> >>>> some
>>> >>>> > documentation / reference / quick summary to read about it that'll
>>> be
>>> >>>> gr8.
>>> >>>> > Just saw one reference in
>>> >>>> >
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>>> >>>> >
>>> >>>> > I am assuming we should be able to create a model from wikipedia
>>> >>>> articles
>>> >>>> > and label the country of a new article. If so, could you please
>>> provide
>>> >>>> a
>>> >>>> > note on how to do this. We already have the wikipedia data being
>>> >>>> extracted
>>> >>>> > for specific countries using WikipediaDatasetCreatorDriver. How do
>>> we go
>>> >>>> > about training the classifier using SGD ?
>>> >>>> >
>>> >>>> > thanks for your help,
>>> >>>> > Joe.
>>> >>>> >
>>> >>>> >
>>> >>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <
>>> ted.dunning@gmail.com>
>>> >>>> > wrote:
>>> >>>> >
>>> >>>> > > I am watching these efforts with interest, but have been unable to
>>> >>>> > > contribute much to the process.  I would encourage Joe and others
>>> to
>>> >>>> keep
>>> >>>> > > whittling this problem down so that we can understand what is
>>> causing
>>> >>>> it.
>>> >>>> > >
>>> >>>> > > In the meantime, I think that the SGD classifiers are close to
>>> >>>> production
>>> >>>> > > quality.  For problems with less than several million training
>>> >>>> examples,
>>> >>>> > > and
>>> >>>> > > especially problems with many sparse features, I think that these
>>> >>>> > > classifiers might be easier to get started with than the Naive
>>> Bayes
>>> >>>> > > classifiers.  To make a virtue of a defect, the SGD based
>>> classifiers
>>> >>>> to
>>> >>>> > > not
>>> >>>> > > use Hadoop for training.  This makes deployment of a
>>> classification
>>> >>>> > > training
>>> >>>> > > workflow easier, but limits the total size of data that can be
>>> >>>> handled.
>>> >>>> > >
>>> >>>> > > What would you guys need to get started with trying these
>>> alternative
>>> >>>> > > models?
>>> >>>> > >
>>> >>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>>> >>>> > > <np...@gmail.com>wrote:
>>> >>>> > >
>>> >>>> > > > Joe,
>>> >>>> > > > Even I tried with reducing the number of countries in the
>>> >>>> country.txt.
>>> >>>> > > > That didn't help. And in my case, I was monitoring the disk
>>> space
>>> >>>> and
>>> >>>> > > > at no time did it reach 0%. So, I am not sure if that is the
>>> case.
>>> >>>> To
>>> >>>> > > > remove the dependency on the number of countries, I even tried
>>> with
>>> >>>> > > > the subjects.txt as the classification - that also did not help.
>>> >>>> > > > I think this problem is due to the type of the data being
>>> processed,
>>> >>>> > > > but what I am not sure of is what I need to change to get the
>>> data
>>> >>>> to
>>> >>>> > > > be processed successfully.
>>> >>>> > > >
>>> >>>> > > > The experienced folks on Mahout will be able to tell us what is
>>> >>>> missing
>>> >>>> > I
>>> >>>> > > > guess.
>>> >>>> > > >
>>> >>>> > > > Thank you
>>> >>>> > > > Gangadhar
>>> >>>> > > >
>>> >>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
>>> >>>> wrote:
>>> >>>> > > > > Gangadhar,
>>> >>>> > > > >
>>> >>>> > > > > I modified
>>> $MAHOUT_HOME/examples/src/test/resources/country.txt to
>>> >>>> > just
>>> >>>> > > > have
>>> >>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to
>>> create
>>> >>>> the
>>> >>>> > > > > wikipediainput data set and then ran TrainClassifier and it
>>> >>>> worked.
>>> >>>> > > when
>>> >>>> > > > I
>>> >>>> > > > > ran TestClassifier as below, I got blank results in the
>>> output.
>>> >>>> > > > >
>>> >>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> >>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>>> >>>> wikipediamodel
>>> >>>> > -d
>>> >>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>>> >>>> > > > >
>>> >>>> > > > > Summary
>>> >>>> > > > > -------------------------------------------------------
>>> >>>> > > > > Correctly Classified Instances          :          0
>>> ?%
>>> >>>> > > > > Incorrectly Classified Instances        :          0
>>> ?%
>>> >>>> > > > > Total Classified Instances              :          0
>>> >>>> > > > >
>>> >>>> > > > > =======================================================
>>> >>>> > > > > Confusion Matrix
>>> >>>> > > > > -------------------------------------------------------
>>> >>>> > > > > a     <--Classified as
>>> >>>> > > > > 0     |  0     a     = spain
>>> >>>> > > > > Default Category: unknown: 1
>>> >>>> > > > >
>>> >>>> > > > > I am not sure if I am doing something wrong.. have to figure
>>> out
>>> >>>> why
>>> >>>> > my
>>> >>>> > > > o/p
>>> >>>> > > > > is so blank.
>>> >>>> > > > > I'll document these steps and mention about country.txt in the
>>> >>>> wiki.
>>> >>>> > > > >
>>> >>>> > > > > Question to all
>>> >>>> > > > > Should we have 2 country.txt
>>> >>>> > > > >
>>> >>>> > > > >   1. country_full_list.txt - this is the existing list
>>> >>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>>> >>>> > > > >
>>> >>>> > > > > To get a flavor of the wikipedia bayes example, we can use
>>> >>>> > > > > country_sample.txt. When new people want to just try out the
>>> >>>> example,
>>> >>>> > > > they
>>> >>>> > > > > can reference this txt file  as a parameter.
>>> >>>> > > > > To run the example in a robust scalable infrastructure, we
>>> could
>>> >>>> use
>>> >>>> > > > > country_full_list.txt.
>>> >>>> > > > > any thots ?
>>> >>>> > > > >
>>> >>>> > > > > regards
>>> >>>> > > > > Joe.
>>> >>>> > > > >
>>> >>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <
>>> joekumar@gmail.com>
>>> >>>> > wrote:
>>> >>>> > > > >
>>> >>>> > > > >> Gangadhar,
>>> >>>> > > > >>
>>> >>>> > > > >> After running TrainClassifier again, the map task just failed
>>> >>>> with
>>> >>>> > the
>>> >>>> > > > same
>>> >>>> > > > >> exception and I am pretty sure it is an issue with disk
>>> space.
>>> >>>> > > > >> As the map was progressing, I was monitoring my free disk
>>> space
>>> >>>> > > dropping
>>> >>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>>> >>>> task
>>> >>>> > and
>>> >>>> > > > then
>>> >>>> > > > >> the exception happened. After the exception, another map task
>>> was
>>> >>>> > > > resuming
>>> >>>> > > > >> at 33% and I got close to 15GB free space (i guess the first
>>> map
>>> >>>> > task
>>> >>>> > > > freed
>>> >>>> > > > >> up some space) and I am sure they would drop down to zero
>>> again
>>> >>>> and
>>> >>>> > > > throw
>>> >>>> > > > >> the same exception.
>>> >>>> > > > >> I am going to modify the country.txt to just 1 country and
>>> >>>> recreate
>>> >>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how
>>> it
>>> >>>> > > goes..
>>> >>>> > > > >>
>>> >>>> > > > >> Do we have any benchmarks / system requirements for running
>>> this
>>> >>>> > > example
>>> >>>> > > > ?
>>> >>>> > > > >> Has anyone else had success running this example anytime.
>>> Would
>>> >>>> > > > appreciate
>>> >>>> > > > >> your inputs / thots.
>>> >>>> > > > >>
>>> >>>> > > > >> Should we look at tuning the code for handling these
>>> situations ?
>>> >>>> > Any
>>> >>>> > > > quick
>>> >>>> > > > >> suggestions on where to start looking at ?
>>> >>>> > > > >>
>>> >>>> > > > >> regards,
>>> >>>> > > > >> Joe.
>>> >>>> > > > >>
>>> >>>> > > > >>
>>> >>>> > > > >>
>>> >>>> > > > >>
>>> >>>> > > > >
>>> >>>> > > >
>>> >>>> > >
>>> >>>> >
>>> >>>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>
>>> >
>>>
>>
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.
Joe,
I am out of town for this week and won't have access to my machine. I
will check this during the weekend and will get back to you. Will
follow the steps in the wiki.

Thank you

On Fri, Sep 24, 2010 at 8:44 AM, Joe Kumar <jo...@gmail.com> wrote:
> Hi Gangadhar,
>
> I ran TestClassifier with similar parameters. It didnt take me 2 hrs though.
>
> I have documented the steps that worked for me at
> https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example
> Can you please get the patch available at MAHOUT-509 and apply it and then
> try the steps in the wiki.
> Please let me know if you still face issues.
>
> reg
> Joe.
>
>
> On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <npk.gangadhar@gmail.com
>> wrote:
>
>> Joe,
>> Can you let me know what was the command you used to test the
>> classifier ? With the ngrams set to 1 as suggested by Robin, I was
>> able to train the classifier. The command:
>> $HADOOP_HOME/bin/hadoop jar
>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1
>> --input wikipediainput10 --output wikipediamodel10 --classifierType
>> bayes --dataSource hdfs
>>
>> After this, as per the wiki, we need to get the data from HDFS. I did that
>> <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10
>>
>> After this, the classifier is to be tested:
>> $HADOOP_HOME/bin/hadoop jar
>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10
>> -d wikipediainput10  -ng 1 -type bayes -source hdfs
>>
>> When I run this, this runs for close to 2 hours and after 2 hours, it
>> errors out with a java.io.FileException saying that the logs_ is a
>> directory in the wikipediainput10 folder. I am sorry I can't provide
>> the stack trace right now because I accidentally closed the terminal
>> window before I could copy it. I will run this again and send the
>> stack trace.
>>
>> But, if you can send me the steps that you followed after running the
>> classifier, I can repeat those and see if I am able to successfully
>> execute the classifier.
>>
>> Thank you
>> Gangadhar
>>
>>
>> On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala
>> <np...@gmail.com> wrote:
>> > Joe,
>> > I will try with the ngram setting of 1 and let you know how it goes.
>> > Robin, the ngram parameter is used to check the number of subsequences
>> > of characters isn't it ? Or is it evaluated differently w.r.t to the
>> > Bayesian classifier ?
>> >
>> > Ted, like Joe mentioned, if you could point us to some information on
>> > SGD we could try it and report back the results to the list.
>> >
>> > Thank you
>> > Gangadhar
>> >
>> > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <jo...@gmail.com> wrote:
>> >> Robin / Gangadhar,
>> >> With ngram as 1 and all the countries in the country.txt , the model is
>> >> getting created without any issues.
>> >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i
>> wikipediainput
>> >> -o wikipediamodel -type bayes -source hdfs
>> >>
>> >> Robin,
>> >> Even for ngram parameter, the default value is mentioned as 1 but it is
>> set
>> >> as a mandatory parameter in TrainClassifier. so i'll modify the code to
>> set
>> >> the default ngram as 1 and make it as a non mandatory param.
>> >>
>> >> That aside, When I try to test the model, the summary is getting printed
>> >> like below.
>> >> Summary
>> >> -------------------------------------------------------
>> >> Correctly Classified Instances          :          0         ?%
>> >> Incorrectly Classified Instances        :          0         ?%
>> >> Total Classified Instances              :          0
>> >> Need to figure out the reason..
>> >>
>> >> Since TestClassifier also has the same params and settings like
>> >> TrainClassifier, can i modify it to set the default values for ngram,
>> >> classifierType & dataSource ?
>> >>
>> >> reg,
>> >> Joe.
>> >>
>> >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:
>> >>
>> >>> Robin,
>> >>>
>> >>> Thanks for your tip.
>> >>> Will try it out and post updates.
>> >>>
>> >>> reg
>> >>> Joe.
>> >>>
>> >>>
>> >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com>
>> wrote:
>> >>>
>> >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st.
>> You
>> >>>> need atleast 2 countries. otherwise there is no classification.
>> Secondly
>> >>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>> >>>> number
>> >>>> of features. Why dont you try with one and see.
>> >>>>
>> >>>> Robin
>> >>>>
>> >>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com>
>> wrote:
>> >>>>
>> >>>> > Hi Ted,
>> >>>> >
>> >>>> > sure. will keep digging..
>> >>>> >
>> >>>> > About SGD, I dont have an idea about how it works et al. If there is
>> >>>> some
>> >>>> > documentation / reference / quick summary to read about it that'll
>> be
>> >>>> gr8.
>> >>>> > Just saw one reference in
>> >>>> >
>> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>> >>>> >
>> >>>> > I am assuming we should be able to create a model from wikipedia
>> >>>> articles
>> >>>> > and label the country of a new article. If so, could you please
>> provide
>> >>>> a
>> >>>> > note on how to do this. We already have the wikipedia data being
>> >>>> extracted
>> >>>> > for specific countries using WikipediaDatasetCreatorDriver. How do
>> we go
>> >>>> > about training the classifier using SGD ?
>> >>>> >
>> >>>> > thanks for your help,
>> >>>> > Joe.
>> >>>> >
>> >>>> >
>> >>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <
>> ted.dunning@gmail.com>
>> >>>> > wrote:
>> >>>> >
>> >>>> > > I am watching these efforts with interest, but have been unable to
>> >>>> > > contribute much to the process.  I would encourage Joe and others
>> to
>> >>>> keep
>> >>>> > > whittling this problem down so that we can understand what is
>> causing
>> >>>> it.
>> >>>> > >
>> >>>> > > In the meantime, I think that the SGD classifiers are close to
>> >>>> production
>> >>>> > > quality.  For problems with less than several million training
>> >>>> examples,
>> >>>> > > and
>> >>>> > > especially problems with many sparse features, I think that these
>> >>>> > > classifiers might be easier to get started with than the Naive
>> Bayes
>> >>>> > > classifiers.  To make a virtue of a defect, the SGD based
>> classifiers
>> >>>> to
>> >>>> > > not
>> >>>> > > use Hadoop for training.  This makes deployment of a
>> classification
>> >>>> > > training
>> >>>> > > workflow easier, but limits the total size of data that can be
>> >>>> handled.
>> >>>> > >
>> >>>> > > What would you guys need to get started with trying these
>> alternative
>> >>>> > > models?
>> >>>> > >
>> >>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>> >>>> > > <np...@gmail.com>wrote:
>> >>>> > >
>> >>>> > > > Joe,
>> >>>> > > > Even I tried with reducing the number of countries in the
>> >>>> country.txt.
>> >>>> > > > That didn't help. And in my case, I was monitoring the disk
>> space
>> >>>> and
>> >>>> > > > at no time did it reach 0%. So, I am not sure if that is the
>> case.
>> >>>> To
>> >>>> > > > remove the dependency on the number of countries, I even tried
>> with
>> >>>> > > > the subjects.txt as the classification - that also did not help.
>> >>>> > > > I think this problem is due to the type of the data being
>> processed,
>> >>>> > > > but what I am not sure of is what I need to change to get the
>> data
>> >>>> to
>> >>>> > > > be processed successfully.
>> >>>> > > >
>> >>>> > > > The experienced folks on Mahout will be able to tell us what is
>> >>>> missing
>> >>>> > I
>> >>>> > > > guess.
>> >>>> > > >
>> >>>> > > > Thank you
>> >>>> > > > Gangadhar
>> >>>> > > >
>> >>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
>> >>>> wrote:
>> >>>> > > > > Gangadhar,
>> >>>> > > > >
>> >>>> > > > > I modified
>> $MAHOUT_HOME/examples/src/test/resources/country.txt to
>> >>>> > just
>> >>>> > > > have
>> >>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to
>> create
>> >>>> the
>> >>>> > > > > wikipediainput data set and then ran TrainClassifier and it
>> >>>> worked.
>> >>>> > > when
>> >>>> > > > I
>> >>>> > > > > ran TestClassifier as below, I got blank results in the
>> output.
>> >>>> > > > >
>> >>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> >>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>> >>>> wikipediamodel
>> >>>> > -d
>> >>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>> >>>> > > > >
>> >>>> > > > > Summary
>> >>>> > > > > -------------------------------------------------------
>> >>>> > > > > Correctly Classified Instances          :          0
>> ?%
>> >>>> > > > > Incorrectly Classified Instances        :          0
>> ?%
>> >>>> > > > > Total Classified Instances              :          0
>> >>>> > > > >
>> >>>> > > > > =======================================================
>> >>>> > > > > Confusion Matrix
>> >>>> > > > > -------------------------------------------------------
>> >>>> > > > > a     <--Classified as
>> >>>> > > > > 0     |  0     a     = spain
>> >>>> > > > > Default Category: unknown: 1
>> >>>> > > > >
>> >>>> > > > > I am not sure if I am doing something wrong.. have to figure
>> out
>> >>>> why
>> >>>> > my
>> >>>> > > > o/p
>> >>>> > > > > is so blank.
>> >>>> > > > > I'll document these steps and mention about country.txt in the
>> >>>> wiki.
>> >>>> > > > >
>> >>>> > > > > Question to all
>> >>>> > > > > Should we have 2 country.txt
>> >>>> > > > >
>> >>>> > > > >   1. country_full_list.txt - this is the existing list
>> >>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>> >>>> > > > >
>> >>>> > > > > To get a flavor of the wikipedia bayes example, we can use
>> >>>> > > > > country_sample.txt. When new people want to just try out the
>> >>>> example,
>> >>>> > > > they
>> >>>> > > > > can reference this txt file  as a parameter.
>> >>>> > > > > To run the example in a robust scalable infrastructure, we
>> could
>> >>>> use
>> >>>> > > > > country_full_list.txt.
>> >>>> > > > > any thots ?
>> >>>> > > > >
>> >>>> > > > > regards
>> >>>> > > > > Joe.
>> >>>> > > > >
>> >>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <
>> joekumar@gmail.com>
>> >>>> > wrote:
>> >>>> > > > >
>> >>>> > > > >> Gangadhar,
>> >>>> > > > >>
>> >>>> > > > >> After running TrainClassifier again, the map task just failed
>> >>>> with
>> >>>> > the
>> >>>> > > > same
>> >>>> > > > >> exception and I am pretty sure it is an issue with disk
>> space.
>> >>>> > > > >> As the map was progressing, I was monitoring my free disk
>> space
>> >>>> > > dropping
>> >>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>> >>>> task
>> >>>> > and
>> >>>> > > > then
>> >>>> > > > >> the exception happened. After the exception, another map task
>> was
>> >>>> > > > resuming
>> >>>> > > > >> at 33% and I got close to 15GB free space (i guess the first
>> map
>> >>>> > task
>> >>>> > > > freed
>> >>>> > > > >> up some space) and I am sure they would drop down to zero
>> again
>> >>>> and
>> >>>> > > > throw
>> >>>> > > > >> the same exception.
>> >>>> > > > >> I am going to modify the country.txt to just 1 country and
>> >>>> recreate
>> >>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how
>> it
>> >>>> > > goes..
>> >>>> > > > >>
>> >>>> > > > >> Do we have any benchmarks / system requirements for running
>> this
>> >>>> > > example
>> >>>> > > > ?
>> >>>> > > > >> Has anyone else had success running this example anytime.
>> Would
>> >>>> > > > appreciate
>> >>>> > > > >> your inputs / thots.
>> >>>> > > > >>
>> >>>> > > > >> Should we look at tuning the code for handling these
>> situations ?
>> >>>> > Any
>> >>>> > > > quick
>> >>>> > > > >> suggestions on where to start looking at ?
>> >>>> > > > >>
>> >>>> > > > >> regards,
>> >>>> > > > >> Joe.
>> >>>> > > > >>
>> >>>> > > > >>
>> >>>> > > > >>
>> >>>> > > > >>
>> >>>> > > > >
>> >>>> > > >
>> >>>> > >
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >
>>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.
Hi Gangadhar,

I ran TestClassifier with similar parameters. It didnt take me 2 hrs though.

I have documented the steps that worked for me at
https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example
Can you please get the patch available at MAHOUT-509 and apply it and then
try the steps in the wiki.
Please let me know if you still face issues.

reg
Joe.


On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <npk.gangadhar@gmail.com
> wrote:

> Joe,
> Can you let me know what was the command you used to test the
> classifier ? With the ngrams set to 1 as suggested by Robin, I was
> able to train the classifier. The command:
> $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1
> --input wikipediainput10 --output wikipediamodel10 --classifierType
> bayes --dataSource hdfs
>
> After this, as per the wiki, we need to get the data from HDFS. I did that
> <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10
>
> After this, the classifier is to be tested:
> $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10
> -d wikipediainput10  -ng 1 -type bayes -source hdfs
>
> When I run this, this runs for close to 2 hours and after 2 hours, it
> errors out with a java.io.FileException saying that the logs_ is a
> directory in the wikipediainput10 folder. I am sorry I can't provide
> the stack trace right now because I accidentally closed the terminal
> window before I could copy it. I will run this again and send the
> stack trace.
>
> But, if you can send me the steps that you followed after running the
> classifier, I can repeat those and see if I am able to successfully
> execute the classifier.
>
> Thank you
> Gangadhar
>
>
> On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala
> <np...@gmail.com> wrote:
> > Joe,
> > I will try with the ngram setting of 1 and let you know how it goes.
> > Robin, the ngram parameter is used to check the number of subsequences
> > of characters isn't it ? Or is it evaluated differently w.r.t to the
> > Bayesian classifier ?
> >
> > Ted, like Joe mentioned, if you could point us to some information on
> > SGD we could try it and report back the results to the list.
> >
> > Thank you
> > Gangadhar
> >
> > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <jo...@gmail.com> wrote:
> >> Robin / Gangadhar,
> >> With ngram as 1 and all the countries in the country.txt , the model is
> >> getting created without any issues.
> >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i
> wikipediainput
> >> -o wikipediamodel -type bayes -source hdfs
> >>
> >> Robin,
> >> Even for ngram parameter, the default value is mentioned as 1 but it is
> set
> >> as a mandatory parameter in TrainClassifier. so i'll modify the code to
> set
> >> the default ngram as 1 and make it as a non mandatory param.
> >>
> >> That aside, When I try to test the model, the summary is getting printed
> >> like below.
> >> Summary
> >> -------------------------------------------------------
> >> Correctly Classified Instances          :          0         ?%
> >> Incorrectly Classified Instances        :          0         ?%
> >> Total Classified Instances              :          0
> >> Need to figure out the reason..
> >>
> >> Since TestClassifier also has the same params and settings like
> >> TrainClassifier, can i modify it to set the default values for ngram,
> >> classifierType & dataSource ?
> >>
> >> reg,
> >> Joe.
> >>
> >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:
> >>
> >>> Robin,
> >>>
> >>> Thanks for your tip.
> >>> Will try it out and post updates.
> >>>
> >>> reg
> >>> Joe.
> >>>
> >>>
> >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com>
> wrote:
> >>>
> >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st.
> You
> >>>> need atleast 2 countries. otherwise there is no classification.
> Secondly
> >>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
> >>>> number
> >>>> of features. Why dont you try with one and see.
> >>>>
> >>>> Robin
> >>>>
> >>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com>
> wrote:
> >>>>
> >>>> > Hi Ted,
> >>>> >
> >>>> > sure. will keep digging..
> >>>> >
> >>>> > About SGD, I dont have an idea about how it works et al. If there is
> >>>> some
> >>>> > documentation / reference / quick summary to read about it that'll
> be
> >>>> gr8.
> >>>> > Just saw one reference in
> >>>> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
> >>>> >
> >>>> > I am assuming we should be able to create a model from wikipedia
> >>>> articles
> >>>> > and label the country of a new article. If so, could you please
> provide
> >>>> a
> >>>> > note on how to do this. We already have the wikipedia data being
> >>>> extracted
> >>>> > for specific countries using WikipediaDatasetCreatorDriver. How do
> we go
> >>>> > about training the classifier using SGD ?
> >>>> >
> >>>> > thanks for your help,
> >>>> > Joe.
> >>>> >
> >>>> >
> >>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <
> ted.dunning@gmail.com>
> >>>> > wrote:
> >>>> >
> >>>> > > I am watching these efforts with interest, but have been unable to
> >>>> > > contribute much to the process.  I would encourage Joe and others
> to
> >>>> keep
> >>>> > > whittling this problem down so that we can understand what is
> causing
> >>>> it.
> >>>> > >
> >>>> > > In the meantime, I think that the SGD classifiers are close to
> >>>> production
> >>>> > > quality.  For problems with less than several million training
> >>>> examples,
> >>>> > > and
> >>>> > > especially problems with many sparse features, I think that these
> >>>> > > classifiers might be easier to get started with than the Naive
> Bayes
> >>>> > > classifiers.  To make a virtue of a defect, the SGD based
> classifiers
> >>>> to
> >>>> > > not
> >>>> > > use Hadoop for training.  This makes deployment of a
> classification
> >>>> > > training
> >>>> > > workflow easier, but limits the total size of data that can be
> >>>> handled.
> >>>> > >
> >>>> > > What would you guys need to get started with trying these
> alternative
> >>>> > > models?
> >>>> > >
> >>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> >>>> > > <np...@gmail.com>wrote:
> >>>> > >
> >>>> > > > Joe,
> >>>> > > > Even I tried with reducing the number of countries in the
> >>>> country.txt.
> >>>> > > > That didn't help. And in my case, I was monitoring the disk
> space
> >>>> and
> >>>> > > > at no time did it reach 0%. So, I am not sure if that is the
> case.
> >>>> To
> >>>> > > > remove the dependency on the number of countries, I even tried
> with
> >>>> > > > the subjects.txt as the classification - that also did not help.
> >>>> > > > I think this problem is due to the type of the data being
> processed,
> >>>> > > > but what I am not sure of is what I need to change to get the
> data
> >>>> to
> >>>> > > > be processed successfully.
> >>>> > > >
> >>>> > > > The experienced folks on Mahout will be able to tell us what is
> >>>> missing
> >>>> > I
> >>>> > > > guess.
> >>>> > > >
> >>>> > > > Thank you
> >>>> > > > Gangadhar
> >>>> > > >
> >>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
> >>>> wrote:
> >>>> > > > > Gangadhar,
> >>>> > > > >
> >>>> > > > > I modified
> $MAHOUT_HOME/examples/src/test/resources/country.txt to
> >>>> > just
> >>>> > > > have
> >>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to
> create
> >>>> the
> >>>> > > > > wikipediainput data set and then ran TrainClassifier and it
> >>>> worked.
> >>>> > > when
> >>>> > > > I
> >>>> > > > > ran TestClassifier as below, I got blank results in the
> output.
> >>>> > > > >
> >>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> >>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
> >>>> wikipediamodel
> >>>> > -d
> >>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
> >>>> > > > >
> >>>> > > > > Summary
> >>>> > > > > -------------------------------------------------------
> >>>> > > > > Correctly Classified Instances          :          0
> ?%
> >>>> > > > > Incorrectly Classified Instances        :          0
> ?%
> >>>> > > > > Total Classified Instances              :          0
> >>>> > > > >
> >>>> > > > > =======================================================
> >>>> > > > > Confusion Matrix
> >>>> > > > > -------------------------------------------------------
> >>>> > > > > a     <--Classified as
> >>>> > > > > 0     |  0     a     = spain
> >>>> > > > > Default Category: unknown: 1
> >>>> > > > >
> >>>> > > > > I am not sure if I am doing something wrong.. have to figure
> out
> >>>> why
> >>>> > my
> >>>> > > > o/p
> >>>> > > > > is so blank.
> >>>> > > > > I'll document these steps and mention about country.txt in the
> >>>> wiki.
> >>>> > > > >
> >>>> > > > > Question to all
> >>>> > > > > Should we have 2 country.txt
> >>>> > > > >
> >>>> > > > >   1. country_full_list.txt - this is the existing list
> >>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
> >>>> > > > >
> >>>> > > > > To get a flavor of the wikipedia bayes example, we can use
> >>>> > > > > country_sample.txt. When new people want to just try out the
> >>>> example,
> >>>> > > > they
> >>>> > > > > can reference this txt file  as a parameter.
> >>>> > > > > To run the example in a robust scalable infrastructure, we
> could
> >>>> use
> >>>> > > > > country_full_list.txt.
> >>>> > > > > any thots ?
> >>>> > > > >
> >>>> > > > > regards
> >>>> > > > > Joe.
> >>>> > > > >
> >>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <
> joekumar@gmail.com>
> >>>> > wrote:
> >>>> > > > >
> >>>> > > > >> Gangadhar,
> >>>> > > > >>
> >>>> > > > >> After running TrainClassifier again, the map task just failed
> >>>> with
> >>>> > the
> >>>> > > > same
> >>>> > > > >> exception and I am pretty sure it is an issue with disk
> space.
> >>>> > > > >> As the map was progressing, I was monitoring my free disk
> space
> >>>> > > dropping
> >>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
> >>>> task
> >>>> > and
> >>>> > > > then
> >>>> > > > >> the exception happened. After the exception, another map task
> was
> >>>> > > > resuming
> >>>> > > > >> at 33% and I got close to 15GB free space (i guess the first
> map
> >>>> > task
> >>>> > > > freed
> >>>> > > > >> up some space) and I am sure they would drop down to zero
> again
> >>>> and
> >>>> > > > throw
> >>>> > > > >> the same exception.
> >>>> > > > >> I am going to modify the country.txt to just 1 country and
> >>>> recreate
> >>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how
> it
> >>>> > > goes..
> >>>> > > > >>
> >>>> > > > >> Do we have any benchmarks / system requirements for running
> this
> >>>> > > example
> >>>> > > > ?
> >>>> > > > >> Has anyone else had success running this example anytime.
> Would
> >>>> > > > appreciate
> >>>> > > > >> your inputs / thots.
> >>>> > > > >>
> >>>> > > > >> Should we look at tuning the code for handling these
> situations ?
> >>>> > Any
> >>>> > > > quick
> >>>> > > > >> suggestions on where to start looking at ?
> >>>> > > > >>
> >>>> > > > >> regards,
> >>>> > > > >> Joe.
> >>>> > > > >>
> >>>> > > > >>
> >>>> > > > >>
> >>>> > > > >>
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.
Joe,
Can you let me know what was the command you used to test the
classifier ? With the ngrams set to 1 as suggested by Robin, I was
able to train the classifier. The command:
$HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1
--input wikipediainput10 --output wikipediamodel10 --classifierType
bayes --dataSource hdfs

After this, as per the wiki, we need to get the data from HDFS. I did that
<HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10

After this, the classifier is to be tested:
$HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10
-d wikipediainput10  -ng 1 -type bayes -source hdfs

When I run this, this runs for close to 2 hours and after 2 hours, it
errors out with a java.io.FileException saying that the logs_ is a
directory in the wikipediainput10 folder. I am sorry I can't provide
the stack trace right now because I accidentally closed the terminal
window before I could copy it. I will run this again and send the
stack trace.

But, if you can send me the steps that you followed after running the
classifier, I can repeat those and see if I am able to successfully
execute the classifier.

Thank you
Gangadhar


On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala
<np...@gmail.com> wrote:
> Joe,
> I will try with the ngram setting of 1 and let you know how it goes.
> Robin, the ngram parameter is used to check the number of subsequences
> of characters isn't it ? Or is it evaluated differently w.r.t to the
> Bayesian classifier ?
>
> Ted, like Joe mentioned, if you could point us to some information on
> SGD we could try it and report back the results to the list.
>
> Thank you
> Gangadhar
>
> On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <jo...@gmail.com> wrote:
>> Robin / Gangadhar,
>> With ngram as 1 and all the countries in the country.txt , the model is
>> getting created without any issues.
>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput
>> -o wikipediamodel -type bayes -source hdfs
>>
>> Robin,
>> Even for ngram parameter, the default value is mentioned as 1 but it is set
>> as a mandatory parameter in TrainClassifier. so i'll modify the code to set
>> the default ngram as 1 and make it as a non mandatory param.
>>
>> That aside, When I try to test the model, the summary is getting printed
>> like below.
>> Summary
>> -------------------------------------------------------
>> Correctly Classified Instances          :          0         ?%
>> Incorrectly Classified Instances        :          0         ?%
>> Total Classified Instances              :          0
>> Need to figure out the reason..
>>
>> Since TestClassifier also has the same params and settings like
>> TrainClassifier, can i modify it to set the default values for ngram,
>> classifierType & dataSource ?
>>
>> reg,
>> Joe.
>>
>> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:
>>
>>> Robin,
>>>
>>> Thanks for your tip.
>>> Will try it out and post updates.
>>>
>>> reg
>>> Joe.
>>>
>>>
>>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com> wrote:
>>>
>>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
>>>> need atleast 2 countries. otherwise there is no classification. Secondly
>>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>>>> number
>>>> of features. Why dont you try with one and see.
>>>>
>>>> Robin
>>>>
>>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com> wrote:
>>>>
>>>> > Hi Ted,
>>>> >
>>>> > sure. will keep digging..
>>>> >
>>>> > About SGD, I dont have an idea about how it works et al. If there is
>>>> some
>>>> > documentation / reference / quick summary to read about it that'll be
>>>> gr8.
>>>> > Just saw one reference in
>>>> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>>>> >
>>>> > I am assuming we should be able to create a model from wikipedia
>>>> articles
>>>> > and label the country of a new article. If so, could you please provide
>>>> a
>>>> > note on how to do this. We already have the wikipedia data being
>>>> extracted
>>>> > for specific countries using WikipediaDatasetCreatorDriver. How do we go
>>>> > about training the classifier using SGD ?
>>>> >
>>>> > thanks for your help,
>>>> > Joe.
>>>> >
>>>> >
>>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <te...@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > I am watching these efforts with interest, but have been unable to
>>>> > > contribute much to the process.  I would encourage Joe and others to
>>>> keep
>>>> > > whittling this problem down so that we can understand what is causing
>>>> it.
>>>> > >
>>>> > > In the meantime, I think that the SGD classifiers are close to
>>>> production
>>>> > > quality.  For problems with less than several million training
>>>> examples,
>>>> > > and
>>>> > > especially problems with many sparse features, I think that these
>>>> > > classifiers might be easier to get started with than the Naive Bayes
>>>> > > classifiers.  To make a virtue of a defect, the SGD based classifiers
>>>> to
>>>> > > not
>>>> > > use Hadoop for training.  This makes deployment of a classification
>>>> > > training
>>>> > > workflow easier, but limits the total size of data that can be
>>>> handled.
>>>> > >
>>>> > > What would you guys need to get started with trying these alternative
>>>> > > models?
>>>> > >
>>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>>>> > > <np...@gmail.com>wrote:
>>>> > >
>>>> > > > Joe,
>>>> > > > Even I tried with reducing the number of countries in the
>>>> country.txt.
>>>> > > > That didn't help. And in my case, I was monitoring the disk space
>>>> and
>>>> > > > at no time did it reach 0%. So, I am not sure if that is the case.
>>>> To
>>>> > > > remove the dependency on the number of countries, I even tried with
>>>> > > > the subjects.txt as the classification - that also did not help.
>>>> > > > I think this problem is due to the type of the data being processed,
>>>> > > > but what I am not sure of is what I need to change to get the data
>>>> to
>>>> > > > be processed successfully.
>>>> > > >
>>>> > > > The experienced folks on Mahout will be able to tell us what is
>>>> missing
>>>> > I
>>>> > > > guess.
>>>> > > >
>>>> > > > Thank you
>>>> > > > Gangadhar
>>>> > > >
>>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
>>>> wrote:
>>>> > > > > Gangadhar,
>>>> > > > >
>>>> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
>>>> > just
>>>> > > > have
>>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create
>>>> the
>>>> > > > > wikipediainput data set and then ran TrainClassifier and it
>>>> worked.
>>>> > > when
>>>> > > > I
>>>> > > > > ran TestClassifier as below, I got blank results in the output.
>>>> > > > >
>>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>>>> wikipediamodel
>>>> > -d
>>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>>>> > > > >
>>>> > > > > Summary
>>>> > > > > -------------------------------------------------------
>>>> > > > > Correctly Classified Instances          :          0         ?%
>>>> > > > > Incorrectly Classified Instances        :          0         ?%
>>>> > > > > Total Classified Instances              :          0
>>>> > > > >
>>>> > > > > =======================================================
>>>> > > > > Confusion Matrix
>>>> > > > > -------------------------------------------------------
>>>> > > > > a     <--Classified as
>>>> > > > > 0     |  0     a     = spain
>>>> > > > > Default Category: unknown: 1
>>>> > > > >
>>>> > > > > I am not sure if I am doing something wrong.. have to figure out
>>>> why
>>>> > my
>>>> > > > o/p
>>>> > > > > is so blank.
>>>> > > > > I'll document these steps and mention about country.txt in the
>>>> wiki.
>>>> > > > >
>>>> > > > > Question to all
>>>> > > > > Should we have 2 country.txt
>>>> > > > >
>>>> > > > >   1. country_full_list.txt - this is the existing list
>>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>>>> > > > >
>>>> > > > > To get a flavor of the wikipedia bayes example, we can use
>>>> > > > > country_sample.txt. When new people want to just try out the
>>>> example,
>>>> > > > they
>>>> > > > > can reference this txt file  as a parameter.
>>>> > > > > To run the example in a robust scalable infrastructure, we could
>>>> use
>>>> > > > > country_full_list.txt.
>>>> > > > > any thots ?
>>>> > > > >
>>>> > > > > regards
>>>> > > > > Joe.
>>>> > > > >
>>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com>
>>>> > wrote:
>>>> > > > >
>>>> > > > >> Gangadhar,
>>>> > > > >>
>>>> > > > >> After running TrainClassifier again, the map task just failed
>>>> with
>>>> > the
>>>> > > > same
>>>> > > > >> exception and I am pretty sure it is an issue with disk space.
>>>> > > > >> As the map was progressing, I was monitoring my free disk space
>>>> > > dropping
>>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>>>> task
>>>> > and
>>>> > > > then
>>>> > > > >> the exception happened. After the exception, another map task was
>>>> > > > resuming
>>>> > > > >> at 33% and I got close to 15GB free space (i guess the first map
>>>> > task
>>>> > > > freed
>>>> > > > >> up some space) and I am sure they would drop down to zero again
>>>> and
>>>> > > > throw
>>>> > > > >> the same exception.
>>>> > > > >> I am going to modify the country.txt to just 1 country and
>>>> recreate
>>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how it
>>>> > > goes..
>>>> > > > >>
>>>> > > > >> Do we have any benchmarks / system requirements for running this
>>>> > > example
>>>> > > > ?
>>>> > > > >> Has anyone else had success running this example anytime. Would
>>>> > > > appreciate
>>>> > > > >> your inputs / thots.
>>>> > > > >>
>>>> > > > >> Should we look at tuning the code for handling these situations ?
>>>> > Any
>>>> > > > quick
>>>> > > > >> suggestions on where to start looking at ?
>>>> > > > >>
>>>> > > > >> regards,
>>>> > > > >> Joe.
>>>> > > > >>
>>>> > > > >>
>>>> > > > >>
>>>> > > > >>
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.
Joe,
I will try with the ngram setting of 1 and let you know how it goes.
Robin, the ngram parameter is used to check the number of subsequences
of characters isn't it ? Or is it evaluated differently w.r.t to the
Bayesian classifier ?

Ted, like Joe mentioned, if you could point us to some information on
SGD we could try it and report back the results to the list.

Thank you
Gangadhar

On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <jo...@gmail.com> wrote:
> Robin / Gangadhar,
> With ngram as 1 and all the countries in the country.txt , the model is
> getting created without any issues.
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput
> -o wikipediamodel -type bayes -source hdfs
>
> Robin,
> Even for ngram parameter, the default value is mentioned as 1 but it is set
> as a mandatory parameter in TrainClassifier. so i'll modify the code to set
> the default ngram as 1 and make it as a non mandatory param.
>
> That aside, When I try to test the model, the summary is getting printed
> like below.
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :          0         ?%
> Incorrectly Classified Instances        :          0         ?%
> Total Classified Instances              :          0
> Need to figure out the reason..
>
> Since TestClassifier also has the same params and settings like
> TrainClassifier, can i modify it to set the default values for ngram,
> classifierType & dataSource ?
>
> reg,
> Joe.
>
> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:
>
>> Robin,
>>
>> Thanks for your tip.
>> Will try it out and post updates.
>>
>> reg
>> Joe.
>>
>>
>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com> wrote:
>>
>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
>>> need atleast 2 countries. otherwise there is no classification. Secondly
>>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>>> number
>>> of features. Why dont you try with one and see.
>>>
>>> Robin
>>>
>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com> wrote:
>>>
>>> > Hi Ted,
>>> >
>>> > sure. will keep digging..
>>> >
>>> > About SGD, I dont have an idea about how it works et al. If there is
>>> some
>>> > documentation / reference / quick summary to read about it that'll be
>>> gr8.
>>> > Just saw one reference in
>>> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>>> >
>>> > I am assuming we should be able to create a model from wikipedia
>>> articles
>>> > and label the country of a new article. If so, could you please provide
>>> a
>>> > note on how to do this. We already have the wikipedia data being
>>> extracted
>>> > for specific countries using WikipediaDatasetCreatorDriver. How do we go
>>> > about training the classifier using SGD ?
>>> >
>>> > thanks for your help,
>>> > Joe.
>>> >
>>> >
>>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <te...@gmail.com>
>>> > wrote:
>>> >
>>> > > I am watching these efforts with interest, but have been unable to
>>> > > contribute much to the process.  I would encourage Joe and others to
>>> keep
>>> > > whittling this problem down so that we can understand what is causing
>>> it.
>>> > >
>>> > > In the meantime, I think that the SGD classifiers are close to
>>> production
>>> > > quality.  For problems with less than several million training
>>> examples,
>>> > > and
>>> > > especially problems with many sparse features, I think that these
>>> > > classifiers might be easier to get started with than the Naive Bayes
>>> > > classifiers.  To make a virtue of a defect, the SGD based classifiers
>>> to
>>> > > not
>>> > > use Hadoop for training.  This makes deployment of a classification
>>> > > training
>>> > > workflow easier, but limits the total size of data that can be
>>> handled.
>>> > >
>>> > > What would you guys need to get started with trying these alternative
>>> > > models?
>>> > >
>>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>>> > > <np...@gmail.com>wrote:
>>> > >
>>> > > > Joe,
>>> > > > Even I tried with reducing the number of countries in the
>>> country.txt.
>>> > > > That didn't help. And in my case, I was monitoring the disk space
>>> and
>>> > > > at no time did it reach 0%. So, I am not sure if that is the case.
>>> To
>>> > > > remove the dependency on the number of countries, I even tried with
>>> > > > the subjects.txt as the classification - that also did not help.
>>> > > > I think this problem is due to the type of the data being processed,
>>> > > > but what I am not sure of is what I need to change to get the data
>>> to
>>> > > > be processed successfully.
>>> > > >
>>> > > > The experienced folks on Mahout will be able to tell us what is
>>> missing
>>> > I
>>> > > > guess.
>>> > > >
>>> > > > Thank you
>>> > > > Gangadhar
>>> > > >
>>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
>>> wrote:
>>> > > > > Gangadhar,
>>> > > > >
>>> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
>>> > just
>>> > > > have
>>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create
>>> the
>>> > > > > wikipediainput data set and then ran TrainClassifier and it
>>> worked.
>>> > > when
>>> > > > I
>>> > > > > ran TestClassifier as below, I got blank results in the output.
>>> > > > >
>>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>>> wikipediamodel
>>> > -d
>>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>>> > > > >
>>> > > > > Summary
>>> > > > > -------------------------------------------------------
>>> > > > > Correctly Classified Instances          :          0         ?%
>>> > > > > Incorrectly Classified Instances        :          0         ?%
>>> > > > > Total Classified Instances              :          0
>>> > > > >
>>> > > > > =======================================================
>>> > > > > Confusion Matrix
>>> > > > > -------------------------------------------------------
>>> > > > > a     <--Classified as
>>> > > > > 0     |  0     a     = spain
>>> > > > > Default Category: unknown: 1
>>> > > > >
>>> > > > > I am not sure if I am doing something wrong.. have to figure out
>>> why
>>> > my
>>> > > > o/p
>>> > > > > is so blank.
>>> > > > > I'll document these steps and mention about country.txt in the
>>> wiki.
>>> > > > >
>>> > > > > Question to all
>>> > > > > Should we have 2 country.txt
>>> > > > >
>>> > > > >   1. country_full_list.txt - this is the existing list
>>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>>> > > > >
>>> > > > > To get a flavor of the wikipedia bayes example, we can use
>>> > > > > country_sample.txt. When new people want to just try out the
>>> example,
>>> > > > they
>>> > > > > can reference this txt file  as a parameter.
>>> > > > > To run the example in a robust scalable infrastructure, we could
>>> use
>>> > > > > country_full_list.txt.
>>> > > > > any thots ?
>>> > > > >
>>> > > > > regards
>>> > > > > Joe.
>>> > > > >
>>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com>
>>> > wrote:
>>> > > > >
>>> > > > >> Gangadhar,
>>> > > > >>
>>> > > > >> After running TrainClassifier again, the map task just failed
>>> with
>>> > the
>>> > > > same
>>> > > > >> exception and I am pretty sure it is an issue with disk space.
>>> > > > >> As the map was progressing, I was monitoring my free disk space
>>> > > dropping
>>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>>> task
>>> > and
>>> > > > then
>>> > > > >> the exception happened. After the exception, another map task was
>>> > > > resuming
>>> > > > >> at 33% and I got close to 15GB free space (i guess the first map
>>> > task
>>> > > > freed
>>> > > > >> up some space) and I am sure they would drop down to zero again
>>> and
>>> > > > throw
>>> > > > >> the same exception.
>>> > > > >> I am going to modify the country.txt to just 1 country and
>>> recreate
>>> > > > >> wikipediainput and run TrainClassifier. Will let you know how it
>>> > > goes..
>>> > > > >>
>>> > > > >> Do we have any benchmarks / system requirements for running this
>>> > > example
>>> > > > ?
>>> > > > >> Has anyone else had success running this example anytime. Would
>>> > > > appreciate
>>> > > > >> your inputs / thots.
>>> > > > >>
>>> > > > >> Should we look at tuning the code for handling these situations ?
>>> > Any
>>> > > > quick
>>> > > > >> suggestions on where to start looking at ?
>>> > > > >>
>>> > > > >> regards,
>>> > > > >> Joe.
>>> > > > >>
>>> > > > >>
>>> > > > >>
>>> > > > >>
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>>
>>
>>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.
Robin / Gangadhar,
With ngram as 1 and all the countries in the country.txt , the model is
getting created without any issues.
$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput
-o wikipediamodel -type bayes -source hdfs

Robin,
Even for ngram parameter, the default value is mentioned as 1 but it is set
as a mandatory parameter in TrainClassifier. so i'll modify the code to set
the default ngram as 1 and make it as a non mandatory param.

That aside, When I try to test the model, the summary is getting printed
like below.
Summary
-------------------------------------------------------
Correctly Classified Instances          :          0         ?%
Incorrectly Classified Instances        :          0         ?%
Total Classified Instances              :          0
Need to figure out the reason..

Since TestClassifier also has the same params and settings like
TrainClassifier, can i modify it to set the default values for ngram,
classifierType & dataSource ?

reg,
Joe.

On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <jo...@gmail.com> wrote:

> Robin,
>
> Thanks for your tip.
> Will try it out and post updates.
>
> reg
> Joe.
>
>
> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com> wrote:
>
>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
>> need atleast 2 countries. otherwise there is no classification. Secondly
>> ngram =3 is a bit too high. With wikipedia this will result in a huge
>> number
>> of features. Why dont you try with one and see.
>>
>> Robin
>>
>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com> wrote:
>>
>> > Hi Ted,
>> >
>> > sure. will keep digging..
>> >
>> > About SGD, I dont have an idea about how it works et al. If there is
>> some
>> > documentation / reference / quick summary to read about it that'll be
>> gr8.
>> > Just saw one reference in
>> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>> >
>> > I am assuming we should be able to create a model from wikipedia
>> articles
>> > and label the country of a new article. If so, could you please provide
>> a
>> > note on how to do this. We already have the wikipedia data being
>> extracted
>> > for specific countries using WikipediaDatasetCreatorDriver. How do we go
>> > about training the classifier using SGD ?
>> >
>> > thanks for your help,
>> > Joe.
>> >
>> >
>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> >
>> > > I am watching these efforts with interest, but have been unable to
>> > > contribute much to the process.  I would encourage Joe and others to
>> keep
>> > > whittling this problem down so that we can understand what is causing
>> it.
>> > >
>> > > In the meantime, I think that the SGD classifiers are close to
>> production
>> > > quality.  For problems with less than several million training
>> examples,
>> > > and
>> > > especially problems with many sparse features, I think that these
>> > > classifiers might be easier to get started with than the Naive Bayes
>> > > classifiers.  To make a virtue of a defect, the SGD based classifiers
>> to
>> > > not
>> > > use Hadoop for training.  This makes deployment of a classification
>> > > training
>> > > workflow easier, but limits the total size of data that can be
>> handled.
>> > >
>> > > What would you guys need to get started with trying these alternative
>> > > models?
>> > >
>> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
>> > > <np...@gmail.com>wrote:
>> > >
>> > > > Joe,
>> > > > Even I tried with reducing the number of countries in the
>> country.txt.
>> > > > That didn't help. And in my case, I was monitoring the disk space
>> and
>> > > > at no time did it reach 0%. So, I am not sure if that is the case.
>> To
>> > > > remove the dependency on the number of countries, I even tried with
>> > > > the subjects.txt as the classification - that also did not help.
>> > > > I think this problem is due to the type of the data being processed,
>> > > > but what I am not sure of is what I need to change to get the data
>> to
>> > > > be processed successfully.
>> > > >
>> > > > The experienced folks on Mahout will be able to tell us what is
>> missing
>> > I
>> > > > guess.
>> > > >
>> > > > Thank you
>> > > > Gangadhar
>> > > >
>> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
>> wrote:
>> > > > > Gangadhar,
>> > > > >
>> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
>> > just
>> > > > have
>> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create
>> the
>> > > > > wikipediainput data set and then ran TrainClassifier and it
>> worked.
>> > > when
>> > > > I
>> > > > > ran TestClassifier as below, I got blank results in the output.
>> > > > >
>> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m
>> wikipediamodel
>> > -d
>> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
>> > > > >
>> > > > > Summary
>> > > > > -------------------------------------------------------
>> > > > > Correctly Classified Instances          :          0         ?%
>> > > > > Incorrectly Classified Instances        :          0         ?%
>> > > > > Total Classified Instances              :          0
>> > > > >
>> > > > > =======================================================
>> > > > > Confusion Matrix
>> > > > > -------------------------------------------------------
>> > > > > a     <--Classified as
>> > > > > 0     |  0     a     = spain
>> > > > > Default Category: unknown: 1
>> > > > >
>> > > > > I am not sure if I am doing something wrong.. have to figure out
>> why
>> > my
>> > > > o/p
>> > > > > is so blank.
>> > > > > I'll document these steps and mention about country.txt in the
>> wiki.
>> > > > >
>> > > > > Question to all
>> > > > > Should we have 2 country.txt
>> > > > >
>> > > > >   1. country_full_list.txt - this is the existing list
>> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
>> > > > >
>> > > > > To get a flavor of the wikipedia bayes example, we can use
>> > > > > country_sample.txt. When new people want to just try out the
>> example,
>> > > > they
>> > > > > can reference this txt file  as a parameter.
>> > > > > To run the example in a robust scalable infrastructure, we could
>> use
>> > > > > country_full_list.txt.
>> > > > > any thots ?
>> > > > >
>> > > > > regards
>> > > > > Joe.
>> > > > >
>> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com>
>> > wrote:
>> > > > >
>> > > > >> Gangadhar,
>> > > > >>
>> > > > >> After running TrainClassifier again, the map task just failed
>> with
>> > the
>> > > > same
>> > > > >> exception and I am pretty sure it is an issue with disk space.
>> > > > >> As the map was progressing, I was monitoring my free disk space
>> > > dropping
>> > > > >> from 81GB. It came down to 0 after almost 66% through the map
>> task
>> > and
>> > > > then
>> > > > >> the exception happened. After the exception, another map task was
>> > > > resuming
>> > > > >> at 33% and I got close to 15GB free space (i guess the first map
>> > task
>> > > > freed
>> > > > >> up some space) and I am sure they would drop down to zero again
>> and
>> > > > throw
>> > > > >> the same exception.
>> > > > >> I am going to modify the country.txt to just 1 country and
>> recreate
>> > > > >> wikipediainput and run TrainClassifier. Will let you know how it
>> > > goes..
>> > > > >>
>> > > > >> Do we have any benchmarks / system requirements for running this
>> > > example
>> > > > ?
>> > > > >> Has anyone else had success running this example anytime. Would
>> > > > appreciate
>> > > > >> your inputs / thots.
>> > > > >>
>> > > > >> Should we look at tuning the code for handling these situations ?
>> > Any
>> > > > quick
>> > > > >> suggestions on where to start looking at ?
>> > > > >>
>> > > > >> regards,
>> > > > >> Joe.
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
>
>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.
Robin,

Thanks for your tip.
Will try it out and post updates.

reg
Joe.

On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <ro...@gmail.com> wrote:

> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
> need atleast 2 countries. otherwise there is no classification. Secondly
> ngram =3 is a bit too high. With wikipedia this will result in a huge
> number
> of features. Why dont you try with one and see.
>
> Robin
>
> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com> wrote:
>
> > Hi Ted,
> >
> > sure. will keep digging..
> >
> > About SGD, I dont have an idea about how it works et al. If there is some
> > documentation / reference / quick summary to read about it that'll be
> gr8.
> > Just saw one reference in
> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
> >
> > I am assuming we should be able to create a model from wikipedia articles
> > and label the country of a new article. If so, could you please provide a
> > note on how to do this. We already have the wikipedia data being
> extracted
> > for specific countries using WikipediaDatasetCreatorDriver. How do we go
> > about training the classifier using SGD ?
> >
> > thanks for your help,
> > Joe.
> >
> >
> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > I am watching these efforts with interest, but have been unable to
> > > contribute much to the process.  I would encourage Joe and others to
> keep
> > > whittling this problem down so that we can understand what is causing
> it.
> > >
> > > In the meantime, I think that the SGD classifiers are close to
> production
> > > quality.  For problems with less than several million training
> examples,
> > > and
> > > especially problems with many sparse features, I think that these
> > > classifiers might be easier to get started with than the Naive Bayes
> > > classifiers.  To make a virtue of a defect, the SGD based classifiers
> to
> > > not
> > > use Hadoop for training.  This makes deployment of a classification
> > > training
> > > workflow easier, but limits the total size of data that can be handled.
> > >
> > > What would you guys need to get started with trying these alternative
> > > models?
> > >
> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> > > <np...@gmail.com>wrote:
> > >
> > > > Joe,
> > > > Even I tried with reducing the number of countries in the
> country.txt.
> > > > That didn't help. And in my case, I was monitoring the disk space and
> > > > at no time did it reach 0%. So, I am not sure if that is the case. To
> > > > remove the dependency on the number of countries, I even tried with
> > > > the subjects.txt as the classification - that also did not help.
> > > > I think this problem is due to the type of the data being processed,
> > > > but what I am not sure of is what I need to change to get the data to
> > > > be processed successfully.
> > > >
> > > > The experienced folks on Mahout will be able to tell us what is
> missing
> > I
> > > > guess.
> > > >
> > > > Thank you
> > > > Gangadhar
> > > >
> > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com>
> wrote:
> > > > > Gangadhar,
> > > > >
> > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
> > just
> > > > have
> > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create
> the
> > > > > wikipediainput data set and then ran TrainClassifier and it worked.
> > > when
> > > > I
> > > > > ran TestClassifier as below, I got blank results in the output.
> > > > >
> > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > > > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel
> > -d
> > > > >  wikipediainput  -ng 3 -type bayes -source hdfs
> > > > >
> > > > > Summary
> > > > > -------------------------------------------------------
> > > > > Correctly Classified Instances          :          0         ?%
> > > > > Incorrectly Classified Instances        :          0         ?%
> > > > > Total Classified Instances              :          0
> > > > >
> > > > > =======================================================
> > > > > Confusion Matrix
> > > > > -------------------------------------------------------
> > > > > a     <--Classified as
> > > > > 0     |  0     a     = spain
> > > > > Default Category: unknown: 1
> > > > >
> > > > > I am not sure if I am doing something wrong.. have to figure out
> why
> > my
> > > > o/p
> > > > > is so blank.
> > > > > I'll document these steps and mention about country.txt in the
> wiki.
> > > > >
> > > > > Question to all
> > > > > Should we have 2 country.txt
> > > > >
> > > > >   1. country_full_list.txt - this is the existing list
> > > > >   2. country_sample_list.txt - a list with 2 or 3 countries
> > > > >
> > > > > To get a flavor of the wikipedia bayes example, we can use
> > > > > country_sample.txt. When new people want to just try out the
> example,
> > > > they
> > > > > can reference this txt file  as a parameter.
> > > > > To run the example in a robust scalable infrastructure, we could
> use
> > > > > country_full_list.txt.
> > > > > any thots ?
> > > > >
> > > > > regards
> > > > > Joe.
> > > > >
> > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com>
> > wrote:
> > > > >
> > > > >> Gangadhar,
> > > > >>
> > > > >> After running TrainClassifier again, the map task just failed with
> > the
> > > > same
> > > > >> exception and I am pretty sure it is an issue with disk space.
> > > > >> As the map was progressing, I was monitoring my free disk space
> > > dropping
> > > > >> from 81GB. It came down to 0 after almost 66% through the map task
> > and
> > > > then
> > > > >> the exception happened. After the exception, another map task was
> > > > resuming
> > > > >> at 33% and I got close to 15GB free space (i guess the first map
> > task
> > > > freed
> > > > >> up some space) and I am sure they would drop down to zero again
> and
> > > > throw
> > > > >> the same exception.
> > > > >> I am going to modify the country.txt to just 1 country and
> recreate
> > > > >> wikipediainput and run TrainClassifier. Will let you know how it
> > > goes..
> > > > >>
> > > > >> Do we have any benchmarks / system requirements for running this
> > > example
> > > > ?
> > > > >> Has anyone else had success running this example anytime. Would
> > > > appreciate
> > > > >> your inputs / thots.
> > > > >>
> > > > >> Should we look at tuning the code for handling these situations ?
> > Any
> > > > quick
> > > > >> suggestions on where to start looking at ?
> > > > >>
> > > > >> regards,
> > > > >> Joe.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Options in TrainClassifier.java

Posted by Robin Anil <ro...@gmail.com>.
Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
need atleast 2 countries. otherwise there is no classification. Secondly
ngram =3 is a bit too high. With wikipedia this will result in a huge number
of features. Why dont you try with one and see.

Robin

On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <jo...@gmail.com> wrote:

> Hi Ted,
>
> sure. will keep digging..
>
> About SGD, I dont have an idea about how it works et al. If there is some
> documentation / reference / quick summary to read about it that'll be gr8.
> Just saw one reference in
> https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.
>
> I am assuming we should be able to create a model from wikipedia articles
> and label the country of a new article. If so, could you please provide a
> note on how to do this. We already have the wikipedia data being extracted
> for specific countries using WikipediaDatasetCreatorDriver. How do we go
> about training the classifier using SGD ?
>
> thanks for your help,
> Joe.
>
>
> On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I am watching these efforts with interest, but have been unable to
> > contribute much to the process.  I would encourage Joe and others to keep
> > whittling this problem down so that we can understand what is causing it.
> >
> > In the meantime, I think that the SGD classifiers are close to production
> > quality.  For problems with less than several million training examples,
> > and
> > especially problems with many sparse features, I think that these
> > classifiers might be easier to get started with than the Naive Bayes
> > classifiers.  To make a virtue of a defect, the SGD based classifiers to
> > not
> > use Hadoop for training.  This makes deployment of a classification
> > training
> > workflow easier, but limits the total size of data that can be handled.
> >
> > What would you guys need to get started with trying these alternative
> > models?
> >
> > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> > <np...@gmail.com>wrote:
> >
> > > Joe,
> > > Even I tried with reducing the number of countries in the country.txt.
> > > That didn't help. And in my case, I was monitoring the disk space and
> > > at no time did it reach 0%. So, I am not sure if that is the case. To
> > > remove the dependency on the number of countries, I even tried with
> > > the subjects.txt as the classification - that also did not help.
> > > I think this problem is due to the type of the data being processed,
> > > but what I am not sure of is what I need to change to get the data to
> > > be processed successfully.
> > >
> > > The experienced folks on Mahout will be able to tell us what is missing
> I
> > > guess.
> > >
> > > Thank you
> > > Gangadhar
> > >
> > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com> wrote:
> > > > Gangadhar,
> > > >
> > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to
> just
> > > have
> > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> > > > wikipediainput data set and then ran TrainClassifier and it worked.
> > when
> > > I
> > > > ran TestClassifier as below, I got blank results in the output.
> > > >
> > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel
> -d
> > > >  wikipediainput  -ng 3 -type bayes -source hdfs
> > > >
> > > > Summary
> > > > -------------------------------------------------------
> > > > Correctly Classified Instances          :          0         ?%
> > > > Incorrectly Classified Instances        :          0         ?%
> > > > Total Classified Instances              :          0
> > > >
> > > > =======================================================
> > > > Confusion Matrix
> > > > -------------------------------------------------------
> > > > a     <--Classified as
> > > > 0     |  0     a     = spain
> > > > Default Category: unknown: 1
> > > >
> > > > I am not sure if I am doing something wrong.. have to figure out why
> my
> > > o/p
> > > > is so blank.
> > > > I'll document these steps and mention about country.txt in the wiki.
> > > >
> > > > Question to all
> > > > Should we have 2 country.txt
> > > >
> > > >   1. country_full_list.txt - this is the existing list
> > > >   2. country_sample_list.txt - a list with 2 or 3 countries
> > > >
> > > > To get a flavor of the wikipedia bayes example, we can use
> > > > country_sample.txt. When new people want to just try out the example,
> > > they
> > > > can reference this txt file  as a parameter.
> > > > To run the example in a robust scalable infrastructure, we could use
> > > > country_full_list.txt.
> > > > any thots ?
> > > >
> > > > regards
> > > > Joe.
> > > >
> > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com>
> wrote:
> > > >
> > > >> Gangadhar,
> > > >>
> > > >> After running TrainClassifier again, the map task just failed with
> the
> > > same
> > > >> exception and I am pretty sure it is an issue with disk space.
> > > >> As the map was progressing, I was monitoring my free disk space
> > dropping
> > > >> from 81GB. It came down to 0 after almost 66% through the map task
> and
> > > then
> > > >> the exception happened. After the exception, another map task was
> > > resuming
> > > >> at 33% and I got close to 15GB free space (i guess the first map
> task
> > > freed
> > > >> up some space) and I am sure they would drop down to zero again and
> > > throw
> > > >> the same exception.
> > > >> I am going to modify the country.txt to just 1 country and recreate
> > > >> wikipediainput and run TrainClassifier. Will let you know how it
> > goes..
> > > >>
> > > >> Do we have any benchmarks / system requirements for running this
> > example
> > > ?
> > > >> Has anyone else had success running this example anytime. Would
> > > appreciate
> > > >> your inputs / thots.
> > > >>
> > > >> Should we look at tuning the code for handling these situations ?
> Any
> > > quick
> > > >> suggestions on where to start looking at ?
> > > >>
> > > >> regards,
> > > >> Joe.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.
Hi Ted,

sure. will keep digging..

About SGD, I dont have an idea about how it works et al. If there is some
documentation / reference / quick summary to read about it that'll be gr8.
Just saw one reference in
https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression.

I am assuming we should be able to create a model from wikipedia articles
and label the country of a new article. If so, could you please provide a
note on how to do this. We already have the wikipedia data being extracted
for specific countries using WikipediaDatasetCreatorDriver. How do we go
about training the classifier using SGD ?

thanks for your help,
Joe.


On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <te...@gmail.com> wrote:

> I am watching these efforts with interest, but have been unable to
> contribute much to the process.  I would encourage Joe and others to keep
> whittling this problem down so that we can understand what is causing it.
>
> In the meantime, I think that the SGD classifiers are close to production
> quality.  For problems with less than several million training examples,
> and
> especially problems with many sparse features, I think that these
> classifiers might be easier to get started with than the Naive Bayes
> classifiers.  To make a virtue of a defect, the SGD based classifiers to
> not
> use Hadoop for training.  This makes deployment of a classification
> training
> workflow easier, but limits the total size of data that can be handled.
>
> What would you guys need to get started with trying these alternative
> models?
>
> On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
> <np...@gmail.com>wrote:
>
> > Joe,
> > Even I tried with reducing the number of countries in the country.txt.
> > That didn't help. And in my case, I was monitoring the disk space and
> > at no time did it reach 0%. So, I am not sure if that is the case. To
> > remove the dependency on the number of countries, I even tried with
> > the subjects.txt as the classification - that also did not help.
> > I think this problem is due to the type of the data being processed,
> > but what I am not sure of is what I need to change to get the data to
> > be processed successfully.
> >
> > The experienced folks on Mahout will be able to tell us what is missing I
> > guess.
> >
> > Thank you
> > Gangadhar
> >
> > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com> wrote:
> > > Gangadhar,
> > >
> > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just
> > have
> > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> > > wikipediainput data set and then ran TrainClassifier and it worked.
> when
> > I
> > > ran TestClassifier as below, I got blank results in the output.
> > >
> > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
> > >  wikipediainput  -ng 3 -type bayes -source hdfs
> > >
> > > Summary
> > > -------------------------------------------------------
> > > Correctly Classified Instances          :          0         ?%
> > > Incorrectly Classified Instances        :          0         ?%
> > > Total Classified Instances              :          0
> > >
> > > =======================================================
> > > Confusion Matrix
> > > -------------------------------------------------------
> > > a     <--Classified as
> > > 0     |  0     a     = spain
> > > Default Category: unknown: 1
> > >
> > > I am not sure if I am doing something wrong.. have to figure out why my
> > o/p
> > > is so blank.
> > > I'll document these steps and mention about country.txt in the wiki.
> > >
> > > Question to all
> > > Should we have 2 country.txt
> > >
> > >   1. country_full_list.txt - this is the existing list
> > >   2. country_sample_list.txt - a list with 2 or 3 countries
> > >
> > > To get a flavor of the wikipedia bayes example, we can use
> > > country_sample.txt. When new people want to just try out the example,
> > they
> > > can reference this txt file  as a parameter.
> > > To run the example in a robust scalable infrastructure, we could use
> > > country_full_list.txt.
> > > any thots ?
> > >
> > > regards
> > > Joe.
> > >
> > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com> wrote:
> > >
> > >> Gangadhar,
> > >>
> > >> After running TrainClassifier again, the map task just failed with the
> > same
> > >> exception and I am pretty sure it is an issue with disk space.
> > >> As the map was progressing, I was monitoring my free disk space
> dropping
> > >> from 81GB. It came down to 0 after almost 66% through the map task and
> > then
> > >> the exception happened. After the exception, another map task was
> > resuming
> > >> at 33% and I got close to 15GB free space (i guess the first map task
> > freed
> > >> up some space) and I am sure they would drop down to zero again and
> > throw
> > >> the same exception.
> > >> I am going to modify the country.txt to just 1 country and recreate
> > >> wikipediainput and run TrainClassifier. Will let you know how it
> goes..
> > >>
> > >> Do we have any benchmarks / system requirements for running this
> example
> > ?
> > >> Has anyone else had success running this example anytime. Would
> > appreciate
> > >> your inputs / thots.
> > >>
> > >> Should we look at tuning the code for handling these situations ? Any
> > quick
> > >> suggestions on where to start looking at ?
> > >>
> > >> regards,
> > >> Joe.
> > >>
> > >>
> > >>
> > >>
> > >
> >
>

Re: Options in TrainClassifier.java

Posted by Ted Dunning <te...@gmail.com>.
I am watching these efforts with interest, but have been unable to
contribute much to the process.  I would encourage Joe and others to keep
whittling this problem down so that we can understand what is causing it.

In the meantime, I think that the SGD classifiers are close to production
quality.  For problems with less than several million training examples, and
especially problems with many sparse features, I think that these
classifiers might be easier to get started with than the Naive Bayes
classifiers.  To make a virtue of a defect, the SGD based classifiers to not
use Hadoop for training.  This makes deployment of a classification training
workflow easier, but limits the total size of data that can be handled.

What would you guys need to get started with trying these alternative
models?

On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala
<np...@gmail.com>wrote:

> Joe,
> Even I tried with reducing the number of countries in the country.txt.
> That didn't help. And in my case, I was monitoring the disk space and
> at no time did it reach 0%. So, I am not sure if that is the case. To
> remove the dependency on the number of countries, I even tried with
> the subjects.txt as the classification - that also did not help.
> I think this problem is due to the type of the data being processed,
> but what I am not sure of is what I need to change to get the data to
> be processed successfully.
>
> The experienced folks on Mahout will be able to tell us what is missing I
> guess.
>
> Thank you
> Gangadhar
>
> On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com> wrote:
> > Gangadhar,
> >
> > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just
> have
> > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> > wikipediainput data set and then ran TrainClassifier and it worked. when
> I
> > ran TestClassifier as below, I got blank results in the output.
> >
> > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
> >  wikipediainput  -ng 3 -type bayes -source hdfs
> >
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :          0         ?%
> > Incorrectly Classified Instances        :          0         ?%
> > Total Classified Instances              :          0
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> > a     <--Classified as
> > 0     |  0     a     = spain
> > Default Category: unknown: 1
> >
> > I am not sure if I am doing something wrong.. have to figure out why my
> o/p
> > is so blank.
> > I'll document these steps and mention about country.txt in the wiki.
> >
> > Question to all
> > Should we have 2 country.txt
> >
> >   1. country_full_list.txt - this is the existing list
> >   2. country_sample_list.txt - a list with 2 or 3 countries
> >
> > To get a flavor of the wikipedia bayes example, we can use
> > country_sample.txt. When new people want to just try out the example,
> they
> > can reference this txt file  as a parameter.
> > To run the example in a robust scalable infrastructure, we could use
> > country_full_list.txt.
> > any thots ?
> >
> > regards
> > Joe.
> >
> > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com> wrote:
> >
> >> Gangadhar,
> >>
> >> After running TrainClassifier again, the map task just failed with the
> same
> >> exception and I am pretty sure it is an issue with disk space.
> >> As the map was progressing, I was monitoring my free disk space dropping
> >> from 81GB. It came down to 0 after almost 66% through the map task and
> then
> >> the exception happened. After the exception, another map task was
> resuming
> >> at 33% and I got close to 15GB free space (i guess the first map task
> freed
> >> up some space) and I am sure they would drop down to zero again and
> throw
> >> the same exception.
> >> I am going to modify the country.txt to just 1 country and recreate
> >> wikipediainput and run TrainClassifier. Will let you know how it goes..
> >>
> >> Do we have any benchmarks / system requirements for running this example
> ?
> >> Has anyone else had success running this example anytime. Would
> appreciate
> >> your inputs / thots.
> >>
> >> Should we look at tuning the code for handling these situations ? Any
> quick
> >> suggestions on where to start looking at ?
> >>
> >> regards,
> >> Joe.
> >>
> >>
> >>
> >>
> >
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.
Joe,
Even I tried with reducing the number of countries in the country.txt.
That didn't help. And in my case, I was monitoring the disk space and
at no time did it reach 0%. So, I am not sure if that is the case. To
remove the dependency on the number of countries, I even tried with
the subjects.txt as the classification - that also did not help.
I think this problem is due to the type of the data being processed,
but what I am not sure of is what I need to change to get the data to
be processed successfully.

The experienced folks on Mahout will be able to tell us what is missing I guess.

Thank you
Gangadhar

On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <jo...@gmail.com> wrote:
> Gangadhar,
>
> I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just have
> 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
> wikipediainput data set and then ran TrainClassifier and it worked. when I
> ran TestClassifier as below, I got blank results in the output.
>
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
>  wikipediainput  -ng 3 -type bayes -source hdfs
>
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :          0         ?%
> Incorrectly Classified Instances        :          0         ?%
> Total Classified Instances              :          0
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a     <--Classified as
> 0     |  0     a     = spain
> Default Category: unknown: 1
>
> I am not sure if I am doing something wrong.. have to figure out why my o/p
> is so blank.
> I'll document these steps and mention about country.txt in the wiki.
>
> Question to all
> Should we have 2 country.txt
>
>   1. country_full_list.txt - this is the existing list
>   2. country_sample_list.txt - a list with 2 or 3 countries
>
> To get a flavor of the wikipedia bayes example, we can use
> country_sample.txt. When new people want to just try out the example, they
> can reference this txt file  as a parameter.
> To run the example in a robust scalable infrastructure, we could use
> country_full_list.txt.
> any thots ?
>
> regards
> Joe.
>
> On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com> wrote:
>
>> Gangadhar,
>>
>> After running TrainClassifier again, the map task just failed with the same
>> exception and I am pretty sure it is an issue with disk space.
>> As the map was progressing, I was monitoring my free disk space dropping
>> from 81GB. It came down to 0 after almost 66% through the map task and then
>> the exception happened. After the exception, another map task was resuming
>> at 33% and I got close to 15GB free space (i guess the first map task freed
>> up some space) and I am sure they would drop down to zero again and throw
>> the same exception.
>> I am going to modify the country.txt to just 1 country and recreate
>> wikipediainput and run TrainClassifier. Will let you know how it goes..
>>
>> Do we have any benchmarks / system requirements for running this example ?
>> Has anyone else had success running this example anytime. Would appreciate
>> your inputs / thots.
>>
>> Should we look at tuning the code for handling these situations ? Any quick
>> suggestions on where to start looking at ?
>>
>> regards,
>> Joe.
>>
>>
>>
>>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.
Gangadhar,

I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just have
1 entry (spain) and used WikipediaDatasetCreatorDriver to create the
wikipediainput data set and then ran TrainClassifier and it worked. when I
ran TestClassifier as below, I got blank results in the output.

$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d
 wikipediainput  -ng 3 -type bayes -source hdfs

Summary
-------------------------------------------------------
Correctly Classified Instances          :          0         ?%
Incorrectly Classified Instances        :          0         ?%
Total Classified Instances              :          0

=======================================================
Confusion Matrix
-------------------------------------------------------
a     <--Classified as
0     |  0     a     = spain
Default Category: unknown: 1

I am not sure if I am doing something wrong.. have to figure out why my o/p
is so blank.
I'll document these steps and mention about country.txt in the wiki.

Question to all
Should we have 2 country.txt

   1. country_full_list.txt - this is the existing list
   2. country_sample_list.txt - a list with 2 or 3 countries

To get a flavor of the wikipedia bayes example, we can use
country_sample.txt. When new people want to just try out the example, they
can reference this txt file  as a parameter.
To run the example in a robust scalable infrastructure, we could use
country_full_list.txt.
any thots ?

regards
Joe.

On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <jo...@gmail.com> wrote:

> Gangadhar,
>
> After running TrainClassifier again, the map task just failed with the same
> exception and I am pretty sure it is an issue with disk space.
> As the map was progressing, I was monitoring my free disk space dropping
> from 81GB. It came down to 0 after almost 66% through the map task and then
> the exception happened. After the exception, another map task was resuming
> at 33% and I got close to 15GB free space (i guess the first map task freed
> up some space) and I am sure they would drop down to zero again and throw
> the same exception.
> I am going to modify the country.txt to just 1 country and recreate
> wikipediainput and run TrainClassifier. Will let you know how it goes..
>
> Do we have any benchmarks / system requirements for running this example ?
> Has anyone else had success running this example anytime. Would appreciate
> your inputs / thots.
>
> Should we look at tuning the code for handling these situations ? Any quick
> suggestions on where to start looking at ?
>
> regards,
> Joe.
>
>
>
>

Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.
Gangadhar,

After running TrainClassifier again, the map task just failed with the same
exception and I am pretty sure it is an issue with disk space.
As the map was progressing, I was monitoring my free disk space dropping
from 81GB. It came down to 0 after almost 66% through the map task and then
the exception happened. After the exception, another map task was resuming
at 33% and I got close to 15GB free space (i guess the first map task freed
up some space) and I am sure they would drop down to zero again and throw
the same exception.
I am going to modify the country.txt to just 1 country and recreate
wikipediainput and run TrainClassifier. Will let you know how it goes..

Do we have any benchmarks / system requirements for running this example ?
Has anyone else had success running this example anytime. Would appreciate
your inputs / thots.

Should we look at tuning the code for handling these situations ? Any quick
suggestions on where to start looking at ?

regards,
Joe.

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.
Joe,
I don't think it is the disk space that could be the problem because I
did have enough disk space (well, not 81GB, but around 40GB free) . I
will try if the suggestions in the thread you mentioned make any
difference. Will keep you posted.

Thank you

On Fri, Sep 17, 2010 at 11:33 PM, Joe Kumar <jo...@gmail.com> wrote:
> Gangadhar,
>
> I couldnt find any concrete reason behind this error. Some of them have
> reported this to happen very sporadic. As per some suggestions in this
> thread (
> http://www.mail-archive.com/core-user@hadoop.apache.org/msg09250.html) , I
> have changed the location of hadoop tmp dir. Also I have cleaned up some
> space in my laptop (now having 81GB of free space) and have started the job
> again. I m trying to see if freeing up space helps. I'll post any progress.
>
> Has anyone else faced similar issues. Would appreciate feedbacks / thots.
>
> reg
> Joe.
>
>
> On Fri, Sep 17, 2010 at 8:36 PM, Gangadhar Nittala
> <np...@gmail.com>wrote:
>
>> Thank you Joe for the confirmation. I am also checking the code to see
>> what is causing this issue. May be others in the list will know what
>> can cause this issue. I am guessing the root cause is not Mahout but
>> something in Hadoop.
>>
>> On Thu, Sep 16, 2010 at 11:34 PM, Joe Kumar <jo...@gmail.com> wrote:
>> > Gangadhar,
>> >
>> > After some system issues, I finally ran the TrainClassifier. After almost
>> > 65% into the map job, I got the same error that you have mentioned.
>> > INFO mapred.JobClient: Task Id : attempt_201009160819_0002_m_000000_0,
>> > Status : FAILED
>> > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
>> > valid local directory for
>> >
>> taskTracker/jobcache/job_201009160819_0002/attempt_201009160819_0002_m_000000_0/output/file.out
>> > at
>> >
>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
>> > ...
>> > Havent yet analyzed the root cause / solution but just wanted to confirm
>> > that I am facing the same issue as you do.
>> > I'll try to search / analyze and post more details.
>> >
>> > reg,
>> > Joe.
>> >
>> > On Wed, Sep 15, 2010 at 10:20 PM, Joe Kumar <jo...@gmail.com> wrote:
>> >
>> >> Hi Gangadhar,
>> >>
>> >> rite. I did the same to execute the TrainClassifier but then since the
>> >> default datasource is hdfs, we should not be mandated to provide this
>> >> parameter.
>> >> I havent completed executing the TrainClassifier yet. I'll do it tonite
>> and
>> >> let you know if I get into trouble.
>> >>
>> >> reg,
>> >> Joe.
>> >>
>> >>
>> >> On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala <
>> >> npk.gangadhar@gmail.com> wrote:
>> >>
>> >>> I ran into the issue that Joe mentioned about the command line
>> >>> parameters. I just added the datasource to the command line to execute
>> >>> thus
>> >>>  $HADOOP_HOME/bin/hadoop jar
>> >>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> >>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
>> >>> --input wikipediainput10 --output wikipediamodel10 --classifierType
>> >>> bayes --dataSource hdfs
>> >>>
>> >>> On a related note, Joe, were you able to run the TrainClassifier
>> >>> without any errors ? When I tried this, the map-reduce job would abort
>> >>> always at 99%. I tried the example that was given in the wiki with
>> >>> both subjects and countries. I even reduced the list of countries in
>> >>> the country.txt assuming that was what was causing the issue. No
>> >>> matter what, the classifier task fails. And the exception in the task
>> >>> log :
>> >>>
>> >>> 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart
>> >>> = 41271492; bufend = 58259002; bufvoid = 99614720
>> >>> 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart
>> >>> = 196379; kvend = 130842; length = 327680
>> >>> 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask:
>> >>> Finished spill 287
>> >>> 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask:
>> >>> Starting flush of map output
>> >>> 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask:
>> >>> Finished spill 288
>> >>> 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker:
>> >>> Error running child
>> >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> >>> any valid local directory for
>> >>>
>> >>>
>> taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out
>> >>>        at
>> >>>
>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
>> >>>        at
>> >>>
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
>> >>>        at
>> >>>
>> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
>> >>>        at
>> >>>
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469)
>> >>>        at
>> >>>
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
>> >>>        at
>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>> >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>> >>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>> >>>
>> >>> I checked the hadoop JIRA and this seems to be fixed already
>> >>> https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what
>> >>> I am doing wrong. Any suggestions to what I need to change to get this
>> >>> fixed will be very helpful. I have been struggling with this for a
>> >>> while now.
>> >>>
>> >>> Thank you
>> >>>
>> >>> On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <jo...@gmail.com> wrote:
>> >>> > Robin,
>> >>> >
>> >>> > sure. I'll submit a patch.
>> >>> >
>> >>> > The command line flag already has the default behavior specified.
>> >>> >  --classifierType (-type) classifierType    Type of classifier:
>> >>> > bayes|cbayes.
>> >>> >                                             Default: bayes
>> >>> >
>> >>> >  --dataSource (-source) dataSource          Location of model:
>> >>> hdfs|hbase.
>> >>> >
>> >>> >                                             Default Value: hdfs
>> >>> > So there is no change in the flag description.
>> >>> >
>> >>> > reg,
>> >>> > Joe.
>> >>> >
>> >>> >
>> >>> > On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <ro...@gmail.com>
>> >>> wrote:
>> >>> >
>> >>> >> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <jo...@gmail.com>
>> >>> wrote:
>> >>> >>
>> >>> >> > Hi all,
>> >>> >> >
>> >>> >> > As I was going through wikipedia example, I encountered a
>> situation
>> >>> with
>> >>> >> > TrainClassifier wherein some of the options with default values
>> are
>> >>> >> > actually
>> >>> >> > mandatory.
>> >>> >> > The documentation / command line help says that
>> >>> >> >
>> >>> >> >   1. default source (--datasource) is hdfs but TrainClassifier
>> >>> >> >   has withRequired(true) while building the --datasource option.
>> We
>> >>> are
>> >>> >> >   checking if the dataSourceType is hbase else set it to hdfs. so
>> >>> >> >   ideally withRequired should be set to false
>> >>> >> >   2. default --classifierType is bayes but withRequired is set to
>> >>> true
>> >>> >> and
>> >>> >> >   we have code like
>> >>> >> >
>> >>> >> > if ("bayes".equalsIgnoreCase(classifierType)) {
>> >>> >> >        log.info("Training Bayes Classifier");
>> >>> >> >        trainNaiveBayes(inputPath, outputPath, params);
>> >>> >> >
>> >>> >> >      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
>> >>> >> >        log.info("Training Complementary Bayes Classifier");
>> >>> >> >        // setup the HDFS and copy the files there, then run the
>> >>> trainer
>> >>> >> >        trainCNaiveBayes(inputPath, outputPath, params);
>> >>> >> >      }
>> >>> >> >
>> >>> >> > which should be changed to
>> >>> >> >
>> >>> >> > *if ("cbayes".equalsIgnoreCase(classifierType)) {*
>> >>> >> >        log.info("Training Complementary Bayes Classifier");
>> >>> >> >        trainCNaiveBayes(inputPath, outputPath, params);
>> >>> >> >
>> >>> >> >      } *else  {*
>> >>> >> >        log.info("Training  Bayes Classifier");
>> >>> >> >        // setup the HDFS and copy the files there, then run the
>> >>> trainer
>> >>> >> >        trainNaiveBayes(inputPath, outputPath, params);
>> >>> >> >      }
>> >>> >> >
>> >>> >> > Please let me know if this looks valid and I'll submit a patch for
>> a
>> >>> JIRA
>> >>> >> > issue.
>> >>> >> >
>> >>> >> > +1 all valid. , Go ahead and fix it and in the cmdline flags write
>> >>> the
>> >>> >> default behavior in the flag description
>> >>> >>
>> >>> >>
>> >>> >> > reg
>> >>> >> > Joe.
>> >>> >> >
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.
Gangadhar,

I couldnt find any concrete reason behind this error. Some of them have
reported this to happen very sporadic. As per some suggestions in this
thread (
http://www.mail-archive.com/core-user@hadoop.apache.org/msg09250.html) , I
have changed the location of hadoop tmp dir. Also I have cleaned up some
space in my laptop (now having 81GB of free space) and have started the job
again. I m trying to see if freeing up space helps. I'll post any progress.

Has anyone else faced similar issues. Would appreciate feedbacks / thots.

reg
Joe.


On Fri, Sep 17, 2010 at 8:36 PM, Gangadhar Nittala
<np...@gmail.com>wrote:

> Thank you Joe for the confirmation. I am also checking the code to see
> what is causing this issue. May be others in the list will know what
> can cause this issue. I am guessing the root cause is not Mahout but
> something in Hadoop.
>
> On Thu, Sep 16, 2010 at 11:34 PM, Joe Kumar <jo...@gmail.com> wrote:
> > Gangadhar,
> >
> > After some system issues, I finally ran the TrainClassifier. After almost
> > 65% into the map job, I got the same error that you have mentioned.
> > INFO mapred.JobClient: Task Id : attempt_201009160819_0002_m_000000_0,
> > Status : FAILED
> > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
> > valid local directory for
> >
> taskTracker/jobcache/job_201009160819_0002/attempt_201009160819_0002_m_000000_0/output/file.out
> > at
> >
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
> > ...
> > Havent yet analyzed the root cause / solution but just wanted to confirm
> > that I am facing the same issue as you do.
> > I'll try to search / analyze and post more details.
> >
> > reg,
> > Joe.
> >
> > On Wed, Sep 15, 2010 at 10:20 PM, Joe Kumar <jo...@gmail.com> wrote:
> >
> >> Hi Gangadhar,
> >>
> >> rite. I did the same to execute the TrainClassifier but then since the
> >> default datasource is hdfs, we should not be mandated to provide this
> >> parameter.
> >> I havent completed executing the TrainClassifier yet. I'll do it tonite
> and
> >> let you know if I get into trouble.
> >>
> >> reg,
> >> Joe.
> >>
> >>
> >> On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala <
> >> npk.gangadhar@gmail.com> wrote:
> >>
> >>> I ran into the issue that Joe mentioned about the command line
> >>> parameters. I just added the datasource to the command line to execute
> >>> thus
> >>>  $HADOOP_HOME/bin/hadoop jar
> >>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> >>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
> >>> --input wikipediainput10 --output wikipediamodel10 --classifierType
> >>> bayes --dataSource hdfs
> >>>
> >>> On a related note, Joe, were you able to run the TrainClassifier
> >>> without any errors ? When I tried this, the map-reduce job would abort
> >>> always at 99%. I tried the example that was given in the wiki with
> >>> both subjects and countries. I even reduced the list of countries in
> >>> the country.txt assuming that was what was causing the issue. No
> >>> matter what, the classifier task fails. And the exception in the task
> >>> log :
> >>>
> >>> 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart
> >>> = 41271492; bufend = 58259002; bufvoid = 99614720
> >>> 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart
> >>> = 196379; kvend = 130842; length = 327680
> >>> 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask:
> >>> Finished spill 287
> >>> 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask:
> >>> Starting flush of map output
> >>> 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask:
> >>> Finished spill 288
> >>> 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker:
> >>> Error running child
> >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> >>> any valid local directory for
> >>>
> >>>
> taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out
> >>>        at
> >>>
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
> >>>        at
> >>>
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
> >>>        at
> >>>
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
> >>>        at
> >>>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469)
> >>>        at
> >>>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
> >>>        at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
> >>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >>>
> >>> I checked the hadoop JIRA and this seems to be fixed already
> >>> https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what
> >>> I am doing wrong. Any suggestions to what I need to change to get this
> >>> fixed will be very helpful. I have been struggling with this for a
> >>> while now.
> >>>
> >>> Thank you
> >>>
> >>> On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <jo...@gmail.com> wrote:
> >>> > Robin,
> >>> >
> >>> > sure. I'll submit a patch.
> >>> >
> >>> > The command line flag already has the default behavior specified.
> >>> >  --classifierType (-type) classifierType    Type of classifier:
> >>> > bayes|cbayes.
> >>> >                                             Default: bayes
> >>> >
> >>> >  --dataSource (-source) dataSource          Location of model:
> >>> hdfs|hbase.
> >>> >
> >>> >                                             Default Value: hdfs
> >>> > So there is no change in the flag description.
> >>> >
> >>> > reg,
> >>> > Joe.
> >>> >
> >>> >
> >>> > On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <ro...@gmail.com>
> >>> wrote:
> >>> >
> >>> >> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <jo...@gmail.com>
> >>> wrote:
> >>> >>
> >>> >> > Hi all,
> >>> >> >
> >>> >> > As I was going through wikipedia example, I encountered a
> situation
> >>> with
> >>> >> > TrainClassifier wherein some of the options with default values
> are
> >>> >> > actually
> >>> >> > mandatory.
> >>> >> > The documentation / command line help says that
> >>> >> >
> >>> >> >   1. default source (--datasource) is hdfs but TrainClassifier
> >>> >> >   has withRequired(true) while building the --datasource option.
> We
> >>> are
> >>> >> >   checking if the dataSourceType is hbase else set it to hdfs. so
> >>> >> >   ideally withRequired should be set to false
> >>> >> >   2. default --classifierType is bayes but withRequired is set to
> >>> true
> >>> >> and
> >>> >> >   we have code like
> >>> >> >
> >>> >> > if ("bayes".equalsIgnoreCase(classifierType)) {
> >>> >> >        log.info("Training Bayes Classifier");
> >>> >> >        trainNaiveBayes(inputPath, outputPath, params);
> >>> >> >
> >>> >> >      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
> >>> >> >        log.info("Training Complementary Bayes Classifier");
> >>> >> >        // setup the HDFS and copy the files there, then run the
> >>> trainer
> >>> >> >        trainCNaiveBayes(inputPath, outputPath, params);
> >>> >> >      }
> >>> >> >
> >>> >> > which should be changed to
> >>> >> >
> >>> >> > *if ("cbayes".equalsIgnoreCase(classifierType)) {*
> >>> >> >        log.info("Training Complementary Bayes Classifier");
> >>> >> >        trainCNaiveBayes(inputPath, outputPath, params);
> >>> >> >
> >>> >> >      } *else  {*
> >>> >> >        log.info("Training  Bayes Classifier");
> >>> >> >        // setup the HDFS and copy the files there, then run the
> >>> trainer
> >>> >> >        trainNaiveBayes(inputPath, outputPath, params);
> >>> >> >      }
> >>> >> >
> >>> >> > Please let me know if this looks valid and I'll submit a patch for
> a
> >>> JIRA
> >>> >> > issue.
> >>> >> >
> >>> >> > +1 all valid. , Go ahead and fix it and in the cmdline flags write
> >>> the
> >>> >> default behavior in the flag description
> >>> >>
> >>> >>
> >>> >> > reg
> >>> >> > Joe.
> >>> >> >
> >>> >>
> >>> >
> >>>
> >>
> >>
> >>
> >>
> >>
> >
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.
Thank you Joe for the confirmation. I am also checking the code to see
what is causing this issue. May be others in the list will know what
can cause this issue. I am guessing the root cause is not Mahout but
something in Hadoop.

On Thu, Sep 16, 2010 at 11:34 PM, Joe Kumar <jo...@gmail.com> wrote:
> Gangadhar,
>
> After some system issues, I finally ran the TrainClassifier. After almost
> 65% into the map job, I got the same error that you have mentioned.
> INFO mapred.JobClient: Task Id : attempt_201009160819_0002_m_000000_0,
> Status : FAILED
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
> valid local directory for
> taskTracker/jobcache/job_201009160819_0002/attempt_201009160819_0002_m_000000_0/output/file.out
> at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
> ...
> Havent yet analyzed the root cause / solution but just wanted to confirm
> that I am facing the same issue as you do.
> I'll try to search / analyze and post more details.
>
> reg,
> Joe.
>
> On Wed, Sep 15, 2010 at 10:20 PM, Joe Kumar <jo...@gmail.com> wrote:
>
>> Hi Gangadhar,
>>
>> rite. I did the same to execute the TrainClassifier but then since the
>> default datasource is hdfs, we should not be mandated to provide this
>> parameter.
>> I havent completed executing the TrainClassifier yet. I'll do it tonite and
>> let you know if I get into trouble.
>>
>> reg,
>> Joe.
>>
>>
>> On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala <
>> npk.gangadhar@gmail.com> wrote:
>>
>>> I ran into the issue that Joe mentioned about the command line
>>> parameters. I just added the datasource to the command line to execute
>>> thus
>>>  $HADOOP_HOME/bin/hadoop jar
>>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
>>> --input wikipediainput10 --output wikipediamodel10 --classifierType
>>> bayes --dataSource hdfs
>>>
>>> On a related note, Joe, were you able to run the TrainClassifier
>>> without any errors ? When I tried this, the map-reduce job would abort
>>> always at 99%. I tried the example that was given in the wiki with
>>> both subjects and countries. I even reduced the list of countries in
>>> the country.txt assuming that was what was causing the issue. No
>>> matter what, the classifier task fails. And the exception in the task
>>> log :
>>>
>>> 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart
>>> = 41271492; bufend = 58259002; bufvoid = 99614720
>>> 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart
>>> = 196379; kvend = 130842; length = 327680
>>> 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask:
>>> Finished spill 287
>>> 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask:
>>> Starting flush of map output
>>> 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask:
>>> Finished spill 288
>>> 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker:
>>> Error running child
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>> any valid local directory for
>>>
>>> taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out
>>>        at
>>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
>>>        at
>>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
>>>        at
>>> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
>>>        at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469)
>>>        at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
>>>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>
>>> I checked the hadoop JIRA and this seems to be fixed already
>>> https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what
>>> I am doing wrong. Any suggestions to what I need to change to get this
>>> fixed will be very helpful. I have been struggling with this for a
>>> while now.
>>>
>>> Thank you
>>>
>>> On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <jo...@gmail.com> wrote:
>>> > Robin,
>>> >
>>> > sure. I'll submit a patch.
>>> >
>>> > The command line flag already has the default behavior specified.
>>> >  --classifierType (-type) classifierType    Type of classifier:
>>> > bayes|cbayes.
>>> >                                             Default: bayes
>>> >
>>> >  --dataSource (-source) dataSource          Location of model:
>>> hdfs|hbase.
>>> >
>>> >                                             Default Value: hdfs
>>> > So there is no change in the flag description.
>>> >
>>> > reg,
>>> > Joe.
>>> >
>>> >
>>> > On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <ro...@gmail.com>
>>> wrote:
>>> >
>>> >> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <jo...@gmail.com>
>>> wrote:
>>> >>
>>> >> > Hi all,
>>> >> >
>>> >> > As I was going through wikipedia example, I encountered a situation
>>> with
>>> >> > TrainClassifier wherein some of the options with default values are
>>> >> > actually
>>> >> > mandatory.
>>> >> > The documentation / command line help says that
>>> >> >
>>> >> >   1. default source (--datasource) is hdfs but TrainClassifier
>>> >> >   has withRequired(true) while building the --datasource option. We
>>> are
>>> >> >   checking if the dataSourceType is hbase else set it to hdfs. so
>>> >> >   ideally withRequired should be set to false
>>> >> >   2. default --classifierType is bayes but withRequired is set to
>>> true
>>> >> and
>>> >> >   we have code like
>>> >> >
>>> >> > if ("bayes".equalsIgnoreCase(classifierType)) {
>>> >> >        log.info("Training Bayes Classifier");
>>> >> >        trainNaiveBayes(inputPath, outputPath, params);
>>> >> >
>>> >> >      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
>>> >> >        log.info("Training Complementary Bayes Classifier");
>>> >> >        // setup the HDFS and copy the files there, then run the
>>> trainer
>>> >> >        trainCNaiveBayes(inputPath, outputPath, params);
>>> >> >      }
>>> >> >
>>> >> > which should be changed to
>>> >> >
>>> >> > *if ("cbayes".equalsIgnoreCase(classifierType)) {*
>>> >> >        log.info("Training Complementary Bayes Classifier");
>>> >> >        trainCNaiveBayes(inputPath, outputPath, params);
>>> >> >
>>> >> >      } *else  {*
>>> >> >        log.info("Training  Bayes Classifier");
>>> >> >        // setup the HDFS and copy the files there, then run the
>>> trainer
>>> >> >        trainNaiveBayes(inputPath, outputPath, params);
>>> >> >      }
>>> >> >
>>> >> > Please let me know if this looks valid and I'll submit a patch for a
>>> JIRA
>>> >> > issue.
>>> >> >
>>> >> > +1 all valid. , Go ahead and fix it and in the cmdline flags write
>>> the
>>> >> default behavior in the flag description
>>> >>
>>> >>
>>> >> > reg
>>> >> > Joe.
>>> >> >
>>> >>
>>> >
>>>
>>
>>
>>
>>
>>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.
Gangadhar,

After some system issues, I finally ran the TrainClassifier. After almost
65% into the map job, I got the same error that you have mentioned.
INFO mapred.JobClient: Task Id : attempt_201009160819_0002_m_000000_0,
Status : FAILED
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_201009160819_0002/attempt_201009160819_0002_m_000000_0/output/file.out
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
...
Havent yet analyzed the root cause / solution but just wanted to confirm
that I am facing the same issue as you do.
I'll try to search / analyze and post more details.

reg,
Joe.

On Wed, Sep 15, 2010 at 10:20 PM, Joe Kumar <jo...@gmail.com> wrote:

> Hi Gangadhar,
>
> rite. I did the same to execute the TrainClassifier but then since the
> default datasource is hdfs, we should not be mandated to provide this
> parameter.
> I havent completed executing the TrainClassifier yet. I'll do it tonite and
> let you know if I get into trouble.
>
> reg,
> Joe.
>
>
> On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala <
> npk.gangadhar@gmail.com> wrote:
>
>> I ran into the issue that Joe mentioned about the command line
>> parameters. I just added the datasource to the command line to execute
>> thus
>>  $HADOOP_HOME/bin/hadoop jar
>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
>> --input wikipediainput10 --output wikipediamodel10 --classifierType
>> bayes --dataSource hdfs
>>
>> On a related note, Joe, were you able to run the TrainClassifier
>> without any errors ? When I tried this, the map-reduce job would abort
>> always at 99%. I tried the example that was given in the wiki with
>> both subjects and countries. I even reduced the list of countries in
>> the country.txt assuming that was what was causing the issue. No
>> matter what, the classifier task fails. And the exception in the task
>> log :
>>
>> 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart
>> = 41271492; bufend = 58259002; bufvoid = 99614720
>> 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart
>> = 196379; kvend = 130842; length = 327680
>> 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask:
>> Finished spill 287
>> 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask:
>> Starting flush of map output
>> 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask:
>> Finished spill 288
>> 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker:
>> Error running child
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> any valid local directory for
>>
>> taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out
>>        at
>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
>>        at
>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
>>        at
>> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
>>        at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469)
>>        at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
>>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> I checked the hadoop JIRA and this seems to be fixed already
>> https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what
>> I am doing wrong. Any suggestions to what I need to change to get this
>> fixed will be very helpful. I have been struggling with this for a
>> while now.
>>
>> Thank you
>>
>> On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <jo...@gmail.com> wrote:
>> > Robin,
>> >
>> > sure. I'll submit a patch.
>> >
>> > The command line flag already has the default behavior specified.
>> >  --classifierType (-type) classifierType    Type of classifier:
>> > bayes|cbayes.
>> >                                             Default: bayes
>> >
>> >  --dataSource (-source) dataSource          Location of model:
>> hdfs|hbase.
>> >
>> >                                             Default Value: hdfs
>> > So there is no change in the flag description.
>> >
>> > reg,
>> > Joe.
>> >
>> >
>> > On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <ro...@gmail.com>
>> wrote:
>> >
>> >> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <jo...@gmail.com>
>> wrote:
>> >>
>> >> > Hi all,
>> >> >
>> >> > As I was going through wikipedia example, I encountered a situation
>> with
>> >> > TrainClassifier wherein some of the options with default values are
>> >> > actually
>> >> > mandatory.
>> >> > The documentation / command line help says that
>> >> >
>> >> >   1. default source (--datasource) is hdfs but TrainClassifier
>> >> >   has withRequired(true) while building the --datasource option. We
>> are
>> >> >   checking if the dataSourceType is hbase else set it to hdfs. so
>> >> >   ideally withRequired should be set to false
>> >> >   2. default --classifierType is bayes but withRequired is set to
>> true
>> >> and
>> >> >   we have code like
>> >> >
>> >> > if ("bayes".equalsIgnoreCase(classifierType)) {
>> >> >        log.info("Training Bayes Classifier");
>> >> >        trainNaiveBayes(inputPath, outputPath, params);
>> >> >
>> >> >      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
>> >> >        log.info("Training Complementary Bayes Classifier");
>> >> >        // setup the HDFS and copy the files there, then run the
>> trainer
>> >> >        trainCNaiveBayes(inputPath, outputPath, params);
>> >> >      }
>> >> >
>> >> > which should be changed to
>> >> >
>> >> > *if ("cbayes".equalsIgnoreCase(classifierType)) {*
>> >> >        log.info("Training Complementary Bayes Classifier");
>> >> >        trainCNaiveBayes(inputPath, outputPath, params);
>> >> >
>> >> >      } *else  {*
>> >> >        log.info("Training  Bayes Classifier");
>> >> >        // setup the HDFS and copy the files there, then run the
>> trainer
>> >> >        trainNaiveBayes(inputPath, outputPath, params);
>> >> >      }
>> >> >
>> >> > Please let me know if this looks valid and I'll submit a patch for a
>> JIRA
>> >> > issue.
>> >> >
>> >> > +1 all valid. , Go ahead and fix it and in the cmdline flags write
>> the
>> >> default behavior in the flag description
>> >>
>> >>
>> >> > reg
>> >> > Joe.
>> >> >
>> >>
>> >
>>
>
>
>
>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.
Hi Gangadhar,

rite. I did the same to execute the TrainClassifier but then since the
default datasource is hdfs, we should not be mandated to provide this
parameter.
I havent completed executing the TrainClassifier yet. I'll do it tonite and
let you know if I get into trouble.

reg,
Joe.

On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala
<np...@gmail.com>wrote:

> I ran into the issue that Joe mentioned about the command line
> parameters. I just added the datasource to the command line to execute
> thus
>  $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
> --input wikipediainput10 --output wikipediamodel10 --classifierType
> bayes --dataSource hdfs
>
> On a related note, Joe, were you able to run the TrainClassifier
> without any errors ? When I tried this, the map-reduce job would abort
> always at 99%. I tried the example that was given in the wiki with
> both subjects and countries. I even reduced the list of countries in
> the country.txt assuming that was what was causing the issue. No
> matter what, the classifier task fails. And the exception in the task
> log :
>
> 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart
> = 41271492; bufend = 58259002; bufvoid = 99614720
> 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart
> = 196379; kvend = 130842; length = 327680
> 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask:
> Finished spill 287
> 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask:
> Starting flush of map output
> 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask:
> Finished spill 288
> 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker:
> Error running child
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> any valid local directory for
>
> taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out
>        at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
>        at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
>        at
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469)
>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> I checked the hadoop JIRA and this seems to be fixed already
> https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what
> I am doing wrong. Any suggestions to what I need to change to get this
> fixed will be very helpful. I have been struggling with this for a
> while now.
>
> Thank you
>
> On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <jo...@gmail.com> wrote:
> > Robin,
> >
> > sure. I'll submit a patch.
> >
> > The command line flag already has the default behavior specified.
> >  --classifierType (-type) classifierType    Type of classifier:
> > bayes|cbayes.
> >                                             Default: bayes
> >
> >  --dataSource (-source) dataSource          Location of model:
> hdfs|hbase.
> >
> >                                             Default Value: hdfs
> > So there is no change in the flag description.
> >
> > reg,
> > Joe.
> >
> >
> > On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <ro...@gmail.com>
> wrote:
> >
> >> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <jo...@gmail.com> wrote:
> >>
> >> > Hi all,
> >> >
> >> > As I was going through wikipedia example, I encountered a situation
> with
> >> > TrainClassifier wherein some of the options with default values are
> >> > actually
> >> > mandatory.
> >> > The documentation / command line help says that
> >> >
> >> >   1. default source (--datasource) is hdfs but TrainClassifier
> >> >   has withRequired(true) while building the --datasource option. We
> are
> >> >   checking if the dataSourceType is hbase else set it to hdfs. so
> >> >   ideally withRequired should be set to false
> >> >   2. default --classifierType is bayes but withRequired is set to true
> >> and
> >> >   we have code like
> >> >
> >> > if ("bayes".equalsIgnoreCase(classifierType)) {
> >> >        log.info("Training Bayes Classifier");
> >> >        trainNaiveBayes(inputPath, outputPath, params);
> >> >
> >> >      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
> >> >        log.info("Training Complementary Bayes Classifier");
> >> >        // setup the HDFS and copy the files there, then run the
> trainer
> >> >        trainCNaiveBayes(inputPath, outputPath, params);
> >> >      }
> >> >
> >> > which should be changed to
> >> >
> >> > *if ("cbayes".equalsIgnoreCase(classifierType)) {*
> >> >        log.info("Training Complementary Bayes Classifier");
> >> >        trainCNaiveBayes(inputPath, outputPath, params);
> >> >
> >> >      } *else  {*
> >> >        log.info("Training  Bayes Classifier");
> >> >        // setup the HDFS and copy the files there, then run the
> trainer
> >> >        trainNaiveBayes(inputPath, outputPath, params);
> >> >      }
> >> >
> >> > Please let me know if this looks valid and I'll submit a patch for a
> JIRA
> >> > issue.
> >> >
> >> > +1 all valid. , Go ahead and fix it and in the cmdline flags write the
> >> default behavior in the flag description
> >>
> >>
> >> > reg
> >> > Joe.
> >> >
> >>
> >
>

Re: Options in TrainClassifier.java

Posted by Gangadhar Nittala <np...@gmail.com>.
I ran into the issue that Joe mentioned about the command line
parameters. I just added the datasource to the command line to execute
thus
 $HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
--input wikipediainput10 --output wikipediamodel10 --classifierType
bayes --dataSource hdfs

On a related note, Joe, were you able to run the TrainClassifier
without any errors ? When I tried this, the map-reduce job would abort
always at 99%. I tried the example that was given in the wiki with
both subjects and countries. I even reduced the list of countries in
the country.txt assuming that was what was causing the issue. No
matter what, the classifier task fails. And the exception in the task
log :

10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart
= 41271492; bufend = 58259002; bufvoid = 99614720
2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 196379; kvend = 130842; length = 327680
2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 287
2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask:
Starting flush of map output
2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 288
2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker:
Error running child
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
any valid local directory for
taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out
	at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
	at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)

I checked the hadoop JIRA and this seems to be fixed already
https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what
I am doing wrong. Any suggestions to what I need to change to get this
fixed will be very helpful. I have been struggling with this for a
while now.

Thank you

On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <jo...@gmail.com> wrote:
> Robin,
>
> sure. I'll submit a patch.
>
> The command line flag already has the default behavior specified.
>  --classifierType (-type) classifierType    Type of classifier:
> bayes|cbayes.
>                                             Default: bayes
>
>  --dataSource (-source) dataSource          Location of model: hdfs|hbase.
>
>                                             Default Value: hdfs
> So there is no change in the flag description.
>
> reg,
> Joe.
>
>
> On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <ro...@gmail.com> wrote:
>
>> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <jo...@gmail.com> wrote:
>>
>> > Hi all,
>> >
>> > As I was going through wikipedia example, I encountered a situation with
>> > TrainClassifier wherein some of the options with default values are
>> > actually
>> > mandatory.
>> > The documentation / command line help says that
>> >
>> >   1. default source (--datasource) is hdfs but TrainClassifier
>> >   has withRequired(true) while building the --datasource option. We are
>> >   checking if the dataSourceType is hbase else set it to hdfs. so
>> >   ideally withRequired should be set to false
>> >   2. default --classifierType is bayes but withRequired is set to true
>> and
>> >   we have code like
>> >
>> > if ("bayes".equalsIgnoreCase(classifierType)) {
>> >        log.info("Training Bayes Classifier");
>> >        trainNaiveBayes(inputPath, outputPath, params);
>> >
>> >      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
>> >        log.info("Training Complementary Bayes Classifier");
>> >        // setup the HDFS and copy the files there, then run the trainer
>> >        trainCNaiveBayes(inputPath, outputPath, params);
>> >      }
>> >
>> > which should be changed to
>> >
>> > *if ("cbayes".equalsIgnoreCase(classifierType)) {*
>> >        log.info("Training Complementary Bayes Classifier");
>> >        trainCNaiveBayes(inputPath, outputPath, params);
>> >
>> >      } *else  {*
>> >        log.info("Training  Bayes Classifier");
>> >        // setup the HDFS and copy the files there, then run the trainer
>> >        trainNaiveBayes(inputPath, outputPath, params);
>> >      }
>> >
>> > Please let me know if this looks valid and I'll submit a patch for a JIRA
>> > issue.
>> >
>> > +1 all valid. , Go ahead and fix it and in the cmdline flags write the
>> default behavior in the flag description
>>
>>
>> > reg
>> > Joe.
>> >
>>
>

Re: Options in TrainClassifier.java

Posted by Joe Kumar <jo...@gmail.com>.
Robin,

sure. I'll submit a patch.

The command line flag already has the default behavior specified.
  --classifierType (-type) classifierType    Type of classifier:
bayes|cbayes.
                                             Default: bayes

  --dataSource (-source) dataSource          Location of model: hdfs|hbase.

                                             Default Value: hdfs
So there is no change in the flag description.

reg,
Joe.


On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <ro...@gmail.com> wrote:

> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <jo...@gmail.com> wrote:
>
> > Hi all,
> >
> > As I was going through wikipedia example, I encountered a situation with
> > TrainClassifier wherein some of the options with default values are
> > actually
> > mandatory.
> > The documentation / command line help says that
> >
> >   1. default source (--datasource) is hdfs but TrainClassifier
> >   has withRequired(true) while building the --datasource option. We are
> >   checking if the dataSourceType is hbase else set it to hdfs. so
> >   ideally withRequired should be set to false
> >   2. default --classifierType is bayes but withRequired is set to true
> and
> >   we have code like
> >
> > if ("bayes".equalsIgnoreCase(classifierType)) {
> >        log.info("Training Bayes Classifier");
> >        trainNaiveBayes(inputPath, outputPath, params);
> >
> >      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
> >        log.info("Training Complementary Bayes Classifier");
> >        // setup the HDFS and copy the files there, then run the trainer
> >        trainCNaiveBayes(inputPath, outputPath, params);
> >      }
> >
> > which should be changed to
> >
> > *if ("cbayes".equalsIgnoreCase(classifierType)) {*
> >        log.info("Training Complementary Bayes Classifier");
> >        trainCNaiveBayes(inputPath, outputPath, params);
> >
> >      } *else  {*
> >        log.info("Training  Bayes Classifier");
> >        // setup the HDFS and copy the files there, then run the trainer
> >        trainNaiveBayes(inputPath, outputPath, params);
> >      }
> >
> > Please let me know if this looks valid and I'll submit a patch for a JIRA
> > issue.
> >
> > +1 all valid. , Go ahead and fix it and in the cmdline flags write the
> default behavior in the flag description
>
>
> > reg
> > Joe.
> >
>

Re: Options in TrainClassifier.java

Posted by Robin Anil <ro...@gmail.com>.
On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <jo...@gmail.com> wrote:

> Hi all,
>
> As I was going through wikipedia example, I encountered a situation with
> TrainClassifier wherein some of the options with default values are
> actually
> mandatory.
> The documentation / command line help says that
>
>   1. default source (--datasource) is hdfs but TrainClassifier
>   has withRequired(true) while building the --datasource option. We are
>   checking if the dataSourceType is hbase else set it to hdfs. so
>   ideally withRequired should be set to false
>   2. default --classifierType is bayes but withRequired is set to true and
>   we have code like
>
> if ("bayes".equalsIgnoreCase(classifierType)) {
>        log.info("Training Bayes Classifier");
>        trainNaiveBayes(inputPath, outputPath, params);
>
>      } else if ("cbayes".equalsIgnoreCase(classifierType)) {
>        log.info("Training Complementary Bayes Classifier");
>        // setup the HDFS and copy the files there, then run the trainer
>        trainCNaiveBayes(inputPath, outputPath, params);
>      }
>
> which should be changed to
>
> *if ("cbayes".equalsIgnoreCase(classifierType)) {*
>        log.info("Training Complementary Bayes Classifier");
>        trainCNaiveBayes(inputPath, outputPath, params);
>
>      } *else  {*
>        log.info("Training  Bayes Classifier");
>        // setup the HDFS and copy the files there, then run the trainer
>        trainNaiveBayes(inputPath, outputPath, params);
>      }
>
> Please let me know if this looks valid and I'll submit a patch for a JIRA
> issue.
>
> +1 all valid. , Go ahead and fix it and in the cmdline flags write the
default behavior in the flag description


> reg
> Joe.
>