You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jakub Stransky <st...@gmail.com> on 2014/12/01 17:09:55 UTC

Insights to Naive Bayes classifier example - 20news groups

Hello Mahout experts,

I am trying to follow some examples provided with Mahout and some features
are not clear to me. It would be great if someone could clarify a bit more.

To prepare a the data (train and test) the following sequence of steps is
perfomed (taken from mahout cookbook):

All input is merged into single dir:
*cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*

Converted to hadoop sequence file and then vectorized:
*./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-**vectors
-lnorm -nv -wt tfidf*

Devided to test and train data:
*./mahout split*
*-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
*--trainingOutput ${WORK_DIR}/20news-train-vectors*
*--testOutput ${WORK_DIR}/20news-test-vectors*
*--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*

Model is trained:
*./mahout trainnb*
*-i ${WORK_DIR}/20news-train-vectors -el*
*-o ${WORK_DIR}/model*
*-li ${WORK_DIR}/labelindex*
*-ow*


What I am missing here and that is subject of my question is: Where is the
category assigned to the testing data to train the categorization? What I
would expect is that there will be vector which says that this document
belongs to a particular category. This seems to me has been ereased by
first step where we mixed all the data to create our corpus. I would still
expect that this information will be somewhere retained. Instead the
messages looks as follows:

From: yeoy@a.cs.okstate.edu (YEO YEK CHONG)
Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
Organization: Oklahoma State University
Lines: 7

>From article <a4...@vicuna.ocunix.on.ca>, by Steve Frampton <
frampton@vicuna.ocunix.on.ca>:
> I was wondering, is the "Kermit" package (the actual package, not a

Yes!  In the usual ftp sites.

Yek CHong


There is no notion from which group this text belongs to. What's the hack!

Could someone please clarify a bit what's going on as when crosswalidation
is performed - confusion matrix takes into consideration those categories.

Thanks a lot for helping me out
Jakub

RE: Insights to Naive Bayes classifier example - 20news groups

Posted by Andrew Palumbo <ap...@outlook.com>.
> All input is merged into single dir:
> *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
 
as well the above line should read as follows.  
$ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all
see: http://mahout.apache.org/users/classification/twenty-newsgroups.html

 		 	   		  

RE: Insights to Naive Bayes classifier example - 20news groups

Posted by Andrew Palumbo <ap...@outlook.com>.

> Date: Tue, 2 Dec 2014 14:06:44 +0100
> Subject: Re: Insights to Naive Bayes classifier example - 20news groups
> From: stransky.ja@gmail.com
> To: user@mahout.apache.org
> 
> Hi Andrew,
> 
> many thanks for final clarification! Now I have last question - probably
> the most obvious but I missed it somewhere probably. Because all the
> examples ends up by testing the classifier - display confusion matrix.  So
> the state is:
> We have a trained and tested model and now we would like to use the model
> to classify  unseen, unknown data - actually use the classifier. For sure
> it is clear how to prepare the input - vectorize etc. What is not clear to
> me at the moment is how do I call trained model with new vectorized data as
> an input. Or may be even the vectorization itself - because we need
> probably the dictionary used by model to produce a valid vectors. What
> about terms which we not in the training set etc.
> 
> Is there any documentation regarding this aspect?

As of Mahout 0.9 there are no CLI drivers available to vectorize and classify new documents.  There is a ticket open for Mahout 1.0 regarding this.  Currently you'll have to write a utility class to vectorize and classify new documents.  As you mentioned, you'll need to use the same dictionary.file-0 as is created by seq2sparse for training.  As well if you're using TF-IDF weights you'll need to use the same df-count file to compute the IDF.  Both are located in the directory output by seq2sparse.    You'll also want to use the same maxNgramSize as you used to train the model.  If you want to keep it simple, by using unigrams, you can avoid Lucene integration, an just keep a count the occurences of tokenized terms. Terms unseen by the training set can be rejected.

Once the document is vectorized, you can use BayesUtils.readModelFromDir(..) to retrieve your model, BayesUtils.readLabelIndex(..) [1], and (Complemtary)StandardNaiveBayesClassifier.classifyFull(...)[2] to classify your vector. You can also look at TestNaiveBayesDriver.AnalyzeResults[3] to see how labels are assigned.

There's no documentation on the Mahout site at the moment. There is a good blog post here that can give you an Idea of how to get started:

https://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/

[1] https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/BayesUtils.java
[2] https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/StandardNaiveBayesClassifier.java
[3] https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/test/TestNaiveBayesDriver.java

     
> 
> Thx
> Jakub
> 
> 
> 
> On 1 December 2014 at 21:12, Andrew Palumbo <ap...@outlook.com> wrote:
> 
> >
> >
> >
> > > However the sequence of steps as described in Mahout Cookbook seems to me
> > > incorrect as:
> >
> > this is entirely possible, that book may be out of date. The end to end
> > instructions on the website for the 20 newsgroups example is up to date
> > though.  As is the example script.
> >
> > You don't want to merge all of the files into one directory, rather to
> > merge the training and testing sets in 20news-bydate while maintaining
> > their directory structure.
> >
> > > After data set download and extraction data are merged via command:
> > > *cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all*
> > >
> > > Which essentially copies files to a single location -> 20news-all folder
> >
> > this should not copy all of the *files* individually into the 20news-all
> > folder rather the directories containing the files:
> >
> >     $ ls 20news-all/
> >     alt.atheism               rec.autos           sci.space
> >     comp.graphics             rec.motorcycles     soc.religion.christian
> >     {...}
> >
> > > *./mahout seqdirectory  -i ${WORK_DIR}/20news-all  -o
> > > ${WORK_DIR}/20news-seq*
> > > Converts to a hadoop sequence directory from 20news-all dir - where all
> > > files were copied and efffectively the classification to folders were
> > lost.
> > > We can peek inside a created seq file via hadoop fs -text
> > > $WORK_DIR/20news-seq/chunck-0 | more which prints following result:
> > >
> > > */67399* From:xxx
> > > Subject: Re: Imake-TeX: looking for beta testers
> > > Organization: CS Department, Dortmund University, Germany
> > > Lines: 59
> > > Distribution: world
> > > NNTP-Posting-Host: tommy.informatik.uni-dortmund.de
> > > In article <xxxxx>,
> > > yyy writes:
> > > |> As I announced at the X Technical Conference in January, I would
> > > like
> > > |> to
> > > |> make Imake-TeX, the Imake support for using the TeX typesetting
> > > system,
> > > |> publically available. Currently Imake-TeX is in beta test here at
> > > the
> > > |> computer science department of Dortmund University, and I am
> > > looking
> > > ...
> > >
> > > To my understanding - number after slash in bold represents a key of
> > > sequence file, right?
> >
> > Correct though it should read something like:
> >
> >     /comp.graphics/67399 {...}
> >
> > where comp.graphics is the category as well as the directory that it was
> > read in from.
> >
> > > Then seq2sparse is performed:
> > >
> > > ./mahout seq2sparse  -i ${WORK_DIR}/20news-seq vectors -lnorm -nv  -wt
> > > tfidf -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
> > >
> > >
> > > *Conclusions which I would like to verify:*
> > > - sequence of steps as described is incorrect - particularly conversion
> > to
> > > sequence file as the key doesn't contain folder name describing the
> > > category of training data, or am I still missing something in here?
> >
> > yes- it looks like you are copying the individual files rather than the
> > directories into 20news-all
> >
> > >
> > > - mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o
> > > ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow
> > >   What are the exact mechanics when label extraction is performed e.g.
> > > /category/docID as a key is resolved just to category ???
> >
> > yes
> >
> > > Does every time
> > > the last part after the slash is dropped as a category?? Or is is
> > possible
> > > to define the strategy somewhere?
> >
> > The hard-coded convention as of Mahout 0.9 is to extract the label as the
> > first string after the key is split on "/".  This makes category
> > organization by directory and sequence file conversion with seqdirectory
> > straightforward.  The new scala DSL Naive Bayes which is currently in
> > development will allow the user more flexibility in extracting the label.
> >
> > The label extraction process can be found here:
> >
> > https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapper.java
> >
> > and could me modified if need be.
> >
> > >
> > > Thanks
> > > Jakub
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 1 December 2014 at 17:43, Andrew Palumbo <ap...@outlook.com> wrote:
> > >
> > > > Hi Jakub,
> > > >
> > > > The step that you are missing is `$mahout seqdir ...`.   in this step
> > each
> > > > file in each directory (where the directory is the Category) is
> > converted
> > > > into a sequence file of form <Text,Text>  where the Text key is
> > > > /Category/doc_id.
> > > >
> > > > `$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...`
> > > > into a sequence file of form <Text, VectorWritable> leaving the Keys
> > > > unchanged.
> > > >
> > > > `$mahout trainnb ... -el ...` then extracts the label from the Keys of
> > the
> > > > training data ie. the "Category" from /Category/doc_id.
> > > >
> > > > please see
> > > > http://mahout.apache.org/users/classification/twenty-newsgroups.html
> > > > and http://mahout.apache.org/users/classification/bayesian.html
> > > > for more information.
> > > >
> > > > > Date: Mon, 1 Dec 2014 17:09:55 +0100
> > > > > Subject: Insights to Naive Bayes classifier example - 20news groups
> > > > > From: stransky.ja@gmail.com
> > > > > To: user@mahout.apache.org
> > > > >
> > > > > Hello Mahout experts,
> > > > >
> > > > > I am trying to follow some examples provided with Mahout and some
> > > > features
> > > > > are not clear to me. It would be great if someone could clarify a bit
> > > > more.
> > > > >
> > > > > To prepare a the data (train and test) the following sequence of
> > steps is
> > > > > perfomed (taken from mahout cookbook):
> > > > >
> > > > > All input is merged into single dir:
> > > > > *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
> > > > >
> > > > > Converted to hadoop sequence file and then vectorized:
> > > > > *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o
> > > > ${WORK_DIR}/20news-**vectors
> > > > > -lnorm -nv -wt tfidf*
> > > > >
> > > > > Devided to test and train data:
> > > > > *./mahout split*
> > > > > *-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
> > > > > *--trainingOutput ${WORK_DIR}/20news-train-vectors*
> > > > > *--testOutput ${WORK_DIR}/20news-test-vectors*
> > > > > *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
> > > > >
> > > > > Model is trained:
> > > > > *./mahout trainnb*
> > > > > *-i ${WORK_DIR}/20news-train-vectors -el*
> > > > > *-o ${WORK_DIR}/model*
> > > > > *-li ${WORK_DIR}/labelindex*
> > > > > *-ow*
> > > > >
> > > > >
> > > > > What I am missing here and that is subject of my question is: Where
> > is
> > > > the
> > > > > category assigned to the testing data to train the categorization?
> > What I
> > > > > would expect is that there will be vector which says that this
> > document
> > > > > belongs to a particular category. This seems to me has been ereased
> > by
> > > > > first step where we mixed all the data to create our corpus. I would
> > > > still
> > > > > expect that this information will be somewhere retained. Instead the
> > > > > messages looks as follows:
> > > > >
> > > > > From: yeoy@a.cs.okstate.edu (YEO YEK CHONG)
> > > > > Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
> > > > > Organization: Oklahoma State University
> > > > > Lines: 7
> > > > >
> > > > > From article <a4...@vicuna.ocunix.on.ca>, by Steve Frampton <
> > > > > frampton@vicuna.ocunix.on.ca>:
> > > > > > I was wondering, is the "Kermit" package (the actual package, not a
> > > > >
> > > > > Yes!  In the usual ftp sites.
> > > > >
> > > > > Yek CHong
> > > > >
> > > > >
> > > > > There is no notion from which group this text belongs to. What's the
> > > > hack!
> > > > >
> > > > > Could someone please clarify a bit what's going on as when
> > > > crosswalidation
> > > > > is performed - confusion matrix takes into consideration those
> > > > categories.
> > > > >
> > > > > Thanks a lot for helping me out
> > > > > Jakub
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jakub Stransky
> > > cz.linkedin.com/in/jakubstransky
> >
> >
> 
> 
> 
> -- 
> Jakub Stransky
> cz.linkedin.com/in/jakubstransky
 		 	   		  

Re: Insights to Naive Bayes classifier example - 20news groups

Posted by Jakub Stransky <st...@gmail.com>.
Hi Andrew,

many thanks for final clarification! Now I have last question - probably
the most obvious but I missed it somewhere probably. Because all the
examples ends up by testing the classifier - display confusion matrix.  So
the state is:
We have a trained and tested model and now we would like to use the model
to classify  unseen, unknown data - actually use the classifier. For sure
it is clear how to prepare the input - vectorize etc. What is not clear to
me at the moment is how do I call trained model with new vectorized data as
an input. Or may be even the vectorization itself - because we need
probably the dictionary used by model to produce a valid vectors. What
about terms which we not in the training set etc.

Is there any documentation regarding this aspect?

Thx
Jakub



On 1 December 2014 at 21:12, Andrew Palumbo <ap...@outlook.com> wrote:

>
>
>
> > However the sequence of steps as described in Mahout Cookbook seems to me
> > incorrect as:
>
> this is entirely possible, that book may be out of date. The end to end
> instructions on the website for the 20 newsgroups example is up to date
> though.  As is the example script.
>
> You don't want to merge all of the files into one directory, rather to
> merge the training and testing sets in 20news-bydate while maintaining
> their directory structure.
>
> > After data set download and extraction data are merged via command:
> > *cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all*
> >
> > Which essentially copies files to a single location -> 20news-all folder
>
> this should not copy all of the *files* individually into the 20news-all
> folder rather the directories containing the files:
>
>     $ ls 20news-all/
>     alt.atheism               rec.autos           sci.space
>     comp.graphics             rec.motorcycles     soc.religion.christian
>     {...}
>
> > *./mahout seqdirectory  -i ${WORK_DIR}/20news-all  -o
> > ${WORK_DIR}/20news-seq*
> > Converts to a hadoop sequence directory from 20news-all dir - where all
> > files were copied and efffectively the classification to folders were
> lost.
> > We can peek inside a created seq file via hadoop fs -text
> > $WORK_DIR/20news-seq/chunck-0 | more which prints following result:
> >
> > */67399* From:xxx
> > Subject: Re: Imake-TeX: looking for beta testers
> > Organization: CS Department, Dortmund University, Germany
> > Lines: 59
> > Distribution: world
> > NNTP-Posting-Host: tommy.informatik.uni-dortmund.de
> > In article <xxxxx>,
> > yyy writes:
> > |> As I announced at the X Technical Conference in January, I would
> > like
> > |> to
> > |> make Imake-TeX, the Imake support for using the TeX typesetting
> > system,
> > |> publically available. Currently Imake-TeX is in beta test here at
> > the
> > |> computer science department of Dortmund University, and I am
> > looking
> > ...
> >
> > To my understanding - number after slash in bold represents a key of
> > sequence file, right?
>
> Correct though it should read something like:
>
>     /comp.graphics/67399 {...}
>
> where comp.graphics is the category as well as the directory that it was
> read in from.
>
> > Then seq2sparse is performed:
> >
> > ./mahout seq2sparse  -i ${WORK_DIR}/20news-seq vectors -lnorm -nv  -wt
> > tfidf -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
> >
> >
> > *Conclusions which I would like to verify:*
> > - sequence of steps as described is incorrect - particularly conversion
> to
> > sequence file as the key doesn't contain folder name describing the
> > category of training data, or am I still missing something in here?
>
> yes- it looks like you are copying the individual files rather than the
> directories into 20news-all
>
> >
> > - mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o
> > ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow
> >   What are the exact mechanics when label extraction is performed e.g.
> > /category/docID as a key is resolved just to category ???
>
> yes
>
> > Does every time
> > the last part after the slash is dropped as a category?? Or is is
> possible
> > to define the strategy somewhere?
>
> The hard-coded convention as of Mahout 0.9 is to extract the label as the
> first string after the key is split on "/".  This makes category
> organization by directory and sequence file conversion with seqdirectory
> straightforward.  The new scala DSL Naive Bayes which is currently in
> development will allow the user more flexibility in extracting the label.
>
> The label extraction process can be found here:
>
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapper.java
>
> and could me modified if need be.
>
> >
> > Thanks
> > Jakub
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On 1 December 2014 at 17:43, Andrew Palumbo <ap...@outlook.com> wrote:
> >
> > > Hi Jakub,
> > >
> > > The step that you are missing is `$mahout seqdir ...`.   in this step
> each
> > > file in each directory (where the directory is the Category) is
> converted
> > > into a sequence file of form <Text,Text>  where the Text key is
> > > /Category/doc_id.
> > >
> > > `$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...`
> > > into a sequence file of form <Text, VectorWritable> leaving the Keys
> > > unchanged.
> > >
> > > `$mahout trainnb ... -el ...` then extracts the label from the Keys of
> the
> > > training data ie. the "Category" from /Category/doc_id.
> > >
> > > please see
> > > http://mahout.apache.org/users/classification/twenty-newsgroups.html
> > > and http://mahout.apache.org/users/classification/bayesian.html
> > > for more information.
> > >
> > > > Date: Mon, 1 Dec 2014 17:09:55 +0100
> > > > Subject: Insights to Naive Bayes classifier example - 20news groups
> > > > From: stransky.ja@gmail.com
> > > > To: user@mahout.apache.org
> > > >
> > > > Hello Mahout experts,
> > > >
> > > > I am trying to follow some examples provided with Mahout and some
> > > features
> > > > are not clear to me. It would be great if someone could clarify a bit
> > > more.
> > > >
> > > > To prepare a the data (train and test) the following sequence of
> steps is
> > > > perfomed (taken from mahout cookbook):
> > > >
> > > > All input is merged into single dir:
> > > > *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
> > > >
> > > > Converted to hadoop sequence file and then vectorized:
> > > > *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o
> > > ${WORK_DIR}/20news-**vectors
> > > > -lnorm -nv -wt tfidf*
> > > >
> > > > Devided to test and train data:
> > > > *./mahout split*
> > > > *-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
> > > > *--trainingOutput ${WORK_DIR}/20news-train-vectors*
> > > > *--testOutput ${WORK_DIR}/20news-test-vectors*
> > > > *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
> > > >
> > > > Model is trained:
> > > > *./mahout trainnb*
> > > > *-i ${WORK_DIR}/20news-train-vectors -el*
> > > > *-o ${WORK_DIR}/model*
> > > > *-li ${WORK_DIR}/labelindex*
> > > > *-ow*
> > > >
> > > >
> > > > What I am missing here and that is subject of my question is: Where
> is
> > > the
> > > > category assigned to the testing data to train the categorization?
> What I
> > > > would expect is that there will be vector which says that this
> document
> > > > belongs to a particular category. This seems to me has been ereased
> by
> > > > first step where we mixed all the data to create our corpus. I would
> > > still
> > > > expect that this information will be somewhere retained. Instead the
> > > > messages looks as follows:
> > > >
> > > > From: yeoy@a.cs.okstate.edu (YEO YEK CHONG)
> > > > Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
> > > > Organization: Oklahoma State University
> > > > Lines: 7
> > > >
> > > > From article <a4...@vicuna.ocunix.on.ca>, by Steve Frampton <
> > > > frampton@vicuna.ocunix.on.ca>:
> > > > > I was wondering, is the "Kermit" package (the actual package, not a
> > > >
> > > > Yes!  In the usual ftp sites.
> > > >
> > > > Yek CHong
> > > >
> > > >
> > > > There is no notion from which group this text belongs to. What's the
> > > hack!
> > > >
> > > > Could someone please clarify a bit what's going on as when
> > > crosswalidation
> > > > is performed - confusion matrix takes into consideration those
> > > categories.
> > > >
> > > > Thanks a lot for helping me out
> > > > Jakub
> > >
> > >
> >
> >
> >
> > --
> > Jakub Stransky
> > cz.linkedin.com/in/jakubstransky
>
>



-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

RE: Insights to Naive Bayes classifier example - 20news groups

Posted by Andrew Palumbo <ap...@outlook.com>.


> However the sequence of steps as described in Mahout Cookbook seems to me
> incorrect as:

this is entirely possible, that book may be out of date. The end to end instructions on the website for the 20 newsgroups example is up to date though.  As is the example script. 

You don't want to merge all of the files into one directory, rather to merge the training and testing sets in 20news-bydate while maintaining their directory structure.  

> After data set download and extraction data are merged via command:
> *cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all*
> 
> Which essentially copies files to a single location -> 20news-all folder

this should not copy all of the *files* individually into the 20news-all folder rather the directories containing the files:

    $ ls 20news-all/
    alt.atheism               rec.autos           sci.space
    comp.graphics             rec.motorcycles     soc.religion.christian
    {...}
 
> *./mahout seqdirectory  -i ${WORK_DIR}/20news-all  -o
> ${WORK_DIR}/20news-seq*
> Converts to a hadoop sequence directory from 20news-all dir - where all
> files were copied and efffectively the classification to folders were lost.
> We can peek inside a created seq file via hadoop fs -text
> $WORK_DIR/20news-seq/chunck-0 | more which prints following result:
> 
> */67399* From:xxx
> Subject: Re: Imake-TeX: looking for beta testers
> Organization: CS Department, Dortmund University, Germany
> Lines: 59
> Distribution: world
> NNTP-Posting-Host: tommy.informatik.uni-dortmund.de
> In article <xxxxx>,
> yyy writes:
> |> As I announced at the X Technical Conference in January, I would
> like
> |> to
> |> make Imake-TeX, the Imake support for using the TeX typesetting
> system,
> |> publically available. Currently Imake-TeX is in beta test here at
> the
> |> computer science department of Dortmund University, and I am
> looking
> ...
> 
> To my understanding - number after slash in bold represents a key of
> sequence file, right?

Correct though it should read something like:

    /comp.graphics/67399 {...}

where comp.graphics is the category as well as the directory that it was read in from.

> Then seq2sparse is performed:
> 
> ./mahout seq2sparse  -i ${WORK_DIR}/20news-seq vectors -lnorm -nv  -wt
> tfidf -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
> 
> 
> *Conclusions which I would like to verify:*
> - sequence of steps as described is incorrect - particularly conversion to
> sequence file as the key doesn't contain folder name describing the
> category of training data, or am I still missing something in here?

yes- it looks like you are copying the individual files rather than the directories into 20news-all

> 
> - mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o
> ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow
>   What are the exact mechanics when label extraction is performed e.g.
> /category/docID as a key is resolved just to category ???

yes

> Does every time
> the last part after the slash is dropped as a category?? Or is is possible
> to define the strategy somewhere?

The hard-coded convention as of Mahout 0.9 is to extract the label as the first string after the key is split on "/".  This makes category organization by directory and sequence file conversion with seqdirectory straightforward.  The new scala DSL Naive Bayes which is currently in development will allow the user more flexibility in extracting the label.

The label extraction process can be found here: 
https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapper.java

and could me modified if need be.
   
> 
> Thanks
> Jakub
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 1 December 2014 at 17:43, Andrew Palumbo <ap...@outlook.com> wrote:
> 
> > Hi Jakub,
> >
> > The step that you are missing is `$mahout seqdir ...`.   in this step each
> > file in each directory (where the directory is the Category) is converted
> > into a sequence file of form <Text,Text>  where the Text key is
> > /Category/doc_id.
> >
> > `$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...`
> > into a sequence file of form <Text, VectorWritable> leaving the Keys
> > unchanged.
> >
> > `$mahout trainnb ... -el ...` then extracts the label from the Keys of the
> > training data ie. the "Category" from /Category/doc_id.
> >
> > please see
> > http://mahout.apache.org/users/classification/twenty-newsgroups.html
> > and http://mahout.apache.org/users/classification/bayesian.html
> > for more information.
> >
> > > Date: Mon, 1 Dec 2014 17:09:55 +0100
> > > Subject: Insights to Naive Bayes classifier example - 20news groups
> > > From: stransky.ja@gmail.com
> > > To: user@mahout.apache.org
> > >
> > > Hello Mahout experts,
> > >
> > > I am trying to follow some examples provided with Mahout and some
> > features
> > > are not clear to me. It would be great if someone could clarify a bit
> > more.
> > >
> > > To prepare a the data (train and test) the following sequence of steps is
> > > perfomed (taken from mahout cookbook):
> > >
> > > All input is merged into single dir:
> > > *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
> > >
> > > Converted to hadoop sequence file and then vectorized:
> > > *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o
> > ${WORK_DIR}/20news-**vectors
> > > -lnorm -nv -wt tfidf*
> > >
> > > Devided to test and train data:
> > > *./mahout split*
> > > *-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
> > > *--trainingOutput ${WORK_DIR}/20news-train-vectors*
> > > *--testOutput ${WORK_DIR}/20news-test-vectors*
> > > *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
> > >
> > > Model is trained:
> > > *./mahout trainnb*
> > > *-i ${WORK_DIR}/20news-train-vectors -el*
> > > *-o ${WORK_DIR}/model*
> > > *-li ${WORK_DIR}/labelindex*
> > > *-ow*
> > >
> > >
> > > What I am missing here and that is subject of my question is: Where is
> > the
> > > category assigned to the testing data to train the categorization? What I
> > > would expect is that there will be vector which says that this document
> > > belongs to a particular category. This seems to me has been ereased by
> > > first step where we mixed all the data to create our corpus. I would
> > still
> > > expect that this information will be somewhere retained. Instead the
> > > messages looks as follows:
> > >
> > > From: yeoy@a.cs.okstate.edu (YEO YEK CHONG)
> > > Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
> > > Organization: Oklahoma State University
> > > Lines: 7
> > >
> > > From article <a4...@vicuna.ocunix.on.ca>, by Steve Frampton <
> > > frampton@vicuna.ocunix.on.ca>:
> > > > I was wondering, is the "Kermit" package (the actual package, not a
> > >
> > > Yes!  In the usual ftp sites.
> > >
> > > Yek CHong
> > >
> > >
> > > There is no notion from which group this text belongs to. What's the
> > hack!
> > >
> > > Could someone please clarify a bit what's going on as when
> > crosswalidation
> > > is performed - confusion matrix takes into consideration those
> > categories.
> > >
> > > Thanks a lot for helping me out
> > > Jakub
> >
> >
> 
> 
> 
> -- 
> Jakub Stransky
> cz.linkedin.com/in/jakubstransky
 		 	   		  

Re: Insights to Naive Bayes classifier example - 20news groups

Posted by Jakub Stransky <st...@gmail.com>.
Hi Andrew,

thanks for your response which points me to the missing piece of the
puzzle! However there is still something which is not clear to me. Either
to me it seems that the sequence of the commands is not correct or I
haven't fully grasped the elementary mechanics here. I understand the
seqdirectory and seq2sparse as described here:
http://mahout.apache.org/users/basics/creating-vectors-from-text.html

However the sequence of steps as described in Mahout Cookbook seems to me
incorrect as:

After data set download and extraction data are merged via command:
*cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all*

Which essentially copies files to a single location -> 20news-all folder

*./mahout seqdirectory  -i ${WORK_DIR}/20news-all  -o
${WORK_DIR}/20news-seq*
Converts to a hadoop sequence directory from 20news-all dir - where all
files were copied and efffectively the classification to folders were lost.
We can peek inside a created seq file via hadoop fs -text
$WORK_DIR/20news-seq/chunck-0 | more which prints following result:

*/67399* From:xxx
Subject: Re: Imake-TeX: looking for beta testers
Organization: CS Department, Dortmund University, Germany
Lines: 59
Distribution: world
NNTP-Posting-Host: tommy.informatik.uni-dortmund.de
In article <xxxxx>,
yyy writes:
|> As I announced at the X Technical Conference in January, I would
like
|> to
|> make Imake-TeX, the Imake support for using the TeX typesetting
system,
|> publically available. Currently Imake-TeX is in beta test here at
the
|> computer science department of Dortmund University, and I am
looking
...

To my understanding - number after slash in bold represents a key of
sequence file, right?

Then seq2sparse is performed:

./mahout seq2sparse  -i ${WORK_DIR}/20news-seq vectors -lnorm -nv  -wt
tfidf -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf


*Conclusions which I would like to verify:*
- sequence of steps as described is incorrect - particularly conversion to
sequence file as the key doesn't contain folder name describing the
category of training data, or am I still missing something in here?

- mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o
${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow
  What are the exact mechanics when label extraction is performed e.g.
/category/docID as a key is resolved just to category ??? Does every time
the last part after the slash is dropped as a category?? Or is is possible
to define the strategy somewhere?

Thanks
Jakub














On 1 December 2014 at 17:43, Andrew Palumbo <ap...@outlook.com> wrote:

> Hi Jakub,
>
> The step that you are missing is `$mahout seqdir ...`.   in this step each
> file in each directory (where the directory is the Category) is converted
> into a sequence file of form <Text,Text>  where the Text key is
> /Category/doc_id.
>
> `$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...`
> into a sequence file of form <Text, VectorWritable> leaving the Keys
> unchanged.
>
> `$mahout trainnb ... -el ...` then extracts the label from the Keys of the
> training data ie. the "Category" from /Category/doc_id.
>
> please see
> http://mahout.apache.org/users/classification/twenty-newsgroups.html
> and http://mahout.apache.org/users/classification/bayesian.html
> for more information.
>
> > Date: Mon, 1 Dec 2014 17:09:55 +0100
> > Subject: Insights to Naive Bayes classifier example - 20news groups
> > From: stransky.ja@gmail.com
> > To: user@mahout.apache.org
> >
> > Hello Mahout experts,
> >
> > I am trying to follow some examples provided with Mahout and some
> features
> > are not clear to me. It would be great if someone could clarify a bit
> more.
> >
> > To prepare a the data (train and test) the following sequence of steps is
> > perfomed (taken from mahout cookbook):
> >
> > All input is merged into single dir:
> > *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
> >
> > Converted to hadoop sequence file and then vectorized:
> > *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o
> ${WORK_DIR}/20news-**vectors
> > -lnorm -nv -wt tfidf*
> >
> > Devided to test and train data:
> > *./mahout split*
> > *-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
> > *--trainingOutput ${WORK_DIR}/20news-train-vectors*
> > *--testOutput ${WORK_DIR}/20news-test-vectors*
> > *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
> >
> > Model is trained:
> > *./mahout trainnb*
> > *-i ${WORK_DIR}/20news-train-vectors -el*
> > *-o ${WORK_DIR}/model*
> > *-li ${WORK_DIR}/labelindex*
> > *-ow*
> >
> >
> > What I am missing here and that is subject of my question is: Where is
> the
> > category assigned to the testing data to train the categorization? What I
> > would expect is that there will be vector which says that this document
> > belongs to a particular category. This seems to me has been ereased by
> > first step where we mixed all the data to create our corpus. I would
> still
> > expect that this information will be somewhere retained. Instead the
> > messages looks as follows:
> >
> > From: yeoy@a.cs.okstate.edu (YEO YEK CHONG)
> > Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
> > Organization: Oklahoma State University
> > Lines: 7
> >
> > From article <a4...@vicuna.ocunix.on.ca>, by Steve Frampton <
> > frampton@vicuna.ocunix.on.ca>:
> > > I was wondering, is the "Kermit" package (the actual package, not a
> >
> > Yes!  In the usual ftp sites.
> >
> > Yek CHong
> >
> >
> > There is no notion from which group this text belongs to. What's the
> hack!
> >
> > Could someone please clarify a bit what's going on as when
> crosswalidation
> > is performed - confusion matrix takes into consideration those
> categories.
> >
> > Thanks a lot for helping me out
> > Jakub
>
>



-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

RE: Insights to Naive Bayes classifier example - 20news groups

Posted by Andrew Palumbo <ap...@outlook.com>.
Hi Jakub,

The step that you are missing is `$mahout seqdir ...`.   in this step each file in each directory (where the directory is the Category) is converted into a sequence file of form <Text,Text>  where the Text key is /Category/doc_id.

`$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...` into a sequence file of form <Text, VectorWritable> leaving the Keys unchanged.  

`$mahout trainnb ... -el ...` then extracts the label from the Keys of the training data ie. the "Category" from /Category/doc_id.  

please see http://mahout.apache.org/users/classification/twenty-newsgroups.html
and http://mahout.apache.org/users/classification/bayesian.html
for more information.

> Date: Mon, 1 Dec 2014 17:09:55 +0100
> Subject: Insights to Naive Bayes classifier example - 20news groups
> From: stransky.ja@gmail.com
> To: user@mahout.apache.org
> 
> Hello Mahout experts,
> 
> I am trying to follow some examples provided with Mahout and some features
> are not clear to me. It would be great if someone could clarify a bit more.
> 
> To prepare a the data (train and test) the following sequence of steps is
> perfomed (taken from mahout cookbook):
> 
> All input is merged into single dir:
> *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
> 
> Converted to hadoop sequence file and then vectorized:
> *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-**vectors
> -lnorm -nv -wt tfidf*
> 
> Devided to test and train data:
> *./mahout split*
> *-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
> *--trainingOutput ${WORK_DIR}/20news-train-vectors*
> *--testOutput ${WORK_DIR}/20news-test-vectors*
> *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
> 
> Model is trained:
> *./mahout trainnb*
> *-i ${WORK_DIR}/20news-train-vectors -el*
> *-o ${WORK_DIR}/model*
> *-li ${WORK_DIR}/labelindex*
> *-ow*
> 
> 
> What I am missing here and that is subject of my question is: Where is the
> category assigned to the testing data to train the categorization? What I
> would expect is that there will be vector which says that this document
> belongs to a particular category. This seems to me has been ereased by
> first step where we mixed all the data to create our corpus. I would still
> expect that this information will be somewhere retained. Instead the
> messages looks as follows:
> 
> From: yeoy@a.cs.okstate.edu (YEO YEK CHONG)
> Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
> Organization: Oklahoma State University
> Lines: 7
> 
> From article <a4...@vicuna.ocunix.on.ca>, by Steve Frampton <
> frampton@vicuna.ocunix.on.ca>:
> > I was wondering, is the "Kermit" package (the actual package, not a
> 
> Yes!  In the usual ftp sites.
> 
> Yek CHong
> 
> 
> There is no notion from which group this text belongs to. What's the hack!
> 
> Could someone please clarify a bit what's going on as when crosswalidation
> is performed - confusion matrix takes into consideration those categories.
> 
> Thanks a lot for helping me out
> Jakub