You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Wei Zhang <we...@us.ibm.com> on 2014/09/04 20:10:00 UTC

RE: any pointer to run wikipedia bayes example

Hi Andrew,

Finally I figured out it probably doesn't have anything to do with HDFS, it
failed because of  filling up the local disk (during the phase between map
and reduce).

It seems the Collocation Driver are generating too much output even I am
just using 2-gram (on the full Wiki dataset). 60GB per local node (22 nodes
in total) is not enough to hold the temp data.
So i am using unigram instead, I hopein this way, it is probably more
aligned with K-means vectorization as well. Then the vectorization worked
(takes roughly 5 hours on a 22 node cluster).

Thanks again!

Wei



From:	Wei Zhang/Watson/IBM@IBMUS
To:	user@mahout.apache.org
Date:	08/29/2014 04:19 PM
Subject:	RE: any pointer to run wikipedia bayes example



Thanks a lot Andrew for the pointers!

I also tried with a category file with 25 subjects (Art Culture Economics
Education Event Health History Industry Sports Geography ...) On the 1GB
medium dataset, it roughly generated 50K data points, with ~65% accuracy.
If I factor that with 40 (i.e., the full size data set), that gives me a 2
million data points with relatively-high dimension, that should be fine for
me.

In the past couple of days, I was trying the NB example of full wiki
dataset (i.e., 11GB zip file, ~44G unzipped file).

The cluster that we own (a bit old) has 1.5TB space (replication factor of
3, so effectively 0.5 TB free space). The cluster has 22 nodes, each node
has 30GB-50GB tmp directory.
But NB (on full size of Wikipedia dump) repeatedly failed at
org.apache.mahout.vectorizer.collocations.llr.CollocReducer , due to the
complaint "No space left".

Partial exception stack looks like this:
Error: java.io.IOException: No space left on device at
java.io.FileOutputStream.writeBytes(Native Method) at
java.io.FileOutputStream.write(FileOutputStream.java:356) at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write
(RawLocalFileSystem.java:198) at java.io.BufferedOutputStream.flushBuffer
(BufferedOutputStream.java:93) at java.io.BufferedOutputStream.write
(BufferedOutputStream.java:137) at org.apache.hadoop.fs.FSDataOutputStream
$PositionCache.write(FSDataOutputStream.java:49) at
java.io.DataOutputStream.write(DataOutputStream.java:118) at
org.apache.hadoop.mapred.IFileOutputStream.write(IFileOutputStream.java:84)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write
(FSDataOutputStream.java:49) at java.io.DataOutputStream.write
(DataOutputStream.java:118) at org.apache.hadoop.mapred.IFile$Writer.append
(IFile.java:218) at org.apache.hadoop.mapred.Merger.writeFile
(Merger.java:157) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
$LocalFSMerger.run(ReduceTask.java:2659)

some other exception stack from other JVM looks like this:
Creation
of /tmp/hadoop-xxx/mapred/local/userlogs/job_201408261641_0027/attempt_201408261641_0027_r_000000_1.cleanup
 failed. at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir
(TaskLog.java:104) at
org.apache.hadoop.mapred.DefaultTaskController.createLogDir
(DefaultTaskController.java:71) at
org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316) at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228)

It could be entirely possible that it is time for us to move to a larger
cluster. I am just curious how much disk space should we expect to use for
NB on full wiki dataset ?

Thanks!

Wei


Andrew Palumbo ---08/27/2014 05:17:56 PM---Subject: RE: any pointer to run
wikipedia bayes example To: user@mahout.apache.org

From: Andrew Palumbo <ap...@outlook.com>
To: "user@mahout.apache.org" <us...@mahout.apache.org>
Date: 08/27/2014 05:17 PM
Subject: RE: any pointer to run wikipedia bayes example





Subject: RE: any pointer to run wikipedia bayes example
To: user@mahout.apache.org
From: weiz@us.ibm.com
Date: Tue, 26 Aug 2014 18:12:52 -0400


Hello Andrew,



I have given a try to NB on the medium size  Wikipedia (~1GB data after
decompression, roughly 1/50 of the full Wikipedia size) with two categories
(US/UK) I examined the tf-idf vectors generated.



I have two questions:

(1) It seems  there are (only) 11683 data points (i.e., documents)
generated, albeit each data point has relatively high dimension. 10K data
points seem not very exciting, even I multiply it by 50 ( to the full
extend of Wikipedia dataset), it seems the data points are not particularly
many.




I suspect that many of the documents are not categorized as either US or
UK, thus not included in the training set. On a 20 node (8 cores
each)( cluster (albeit a quite old one, 5 years old), it took 45 minutes to
label/vectorize  the dataset, but only 3 minutes to train the NB.






If you used option (2) from the classify-wiki.sh script,  seq2sparse will
be vectorizing the data using 4-grams which take much longer and give you a
much larger feature set.  option (1) uses bigrams.






I am wondering is there a way to get a larger dataset that can stress the
NB training (instead of the label/vectorization part) either by providing a
more inclusive category file or choosing another dataset ?








You could run on the the full country set:


https://github.com/apache/mahout/blob/master/examples/src/test/resources/country.txt



By editing line 101 or 107 to read:


   cp $MAHOUT_HOME/examples/src/test/resources/country.txt $
{WORK_DIR}/country.txt


However on the medium data set, this only yields ~38200 documents so it
still probably will not be not be the size that you are looking for.
Alternatively, you could create your own category.txt file to use and pass
it to the -c argument.

As well you could try turning the -all option which as we discussed before
will likely skew the categories into an "unknown" category, but will not
reject any documents






With a more inclusive category file, I can potentially get a larger
dataset, but I don't know how to handle the case where a document has two
labels in that category file.






Currently, the WikipediaMapper is labeling the document as the first
matching category that it finds, but you can customize this however you'd
like.






(2) I am wondering if I use Wikipedia dataset as the input to the K-means
clustering, (thus no need to label the data), then I can get a relatively
large dataset, and both K-means and NB use the SequenceFileFormat.






I believe this this should work- you could remove the labeling section -
basically lines 79-85  from WikipediaMapper.java



https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/WikipediaMapper.java



and write out something like (K=document_title,V=document) to the sequence
file.



and then run this sequence file through seq2sparse and kmeans as is done in
the cluster-reuters.sh example(starting at line 109):


https://github.com/andrewpalumbo/mahout/blob/master/examples/bin/cluster-reuters.sh






It seems that I would just need to bypass the label data part and go
directly to the vectorization, I am not sure if it is feasible ?




Thanks a lot !



Wei



Andrew Palumbo ---08/21/2014 02:28:45 PM---Hello, Yes, If you work off of
the current trunk, you can use the classify-wiki.sh example.  There i



From: Andrew Palumbo <ap...@outlook.com>

To: "user@mahout.apache.org" <us...@mahout.apache.org>

Date: 08/21/2014 02:28 PM

Subject: RE: any pointer to run wikipedia bayes example








Hello,



Yes, If you work off of the current trunk, you can use the classify-wiki.sh
example.  There is currently no documentation on the Mahout site for this.



You can run this script to build and test an NB classifier for option (1)
10 arbitrary countries or option (2) 2 countries (United States and United
Kingdom)



By defult the script is set to run on a medium sized  wikipedia XML dump.
To run on the full set you'll have to change the download by commenting out
line 78, and uncommenting line 80 [1].  *Be sure to clean your work
directory when changing datasets- option (3).*





The step by step process for  Creating a Naive Bayes Classifier for the
wikipedia XML dump is very similar to creating the the 20 Newsgroups
Classifier.  The only difference being that instead of running $mahout
seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki
on the unzipped wikipedia xml dump.



$ mahout seqwiki invokes WikipediaToSequenceFile.java which accepts a text
file of categories [2] and starts an MR job to parse the each document in
the XML file.  This process will seek to extract documents with category
which (exactly, if the exactMatchOnly option is set) matches a line in the
category file.  If no match is found and the -all option is set, the
document will be dumped into an "unknown" category.

The documents will then be written out as a <Text,Text> sequence file of
the form (K: /category/document_title , V: document) .



There are 3 different example category files available to in
the /examples/src/test/resources directory:  country.txt, country10.txt and
country2.txt.



The CLI options for seqwiki are as follows:



   -input           (-i)             input pathname String

   -output         (-o)           the output pathname String

   -categories  (-c)            the file containing the Wikipedia
categories

   -exactMatchOnly (-e)    if set, then the Wikipedia category must match
exactly instead of simply containing the category string

   -all              (-all)            if set select all categories



>From there you just need to run  seq2sparse, split, trainnb, testnb as in
the example script.



Especially for the Binary classification problem you should have better
results using 3 or 4-grams and a low maxDF cuttoff like 30.



[1]
https://github.com/apache/mahout/blob/master/examples/bin/classify-wiki.sh

[2]
https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt






Subject: Re: any pointer to run wikipedia bayes example

To: user@mahout.apache.org

From: weiz@us.ibm.com

Date: Wed, 20 Aug 2014 09:50:42 -0400





hi,







After did a bit more searching, I found
https://issues.apache.org/jira/browse/MAHOUT-1527



The version of Mahout that I have been working on is Mahout 0.9 (from
http://mahout.apache.org/general/downloads.html), which I downloaded in
April.



Albeit the latest stable release, it doesn't include the patch mentioned in
https://issues.apache.org/jira/browse/MAHOUT-1527







Then I realized had I cloned the latest mahout, I would get a script that
classify-wiki.sh, and probably can start from there.







Sorry for the spam!







Thanks,



Wei







Wei Zhang---08/19/2014 06:18:09 PM---Hi, I have been able to run the
bayesian network 20news group example provided







From:  Wei Zhang/Watson/IBM@IBMUS



To:  user@mahout.apache.org



Date:  08/19/2014 06:18 PM



Subject:  any pointer to run wikipedia bayes example

























Hi,







I have been able to run the bayesian network 20news group example provided



at Mahout website.







I am interested in running the Wikipedia bayes example, as it is a much



larger dataset.



>From several googling attempts,  I figured it is a bit different workflow



than running the 20news group example -- e.g., I would need to provide a



categories.txt file, and invoke WikipediaXmlSplitter,  call



wikipediaDataSetCreator and etc.







I am wondering is there a document somewhere that describes the process of



running Wikipedia bayes example ?



https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html  seems no



longer work.







Greatly appreciated!







Wei



RE: any pointer to run wikipedia bayes example

Posted by Andrew Palumbo <ap...@outlook.com>.
Hi Wei,

Thanks for posting your findings!

Andy

Subject: RE: any pointer to run wikipedia bayes example
To: user@mahout.apache.org
From: weiz@us.ibm.com
Date: Thu, 4 Sep 2014 14:10:00 -0400


Hi Andrew,



Finally I figured out it probably doesn't have anything to do with HDFS, it failed because of  filling up the local disk (during the phase between map and reduce). 



It seems the Collocation Driver are generating too much output even I am just using 2-gram (on the full Wiki dataset). 60GB per local node (22 nodes in total) is not enough to hold the temp data.

So i am using unigram instead, I hopein this way, it is probably more aligned with K-means vectorization as well. Then the vectorization worked (takes roughly 5 hours on a 22 node cluster).



Thanks again!



Wei 



Wei Zhang---08/29/2014 04:19:30 PM---Thanks a lot Andrew for the pointers! I also tried with a category file with 25 subjects (Art Cultur



From:	Wei Zhang/Watson/IBM@IBMUS

To:	user@mahout.apache.org

Date:	08/29/2014 04:19 PM

Subject:	RE: any pointer to run wikipedia bayes example








Thanks a lot Andrew for the pointers! 

 

I also tried with a category file with 25 subjects (Art Culture Economics Education Event Health History Industry Sports Geography ...) On the 1GB medium dataset, it roughly generated 50K data points, with ~65% accuracy. If I factor that with 40 (i.e., the full size data set), that gives me a 2 million data points with relatively-high dimension, that should be fine for me. 



In the past couple of days, I was trying the NB example of full wiki dataset (i.e., 11GB zip file, ~44G unzipped file). 



The cluster that we own (a bit old) has 1.5TB space (replication factor of 3, so effectively 0.5 TB free space). The cluster has 22 nodes, each node has 30GB-50GB tmp directory. 

But NB (on full size of Wikipedia dump) repeatedly failed at org.apache.mahout.vectorizer.collocations.llr.CollocReducer , due to the complaint "No space left". 



Partial exception stack looks like this:

Error: java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:356) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:198) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:93) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:137) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49) at java.io.DataOutputStream.write(DataOutputStream.java:118) at org.apache.hadoop.mapred.IFileOutputStream.write(IFileOutputStream.java:84) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49) at java.io.DataOutputStream.write(DataOutputStream.java:118) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:218) at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:157) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2659) 



some other exception stack from other JVM looks like this:

Creation of /tmp/hadoop-xxx/mapred/local/userlogs/job_201408261641_0027/attempt_201408261641_0027_r_000000_1.cleanup failed. at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:104) at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71) at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228) 



It could be entirely possible that it is time for us to move to a larger cluster. I am just curious how much disk space should we expect to use for NB on full wiki dataset ?



Thanks!



Wei





Andrew Palumbo ---08/27/2014 05:17:56 PM---Subject: RE: any pointer to run wikipedia bayes example To: user@mahout.apache.org



From: Andrew Palumbo <ap...@outlook.com>

To: "user@mahout.apache.org" <us...@mahout.apache.org>

Date: 08/27/2014 05:17 PM

Subject: RE: any pointer to run wikipedia bayes example











Subject: RE: any pointer to run wikipedia bayes example

To: user@mahout.apache.org

From: weiz@us.ibm.com

Date: Tue, 26 Aug 2014 18:12:52 -0400





Hello Andrew,







I have given a try to NB on the medium size  Wikipedia (~1GB data after decompression, roughly 1/50 of the full Wikipedia size) with two categories (US/UK) I examined the tf-idf vectors generated. 







I have two questions:



(1) It seems  there are (only) 11683 data points (i.e., documents) generated, albeit each data point has relatively high dimension. 10K data points seem not very exciting, even I multiply it by 50 ( to the full extend of Wikipedia dataset), it seems the data points are not particularly many.









I suspect that many of the documents are not categorized as either US or UK, thus not included in the training set. On a 20 node (8 cores each)( cluster (albeit a quite old one, 5 years old), it took 45 minutes to label/vectorize  the dataset, but only 3 minutes to train the NB.













If you used option (2) from the classify-wiki.sh script,  seq2sparse will be vectorizing the data using 4-grams which take much longer and give you a much larger feature set.  option (1) uses bigrams.   













I am wondering is there a way to get a larger dataset that can stress the NB training (instead of the label/vectorization part) either by providing a more inclusive category file or choosing another dataset ?     

















You could run on the the full country set:





https://github.com/apache/mahout/blob/master/examples/src/test/resources/country.txt





By editing line 101 or 107 to read:





   cp $MAHOUT_HOME/examples/src/test/resources/country.txt ${WORK_DIR}/country.txt





However on the medium data set, this only yields ~38200 documents so it still probably will not be not be the size that you are looking for. Alternatively, you could create your own category.txt file to use and pass it to the -c argument.



As well you could try turning the -all option which as we discussed before will likely skew the categories into an "unknown" category, but will not reject any documents













With a more inclusive category file, I can potentially get a larger dataset, but I don't know how to handle the case where a document has two labels in that category file. 













Currently, the WikipediaMapper is labeling the document as the first matching category that it finds, but you can customize this however you'd like.  













(2) I am wondering if I use Wikipedia dataset as the input to the K-means clustering, (thus no need to label the data), then I can get a relatively large dataset, and both K-means and NB use the SequenceFileFormat.













I believe this this should work- you could remove the labeling section - basically lines 79-85  from WikipediaMapper.java 







https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/WikipediaMapper.java





and write out something like (K=document_title,V=document) to the sequence file. 







and then run this sequence file through seq2sparse and kmeans as is done in the cluster-reuters.sh example(starting at line 109):





https://github.com/andrewpalumbo/mahout/blob/master/examples/bin/cluster-reuters.sh

   









It seems that I would just need to bypass the label data part and go directly to the vectorization, I am not sure if it is feasible ?









Thanks a lot !







Wei







Andrew Palumbo ---08/21/2014 02:28:45 PM---Hello, Yes, If you work off of the current trunk, you can use the classify-wiki.sh example.  There i







From: Andrew Palumbo <ap...@outlook.com>



To: "user@mahout.apache.org" <us...@mahout.apache.org>



Date: 08/21/2014 02:28 PM



Subject: RE: any pointer to run wikipedia bayes example

















Hello,







Yes, If you work off of the current trunk, you can use the classify-wiki.sh example.  There is currently no documentation on the Mahout site for this.







You can run this script to build and test an NB classifier for option (1) 10 arbitrary countries or option (2) 2 countries (United States and United Kingdom)







By defult the script is set to run on a medium sized  wikipedia XML dump.  To run on the full set you'll have to change the download by commenting out line 78, and uncommenting line 80 [1].  *Be sure to clean your work directory when changing datasets- option (3).*











The step by step process for  Creating a Naive Bayes Classifier for the wikipedia XML dump is very similar to creating the the 20 Newsgroups Classifier.  The only difference being that instead of running $mahout seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki on the unzipped wikipedia xml dump.







$ mahout seqwiki invokes WikipediaToSequenceFile.java which accepts a text file of categories [2] and starts an MR job to parse the each document in the XML file.  This process will seek to extract documents with category which (exactly, if the exactMatchOnly option is set) matches a line in the category file.  If no match is found and the -all option is set, the document will be dumped into an "unknown" category.



The documents will then be written out as a <Text,Text> sequence file of the form (K: /category/document_title , V: document) .







There are 3 different example category files available to in the /examples/src/test/resources directory:  country.txt, country10.txt and country2.txt.







The CLI options for seqwiki are as follows:







   -input           (-i)             input pathname String



   -output         (-o)           the output pathname String



   -categories  (-c)            the file containing the Wikipedia categories



   -exactMatchOnly (-e)    if set, then the Wikipedia category must match exactly instead of simply containing the category string



   -all              (-all)            if set select all categories 







>From there you just need to run  seq2sparse, split, trainnb, testnb as in the example script.







Especially for the Binary classification problem you should have better results using 3 or 4-grams and a low maxDF cuttoff like 30.







[1] https://github.com/apache/mahout/blob/master/examples/bin/classify-wiki.sh



[2] https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt











Subject: Re: any pointer to run wikipedia bayes example



To: user@mahout.apache.org



From: weiz@us.ibm.com



Date: Wed, 20 Aug 2014 09:50:42 -0400











hi, 















After did a bit more searching, I found https://issues.apache.org/jira/browse/MAHOUT-1527







The version of Mahout that I have been working on is Mahout 0.9 (from http://mahout.apache.org/general/downloads.html), which I downloaded in April.







Albeit the latest stable release, it doesn't include the patch mentioned in https://issues.apache.org/jira/browse/MAHOUT-1527















Then I realized had I cloned the latest mahout, I would get a script that classify-wiki.sh, and probably can start from there.  















Sorry for the spam! 















Thanks,







Wei















Wei Zhang---08/19/2014 06:18:09 PM---Hi, I have been able to run the bayesian network 20news group example provided















From:  Wei Zhang/Watson/IBM@IBMUS







To:  user@mahout.apache.org







Date:  08/19/2014 06:18 PM







Subject:  any pointer to run wikipedia bayes example



















































Hi,















I have been able to run the bayesian network 20news group example provided







at Mahout website.















I am interested in running the Wikipedia bayes example, as it is a much







larger dataset.







>From several googling attempts,  I figured it is a bit different workflow







than running the 20news group example -- e.g., I would need to provide a







categories.txt file, and invoke WikipediaXmlSplitter,  call







wikipediaDataSetCreator and etc.















I am wondering is there a document somewhere that describes the process of







running Wikipedia bayes example ?







https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html  seems no







longer work.















Greatly appreciated!















Wei