You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Diederik van Liere <Di...@Rotman.Utoronto.Ca> on 2011/07/12 18:51:17 UTC

Using Mahout XmlInputFormat with Hadoop Streaming

Hi Mahout list,

I've got a quick question: is it possible to use Mahout's XmlInputFormat<http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java> in combination with Hadoop Streaming? I would like to replace Hadoop's xmlrecordreader. I have been looking for an  example but couldn't find any so maybe this is not possible (but just want to make sure that I am not missing anything).
Thanks for your help.

Best,
Diederik


Re: Using Mahout XmlInputFormat with Hadoop Streaming

Posted by Sean Owen <sr...@gmail.com>.
It's expecting a constructor with a certain signature, when there is
no such constructor.
I suspect you are not meant to use the input format with Streaming in
this way, but I don't know the exact nature of what you need to do.
See the comment in StreamInputFormat's javadoc:

/** An input format that selects a RecordReader based on a JobConf property.
 *  This should be used only for non-standard record reader such as
 *  StreamXmlRecordReader. For all other standard
 *  record readers, the appropriate input format classes should be used.
 */


On Thu, Jul 14, 2011 at 8:48 PM, Diederik van Liere
<Di...@rotman.utoronto.ca> wrote:
>
> Hi,
> Sandeep  thanks so much for your reply. Yes I am aware of that blogpost but it does not explain how to use Mahout with Hadoop's streaming interface.
> I issued the following command:
>
> hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar
>        -Dmapred.reduce.tasks=0
>        -Dmapred.child.java.opts=-Xmx1024m
>        -Dmapred.child.ulimit=3145728
>        -libjars /usr/local/mahout/examples/target/mahout-examples-0.6-SNAPSHOT.jar
>        -input /usr/hadoop/enwiki-20110405-pages-meta-history5.xml
>        -output /usr/hadoop/out
>        -mapper ~/wikihadoop/xml_streamer_simulated.py
>        -inputreader "org.apache.mahout.classifier.bayes.XmlInputFormat,begin=<page,end=</page>"
>
> And I got the following output:
>
> 2011-07-14 17:45:58,894 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
> 2011-07-14 17:45:59,231 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /data/hadoop/datastore/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_20
> 1107131716_0002/jars/.job.jar.crc <- /data/hadoop/datastore/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107131716_0002/attempt_201107131716_0002_m_000000_0/work/.job.jar.crc
> 2011-07-14 17:45:59,242 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /data/hadoop/datastore/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_20
> 1107131716_0002/jars/job.jar <- /data/hadoop/datastore/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107131716_0002/attempt_201107131716_0002_m_000000_0/work/job.jar
> 2011-07-14 17:45:59,328 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
> 2011-07-14 17:45:59,552 INFO org.apache.hadoop.mapred.FileInputFormat: getRecordReader start.....split=hdfs://beta:54310/usr/hadoop/enwiki-20110405-pages-meta-history5.xml:0+67108864
> 2011-07-14 17:45:59,778 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logsʼ truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2011-07-14 17:45:59,802 WARN org.apache.hadoop.mapred.Child: Error running child
> java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.mahout.classifier.bayes.XmlInputFormat.<init>(org.apache.hadoop.fs.FSDataInputStream, org.apache.hadoop.mapred.FileSpli
> t, org.apache.hadoop.mapred.Reporter, org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.FileSystem)
>        at org.apache.hadoop.streaming.StreamInputFormat.getRecordReader(StreamInputFormat.java:69)
>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>        at org.apache.hadoop.mapred.Child.main(Child.java:262)
> Caused by: java.lang.NoSuchMethodException: org.apache.mahout.classifier.bayes.XmlInputFormat.<init>(org.apache.hadoop.fs.FSDataInputStream, org.apache.hadoop.mapred.FileSplit, org.apache.had
> oop.mapred.Reporter, org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.FileSystem)
>        at java.lang.Class.getConstructor0(Class.java:2706)
>        at java.lang.Class.getConstructor(Class.java:1657)
>        at org.apache.hadoop.streaming.StreamInputFormat.getRecordReader(StreamInputFormat.java:66)
>        ... 7 more
> 2011-07-14 17:45:59,806 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
>
> How can I fix this problem? Or does this mean that you cannot use Mahout and Hadoop Streaming?
> Any other suggestions are welcome too!
>
> Best,
> Diederik
>
>
> -----Original Message-----
> From: Sandeep Parikh [mailto:sandeep@locusatx.com] On Behalf Of Sandeep Parikh
> Sent: July-12-11 5:42 PM
> To: user@mahout.apache.org
> Subject: Re: Using Mahout XmlInputFormat with Hadoop Streaming
>
> There's an old post on http://xmlandhadoop.blogspot.com/ that provides some direction on using Mahout's XmlInputFormat to read XML from HDFS. As I recall, the code itself contains some errors but in general, should be sufficient to get you started using this input format.
>
> If your XML records look like the following snippet <element key="value"> <child/> <child/> </element>
>
> Then you'll seed to set "xmlinput.start" and "xmlinput.end" as "<element" and "</element>", respectively, when configuring your job. That little nit cost me a few minutes when I last used this format.
>
> -Sandeep
>
> On Tuesday, July 12, 2011 at 11:51 AM, Diederik van Liere wrote:
>
> > Hi Mahout list,
> >
> > I've got a quick question: is it possible to use Mahout's XmlInputFormat<http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java> in combination with Hadoop Streaming? I would like to replace Hadoop's xmlrecordreader. I have been looking for an example but couldn't find any so maybe this is not possible (but just want to make sure that I am not missing anything).
> > Thanks for your help.
> >
> > Best,
> > Diederik
>

RE: Using Mahout XmlInputFormat with Hadoop Streaming

Posted by Diederik van Liere <Di...@Rotman.Utoronto.Ca>.
Hi, 
Sandeep  thanks so much for your reply. Yes I am aware of that blogpost but it does not explain how to use Mahout with Hadoop's streaming interface. 
I issued the following command:

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar 
	-Dmapred.reduce.tasks=0 
	-Dmapred.child.java.opts=-Xmx1024m 
	-Dmapred.child.ulimit=3145728 
	-libjars /usr/local/mahout/examples/target/mahout-examples-0.6-SNAPSHOT.jar 
	-input /usr/hadoop/enwiki-20110405-pages-meta-history5.xml 
	-output /usr/hadoop/out 
	-mapper ~/wikihadoop/xml_streamer_simulated.py  
	-inputreader "org.apache.mahout.classifier.bayes.XmlInputFormat,begin=<page,end=</page>"

And I got the following output:

2011-07-14 17:45:58,894 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2011-07-14 17:45:59,231 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /data/hadoop/datastore/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_20
1107131716_0002/jars/.job.jar.crc <- /data/hadoop/datastore/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107131716_0002/attempt_201107131716_0002_m_000000_0/work/.job.jar.crc
2011-07-14 17:45:59,242 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: /data/hadoop/datastore/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_20
1107131716_0002/jars/job.jar <- /data/hadoop/datastore/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107131716_0002/attempt_201107131716_0002_m_000000_0/work/job.jar
2011-07-14 17:45:59,328 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2011-07-14 17:45:59,552 INFO org.apache.hadoop.mapred.FileInputFormat: getRecordReader start.....split=hdfs://beta:54310/usr/hadoop/enwiki-20110405-pages-meta-history5.xml:0+67108864
2011-07-14 17:45:59,778 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logsʼ truncater with mapRetainSize=-1 and reduceRetainSize=-1
2011-07-14 17:45:59,802 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.mahout.classifier.bayes.XmlInputFormat.<init>(org.apache.hadoop.fs.FSDataInputStream, org.apache.hadoop.mapred.FileSpli
t, org.apache.hadoop.mapred.Reporter, org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.FileSystem)
        at org.apache.hadoop.streaming.StreamInputFormat.getRecordReader(StreamInputFormat.java:69)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.NoSuchMethodException: org.apache.mahout.classifier.bayes.XmlInputFormat.<init>(org.apache.hadoop.fs.FSDataInputStream, org.apache.hadoop.mapred.FileSplit, org.apache.had
oop.mapred.Reporter, org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.FileSystem)
        at java.lang.Class.getConstructor0(Class.java:2706)
        at java.lang.Class.getConstructor(Class.java:1657)
        at org.apache.hadoop.streaming.StreamInputFormat.getRecordReader(StreamInputFormat.java:66)
        ... 7 more
2011-07-14 17:45:59,806 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

How can I fix this problem? Or does this mean that you cannot use Mahout and Hadoop Streaming?
Any other suggestions are welcome too!

Best,
Diederik


-----Original Message-----
From: Sandeep Parikh [mailto:sandeep@locusatx.com] On Behalf Of Sandeep Parikh
Sent: July-12-11 5:42 PM
To: user@mahout.apache.org
Subject: Re: Using Mahout XmlInputFormat with Hadoop Streaming

There's an old post on http://xmlandhadoop.blogspot.com/ that provides some direction on using Mahout's XmlInputFormat to read XML from HDFS. As I recall, the code itself contains some errors but in general, should be sufficient to get you started using this input format.

If your XML records look like the following snippet <element key="value"> <child/> <child/> </element>

Then you'll seed to set "xmlinput.start" and "xmlinput.end" as "<element" and "</element>", respectively, when configuring your job. That little nit cost me a few minutes when I last used this format.

-Sandeep

On Tuesday, July 12, 2011 at 11:51 AM, Diederik van Liere wrote:

> Hi Mahout list,
> 
> I've got a quick question: is it possible to use Mahout's XmlInputFormat<http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java> in combination with Hadoop Streaming? I would like to replace Hadoop's xmlrecordreader. I have been looking for an example but couldn't find any so maybe this is not possible (but just want to make sure that I am not missing anything).
> Thanks for your help.
> 
> Best,
> Diederik


Re: Using Mahout XmlInputFormat with Hadoop Streaming

Posted by Sandeep Parikh <sa...@raveldata.com>.
There's an old post on http://xmlandhadoop.blogspot.com/ that provides some direction on using Mahout's XmlInputFormat to read XML from HDFS. As I recall, the code itself contains some errors but in general, should be sufficient to get you started using this input format.

If your XML records look like the following snippet
<element key="value">
<child/>
<child/>
</element>

Then you'll seed to set "xmlinput.start" and "xmlinput.end" as "<element" and "</element>", respectively, when configuring your job. That little nit cost me a few minutes when I last used this format.

-Sandeep

On Tuesday, July 12, 2011 at 11:51 AM, Diederik van Liere wrote:

> Hi Mahout list,
> 
> I've got a quick question: is it possible to use Mahout's XmlInputFormat<http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java> in combination with Hadoop Streaming? I would like to replace Hadoop's xmlrecordreader. I have been looking for an example but couldn't find any so maybe this is not possible (but just want to make sure that I am not missing anything).
> Thanks for your help.
> 
> Best,
> Diederik