You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Bibek Paudel <et...@gmail.com> on 2010/10/08 13:48:22 UTC

question about processing XML file

Hi,
I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
in a single node cluster. I have already run mapreduce programs for
wordcount and building inverted index.

I am trying to run the wordcount program in a wikipedia dump. It is a
single XML file where each line contains a Wikipedia page in the
following format:

<page>     <title>Main Page</title>    <text>Some text goes
here.</text>    </page>

I want to do wordcount of the text contained inside the tags <text>
and </text>. Please let me know what is the correct way of doing this.

When I enter the following command, I get an error. The jar file, the
WordCount class and input file all exist.

$HADOOP_HOME/bin/hadoop jar WordCount.jar -inputformat
"org.apache.hadoop.mapreduce.StreamInputFormat"
-Dstream.recordreader.class=org.apache.hadoop.streaming.StreamXmlRecordReader
 -inputreader "StreamXmlRecordReader,begin=<text>,end=</text>"
WordCount wikixml wikixml-op2

Error:
-----------
Exception in thread "main" java.lang.ClassNotFoundException: -inputformat
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:247)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

What used to work:
----------------------------
$HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2

Thanks for any help,
Bibek

Re: question about processing XML file

Posted by Bibek Paudel <et...@gmail.com>.

On Fri, Oct 8, 2010 at 1:48 PM, Bibek Paudel <et...@gmail.com> wrote:
> Hi,
> I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
> in a single node cluster. I have already run mapreduce programs for
> wordcount and building inverted index.
>
> I am trying to run the wordcount program in a wikipedia dump. It is a
> single XML file where each line contains a Wikipedia page in the
> following format:
>
> <page>     <title>Main Page</title>    <text>Some text goes
> here.</text>    </page>
>
> I want to do wordcount of the text contained inside the tags <text>
> and </text>. Please let me know what is the correct way of doing this.
>
> When I enter the following command, I get an error. The jar file, the
> WordCount class and input file all exist.
>
> $HADOOP_HOME/bin/hadoop jar WordCount.jar -inputformat
> "org.apache.hadoop.mapreduce.StreamInputFormat"
> -Dstream.recordreader.class=org.apache.hadoop.streaming.StreamXmlRecordReader
>  -inputreader "StreamXmlRecordReader,begin=<text>,end=</text>"
> WordCount wikixml wikixml-op2
>
> Error:
> -----------
> Exception in thread "main" java.lang.ClassNotFoundException: -inputformat
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
>        at java.lang.Class.forName0(Native Method)
>        at java.lang.Class.forName(Class.java:247)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
>
> What used to work:
> ----------------------------
> $HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2
>

Straight out of documentation, the following also works:

$HADOOP_HOME/bin/hadoop jar
contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
"StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
-output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc

What I am interested in doing is:
1. use my java classes (written earlier for normal text files) as
mapper and reducer (and driver).
2. if possible, pass the configuration options, like begin and end
tags of XML from inside my Java program itself.
3. if possible, specify my intent to use StreamXmlRecordReader from
inside the java program itself.

Please let me know what I should read/do to solve these issues.

Bibek

> Thanks for any help,
> Bibek
>