You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Bibek Paudel <et...@gmail.com> on 2010/10/12 14:02:55 UTC

question about processing XML file

Hi,
I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
in a single node cluster. I have already run mapreduce programs for
wordcount and building inverted index.

I am trying to run the wordcount program in a wikipedia dump. It is a
single XML file with Wikipedia pages' data in the following form:

  <page>
    <title>Amr El Halwani</title>
    <id>16000008</id>
    <revision>
      <id>368385014</id>
      <timestamp>2010-06-16T13:32:28Z</timestamp>
      <text xml:space="preserve">
              Some multi-line text goes here.
      </text>
  </page>


I want to do wordcount of the text contained inside the tags <text>
and </text>. Please let me know what is the correct way of doing this.

What works:
------------------
$HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2

Straight out of documentation, the following also works:
---------------------------------------------------------------------------------
$HADOOP_HOME/bin/hadoop jar
contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
"StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
-output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc

What I am interested in doing is:
-------------------------------------------------
1. use my java classes in WordCount.jar (or something similar) as
mapper and reducer (and driver).
2. if possible, pass the configuration options, like begin and end
tags of XML from inside my Java program itself.
3. if possible, specify my intent to use StreamXmlRecordReader from
inside the java program itself.

Please let me know what I should read/do to solve these issues.

Bibek
Bibek

Re: question about processing XML file

Posted by Bibek Paudel <et...@gmail.com>.

On Tue, Oct 12, 2010 at 7:28 PM, Paul Ingles <pa...@oobaloo.co.uk> wrote:
> I found that we needed to 'borrow' mahout's xmlinputformat to work correctly, I posted a small blog article on it a while back: http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
>
> You could either add the dependency on the mahout jars or copy the class source and compile in your tree.
>

I have read  your post. Thanks.

Could you please tell me how i can "add the dependency on the mahout
jars"? is it by using the "-libjar" option in the commandline?

Thanks,
Bibek

> Hth,
> Paul
>
> Sent from my iPhone
>
> On 12 Oct 2010, at 18:10, Steve Lewis <lo...@gmail.com> wrote:
>
>> Look at the classes org.apache.hadoop.mapreduce.lib.input.LineRecordReader
>> and org.apache.hadoop.mapreduce.lib.input.TextInputFormat
>>
>> What you need to do  is copy those and change the LineRecordReader to look
>> for the <page> tag
>>
>> On Tue, Oct 12, 2010 at 5:02 AM, Bibek Paudel <et...@gmail.com>wrote:
>>
>>> Hi,
>>> I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
>>> in a single node cluster. I have already run mapreduce programs for
>>> wordcount and building inverted index.
>>>
>>> I am trying to run the wordcount program in a wikipedia dump. It is a
>>> single XML file with Wikipedia pages' data in the following form:
>>>
>>> <page>
>>>   <title>Amr El Halwani</title>
>>>   <id>16000008</id>
>>>   <revision>
>>>     <id>368385014</id>
>>>     <timestamp>2010-06-16T13:32:28Z</timestamp>
>>>     <text xml:space="preserve">
>>>             Some multi-line text goes here.
>>>     </text>
>>> </page>
>>>
>>>
>>> I want to do wordcount of the text contained inside the tags <text>
>>> and </text>. Please let me know what is the correct way of doing this.
>>>
>>> What works:
>>> ------------------
>>> $HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2
>>>
>>> Straight out of documentation, the following also works:
>>>
>>> ---------------------------------------------------------------------------------
>>> $HADOOP_HOME/bin/hadoop jar
>>> contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
>>> "StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
>>> -output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc
>>>
>>> What I am interested in doing is:
>>> -------------------------------------------------
>>> 1. use my java classes in WordCount.jar (or something similar) as
>>> mapper and reducer (and driver).
>>> 2. if possible, pass the configuration options, like begin and end
>>> tags of XML from inside my Java program itself.
>>> 3. if possible, specify my intent to use StreamXmlRecordReader from
>>> inside the java program itself.
>>>
>>> Please let me know what I should read/do to solve these issues.
>>>
>>> Bibek
>>> Bibek
>>>
>>
>>
>>
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave Ne
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>> Institute for Systems Biology
>> Seattle WA
>

Re: question about processing XML file

Posted by Paul Ingles <pa...@oobaloo.co.uk>.

I found that we needed to 'borrow' mahout's xmlinputformat to work correctly, I posted a small blog article on it a while back: http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

You could either add the dependency on the mahout jars or copy the class source and compile in your tree.

Hth,
Paul

Sent from my iPhone

On 12 Oct 2010, at 18:10, Steve Lewis <lo...@gmail.com> wrote:

> Look at the classes org.apache.hadoop.mapreduce.lib.input.LineRecordReader
> and org.apache.hadoop.mapreduce.lib.input.TextInputFormat
> 
> What you need to do  is copy those and change the LineRecordReader to look
> for the <page> tag
> 
> On Tue, Oct 12, 2010 at 5:02 AM, Bibek Paudel <et...@gmail.com>wrote:
> 
>> Hi,
>> I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
>> in a single node cluster. I have already run mapreduce programs for
>> wordcount and building inverted index.
>> 
>> I am trying to run the wordcount program in a wikipedia dump. It is a
>> single XML file with Wikipedia pages' data in the following form:
>> 
>> <page>
>>   <title>Amr El Halwani</title>
>>   <id>16000008</id>
>>   <revision>
>>     <id>368385014</id>
>>     <timestamp>2010-06-16T13:32:28Z</timestamp>
>>     <text xml:space="preserve">
>>             Some multi-line text goes here.
>>     </text>
>> </page>
>> 
>> 
>> I want to do wordcount of the text contained inside the tags <text>
>> and </text>. Please let me know what is the correct way of doing this.
>> 
>> What works:
>> ------------------
>> $HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2
>> 
>> Straight out of documentation, the following also works:
>> 
>> ---------------------------------------------------------------------------------
>> $HADOOP_HOME/bin/hadoop jar
>> contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
>> "StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
>> -output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc
>> 
>> What I am interested in doing is:
>> -------------------------------------------------
>> 1. use my java classes in WordCount.jar (or something similar) as
>> mapper and reducer (and driver).
>> 2. if possible, pass the configuration options, like begin and end
>> tags of XML from inside my Java program itself.
>> 3. if possible, specify my intent to use StreamXmlRecordReader from
>> inside the java program itself.
>> 
>> Please let me know what I should read/do to solve these issues.
>> 
>> Bibek
>> Bibek
>> 
> 
> 
> 
> -- 
> Steven M. Lewis PhD
> 4221 105th Ave Ne
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Institute for Systems Biology
> Seattle WA

Re: question about processing XML file

Posted by Steve Lewis <lo...@gmail.com>.

Look at the classes org.apache.hadoop.mapreduce.lib.input.LineRecordReader
and org.apache.hadoop.mapreduce.lib.input.TextInputFormat

What you need to do  is copy those and change the LineRecordReader to look
for the <page> tag

On Tue, Oct 12, 2010 at 5:02 AM, Bibek Paudel <et...@gmail.com>wrote:

> Hi,
> I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
> in a single node cluster. I have already run mapreduce programs for
> wordcount and building inverted index.
>
> I am trying to run the wordcount program in a wikipedia dump. It is a
> single XML file with Wikipedia pages' data in the following form:
>
>  <page>
>    <title>Amr El Halwani</title>
>    <id>16000008</id>
>    <revision>
>      <id>368385014</id>
>      <timestamp>2010-06-16T13:32:28Z</timestamp>
>      <text xml:space="preserve">
>              Some multi-line text goes here.
>      </text>
>  </page>
>
>
> I want to do wordcount of the text contained inside the tags <text>
> and </text>. Please let me know what is the correct way of doing this.
>
> What works:
> ------------------
> $HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2
>
> Straight out of documentation, the following also works:
>
> ---------------------------------------------------------------------------------
> $HADOOP_HOME/bin/hadoop jar
> contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
> "StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
> -output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc
>
> What I am interested in doing is:
> -------------------------------------------------
> 1. use my java classes in WordCount.jar (or something similar) as
> mapper and reducer (and driver).
> 2. if possible, pass the configuration options, like begin and end
> tags of XML from inside my Java program itself.
> 3. if possible, specify my intent to use StreamXmlRecordReader from
> inside the java program itself.
>
> Please let me know what I should read/do to solve these issues.
>
> Bibek
> Bibek
>



-- 
Steven M. Lewis PhD
4221 105th Ave Ne
Kirkland, WA 98033
206-384-1340 (cell)
Institute for Systems Biology
Seattle WA