You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Kayla Jay <ka...@yahoo.com> on 2008/06/23 21:38:34 UTC

Working with XML / XQuery in hadoop

Hi

Just wondering if anyone out there works with and manipulates and stores XML data using Hadoop?  I've seen some threads about XML RecordReaders and people who use that XML StreamXmlRecordReader to do splits.  But, has anyone implemented a query framework that will use the hadoop layer to query against the XML in their map/reduce jobs?

I want to know if anyone has done an XQuery or XPath executed within a haoop job to find something within the XML stored in hadoop?

I can't find any samples or anyone else out there who uses XML data vs. traditional log text data.

Are there any use cases of using hadoop to work with XML and then do queries against XML in a distributed manner using hadoop?

Thanks.



      

Re: Working with XML / XQuery in hadoop

Posted by Stefan Groschupf <sg...@101tec.com>.
Yep, we do.
We have a xml Writable that uses XUM behind the scene. This has a  
getDom and getNode(xquery) method. In readIn we read the byte array  
and create the xum dom object from the byte array.
Write simply triggers the BinaryCodec.serialize and we write the bytes  
out.
However the same would work if you de/serialize xml as text, though we  
found that is slower than xum, though works pretty stable, since xum  
has other issues (you need to use BinaryCodex as jvm sigelton etc).
However in general this works pretty well.
Stefan



On Jun 23, 2008, at 9:38 PM, Kayla Jay wrote:

> Hi
>
> Just wondering if anyone out there works with and manipulates and  
> stores XML data using Hadoop?  I've seen some threads about XML  
> RecordReaders and people who use that XML StreamXmlRecordReader to  
> do splits.  But, has anyone implemented a query framework that will  
> use the hadoop layer to query against the XML in their map/reduce  
> jobs?
>
> I want to know if anyone has done an XQuery or XPath executed within  
> a haoop job to find something within the XML stored in hadoop?
>
> I can't find any samples or anyone else out there who uses XML data  
> vs. traditional log text data.
>
> Are there any use cases of using hadoop to work with XML and then do  
> queries against XML in a distributed manner using hadoop?
>
> Thanks.
>
>
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com



Re: Working with XML / XQuery in hadoop

Posted by Brian Vargas <br...@ardvaark.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Kayla,

When I first started playing with Hadoop, I created an InputFormat and
RecordReader that, given an XML file, created a series of key-value
pairs where the XPath of the node in the document was the key and the
value of the node (if it had one) was the value.  At the time, it seemed
like a good idea, but turned out to be horribly slow, due to the insane
number of keys that were created.  It also sucked to code against.

It turned out to be way faster, and way easier to code, to just pass in
the name of the files to be loaded and run them through your favorite
parsing implementation within the Map implementation.  Alternatively, if
the files are small enough, you could load the XML bytes into a sequence
file, and then just read them out as BytesWritable - again, into your
favorite parser.  (In fact, if you're dealing with XML files below the
block size of HDFS, that's probably the better way to do it.)

Brian

Kayla Jay wrote:
| Hi
|
| Just wondering if anyone out there works with and manipulates and
| stores XML data using Hadoop?  I've seen some threads about XML
| RecordReaders and people who use that XML StreamXmlRecordReader to do
| splits.  But, has anyone implemented a query framework that will use
| the hadoop layer to query against the XML in their map/reduce jobs?
|
| I want to know if anyone has done an XQuery or XPath executed within
| a haoop job to find something within the XML stored in hadoop?
|
| I can't find any samples or anyone else out there who uses XML data
| vs. traditional log text data.
|
| Are there any use cases of using hadoop to work with XML and then do
| queries against XML in a distributed manner using hadoop?
|
| Thanks.
|
|
|
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (MingW32)
Comment: What is this? http://pgp.ardvaark.net

iD8DBQFIYAYI3YdPnMKx1eMRA1d+AKDNfYB/oR42NONht2BT4zuHZP0SXQCgjoA7
3G0oVCxBw9Fij1nWvV58zoo=
=5Eyk
-----END PGP SIGNATURE-----