You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Joerg Rieger <jo...@mni.fh-giessen.de> on 2009/08/10 21:07:00 UTC

Re: XML files in HDFS

Hello,

while flipping through the cloud9 collections, I came across an XML  
InputFormat class:

http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/api/edu/umd/cloud9/collection/XMLInputFormat.html

I haven't used it myself, but It might be worth a try.


Joerg


On 30.07.2009, at 14:16, Hyunsik Choi wrote:

> Hi,
>
> Actually, I don't know there exists any well-made XML InputFormat or
> Record reader.
> To the best of my knowledge, StreamXmlRecordReader (
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html
> ) of Hadoop streaming is only solution.
>
> Good luck!
>
> --
> Hyunsik Choi
> Database & Information Systems Group, Korea University
> http://diveintodata.org
>
>
>
> On Thu, Jul 30, 2009 at 5:30 PM, Wasim Bari<wa...@msn.com> wrote:
>>
>>
>>
>> Hi All,
>>
>>       I am looking to store some real big xml files in HDFS and  
>> then process them using MapReduce.
>>
>>
>>
>> Do we have some utility which uploads the xml files to hdfs making  
>> sure split  up of file in block doen't brake an elemet ( mean half  
>> element on one block and half on someother ) ?
>>
>>
>>
>> Any suggestions to work thos out will  be appreciated greatly.
>>
>>
>>
>> Thanks
>>
>>
>>
>> Bari
>>

-- 




Re: XML files in HDFS

Posted by Aaron Kimball <aa...@cloudera.com>.
Wasim,

RecordReader implementations should never require that elements not be
spread across multiple blocks. The start and end offsets into a file in an
InputSplit are taken as soft limits, not hard ones. The RecordReader
implementations that come with Hadoop perform this way, and any that you
author should do the same. If a logical record continues past its end
offset, it will continue to read the data from the next block until it finds
the end of the record. Similarly, if a RecordReader has a start offset > 0,
then it scans forward til the first end-of-record followed by any
beginning-of-record marker, ignoring this data (as it was processed by the
previous inputsplit), and only then does it begin reading records into its
map task.

- Aaron


On Mon, Aug 10, 2009 at 12:07 PM, Joerg Rieger <
joerg.rieger@mni.fh-giessen.de> wrote:

> Hello,
>
> while flipping through the cloud9 collections, I came across an XML
> InputFormat class:
>
>
> http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/api/edu/umd/cloud9/collection/XMLInputFormat.html<http://www.umiacs.umd.edu/%7Ejimmylin/cloud9/docs/api/edu/umd/cloud9/collection/XMLInputFormat.html>
>
> I haven't used it myself, but It might be worth a try.
>
>
> Joerg
>
>
>
> On 30.07.2009, at 14:16, Hyunsik Choi wrote:
>
>  Hi,
>>
>> Actually, I don't know there exists any well-made XML InputFormat or
>> Record reader.
>> To the best of my knowledge, StreamXmlRecordReader (
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html
>> ) of Hadoop streaming is only solution.
>>
>> Good luck!
>>
>> --
>> Hyunsik Choi
>> Database & Information Systems Group, Korea University
>> http://diveintodata.org
>>
>>
>>
>> On Thu, Jul 30, 2009 at 5:30 PM, Wasim Bari<wa...@msn.com> wrote:
>>
>>>
>>>
>>>
>>> Hi All,
>>>
>>>      I am looking to store some real big xml files in HDFS and then
>>> process them using MapReduce.
>>>
>>>
>>>
>>> Do we have some utility which uploads the xml files to hdfs making sure
>>> split  up of file in block doen't brake an elemet ( mean half element on one
>>> block and half on someother ) ?
>>>
>>>
>>>
>>> Any suggestions to work thos out will  be appreciated greatly.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> Bari
>>>
>>>
> --
>
>
>
>