You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Steve Gao <st...@yahoo.com> on 2009/10/29 20:41:29 UTC

What if an XML file is accross boundary of HDFS chunks?

Does anybody have the similar issue? If you store XML files in HDFS, how can you make sure a chunk reads by a mapper does not contain partical data of an XML segment?

For example:

<title>
<book>book1</book>
<author>me</author>
..............what if this is the boundary of a chunk?...................
<year>2009</year>
<book>book2</book>

<author>me</author>

<year>2009</year>
<book>book3</book>

<author>me</author>

<year>2009</year>
<title>



      

Re: What if an XML file is accross boundary of HDFS chunks?

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Steve,

I think I've run across code in SVN that is a splitter for XML entries  
like this.  Look at StreamXmlRecordReader, I think it does what you  
want.

Brian

On Oct 29, 2009, at 4:12 PM, Amandeep Khurana wrote:

> Store the entire xml in one line...
>
> On 10/29/09, Steve Gao <st...@yahoo.com> wrote:
>> Does anybody have the similar issue? If you store XML files in  
>> HDFS, how can
>> you make sure a chunk reads by a mapper does not contain partical  
>> data of an
>> XML segment?
>>
>> For example:
>>
>> <title>
>> <book>book1</book>
>> <author>me</author>
>> ..............what if this is the boundary of a  
>> chunk?...................
>> <year>2009</year>
>> <book>book2</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <book>book3</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <title>
>>
>>
>>
>>
>
>
> -- 
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz


Re: What if an XML file is accross boundary of HDFS chunks?

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Steve,

Look at the mailing list archives - there's a specialized input splitter that you could use that at least 2 different people suggested.

Brian

On Nov 16, 2009, at 2:02 PM, Steve Gao wrote:

> Thanks. But this is not a neat solution in case that the XML block is very large.
> Anybody has another solution? Thanks!
> 
> --- On Thu, 10/29/09, Amandeep Khurana <am...@gmail.com> wrote:
> 
> From: Amandeep Khurana <am...@gmail.com>
> Subject: Re: What if an XML file is accross boundary of HDFS chunks?
> To: common-user@hadoop.apache.org
> Date: Thursday, October 29, 2009, 5:12 PM
> 
> Store the entire xml in one line...
> 
> On 10/29/09, Steve Gao <st...@yahoo.com> wrote:
>> Does anybody have the similar issue? If you store XML files in HDFS, how can
>> you make sure a chunk reads by a mapper does not contain partical data of an
>> XML segment?
>> 
>> For example:
>> 
>> <title>
>> <book>book1</book>
>> <author>me</author>
>> ..............what if this is the boundary of a chunk?...................
>> <year>2009</year>
>> <book>book2</book>
>> 
>> <author>me</author>
>> 
>> <year>2009</year>
>> <book>book3</book>
>> 
>> <author>me</author>
>> 
>> <year>2009</year>
>> <title>
>> 
>> 
>> 
>> 
> 
> 
> -- 
> 
> 
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
> 
> 
> 


Re: What if an XML file is accross boundary of HDFS chunks?

Posted by Steve Gao <st...@yahoo.com>.
Thanks. But this is not a neat solution in case that the XML block is very large.
Anybody has another solution? Thanks!

--- On Thu, 10/29/09, Amandeep Khurana <am...@gmail.com> wrote:

From: Amandeep Khurana <am...@gmail.com>
Subject: Re: What if an XML file is accross boundary of HDFS chunks?
To: common-user@hadoop.apache.org
Date: Thursday, October 29, 2009, 5:12 PM

Store the entire xml in one line...

On 10/29/09, Steve Gao <st...@yahoo.com> wrote:
> Does anybody have the similar issue? If you store XML files in HDFS, how can
> you make sure a chunk reads by a mapper does not contain partical data of an
> XML segment?
>
> For example:
>
> <title>
> <book>book1</book>
> <author>me</author>
> ..............what if this is the boundary of a chunk?...................
> <year>2009</year>
> <book>book2</book>
>
> <author>me</author>
>
> <year>2009</year>
> <book>book3</book>
>
> <author>me</author>
>
> <year>2009</year>
> <title>
>
>
>
>


-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz



      

Re: What if an XML file is accross boundary of HDFS chunks?

Posted by Amandeep Khurana <am...@gmail.com>.
Store the entire xml in one line...

On 10/29/09, Steve Gao <st...@yahoo.com> wrote:
> Does anybody have the similar issue? If you store XML files in HDFS, how can
> you make sure a chunk reads by a mapper does not contain partical data of an
> XML segment?
>
> For example:
>
> <title>
> <book>book1</book>
> <author>me</author>
> ..............what if this is the boundary of a chunk?...................
> <year>2009</year>
> <book>book2</book>
>
> <author>me</author>
>
> <year>2009</year>
> <book>book3</book>
>
> <author>me</author>
>
> <year>2009</year>
> <title>
>
>
>
>


-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz