You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Steve Gao <st...@yahoo.com> on 2009/10/29 20:41:29 UTC
What if an XML file is accross boundary of HDFS chunks?
Does anybody have the similar issue? If you store XML files in HDFS, how can you make sure a chunk reads by a mapper does not contain partical data of an XML segment?
For example:
<title>
<book>book1</book>
<author>me</author>
..............what if this is the boundary of a chunk?...................
<year>2009</year>
<book>book2</book>
<author>me</author>
<year>2009</year>
<book>book3</book>
<author>me</author>
<year>2009</year>
<title>
Re: What if an XML file is accross boundary of HDFS chunks?
Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Steve,
I think I've run across code in SVN that is a splitter for XML entries
like this. Look at StreamXmlRecordReader, I think it does what you
want.
Brian
On Oct 29, 2009, at 4:12 PM, Amandeep Khurana wrote:
> Store the entire xml in one line...
>
> On 10/29/09, Steve Gao <st...@yahoo.com> wrote:
>> Does anybody have the similar issue? If you store XML files in
>> HDFS, how can
>> you make sure a chunk reads by a mapper does not contain partical
>> data of an
>> XML segment?
>>
>> For example:
>>
>> <title>
>> <book>book1</book>
>> <author>me</author>
>> ..............what if this is the boundary of a
>> chunk?...................
>> <year>2009</year>
>> <book>book2</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <book>book3</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <title>
>>
>>
>>
>>
>
>
> --
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
Re: What if an XML file is accross boundary of HDFS chunks?
Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Steve,
Look at the mailing list archives - there's a specialized input splitter that you could use that at least 2 different people suggested.
Brian
On Nov 16, 2009, at 2:02 PM, Steve Gao wrote:
> Thanks. But this is not a neat solution in case that the XML block is very large.
> Anybody has another solution? Thanks!
>
> --- On Thu, 10/29/09, Amandeep Khurana <am...@gmail.com> wrote:
>
> From: Amandeep Khurana <am...@gmail.com>
> Subject: Re: What if an XML file is accross boundary of HDFS chunks?
> To: common-user@hadoop.apache.org
> Date: Thursday, October 29, 2009, 5:12 PM
>
> Store the entire xml in one line...
>
> On 10/29/09, Steve Gao <st...@yahoo.com> wrote:
>> Does anybody have the similar issue? If you store XML files in HDFS, how can
>> you make sure a chunk reads by a mapper does not contain partical data of an
>> XML segment?
>>
>> For example:
>>
>> <title>
>> <book>book1</book>
>> <author>me</author>
>> ..............what if this is the boundary of a chunk?...................
>> <year>2009</year>
>> <book>book2</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <book>book3</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <title>
>>
>>
>>
>>
>
>
> --
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
>
Re: What if an XML file is accross boundary of HDFS chunks?
Posted by Steve Gao <st...@yahoo.com>.
Thanks. But this is not a neat solution in case that the XML block is very large.
Anybody has another solution? Thanks!
--- On Thu, 10/29/09, Amandeep Khurana <am...@gmail.com> wrote:
From: Amandeep Khurana <am...@gmail.com>
Subject: Re: What if an XML file is accross boundary of HDFS chunks?
To: common-user@hadoop.apache.org
Date: Thursday, October 29, 2009, 5:12 PM
Store the entire xml in one line...
On 10/29/09, Steve Gao <st...@yahoo.com> wrote:
> Does anybody have the similar issue? If you store XML files in HDFS, how can
> you make sure a chunk reads by a mapper does not contain partical data of an
> XML segment?
>
> For example:
>
> <title>
> <book>book1</book>
> <author>me</author>
> ..............what if this is the boundary of a chunk?...................
> <year>2009</year>
> <book>book2</book>
>
> <author>me</author>
>
> <year>2009</year>
> <book>book3</book>
>
> <author>me</author>
>
> <year>2009</year>
> <title>
>
>
>
>
--
Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz
Re: What if an XML file is accross boundary of HDFS chunks?
Posted by Amandeep Khurana <am...@gmail.com>.
Store the entire xml in one line...
On 10/29/09, Steve Gao <st...@yahoo.com> wrote:
> Does anybody have the similar issue? If you store XML files in HDFS, how can
> you make sure a chunk reads by a mapper does not contain partical data of an
> XML segment?
>
> For example:
>
> <title>
> <book>book1</book>
> <author>me</author>
> ..............what if this is the boundary of a chunk?...................
> <year>2009</year>
> <book>book2</book>
>
> <author>me</author>
>
> <year>2009</year>
> <book>book3</book>
>
> <author>me</author>
>
> <year>2009</year>
> <title>
>
>
>
>
--
Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz