You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Steve Gao <st...@yahoo.com> on 2009/10/29 21:32:21 UTC

What if an XML file cross boundary of HDFS chunks?

Does anybody have the similar issue? If you store XML files in HDFS, how can you make sure a chunk reads by a mapper does not contain partial data of an XML segment?

For example:

<title>
<book>book1</book>
<author>me</author>
..............what if this is the boundary of a chunk?...................
<year>2009</year>
<book>book2</book>

<author>me</author>

<year>2009</year>
<book>book3</book>

<author>me</author>

<year>2009</year>
<title>



      


      

Re: What if an XML file cross boundary of HDFS chunks?

Posted by Jason Venner <ja...@gmail.com>.
I use the StreamXMLRecordReader out of the streaming contrib package, it
works very well. Your key becomes the stanza you are looking for.

On Sat, Oct 31, 2009 at 7:38 AM, Oliver B. Fischer <o.b.fischer@swe-blog.net
> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello Jeff,
>
> does it means, that there is no programmatic possibility to define where
> a logical file will be splitted independent of the distribution of it
> blocks in the HDFS?
>
> Regards
>
> Oliver
>
> Jeff Zhang schrieb:
> > Hi Steve,
> >
> > When you want to read xml, you should provide your custom InputFormat
> which
> > extends FileInputFormat.
> >
> > and override the method isSplitable to not split a file , that means one
> xml
> > file for one mapper.
> >
> >
> >   protected boolean isSplitable(FileSystem fs, Path filename) {
> >     return false;
> >   }
>
>
> - --
> Oliver B. Fischer, Schönhauser Allee 64, 10437 Berlin
> Tel. +49 30 44793251, Mobil: +49 178 7903538
> Mail: o.b.fischer@swe-blog.net Blog: http://www.swe-blog.net
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (MingW32)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQEcBAEBAgAGBQJK7EwBAAoJELeiwuwqd1DGO/wIAJl8wwf6uNgm/ZwsGh8M1xvz
> wSEH9sD2cfjUSV3rmpHndKEfSTEOeHvvaJmJn24K9HhB9w8QyDogAgHawCdBY2TE
> K27n4wqSGlbLpQz4XmKUOVtFSooeEPUT58Jn2aMAno+nrWHM7oq9tuCJAAYkBexV
> wCrc7eE+o55TlAlx+LDWWS9mJrdTNBYqzoHh0gnWsEGm98CWvzn08tNA/L2moJbQ
> HZwnWzfgEBKBwAZUOYLFt2GigIYN3GE0pMp33BgjWi91zPwGSk7Bcq7XhObLK7o/
> uYxS+s3BTkLy+R6ngjOW1NLvg6STX37FpFNZowDmPt8Bzd8GxAefnqcxkVcnb90=
> =t6vV
> -----END PGP SIGNATURE-----
>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Re: What if an XML file cross boundary of HDFS chunks?

Posted by "Oliver B. Fischer" <o....@swe-blog.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Jeff,

does it means, that there is no programmatic possibility to define where
a logical file will be splitted independent of the distribution of it
blocks in the HDFS?

Regards

Oliver

Jeff Zhang schrieb:
> Hi Steve,
> 
> When you want to read xml, you should provide your custom InputFormat which
> extends FileInputFormat.
> 
> and override the method isSplitable to not split a file , that means one xml
> file for one mapper.
> 
> 
>   protected boolean isSplitable(FileSystem fs, Path filename) {
>     return false;
>   }


- --
Oliver B. Fischer, Schönhauser Allee 64, 10437 Berlin
Tel. +49 30 44793251, Mobil: +49 178 7903538
Mail: o.b.fischer@swe-blog.net Blog: http://www.swe-blog.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJK7EwBAAoJELeiwuwqd1DGO/wIAJl8wwf6uNgm/ZwsGh8M1xvz
wSEH9sD2cfjUSV3rmpHndKEfSTEOeHvvaJmJn24K9HhB9w8QyDogAgHawCdBY2TE
K27n4wqSGlbLpQz4XmKUOVtFSooeEPUT58Jn2aMAno+nrWHM7oq9tuCJAAYkBexV
wCrc7eE+o55TlAlx+LDWWS9mJrdTNBYqzoHh0gnWsEGm98CWvzn08tNA/L2moJbQ
HZwnWzfgEBKBwAZUOYLFt2GigIYN3GE0pMp33BgjWi91zPwGSk7Bcq7XhObLK7o/
uYxS+s3BTkLy+R6ngjOW1NLvg6STX37FpFNZowDmPt8Bzd8GxAefnqcxkVcnb90=
=t6vV
-----END PGP SIGNATURE-----


Re: What if an XML file cross boundary of HDFS chunks?

Posted by Vamc <kr...@gmail.com>.
Hi Steve,

I am new to this forum and a buddy on Hadoop.. I have same kind of problem
where input file is not able to treated as a text file ..

Cant we do like this ,

Define our own InputFormat ,InputSplit  and RecordReader..


Thanks 
Vamsi 




Jeff Zhang-4 wrote:
> 
> Hi Steve,
> 
> When you want to read xml, you should provide your custom InputFormat
> which
> extends FileInputFormat.
> 
> and override the method isSplitable to not split a file , that means one
> xml
> file for one mapper.
> 
> 
>   protected boolean isSplitable(FileSystem fs, Path filename) {
>     return false;
>   }
> 
> 
> 
> Best Regards,
> 
> Jeff zhang
> 
> 
> 
> On Thu, Oct 29, 2009 at 12:32 PM, Steve Gao <st...@yahoo.com> wrote:
> 
>>
>> Does anybody have the similar issue? If you store XML files in HDFS, how
>> can you make sure a chunk reads by a mapper does not contain partial data
>> of
>> an XML segment?
>>
>> For example:
>>
>> <title>
>> <book>book1</book>
>> <author>me</author>
>> ..............what if this is the boundary of a chunk?...................
>> <year>2009</year>
>> <book>book2</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <book>book3</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <title>
>>
>>
>>
>>
>>
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/What-if-an-XML-file-cross-boundary-of-HDFS-chunks--tp26120236p28582046.html
Sent from the Hadoop core-dev mailing list archive at Nabble.com.


Re: What if an XML file cross boundary of HDFS chunks?

Posted by Jeff Zhang <zj...@gmail.com>.
Hi Steve,

When you want to read xml, you should provide your custom InputFormat which
extends FileInputFormat.

and override the method isSplitable to not split a file , that means one xml
file for one mapper.


  protected boolean isSplitable(FileSystem fs, Path filename) {
    return false;
  }



Best Regards,

Jeff zhang



On Thu, Oct 29, 2009 at 12:32 PM, Steve Gao <st...@yahoo.com> wrote:

>
> Does anybody have the similar issue? If you store XML files in HDFS, how
> can you make sure a chunk reads by a mapper does not contain partial data of
> an XML segment?
>
> For example:
>
> <title>
> <book>book1</book>
> <author>me</author>
> ..............what if this is the boundary of a chunk?...................
> <year>2009</year>
> <book>book2</book>
>
> <author>me</author>
>
> <year>2009</year>
> <book>book3</book>
>
> <author>me</author>
>
> <year>2009</year>
> <title>
>
>
>
>
>
>
>