You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Steve Gao <st...@yahoo.com> on 2009/10/29 21:32:21 UTC
What if an XML file cross boundary of HDFS chunks?
Does anybody have the similar issue? If you store XML files in HDFS, how can you make sure a chunk reads by a mapper does not contain partial data of an XML segment?
For example:
<title>
<book>book1</book>
<author>me</author>
..............what if this is the boundary of a chunk?...................
<year>2009</year>
<book>book2</book>
<author>me</author>
<year>2009</year>
<book>book3</book>
<author>me</author>
<year>2009</year>
<title>
Re: What if an XML file cross boundary of HDFS chunks?
Posted by Jason Venner <ja...@gmail.com>.
I use the StreamXMLRecordReader out of the streaming contrib package, it
works very well. Your key becomes the stanza you are looking for.
On Sat, Oct 31, 2009 at 7:38 AM, Oliver B. Fischer <o.b.fischer@swe-blog.net
> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello Jeff,
>
> does it means, that there is no programmatic possibility to define where
> a logical file will be splitted independent of the distribution of it
> blocks in the HDFS?
>
> Regards
>
> Oliver
>
> Jeff Zhang schrieb:
> > Hi Steve,
> >
> > When you want to read xml, you should provide your custom InputFormat
> which
> > extends FileInputFormat.
> >
> > and override the method isSplitable to not split a file , that means one
> xml
> > file for one mapper.
> >
> >
> > protected boolean isSplitable(FileSystem fs, Path filename) {
> > return false;
> > }
>
>
> - --
> Oliver B. Fischer, Schönhauser Allee 64, 10437 Berlin
> Tel. +49 30 44793251, Mobil: +49 178 7903538
> Mail: o.b.fischer@swe-blog.net Blog: http://www.swe-blog.net
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (MingW32)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQEcBAEBAgAGBQJK7EwBAAoJELeiwuwqd1DGO/wIAJl8wwf6uNgm/ZwsGh8M1xvz
> wSEH9sD2cfjUSV3rmpHndKEfSTEOeHvvaJmJn24K9HhB9w8QyDogAgHawCdBY2TE
> K27n4wqSGlbLpQz4XmKUOVtFSooeEPUT58Jn2aMAno+nrWHM7oq9tuCJAAYkBexV
> wCrc7eE+o55TlAlx+LDWWS9mJrdTNBYqzoHh0gnWsEGm98CWvzn08tNA/L2moJbQ
> HZwnWzfgEBKBwAZUOYLFt2GigIYN3GE0pMp33BgjWi91zPwGSk7Bcq7XhObLK7o/
> uYxS+s3BTkLy+R6ngjOW1NLvg6STX37FpFNZowDmPt8Bzd8GxAefnqcxkVcnb90=
> =t6vV
> -----END PGP SIGNATURE-----
>
>
--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals
Re: What if an XML file cross boundary of HDFS chunks?
Posted by "Oliver B. Fischer" <o....@swe-blog.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello Jeff,
does it means, that there is no programmatic possibility to define where
a logical file will be splitted independent of the distribution of it
blocks in the HDFS?
Regards
Oliver
Jeff Zhang schrieb:
> Hi Steve,
>
> When you want to read xml, you should provide your custom InputFormat which
> extends FileInputFormat.
>
> and override the method isSplitable to not split a file , that means one xml
> file for one mapper.
>
>
> protected boolean isSplitable(FileSystem fs, Path filename) {
> return false;
> }
- --
Oliver B. Fischer, Schönhauser Allee 64, 10437 Berlin
Tel. +49 30 44793251, Mobil: +49 178 7903538
Mail: o.b.fischer@swe-blog.net Blog: http://www.swe-blog.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBAgAGBQJK7EwBAAoJELeiwuwqd1DGO/wIAJl8wwf6uNgm/ZwsGh8M1xvz
wSEH9sD2cfjUSV3rmpHndKEfSTEOeHvvaJmJn24K9HhB9w8QyDogAgHawCdBY2TE
K27n4wqSGlbLpQz4XmKUOVtFSooeEPUT58Jn2aMAno+nrWHM7oq9tuCJAAYkBexV
wCrc7eE+o55TlAlx+LDWWS9mJrdTNBYqzoHh0gnWsEGm98CWvzn08tNA/L2moJbQ
HZwnWzfgEBKBwAZUOYLFt2GigIYN3GE0pMp33BgjWi91zPwGSk7Bcq7XhObLK7o/
uYxS+s3BTkLy+R6ngjOW1NLvg6STX37FpFNZowDmPt8Bzd8GxAefnqcxkVcnb90=
=t6vV
-----END PGP SIGNATURE-----
Re: What if an XML file cross boundary of HDFS chunks?
Posted by Vamc <kr...@gmail.com>.
Hi Steve,
I am new to this forum and a buddy on Hadoop.. I have same kind of problem
where input file is not able to treated as a text file ..
Cant we do like this ,
Define our own InputFormat ,InputSplit and RecordReader..
Thanks
Vamsi
Jeff Zhang-4 wrote:
>
> Hi Steve,
>
> When you want to read xml, you should provide your custom InputFormat
> which
> extends FileInputFormat.
>
> and override the method isSplitable to not split a file , that means one
> xml
> file for one mapper.
>
>
> protected boolean isSplitable(FileSystem fs, Path filename) {
> return false;
> }
>
>
>
> Best Regards,
>
> Jeff zhang
>
>
>
> On Thu, Oct 29, 2009 at 12:32 PM, Steve Gao <st...@yahoo.com> wrote:
>
>>
>> Does anybody have the similar issue? If you store XML files in HDFS, how
>> can you make sure a chunk reads by a mapper does not contain partial data
>> of
>> an XML segment?
>>
>> For example:
>>
>> <title>
>> <book>book1</book>
>> <author>me</author>
>> ..............what if this is the boundary of a chunk?...................
>> <year>2009</year>
>> <book>book2</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <book>book3</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <title>
>>
>>
>>
>>
>>
>>
>>
>
>
--
View this message in context: http://old.nabble.com/What-if-an-XML-file-cross-boundary-of-HDFS-chunks--tp26120236p28582046.html
Sent from the Hadoop core-dev mailing list archive at Nabble.com.
Re: What if an XML file cross boundary of HDFS chunks?
Posted by Jeff Zhang <zj...@gmail.com>.
Hi Steve,
When you want to read xml, you should provide your custom InputFormat which
extends FileInputFormat.
and override the method isSplitable to not split a file , that means one xml
file for one mapper.
protected boolean isSplitable(FileSystem fs, Path filename) {
return false;
}
Best Regards,
Jeff zhang
On Thu, Oct 29, 2009 at 12:32 PM, Steve Gao <st...@yahoo.com> wrote:
>
> Does anybody have the similar issue? If you store XML files in HDFS, how
> can you make sure a chunk reads by a mapper does not contain partial data of
> an XML segment?
>
> For example:
>
> <title>
> <book>book1</book>
> <author>me</author>
> ..............what if this is the boundary of a chunk?...................
> <year>2009</year>
> <book>book2</book>
>
> <author>me</author>
>
> <year>2009</year>
> <book>book3</book>
>
> <author>me</author>
>
> <year>2009</year>
> <title>
>
>
>
>
>
>
>