You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Vamc <kr...@gmail.com> on 2010/05/17 13:12:43 UTC
Re: What if an XML file cross boundary of HDFS chunks?
Hi Steve,
I am new to this forum and a buddy on Hadoop.. I have same kind of problem
where input file is not able to treated as a text file ..
Cant we do like this ,
Define our own InputFormat ,InputSplit and RecordReader..
Thanks
Vamsi
Jeff Zhang-4 wrote:
>
> Hi Steve,
>
> When you want to read xml, you should provide your custom InputFormat
> which
> extends FileInputFormat.
>
> and override the method isSplitable to not split a file , that means one
> xml
> file for one mapper.
>
>
> protected boolean isSplitable(FileSystem fs, Path filename) {
> return false;
> }
>
>
>
> Best Regards,
>
> Jeff zhang
>
>
>
> On Thu, Oct 29, 2009 at 12:32 PM, Steve Gao <st...@yahoo.com> wrote:
>
>>
>> Does anybody have the similar issue? If you store XML files in HDFS, how
>> can you make sure a chunk reads by a mapper does not contain partial data
>> of
>> an XML segment?
>>
>> For example:
>>
>> <title>
>> <book>book1</book>
>> <author>me</author>
>> ..............what if this is the boundary of a chunk?...................
>> <year>2009</year>
>> <book>book2</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <book>book3</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <title>
>>
>>
>>
>>
>>
>>
>>
>
>
--
View this message in context: http://old.nabble.com/What-if-an-XML-file-cross-boundary-of-HDFS-chunks--tp26120236p28582046.html
Sent from the Hadoop core-dev mailing list archive at Nabble.com.