You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Vamc <kr...@gmail.com> on 2010/05/17 13:12:43 UTC

Re: What if an XML file cross boundary of HDFS chunks?

Hi Steve,

I am new to this forum and a buddy on Hadoop.. I have same kind of problem
where input file is not able to treated as a text file ..

Cant we do like this ,

Define our own InputFormat ,InputSplit  and RecordReader..


Thanks 
Vamsi 




Jeff Zhang-4 wrote:
> 
> Hi Steve,
> 
> When you want to read xml, you should provide your custom InputFormat
> which
> extends FileInputFormat.
> 
> and override the method isSplitable to not split a file , that means one
> xml
> file for one mapper.
> 
> 
>   protected boolean isSplitable(FileSystem fs, Path filename) {
>     return false;
>   }
> 
> 
> 
> Best Regards,
> 
> Jeff zhang
> 
> 
> 
> On Thu, Oct 29, 2009 at 12:32 PM, Steve Gao <st...@yahoo.com> wrote:
> 
>>
>> Does anybody have the similar issue? If you store XML files in HDFS, how
>> can you make sure a chunk reads by a mapper does not contain partial data
>> of
>> an XML segment?
>>
>> For example:
>>
>> <title>
>> <book>book1</book>
>> <author>me</author>
>> ..............what if this is the boundary of a chunk?...................
>> <year>2009</year>
>> <book>book2</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <book>book3</book>
>>
>> <author>me</author>
>>
>> <year>2009</year>
>> <title>
>>
>>
>>
>>
>>
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/What-if-an-XML-file-cross-boundary-of-HDFS-chunks--tp26120236p28582046.html
Sent from the Hadoop core-dev mailing list archive at Nabble.com.