You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Harsh J <ha...@cloudera.com> on 2012/05/11 14:19:25 UTC

Re: How to maintain record boundaries

Shreya,

This has been asked several times before, and the way it is handled by
TextInputFormats (for one example) is explained at
http://wiki.apache.org/hadoop/HadoopMapReduce in the Map section. If
you are writing a custom reader, feel free to follow the same steps -
you basically need to seek over to next blocks for an end-record
marker and not limit yourself to just one-block reads.

All input formats provided in MR handle this already for you, and you
needn't worry about this unless you're implementing a whole new reader
from scratch.

On Fri, May 11, 2012 at 5:45 PM,  <Sh...@cognizant.com> wrote:
> Hi
>
> When we store data into HDFS, it gets broken into small pieces and distributed across the cluster based on Block size for the file.
> While processing the data using MR program I want a particular record as a whole without it being split across nodes, but the data has already been split and stored in HDFS when I loaded the data.
> How would I make sure that my record doesn't get split, how would my Input format make a difference now ?
>
> Regards
> Shreya
>
> This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful.



-- 
Harsh J

Re: How to maintain record boundaries

Posted by Shi Yu <sh...@uchicago.edu>.
here are some quick code for you (based on Tom's book). You could 
overwrite the TextInputFormat isSplitable method to avoid splitting, 
which is pretty important and useful when processing sequence data.

//Old API

public class NonSplittableTextInputFormat extends TextInputFormat {

     @Override
     protected boolean isSplitable(FileSystem fs, Path file){
         return false;
     }

}


//New API
public class NonSplittableTextInputFormatNewAPI extends TextInputFormat {

     @Override
     protected boolean isSplitable(JobContext context, Path file){
         return false;
     }

}


On 5/11/2012 7:19 AM, Harsh J wrote:
> Shreya,
>
> This has been asked several times before, and the way it is handled by
> TextInputFormats (for one example) is explained at
> http://wiki.apache.org/hadoop/HadoopMapReduce in the Map section. If
> you are writing a custom reader, feel free to follow the same steps -
> you basically need to seek over to next blocks for an end-record
> marker and not limit yourself to just one-block reads.
>
> All input formats provided in MR handle this already for you, and you
> needn't worry about this unless you're implementing a whole new reader
> from scratch.
>
> On Fri, May 11, 2012 at 5:45 PM,<Sh...@cognizant.com>  wrote:
>> Hi
>>
>> When we store data into HDFS, it gets broken into small pieces and distributed across the cluster based on Block size for the file.
>> While processing the data using MR program I want a particular record as a whole without it being split across nodes, but the data has already been split and stored in HDFS when I loaded the data.
>> How would I make sure that my record doesn't get split, how would my Input format make a difference now ?
>>
>> Regards
>> Shreya
>>
>> This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful.
>
>