You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mark <st...@gmail.com> on 2010/08/20 03:47:56 UTC

InputSplit and RecordReader

  From what I understand the InputSplit is a byte slice of a particular 
file which is then handed off to an individual mapper for processing. Is 
the size of the InputSplit equal to the hadoop block ie 64/128mb? If 
not, what is the size.

Now the RecordReaders takes in bytes from the InputSplit and transforms 
that to a record-oriented structure suitable for use within a mapper.. 
ie key/value correct? Now the wiki says its the RecordReaders job is to 
respect record boundaries.. how is this accomplished? Say I have an 
InplutSplit which is 100kb in size and each record is approximately 30kb 
in size. What happens to the last 10kb in this example? I believe I read 
somewhere that it will read past that boundary but how is that possible 
if the RecordReader has only been presented with 100kb?

Can someone please clarify some of these issues for me. Thanks

Re: InputSplit and RecordReader

Posted by Gang Luo <lg...@yahoo.com.cn>.
right.

-Gang




----- 原始邮件 ----
发件人: Mark <st...@gmail.com>
收件人: common-user@hadoop.apache.org
发送日期: 2010/8/20 (周五) 1:58:29 上午
主   题: Re: InputSplit and RecordReader

On 8/19/10 7:18 PM, Gang Luo wrote:
> The size of a input spilt could be different from a block. You can specify 
> max/min size of input splits. 
>
>
> InputSplit is actually metadata indicating the start point in a file, the 
>length 
>
> of the split, etc. It doesn't present you the real data. A mapper, when 
>assigned 
>
> a split to process, will read the input as specified in the InputSplit. It can 

> accross the boundary if needed.
>
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Mark <st...@gmail.com>
> 收件人: common-user@hadoop.apache.org
> 发送日期: 2010/8/19 (周四) 9:47:56 下午
> 主   题: InputSplit and RecordReader
>
> From what I understand the InputSplit is a byte slice of a particular file 
>which 
>
> is then handed off to an individual mapper for processing. Is the size of the 
> InputSplit equal to the hadoop block ie 64/128mb? If not, what is the size.
>
> Now the RecordReaders takes in bytes from the InputSplit and transforms that to 
>
> a record-oriented structure suitable for use within a mapper.. ie key/value 
> correct? Now the wiki says its the RecordReaders job is to respect record 
> boundaries.. how is this accomplished? Say I have an InplutSplit which is 100kb 
>
> in size and each record is approximately 30kb in size. What happens to the last 
>
> 10kb in this example? I believe I read somewhere that it will read past that 
> boundary but how is that possible if the RecordReader has only been presented 
> with 100kb?
>
> Can someone please clarify some of these issues for me. Thanks
>
>
>
>      
Ok so that makes a little more sense. Basically an InputSplit says
"start at offest x and read about y bytes" and then the RecordReader
would basically increase that size to finish the last record. Is this
along the right lines?



      

Re: InputSplit and RecordReader

Posted by Mark <st...@gmail.com>.
 On 8/19/10 7:18 PM, Gang Luo wrote:
> The size of a input spilt could be different from a block. You can specify 
> max/min size of input splits. 
>
>
> InputSplit is actually metadata indicating the start point in a file, the length 
> of the split, etc. It doesn't present you the real data. A mapper, when assigned 
> a split to process, will read the input as specified in the InputSplit. It can 
> accross the boundary if needed.
>
> -Gang
>
>
>
>
> ----- 原始邮件 ----
> 发件人: Mark <st...@gmail.com>
> 收件人: common-user@hadoop.apache.org
> 发送日期: 2010/8/19 (周四) 9:47:56 下午
> 主   题: InputSplit and RecordReader
>
> From what I understand the InputSplit is a byte slice of a particular file which 
> is then handed off to an individual mapper for processing. Is the size of the 
> InputSplit equal to the hadoop block ie 64/128mb? If not, what is the size.
>
> Now the RecordReaders takes in bytes from the InputSplit and transforms that to 
> a record-oriented structure suitable for use within a mapper.. ie key/value 
> correct? Now the wiki says its the RecordReaders job is to respect record 
> boundaries.. how is this accomplished? Say I have an InplutSplit which is 100kb 
> in size and each record is approximately 30kb in size. What happens to the last 
> 10kb in this example? I believe I read somewhere that it will read past that 
> boundary but how is that possible if the RecordReader has only been presented 
> with 100kb?
>
> Can someone please clarify some of these issues for me. Thanks
>
>
>
>       
Ok so that makes a little more sense. Basically an InputSplit says
"start at offest x and read about y bytes" and then the RecordReader
would basically increase that size to finish the last record. Is this
along the right lines?

Re: InputSplit and RecordReader

Posted by Gang Luo <lg...@yahoo.com.cn>.
The size of a input spilt could be different from a block. You can specify 
max/min size of input splits. 


InputSplit is actually metadata indicating the start point in a file, the length 
of the split, etc. It doesn't present you the real data. A mapper, when assigned 
a split to process, will read the input as specified in the InputSplit. It can 
accross the boundary if needed.

-Gang




----- 原始邮件 ----
发件人: Mark <st...@gmail.com>
收件人: common-user@hadoop.apache.org
发送日期: 2010/8/19 (周四) 9:47:56 下午
主   题: InputSplit and RecordReader

From what I understand the InputSplit is a byte slice of a particular file which 
is then handed off to an individual mapper for processing. Is the size of the 
InputSplit equal to the hadoop block ie 64/128mb? If not, what is the size.

Now the RecordReaders takes in bytes from the InputSplit and transforms that to 
a record-oriented structure suitable for use within a mapper.. ie key/value 
correct? Now the wiki says its the RecordReaders job is to respect record 
boundaries.. how is this accomplished? Say I have an InplutSplit which is 100kb 
in size and each record is approximately 30kb in size. What happens to the last 
10kb in this example? I believe I read somewhere that it will read past that 
boundary but how is that possible if the RecordReader has only been presented 
with 100kb?

Can someone please clarify some of these issues for me. Thanks