You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Jeff LI <un...@gmail.com> on 2012/12/02 23:03:03 UTC

Input splits for sequence file input

Hello,

I was reading on the relationship between input splits and HDFS blocks and
a question came up to me:

If a logical record crosses HDFS block boundary, let's say block#1 and
block#2, does the mapper assigned with this input split asks for (1) both
blocks, or (2) block#1 and just the part of block#2 that this logical
record extends to, or (3) block#1 and part of block#2 up to some sync point
that covers this particular logical record?  Note the input is sequence
file.

I guess my question really is: does Hadoop operate on a block basis or does
it respect some sort of logical structure within a block when it's trying
to feed the mappers with input data.

Cheers

Jeff

Re: Input splits for sequence file input

Posted by Harsh J <ha...@cloudera.com>.

Hi Jeff,

This has been asked several times before (check out
http://search-hadoop.com please).

The answer is (3) for SequenceFiles (due to no notion of records) and
(2) as a general thought (i.e. text files, etc.).

On Mon, Dec 3, 2012 at 3:33 AM, Jeff LI <un...@gmail.com> wrote:
> Hello,
>
> I was reading on the relationship between input splits and HDFS blocks and a
> question came up to me:
>
> If a logical record crosses HDFS block boundary, let's say block#1 and
> block#2, does the mapper assigned with this input split asks for (1) both
> blocks, or (2) block#1 and just the part of block#2 that this logical record
> extends to, or (3) block#1 and part of block#2 up to some sync point that
> covers this particular logical record?  Note the input is sequence file.
>
> I guess my question really is: does Hadoop operate on a block basis or does
> it respect some sort of logical structure within a block when it's trying to
> feed the mappers with input data.
>
> Cheers
>
> Jeff
>



-- 
Harsh J

Re: Input splits for sequence file input

Posted by Harsh J <ha...@cloudera.com>.

Hi Jeff,

This has been asked several times before (check out
http://search-hadoop.com please).

The answer is (3) for SequenceFiles (due to no notion of records) and
(2) as a general thought (i.e. text files, etc.).

On Mon, Dec 3, 2012 at 3:33 AM, Jeff LI <un...@gmail.com> wrote:
> Hello,
>
> I was reading on the relationship between input splits and HDFS blocks and a
> question came up to me:
>
> If a logical record crosses HDFS block boundary, let's say block#1 and
> block#2, does the mapper assigned with this input split asks for (1) both
> blocks, or (2) block#1 and just the part of block#2 that this logical record
> extends to, or (3) block#1 and part of block#2 up to some sync point that
> covers this particular logical record?  Note the input is sequence file.
>
> I guess my question really is: does Hadoop operate on a block basis or does
> it respect some sort of logical structure within a block when it's trying to
> feed the mappers with input data.
>
> Cheers
>
> Jeff
>



-- 
Harsh J

Re: Input splits for sequence file input

Posted by Harsh J <ha...@cloudera.com>.

Hi Jeff,

This has been asked several times before (check out
http://search-hadoop.com please).

The answer is (3) for SequenceFiles (due to no notion of records) and
(2) as a general thought (i.e. text files, etc.).

On Mon, Dec 3, 2012 at 3:33 AM, Jeff LI <un...@gmail.com> wrote:
> Hello,
>
> I was reading on the relationship between input splits and HDFS blocks and a
> question came up to me:
>
> If a logical record crosses HDFS block boundary, let's say block#1 and
> block#2, does the mapper assigned with this input split asks for (1) both
> blocks, or (2) block#1 and just the part of block#2 that this logical record
> extends to, or (3) block#1 and part of block#2 up to some sync point that
> covers this particular logical record?  Note the input is sequence file.
>
> I guess my question really is: does Hadoop operate on a block basis or does
> it respect some sort of logical structure within a block when it's trying to
> feed the mappers with input data.
>
> Cheers
>
> Jeff
>



-- 
Harsh J

Re: Input splits for sequence file input

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jeff,

            Beyond the hdfs blocks, there is something called as *
InputSplit/FileSplit* (in your terms logical structure).
            Mapper operates on InputSplits using the RecordReader and this
RecordReader is specific to InputFormat.
            InputFormat parses the input and generates key-value pairs.

            InputFormat also handle records that may be split on the
FileSplit boundary (i.e., different blocks).

            Please check this link for more information,
http://wiki.apache.org/hadoop/HadoopMapReduce

Best,
Mahesh Balija,
Calsoft Labs.

On Mon, Dec 3, 2012 at 3:33 AM, Jeff LI <un...@gmail.com> wrote:

> Hello,
>
> I was reading on the relationship between input splits and HDFS blocks and
> a question came up to me:
>
> If a logical record crosses HDFS block boundary, let's say block#1 and
> block#2, does the mapper assigned with this input split asks for (1) both
> blocks, or (2) block#1 and just the part of block#2 that this logical
> record extends to, or (3) block#1 and part of block#2 up to some sync point
> that covers this particular logical record?  Note the input is sequence
> file.
>
> I guess my question really is: does Hadoop operate on a block basis or
> does it respect some sort of logical structure within a block when it's
> trying to feed the mappers with input data.
>
> Cheers
>
> Jeff
>
>

Re: Input splits for sequence file input

Posted by Jay Vyas <ja...@gmail.com>.

This question is fundamentally flawed : it assumes that a mapper will ask for anything.

The mapper class "run" method reads from a record reader.  The question you really should ask is :

How does a RecordReader read records across block boundaries?

Jay Vyas 
http://jayunit100.blogspot.com

On Dec 2, 2012, at 9:08 PM, Jeff Zhang <zj...@gmail.com> wrote:

> method createRecordReader will handle the record boundary issue. You can check the code for details
> 
> On Mon, Dec 3, 2012 at 6:03 AM, Jeff LI <un...@gmail.com> wrote:
>> Hello,
>> 
>> I was reading on the relationship between input splits and HDFS blocks and a question came up to me:
>> 
>> If a logical record crosses HDFS block boundary, let's say block#1 and block#2, does the mapper assigned with this input split asks for (1) both blocks, or (2) block#1 and just the part of block#2 that this logical record extends to, or (3) block#1 and part of block#2 up to some sync point that covers this particular logical record?  Note the input is sequence file.
>> 
>> I guess my question really is: does Hadoop operate on a block basis or does it respect some sort of logical structure within a block when it's trying to feed the mappers with input data.
>> 
>> Cheers
>> 
>> Jeff
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang

Re: Input splits for sequence file input

Posted by Jay Vyas <ja...@gmail.com>.

This question is fundamentally flawed : it assumes that a mapper will ask for anything.

The mapper class "run" method reads from a record reader.  The question you really should ask is :

How does a RecordReader read records across block boundaries?

Jay Vyas 
http://jayunit100.blogspot.com

On Dec 2, 2012, at 9:08 PM, Jeff Zhang <zj...@gmail.com> wrote:

> method createRecordReader will handle the record boundary issue. You can check the code for details
> 
> On Mon, Dec 3, 2012 at 6:03 AM, Jeff LI <un...@gmail.com> wrote:
>> Hello,
>> 
>> I was reading on the relationship between input splits and HDFS blocks and a question came up to me:
>> 
>> If a logical record crosses HDFS block boundary, let's say block#1 and block#2, does the mapper assigned with this input split asks for (1) both blocks, or (2) block#1 and just the part of block#2 that this logical record extends to, or (3) block#1 and part of block#2 up to some sync point that covers this particular logical record?  Note the input is sequence file.
>> 
>> I guess my question really is: does Hadoop operate on a block basis or does it respect some sort of logical structure within a block when it's trying to feed the mappers with input data.
>> 
>> Cheers
>> 
>> Jeff
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang

Re: Input splits for sequence file input

Posted by Jay Vyas <ja...@gmail.com>.

This question is fundamentally flawed : it assumes that a mapper will ask for anything.

The mapper class "run" method reads from a record reader.  The question you really should ask is :

How does a RecordReader read records across block boundaries?

Jay Vyas 
http://jayunit100.blogspot.com

On Dec 2, 2012, at 9:08 PM, Jeff Zhang <zj...@gmail.com> wrote:

> method createRecordReader will handle the record boundary issue. You can check the code for details
> 
> On Mon, Dec 3, 2012 at 6:03 AM, Jeff LI <un...@gmail.com> wrote:
>> Hello,
>> 
>> I was reading on the relationship between input splits and HDFS blocks and a question came up to me:
>> 
>> If a logical record crosses HDFS block boundary, let's say block#1 and block#2, does the mapper assigned with this input split asks for (1) both blocks, or (2) block#1 and just the part of block#2 that this logical record extends to, or (3) block#1 and part of block#2 up to some sync point that covers this particular logical record?  Note the input is sequence file.
>> 
>> I guess my question really is: does Hadoop operate on a block basis or does it respect some sort of logical structure within a block when it's trying to feed the mappers with input data.
>> 
>> Cheers
>> 
>> Jeff
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang

Re: Input splits for sequence file input

Posted by Jay Vyas <ja...@gmail.com>.

This question is fundamentally flawed : it assumes that a mapper will ask for anything.

The mapper class "run" method reads from a record reader.  The question you really should ask is :

How does a RecordReader read records across block boundaries?

Jay Vyas 
http://jayunit100.blogspot.com

On Dec 2, 2012, at 9:08 PM, Jeff Zhang <zj...@gmail.com> wrote:

> method createRecordReader will handle the record boundary issue. You can check the code for details
> 
> On Mon, Dec 3, 2012 at 6:03 AM, Jeff LI <un...@gmail.com> wrote:
>> Hello,
>> 
>> I was reading on the relationship between input splits and HDFS blocks and a question came up to me:
>> 
>> If a logical record crosses HDFS block boundary, let's say block#1 and block#2, does the mapper assigned with this input split asks for (1) both blocks, or (2) block#1 and just the part of block#2 that this logical record extends to, or (3) block#1 and part of block#2 up to some sync point that covers this particular logical record?  Note the input is sequence file.
>> 
>> I guess my question really is: does Hadoop operate on a block basis or does it respect some sort of logical structure within a block when it's trying to feed the mappers with input data.
>> 
>> Cheers
>> 
>> Jeff
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang

Re: Input splits for sequence file input

Posted by Jeff Zhang <zj...@gmail.com>.

method createRecordReader will handle the record boundary issue. You can
check the code for details

On Mon, Dec 3, 2012 at 6:03 AM, Jeff LI <un...@gmail.com> wrote:

> Hello,
>
> I was reading on the relationship between input splits and HDFS blocks and
> a question came up to me:
>
> If a logical record crosses HDFS block boundary, let's say block#1 and
> block#2, does the mapper assigned with this input split asks for (1) both
> blocks, or (2) block#1 and just the part of block#2 that this logical
> record extends to, or (3) block#1 and part of block#2 up to some sync point
> that covers this particular logical record?  Note the input is sequence
> file.
>
> I guess my question really is: does Hadoop operate on a block basis or
> does it respect some sort of logical structure within a block when it's
> trying to feed the mappers with input data.
>
> Cheers
>
> Jeff
>
>


-- 
Best Regards

Jeff Zhang

Re: Input splits for sequence file input

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jeff,

            Beyond the hdfs blocks, there is something called as *
InputSplit/FileSplit* (in your terms logical structure).
            Mapper operates on InputSplits using the RecordReader and this
RecordReader is specific to InputFormat.
            InputFormat parses the input and generates key-value pairs.

            InputFormat also handle records that may be split on the
FileSplit boundary (i.e., different blocks).

            Please check this link for more information,
http://wiki.apache.org/hadoop/HadoopMapReduce

Best,
Mahesh Balija,
Calsoft Labs.

On Mon, Dec 3, 2012 at 3:33 AM, Jeff LI <un...@gmail.com> wrote:

> Hello,
>
> I was reading on the relationship between input splits and HDFS blocks and
> a question came up to me:
>
> If a logical record crosses HDFS block boundary, let's say block#1 and
> block#2, does the mapper assigned with this input split asks for (1) both
> blocks, or (2) block#1 and just the part of block#2 that this logical
> record extends to, or (3) block#1 and part of block#2 up to some sync point
> that covers this particular logical record?  Note the input is sequence
> file.
>
> I guess my question really is: does Hadoop operate on a block basis or
> does it respect some sort of logical structure within a block when it's
> trying to feed the mappers with input data.
>
> Cheers
>
> Jeff
>
>

Re: Input splits for sequence file input

Posted by Jeff Zhang <zj...@gmail.com>.

method createRecordReader will handle the record boundary issue. You can
check the code for details

On Mon, Dec 3, 2012 at 6:03 AM, Jeff LI <un...@gmail.com> wrote:

> Hello,
>
> I was reading on the relationship between input splits and HDFS blocks and
> a question came up to me:
>
> If a logical record crosses HDFS block boundary, let's say block#1 and
> block#2, does the mapper assigned with this input split asks for (1) both
> blocks, or (2) block#1 and just the part of block#2 that this logical
> record extends to, or (3) block#1 and part of block#2 up to some sync point
> that covers this particular logical record?  Note the input is sequence
> file.
>
> I guess my question really is: does Hadoop operate on a block basis or
> does it respect some sort of logical structure within a block when it's
> trying to feed the mappers with input data.
>
> Cheers
>
> Jeff
>
>


-- 
Best Regards

Jeff Zhang

Re: Input splits for sequence file input

Posted by Jeff Zhang <zj...@gmail.com>.

method createRecordReader will handle the record boundary issue. You can
check the code for details

On Mon, Dec 3, 2012 at 6:03 AM, Jeff LI <un...@gmail.com> wrote:

> Hello,
>
> I was reading on the relationship between input splits and HDFS blocks and
> a question came up to me:
>
> If a logical record crosses HDFS block boundary, let's say block#1 and
> block#2, does the mapper assigned with this input split asks for (1) both
> blocks, or (2) block#1 and just the part of block#2 that this logical
> record extends to, or (3) block#1 and part of block#2 up to some sync point
> that covers this particular logical record?  Note the input is sequence
> file.
>
> I guess my question really is: does Hadoop operate on a block basis or
> does it respect some sort of logical structure within a block when it's
> trying to feed the mappers with input data.
>
> Cheers
>
> Jeff
>
>


-- 
Best Regards

Jeff Zhang

Re: Input splits for sequence file input

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jeff,

            Beyond the hdfs blocks, there is something called as *
InputSplit/FileSplit* (in your terms logical structure).
            Mapper operates on InputSplits using the RecordReader and this
RecordReader is specific to InputFormat.
            InputFormat parses the input and generates key-value pairs.

            InputFormat also handle records that may be split on the
FileSplit boundary (i.e., different blocks).

            Please check this link for more information,
http://wiki.apache.org/hadoop/HadoopMapReduce

Best,
Mahesh Balija,
Calsoft Labs.

On Mon, Dec 3, 2012 at 3:33 AM, Jeff LI <un...@gmail.com> wrote:

> Hello,
>
> I was reading on the relationship between input splits and HDFS blocks and
> a question came up to me:
>
> If a logical record crosses HDFS block boundary, let's say block#1 and
> block#2, does the mapper assigned with this input split asks for (1) both
> blocks, or (2) block#1 and just the part of block#2 that this logical
> record extends to, or (3) block#1 and part of block#2 up to some sync point
> that covers this particular logical record?  Note the input is sequence
> file.
>
> I guess my question really is: does Hadoop operate on a block basis or
> does it respect some sort of logical structure within a block when it's
> trying to feed the mappers with input data.
>
> Cheers
>
> Jeff
>
>

Re: Input splits for sequence file input

Posted by Harsh J <ha...@cloudera.com>.

Hi Jeff,

This has been asked several times before (check out
http://search-hadoop.com please).

The answer is (3) for SequenceFiles (due to no notion of records) and
(2) as a general thought (i.e. text files, etc.).

On Mon, Dec 3, 2012 at 3:33 AM, Jeff LI <un...@gmail.com> wrote:
> Hello,
>
> I was reading on the relationship between input splits and HDFS blocks and a
> question came up to me:
>
> If a logical record crosses HDFS block boundary, let's say block#1 and
> block#2, does the mapper assigned with this input split asks for (1) both
> blocks, or (2) block#1 and just the part of block#2 that this logical record
> extends to, or (3) block#1 and part of block#2 up to some sync point that
> covers this particular logical record?  Note the input is sequence file.
>
> I guess my question really is: does Hadoop operate on a block basis or does
> it respect some sort of logical structure within a block when it's trying to
> feed the mappers with input data.
>
> Cheers
>
> Jeff
>



-- 
Harsh J

Re: Input splits for sequence file input

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Jeff,

            Beyond the hdfs blocks, there is something called as *
InputSplit/FileSplit* (in your terms logical structure).
            Mapper operates on InputSplits using the RecordReader and this
RecordReader is specific to InputFormat.
            InputFormat parses the input and generates key-value pairs.

            InputFormat also handle records that may be split on the
FileSplit boundary (i.e., different blocks).

            Please check this link for more information,
http://wiki.apache.org/hadoop/HadoopMapReduce

Best,
Mahesh Balija,
Calsoft Labs.

On Mon, Dec 3, 2012 at 3:33 AM, Jeff LI <un...@gmail.com> wrote:

> Hello,
>
> I was reading on the relationship between input splits and HDFS blocks and
> a question came up to me:
>
> If a logical record crosses HDFS block boundary, let's say block#1 and
> block#2, does the mapper assigned with this input split asks for (1) both
> blocks, or (2) block#1 and just the part of block#2 that this logical
> record extends to, or (3) block#1 and part of block#2 up to some sync point
> that covers this particular logical record?  Note the input is sequence
> file.
>
> I guess my question really is: does Hadoop operate on a block basis or
> does it respect some sort of logical structure within a block when it's
> trying to feed the mappers with input data.
>
> Cheers
>
> Jeff
>
>

Re: Input splits for sequence file input

Posted by Jeff Zhang <zj...@gmail.com>.

method createRecordReader will handle the record boundary issue. You can
check the code for details

On Mon, Dec 3, 2012 at 6:03 AM, Jeff LI <un...@gmail.com> wrote:

> Hello,
>
> I was reading on the relationship between input splits and HDFS blocks and
> a question came up to me:
>
> If a logical record crosses HDFS block boundary, let's say block#1 and
> block#2, does the mapper assigned with this input split asks for (1) both
> blocks, or (2) block#1 and just the part of block#2 that this logical
> record extends to, or (3) block#1 and part of block#2 up to some sync point
> that covers this particular logical record?  Note the input is sequence
> file.
>
> I guess my question really is: does Hadoop operate on a block basis or
> does it respect some sort of logical structure within a block when it's
> trying to feed the mappers with input data.
>
> Cheers
>
> Jeff
>
>


-- 
Best Regards

Jeff Zhang