You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mehal Patel <me...@gmail.com> on 2013/02/09 01:40:58 UTC

How MapReduce selects data blocks for processing user request

Hello All,

I am confused over how MapReduce tasks select data blocks for processing
user requests ?

As data block replication replicates single data block over multiple
datanodes, during job processing how uniquely
data blocks are selected for processing user requests ? How does it
guarantees that no same block gets chosen twice or thrice
for different mapper task.


Thank you

-Mehal

Re: How MapReduce selects data blocks for processing user request

Posted by Harsh J <ha...@cloudera.com>.

Hi Mehal,

> I am confused over how MapReduce tasks select data blocks for processing user requests ?

I suggest reading chapter 6 of Tom White's Hadoop: The Definitive
Guide, titled "How MapReduce Works". It explains almost everything you
need to know in very clear language, and should help you generally if
you get this or other such good books.

> As data block replication replicates single data block over multiple datanodes, during job processing how uniquely data blocks are selected for processing user requests ?

The first point to clear up is that MapReduce is not hard-tied to
HDFS. It generates splits on any FS and the splits are unique, based
on your given input path. Each split therefore relates to one task and
the task's input goal is hence defined at submit-time itself. Each
split is further defined by its path, start offset into the file and
length after offset to be processed - "uniquely" defining itself.

> How does it guarantees that no same block gets chosen twice or thrice for different mapper task.

See above - each "block" (or a "split" in MR terms), is defined by its
start-offset and length. No two splits generated for a single file
would be the same, as we generate it that way - and hence there won't
be such a case you're worried about.

On Sat, Feb 9, 2013 at 6:10 AM, Mehal Patel <me...@gmail.com> wrote:
> Hello All,
>
> I am confused over how MapReduce tasks select data blocks for processing
> user requests ?
>
> As data block replication replicates single data block over multiple
> datanodes, during job processing how uniquely
> data blocks are selected for processing user requests ? How does it
> guarantees that no same block gets chosen twice or thrice
> for different mapper task.
>
>
> Thank you
>
> -Mehal

--
Harsh J

Re: How MapReduce selects data blocks for processing user request

Posted by Harsh J <ha...@cloudera.com>.

Hi Mehal,

> I am confused over how MapReduce tasks select data blocks for processing user requests ?

I suggest reading chapter 6 of Tom White's Hadoop: The Definitive
Guide, titled "How MapReduce Works". It explains almost everything you
need to know in very clear language, and should help you generally if
you get this or other such good books.

> As data block replication replicates single data block over multiple datanodes, during job processing how uniquely data blocks are selected for processing user requests ?

The first point to clear up is that MapReduce is not hard-tied to
HDFS. It generates splits on any FS and the splits are unique, based
on your given input path. Each split therefore relates to one task and
the task's input goal is hence defined at submit-time itself. Each
split is further defined by its path, start offset into the file and
length after offset to be processed - "uniquely" defining itself.

> How does it guarantees that no same block gets chosen twice or thrice for different mapper task.

See above - each "block" (or a "split" in MR terms), is defined by its
start-offset and length. No two splits generated for a single file
would be the same, as we generate it that way - and hence there won't
be such a case you're worried about.

On Sat, Feb 9, 2013 at 6:10 AM, Mehal Patel <me...@gmail.com> wrote:
> Hello All,
>
> I am confused over how MapReduce tasks select data blocks for processing
> user requests ?
>
> As data block replication replicates single data block over multiple
> datanodes, during job processing how uniquely
> data blocks are selected for processing user requests ? How does it
> guarantees that no same block gets chosen twice or thrice
> for different mapper task.
>
>
> Thank you
>
> -Mehal

--
Harsh J

Re: How MapReduce selects data blocks for processing user request

Posted by Rishi Yadav <ri...@infoobjects.com>.

Hi Mehal,

When Client makes a read request for a certain file say foo.txt, namenode
sends information of first block(BlockID) and the datanodes it resides on.

It's client which decides which datanode to pull information from. If first
request fails, it can make a retry to get another replica of block from
another datanode. This process repeats until all data is read.

Thanks and Regards,

Rishi Yadav

(o) 408.988.2000x113 ||  (f) 408.716.2726

InfoObjects Inc || http://www.infoobjects.com *(Big Data Solutions)*

*INC 500 Fastest growing company in 2012 || 2011*

*Best Place to work in Bay Area 2012 - *SF Business Times and the Silicon
Valley / San Jose Business Journal

2041 Mission College Boulevard, #280 || Santa Clara, CA 95054

On Fri, Feb 8, 2013 at 4:40 PM, Mehal Patel <me...@gmail.com> wrote:

> Hello All,
>
> I am confused over how MapReduce tasks select data blocks for processing
> user requests ?
>
> As data block replication replicates single data block over multiple
> datanodes, during job processing how uniquely
> data blocks are selected for processing user requests ? How does it
> guarantees that no same block gets chosen twice or thrice
> for different mapper task.
>
>
> Thank you
>
> -Mehal
>

Re: How MapReduce selects data blocks for processing user request

Posted by Harsh J <ha...@cloudera.com>.

Hi Mehal,

> I am confused over how MapReduce tasks select data blocks for processing user requests ?

I suggest reading chapter 6 of Tom White's Hadoop: The Definitive
Guide, titled "How MapReduce Works". It explains almost everything you
need to know in very clear language, and should help you generally if
you get this or other such good books.

> As data block replication replicates single data block over multiple datanodes, during job processing how uniquely data blocks are selected for processing user requests ?

The first point to clear up is that MapReduce is not hard-tied to
HDFS. It generates splits on any FS and the splits are unique, based
on your given input path. Each split therefore relates to one task and
the task's input goal is hence defined at submit-time itself. Each
split is further defined by its path, start offset into the file and
length after offset to be processed - "uniquely" defining itself.

> How does it guarantees that no same block gets chosen twice or thrice for different mapper task.

See above - each "block" (or a "split" in MR terms), is defined by its
start-offset and length. No two splits generated for a single file
would be the same, as we generate it that way - and hence there won't
be such a case you're worried about.

On Sat, Feb 9, 2013 at 6:10 AM, Mehal Patel <me...@gmail.com> wrote:
> Hello All,
>
> I am confused over how MapReduce tasks select data blocks for processing
> user requests ?
>
> As data block replication replicates single data block over multiple
> datanodes, during job processing how uniquely
> data blocks are selected for processing user requests ? How does it
> guarantees that no same block gets chosen twice or thrice
> for different mapper task.
>
>
> Thank you
>
> -Mehal

--
Harsh J

Re: How MapReduce selects data blocks for processing user request

Posted by Rishi Yadav <ri...@infoobjects.com>.

Hi Mehal,

When Client makes a read request for a certain file say foo.txt, namenode
sends information of first block(BlockID) and the datanodes it resides on.

It's client which decides which datanode to pull information from. If first
request fails, it can make a retry to get another replica of block from
another datanode. This process repeats until all data is read.

Thanks and Regards,

Rishi Yadav

(o) 408.988.2000x113 ||  (f) 408.716.2726

InfoObjects Inc || http://www.infoobjects.com *(Big Data Solutions)*

*INC 500 Fastest growing company in 2012 || 2011*

*Best Place to work in Bay Area 2012 - *SF Business Times and the Silicon
Valley / San Jose Business Journal

2041 Mission College Boulevard, #280 || Santa Clara, CA 95054

On Fri, Feb 8, 2013 at 4:40 PM, Mehal Patel <me...@gmail.com> wrote:

> Hello All,
>
> I am confused over how MapReduce tasks select data blocks for processing
> user requests ?
>
> As data block replication replicates single data block over multiple
> datanodes, during job processing how uniquely
> data blocks are selected for processing user requests ? How does it
> guarantees that no same block gets chosen twice or thrice
> for different mapper task.
>
>
> Thank you
>
> -Mehal
>

Re: How MapReduce selects data blocks for processing user request

Posted by Rishi Yadav <ri...@infoobjects.com>.

Hi Mehal,

When Client makes a read request for a certain file say foo.txt, namenode
sends information of first block(BlockID) and the datanodes it resides on.

It's client which decides which datanode to pull information from. If first
request fails, it can make a retry to get another replica of block from
another datanode. This process repeats until all data is read.

Thanks and Regards,

Rishi Yadav

(o) 408.988.2000x113 ||  (f) 408.716.2726

InfoObjects Inc || http://www.infoobjects.com *(Big Data Solutions)*

*INC 500 Fastest growing company in 2012 || 2011*

*Best Place to work in Bay Area 2012 - *SF Business Times and the Silicon
Valley / San Jose Business Journal

2041 Mission College Boulevard, #280 || Santa Clara, CA 95054

On Fri, Feb 8, 2013 at 4:40 PM, Mehal Patel <me...@gmail.com> wrote:

> Hello All,
>
> I am confused over how MapReduce tasks select data blocks for processing
> user requests ?
>
> As data block replication replicates single data block over multiple
> datanodes, during job processing how uniquely
> data blocks are selected for processing user requests ? How does it
> guarantees that no same block gets chosen twice or thrice
> for different mapper task.
>
>
> Thank you
>
> -Mehal
>

Re: How MapReduce selects data blocks for processing user request

Posted by Harsh J <ha...@cloudera.com>.

Hi Mehal,

> I am confused over how MapReduce tasks select data blocks for processing user requests ?

I suggest reading chapter 6 of Tom White's Hadoop: The Definitive
Guide, titled "How MapReduce Works". It explains almost everything you
need to know in very clear language, and should help you generally if
you get this or other such good books.

> As data block replication replicates single data block over multiple datanodes, during job processing how uniquely data blocks are selected for processing user requests ?

The first point to clear up is that MapReduce is not hard-tied to
HDFS. It generates splits on any FS and the splits are unique, based
on your given input path. Each split therefore relates to one task and
the task's input goal is hence defined at submit-time itself. Each
split is further defined by its path, start offset into the file and
length after offset to be processed - "uniquely" defining itself.

> How does it guarantees that no same block gets chosen twice or thrice for different mapper task.

See above - each "block" (or a "split" in MR terms), is defined by its
start-offset and length. No two splits generated for a single file
would be the same, as we generate it that way - and hence there won't
be such a case you're worried about.

On Sat, Feb 9, 2013 at 6:10 AM, Mehal Patel <me...@gmail.com> wrote:
> Hello All,
>
> I am confused over how MapReduce tasks select data blocks for processing
> user requests ?
>
> As data block replication replicates single data block over multiple
> datanodes, during job processing how uniquely
> data blocks are selected for processing user requests ? How does it
> guarantees that no same block gets chosen twice or thrice
> for different mapper task.
>
>
> Thank you
>
> -Mehal

--
Harsh J

Re: How MapReduce selects data blocks for processing user request

Posted by Rishi Yadav <ri...@infoobjects.com>.

Hi Mehal,

When Client makes a read request for a certain file say foo.txt, namenode
sends information of first block(BlockID) and the datanodes it resides on.

It's client which decides which datanode to pull information from. If first
request fails, it can make a retry to get another replica of block from
another datanode. This process repeats until all data is read.

Thanks and Regards,

Rishi Yadav

(o) 408.988.2000x113 ||  (f) 408.716.2726

InfoObjects Inc || http://www.infoobjects.com *(Big Data Solutions)*

*INC 500 Fastest growing company in 2012 || 2011*

*Best Place to work in Bay Area 2012 - *SF Business Times and the Silicon
Valley / San Jose Business Journal

2041 Mission College Boulevard, #280 || Santa Clara, CA 95054

On Fri, Feb 8, 2013 at 4:40 PM, Mehal Patel <me...@gmail.com> wrote:

> Hello All,
>
> I am confused over how MapReduce tasks select data blocks for processing
> user requests ?
>
> As data block replication replicates single data block over multiple
> datanodes, during job processing how uniquely
> data blocks are selected for processing user requests ? How does it
> guarantees that no same block gets chosen twice or thrice
> for different mapper task.
>
>
> Thank you
>
> -Mehal
>