You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by jin keyao <ke...@gmail.com> on 2013/03/19 08:30:05 UTC

question about scheduler rack awareness

hi,

I have a question about the scheduler rack awareness:

if each map task handles several blocks once, how the map task will be
created ?

for example, if I have a file with 100 blocks, tends to use 10 map task to
do wordCount.

my question is : will the map task created on the node which access his 10
blocks most fastest ?

thanks a lot

Re: question about scheduler rack awareness

Posted by Harsh J <ha...@cloudera.com>.

You can use CombineFileInputFormat to achieve this with good locality
hints. It tries to pack together block splits that belong to one node
as much as possible.

On Tue, Mar 19, 2013 at 7:43 PM, jin keyao <ke...@gmail.com> wrote:
> hi Jens,
>
> thanks for your quick reply.
>
> actually I am afraid I didn't describe my question very well.  Usually for
> example, we make 1 Block as an input, such as we are benchmarking Terasort,
> which is good for scheduling the map on the node contain the data.
>
> But in a typical scenario I am working on, I want to make the sequential
> blocks as an input, if I take the original example, what I want to do is to
> group 100 blocks into 10 inputs, each input contain 10 sequential blocks to
> be handled by map tasks
>
> each 10 blocks may distribute on several nodes, based on the block placement
> policy in Hadoop.  so in this case, will the map task be scheduled on the
> node which have a nearest way to access those 10 blocks?
>
> or will hadoop only make sure the first block it's going to access is a
> local one?
>
>
> 2013/3/19 Jens Scheidtmann <je...@gmail.com>
>>
>> Dear Jin,
>>
>> you wrote:
>> > my question is : will the map task created on the node which access his
>> > 10 blocks most fastest ?
>>
>> hadoop tries hard to run the map tasks on the node, where the data is
>> stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
>> what happens for creation of map jvms. Sorry, I was not able to relocate
>> them on the web, yet (well, safaribooksonline.com ;-).
>>
>> Depending on the specific data layout (e.g. record lengths), the map tasks
>> may need to read other blocks anyway, which may be off-node.
>>
>> On how many nodes is your 100 blocks file stored? on 10?
>>
>> If it is on one node, then you're likely running into map slot limits or
>> container limits.
>>
>> Best regards,
>>
>> Jens
>
>



-- 
Harsh J

Re: question about scheduler rack awareness

Posted by Harsh J <ha...@cloudera.com>.

You can use CombineFileInputFormat to achieve this with good locality
hints. It tries to pack together block splits that belong to one node
as much as possible.

On Tue, Mar 19, 2013 at 7:43 PM, jin keyao <ke...@gmail.com> wrote:
> hi Jens,
>
> thanks for your quick reply.
>
> actually I am afraid I didn't describe my question very well.  Usually for
> example, we make 1 Block as an input, such as we are benchmarking Terasort,
> which is good for scheduling the map on the node contain the data.
>
> But in a typical scenario I am working on, I want to make the sequential
> blocks as an input, if I take the original example, what I want to do is to
> group 100 blocks into 10 inputs, each input contain 10 sequential blocks to
> be handled by map tasks
>
> each 10 blocks may distribute on several nodes, based on the block placement
> policy in Hadoop.  so in this case, will the map task be scheduled on the
> node which have a nearest way to access those 10 blocks?
>
> or will hadoop only make sure the first block it's going to access is a
> local one?
>
>
> 2013/3/19 Jens Scheidtmann <je...@gmail.com>
>>
>> Dear Jin,
>>
>> you wrote:
>> > my question is : will the map task created on the node which access his
>> > 10 blocks most fastest ?
>>
>> hadoop tries hard to run the map tasks on the node, where the data is
>> stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
>> what happens for creation of map jvms. Sorry, I was not able to relocate
>> them on the web, yet (well, safaribooksonline.com ;-).
>>
>> Depending on the specific data layout (e.g. record lengths), the map tasks
>> may need to read other blocks anyway, which may be off-node.
>>
>> On how many nodes is your 100 blocks file stored? on 10?
>>
>> If it is on one node, then you're likely running into map slot limits or
>> container limits.
>>
>> Best regards,
>>
>> Jens
>
>



-- 
Harsh J

Re: question about scheduler rack awareness

Posted by Harsh J <ha...@cloudera.com>.

You can use CombineFileInputFormat to achieve this with good locality
hints. It tries to pack together block splits that belong to one node
as much as possible.

On Tue, Mar 19, 2013 at 7:43 PM, jin keyao <ke...@gmail.com> wrote:
> hi Jens,
>
> thanks for your quick reply.
>
> actually I am afraid I didn't describe my question very well.  Usually for
> example, we make 1 Block as an input, such as we are benchmarking Terasort,
> which is good for scheduling the map on the node contain the data.
>
> But in a typical scenario I am working on, I want to make the sequential
> blocks as an input, if I take the original example, what I want to do is to
> group 100 blocks into 10 inputs, each input contain 10 sequential blocks to
> be handled by map tasks
>
> each 10 blocks may distribute on several nodes, based on the block placement
> policy in Hadoop.  so in this case, will the map task be scheduled on the
> node which have a nearest way to access those 10 blocks?
>
> or will hadoop only make sure the first block it's going to access is a
> local one?
>
>
> 2013/3/19 Jens Scheidtmann <je...@gmail.com>
>>
>> Dear Jin,
>>
>> you wrote:
>> > my question is : will the map task created on the node which access his
>> > 10 blocks most fastest ?
>>
>> hadoop tries hard to run the map tasks on the node, where the data is
>> stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
>> what happens for creation of map jvms. Sorry, I was not able to relocate
>> them on the web, yet (well, safaribooksonline.com ;-).
>>
>> Depending on the specific data layout (e.g. record lengths), the map tasks
>> may need to read other blocks anyway, which may be off-node.
>>
>> On how many nodes is your 100 blocks file stored? on 10?
>>
>> If it is on one node, then you're likely running into map slot limits or
>> container limits.
>>
>> Best regards,
>>
>> Jens
>
>



-- 
Harsh J

Re: question about scheduler rack awareness

Posted by Harsh J <ha...@cloudera.com>.

You can use CombineFileInputFormat to achieve this with good locality
hints. It tries to pack together block splits that belong to one node
as much as possible.

On Tue, Mar 19, 2013 at 7:43 PM, jin keyao <ke...@gmail.com> wrote:
> hi Jens,
>
> thanks for your quick reply.
>
> actually I am afraid I didn't describe my question very well.  Usually for
> example, we make 1 Block as an input, such as we are benchmarking Terasort,
> which is good for scheduling the map on the node contain the data.
>
> But in a typical scenario I am working on, I want to make the sequential
> blocks as an input, if I take the original example, what I want to do is to
> group 100 blocks into 10 inputs, each input contain 10 sequential blocks to
> be handled by map tasks
>
> each 10 blocks may distribute on several nodes, based on the block placement
> policy in Hadoop.  so in this case, will the map task be scheduled on the
> node which have a nearest way to access those 10 blocks?
>
> or will hadoop only make sure the first block it's going to access is a
> local one?
>
>
> 2013/3/19 Jens Scheidtmann <je...@gmail.com>
>>
>> Dear Jin,
>>
>> you wrote:
>> > my question is : will the map task created on the node which access his
>> > 10 blocks most fastest ?
>>
>> hadoop tries hard to run the map tasks on the node, where the data is
>> stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
>> what happens for creation of map jvms. Sorry, I was not able to relocate
>> them on the web, yet (well, safaribooksonline.com ;-).
>>
>> Depending on the specific data layout (e.g. record lengths), the map tasks
>> may need to read other blocks anyway, which may be off-node.
>>
>> On how many nodes is your 100 blocks file stored? on 10?
>>
>> If it is on one node, then you're likely running into map slot limits or
>> container limits.
>>
>> Best regards,
>>
>> Jens
>
>



-- 
Harsh J

Re: question about scheduler rack awareness

Posted by jin keyao <ke...@gmail.com>.

hi Jens,

thanks for your quick reply.

actually I am afraid I didn't describe my question very well.  Usually for
example, we make 1 Block as an input, such as we are benchmarking Terasort,
which is good for scheduling the map on the node contain the data.

But in a typical scenario I am working on, I want to make the sequential
blocks as an input, if I take the original example, what I want to do is to
group 100 blocks into 10 inputs, each input contain 10 sequential blocks to
be handled by map tasks

each 10 blocks may distribute on several nodes, based on the block
placement policy in Hadoop.  so in this case, will the map task be
scheduled on the node which have a nearest way to access those 10 blocks?

or will hadoop only make sure the first block it's going to access is a
local one?

2013/3/19 Jens Scheidtmann <je...@gmail.com>

> Dear Jin,
>
> you wrote:
> > my question is : will the map task created on the node which access his
> 10 blocks most fastest ?
>
> hadoop tries hard to run the map tasks on the node, where the data is
> stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
> what happens for creation of map jvms. Sorry, I was not able to relocate
> them on the web, yet (well, safaribooksonline.com ;-).
>
> Depending on the specific data layout (e.g. record lengths), the map tasks
> may need to read other blocks anyway, which may be off-node.
>
> On how many nodes is your 100 blocks file stored? on 10?
>
> If it is on one node, then you're likely running into map slot limits or
> container limits.
>
> Best regards,
>
> Jens
>

Re: question about scheduler rack awareness

Posted by jin keyao <ke...@gmail.com>.

hi Jens,

thanks for your quick reply.

actually I am afraid I didn't describe my question very well.  Usually for
example, we make 1 Block as an input, such as we are benchmarking Terasort,
which is good for scheduling the map on the node contain the data.

But in a typical scenario I am working on, I want to make the sequential
blocks as an input, if I take the original example, what I want to do is to
group 100 blocks into 10 inputs, each input contain 10 sequential blocks to
be handled by map tasks

each 10 blocks may distribute on several nodes, based on the block
placement policy in Hadoop.  so in this case, will the map task be
scheduled on the node which have a nearest way to access those 10 blocks?

or will hadoop only make sure the first block it's going to access is a
local one?

2013/3/19 Jens Scheidtmann <je...@gmail.com>

> Dear Jin,
>
> you wrote:
> > my question is : will the map task created on the node which access his
> 10 blocks most fastest ?
>
> hadoop tries hard to run the map tasks on the node, where the data is
> stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
> what happens for creation of map jvms. Sorry, I was not able to relocate
> them on the web, yet (well, safaribooksonline.com ;-).
>
> Depending on the specific data layout (e.g. record lengths), the map tasks
> may need to read other blocks anyway, which may be off-node.
>
> On how many nodes is your 100 blocks file stored? on 10?
>
> If it is on one node, then you're likely running into map slot limits or
> container limits.
>
> Best regards,
>
> Jens
>

Re: question about scheduler rack awareness

Posted by jin keyao <ke...@gmail.com>.

hi Jens,

thanks for your quick reply.

actually I am afraid I didn't describe my question very well.  Usually for
example, we make 1 Block as an input, such as we are benchmarking Terasort,
which is good for scheduling the map on the node contain the data.

But in a typical scenario I am working on, I want to make the sequential
blocks as an input, if I take the original example, what I want to do is to
group 100 blocks into 10 inputs, each input contain 10 sequential blocks to
be handled by map tasks

each 10 blocks may distribute on several nodes, based on the block
placement policy in Hadoop.  so in this case, will the map task be
scheduled on the node which have a nearest way to access those 10 blocks?

or will hadoop only make sure the first block it's going to access is a
local one?

2013/3/19 Jens Scheidtmann <je...@gmail.com>

> Dear Jin,
>
> you wrote:
> > my question is : will the map task created on the node which access his
> 10 blocks most fastest ?
>
> hadoop tries hard to run the map tasks on the node, where the data is
> stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
> what happens for creation of map jvms. Sorry, I was not able to relocate
> them on the web, yet (well, safaribooksonline.com ;-).
>
> Depending on the specific data layout (e.g. record lengths), the map tasks
> may need to read other blocks anyway, which may be off-node.
>
> On how many nodes is your 100 blocks file stored? on 10?
>
> If it is on one node, then you're likely running into map slot limits or
> container limits.
>
> Best regards,
>
> Jens
>

Re: question about scheduler rack awareness

Posted by jin keyao <ke...@gmail.com>.

hi Jens,

thanks for your quick reply.

actually I am afraid I didn't describe my question very well.  Usually for
example, we make 1 Block as an input, such as we are benchmarking Terasort,
which is good for scheduling the map on the node contain the data.

But in a typical scenario I am working on, I want to make the sequential
blocks as an input, if I take the original example, what I want to do is to
group 100 blocks into 10 inputs, each input contain 10 sequential blocks to
be handled by map tasks

each 10 blocks may distribute on several nodes, based on the block
placement policy in Hadoop.  so in this case, will the map task be
scheduled on the node which have a nearest way to access those 10 blocks?

or will hadoop only make sure the first block it's going to access is a
local one?

2013/3/19 Jens Scheidtmann <je...@gmail.com>

> Dear Jin,
>
> you wrote:
> > my question is : will the map task created on the node which access his
> 10 blocks most fastest ?
>
> hadoop tries hard to run the map tasks on the node, where the data is
> stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
> what happens for creation of map jvms. Sorry, I was not able to relocate
> them on the web, yet (well, safaribooksonline.com ;-).
>
> Depending on the specific data layout (e.g. record lengths), the map tasks
> may need to read other blocks anyway, which may be off-node.
>
> On how many nodes is your 100 blocks file stored? on 10?
>
> If it is on one node, then you're likely running into map slot limits or
> container limits.
>
> Best regards,
>
> Jens
>

Re: question about scheduler rack awareness

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear Jin,

you wrote:
> my question is : will the map task created on the node which access his
10 blocks most fastest ?

hadoop tries hard to run the map tasks on the node, where the data is
stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
what happens for creation of map jvms. Sorry, I was not able to relocate
them on the web, yet (well, safaribooksonline.com ;-).

Depending on the specific data layout (e.g. record lengths), the map tasks
may need to read other blocks anyway, which may be off-node.

On how many nodes is your 100 blocks file stored? on 10?

If it is on one node, then you're likely running into map slot limits or
container limits.

Best regards,

Jens

Re: question about scheduler rack awareness

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear Jin,

you wrote:
> my question is : will the map task created on the node which access his
10 blocks most fastest ?

hadoop tries hard to run the map tasks on the node, where the data is
stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
what happens for creation of map jvms. Sorry, I was not able to relocate
them on the web, yet (well, safaribooksonline.com ;-).

Depending on the specific data layout (e.g. record lengths), the map tasks
may need to read other blocks anyway, which may be off-node.

On how many nodes is your 100 blocks file stored? on 10?

If it is on one node, then you're likely running into map slot limits or
container limits.

Best regards,

Jens

Re: question about scheduler rack awareness

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear Jin,

you wrote:
> my question is : will the map task created on the node which access his
10 blocks most fastest ?

hadoop tries hard to run the map tasks on the node, where the data is
stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
what happens for creation of map jvms. Sorry, I was not able to relocate
them on the web, yet (well, safaribooksonline.com ;-).

Depending on the specific data layout (e.g. record lengths), the map tasks
may need to read other blocks anyway, which may be off-node.

On how many nodes is your 100 blocks file stored? on 10?

If it is on one node, then you're likely running into map slot limits or
container limits.

Best regards,

Jens

Re: question about scheduler rack awareness

Posted by Jens Scheidtmann <je...@gmail.com>.

Dear Jin,

you wrote:
> my question is : will the map task created on the node which access his
10 blocks most fastest ?

hadoop tries hard to run the map tasks on the node, where the data is
stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
what happens for creation of map jvms. Sorry, I was not able to relocate
them on the web, yet (well, safaribooksonline.com ;-).

Depending on the specific data layout (e.g. record lengths), the map tasks
may need to read other blocks anyway, which may be off-node.

On how many nodes is your 100 blocks file stored? on 10?

If it is on one node, then you're likely running into map slot limits or
container limits.

Best regards,

Jens