You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by centerqi hu <ce...@gmail.com> on 2012/10/07 15:56:59 UTC

What is the difference between Rack-local map tasks and Data-local map tasks?

hi all

When I run "hadoop job -status xxx",Output the following some list.

Rack-local map tasks=124
Data-local map tasks=6

What is the difference between Rack-local map tasks and Data-local map
tasks?
-- 
centerqi@gmail.com|Sam

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by paritosh ranjan <pa...@gmail.com>.

One thing to look for would be the block size and input split size. In case
the input split size is greater than block size, then the task might pick
blocks which are not on the same node. So, keeping the input split size
less than or equal to block size might help.

HTH,
Paritosh

On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> Basically, more replicas.
>
> The second solution would be to use a 'smarter' scheduler. In theory, the
> jobtracker should be able to say "postpone this task until a data-local
> task can be created". But I don't think any stable and public available
> scheduler do that at the moment. This would allow you to have less traffic
> but the whole job might be slower due to the wait. It might be a good trade
> if you have multiple jobs running at the same time and if your hot data is
> uniformly distributed. But in practice this is of course not always the
> case and you also need to consider sla for the users so the whole is not
> trivial.
>
> Regards
>
> Bertrand
>
>
> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
>
>> Very good explanation,
>> If there is a way to reduce Rack-local map tasks
>> but can increase the Data-local map tasks ,
>> Whether to increase performance？
>>
>> 2012/10/7 Michael Segel <mi...@hotmail.com>
>>
>>> Rack local means that while the data isn't local to the node running the
>>> task, it is still on the same rack.
>>> (Its meaningless unless you've set up rack awareness because all of the
>>> machines are on the default rack. )
>>>
>>> Data local means that the task is running local to the machine that
>>> contains the actual data.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>>
>>>
>>> hi all
>>>
>>> When I run "hadoop job -status xxx",Output the following some list.
>>>
>>> Rack-local map tasks=124
>>> Data-local map tasks=6
>>>
>>> What is the difference between Rack-local map tasks and Data-local map
>>> tasks?
>>> --
>>> centerqi@gmail.com|Sam
>>>
>>>
>>>
>>
>>
>> --
>> centerqi@gmail.com|齐忠
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Bertrand Dechoux <de...@gmail.com>.

@Harsh : I didn't know. That's good to hear. I will check out
the mapred.fairscheduler.locality.delay in FairScheduler.
And I will also look at YARN-80 for my personal information.

Thanks!

Bertrand

On Mon, Oct 8, 2012 at 2:13 AM, Michael Segel <mi...@hotmail.com>wrote:

> Ok,
>
> So what would be the use case for this feature?
>
> I mean when would locality take precedence over job time completion?
>
> On Oct 7, 2012, at 5:46 PM, Harsh J <ha...@cloudera.com> wrote:
>
> > Bertrand,
> >
> > FairScheduler does support delay scheduling for locality via
> > mapred.fairscheduler.locality.delay config prop. MR2's
> > CapacityScheduler recently got similar support for better locality
> > scheduling as well (see YARN-80). Is this not what you're talking of?
> >
> > On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
> >> Basically, more replicas.
> >>
> >> The second solution would be to use a 'smarter' scheduler. In theory,
> the
> >> jobtracker should be able to say "postpone this task until a data-local
> task
> >> can be created". But I don't think any stable and public available
> scheduler
> >> do that at the moment. This would allow you to have less traffic but the
> >> whole job might be slower due to the wait. It might be a good trade if
> you
> >> have multiple jobs running at the same time and if your hot data is
> >> uniformly distributed. But in practice this is of course not always the
> case
> >> and you also need to consider sla for the users so the whole is not
> trivial.
> >>
> >> Regards
> >>
> >> Bertrand
> >>
> >>
> >> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
> >>>
> >>> Very good explanation,
> >>> If there is a way to reduce Rack-local map tasks
> >>> but can increase the Data-local map tasks ,
> >>> Whether to increase performance？
> >>>
> >>> 2012/10/7 Michael Segel <mi...@hotmail.com>
> >>>>
> >>>> Rack local means that while the data isn't local to the node running
> the
> >>>> task, it is still on the same rack.
> >>>> (Its meaningless unless you've set up rack awareness because all of
> the
> >>>> machines are on the default rack. )
> >>>>
> >>>> Data local means that the task is running local to the machine that
> >>>> contains the actual data.
> >>>>
> >>>> HTH
> >>>>
> >>>> -Mike
> >>>>
> >>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
> >>>>
> >>>>
> >>>> hi all
> >>>>
> >>>> When I run "hadoop job -status xxx",Output the following some list.
> >>>>
> >>>> Rack-local map tasks=124
> >>>> Data-local map tasks=6
> >>>>
> >>>> What is the difference between Rack-local map tasks and Data-local map
> >>>> tasks?
> >>>>
> >>>> --
> >>>> centerqi@gmail.com|Sam
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> centerqi@gmail.com|齐忠
> >>
> >>
> >>
> >>
> >> --
> >> Bertrand Dechoux
> >
> >
> >
> > --
> > Harsh J
> >
>
>


-- 
Bertrand Dechoux

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Bertrand Dechoux <de...@gmail.com>.

@Harsh : I didn't know. That's good to hear. I will check out
the mapred.fairscheduler.locality.delay in FairScheduler.
And I will also look at YARN-80 for my personal information.

Thanks!

Bertrand

On Mon, Oct 8, 2012 at 2:13 AM, Michael Segel <mi...@hotmail.com>wrote:

> Ok,
>
> So what would be the use case for this feature?
>
> I mean when would locality take precedence over job time completion?
>
> On Oct 7, 2012, at 5:46 PM, Harsh J <ha...@cloudera.com> wrote:
>
> > Bertrand,
> >
> > FairScheduler does support delay scheduling for locality via
> > mapred.fairscheduler.locality.delay config prop. MR2's
> > CapacityScheduler recently got similar support for better locality
> > scheduling as well (see YARN-80). Is this not what you're talking of?
> >
> > On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
> >> Basically, more replicas.
> >>
> >> The second solution would be to use a 'smarter' scheduler. In theory,
> the
> >> jobtracker should be able to say "postpone this task until a data-local
> task
> >> can be created". But I don't think any stable and public available
> scheduler
> >> do that at the moment. This would allow you to have less traffic but the
> >> whole job might be slower due to the wait. It might be a good trade if
> you
> >> have multiple jobs running at the same time and if your hot data is
> >> uniformly distributed. But in practice this is of course not always the
> case
> >> and you also need to consider sla for the users so the whole is not
> trivial.
> >>
> >> Regards
> >>
> >> Bertrand
> >>
> >>
> >> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
> >>>
> >>> Very good explanation,
> >>> If there is a way to reduce Rack-local map tasks
> >>> but can increase the Data-local map tasks ,
> >>> Whether to increase performance？
> >>>
> >>> 2012/10/7 Michael Segel <mi...@hotmail.com>
> >>>>
> >>>> Rack local means that while the data isn't local to the node running
> the
> >>>> task, it is still on the same rack.
> >>>> (Its meaningless unless you've set up rack awareness because all of
> the
> >>>> machines are on the default rack. )
> >>>>
> >>>> Data local means that the task is running local to the machine that
> >>>> contains the actual data.
> >>>>
> >>>> HTH
> >>>>
> >>>> -Mike
> >>>>
> >>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
> >>>>
> >>>>
> >>>> hi all
> >>>>
> >>>> When I run "hadoop job -status xxx",Output the following some list.
> >>>>
> >>>> Rack-local map tasks=124
> >>>> Data-local map tasks=6
> >>>>
> >>>> What is the difference between Rack-local map tasks and Data-local map
> >>>> tasks?
> >>>>
> >>>> --
> >>>> centerqi@gmail.com|Sam
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> centerqi@gmail.com|齐忠
> >>
> >>
> >>
> >>
> >> --
> >> Bertrand Dechoux
> >
> >
> >
> > --
> > Harsh J
> >
>
>


-- 
Bertrand Dechoux

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Bertrand Dechoux <de...@gmail.com>.

@Harsh : I didn't know. That's good to hear. I will check out
the mapred.fairscheduler.locality.delay in FairScheduler.
And I will also look at YARN-80 for my personal information.

Thanks!

Bertrand

On Mon, Oct 8, 2012 at 2:13 AM, Michael Segel <mi...@hotmail.com>wrote:

> Ok,
>
> So what would be the use case for this feature?
>
> I mean when would locality take precedence over job time completion?
>
> On Oct 7, 2012, at 5:46 PM, Harsh J <ha...@cloudera.com> wrote:
>
> > Bertrand,
> >
> > FairScheduler does support delay scheduling for locality via
> > mapred.fairscheduler.locality.delay config prop. MR2's
> > CapacityScheduler recently got similar support for better locality
> > scheduling as well (see YARN-80). Is this not what you're talking of?
> >
> > On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
> >> Basically, more replicas.
> >>
> >> The second solution would be to use a 'smarter' scheduler. In theory,
> the
> >> jobtracker should be able to say "postpone this task until a data-local
> task
> >> can be created". But I don't think any stable and public available
> scheduler
> >> do that at the moment. This would allow you to have less traffic but the
> >> whole job might be slower due to the wait. It might be a good trade if
> you
> >> have multiple jobs running at the same time and if your hot data is
> >> uniformly distributed. But in practice this is of course not always the
> case
> >> and you also need to consider sla for the users so the whole is not
> trivial.
> >>
> >> Regards
> >>
> >> Bertrand
> >>
> >>
> >> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
> >>>
> >>> Very good explanation,
> >>> If there is a way to reduce Rack-local map tasks
> >>> but can increase the Data-local map tasks ,
> >>> Whether to increase performance？
> >>>
> >>> 2012/10/7 Michael Segel <mi...@hotmail.com>
> >>>>
> >>>> Rack local means that while the data isn't local to the node running
> the
> >>>> task, it is still on the same rack.
> >>>> (Its meaningless unless you've set up rack awareness because all of
> the
> >>>> machines are on the default rack. )
> >>>>
> >>>> Data local means that the task is running local to the machine that
> >>>> contains the actual data.
> >>>>
> >>>> HTH
> >>>>
> >>>> -Mike
> >>>>
> >>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
> >>>>
> >>>>
> >>>> hi all
> >>>>
> >>>> When I run "hadoop job -status xxx",Output the following some list.
> >>>>
> >>>> Rack-local map tasks=124
> >>>> Data-local map tasks=6
> >>>>
> >>>> What is the difference between Rack-local map tasks and Data-local map
> >>>> tasks?
> >>>>
> >>>> --
> >>>> centerqi@gmail.com|Sam
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> centerqi@gmail.com|齐忠
> >>
> >>
> >>
> >>
> >> --
> >> Bertrand Dechoux
> >
> >
> >
> > --
> > Harsh J
> >
>
>


-- 
Bertrand Dechoux

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Bertrand Dechoux <de...@gmail.com>.

@Harsh : I didn't know. That's good to hear. I will check out
the mapred.fairscheduler.locality.delay in FairScheduler.
And I will also look at YARN-80 for my personal information.

Thanks!

Bertrand

On Mon, Oct 8, 2012 at 2:13 AM, Michael Segel <mi...@hotmail.com>wrote:

> Ok,
>
> So what would be the use case for this feature?
>
> I mean when would locality take precedence over job time completion?
>
> On Oct 7, 2012, at 5:46 PM, Harsh J <ha...@cloudera.com> wrote:
>
> > Bertrand,
> >
> > FairScheduler does support delay scheduling for locality via
> > mapred.fairscheduler.locality.delay config prop. MR2's
> > CapacityScheduler recently got similar support for better locality
> > scheduling as well (see YARN-80). Is this not what you're talking of?
> >
> > On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
> >> Basically, more replicas.
> >>
> >> The second solution would be to use a 'smarter' scheduler. In theory,
> the
> >> jobtracker should be able to say "postpone this task until a data-local
> task
> >> can be created". But I don't think any stable and public available
> scheduler
> >> do that at the moment. This would allow you to have less traffic but the
> >> whole job might be slower due to the wait. It might be a good trade if
> you
> >> have multiple jobs running at the same time and if your hot data is
> >> uniformly distributed. But in practice this is of course not always the
> case
> >> and you also need to consider sla for the users so the whole is not
> trivial.
> >>
> >> Regards
> >>
> >> Bertrand
> >>
> >>
> >> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
> >>>
> >>> Very good explanation,
> >>> If there is a way to reduce Rack-local map tasks
> >>> but can increase the Data-local map tasks ,
> >>> Whether to increase performance？
> >>>
> >>> 2012/10/7 Michael Segel <mi...@hotmail.com>
> >>>>
> >>>> Rack local means that while the data isn't local to the node running
> the
> >>>> task, it is still on the same rack.
> >>>> (Its meaningless unless you've set up rack awareness because all of
> the
> >>>> machines are on the default rack. )
> >>>>
> >>>> Data local means that the task is running local to the machine that
> >>>> contains the actual data.
> >>>>
> >>>> HTH
> >>>>
> >>>> -Mike
> >>>>
> >>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
> >>>>
> >>>>
> >>>> hi all
> >>>>
> >>>> When I run "hadoop job -status xxx",Output the following some list.
> >>>>
> >>>> Rack-local map tasks=124
> >>>> Data-local map tasks=6
> >>>>
> >>>> What is the difference between Rack-local map tasks and Data-local map
> >>>> tasks?
> >>>>
> >>>> --
> >>>> centerqi@gmail.com|Sam
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> centerqi@gmail.com|齐忠
> >>
> >>
> >>
> >>
> >> --
> >> Bertrand Dechoux
> >
> >
> >
> > --
> > Harsh J
> >
>
>


-- 
Bertrand Dechoux

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Michael Segel <mi...@hotmail.com>.

Ok, 

So what would be the use case for this feature?

I mean when would locality take precedence over job time completion? 

On Oct 7, 2012, at 5:46 PM, Harsh J <ha...@cloudera.com> wrote:

> Bertrand,
> 
> FairScheduler does support delay scheduling for locality via
> mapred.fairscheduler.locality.delay config prop. MR2's
> CapacityScheduler recently got similar support for better locality
> scheduling as well (see YARN-80). Is this not what you're talking of?
> 
> On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com> wrote:
>> Basically, more replicas.
>> 
>> The second solution would be to use a 'smarter' scheduler. In theory, the
>> jobtracker should be able to say "postpone this task until a data-local task
>> can be created". But I don't think any stable and public available scheduler
>> do that at the moment. This would allow you to have less traffic but the
>> whole job might be slower due to the wait. It might be a good trade if you
>> have multiple jobs running at the same time and if your hot data is
>> uniformly distributed. But in practice this is of course not always the case
>> and you also need to consider sla for the users so the whole is not trivial.
>> 
>> Regards
>> 
>> Bertrand
>> 
>> 
>> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
>>> 
>>> Very good explanation,
>>> If there is a way to reduce Rack-local map tasks
>>> but can increase the Data-local map tasks ,
>>> Whether to increase performance？
>>> 
>>> 2012/10/7 Michael Segel <mi...@hotmail.com>
>>>> 
>>>> Rack local means that while the data isn't local to the node running the
>>>> task, it is still on the same rack.
>>>> (Its meaningless unless you've set up rack awareness because all of the
>>>> machines are on the default rack. )
>>>> 
>>>> Data local means that the task is running local to the machine that
>>>> contains the actual data.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>>> 
>>>> 
>>>> hi all
>>>> 
>>>> When I run "hadoop job -status xxx",Output the following some list.
>>>> 
>>>> Rack-local map tasks=124
>>>> Data-local map tasks=6
>>>> 
>>>> What is the difference between Rack-local map tasks and Data-local map
>>>> tasks?
>>>> 
>>>> --
>>>> centerqi@gmail.com|Sam
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> centerqi@gmail.com|齐忠
>> 
>> 
>> 
>> 
>> --
>> Bertrand Dechoux
> 
> 
> 
> -- 
> Harsh J
>

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Michael Segel <mi...@hotmail.com>.

Ok, 

So what would be the use case for this feature?

I mean when would locality take precedence over job time completion? 

On Oct 7, 2012, at 5:46 PM, Harsh J <ha...@cloudera.com> wrote:

> Bertrand,
> 
> FairScheduler does support delay scheduling for locality via
> mapred.fairscheduler.locality.delay config prop. MR2's
> CapacityScheduler recently got similar support for better locality
> scheduling as well (see YARN-80). Is this not what you're talking of?
> 
> On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com> wrote:
>> Basically, more replicas.
>> 
>> The second solution would be to use a 'smarter' scheduler. In theory, the
>> jobtracker should be able to say "postpone this task until a data-local task
>> can be created". But I don't think any stable and public available scheduler
>> do that at the moment. This would allow you to have less traffic but the
>> whole job might be slower due to the wait. It might be a good trade if you
>> have multiple jobs running at the same time and if your hot data is
>> uniformly distributed. But in practice this is of course not always the case
>> and you also need to consider sla for the users so the whole is not trivial.
>> 
>> Regards
>> 
>> Bertrand
>> 
>> 
>> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
>>> 
>>> Very good explanation,
>>> If there is a way to reduce Rack-local map tasks
>>> but can increase the Data-local map tasks ,
>>> Whether to increase performance？
>>> 
>>> 2012/10/7 Michael Segel <mi...@hotmail.com>
>>>> 
>>>> Rack local means that while the data isn't local to the node running the
>>>> task, it is still on the same rack.
>>>> (Its meaningless unless you've set up rack awareness because all of the
>>>> machines are on the default rack. )
>>>> 
>>>> Data local means that the task is running local to the machine that
>>>> contains the actual data.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>>> 
>>>> 
>>>> hi all
>>>> 
>>>> When I run "hadoop job -status xxx",Output the following some list.
>>>> 
>>>> Rack-local map tasks=124
>>>> Data-local map tasks=6
>>>> 
>>>> What is the difference between Rack-local map tasks and Data-local map
>>>> tasks?
>>>> 
>>>> --
>>>> centerqi@gmail.com|Sam
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> centerqi@gmail.com|齐忠
>> 
>> 
>> 
>> 
>> --
>> Bertrand Dechoux
> 
> 
> 
> -- 
> Harsh J
>

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Michael Segel <mi...@hotmail.com>.

Ok, 

So what would be the use case for this feature?

I mean when would locality take precedence over job time completion? 

On Oct 7, 2012, at 5:46 PM, Harsh J <ha...@cloudera.com> wrote:

> Bertrand,
> 
> FairScheduler does support delay scheduling for locality via
> mapred.fairscheduler.locality.delay config prop. MR2's
> CapacityScheduler recently got similar support for better locality
> scheduling as well (see YARN-80). Is this not what you're talking of?
> 
> On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com> wrote:
>> Basically, more replicas.
>> 
>> The second solution would be to use a 'smarter' scheduler. In theory, the
>> jobtracker should be able to say "postpone this task until a data-local task
>> can be created". But I don't think any stable and public available scheduler
>> do that at the moment. This would allow you to have less traffic but the
>> whole job might be slower due to the wait. It might be a good trade if you
>> have multiple jobs running at the same time and if your hot data is
>> uniformly distributed. But in practice this is of course not always the case
>> and you also need to consider sla for the users so the whole is not trivial.
>> 
>> Regards
>> 
>> Bertrand
>> 
>> 
>> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
>>> 
>>> Very good explanation,
>>> If there is a way to reduce Rack-local map tasks
>>> but can increase the Data-local map tasks ,
>>> Whether to increase performance？
>>> 
>>> 2012/10/7 Michael Segel <mi...@hotmail.com>
>>>> 
>>>> Rack local means that while the data isn't local to the node running the
>>>> task, it is still on the same rack.
>>>> (Its meaningless unless you've set up rack awareness because all of the
>>>> machines are on the default rack. )
>>>> 
>>>> Data local means that the task is running local to the machine that
>>>> contains the actual data.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>>> 
>>>> 
>>>> hi all
>>>> 
>>>> When I run "hadoop job -status xxx",Output the following some list.
>>>> 
>>>> Rack-local map tasks=124
>>>> Data-local map tasks=6
>>>> 
>>>> What is the difference between Rack-local map tasks and Data-local map
>>>> tasks?
>>>> 
>>>> --
>>>> centerqi@gmail.com|Sam
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> centerqi@gmail.com|齐忠
>> 
>> 
>> 
>> 
>> --
>> Bertrand Dechoux
> 
> 
> 
> -- 
> Harsh J
>

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Michael Segel <mi...@hotmail.com>.

Ok, 

So what would be the use case for this feature?

I mean when would locality take precedence over job time completion? 

On Oct 7, 2012, at 5:46 PM, Harsh J <ha...@cloudera.com> wrote:

> Bertrand,
> 
> FairScheduler does support delay scheduling for locality via
> mapred.fairscheduler.locality.delay config prop. MR2's
> CapacityScheduler recently got similar support for better locality
> scheduling as well (see YARN-80). Is this not what you're talking of?
> 
> On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com> wrote:
>> Basically, more replicas.
>> 
>> The second solution would be to use a 'smarter' scheduler. In theory, the
>> jobtracker should be able to say "postpone this task until a data-local task
>> can be created". But I don't think any stable and public available scheduler
>> do that at the moment. This would allow you to have less traffic but the
>> whole job might be slower due to the wait. It might be a good trade if you
>> have multiple jobs running at the same time and if your hot data is
>> uniformly distributed. But in practice this is of course not always the case
>> and you also need to consider sla for the users so the whole is not trivial.
>> 
>> Regards
>> 
>> Bertrand
>> 
>> 
>> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
>>> 
>>> Very good explanation,
>>> If there is a way to reduce Rack-local map tasks
>>> but can increase the Data-local map tasks ,
>>> Whether to increase performance？
>>> 
>>> 2012/10/7 Michael Segel <mi...@hotmail.com>
>>>> 
>>>> Rack local means that while the data isn't local to the node running the
>>>> task, it is still on the same rack.
>>>> (Its meaningless unless you've set up rack awareness because all of the
>>>> machines are on the default rack. )
>>>> 
>>>> Data local means that the task is running local to the machine that
>>>> contains the actual data.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>>> 
>>>> 
>>>> hi all
>>>> 
>>>> When I run "hadoop job -status xxx",Output the following some list.
>>>> 
>>>> Rack-local map tasks=124
>>>> Data-local map tasks=6
>>>> 
>>>> What is the difference between Rack-local map tasks and Data-local map
>>>> tasks?
>>>> 
>>>> --
>>>> centerqi@gmail.com|Sam
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> centerqi@gmail.com|齐忠
>> 
>> 
>> 
>> 
>> --
>> Bertrand Dechoux
> 
> 
> 
> -- 
> Harsh J
>

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Harsh J <ha...@cloudera.com>.

Bertrand,

FairScheduler does support delay scheduling for locality via
mapred.fairscheduler.locality.delay config prop. MR2's
CapacityScheduler recently got similar support for better locality
scheduling as well (see YARN-80). Is this not what you're talking of?

On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com> wrote:
> Basically, more replicas.
>
> The second solution would be to use a 'smarter' scheduler. In theory, the
> jobtracker should be able to say "postpone this task until a data-local task
> can be created". But I don't think any stable and public available scheduler
> do that at the moment. This would allow you to have less traffic but the
> whole job might be slower due to the wait. It might be a good trade if you
> have multiple jobs running at the same time and if your hot data is
> uniformly distributed. But in practice this is of course not always the case
> and you also need to consider sla for the users so the whole is not trivial.
>
> Regards
>
> Bertrand
>
>
> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
>>
>> Very good explanation,
>> If there is a way to reduce Rack-local map tasks
>> but can increase the Data-local map tasks ,
>> Whether to increase performance？
>>
>> 2012/10/7 Michael Segel <mi...@hotmail.com>
>>>
>>> Rack local means that while the data isn't local to the node running the
>>> task, it is still on the same rack.
>>> (Its meaningless unless you've set up rack awareness because all of the
>>> machines are on the default rack. )
>>>
>>> Data local means that the task is running local to the machine that
>>> contains the actual data.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>>
>>>
>>> hi all
>>>
>>> When I run "hadoop job -status xxx",Output the following some list.
>>>
>>> Rack-local map tasks=124
>>> Data-local map tasks=6
>>>
>>> What is the difference between Rack-local map tasks and Data-local map
>>> tasks?
>>>
>>> --
>>> centerqi@gmail.com|Sam
>>>
>>>
>>
>>
>>
>> --
>> centerqi@gmail.com|齐忠
>
>
>
>
> --
> Bertrand Dechoux



-- 
Harsh J

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Harsh J <ha...@cloudera.com>.

Bertrand,

FairScheduler does support delay scheduling for locality via
mapred.fairscheduler.locality.delay config prop. MR2's
CapacityScheduler recently got similar support for better locality
scheduling as well (see YARN-80). Is this not what you're talking of?

On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com> wrote:
> Basically, more replicas.
>
> The second solution would be to use a 'smarter' scheduler. In theory, the
> jobtracker should be able to say "postpone this task until a data-local task
> can be created". But I don't think any stable and public available scheduler
> do that at the moment. This would allow you to have less traffic but the
> whole job might be slower due to the wait. It might be a good trade if you
> have multiple jobs running at the same time and if your hot data is
> uniformly distributed. But in practice this is of course not always the case
> and you also need to consider sla for the users so the whole is not trivial.
>
> Regards
>
> Bertrand
>
>
> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
>>
>> Very good explanation,
>> If there is a way to reduce Rack-local map tasks
>> but can increase the Data-local map tasks ,
>> Whether to increase performance？
>>
>> 2012/10/7 Michael Segel <mi...@hotmail.com>
>>>
>>> Rack local means that while the data isn't local to the node running the
>>> task, it is still on the same rack.
>>> (Its meaningless unless you've set up rack awareness because all of the
>>> machines are on the default rack. )
>>>
>>> Data local means that the task is running local to the machine that
>>> contains the actual data.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>>
>>>
>>> hi all
>>>
>>> When I run "hadoop job -status xxx",Output the following some list.
>>>
>>> Rack-local map tasks=124
>>> Data-local map tasks=6
>>>
>>> What is the difference between Rack-local map tasks and Data-local map
>>> tasks?
>>>
>>> --
>>> centerqi@gmail.com|Sam
>>>
>>>
>>
>>
>>
>> --
>> centerqi@gmail.com|齐忠
>
>
>
>
> --
> Bertrand Dechoux



-- 
Harsh J

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Harsh J <ha...@cloudera.com>.

Bertrand,

FairScheduler does support delay scheduling for locality via
mapred.fairscheduler.locality.delay config prop. MR2's
CapacityScheduler recently got similar support for better locality
scheduling as well (see YARN-80). Is this not what you're talking of?

On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com> wrote:
> Basically, more replicas.
>
> The second solution would be to use a 'smarter' scheduler. In theory, the
> jobtracker should be able to say "postpone this task until a data-local task
> can be created". But I don't think any stable and public available scheduler
> do that at the moment. This would allow you to have less traffic but the
> whole job might be slower due to the wait. It might be a good trade if you
> have multiple jobs running at the same time and if your hot data is
> uniformly distributed. But in practice this is of course not always the case
> and you also need to consider sla for the users so the whole is not trivial.
>
> Regards
>
> Bertrand
>
>
> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
>>
>> Very good explanation,
>> If there is a way to reduce Rack-local map tasks
>> but can increase the Data-local map tasks ,
>> Whether to increase performance？
>>
>> 2012/10/7 Michael Segel <mi...@hotmail.com>
>>>
>>> Rack local means that while the data isn't local to the node running the
>>> task, it is still on the same rack.
>>> (Its meaningless unless you've set up rack awareness because all of the
>>> machines are on the default rack. )
>>>
>>> Data local means that the task is running local to the machine that
>>> contains the actual data.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>>
>>>
>>> hi all
>>>
>>> When I run "hadoop job -status xxx",Output the following some list.
>>>
>>> Rack-local map tasks=124
>>> Data-local map tasks=6
>>>
>>> What is the difference between Rack-local map tasks and Data-local map
>>> tasks?
>>>
>>> --
>>> centerqi@gmail.com|Sam
>>>
>>>
>>
>>
>>
>> --
>> centerqi@gmail.com|齐忠
>
>
>
>
> --
> Bertrand Dechoux



-- 
Harsh J

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by paritosh ranjan <pa...@gmail.com>.

One thing to look for would be the block size and input split size. In case
the input split size is greater than block size, then the task might pick
blocks which are not on the same node. So, keeping the input split size
less than or equal to block size might help.

HTH,
Paritosh

On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> Basically, more replicas.
>
> The second solution would be to use a 'smarter' scheduler. In theory, the
> jobtracker should be able to say "postpone this task until a data-local
> task can be created". But I don't think any stable and public available
> scheduler do that at the moment. This would allow you to have less traffic
> but the whole job might be slower due to the wait. It might be a good trade
> if you have multiple jobs running at the same time and if your hot data is
> uniformly distributed. But in practice this is of course not always the
> case and you also need to consider sla for the users so the whole is not
> trivial.
>
> Regards
>
> Bertrand
>
>
> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
>
>> Very good explanation,
>> If there is a way to reduce Rack-local map tasks
>> but can increase the Data-local map tasks ,
>> Whether to increase performance？
>>
>> 2012/10/7 Michael Segel <mi...@hotmail.com>
>>
>>> Rack local means that while the data isn't local to the node running the
>>> task, it is still on the same rack.
>>> (Its meaningless unless you've set up rack awareness because all of the
>>> machines are on the default rack. )
>>>
>>> Data local means that the task is running local to the machine that
>>> contains the actual data.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>>
>>>
>>> hi all
>>>
>>> When I run "hadoop job -status xxx",Output the following some list.
>>>
>>> Rack-local map tasks=124
>>> Data-local map tasks=6
>>>
>>> What is the difference between Rack-local map tasks and Data-local map
>>> tasks?
>>> --
>>> centerqi@gmail.com|Sam
>>>
>>>
>>>
>>
>>
>> --
>> centerqi@gmail.com|齐忠
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by paritosh ranjan <pa...@gmail.com>.

One thing to look for would be the block size and input split size. In case
the input split size is greater than block size, then the task might pick
blocks which are not on the same node. So, keeping the input split size
less than or equal to block size might help.

HTH,
Paritosh

On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> Basically, more replicas.
>
> The second solution would be to use a 'smarter' scheduler. In theory, the
> jobtracker should be able to say "postpone this task until a data-local
> task can be created". But I don't think any stable and public available
> scheduler do that at the moment. This would allow you to have less traffic
> but the whole job might be slower due to the wait. It might be a good trade
> if you have multiple jobs running at the same time and if your hot data is
> uniformly distributed. But in practice this is of course not always the
> case and you also need to consider sla for the users so the whole is not
> trivial.
>
> Regards
>
> Bertrand
>
>
> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
>
>> Very good explanation,
>> If there is a way to reduce Rack-local map tasks
>> but can increase the Data-local map tasks ,
>> Whether to increase performance？
>>
>> 2012/10/7 Michael Segel <mi...@hotmail.com>
>>
>>> Rack local means that while the data isn't local to the node running the
>>> task, it is still on the same rack.
>>> (Its meaningless unless you've set up rack awareness because all of the
>>> machines are on the default rack. )
>>>
>>> Data local means that the task is running local to the machine that
>>> contains the actual data.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>>
>>>
>>> hi all
>>>
>>> When I run "hadoop job -status xxx",Output the following some list.
>>>
>>> Rack-local map tasks=124
>>> Data-local map tasks=6
>>>
>>> What is the difference between Rack-local map tasks and Data-local map
>>> tasks?
>>> --
>>> centerqi@gmail.com|Sam
>>>
>>>
>>>
>>
>>
>> --
>> centerqi@gmail.com|齐忠
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by paritosh ranjan <pa...@gmail.com>.

One thing to look for would be the block size and input split size. In case
the input split size is greater than block size, then the task might pick
blocks which are not on the same node. So, keeping the input split size
less than or equal to block size might help.

HTH,
Paritosh

On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> Basically, more replicas.
>
> The second solution would be to use a 'smarter' scheduler. In theory, the
> jobtracker should be able to say "postpone this task until a data-local
> task can be created". But I don't think any stable and public available
> scheduler do that at the moment. This would allow you to have less traffic
> but the whole job might be slower due to the wait. It might be a good trade
> if you have multiple jobs running at the same time and if your hot data is
> uniformly distributed. But in practice this is of course not always the
> case and you also need to consider sla for the users so the whole is not
> trivial.
>
> Regards
>
> Bertrand
>
>
> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
>
>> Very good explanation,
>> If there is a way to reduce Rack-local map tasks
>> but can increase the Data-local map tasks ,
>> Whether to increase performance？
>>
>> 2012/10/7 Michael Segel <mi...@hotmail.com>
>>
>>> Rack local means that while the data isn't local to the node running the
>>> task, it is still on the same rack.
>>> (Its meaningless unless you've set up rack awareness because all of the
>>> machines are on the default rack. )
>>>
>>> Data local means that the task is running local to the machine that
>>> contains the actual data.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>>
>>>
>>> hi all
>>>
>>> When I run "hadoop job -status xxx",Output the following some list.
>>>
>>> Rack-local map tasks=124
>>> Data-local map tasks=6
>>>
>>> What is the difference between Rack-local map tasks and Data-local map
>>> tasks?
>>> --
>>> centerqi@gmail.com|Sam
>>>
>>>
>>>
>>
>>
>> --
>> centerqi@gmail.com|齐忠
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Harsh J <ha...@cloudera.com>.

Bertrand,

FairScheduler does support delay scheduling for locality via
mapred.fairscheduler.locality.delay config prop. MR2's
CapacityScheduler recently got similar support for better locality
scheduling as well (see YARN-80). Is this not what you're talking of?

On Mon, Oct 8, 2012 at 1:01 AM, Bertrand Dechoux <de...@gmail.com> wrote:
> Basically, more replicas.
>
> The second solution would be to use a 'smarter' scheduler. In theory, the
> jobtracker should be able to say "postpone this task until a data-local task
> can be created". But I don't think any stable and public available scheduler
> do that at the moment. This would allow you to have less traffic but the
> whole job might be slower due to the wait. It might be a good trade if you
> have multiple jobs running at the same time and if your hot data is
> uniformly distributed. But in practice this is of course not always the case
> and you also need to consider sla for the users so the whole is not trivial.
>
> Regards
>
> Bertrand
>
>
> On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:
>>
>> Very good explanation,
>> If there is a way to reduce Rack-local map tasks
>> but can increase the Data-local map tasks ,
>> Whether to increase performance？
>>
>> 2012/10/7 Michael Segel <mi...@hotmail.com>
>>>
>>> Rack local means that while the data isn't local to the node running the
>>> task, it is still on the same rack.
>>> (Its meaningless unless you've set up rack awareness because all of the
>>> machines are on the default rack. )
>>>
>>> Data local means that the task is running local to the machine that
>>> contains the actual data.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>>
>>>
>>> hi all
>>>
>>> When I run "hadoop job -status xxx",Output the following some list.
>>>
>>> Rack-local map tasks=124
>>> Data-local map tasks=6
>>>
>>> What is the difference between Rack-local map tasks and Data-local map
>>> tasks?
>>>
>>> --
>>> centerqi@gmail.com|Sam
>>>
>>>
>>
>>
>>
>> --
>> centerqi@gmail.com|齐忠
>
>
>
>
> --
> Bertrand Dechoux



-- 
Harsh J

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Bertrand Dechoux <de...@gmail.com>.

Basically, more replicas.

The second solution would be to use a 'smarter' scheduler. In theory, the
jobtracker should be able to say "postpone this task until a data-local
task can be created". But I don't think any stable and public available
scheduler do that at the moment. This would allow you to have less traffic
but the whole job might be slower due to the wait. It might be a good trade
if you have multiple jobs running at the same time and if your hot data is
uniformly distributed. But in practice this is of course not always the
case and you also need to consider sla for the users so the whole is not
trivial.

Regards

Bertrand

On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:

> Very good explanation,
> If there is a way to reduce Rack-local map tasks
> but can increase the Data-local map tasks ,
> Whether to increase performance？
>
> 2012/10/7 Michael Segel <mi...@hotmail.com>
>
>> Rack local means that while the data isn't local to the node running the
>> task, it is still on the same rack.
>> (Its meaningless unless you've set up rack awareness because all of the
>> machines are on the default rack. )
>>
>> Data local means that the task is running local to the machine that
>> contains the actual data.
>>
>> HTH
>>
>> -Mike
>>
>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>
>>
>> hi all
>>
>> When I run "hadoop job -status xxx",Output the following some list.
>>
>> Rack-local map tasks=124
>> Data-local map tasks=6
>>
>> What is the difference between Rack-local map tasks and Data-local map
>> tasks?
>> --
>> centerqi@gmail.com|Sam
>>
>>
>>
>
>
> --
> centerqi@gmail.com|齐忠
>

-- 
Bertrand Dechoux

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Bejoy KS <be...@gmail.com>.

Definitely, If data local map tasks are more the performance will be improved much.

Ideally if data is uniformly distributed across DNs and if you have enough number of map task slots on colocated TTs then most of your map tasks should be Data Local. You may have just a few non data local map tasks when the number of input splits/map tasks are large which is quite common.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: centerqi hu <ce...@gmail.com>
Date: Sun, 7 Oct 2012 23:28:55 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: What is the difference between Rack-local map tasks and
 Data-local map tasks?

Very good explanation,
If there is a way to reduce Rack-local map tasks
but can increase the Data-local map tasks ,
Whether to increase performance？

2012/10/7 Michael Segel <mi...@hotmail.com>

> Rack local means that while the data isn't local to the node running the
> task, it is still on the same rack.
> (Its meaningless unless you've set up rack awareness because all of the
> machines are on the default rack. )
>
> Data local means that the task is running local to the machine that
> contains the actual data.
>
> HTH
>
> -Mike
>
> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>
>
> hi all
>
> When I run "hadoop job -status xxx",Output the following some list.
>
> Rack-local map tasks=124
> Data-local map tasks=6
>
> What is the difference between Rack-local map tasks and Data-local map
> tasks?
> --
> centerqi@gmail.com|Sam
>
>
>


-- 
centerqi@gmail.com|齐忠

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Bertrand Dechoux <de...@gmail.com>.

Basically, more replicas.

The second solution would be to use a 'smarter' scheduler. In theory, the
jobtracker should be able to say "postpone this task until a data-local
task can be created". But I don't think any stable and public available
scheduler do that at the moment. This would allow you to have less traffic
but the whole job might be slower due to the wait. It might be a good trade
if you have multiple jobs running at the same time and if your hot data is
uniformly distributed. But in practice this is of course not always the
case and you also need to consider sla for the users so the whole is not
trivial.

Regards

Bertrand

On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:

> Very good explanation,
> If there is a way to reduce Rack-local map tasks
> but can increase the Data-local map tasks ,
> Whether to increase performance？
>
> 2012/10/7 Michael Segel <mi...@hotmail.com>
>
>> Rack local means that while the data isn't local to the node running the
>> task, it is still on the same rack.
>> (Its meaningless unless you've set up rack awareness because all of the
>> machines are on the default rack. )
>>
>> Data local means that the task is running local to the machine that
>> contains the actual data.
>>
>> HTH
>>
>> -Mike
>>
>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>
>>
>> hi all
>>
>> When I run "hadoop job -status xxx",Output the following some list.
>>
>> Rack-local map tasks=124
>> Data-local map tasks=6
>>
>> What is the difference between Rack-local map tasks and Data-local map
>> tasks?
>> --
>> centerqi@gmail.com|Sam
>>
>>
>>
>
>
> --
> centerqi@gmail.com|齐忠
>

-- 
Bertrand Dechoux

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Bejoy KS <be...@gmail.com>.

Definitely, If data local map tasks are more the performance will be improved much.

Ideally if data is uniformly distributed across DNs and if you have enough number of map task slots on colocated TTs then most of your map tasks should be Data Local. You may have just a few non data local map tasks when the number of input splits/map tasks are large which is quite common.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: centerqi hu <ce...@gmail.com>
Date: Sun, 7 Oct 2012 23:28:55 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: What is the difference between Rack-local map tasks and
 Data-local map tasks?

Very good explanation,
If there is a way to reduce Rack-local map tasks
but can increase the Data-local map tasks ,
Whether to increase performance？

2012/10/7 Michael Segel <mi...@hotmail.com>

> Rack local means that while the data isn't local to the node running the
> task, it is still on the same rack.
> (Its meaningless unless you've set up rack awareness because all of the
> machines are on the default rack. )
>
> Data local means that the task is running local to the machine that
> contains the actual data.
>
> HTH
>
> -Mike
>
> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>
>
> hi all
>
> When I run "hadoop job -status xxx",Output the following some list.
>
> Rack-local map tasks=124
> Data-local map tasks=6
>
> What is the difference between Rack-local map tasks and Data-local map
> tasks?
> --
> centerqi@gmail.com|Sam
>
>
>


-- 
centerqi@gmail.com|齐忠

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Bertrand Dechoux <de...@gmail.com>.

Basically, more replicas.

The second solution would be to use a 'smarter' scheduler. In theory, the
jobtracker should be able to say "postpone this task until a data-local
task can be created". But I don't think any stable and public available
scheduler do that at the moment. This would allow you to have less traffic
but the whole job might be slower due to the wait. It might be a good trade
if you have multiple jobs running at the same time and if your hot data is
uniformly distributed. But in practice this is of course not always the
case and you also need to consider sla for the users so the whole is not
trivial.

Regards

Bertrand

On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:

> Very good explanation,
> If there is a way to reduce Rack-local map tasks
> but can increase the Data-local map tasks ,
> Whether to increase performance？
>
> 2012/10/7 Michael Segel <mi...@hotmail.com>
>
>> Rack local means that while the data isn't local to the node running the
>> task, it is still on the same rack.
>> (Its meaningless unless you've set up rack awareness because all of the
>> machines are on the default rack. )
>>
>> Data local means that the task is running local to the machine that
>> contains the actual data.
>>
>> HTH
>>
>> -Mike
>>
>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>
>>
>> hi all
>>
>> When I run "hadoop job -status xxx",Output the following some list.
>>
>> Rack-local map tasks=124
>> Data-local map tasks=6
>>
>> What is the difference between Rack-local map tasks and Data-local map
>> tasks?
>> --
>> centerqi@gmail.com|Sam
>>
>>
>>
>
>
> --
> centerqi@gmail.com|齐忠
>

-- 
Bertrand Dechoux

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Bejoy KS <be...@gmail.com>.

Definitely, If data local map tasks are more the performance will be improved much.

Ideally if data is uniformly distributed across DNs and if you have enough number of map task slots on colocated TTs then most of your map tasks should be Data Local. You may have just a few non data local map tasks when the number of input splits/map tasks are large which is quite common.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: centerqi hu <ce...@gmail.com>
Date: Sun, 7 Oct 2012 23:28:55 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: What is the difference between Rack-local map tasks and
 Data-local map tasks?

Very good explanation,
If there is a way to reduce Rack-local map tasks
but can increase the Data-local map tasks ,
Whether to increase performance？

2012/10/7 Michael Segel <mi...@hotmail.com>

> Rack local means that while the data isn't local to the node running the
> task, it is still on the same rack.
> (Its meaningless unless you've set up rack awareness because all of the
> machines are on the default rack. )
>
> Data local means that the task is running local to the machine that
> contains the actual data.
>
> HTH
>
> -Mike
>
> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>
>
> hi all
>
> When I run "hadoop job -status xxx",Output the following some list.
>
> Rack-local map tasks=124
> Data-local map tasks=6
>
> What is the difference between Rack-local map tasks and Data-local map
> tasks?
> --
> centerqi@gmail.com|Sam
>
>
>


-- 
centerqi@gmail.com|齐忠

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Bertrand Dechoux <de...@gmail.com>.

Basically, more replicas.

The second solution would be to use a 'smarter' scheduler. In theory, the
jobtracker should be able to say "postpone this task until a data-local
task can be created". But I don't think any stable and public available
scheduler do that at the moment. This would allow you to have less traffic
but the whole job might be slower due to the wait. It might be a good trade
if you have multiple jobs running at the same time and if your hot data is
uniformly distributed. But in practice this is of course not always the
case and you also need to consider sla for the users so the whole is not
trivial.

Regards

Bertrand

On Sun, Oct 7, 2012 at 5:28 PM, centerqi hu <ce...@gmail.com> wrote:

> Very good explanation,
> If there is a way to reduce Rack-local map tasks
> but can increase the Data-local map tasks ,
> Whether to increase performance？
>
> 2012/10/7 Michael Segel <mi...@hotmail.com>
>
>> Rack local means that while the data isn't local to the node running the
>> task, it is still on the same rack.
>> (Its meaningless unless you've set up rack awareness because all of the
>> machines are on the default rack. )
>>
>> Data local means that the task is running local to the machine that
>> contains the actual data.
>>
>> HTH
>>
>> -Mike
>>
>> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>>
>>
>> hi all
>>
>> When I run "hadoop job -status xxx",Output the following some list.
>>
>> Rack-local map tasks=124
>> Data-local map tasks=6
>>
>> What is the difference between Rack-local map tasks and Data-local map
>> tasks?
>> --
>> centerqi@gmail.com|Sam
>>
>>
>>
>
>
> --
> centerqi@gmail.com|齐忠
>

-- 
Bertrand Dechoux

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Bejoy KS <be...@gmail.com>.

Definitely, If data local map tasks are more the performance will be improved much.

Ideally if data is uniformly distributed across DNs and if you have enough number of map task slots on colocated TTs then most of your map tasks should be Data Local. You may have just a few non data local map tasks when the number of input splits/map tasks are large which is quite common.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: centerqi hu <ce...@gmail.com>
Date: Sun, 7 Oct 2012 23:28:55 
To: <us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Re: What is the difference between Rack-local map tasks and
 Data-local map tasks?

Very good explanation,
If there is a way to reduce Rack-local map tasks
but can increase the Data-local map tasks ,
Whether to increase performance？

2012/10/7 Michael Segel <mi...@hotmail.com>

> Rack local means that while the data isn't local to the node running the
> task, it is still on the same rack.
> (Its meaningless unless you've set up rack awareness because all of the
> machines are on the default rack. )
>
> Data local means that the task is running local to the machine that
> contains the actual data.
>
> HTH
>
> -Mike
>
> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>
>
> hi all
>
> When I run "hadoop job -status xxx",Output the following some list.
>
> Rack-local map tasks=124
> Data-local map tasks=6
>
> What is the difference between Rack-local map tasks and Data-local map
> tasks?
> --
> centerqi@gmail.com|Sam
>
>
>


-- 
centerqi@gmail.com|齐忠

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by centerqi hu <ce...@gmail.com>.

Very good explanation,
If there is a way to reduce Rack-local map tasks
but can increase the Data-local map tasks ,
Whether to increase performance？

2012/10/7 Michael Segel <mi...@hotmail.com>

> Rack local means that while the data isn't local to the node running the
> task, it is still on the same rack.
> (Its meaningless unless you've set up rack awareness because all of the
> machines are on the default rack. )
>
> Data local means that the task is running local to the machine that
> contains the actual data.
>
> HTH
>
> -Mike
>
> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>
>
> hi all
>
> When I run "hadoop job -status xxx",Output the following some list.
>
> Rack-local map tasks=124
> Data-local map tasks=6
>
> What is the difference between Rack-local map tasks and Data-local map
> tasks?
> --
> centerqi@gmail.com|Sam
>
>
>


-- 
centerqi@gmail.com|齐忠

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by centerqi hu <ce...@gmail.com>.

Very good explanation,
If there is a way to reduce Rack-local map tasks
but can increase the Data-local map tasks ,
Whether to increase performance？

2012/10/7 Michael Segel <mi...@hotmail.com>

> Rack local means that while the data isn't local to the node running the
> task, it is still on the same rack.
> (Its meaningless unless you've set up rack awareness because all of the
> machines are on the default rack. )
>
> Data local means that the task is running local to the machine that
> contains the actual data.
>
> HTH
>
> -Mike
>
> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>
>
> hi all
>
> When I run "hadoop job -status xxx",Output the following some list.
>
> Rack-local map tasks=124
> Data-local map tasks=6
>
> What is the difference between Rack-local map tasks and Data-local map
> tasks?
> --
> centerqi@gmail.com|Sam
>
>
>


-- 
centerqi@gmail.com|齐忠

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by centerqi hu <ce...@gmail.com>.

Very good explanation,
If there is a way to reduce Rack-local map tasks
but can increase the Data-local map tasks ,
Whether to increase performance？

2012/10/7 Michael Segel <mi...@hotmail.com>

> Rack local means that while the data isn't local to the node running the
> task, it is still on the same rack.
> (Its meaningless unless you've set up rack awareness because all of the
> machines are on the default rack. )
>
> Data local means that the task is running local to the machine that
> contains the actual data.
>
> HTH
>
> -Mike
>
> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>
>
> hi all
>
> When I run "hadoop job -status xxx",Output the following some list.
>
> Rack-local map tasks=124
> Data-local map tasks=6
>
> What is the difference between Rack-local map tasks and Data-local map
> tasks?
> --
> centerqi@gmail.com|Sam
>
>
>


-- 
centerqi@gmail.com|齐忠

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by centerqi hu <ce...@gmail.com>.

Very good explanation,
If there is a way to reduce Rack-local map tasks
but can increase the Data-local map tasks ,
Whether to increase performance？

2012/10/7 Michael Segel <mi...@hotmail.com>

> Rack local means that while the data isn't local to the node running the
> task, it is still on the same rack.
> (Its meaningless unless you've set up rack awareness because all of the
> machines are on the default rack. )
>
> Data local means that the task is running local to the machine that
> contains the actual data.
>
> HTH
>
> -Mike
>
> On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:
>
>
> hi all
>
> When I run "hadoop job -status xxx",Output the following some list.
>
> Rack-local map tasks=124
> Data-local map tasks=6
>
> What is the difference between Rack-local map tasks and Data-local map
> tasks?
> --
> centerqi@gmail.com|Sam
>
>
>


-- 
centerqi@gmail.com|齐忠

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Michael Segel <mi...@hotmail.com>.

Rack local means that while the data isn't local to the node running the task, it is still on the same rack. 
(Its meaningless unless you've set up rack awareness because all of the machines are on the default rack. )

Data local means that the task is running local to the machine that contains the actual data. 

HTH

-Mike

On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:

> 
> hi all
> When I run "hadoop job -status xxx",Output the following some list.
> 
> Rack-local map tasks=124
> Data-local map tasks=6
> What is the difference between Rack-local map tasks and Data-local map tasks?
> 
> -- 
> centerqi@gmail.com|Sam

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Michael Segel <mi...@hotmail.com>.

Rack local means that while the data isn't local to the node running the task, it is still on the same rack. 
(Its meaningless unless you've set up rack awareness because all of the machines are on the default rack. )

Data local means that the task is running local to the machine that contains the actual data. 

HTH

-Mike

On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:

> 
> hi all
> When I run "hadoop job -status xxx",Output the following some list.
> 
> Rack-local map tasks=124
> Data-local map tasks=6
> What is the difference between Rack-local map tasks and Data-local map tasks?
> 
> -- 
> centerqi@gmail.com|Sam

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Michael Segel <mi...@hotmail.com>.

Rack local means that while the data isn't local to the node running the task, it is still on the same rack. 
(Its meaningless unless you've set up rack awareness because all of the machines are on the default rack. )

Data local means that the task is running local to the machine that contains the actual data. 

HTH

-Mike

On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:

> 
> hi all
> When I run "hadoop job -status xxx",Output the following some list.
> 
> Rack-local map tasks=124
> Data-local map tasks=6
> What is the difference between Rack-local map tasks and Data-local map tasks?
> 
> -- 
> centerqi@gmail.com|Sam

Re: What is the difference between Rack-local map tasks and Data-local map tasks?

Posted by Michael Segel <mi...@hotmail.com>.

Rack local means that while the data isn't local to the node running the task, it is still on the same rack. 
(Its meaningless unless you've set up rack awareness because all of the machines are on the default rack. )

Data local means that the task is running local to the machine that contains the actual data. 

HTH

-Mike

On Oct 7, 2012, at 8:56 AM, centerqi hu <ce...@gmail.com> wrote:

> 
> hi all
> When I run "hadoop job -status xxx",Output the following some list.
> 
> Rack-local map tasks=124
> Data-local map tasks=6
> What is the difference between Rack-local map tasks and Data-local map tasks?
> 
> -- 
> centerqi@gmail.com|Sam