You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2007/10/09 16:08:51 UTC

[jira] Created: (HADOOP-2014) Job Tracker should not clobber the data locality of tasks

Job Tracker should not clobber the data locality of tasks
---------------------------------------------------------

                 Key: HADOOP-2014
                 URL: https://issues.apache.org/jira/browse/HADOOP-2014
             Project: Hadoop
          Issue Type: Bug
            Reporter: Runping Qi



Currently, when the Job Tracker assigns a mapper task to a task tracker and there is no local split to the task tracker, the
job tracker will find the first runable task in the mast task list  and assign the task to the task tracker.
The split for the task is not local to the task tracker, of course. However, the split may be local to other task trackers.
Assigning the that task, to that task tracker may decrease the potential number of mapper attempts with data locality.
The desired behavior in this situation is to choose a task whose split is not local to any  task tracker. 
Resort to the current behavior only if no such task is found.

In general, it will be useful to know the number of task trackers to which each split is local.
To assign a task to a task tracker, the job tracker should first  try to pick a task that is local to the task tracker  and that has minimal number of task trackers to which it is local. If no task is local to the task tracker, the job tracker should  try to pick a task that has minimal number of task trackers to which it is local. 

It is worthwhile to instrument the job tracker code to report the number of splits that are local to some task trackers.
That should be the maximum number of tasks with data locality. By comparing that number with the the actual number of 
data local mappers launched, we can know the effectiveness of the job tracker scheduling.

When we introduce rack locality, we should apply the same principle.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-2014) Job Tracker should not clobber the data locality of tasks

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Runping Qi reassigned HADOOP-2014:
----------------------------------

    Assignee: Devaraj Das

> Job Tracker should not clobber the data locality of tasks
> ---------------------------------------------------------
>
>                 Key: HADOOP-2014
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2014
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Devaraj Das
>
> Currently, when the Job Tracker assigns a mapper task to a task tracker and there is no local split to the task tracker, the
> job tracker will find the first runable task in the mast task list  and assign the task to the task tracker.
> The split for the task is not local to the task tracker, of course. However, the split may be local to other task trackers.
> Assigning the that task, to that task tracker may decrease the potential number of mapper attempts with data locality.
> The desired behavior in this situation is to choose a task whose split is not local to any  task tracker. 
> Resort to the current behavior only if no such task is found.
> In general, it will be useful to know the number of task trackers to which each split is local.
> To assign a task to a task tracker, the job tracker should first  try to pick a task that is local to the task tracker  and that has minimal number of task trackers to which it is local. If no task is local to the task tracker, the job tracker should  try to pick a task that has minimal number of task trackers to which it is local. 
> It is worthwhile to instrument the job tracker code to report the number of splits that are local to some task trackers.
> That should be the maximum number of tasks with data locality. By comparing that number with the the actual number of 
> data local mappers launched, we can know the effectiveness of the job tracker scheduling.
> When we introduce rack locality, we should apply the same principle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2014) Job Tracker should not clobber the data locality of tasks

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Runping Qi updated HADOOP-2014:
-------------------------------

    Component/s: mapred
    Description: 
Currently, when the Job Tracker assigns a mapper task to a task tracker and there is no local split to the task tracker, the
job tracker will find the first runable task in the mast task list  and assign the task to the task tracker.
The split for the task is not local to the task tracker, of course. However, the split may be local to other task trackers.
Assigning the that task, to that task tracker may decrease the potential number of mapper attempts with data locality.
The desired behavior in this situation is to choose a task whose split is not local to any  task tracker. 
Resort to the current behavior only if no such task is found.

In general, it will be useful to know the number of task trackers to which each split is local.
To assign a task to a task tracker, the job tracker should first  try to pick a task that is local to the task tracker  and that has minimal number of task trackers to which it is local. If no task is local to the task tracker, the job tracker should  try to pick a task that has minimal number of task trackers to which it is local. 

It is worthwhile to instrument the job tracker code to report the number of splits that are local to some task trackers.
That should be the maximum number of tasks with data locality. By comparing that number with the the actual number of 
data local mappers launched, we can know the effectiveness of the job tracker scheduling.

When we introduce rack locality, we should apply the same principle.



  was:

Currently, when the Job Tracker assigns a mapper task to a task tracker and there is no local split to the task tracker, the
job tracker will find the first runable task in the mast task list  and assign the task to the task tracker.
The split for the task is not local to the task tracker, of course. However, the split may be local to other task trackers.
Assigning the that task, to that task tracker may decrease the potential number of mapper attempts with data locality.
The desired behavior in this situation is to choose a task whose split is not local to any  task tracker. 
Resort to the current behavior only if no such task is found.

In general, it will be useful to know the number of task trackers to which each split is local.
To assign a task to a task tracker, the job tracker should first  try to pick a task that is local to the task tracker  and that has minimal number of task trackers to which it is local. If no task is local to the task tracker, the job tracker should  try to pick a task that has minimal number of task trackers to which it is local. 

It is worthwhile to instrument the job tracker code to report the number of splits that are local to some task trackers.
That should be the maximum number of tasks with data locality. By comparing that number with the the actual number of 
data local mappers launched, we can know the effectiveness of the job tracker scheduling.

When we introduce rack locality, we should apply the same principle.




> Job Tracker should not clobber the data locality of tasks
> ---------------------------------------------------------
>
>                 Key: HADOOP-2014
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2014
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>
> Currently, when the Job Tracker assigns a mapper task to a task tracker and there is no local split to the task tracker, the
> job tracker will find the first runable task in the mast task list  and assign the task to the task tracker.
> The split for the task is not local to the task tracker, of course. However, the split may be local to other task trackers.
> Assigning the that task, to that task tracker may decrease the potential number of mapper attempts with data locality.
> The desired behavior in this situation is to choose a task whose split is not local to any  task tracker. 
> Resort to the current behavior only if no such task is found.
> In general, it will be useful to know the number of task trackers to which each split is local.
> To assign a task to a task tracker, the job tracker should first  try to pick a task that is local to the task tracker  and that has minimal number of task trackers to which it is local. If no task is local to the task tracker, the job tracker should  try to pick a task that has minimal number of task trackers to which it is local. 
> It is worthwhile to instrument the job tracker code to report the number of splits that are local to some task trackers.
> That should be the maximum number of tasks with data locality. By comparing that number with the the actual number of 
> data local mappers launched, we can know the effectiveness of the job tracker scheduling.
> When we introduce rack locality, we should apply the same principle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2014) Job Tracker should not clobber the data locality of tasks

Posted by "eric baldeschwieler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558155#action_12558155 ] 

eric baldeschwieler commented on HADOOP-2014:
---------------------------------------------

An ideal solution would maintain some sort of prioritized list of maps / node / rack so that we execute work first that is unlikely to find another efficient location to execute.

It would also make sense to place some no local work early, since these tasks run slowly, on nodes that are likely to run out of local work relatively early.

One could also pay attention to IO load on each source node...

At a minimum we should track maps that have no local option and schedule them first when a node has no local option.  (As doug cutting suggested in hadoop-2560

> Job Tracker should not clobber the data locality of tasks
> ---------------------------------------------------------
>
>                 Key: HADOOP-2014
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2014
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Devaraj Das
>
> Currently, when the Job Tracker assigns a mapper task to a task tracker and there is no local split to the task tracker, the
> job tracker will find the first runable task in the mast task list  and assign the task to the task tracker.
> The split for the task is not local to the task tracker, of course. However, the split may be local to other task trackers.
> Assigning the that task, to that task tracker may decrease the potential number of mapper attempts with data locality.
> The desired behavior in this situation is to choose a task whose split is not local to any  task tracker. 
> Resort to the current behavior only if no such task is found.
> In general, it will be useful to know the number of task trackers to which each split is local.
> To assign a task to a task tracker, the job tracker should first  try to pick a task that is local to the task tracker  and that has minimal number of task trackers to which it is local. If no task is local to the task tracker, the job tracker should  try to pick a task that has minimal number of task trackers to which it is local. 
> It is worthwhile to instrument the job tracker code to report the number of splits that are local to some task trackers.
> That should be the maximum number of tasks with data locality. By comparing that number with the the actual number of 
> data local mappers launched, we can know the effectiveness of the job tracker scheduling.
> When we introduce rack locality, we should apply the same principle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-2014) Job Tracker should not clobber the data locality of tasks

Posted by "eric baldeschwieler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558155#action_12558155 ] 

eric14 edited comment on HADOOP-2014 at 1/11/08 2:41 PM:
----------------------------------------------------------------------

An ideal solution would maintain some sort of prioritized list of maps / node / rack so that we execute work first that is unlikely to find another efficient location to execute.

It would also make sense to place some no local work early, since these tasks run slowly, on nodes that are likely to run out of local work relatively early.

One could also pay attention to IO load on each source node...

At a minimum we should track maps that have no local option and schedule them first when a node has no local option.  (As doug cutting suggested in HADOOP-2560)

      was (Author: eric14):
    An ideal solution would maintain some sort of prioritized list of maps / node / rack so that we execute work first that is unlikely to find another efficient location to execute.

It would also make sense to place some no local work early, since these tasks run slowly, on nodes that are likely to run out of local work relatively early.

One could also pay attention to IO load on each source node...

At a minimum we should track maps that have no local option and schedule them first when a node has no local option.  (As doug cutting suggested in hadoop-2560
  
> Job Tracker should not clobber the data locality of tasks
> ---------------------------------------------------------
>
>                 Key: HADOOP-2014
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2014
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Devaraj Das
>
> Currently, when the Job Tracker assigns a mapper task to a task tracker and there is no local split to the task tracker, the
> job tracker will find the first runable task in the mast task list  and assign the task to the task tracker.
> The split for the task is not local to the task tracker, of course. However, the split may be local to other task trackers.
> Assigning the that task, to that task tracker may decrease the potential number of mapper attempts with data locality.
> The desired behavior in this situation is to choose a task whose split is not local to any  task tracker. 
> Resort to the current behavior only if no such task is found.
> In general, it will be useful to know the number of task trackers to which each split is local.
> To assign a task to a task tracker, the job tracker should first  try to pick a task that is local to the task tracker  and that has minimal number of task trackers to which it is local. If no task is local to the task tracker, the job tracker should  try to pick a task that has minimal number of task trackers to which it is local. 
> It is worthwhile to instrument the job tracker code to report the number of splits that are local to some task trackers.
> That should be the maximum number of tasks with data locality. By comparing that number with the the actual number of 
> data local mappers launched, we can know the effectiveness of the job tracker scheduling.
> When we introduce rack locality, we should apply the same principle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.