You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Christian Kunz (JIRA)" <ji...@apache.org> on 2008/09/26 21:57:44 UTC

[jira] Created: (HADOOP-4295) mapred.map.tasks.maximum and mapred.reduce.tasks.maximum

mapred.map.tasks.maximum and mapred.reduce.tasks.maximum
--------------------------------------------------------

                 Key: HADOOP-4295
                 URL: https://issues.apache.org/jira/browse/HADOOP-4295
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
            Reporter: Christian Kunz


Right now mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are set on the tasktracker level.

In absense of a smart tasktracker monitoring resources and deciding in an adaptive manner how many tasks can be run simultaneously, it would be nice to move these two configuration options to the job level. This would make it easier to optimize the performance of a batch of jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4295) job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635421#action_12635421 ] 

Christian Kunz commented on HADOOP-4295:
----------------------------------------

bq. I am not sure if Doug was suggesting we use HADOOP-4035 to implement the functionality proposed in this JIRA. I understood it to mean that the approach should be the same.
This was my understanding as well.

bq. That said, I also think we'll need to consider unifying mechanisms of resource management at some time (maybe in the near future, smile).
The sooner, the better, *smile*. Currently one has to restart the framework when mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum get changed.

> job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum 
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-4295
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4295
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>
> Right now mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are set on the tasktracker level.
> In absense of a smart tasktracker monitoring resources and deciding in an adaptive manner how many tasks can be run simultaneously, it would be nice to move these two configuration options to the job level. This would make it easier to optimize the performance of a batch of jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4295) job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635004#action_12635004 ] 

Doug Cutting commented on HADOOP-4295:
--------------------------------------

I think these are appropriately set at the tasktracker level, since they're meant to correspond to the resources of the tasktracker, e.g., the number of cores.  If one has a mixed cluster, with some 2-core nodes and some 4-core nodes, then one might reasonably set these differently on different nodes.  The memory limits of HADOOP-2765 and HADOOP-4035 can be used to control things on a per-job basis.

> job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum 
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-4295
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4295
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>
> Right now mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are set on the tasktracker level.
> In absense of a smart tasktracker monitoring resources and deciding in an adaptive manner how many tasks can be run simultaneously, it would be nice to move these two configuration options to the job level. This would make it easier to optimize the performance of a batch of jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4295) job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646402#action_12646402 ] 

Christian Kunz commented on HADOOP-4295:
----------------------------------------

I talked with Sameer offline and we agreed to use a work-around based on the scheduler till a more general solution for resource monitoring and utilization is available.

> job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum 
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-4295
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4295
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>
> Right now mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are set on the tasktracker level.
> In absense of a smart tasktracker monitoring resources and deciding in an adaptive manner how many tasks can be run simultaneously, it would be nice to move these two configuration options to the job level. This would make it easier to optimize the performance of a batch of jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4295) job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-4295:
-----------------------------------

    Summary: job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum   (was: mapred.map.tasks.maximum and mapred.reduce.tasks.maximum)

> job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum 
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-4295
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4295
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>
> Right now mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are set on the tasktracker level.
> In absense of a smart tasktracker monitoring resources and deciding in an adaptive manner how many tasks can be run simultaneously, it would be nice to move these two configuration options to the job level. This would make it easier to optimize the performance of a batch of jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4295) job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635059#action_12635059 ] 

Christian Kunz commented on HADOOP-4295:
----------------------------------------

The situation becomes more complicated when some applications in a batch are pipes applications, some are not. Among pipes applications some might produce a large amount of data to shuffle requiring the java tasks to sort intensively, some not.
In summary, the mapping of number of cores to mapred.map.tasks.maximum and mapred.reduce.tasks.maximum is not always straight forward.


> job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum 
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-4295
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4295
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>
> Right now mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are set on the tasktracker level.
> In absense of a smart tasktracker monitoring resources and deciding in an adaptive manner how many tasks can be run simultaneously, it would be nice to move these two configuration options to the job level. This would make it easier to optimize the performance of a batch of jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-4295) job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz resolved HADOOP-4295.
------------------------------------

    Resolution: Won't Fix

> job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum 
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-4295
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4295
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>
> Right now mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are set on the tasktracker level.
> In absense of a smart tasktracker monitoring resources and deciding in an adaptive manner how many tasks can be run simultaneously, it would be nice to move these two configuration options to the job level. This would make it easier to optimize the performance of a batch of jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4295) job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635280#action_12635280 ] 

Hemanth Yamijala commented on HADOOP-4295:
------------------------------------------

bq. Should the title be changed to something like
bq. Modify the capacity scheduler (HADOOP-3445) to take job limitations concerning number of simultaneous tasks per node into account when scheduling tasks?

I am not sure if Doug was suggesting we use HADOOP-4035 to implement the functionality proposed in this JIRA. I understood it to mean that the approach should be the same. Either way, I think it would be nice to have it handled separately, since HADOOP-4035 is specifically addressing only memory based parameters in job control.

That said, I also think we'll need to consider unifying mechanisms of resource management at some time (maybe in the near future, *smile*). We already seem to have *slightly* different ways of dealing with cores, memory, and disk (a.k.a HADOOP-657) - specifying, measuring, reporting and scheduling.

> job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum 
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-4295
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4295
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>
> Right now mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are set on the tasktracker level.
> In absense of a smart tasktracker monitoring resources and deciding in an adaptive manner how many tasks can be run simultaneously, it would be nice to move these two configuration options to the job level. This would make it easier to optimize the performance of a batch of jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4295) job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635643#action_12635643 ] 

Vinod K V commented on HADOOP-4295:
-----------------------------------

Then, may be, similar to the configuration knob mapred.tasks.maxmemory w.r.t memory, we can have mapred.job.{map|reduce}.tasks to specify number of tasks a job occupies; while mapred.tasktracker.tasks.maxmemory maps to mapred.tasktracker.{map|reduce}.tasks.maximium. After that, similar to how HADOOP-4035 wishes to proceed, a scheduler can compare the job's requirements of number of tasks with tasktracker's limits and scheduler accordingly.

Notes:
 - May we should use the term "cores" in mapred.tasktracker.{map|reduce}.tasks.maximium. We clearly need to redefine and distinguish tasks, slots and cores, once and for ever.
 - Should we also rename mapred.tasks.maxmemory to mapred.job.tasks.maxmemory?

> job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum 
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-4295
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4295
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>
> Right now mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are set on the tasktracker level.
> In absense of a smart tasktracker monitoring resources and deciding in an adaptive manner how many tasks can be run simultaneously, it would be nice to move these two configuration options to the job level. This would make it easier to optimize the performance of a batch of jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4295) job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635062#action_12635062 ] 

Doug Cutting commented on HADOOP-4295:
--------------------------------------

I'm not arguing that these are perfect, but permitting them to vary per node is a feature that we shouldn't toss out.  Adding a different parameter that limits the number of tasks that a job would actually run simultaneously on a node might be reasonable.  Thus I think extending the scheduler, as is done in HADOOP-4035, is more like what we'd want here rather than to change these existing parameters.

> job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum 
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-4295
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4295
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>
> Right now mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are set on the tasktracker level.
> In absense of a smart tasktracker monitoring resources and deciding in an adaptive manner how many tasks can be run simultaneously, it would be nice to move these two configuration options to the job level. This would make it easier to optimize the performance of a batch of jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4295) job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635076#action_12635076 ] 

Christian Kunz commented on HADOOP-4295:
----------------------------------------

Okay, my bad. I went too far by requesting to move the configuration parameters to job-level instead of just adding job-level control

Should the title be changed to something like 
Modify the capacity scheduler (HADOOP-3445) to take job limitations concerning number of simultaneous tasks per node into account when scheduling tasks?

> job-level configurable mapred.map.tasks.maximum and mapred.reduce.tasks.maximum 
> --------------------------------------------------------------------------------
>
>                 Key: HADOOP-4295
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4295
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Christian Kunz
>
> Right now mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum are set on the tasktracker level.
> In absense of a smart tasktracker monitoring resources and deciding in an adaptive manner how many tasks can be run simultaneously, it would be nice to move these two configuration options to the job level. This would make it easier to optimize the performance of a batch of jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.