You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Vivek Ratan (JIRA)" <ji...@apache.org> on 2008/10/21 12:03:46 UTC

[jira] Created: (HADOOP-4472) Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()?

Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()? 
---------------------------------------------------------------------------------------

                 Key: HADOOP-4472
                 URL: https://issues.apache.org/jira/browse/HADOOP-4472
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
            Reporter: Vivek Ratan


JobInProgress.initTasks() creates TIPs for map and reduce tasks, and also the newly-introduced setup and cleanup tasks. initTasks() is called by the schedulers, as for reasons of memory optimizations, schedulers may choose to initialize M/R tasks at various moments (the Capacity Scheduler, for example, calls initTasks() just when it considers a job for running). One can say that Schedulers 'own' the initialization of M/R tasks in a job. Furthermore the JT 'owns' the setup and cleanup tasks (it schedules them, and Schedulers are unaware of these tasks). This causes a problematic dependency between the JT and a Scheduler. For example, the Capacity Scheduler calls initTasks() and immediately calls JobInProgress.obtainNewMapTask for a map task. This is a problem today, because we cannot run any map or reduce tasks before the setup task is run, which the Capacity Scheduler is not aware of. 

Either all Schedulers are explicitly aware of setup/cleanup tasks and their dependencies with M/R tasks (in which case, Schedulers 'own' the creation and scheduling of all these tasks correctly), or the JT 'owns' the setup/cleanup tasks and Schedulers are completely unaware of them (in which case, the creation of setup/cleanup tasks must be moved out of initTasks into a separate method which is called by the JT). 

I think the latter is the right way to go (unless we implement HADOOP-4421, in which case the former option may be viable as well). 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4472) Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()?

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642908#action_12642908 ] 

Owen O'Malley commented on HADOOP-4472:
---------------------------------------

{quote}
The IDs of setup and cleanup tasks can be -1 and -2.
{quote}

I don't think that is right. I think we should have a SetupTask and a CleanupTask that are both id = 0. The isMap boolean needs to be replaced with an enumeration. {SETUP, MAP, REDUCE, CLEANUP}. There should be tips associated with both setup and cleanup.

{quote}
JobTracker does not inform the listeners when the job is submitted, and it waits for the setup completion.
{quote}

The Scheduler should be in control of when the job is initialized. Therefore it must be notified when the job is submitted. Furthermore, it should be notified again when the setup task is finished and the rest of the job is runnable.

{quote}
JT can poll the waiting jobs to see if setup is complete for them
{quote}

I don't see why the JT should ever poll the state. It knows the state changed via the heartbeat. Do you mean the scheduler? That should be done via an event.


> Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()? 
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4472
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Vivek Ratan
>            Assignee: Amareshwari Sriramadasu
>
> JobInProgress.initTasks() creates TIPs for map and reduce tasks, and also the newly-introduced setup and cleanup tasks. initTasks() is called by the schedulers, as for reasons of memory optimizations, schedulers may choose to initialize M/R tasks at various moments (the Capacity Scheduler, for example, calls initTasks() just when it considers a job for running). One can say that Schedulers 'own' the initialization of M/R tasks in a job. Furthermore the JT 'owns' the setup and cleanup tasks (it schedules them, and Schedulers are unaware of these tasks). This causes a problematic dependency between the JT and a Scheduler. For example, the Capacity Scheduler calls initTasks() and immediately calls JobInProgress.obtainNewMapTask for a map task. This is a problem today, because we cannot run any map or reduce tasks before the setup task is run, which the Capacity Scheduler is not aware of. 
> Either all Schedulers are explicitly aware of setup/cleanup tasks and their dependencies with M/R tasks (in which case, Schedulers 'own' the creation and scheduling of all these tasks correctly), or the JT 'owns' the setup/cleanup tasks and Schedulers are completely unaware of them (in which case, the creation of setup/cleanup tasks must be moved out of initTasks into a separate method which is called by the JT). 
> I think the latter is the right way to go (unless we implement HADOOP-4421, in which case the former option may be viable as well). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4472) Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()?

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643127#action_12643127 ] 

Devaraj Das commented on HADOOP-4472:
-------------------------------------

If we think that doing HADOOP-4513 addresses the issue to do with initializing too many jobs simultaneously in the capacity scheduler, then I propose that we don't fix this issue at all for now. Instead we should look at HADOOP-4421. Thoughts?

> Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()? 
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4472
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Vivek Ratan
>            Assignee: Amareshwari Sriramadasu
>
> JobInProgress.initTasks() creates TIPs for map and reduce tasks, and also the newly-introduced setup and cleanup tasks. initTasks() is called by the schedulers, as for reasons of memory optimizations, schedulers may choose to initialize M/R tasks at various moments (the Capacity Scheduler, for example, calls initTasks() just when it considers a job for running). One can say that Schedulers 'own' the initialization of M/R tasks in a job. Furthermore the JT 'owns' the setup and cleanup tasks (it schedules them, and Schedulers are unaware of these tasks). This causes a problematic dependency between the JT and a Scheduler. For example, the Capacity Scheduler calls initTasks() and immediately calls JobInProgress.obtainNewMapTask for a map task. This is a problem today, because we cannot run any map or reduce tasks before the setup task is run, which the Capacity Scheduler is not aware of. 
> Either all Schedulers are explicitly aware of setup/cleanup tasks and their dependencies with M/R tasks (in which case, Schedulers 'own' the creation and scheduling of all these tasks correctly), or the JT 'owns' the setup/cleanup tasks and Schedulers are completely unaware of them (in which case, the creation of setup/cleanup tasks must be moved out of initTasks into a separate method which is called by the JT). 
> I think the latter is the right way to go (unless we implement HADOOP-4421, in which case the former option may be viable as well). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-4472) Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()?

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu reassigned HADOOP-4472:
-----------------------------------------------

    Assignee: Amareshwari Sriramadasu

> Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()? 
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4472
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Vivek Ratan
>            Assignee: Amareshwari Sriramadasu
>
> JobInProgress.initTasks() creates TIPs for map and reduce tasks, and also the newly-introduced setup and cleanup tasks. initTasks() is called by the schedulers, as for reasons of memory optimizations, schedulers may choose to initialize M/R tasks at various moments (the Capacity Scheduler, for example, calls initTasks() just when it considers a job for running). One can say that Schedulers 'own' the initialization of M/R tasks in a job. Furthermore the JT 'owns' the setup and cleanup tasks (it schedules them, and Schedulers are unaware of these tasks). This causes a problematic dependency between the JT and a Scheduler. For example, the Capacity Scheduler calls initTasks() and immediately calls JobInProgress.obtainNewMapTask for a map task. This is a problem today, because we cannot run any map or reduce tasks before the setup task is run, which the Capacity Scheduler is not aware of. 
> Either all Schedulers are explicitly aware of setup/cleanup tasks and their dependencies with M/R tasks (in which case, Schedulers 'own' the creation and scheduling of all these tasks correctly), or the JT 'owns' the setup/cleanup tasks and Schedulers are completely unaware of them (in which case, the creation of setup/cleanup tasks must be moved out of initTasks into a separate method which is called by the JT). 
> I think the latter is the right way to go (unless we implement HADOOP-4421, in which case the former option may be viable as well). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4472) Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()?

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642884#action_12642884 ] 

Amareshwari Sriramadasu commented on HADOOP-4472:
-------------------------------------------------

The following is the approach for separating initialization of setup/cleanup tasks out of initTasks.
1. Remove the initialization from initTasks and add it in the constructor of JobInProgress. The IDs of setup and cleanup tasks can be -1 and -2.
2. JobTracker does not inform the listeners when the job is submitted, and it waits for the setup completion. 
JT can poll the waiting jobs to see if setup is complete for them, but this will be done in heartbeat which becomes expensive. Otherwise JIP can tell the JT that setup is complete by an api JobTracker.setupComplete,  through which the JT informs other listeners about the job. 
Then the jobs will be added to the listeners in the order of setup completion.
3. Once the initTasks is done, the job state is moved to RUNNING state. Since initTasks can be done by any one, they have to inform Jobtracker about the state change. This will be easier once HADOOP-4521 comes. For now we can have the package-private api in JobTracker.

Thoughts?

Moreover, If the initTasks is done asynchronously due to HADOOP-4513, we wouldn't need this change.

> Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()? 
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4472
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Vivek Ratan
>            Assignee: Amareshwari Sriramadasu
>
> JobInProgress.initTasks() creates TIPs for map and reduce tasks, and also the newly-introduced setup and cleanup tasks. initTasks() is called by the schedulers, as for reasons of memory optimizations, schedulers may choose to initialize M/R tasks at various moments (the Capacity Scheduler, for example, calls initTasks() just when it considers a job for running). One can say that Schedulers 'own' the initialization of M/R tasks in a job. Furthermore the JT 'owns' the setup and cleanup tasks (it schedules them, and Schedulers are unaware of these tasks). This causes a problematic dependency between the JT and a Scheduler. For example, the Capacity Scheduler calls initTasks() and immediately calls JobInProgress.obtainNewMapTask for a map task. This is a problem today, because we cannot run any map or reduce tasks before the setup task is run, which the Capacity Scheduler is not aware of. 
> Either all Schedulers are explicitly aware of setup/cleanup tasks and their dependencies with M/R tasks (in which case, Schedulers 'own' the creation and scheduling of all these tasks correctly), or the JT 'owns' the setup/cleanup tasks and Schedulers are completely unaware of them (in which case, the creation of setup/cleanup tasks must be moved out of initTasks into a separate method which is called by the JT). 
> I think the latter is the right way to go (unless we implement HADOOP-4421, in which case the former option may be viable as well). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4472) Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()?

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642076#action_12642076 ] 

Hemanth Yamijala commented on HADOOP-4472:
------------------------------------------

As an update, I had a discussion with Owen, and we think it is better to make the schedulers explicitly aware of setup tasks. However, we do need consensus and more thought on this. And more importantly, it might be a very big change if we want this fixed for Hadoop 0.19.1. So for the interim, the option proposed here will suffice.

Also note that the separation of setup / cleanup from initTasks makes sense even if we make the schedulers aware of the tasks. The discussion is only about the jobAdded part that I have proposed before.

> Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()? 
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4472
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Vivek Ratan
>
> JobInProgress.initTasks() creates TIPs for map and reduce tasks, and also the newly-introduced setup and cleanup tasks. initTasks() is called by the schedulers, as for reasons of memory optimizations, schedulers may choose to initialize M/R tasks at various moments (the Capacity Scheduler, for example, calls initTasks() just when it considers a job for running). One can say that Schedulers 'own' the initialization of M/R tasks in a job. Furthermore the JT 'owns' the setup and cleanup tasks (it schedules them, and Schedulers are unaware of these tasks). This causes a problematic dependency between the JT and a Scheduler. For example, the Capacity Scheduler calls initTasks() and immediately calls JobInProgress.obtainNewMapTask for a map task. This is a problem today, because we cannot run any map or reduce tasks before the setup task is run, which the Capacity Scheduler is not aware of. 
> Either all Schedulers are explicitly aware of setup/cleanup tasks and their dependencies with M/R tasks (in which case, Schedulers 'own' the creation and scheduling of all these tasks correctly), or the JT 'owns' the setup/cleanup tasks and Schedulers are completely unaware of them (in which case, the creation of setup/cleanup tasks must be moved out of initTasks into a separate method which is called by the JT). 
> I think the latter is the right way to go (unless we implement HADOOP-4421, in which case the former option may be viable as well). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4472) Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()?

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641761#action_12641761 ] 

Hemanth Yamijala commented on HADOOP-4472:
------------------------------------------

If we choose to make the schedulers unaware of the setup/cleanup tasks, one more consideration is whether the JT should add the job to the scheduler only after setup task is complete.

Consider this example: two jobs are added J1 and J2. Say J1's setup task takes a long time to complete, and J2's setup task completes very soon. If the scheduler is informed about the two jobs before their setup tasks have completed (as it happens today), the scheduler would look at J1 and initialize its tasks. But since setup has not completed, it would move to initialize the second job as well. Initializing the first job at this point seems to be wastefully occupying JT memory, as no M/R task can run until setup is complete. Put another way, letting task initialization happen only after setup has completed will help to reduce the memory footprint of the JT.

The flip side of this approach is that tasks would be given to the scheduler in order of completion of the setup tasks (irrespective of other aspects like submission time, priority, etc). However, I believe at some point (in 0.19) setup was being done as part of the job client, before they were graduated to tasks. That implicitly meant that jobs are given to the scheduler in order of completion of setup tasks. So, it may not be that bad.

Does this make sense ?

> Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()? 
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4472
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Vivek Ratan
>
> JobInProgress.initTasks() creates TIPs for map and reduce tasks, and also the newly-introduced setup and cleanup tasks. initTasks() is called by the schedulers, as for reasons of memory optimizations, schedulers may choose to initialize M/R tasks at various moments (the Capacity Scheduler, for example, calls initTasks() just when it considers a job for running). One can say that Schedulers 'own' the initialization of M/R tasks in a job. Furthermore the JT 'owns' the setup and cleanup tasks (it schedules them, and Schedulers are unaware of these tasks). This causes a problematic dependency between the JT and a Scheduler. For example, the Capacity Scheduler calls initTasks() and immediately calls JobInProgress.obtainNewMapTask for a map task. This is a problem today, because we cannot run any map or reduce tasks before the setup task is run, which the Capacity Scheduler is not aware of. 
> Either all Schedulers are explicitly aware of setup/cleanup tasks and their dependencies with M/R tasks (in which case, Schedulers 'own' the creation and scheduling of all these tasks correctly), or the JT 'owns' the setup/cleanup tasks and Schedulers are completely unaware of them (in which case, the creation of setup/cleanup tasks must be moved out of initTasks into a separate method which is called by the JT). 
> I think the latter is the right way to go (unless we implement HADOOP-4421, in which case the former option may be viable as well). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4472) Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()?

Posted by "Vivek Ratan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643425#action_12643425 ] 

Vivek Ratan commented on HADOOP-4472:
-------------------------------------

bq. If we think that doing HADOOP-4513 addresses the issue to do with initializing too many jobs simultaneously in the capacity scheduler, then I propose that we don't fix this issue at all for now. Instead we should look at HADOOP-4421. Thoughts?

Agreed. Just to sum up. A job can start running (i.e., we can schedule the first of its map or reduce tasks) when it's M/R tasks are initialized and when its setup task is complete. Each of these (running of the setup task, calling initTasks()) can potentially take a while. The benefit of moving the creation of the setup task outside of initTasks() is the ability to run the setup task independently of when initTasks() is called. IN particular, we'd like to run the setup tasks as early as possible. Or at the very least, avoid running the setup task after initTasks(), as these two are independent. This approach is beneficial if either the setup task or initTasks(), or both, take a long time.

However, achieving this separation is not easy. If we want to preserve current semantics and not change any interfaces, then we need to update a job's state to RUNNING either when the setup task is done or when initTasks() completes. Since the job's state change requires notifications to listeners, and initTasks() does not have access to listeners, this notification is going to be hard to do without changing some interfaces. See HADOOP-4521 for a related issue. 

Given that the long-term solution is not clear, and could be quite different, we should drop this proposal and look at a larger solution in the context of HADOOP-4421. 

> Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()? 
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4472
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Vivek Ratan
>            Assignee: Amareshwari Sriramadasu
>
> JobInProgress.initTasks() creates TIPs for map and reduce tasks, and also the newly-introduced setup and cleanup tasks. initTasks() is called by the schedulers, as for reasons of memory optimizations, schedulers may choose to initialize M/R tasks at various moments (the Capacity Scheduler, for example, calls initTasks() just when it considers a job for running). One can say that Schedulers 'own' the initialization of M/R tasks in a job. Furthermore the JT 'owns' the setup and cleanup tasks (it schedules them, and Schedulers are unaware of these tasks). This causes a problematic dependency between the JT and a Scheduler. For example, the Capacity Scheduler calls initTasks() and immediately calls JobInProgress.obtainNewMapTask for a map task. This is a problem today, because we cannot run any map or reduce tasks before the setup task is run, which the Capacity Scheduler is not aware of. 
> Either all Schedulers are explicitly aware of setup/cleanup tasks and their dependencies with M/R tasks (in which case, Schedulers 'own' the creation and scheduling of all these tasks correctly), or the JT 'owns' the setup/cleanup tasks and Schedulers are completely unaware of them (in which case, the creation of setup/cleanup tasks must be moved out of initTasks into a separate method which is called by the JT). 
> I think the latter is the right way to go (unless we implement HADOOP-4421, in which case the former option may be viable as well). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4472) Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()?

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643406#action_12643406 ] 

Amareshwari Sriramadasu commented on HADOOP-4472:
-------------------------------------------------

bq. I don't think that is right. I think we should have a SetupTask and a CleanupTask that are both id = 0. The isMap boolean needs to be replaced with an enumeration. {SETUP, MAP, REDUCE, CLEANUP}. There should be tips associated with both setup and cleanup.
This is proposed in HADOOP-4421. 

bq. I don't see why the JT should ever poll the state. It knows the state changed via the heartbeat.
Now there is no state change for setup. Both initTasks and setup happen in PREP state. 

bq. Furthermore, it should be notified again when the setup task is finished and the rest of the job is runnable.
I think it makes sense to have RUNNABLE state, when the setup completes.

bq. I propose that we don't fix this issue at all for now. Instead we should look at HADOOP-4421.
+1. We can add RUNNABLE state also through HADOOP-4421.

> Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()? 
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4472
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Vivek Ratan
>            Assignee: Amareshwari Sriramadasu
>
> JobInProgress.initTasks() creates TIPs for map and reduce tasks, and also the newly-introduced setup and cleanup tasks. initTasks() is called by the schedulers, as for reasons of memory optimizations, schedulers may choose to initialize M/R tasks at various moments (the Capacity Scheduler, for example, calls initTasks() just when it considers a job for running). One can say that Schedulers 'own' the initialization of M/R tasks in a job. Furthermore the JT 'owns' the setup and cleanup tasks (it schedules them, and Schedulers are unaware of these tasks). This causes a problematic dependency between the JT and a Scheduler. For example, the Capacity Scheduler calls initTasks() and immediately calls JobInProgress.obtainNewMapTask for a map task. This is a problem today, because we cannot run any map or reduce tasks before the setup task is run, which the Capacity Scheduler is not aware of. 
> Either all Schedulers are explicitly aware of setup/cleanup tasks and their dependencies with M/R tasks (in which case, Schedulers 'own' the creation and scheduling of all these tasks correctly), or the JT 'owns' the setup/cleanup tasks and Schedulers are completely unaware of them (in which case, the creation of setup/cleanup tasks must be moved out of initTasks into a separate method which is called by the JT). 
> I think the latter is the right way to go (unless we implement HADOOP-4421, in which case the former option may be viable as well). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4472) Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()?

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641801#action_12641801 ] 

Devaraj Das commented on HADOOP-4472:
-------------------------------------

+1. As we discussed offline, i think it makes sense to not initialize all tasks of a job before its setupTask completes. And yes, earlier the JobClient used to run setup for the job and purely from that point of view, it doesn't seem too bad to order jobs in a Scheduler's WaitingQueue in the order of the setupTask completions. What do others think?

> Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()? 
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4472
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Vivek Ratan
>
> JobInProgress.initTasks() creates TIPs for map and reduce tasks, and also the newly-introduced setup and cleanup tasks. initTasks() is called by the schedulers, as for reasons of memory optimizations, schedulers may choose to initialize M/R tasks at various moments (the Capacity Scheduler, for example, calls initTasks() just when it considers a job for running). One can say that Schedulers 'own' the initialization of M/R tasks in a job. Furthermore the JT 'owns' the setup and cleanup tasks (it schedules them, and Schedulers are unaware of these tasks). This causes a problematic dependency between the JT and a Scheduler. For example, the Capacity Scheduler calls initTasks() and immediately calls JobInProgress.obtainNewMapTask for a map task. This is a problem today, because we cannot run any map or reduce tasks before the setup task is run, which the Capacity Scheduler is not aware of. 
> Either all Schedulers are explicitly aware of setup/cleanup tasks and their dependencies with M/R tasks (in which case, Schedulers 'own' the creation and scheduling of all these tasks correctly), or the JT 'owns' the setup/cleanup tasks and Schedulers are completely unaware of them (in which case, the creation of setup/cleanup tasks must be moved out of initTasks into a separate method which is called by the JT). 
> I think the latter is the right way to go (unless we implement HADOOP-4421, in which case the former option may be viable as well). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4472) Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()?

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642062#action_12642062 ] 

Amar Kamat commented on HADOOP-4472:
------------------------------------

+1

> Should we move out the creation of setup/cleanup tasks from JobInProgress.initTasks()? 
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4472
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Vivek Ratan
>
> JobInProgress.initTasks() creates TIPs for map and reduce tasks, and also the newly-introduced setup and cleanup tasks. initTasks() is called by the schedulers, as for reasons of memory optimizations, schedulers may choose to initialize M/R tasks at various moments (the Capacity Scheduler, for example, calls initTasks() just when it considers a job for running). One can say that Schedulers 'own' the initialization of M/R tasks in a job. Furthermore the JT 'owns' the setup and cleanup tasks (it schedules them, and Schedulers are unaware of these tasks). This causes a problematic dependency between the JT and a Scheduler. For example, the Capacity Scheduler calls initTasks() and immediately calls JobInProgress.obtainNewMapTask for a map task. This is a problem today, because we cannot run any map or reduce tasks before the setup task is run, which the Capacity Scheduler is not aware of. 
> Either all Schedulers are explicitly aware of setup/cleanup tasks and their dependencies with M/R tasks (in which case, Schedulers 'own' the creation and scheduling of all these tasks correctly), or the JT 'owns' the setup/cleanup tasks and Schedulers are completely unaware of them (in which case, the creation of setup/cleanup tasks must be moved out of initTasks into a separate method which is called by the JT). 
> I think the latter is the right way to go (unless we implement HADOOP-4421, in which case the former option may be viable as well). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.