You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Karam Singh (JIRA)" <ji...@apache.org> on 2008/10/31 15:43:44 UTC

[jira] Created: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
--------------------------------------------------------------------------------------

                 Key: HADOOP-4558
                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/capacity-sched
    Affects Versions: 0.19.0
         Environment: Cluster Capacity Maps=Reduces =210 each
Two Queues: 
Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins


            Reporter: Karam Singh


Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
First job submitted with tasks equal to cluster's M/R Capacity
Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat reassigned HADOOP-4558:
----------------------------------

    Assignee: Amar Kamat

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>         Attachments: 4558.1.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652321#action_12652321 ] 

Sreekanth Ramakrishnan commented on HADOOP-4558:
------------------------------------------------

+1 to patch.

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>         Attachments: 4558.1.patch, HADOOP-4558-v1.4.patch, HADOOP-4558-v1.5.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Vivek Ratan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644672#action_12644672 ] 

Vivek Ratan commented on HADOOP-4558:
-------------------------------------

The Capacity Scheduler updates its queue-based data structures on every heartbeat, when a task needs to be assigned. What happened here was that a job was submitted, but all TTs were running long-running maps and no call to assignTasks() was made. Hence the Scheduler never updated its structures and was unaware of the 2nd job being added. It's a fairly simple fix - update the data structure when a job is added/removed. 

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Karam Singh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644308#action_12644308 ] 

Karam Singh commented on HADOOP-4558:
-------------------------------------

Submitted sleep Job J1 to Q1 with maps and reduces tasks = 210 each.
Sleep time per map =10 mins and reduces = 20 mins.
When all maps and reduces of J1 starts running, submitted sleep Job J2 to Q2 with maps and reduces tasks = 210 each. 

J2  started only after 10 mins when maps of J1 starts finishing up. 
J2 should get its slots after 2 mins (i.e after its reclaim time limit).



> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652794#action_12652794 ] 

Hemanth Yamijala commented on HADOOP-4558:
------------------------------------------

This patch doesn't apply cleanly to trunk anymore. Possibly due to HADOOP-4035. Can you please create a new patch ?

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>         Attachments: 4558.1.patch, HADOOP-4558-v1.4.patch, HADOOP-4558-v1.5.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653271#action_12653271 ] 

Amar Kamat commented on HADOOP-4558:
------------------------------------

Tested this patch on 50 nodes and the capacity gets reclaimed as expected.

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>         Attachments: 4558.1.patch, HADOOP-4558-v1.4.patch, HADOOP-4558-v1.5.patch, HADOOP-4558-v1.7.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645402#action_12645402 ] 

Amar Kamat commented on HADOOP-4558:
------------------------------------

Looks like the structure for running tasks is maintained only if speculation is _ON_. Hence with speculation turned off we dont see any tasks getting killed. We have 3 options here
1. Maintain a list of running tasks per job in capacity scheduler and use that to kill tasks instead. The drawback of this approach is 
   - Scheduler will do the same book keeping as done by JIP
   - Scheduler now needs to know about task completions.

2. Maintain the list of running tasks irrespective of speculation. The only drawback of this approach is that this will modify the (framework) code path for jobs with speculation turned _OFF_ and hence will require benchmarking

3. For jobs with speculation turned _OFF_, we walk over the map structure, find out the least progressed maps and kill them. The benefit of this approach is that the framework code remains unchanged and there is no code duplication. The drawback is that this approach does a linear scan everytime.

Thoughts?

I am still investigating why the reclaim didnt happen as expected.

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>         Attachments: 4558.1.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646190#action_12646190 ] 

Devaraj Das commented on HADOOP-4558:
-------------------------------------

+1 on #2

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>         Attachments: 4558.1.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Sreekanth Ramakrishnan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652288#action_12652288 ] 

Sreekanth Ramakrishnan commented on HADOOP-4558:
------------------------------------------------

Can you please remove the assignment of pending tasks in TaskSchedulingMgr.jobAdded(), the patch does not seem to apply cleanly on trunk, can you please check the same?

The rest of the patch looks fine.

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>         Attachments: 4558.1.patch, HADOOP-4558-v1.4.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Vivek Ratan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647277#action_12647277 ] 

Vivek Ratan commented on HADOOP-4558:
-------------------------------------

It's true that queue information is updated at the beginning of assignTasks(). It can be done at the end too, but it won't help much. In fact, we may call updateQSIObjects() once every few heartbeats, if the call is expensive. Any code that requires exact information about the state of the queues should call updateQSIObjects() and should not rely on when this method is called by assignTasks(). Hence, a better solution is for reclaimCapacity() to call updateQSIObjects() at the beginning. 

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>         Attachments: 4558.1.patch, HADOOP-4558-v1.4.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654073#action_12654073 ] 

Hudson commented on HADOOP-4558:
--------------------------------

Integrated in Hadoop-trunk #680 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/680/])
    

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>             Fix For: 0.20.0
>
>         Attachments: 4558.1.patch, HADOOP-4558-v1.4.patch, HADOOP-4558-v1.5.patch, HADOOP-4558-v1.7.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646223#action_12646223 ] 

Amar Kamat commented on HADOOP-4558:
------------------------------------

{quote}
Here J1 is still using 12 extra map and 1 extra reduce slots
It took nearly two more minutes to when j1 and j2 both starts using MR slots equal to their GCs.
{quote}
The reason is as follows :
When job2 gets added, a {{ReclaimedResource}} object is added to the reclaim queue. After _whenToKill_ units of time, tasks from job1 are killed. But at this point of time job2 is not set up and hence is not able to schedule tasks. So again job1 is selected for scheduling tasks. Now once job2 finishes setup, the reclaim request is added for the (extra) scheduled tasks. Hence the observation that there is some extra killings and the guaranteed capacity is allocated after few mins.

I think the issue is more involved. Here are the choices
1) Let it be : Since the setup task took time to schedule and finish, its ok to keep it as it is. What we guarantee here is that the slots will be allocated to the queue as soon as a request is made
2) Delay : One way to avoid the _thrashing_ is to delay the reclaim until the job/queue which wants it, actually needs it. The obvious problem with this is that it will take sometime to kill the tasks and hence there will a little delay in reclaim. Also the _sla_ needs to be redefined.

Note that this issue also depends on how set-up tasks are handled in future and when the job actually becomes _RUNNING_.

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>         Attachments: 4558.1.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653191#action_12653191 ] 

Amar Kamat commented on HADOOP-4558:
------------------------------------

Result of _test-patch_ on my box
{noformat}
[exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
{noformat}

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>         Attachments: 4558.1.patch, HADOOP-4558-v1.4.patch, HADOOP-4558-v1.5.patch, HADOOP-4558-v1.7.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647273#action_12647273 ] 

Amar Kamat commented on HADOOP-4558:
------------------------------------

Note that once a task is assigned, capacity scheduler is not updated to reflect the change. It waits for a heartbeat to update itself via {{updateQSIObjects()}}. I feel its better we update the scheduler after the tasks are assigned so that the scheduler is up to date. This needs benchmarking and discussion. Following is the use case
- job1 is added
- job1 takes up one slot more than guaranteed 
- job2 is added
- ideally the reclaim thread should detect the capacity violation and kill that one extra task but will not do as the count is stale and one less.
- upon next heartbeat the scheduler will detect that job1 has violated and hence will start the reclaim process. 

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>         Attachments: 4558.1.patch, HADOOP-4558-v1.4.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-4558:
-------------------------------

    Attachment: HADOOP-4558-v1.5.patch

Attaching a patch with changes.
Result of _test-patch_
{noformat}
[exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
{noformat}

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>         Attachments: 4558.1.patch, HADOOP-4558-v1.4.patch, HADOOP-4558-v1.5.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-4558:
-------------------------------

    Attachment: HADOOP-4558-v1.4.patch

Attaching a patch that changes the following
- {{jobAdded()}} now changes the _pendingTasks_ count
- {{reclaimCapacity()}} now uses the updated QSI info
- _reclaimTime_ is now converted to milliseconds
- Added a testcase to test the fix
- changed tests in {{TestCapacityScheduler}} reflecting the change.

Result of -test-patch_ is as follows 
{code}
[exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

{code}

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>         Attachments: 4558.1.patch, HADOOP-4558-v1.4.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-4558:
-------------------------------

    Attachment: HADOOP-4558-v1.7.patch

Updating to trunk.

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>         Attachments: 4558.1.patch, HADOOP-4558-v1.4.patch, HADOOP-4558-v1.5.patch, HADOOP-4558-v1.7.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Vivek Ratan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vivek Ratan updated HADOOP-4558:
--------------------------------

    Attachment: 4558.1.patch

Attached patch (4558.1.patch) has a fairly simple fix. We update the # of pending for the queue when a job is added. Also made jobAdded() and jobRemoved() synchronized, as we're updating common data structures. 

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>         Attachments: 4558.1.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hemanth Yamijala resolved HADOOP-4558.
--------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.20.0
     Hadoop Flags: [Reviewed]

I ran the capacity scheduler tests, and they passed. Since the patch does not touch any other component, there was no need to run any other tests.

I just committed this. Thanks, Amar !

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>            Assignee: Amar Kamat
>             Fix For: 0.20.0
>
>         Attachments: 4558.1.patch, HADOOP-4558-v1.4.patch, HADOOP-4558-v1.5.patch, HADOOP-4558-v1.7.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Karam Singh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645241#action_12645241 ] 

Karam Singh commented on HADOOP-4558:
-------------------------------------

Repeated the same test case mentioned above with same environment after applying 4558.1.patch

Following is the observation -:
1. With speculative execution = true.
   Job J2 starts running after 2 mins of submission.
   Now both jobs start running with -:
     J1 maps = 96 and reduces = 85
     J2 maps = 114 and reduces = 125
     Here J1 is still using 12 extra map and 1 extra reduce slots 
     It took nearly two more minutes to when j1 and j2 both starts using MR slots equal to their GCs.

2. With speculative execution = false.
   J2 starts running only when maps of J1 starts finishing.


> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>         Attachments: 4558.1.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Vivek Ratan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646273#action_12646273 ] 

Vivek Ratan commented on HADOOP-4558:
-------------------------------------

We should leave this as is. The right solution depends on how we handle HADOOP-4421. If the schedulers are aware of setup tasks and handle them directly, then the solution is different from if setup tasks are handled outside of the Scheduler. 

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>         Attachments: 4558.1.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4558) Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other

Posted by "Vivek Ratan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646169#action_12646169 ] 

Vivek Ratan commented on HADOOP-4558:
-------------------------------------

I'd go with #2 (yes, you need to make sure that no code is relying on the fact that the data structures for running tasks are empty if speculative execution is turned off). Granted, you're keeping extra state for jobs with spec execution turned off, but the number of running tasks cannot exceed the cluster capacity, so you're bounded. option #1 duplicates code between the Capacity Scheduler & JobInProgress, and Option #3 is expensive, though we do a linear scan only when killing tasks, which shouldn't happen very often. 

> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4558
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Cluster Capacity Maps=Reduces =210 each
> Two Queues: 
> Q1:  default, GC (%) =40, GC=84 (Maps and Reduces each). Reclaim time = 3 mins.
> Q2: test_q1, GC (%) =60, GC=126 (Maps and Reduces each) Reclaim time = 2 mins
>            Reporter: Karam Singh
>         Attachments: 4558.1.patch
>
>
> Scheduler fails to reclaim capacity if Jobs are submitted to queue one after the other.
> First job submitted with tasks equal to cluster's M/R Capacity
> Second is submitted to different queue when all tasks of First Job are running, scheduler fails to reclaim capacity for second job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.