You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Vinod K V (JIRA)" <ji...@apache.org> on 2009/01/21 14:33:59 UTC

[jira] Commented: (HADOOP-4834) Have end to end tests based on MiniMRCluster to verify correct behaviour of slot reclamation by queues.

    [ https://issues.apache.org/jira/browse/HADOOP-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665824#action_12665824 ] 

Vinod K V commented on HADOOP-4834:
-----------------------------------


While testing, I came across the following problem with ReclaimCapacity functionality.
 - When reclaim-capacity interval is sufficiently small (1 or 2 seconds, default is 5), I see a lot of the following exceptions in the log. This is a fatal exception and affects one iteration of reclaim capacity functionality. The reason for this is that TaskStatus only gets populated when a TT reports back launching of a task. But we don't have null checks for TaskStatus in TaskSchedulingMgr.killTasksFromQueue, thus causing this error. This is not visible when reclaim-interval is not small enough, as within that much time, TTs report back and TaskStatus will never be observed to be null.

   {code}
   09/01/21 12:14:35 ERROR mapred.CapacityTaskScheduler: Error in redistributing capacity:
   java.lang.NullPointerException
        at java.util.TreeMap.getEntry(TreeMap.java:341)
        at java.util.TreeMap.get(TreeMap.java:272)
        at org.apache.hadoop.mapred.TaskInProgress.killTask(TaskInProgress.java:741)
        at org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.killTasksFromJob(CapacityTaskScheduler.java:878)
        at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.killTasksFromQueue(CapacityTaskScheduler.java:612)
        at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.killTasks(CapacityTaskScheduler.java:594)
        at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.reclaimCapacity(CapacityTaskScheduler.java:531)
        at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$800(CapacityTaskScheduler.java:362)
        at org.apache.hadoop.mapred.CapacityTaskScheduler.reclaimCapacity(CapacityTaskScheduler.java:1216)
        at org.apache.hadoop.mapred.CapacityTaskScheduler$ReclaimCapacity.run(CapacityTaskScheduler.java:1001)
        at java.lang.Thread.run(Thread.java:636)
   {code}

Inserting null checks prevents this.

> Have end to end tests based on MiniMRCluster to verify correct behaviour of slot reclamation by queues.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4834
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4834
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>            Reporter: Vinod K V
>            Assignee: Vinod K V
>
> We should have a test that submits long running jobs to different queues one after the other, and ensures that queues get required capacity or get back taken-away capacity after killing tasks within the specified amount of time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.