You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Karam Singh (JIRA)" <ji...@apache.org> on 2009/05/08 15:17:45 UTC

[jira] Created: (HADOOP-5794) Sometimes job does not get removed from scheduler queue after it is killed

Sometimes job does not get removed from scheduler queue after it is killed
--------------------------------------------------------------------------

                 Key: HADOOP-5794
                 URL: https://issues.apache.org/jira/browse/HADOOP-5794
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/capacity-sched
    Affects Versions: 0.20.0
            Reporter: Karam Singh


Sometimes when we kill a job, it does get removed from waiting queue, while job status: "Killed" with Job Setup and Cleanup: "Successful" 
Also JobTracker webui shows job under failed jobs lists and hadoop job -list all, hadoop queue <queuename> -showJobs also shows jobs state=5.
Prior to killing job state was "Running"


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5794) Sometimes job does not get removed from scheduler queue after it is killed

Posted by "Vinod K V (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712051#action_12712051 ] 

Vinod K V commented on HADOOP-5794:
-----------------------------------

Beautiful! (Sorry couldn't resist myself..)

> Sometimes job does not get removed from scheduler queue after it is killed
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-5794
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5794
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.20.0
>            Reporter: Karam Singh
>
> Sometimes when we kill a job, it does get removed from waiting queue, while job status: "Killed" with Job Setup and Cleanup: "Successful" 
> Also JobTracker webui shows job under failed jobs lists and hadoop job -list all, hadoop queue <queuename> -showJobs also shows jobs state=5.
> Prior to killing job state was "Running"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5794) Sometimes job does not get removed from scheduler queue after it is killed

Posted by "rahul k singh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712013#action_12712013 ] 

rahul k singh commented on HADOOP-5794:
---------------------------------------

Analysis of the problem:
When the job tracker is restarted , RecoveryManager tries to recover the job from job history.RecoveryMaanger instantiates the JobInProgress object and sets its startTime as System.currentTimeMillis.In JobInProgress constructor JobStatus startTime is set as JIP's startTime .RecoveryManager fetches startTime information from job history and updates the JIP's startTime(remember this change is not propagated to JobStatus startTime) , hence now Jobstatus has old value of startTime . These Job statuses are used in JobQueuesManager to categorize jobs based on the state they are in. The data structure in JobQueuesManager(waitingJobs) uses startTime as the comparator.As waitingJobs has old startTime value , it has the old entry.
Whenever we try to do "hadoop job -list" JobTracker's getJobStatus method is called , this sets the JobStatus startTime value with JobInProgress startTime value , now at this point , startTime values in JIP and JobStatus are consistent, but the startTime value in waitingJobs in JobQueueManager is stale . Hence when we try to remove the jobs which are completed(Completed/killed/failed , for example issueing "hadoop job -kill <>" command ) from waitingJobs() nothing is removed as comparator startTime is changed.

> Sometimes job does not get removed from scheduler queue after it is killed
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-5794
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5794
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.20.0
>            Reporter: Karam Singh
>
> Sometimes when we kill a job, it does get removed from waiting queue, while job status: "Killed" with Job Setup and Cleanup: "Successful" 
> Also JobTracker webui shows job under failed jobs lists and hadoop job -list all, hadoop queue <queuename> -showJobs also shows jobs state=5.
> Prior to killing job state was "Running"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5794) Sometimes job does not get removed from scheduler queue after it is killed

Posted by "Karam Singh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707336#action_12707336 ] 

Karam Singh commented on HADOOP-5794:
-------------------------------------

Cluster setup -  : 
Cluster Capacity = 204 maps, 204 reduces
4 queues 
Q1 Capacity Percent= 40
Q2 Capacity Percent= 40
Q3 Capacity Percent= 40
Q4 Capacity Percent= 40

Each queue has user limit=100%
Submitted 8 jobs to each queue. Total 32 sleep jobs were submitted with each job having maps=10000 (sleep time 5 secs), reduce=2 (sleep time 1 min).
All jobs were initialized. Out which maps of 4 maps started running. When at least 1000 maps of each job completed, re-started JobTracker.
After recovery of JobTracker, waited up to the time when 4 jobs got completed. Killed all remaining 28 jobs.
All jobs got killed successfully.
JobTracker webui displayed all killed jobs under failed jobs list. hadoop job -list all also displays the status of 28 killed job as 5.
While browsing through jobqueue_details.jsp pages of queues found that 2 jobs which were killed have not been removed from queue of capacity scheduler. Maps of both jobs were running before kill was sent to them.
To check that cluster should be blocked because of this, submitted 3 more jobs to each queue where 2 killed were listed and verified the newly submitted jobs ran successfully.
Waited up to 20 mins before shutting down the cluster


> Sometimes job does not get removed from scheduler queue after it is killed
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-5794
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5794
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.20.0
>            Reporter: Karam Singh
>
> Sometimes when we kill a job, it does get removed from waiting queue, while job status: "Killed" with Job Setup and Cleanup: "Successful" 
> Also JobTracker webui shows job under failed jobs lists and hadoop job -list all, hadoop queue <queuename> -showJobs also shows jobs state=5.
> Prior to killing job state was "Running"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.