You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Devaraj K (JIRA)" <ji...@apache.org> on 2014/04/29 11:39:16 UTC

[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins

    [ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984151#comment-13984151 ] 

Devaraj K commented on YARN-1408:
---------------------------------

bq. So in some race conditions, it is possible that a container can get KILLED by preemption even before it reach RUNNING state.
This scenario can be avoided if we can skip such containers which didnt reach the RUNNING state during preemption.
May be in the following cycles this container will reach RUNNING state and the can be considered for preemption.

I think we don't need to wait for the container to move to RUNNING state for preemption even if it is eligible. If the container is eligible for preemption, the resources can be released with the current preemption cycle instead of waiting for the next preemption cycle to change the container state to RUNNING, so that it could save the wastage of the container launching and then killing.

> Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
> ----------------------------------------------------------------------------------------------
>
>                 Key: YARN-1408
>                 URL: https://issues.apache.org/jira/browse/YARN-1408
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Sunil G
>             Fix For: 2.5.0
>
>         Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.patch
>
>
> Capacity preemption is enabled as follows.
>  *  yarn.resourcemanager.scheduler.monitor.enable= true ,
>  *  yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
> Queue = a,b
> Capacity of Queue A = 80%
> Capacity of Queue B = 20%
> Step 1: Assign a big jobA on queue a which uses full cluster capacity
> Step 2: Submitted a jobB to queue b  which would use less than 20% of cluster capacity
> JobA task which uses queue b capcity is been preempted and killed.
> This caused below problem:
> 1. New Container has got allocated for jobA in Queue A as per node update from an NM.
> 2. This container has been preempted immediately as per preemption.
> Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM.
> ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED
> This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption.
> attempt_1380289782418_0003_m_000000_0 Timed out after 1800 secs



--
This message was sent by Atlassian JIRA
(v6.2#6252)