You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Wangda Tan (JIRA)" <ji...@apache.org> on 2015/11/04 02:02:27 UTC

[jira] [Commented] (YARN-3946) Allow fetching exact reason as to why a submitted app is in ACCEPTED state.

    [ https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988640#comment-14988640 ] 

Wangda Tan commented on YARN-3946:
----------------------------------

Hi [~Naganarasimha], 
Thanks for working on this, general idea of this approach looks good, few suggestions about what to show: 
- AM launch diagnostics should have an intial value after added to scheduler:
For unmanaged AM, it should be "User launched the Application Master since it's unmanaged"
For managed AM, it should be "Added to scheduler, waiting to be scheduled" with some general suggestions about configurations to look at, such as user-limit, am-percent, queue-limit, etc.
- Loop all applications when queue exceeds limit is too costly. I'd prefer to do nothing when this happens.
- After application moved to activated state, if the application is traversed by scheduler but cannot allocate any resource, you may put something like "Trying to allocate to AM on node=x, etc.". After YARN-4091 we should be able to get more detailed information about why this happened.
- Not caused by your patch, isWaitingForAMContainer checks if master container created, you may also need to check if application is in recover state or not. Because AM could contact to RM before AM container recovered by RM.
- Similar to above, you may need to put diagnostic message when AM is recovering by RM
- After AM launched, diag could be something like "AM is launched", which will be better than empty text.

Regarding to implementation:
- Since RMAppAttempt and SchedulerApplicationAttempt has 1 to 1 relationship, we can save a reference to RMAppAttemt in SchedulerApplicationAttempt, which could avoid getting it from {{RMContext.getRMApps()...}}
- Since String is immutable, amLaunchDiagnostics could be violate so we don't need acquire locks.
- Suggest to add to REST API / web UI together with this patch if changes are not complex.

> Allow fetching exact reason as to why a submitted app is in ACCEPTED state.
> ---------------------------------------------------------------------------
>
>                 Key: YARN-3946
>                 URL: https://issues.apache.org/jira/browse/YARN-3946
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Sumit Nigam
>            Assignee: Naganarasimha G R
>         Attachments: YARN-3946.v1.001.patch, YARN3946_attemptDiagnistic message.png
>
>
> Currently there is no direct way to get the exact reason as to why a submitted app is still in ACCEPTED state. It should be possible to know through RM REST API as to what aspect is not being met - say, queue limits being reached, or core/ memory requirement not being met, or AM limit being reached, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)