You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Shane Kumpf (JIRA)" <ji...@apache.org> on 2018/04/24 12:48:00 UTC

[jira] [Comment Edited] (YARN-2674) Distributed shell AM may re-launch containers if RM work preserving restart happens

    [ https://issues.apache.org/jira/browse/YARN-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449773#comment-16449773 ] 

Shane Kumpf edited comment on YARN-2674 at 4/24/18 12:47 PM:
-------------------------------------------------------------

I've spent some time looking into what issues are already opened for dshell tests and most of the flaky tests are being tracked.

YARN-7771 - Intermittent failures of tests that leverage TestDistributedShell#testDSShell
 YARN-8078 - TestDistributedShell#testDSShellWithoutDomainV2 fails on trunk
 YARN-6479 - TestDistributedShell.testDSShellWithoutDomainV1_5 fails
 YARN-4385 - TestDistributedShell times out

With these known flaky tests commented out, I've still yet to get 20 successful runs of the dshell tests. I'll continue to look into the tests as time permits, but I think we can move forward with this patch in the meantime.


was (Author: shanekumpf@gmail.com):
I've spent some time looking into what issues are already opened for dshell tests and most of the flaky tests are being tracked.

YARN-7771 - Intermittent failures of tests that leverage TestDistributedShell#testDSShell
YARN-8078 - TestDistributedShell#testDSShellWithoutDomainV2 fails on trunk
YARN-6479 - TestDistributedShell.testDSShellWithoutDomainV1_5 fails
YARN-4385 - TestDistributedShell times out
YARN-4350 - TestDistributedShell fails for V2 scenarios

With these known flaky tests commented out, I've still yet to get 20 successful runs of the dshell tests. I'll continue to look into the tests as time permits, but I think we can move forward with this patch in the meantime.

> Distributed shell AM may re-launch containers if RM work preserving restart happens
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-2674
>                 URL: https://issues.apache.org/jira/browse/YARN-2674
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: applications, resourcemanager
>            Reporter: Chun Chen
>            Assignee: Shane Kumpf
>            Priority: Major
>              Labels: oct16-easy
>         Attachments: YARN-2674.1.patch, YARN-2674.2.patch, YARN-2674.3.patch, YARN-2674.4.patch, YARN-2674.5.patch, YARN-2674.6.patch
>
>
> Currently, if RM work preserving restart happens while distributed shell is running, distribute shell AM may re-launch all the containers, including new/running/complete. We must make sure it won't re-launch the running/complete containers.
> We need to remove allocated containers from AMRMClientImpl#remoteRequestsTable once AM receive them from RM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org