You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Daniel Templeton (JIRA)" <ji...@apache.org> on 2016/11/30 19:15:58 UTC

[jira] [Resolved] (YARN-4665) Asynch submit can lose application submissions

     [ https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Templeton resolved YARN-4665.
------------------------------------
    Resolution: Invalid

I'm closing this issue as invalid.  Turns out what I was seeing was actually quirks in the RM failover, which are now addressed by YARN-5677 and YARN-5694.

> Asynch submit can lose application submissions
> ----------------------------------------------
>
>                 Key: YARN-4665
>                 URL: https://issues.apache.org/jira/browse/YARN-4665
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.1.0-beta
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>
> The change introduced in YARN-514 opens up a hole into which applications can fall and be lost.  Prior to YARN-514, the {{submitApplication()}} call did not complete until the application state was persisted to the state store.  After YARN-514, the {{submitApplication()}} call is asynchronous, with the application state being saved later.
> If the state store is slow or unresponsive, it may be that an application's state may not be persisted for quite a while.  During that time, if the RM fails (over), all applications that have not yet been persisted to the state store will be lost.  If the active RM loses ZK connectivity, a significant number of job submissions can pile up before the ZK connection times out, resulting in a large pile of client failures when it finally does.
> This issue is inherent in the design of YARN-514.  I see three solutions:
> 1. Add a WAL to the state store. HBase does it, so we know how to do it. It seems like a heavy solution to the original problem, however.  It's certainly not a trivial change.
> 2. Revert YARN-514 and update the RPC layer to allow a connection to be parked if it's doing something that may take a while. This is a generally useful feature but could be a deep rabbit hole.
> 3. Revert YARN-514 and add back-pressure to the job submission. For example, we set a maximum number of threads that can simultaneously be assigned to handle job submissions.  When that threshold is reached, new job submissions get a try-again-later response. This is also a generally useful feature and should be a fairly constrained set of changes.
> I think the third option is the most approachable.  It's the smallest change, and it adds useful behavior beyond solving the original issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org