You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Daniel Templeton (JIRA)" <ji...@apache.org> on 2016/02/02 17:43:40 UTC

[jira] [Created] (YARN-4665) Asynch submit can lose application submissions

Daniel Templeton created YARN-4665:
--------------------------------------

             Summary: Asynch submit can lose application submissions
                 Key: YARN-4665
                 URL: https://issues.apache.org/jira/browse/YARN-4665
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 2.1.0-beta
            Reporter: Daniel Templeton
            Assignee: Daniel Templeton
            Priority: Critical


The change introduced in YARN-514 opens up a hole into which applications can fall and be lost.  Prior to YARN-514, the {{submitApplication()}} call did not complete until the application state was persisted to the state store.  After YARN-514, the {{submitApplication()}} call is asynchronous, with the application state being saved later.

If the state store is slow or unresponsive, it may be that an application's state may not be persisted for quite a while.  During that time, if the RM fails (over), all applications that have not yet been persisted to the state store will be lost without the client being aware.

This issue is inherent in the design of YARN-514.  I see three solutions:

1. Add a WAL to the state store. HBase does it, so we know how to do it. It seems like a heavy solution to the original problem, however.  It's certainly not a trivial change.

2. Revert YARN-514 and update the RPC layer to allow a connection to be parked if it's doing something that may take a while. This is a generally useful feature but could be a deep rabbit hole.

3. Revert YARN-514 and add back-pressure to the job submission. For example, we set a maximum number of threads that can simultaneously be assigned to handle job submissions.  When that threshold is reached, new job submissions get a try-again-later response. This is also a generally useful feature and should be a fairly constrained set of changes.  The downside is that it impacts the API.

I think the third option is the most approachable.  It's the smallest change, and it adds useful behavior beyond solving the original issue.  And I don't think the API impact is significant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)