You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Liang-Chi Hsieh (JIRA)" <ji...@apache.org> on 2015/07/23 08:49:06 UTC

[jira] [Commented] (SPARK-9256) Message delay causes Master crash upon registering application

    [ https://issues.apache.org/jira/browse/SPARK-9256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14638311#comment-14638311 ] 

Liang-Chi Hsieh commented on SPARK-9256:
----------------------------------------

Since the FileSystemPersistenceEngine persists ApplicationInfo into disk based on an incremental appId generated in Master, the file should not be exist because it should be another different filename.

> Message delay causes Master crash upon registering application
> --------------------------------------------------------------
>
>                 Key: SPARK-9256
>                 URL: https://issues.apache.org/jira/browse/SPARK-9256
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Colin Scott
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> This bug occurs when `spark.deploy.recoveryMode` is set to "FILESYSTEM", and I believe it is only possible to trigger in production when the AppClient and Master are on different machines.
> As part of initialization, the AppClient [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124] with the Master by repeatedly sending a RegisterApplication message until it receives a RegisteredApplication response.
> If the RegisteredApplication response is delayed by at least REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the RegisterApplication RPC), it is possible for the Master to receive *two* RegisterApplication messages for the same AppClient.
> Upon receiving the second RegisterApplication message, the master [attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274] to persist the ApplicationInfo to disk. Since the file already exists, FileSystemPersistenceEngine [throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59] an IllegalStateException, and the Master crashes.
> Incidentally, it appears that there is already a [TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266] in the code to handle this scenario.
> I have a reproducing scenario for this bug on an old version of Spark (1.0.1), but upon inspecting the latest version of the code it appears that it is still possible to trigger it. Let me know if you would like reproducing steps for triggering it on the old version of Spark.
> It should be possible to trigger this bug even if the underlying transport protocol is TCP, since TCP only guarantees in-order delivery in each direction of the connection but not in both directions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org