You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 05:37:28 UTC

[jira] [Resolved] (SPARK-1747) check for Spark on Yarn ApplicationMaster split brain

     [ https://issues.apache.org/jira/browse/SPARK-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-1747.
---------------------------------
    Resolution: Incomplete

> check for Spark on Yarn ApplicationMaster split brain
> -----------------------------------------------------
>
>                 Key: SPARK-1747
>                 URL: https://issues.apache.org/jira/browse/SPARK-1747
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.0.0
>            Reporter: Thomas Graves
>            Priority: Major
>              Labels: bulk-closed
>
> On yarn there is a possibility that applications can end up with an issue referred to as "split brain".  This problem is that you have one Application Master running, something happens like a network split that the AM can no longer talk to the ResourceManager. After some time the ResourceManager will start a new application attempt assuming the old one failed and you end up with 2 application masters.  Note the network split could prevent it from talking to the RM but it could still be running along contacting regular executors. 
> If the previous AM does not need any more resources from the RM it could try to commit. This could cause lots of problems where the second AM finishes and tries to commit too. This could potentially result in data corruption.
> I believe this same issue can happen on Spark since its using the hadoop output formats.  One instance that has this issue is the FileOutputCommitter.  It first writes to a temporary directory (task commit) and then  moves the file to the final directory (job commit).  The first AM could finish the job commit, tell the user its done, the user starts another down stream job, but then the second AM comes in to do the job commit and files the down stream job are processing could disappear until the second AM finishes the job commit. 
> This was fixed in MR by https://issues.apache.org/jira/browse/MAPREDUCE-4832



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org