You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcelo Vanzin (JIRA)" <ji...@apache.org> on 2017/06/16 23:47:00 UTC

[jira] [Resolved] (SPARK-2971) Orphaned YARN ApplicationMaster lingers forever

     [ https://issues.apache.org/jira/browse/SPARK-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcelo Vanzin resolved SPARK-2971.
-----------------------------------
    Resolution: Not A Problem

Pretty sure this has been fixed some point after 1.0; there's both the code above and also there's a timeout in {{ApplicationMaster.waitForSparkDriver}}.

> Orphaned YARN ApplicationMaster lingers forever
> -----------------------------------------------
>
>                 Key: SPARK-2971
>                 URL: https://issues.apache.org/jira/browse/SPARK-2971
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.0.2
>         Environment: Python yarn client mode, Cloudera 5.1.0 on Ubuntu precise
>            Reporter: Shay Rojansky
>
> We have cases where if CTRL-C is hit during a Spark job startup, a YARN ApplicationMaster is created but cannot connect to the driver (presumably because the driver has terminated). Once an AM enters this state it never exits it, and has to be manually killed in YARN.
> Here's an excerpt from the AM logs:
> {noformat}
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/yarn/nm/usercache/roji/filecache/40/spark-assembly-1.0.2-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 14/08/11 16:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 14/08/11 16:29:39 INFO SecurityManager: Changing view acls to: roji
> 14/08/11 16:29:39 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(roji)
> 14/08/11 16:29:40 INFO Slf4jLogger: Slf4jLogger started
> 14/08/11 16:29:40 INFO Remoting: Starting remoting
> 14/08/11 16:29:40 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkYarnAM@g024.grid.eaglerd.local:34075]
> 14/08/11 16:29:40 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkYarnAM@g024.grid.eaglerd.local:34075]
> 14/08/11 16:29:40 INFO RMProxy: Connecting to ResourceManager at master.grid.eaglerd.local/192.168.41.100:8030
> 14/08/11 16:29:40 INFO ExecutorLauncher: ApplicationAttemptId: appattempt_1407759736957_0014_000001
> 14/08/11 16:29:40 INFO ExecutorLauncher: Registering the ApplicationMaster
> 14/08/11 16:29:40 INFO ExecutorLauncher: Waiting for Spark driver to be reachable.
> 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ...
> 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ...
> 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ...
> 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ...
> 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org