You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:21:28 UTC
[jira] [Updated] (SPARK-11228) Job stuck in Executor failure loop when NettyTransport failed to bind

     [ https://issues.apache.org/jira/browse/SPARK-11228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-11228:
---------------------------------
    Labels: bulk-closed  (was: )

> Job stuck in Executor failure loop when NettyTransport failed to bind
> ---------------------------------------------------------------------
>
>                 Key: SPARK-11228
>                 URL: https://issues.apache.org/jira/browse/SPARK-11228
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 1.5.1
>         Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux
>            Reporter: Romi Kuntsman
>            Priority: Major
>              Labels: bulk-closed
>
> I changed my network connection while a local spark cluster is running. In port 8080, I see the master and worker running. 
> I'm running Spark in Java in client mode, so the driver is running inside my IDE. When trying to start a job on the local spark cluster, I get an endless loop of the errors below at #1.
> It only stops when I kill the application manually.
> When looking at the worker log, I see an endless loop of the errors below at #2.
> Expected behaviour would be failing the job after a few failed retries / timeout.
> (IP anonymized to 1.2.3.4)
> 1. Errors see on driver:
> 2015-10-21 11:20:54,793 INFO  [org.apache.spark.scheduler.TaskSchedulerImpl] Adding task set 0.0 with 2 tasks
> 2015-10-21 11:20:55,847 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/1 is now EXITED (Command exited with code 1)
> 2015-10-21 11:20:55,847 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor app-20151021112052-0005/1 removed: Command exited with code 1
> 2015-10-21 11:20:55,848 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to remove non-existent executor 1
> 2015-10-21 11:20:55,848 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: app-20151021112052-0005/2 on worker-20151021090623-1.2.3.4-57305 (1.2.3.4:57305) with 1 cores
> 2015-10-21 11:20:55,848 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted executor ID app-20151021112052-0005/2 on hostPort 1.2.3.4:57305 with 1 cores, 4.9 GB RAM
> 2015-10-21 11:20:55,849 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/2 is now LOADING
> 2015-10-21 11:20:55,852 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/2 is now RUNNING
> 2015-10-21 11:20:57,165 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/2 is now EXITED (Command exited with code 1)
> 2015-10-21 11:20:57,165 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor app-20151021112052-0005/2 removed: Command exited with code 1
> 2015-10-21 11:20:57,166 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to remove non-existent executor 2
> 2015-10-21 11:20:57,166 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: app-20151021112052-0005/3 on worker-20151021090623-1.2.3.4-57305 (1.2.3.4:57305) with 1 cores
> 2015-10-21 11:20:57,167 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted executor ID app-20151021112052-0005/3 on hostPort 1.2.3.4:57305 with 1 cores, 4.9 GB RAM
> 2015-10-21 11:20:57,167 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/3 is now LOADING
> 2015-10-21 11:20:57,169 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/3 is now RUNNING
> 2015-10-21 11:20:58,531 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/3 is now EXITED (Command exited with code 1)
> 2015-10-21 11:20:58,531 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor app-20151021112052-0005/3 removed: Command exited with code 1
> 2015-10-21 11:20:58,532 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to remove non-existent executor 3
> 2015-10-21 11:20:58,532 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: app-20151021112052-0005/4 on worker-20151021090623-1.2.3.4-57305 (1.2.3.4:57305) with 1 cores
> 2015-10-21 11:20:58,532 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted executor ID app-20151021112052-0005/4 on hostPort 1.2.3.4:57305 with 1 cores, 4.9 GB RAM
> 2015-10-21 11:20:58,533 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/4 is now LOADING
> 2015-10-21 11:20:58,535 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/4 is now RUNNING
> 2015-10-21 11:20:59,932 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/4 is now EXITED (Command exited with code 1)
> 2015-10-21 11:20:59,933 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor app-20151021112052-0005/4 removed: Command exited with code 1
> 2015-10-21 11:20:59,933 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to remove non-existent executor 4
> 2015-10-21 11:20:59,933 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor added: app-20151021112052-0005/5 on worker-20151021090623-1.2.3.4-57305 (1.2.3.4:57305) with 1 cores
> 2015-10-21 11:20:59,934 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Granted executor ID app-20151021112052-0005/5 on hostPort 1.2.3.4:57305 with 1 cores, 4.9 GB RAM
> 2015-10-21 11:20:59,935 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/5 is now LOADING
> 2015-10-21 11:20:59,937 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/5 is now RUNNING
> 2015-10-21 11:21:01,338 INFO  [org.apache.spark.deploy.client.AppClient$ClientEndpoint] Executor updated: app-20151021112052-0005/5 is now EXITED (Command exited with code 1)
> 2015-10-21 11:21:01,338 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Executor app-20151021112052-0005/5 removed: Command exited with code 1
> 2015-10-21 11:21:01,339 INFO  [org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend] Asked to remove non-existent executor 5
> 2. Errors seen on workers:
> 15/10/21 11:20:53 INFO Remoting: Starting remoting
> 15/10/21 11:20:53 ERROR NettyTransport: failed to bind to /1.2.3.4:0, shutting down Netty transport
> 15/10/21 11:20:53 WARN Utils: Service 'driverPropsFetcher' could not bind on port 0. Attempting port 1.
> 15/10/21 11:20:53 INFO Slf4jLogger: Slf4jLogger started
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
> 15/10/21 11:20:53 INFO Remoting: Starting remoting
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
> 15/10/21 11:20:53 ERROR Remoting: Remoting system has been terminated abrubtly. Attempting to shut down transports
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
> 15/10/21 11:20:53 ERROR NettyTransport: failed to bind to /1.2.3.4:0, shutting down Netty transport
> 15/10/21 11:20:53 WARN Utils: Service 'driverPropsFetcher' could not bind on port 0. Attempting port 1.
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
> 15/10/21 11:20:53 INFO Slf4jLogger: Slf4jLogger started
> 15/10/21 11:20:53 ERROR Remoting: Remoting system has been terminated abrubtly. Attempting to shut down transports
> 15/10/21 11:20:53 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
> 15/10/21 11:20:53 INFO Remoting: Starting remoting
> 15/10/21 11:20:54 ERROR NettyTransport: failed to bind to /1.2.3.4:0, shutting down Netty transport
> 15/10/21 11:20:54 WARN Utils: Service 'driverPropsFetcher' could not bind on port 0. Attempting port 1.
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
> 15/10/21 11:20:54 INFO Slf4jLogger: Slf4jLogger started
> 15/10/21 11:20:54 INFO Remoting: Starting remoting
> 15/10/21 11:20:54 ERROR NettyTransport: failed to bind to /1.2.3.4:0, shutting down Netty transport
> 15/10/21 11:20:54 WARN Utils: Service 'driverPropsFetcher' could not bind on port 0. Attempting port 1.
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
> 15/10/21 11:20:54 ERROR Remoting: Remoting system has been terminated abrubtly. Attempting to shut down transports
> 15/10/21 11:20:54 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
> 15/10/21 11:20:54 INFO Slf4jLogger: Slf4jLogger started



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org