You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/09/24 07:45:00 UTC

[jira] [Updated] (SPARK-32975) [K8S] - executor fails to be restarted after it goes to ERROR/Failure state

     [ https://issues.apache.org/jira/browse/SPARK-32975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-32975:
---------------------------------
    Priority: Major  (was: Critical)

> [K8S] - executor fails to be restarted after it goes to ERROR/Failure state
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-32975
>                 URL: https://issues.apache.org/jira/browse/SPARK-32975
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes, Scheduler
>    Affects Versions: 2.4.4
>            Reporter: Shenson Joseph
>            Priority: Major
>
> We are using v1beta2-1.1.2-2.4.5 version of operator with spark-2.4.4
> spark executors keeps getting killed with exit code 1 and we are seeing following exception in the executor which goes to error state. Once this error happens, driver doesn't restart executor. 
>  
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
> at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
> at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
> at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
> at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
> at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
> at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> ... 4 more
> Caused by: java.io.IOException: Failed to connect to act-pipeline-app-1600187491917-driver-svc.default.svc:7078
> at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
> at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
> at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.UnknownHostException: act-pipeline-app-1600187491917-driver-svc.default.svc
> at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
> at java.net.InetAddress.getAllByName(InetAddress.java:1193)
> at java.net.InetAddress.getAllByName(InetAddress.java:1127)
> at java.net.InetAddress.getByName(InetAddress.java:1077)
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
> at java.security.AccessController.doPrivileged(Native Method)
> at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
> at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
> at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
> at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
> at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
> at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
> at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
> at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
> at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
> at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
> at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
> at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
> at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
> at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
> at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
> at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
> at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
> at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> ... 1 more
> CodeCache: size=245760Kb used=4762Kb max_used=4763Kb free=240997Kb
> bounds [0x00007f49f5000000, 0x00007f49f54b0000, 0x00007f4a04000000]
> total_blobs=1764 nmethods=1356 adapters=324
> compilation: enabled
>  
>  
>  
> *Additional information:*
> *The status of spark application shows it is RUNNING:*
> kubectl describe sparkapplications.sparkoperator.k8s.io act-pipeline-app
> ...
> ...
> Status:
>   Application State:
>     State:  RUNNING
>   Driver Info:
>     Pod Name:             act-pipeline-app-driver
>     Web UI Address:       10.233.57.201:40550
>     Web UI Port:          40550
>     Web UI Service Name:  act-pipeline-app-ui-svc
>   Execution Attempts:     1
>   Executor State:
>     act-pipeline-app-1600097064694-exec-1:  RUNNING
>   Last Submission Attempt Time:             2020-09-14T15:24:26Z
>   Spark Application Id:                     spark-942bb2e500c54f92ac357b818c712558
>   Submission Attempts:                      1
>   Submission ID:                            4ecdb6ca-d237-4524-b05e-c42cfcc73dc7
>   Termination Time:                         <nil>
> Events:                                     <none>
>  
> *The executor pod is reporting that it is Terminated:*
> kubectl describe pod -l sparkoperator.k8s.io/app-name=act-pipeline-app,spark-role=executor
> ...
> ...
> Containers:
>   executor:
>     Container ID:  docker://9aa5b585e8fb7390b87a4771f3ed1402cae41f0fe55905d0172ed6e90dde34e6
> ...
>     Ports:         7079/TCP, 8090/TCP
>     Host Ports:    0/TCP, 0/TCP
>     Args:
>       executor
>     State:          Terminated
>       Reason:       Error
>       Exit Code:    1
>       Started:      Mon, 14 Sep 2020 11:25:35 -0400
>       Finished:     Mon, 14 Sep 2020 11:25:39 -0400
>     Ready:          False
>     Restart Count:  0
> ...
> Conditions:
>   Type              Status
>   Initialized       True
>   Ready             False
>   ContainersReady   False
>   PodScheduled      True
> ...
> QoS Class:       Burstable
> Node-Selectors:  <none>
> Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
>                  node.kubernetes.io/unreachable:NoExecute for 300s
> Events:          <none>
> In early stage of the driver’s life the failed executor is not detected (it is assumed to be running) and therefore it will not be restarted.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org