You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Suman Somasundar (JIRA)" <ji...@apache.org> on 2019/06/20 00:25:00 UTC

[jira] [Commented] (SPARK-25128) multiple simultaneous job submissions against k8s backend cause driver pods to hang

    [ https://issues.apache.org/jira/browse/SPARK-25128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868138#comment-16868138 ] 

Suman Somasundar commented on SPARK-25128:
------------------------------------------

I have the same issue. When multiple jobs are submitted, the driver pods start, then the executor pods start. But executor fails because it is not able to resolve the driver service. Driver is stuck in running state with the warning - "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources"

Error in executor pod:

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707)
 at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
 at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
 at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
 at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
 at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
 at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
 at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
 at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
 ... 4 more

Caused by: java.io.IOException: Failed to connect to t-f5d67725474036458526157f70bc999c-driver-svc.spark-namespace.svc:7078
 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
 at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
 at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
 at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: t-f5d67725474036458526157f70bc999c-driver-svc.spark-namespace.svc
 at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
 at java.net.InetAddress.getAllByName(InetAddress.java:1192)
 at java.net.InetAddress.getAllByName(InetAddress.java:1126)
 at java.net.InetAddress.getByName(InetAddress.java:1076)
 at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
 at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
 at java.security.AccessController.doPrivileged(Native Method)
 at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
 at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
 at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
 at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
 at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
 at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
 at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
 at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
 at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
 at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
 at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
 at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
 at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:485)
 at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
 at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:103)
 at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
 at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:982)
 at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:516)
 at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:427)
 at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:486)
 at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
 at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:474)
 at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
 at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

> multiple simultaneous job submissions against k8s backend cause driver pods to hang
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-25128
>                 URL: https://issues.apache.org/jira/browse/SPARK-25128
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.3.0
>            Reporter: Erik Erlandson
>            Priority: Minor
>              Labels: kubernetes
>
> User is reporting that multiple "simultaneous" (or rapidly in succession) job submissions against the k8s back-end are causing driver pods to hang in "Waiting: PodInitializing" state. They filed an associated question at [stackoverflow|[https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes|https://stackoverflow.com/questions/51843212/spark-driver-pod-stuck-in-waiting-podinitializing-state-in-kubernetes?noredirect=1#comment90640662_51843212]].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org