You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jean-Yves STEPHAN (Jira)" <ji...@apache.org> on 2020/06/03 08:29:00 UTC

[jira] [Created] (SPARK-31898) [K8S] Driver may launch an uncontrolled number of exec if exec can't talk to the driver

Jean-Yves STEPHAN created SPARK-31898:
-----------------------------------------

             Summary: [K8S] Driver may launch an uncontrolled number of exec if exec can't talk to the driver
                 Key: SPARK-31898
                 URL: https://issues.apache.org/jira/browse/SPARK-31898
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 3.0.0
            Reporter: Jean-Yves STEPHAN


We launched a simple SparkPi job on k8s with dynamic allocation on (with shuffle tracking, as in 3.0). For an unknown reason (we're investigating), the Spark executors would start but couldn't talk to the driver. I've attached an executor log below. The Spark driver would keep requesting executor pods from k8s – as from the driver perspective it never got them. 

The end result was that we launched an unbounded number of Spark exec pods which would be stuck in running state but doing nothing. The dynamic allocation.maxExecutors parameter doesn't help, the Spark app eventually filled up the k8s cluster after it autoscaled at maximum node pool capacity.

We may be able to fix this issue and https://issues.apache.org/jira/browse/SPARK-26423 at the same time – in both cases it's about cleaning up Spark execs that can't talk to the driver.

 

Executor log:
{code:java}
++ id -u++ id -u+ myuid=0++ id -g+ mygid=0+ set +e++ getent passwd 0+ uidentry=root:x:0:0:root:/root:/bin/bash+ set -e+ '[' -z root:x:0:0:root:/root:/bin/bash ']'+ SPARK_CLASSPATH=':/opt/spark/jars/*'+ env+ grep SPARK_JAVA_OPT_+ sort -t_ -k4 -n+ sed 's/[^=]*=\(.*\)/\1/g'+ readarray -t SPARK_EXECUTOR_JAVA_OPTS+ '[' -n '' ']'+ '[' '' == 2 ']'+ '[' '' == 3 ']'+ '[' -z ']'+ case "$1" in+ shift 1+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP)+ exec /usr/bin/tini -s -- /usr/local/openjdk-8/bin/java -Dspark.driver.blockManager.port=7079 -Dspark.driver.port=7078 -Xms3g -Xmx3g -cp ':/opt/spark/jars/*' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@sparkpi2-20200602-183440-y42hy-3392e47276506f56-driver-svc.spark-apps.svc:7078 --executor-id 55 --cores 2 --app-id spark-fc2da45b9e1549edac73739d8132aa2d --hostname 10.0.4.3Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties20/06/02 18:39:00 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 13@sparkpi2-20200602-183440-y42hy-3392e47276506f56-exec-5520/06/02 18:39:00 INFO SignalUtils: Registered signal handler for TERM20/06/02 18:39:00 INFO SignalUtils: Registered signal handler for HUP20/06/02 18:39:00 INFO SignalUtils: Registered signal handler for INT20/06/02 18:39:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable20/06/02 18:39:01 INFO SecurityManager: Changing view acls to: root20/06/02 18:39:01 INFO SecurityManager: Changing modify acls to: root20/06/02 18:39:01 INFO SecurityManager: Changing view acls groups to:20/06/02 18:39:01 INFO SecurityManager: Changing modify acls groups to:20/06/02 18:39:01 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:254) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:244) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:227) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$3(CoarseGrainedExecutorBackend.scala:272) at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.Range.foreach(Range.scala:158) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$1(CoarseGrainedExecutorBackend.scala:270) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) ... 4 moreCaused by: java.io.IOException: Failed to connect to sparkpi2-20200602-183440-y42hy-3392e47276506f56-driver-svc.spark-apps.svc:7078 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Caused by: java.net.UnknownHostException: sparkpi2-20200602-183440-y42hy-3392e47276506f56-driver-svc.spark-apps.svc at java.net.InetAddress.getAllByName0(InetAddress.java:1281) at java.net.InetAddress.getAllByName(InetAddress.java:1193) at java.net.InetAddress.getAllByName(InetAddress.java:1127) at java.net.InetAddress.getByName(InetAddress.java:1077) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143) at java.security.AccessController.doPrivileged(Native Method) at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143) at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32) at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108) at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:202) at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:48) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:182) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:168) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:604) at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104) at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:985) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:505) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:416) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:475) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org