You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Petri (Jira)" <ji...@apache.org> on 2022/01/26 09:22:00 UTC

[jira] [Resolved] (SPARK-37999) Spark executor self-exiting due to driver disassociated in Kubernetes

     [ https://issues.apache.org/jira/browse/SPARK-37999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Petri resolved SPARK-37999.
---------------------------
    Resolution: Not A Bug

> Spark executor self-exiting due to driver disassociated in Kubernetes
> ---------------------------------------------------------------------
>
>                 Key: SPARK-37999
>                 URL: https://issues.apache.org/jira/browse/SPARK-37999
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.2.0
>            Reporter: Petri
>            Priority: Major
>
> I have Spark driver running in a Kubernetes pod with client deploy-mode.I have created a headless K8S service with name 'lola' at port 7077 which targets the driver pod.
> Driver pod will launch successfully and tries to start an executor, but eventually the executor will fail with error:
> {code:java}
> Executor self-exiting due to : Driver lola.mni-system:7077 disassociated! Shutting down.{code}
> Then driver stays up and running and will attempt to start another executor which fails with same error and this goes on and on, driver spawning new failing executors.
> In the driver pod, I see only following errors (when using 'grep ERROR'):
> {code:java}
> 22/01/24 13:41:12 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.82.105:
> 22/01/24 13:41:56 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.82.106:
> 22/01/24 13:42:12 ERROR TaskSchedulerImpl: Lost executor 7 on 192.168.47.80: The executor with ID 7 (registered at 1643031697505 ms) was not found in the cluster at the polling time (1643031731509 ms) which is after the accepted detect delta time (30000 ms) configured by `spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been deleted but the driver missed the deletion event. Marking this executor as failed.
> 22/01/24 13:42:38 ERROR TaskSchedulerImpl: Lost executor 3 on 192.168.82.103:
> 22/01/24 13:45:30 ERROR TaskSchedulerImpl: Lost executor 4 on 192.168.50.220:{code}
>  
> Full log from the executor:
> {code:java}
> + source /opt/spark/bin/common.sh
> + cp /etc/group /tmp/group
> + cp /etc/passwd /tmp/passwd
> ++ id -u
> + myuid=1501
> ++ id -g
> + mygid=0
> + myuname=cspk
> + fsgid=
> + fsgrpname=cspk
> + set +e
> ++ getent passwd 1501
> + uidentry=
> ++ cat /etc/machine-id
> cat: /etc/machine-id: No such file or directory
> + export SYSTEMID=
> + SYSTEMID=
> + set -e
> + '[' -z '' ']'
> + '[' -w /tmp/group ']'
> + echo cspk:x::
> + cp /etc/passwd /tmp/passwd.template
> + '[' -z '' ']'
> + '[' -w /tmp/passwd.template ']'
> + echo 'cspk:x:1501:0:anonymous uid:/opt/spark:/bin/false'
> + envsubst
> + export LD_PRELOAD=/usr/lib64/libnss_wrapper.so
> + LD_PRELOAD=/usr/lib64/libnss_wrapper.so
> + export NSS_WRAPPER_PASSWD=/tmp/passwd
> + NSS_WRAPPER_PASSWD=/tmp/passwd
> + export NSS_WRAPPER_GROUP=/tmp/group
> + NSS_WRAPPER_GROUP=/tmp/group
> + SPARK_K8S_CMD=executor
> + case "$SPARK_K8S_CMD" in
> + shift 1
> + SPARK_CLASSPATH='/var/local/streaming_engine/*:/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + env
> + sort -t_ -k4 -n
> + grep SPARK_AUTH_OPT_
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_AUTH_OPTS
> + env
> + grep SPARK_NET_CRYPTO_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_NET_CRYPTO_OPTS
> + '[' -n '' ']'
> + '[' -z ']'
> + set +x
> TLS Not enabled for WebServer
> + CMD=(${JAVA_HOME}/bin/java $EXTRAJAVAOPTS "${SPARK_EXECUTOR_JAVA_OPTS[@]}" "${SPARK_AUTH_OPTS[@]}" "${SPARK_NET_CRYPTO_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP)
> + exec /usr/bin/tini -s -- /etc/alternatives/jre_openjdk//bin/java -Dcom.nokia.rtna.jmx1= -Dcom.nokia.rtna.jmx2=10100 -Dlog4j.configurationFile=http://192.168.80.89:8888/log4j2.xml -Dlog4j.configuration=http://192.168.80.89:8888/log4j2.xml -Dcom.nokia.rtna.app=LolaStreamingApp -Dspark.driver.port=7077 -Xms4096m -Xmx4096m -cp '/var/local/streaming_engine/*:/opt/spark/jars/*' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@lola.mni-system:7077 --executor-id 10 --cores 3 --app-id spark-application-1643031611044 --hostname 192.168.82.121
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/var/local/streaming_engine/log4j-slf4j-impl-2.13.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/opt/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/var/local/streaming_engine/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> {"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", "time":"2022-01-24T13:49:16.606Z", "timezone":"UTC", "class":"dispatcher-Executor", "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", "log":"Executor self-exiting due to : Driver lola.mni-system:7077 disassociated! Shutting down.\n"}
> {"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", "time":"2022-01-25T13:48:39.734Z", "timezone":"UTC", "class":"shutdown-hook-0", "method":"spark.util.Utils.logError(94)", "log":"Uncaught exception in thread shutdown-hook-0\njava.lang.InterruptedException\n\tat java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)\n\tat java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1454)\n\tat java.base/java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(Executors.java:709)\n\tat org.apache.spark.rpc.netty.MessageLoop.stop(MessageLoop.scala:60)\n\tat org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1(Dispatcher.scala:197)\n\tat org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1$adapted(Dispatcher.scala:194)\n\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\n\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\n\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\n\tat org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:194)\n\tat org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:331)\n\tat org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:309)\n\tat org.apache.spark.SparkEnv.stop(SparkEnv.scala:96)\n\tat org.apache.spark.executor.Executor.stop(Executor.scala:335)\n\tat org.apache.spark.executor.Executor.$anonfun$new$2(Executor.scala:76)\n\tat org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)\n\tat org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n\tat org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)\n\tat org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n\tat scala.util.Try$.apply(Try.scala:213)\n\tat org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)\n\tat org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n"}
> java.lang.InterruptedException: null
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
>         at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1454) ~[?:?]
>         at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(Executors.java:709) ~[?:?]
>         at org.apache.spark.rpc.netty.MessageLoop.stop(MessageLoop.scala:60) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1(Dispatcher.scala:197) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1$adapted(Dispatcher.scala:194) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at scala.collection.Iterator.foreach(Iterator.scala:943) ~[scala-library-2.12.13.jar:?]
>         at scala.collection.Iterator.foreach$(Iterator.scala:943) ~[scala-library-2.12.13.jar:?]
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) ~[scala-library-2.12.13.jar:?]
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74) ~[scala-library-2.12.13.jar:?]
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ~[scala-library-2.12.13.jar:?]
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56) ~[scala-library-2.12.13.jar:?]
>         at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:194) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:331) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:309) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at org.apache.spark.SparkEnv.stop(SparkEnv.scala:96) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at org.apache.spark.executor.Executor.stop(Executor.scala:335) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at org.apache.spark.executor.Executor.$anonfun$new$2(Executor.scala:76) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.13.jar:?]
>         at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) ~[spark-core_2.12-3.1.2.jar:3.1.2]
>         at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [scala-library-2.12.13.jar:?]
>         at scala.util.Try$.apply(Try.scala:213) [scala-library-2.12.13.jar:?]
>         at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) [spark-core_2.12-3.1.2.jar:3.1.2]
>         at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) [spark-core_2.12-3.1.2.jar:3.1.2]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
>         at java.lang.Thread.run(Thread.java:829) [?:?]
>   {code}
>  
> Here is spark configuration for spark-submit:
> {code:java}
> 22/01/25 12:21:25 INFO SparkContext: Running Spark version 3.2.0
> 22/01/25 12:21:25 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
> 22/01/25 12:21:25 INFO ResourceUtils: ==============================================================
> 22/01/25 12:21:25 INFO ResourceUtils: No custom resources configured for spark.driver.
> 22/01/25 12:21:25 INFO ResourceUtils: ==============================================================
> 22/01/25 12:21:25 INFO SparkContext: Submitted application: LolaStreamingApp
> 22/01/25 12:21:25 INFO SparkContext: Spark configuration:
> spark.app.name=LolaStreamingApp
> spark.app.startTime=1643113285920
> spark.checkpoint.compress=true
> spark.cleaner.periodicGC.interval=6min
> spark.driver.cores=2
> spark.driver.extraJavaOptions=-Dcom.nokia.rtna.jmx1=10000 -Dcom.nokia.rtna.jmx2=10100 -Dlog4j.configurationFile=http://192.168.82.107:8888/log4j2.xml -Dlog4j.configuration=http://192.168.82.107:8888/log4j2.xml
> spark.driver.host=lola.mni-system
> spark.driver.memory=4g
> spark.driver.port=7077
> spark.driver.userClassPathFirst=false
> spark.dynamicAllocation.enabled=false
> spark.eventLog.dir=hdfs://namenodeHA/apaas/spark-history-server
> spark.eventLog.enabled=false
> spark.executor.cores=3
> spark.executor.extraClassPath=/var/local/streaming_engine/*
> spark.executor.extraJavaOptions=-Dcom.nokia.rtna.jmx1= -Dcom.nokia.rtna.jmx2=10100 -Dlog4j.configurationFile=http://192.168.82.107:8888/log4j2.xml -Dlog4j.configuration=http://192.168.82.107:8888/log4j2.xml -Dcom.nokia.rtna.app="LolaStreamingApp"
> spark.executor.heartbeatInterval=20s
> spark.executor.instances=1
> spark.executor.memory=4g
> spark.executorEnv.HDFSLOCAL=true
> spark.executorEnv.IGNITE_ALARM_THRESHOLD_PCT=95
> spark.executorEnv.IGNITE_CACHE_SIZE_MB=3100
> spark.executorEnv.NODLKMODE=true
> spark.executorEnv.SPARK_APPNAME=LolaStreamingApp
> spark.executorEnv.SPARK_LOCAL_DIRS=/tmp/spark-local
> spark.extraListeners=com.nokia.rtna.framework.util.FwkSparkListener
> spark.jars=*********(redacted)
> spark.kryo.classesToRegister=org.apache.avro.generic.GenericData,org.apache.avro.generic.GenericData$Record,io.confluent.kafka.schemaregistry.client.rest.entities.Schema,com.nokia.rtna.framework.data.Tuple,com.nokia.rtna.framework.data.AvroData,com.nokia.rtna.framework.processor.context.ContextState,com.nokia.rtna.framework.processor.counter.recordcounter.GenericCounterKey,com.nokia.rtna.framework.processor.counter.recordcounter.GenericCounterRecord,com.nokia.rtna.framework.data.reduced.ReducedRecord,com.nokia.rtna.framework.data.reduced.ReducedArray
> spark.kubernetes.allocation.batch.size=60
> spark.kubernetes.container.image=rtna-docker-inprogress.repo.lab.pl.alcatel-lucent.com/ca4mn/ca4mn-se-engine-driver-and-executor:0.7.7-RTNA12538.206
> spark.kubernetes.container.image.pullPolicy=Always
> spark.kubernetes.driver.annotation.prometheus.io/port=1235
> spark.kubernetes.driver.annotation.prometheus.io/scrape=true
> spark.kubernetes.driver.label.LolaStreamingApp=LolaStreamingApp-driver
> spark.kubernetes.driver.label.StreamingEngineFunction=driver
> spark.kubernetes.driver.label.driverlabel=lola-lolaengine-797b666464-jcnrc
> spark.kubernetes.driver.limit.cores=4
> spark.kubernetes.driver.pod.name=lola-lolaengine-797b666464-jcnrc
> spark.kubernetes.driver.request.cores=2
> spark.kubernetes.driverEnv.IGNITE_ALARM_THRESHOLD_PCT=95
> spark.kubernetes.driverEnv.IGNITE_CACHE_SIZE_MB=3100
> spark.kubernetes.driverEnv.SPARK_APPNAME=LolaStreamingApp
> spark.kubernetes.driverEnv.SPARK_LOCAL_DIRS=/tmp/spark-local
> spark.kubernetes.executor.annotation.prometheus.io/port=1234
> spark.kubernetes.executor.annotation.prometheus.io/scrape=true
> spark.kubernetes.executor.label.StreamingEngineFunction=executor
> spark.kubernetes.executor.limit.cores=5
> spark.kubernetes.executor.memoryOverhead=3800
> spark.kubernetes.executor.request.cores=2
> spark.kubernetes.executor.volumes.persistentVolumeClaim.glustervol.mount.path=/mnt/distributedDisk
> spark.kubernetes.executor.volumes.persistentVolumeClaim.glustervol.options.claimName=lolastreamingapp
> spark.kubernetes.executor.volumes.persistentVolumeClaim.glustervol.readOnly=false
> spark.kubernetes.file.upload.path=/var/local/streaming_engine
> spark.kubernetes.memoryOverheadFactor=0.99
> spark.kubernetes.namespace=mni-system
> spark.kubernetes.shuffle.labels=app=spark-shuffle-service,spark-version=2.2.0
> spark.kubernetes.shuffle.namespace=mni-system
> spark.local.dir=/tmp/spark-local
> spark.locality.wait=3
> spark.logConf=true
> spark.master=k8s://https://k8s-apiserver.bcmt.cluster.local:8443
> spark.memory.offHeap.enabled=true
> spark.memory.offHeap.size=300000000
> spark.memory.storageFraction=0.8
> spark.repl.local.jars=*********(redacted)
> spark.scheduler.listenerbus.eventqueue.size=15000
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> spark.shuffle.service.enabled=false
> spark.stage.maxConsecutiveAttempts=16
> spark.streaming.backpressure.enabled=false
> spark.streaming.blockInterval=200ms
> spark.streaming.kafka.consumer.cache.enabled=false
> spark.streaming.kafka.maxRatePerPartition=100000
> spark.submit.deployMode=client
> spark.submit.pyFiles=
> spark.task.maxFailures=16
> spark.ui.killEnabled=false
> spark.ui.showConsoleProgress=false {code}
>  
> Here is the headless service resource spec:
> {code:java}
> Name:              lola
> Namespace:         mni-system
> Labels:            <none>
> Annotations:       helm.fluxcd.io/antecedent: mni-system:helmrelease/lola
> Selector:          app=lolaengine
> Type:              ClusterIP
> IP:                None
> Port:              driver-rpc  7077/TCP
> TargetPort:        7077/TCP
> Endpoints:         192.168.47.85:7077
> Session Affinity:  None
> Events:            <none>
> {code}
>  
> Earlier I had some Scala version related error in the first executor and then driver disassociated error in subsequent executors. I fixed the Scala version related error. Now also the first executor will self-exit due to driver disassociated error.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org