You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Petri (Jira)" <ji...@apache.org> on 2022/01/24 13:58:00 UTC

[jira] [Created] (SPARK-37999) Spark executor self-exiting due to driver disassociated in Kubernetes

Petri created SPARK-37999:
-----------------------------

             Summary: Spark executor self-exiting due to driver disassociated in Kubernetes
                 Key: SPARK-37999
                 URL: https://issues.apache.org/jira/browse/SPARK-37999
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 3.2.0
            Reporter: Petri


I have Spark driver running in a Kubernetes pod with client deploy-mode.I have created a headless K8S service with name 'lola' at port 7077 which targets the driver pod.
Driver pod will launch successfully and tries to start an executor, but eventually the executor will fail with error:
{code:java}
Executor self-exiting due to : Driver lola.mni-system:7077 disassociated! Shutting down.{code}
Then driver stays up and running and will attempt to start another executor which fails with same error and this goes on and on, driver spawning new failing executors.

In the driver pod, I see only following errors (when using 'grep ERROR'):
{code:java}
22/01/24 13:41:12 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.82.105:
22/01/24 13:41:56 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.82.106:
22/01/24 13:42:12 ERROR TaskSchedulerImpl: Lost executor 7 on 192.168.47.80: The executor with ID 7 (registered at 1643031697505 ms) was not found in the cluster at the polling time (1643031731509 ms) which is after the accepted detect delta time (30000 ms) configured by `spark.kubernetes.executor.missingPodDetectDelta`. The executor may have been deleted but the driver missed the deletion event. Marking this executor as failed.
22/01/24 13:42:38 ERROR TaskSchedulerImpl: Lost executor 3 on 192.168.82.103:
22/01/24 13:45:30 ERROR TaskSchedulerImpl: Lost executor 4 on 192.168.50.220:{code}
 

Full log from the executor:
{code:java}
+ source /opt/spark/bin/common.sh
+ cp /etc/group /tmp/group
+ cp /etc/passwd /tmp/passwd
++ id -u
+ myuid=1501
++ id -g
+ mygid=0
+ myuname=cspk
+ fsgid=
+ fsgrpname=cspk
+ set +e
++ getent passwd 1501
+ uidentry=
++ cat /etc/machine-id
cat: /etc/machine-id: No such file or directory
+ export SYSTEMID=
+ SYSTEMID=
+ set -e
+ '[' -z '' ']'
+ '[' -w /tmp/group ']'
+ echo cspk:x::
+ cp /etc/passwd /tmp/passwd.template
+ '[' -z '' ']'
+ '[' -w /tmp/passwd.template ']'
+ echo 'cspk:x:1501:0:anonymous uid:/opt/spark:/bin/false'
+ envsubst
+ export LD_PRELOAD=/usr/lib64/libnss_wrapper.so
+ LD_PRELOAD=/usr/lib64/libnss_wrapper.so
+ export NSS_WRAPPER_PASSWD=/tmp/passwd
+ NSS_WRAPPER_PASSWD=/tmp/passwd
+ export NSS_WRAPPER_GROUP=/tmp/group
+ NSS_WRAPPER_GROUP=/tmp/group
+ SPARK_K8S_CMD=executor
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH='/var/local/streaming_engine/*:/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ env
+ sort -t_ -k4 -n
+ grep SPARK_AUTH_OPT_
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_AUTH_OPTS
+ env
+ grep SPARK_NET_CRYPTO_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_NET_CRYPTO_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ set +x
TLS Not enabled for WebServer
+ CMD=(${JAVA_HOME}/bin/java $EXTRAJAVAOPTS "${SPARK_EXECUTOR_JAVA_OPTS[@]}" "${SPARK_AUTH_OPTS[@]}" "${SPARK_NET_CRYPTO_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP)
+ exec /usr/bin/tini -s -- /etc/alternatives/jre_openjdk//bin/java -Dcom.nokia.rtna.jmx1= -Dcom.nokia.rtna.jmx2=10100 -Dlog4j.configurationFile=http://192.168.80.89:8888/log4j2.xml -Dlog4j.configuration=http://192.168.80.89:8888/log4j2.xml -Dcom.nokia.rtna.app=LolaStreamingApp -Dspark.driver.port=7077 -Xms4096m -Xmx4096m -cp '/var/local/streaming_engine/*:/opt/spark/jars/*' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@lola.mni-system:7077 --executor-id 10 --cores 3 --app-id spark-application-1643031611044 --hostname 192.168.82.121
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/var/local/streaming_engine/log4j-slf4j-impl-2.13.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/var/local/streaming_engine/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", "time":"2022-01-24T13:49:16.606Z", "timezone":"UTC", "class":"dispatcher-Executor", "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", "log":"Executor self-exiting due to : Driver lola.mni-system:7077 disassociated! Shutting down.\n"}
 {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org