You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Philipp Kraus <ph...@gmail.com> on 2021/05/19 09:18:18 UTC
Spark Executor dies in K8 cluster

Hello,

I have got the following first testing setup:



Kubernetes Cluster 1.20 (4 nodes, each node with 120 GB hard disk, 4 cpus, 40 GB memory)

Spark installation by Binami Helm Charts https://artifacthub.io/packages/helm/bitnami/spark (Chart Version 5.4.2 / Spark 3.1.1)

using GeoSpark version 1.3.2-SNAPSHOT (not Apache Sedona because of migration issues) with the setup of https://sedona.apache.org/download/cluster/ so spark.driver.memory 10g, spark.network.timeout 1000s, spark.driver.maxResultSize 5g

Creating a Fat-Jar Spring-Boot Application which runs some Spark algorithms on Java Adoptable JDK 1.8 (latest docker image) with Spark 3.1.1 and Scala 2.12

Using NFS Server Provisioner as Helm Chart https://artifacthub.io/packages/helm/kvaps/nfs-server-provisioner to create a ReadWriteMany Volume for the Spark-Workers and application. On all pods this volume is mounted under /sparkdir so the Fat-Jar file is stored there

Spark Workers are configured with Helm as a ReplicaSet so at 75% CPU usage new worker should be spawned on default 2 worker pods are running

The Spark master UI shows the workers with the correct memory and cpu resources (4 cores and 10 GB memory for each worker)

Application and Spark are running in the same namespace




I create in the Spring-Boot application (as docker image) a Spark config with (Help release name „test“):

Final String l_jar = "/sparkdir/myspringapp.jar"
 new SparkConf().setMaster( "spark://test--spark-master-svc:7077" )
                             .setAppName( "mySpringBootApp")
                             .setJars( Stream.of( l_jar ).toArray( String[]::new ) )
                             .set( "spark.jars", l_jar )
                             .set( "spark.driver.userClassPathFirst", l_jar )
                             .set( "spark.kubernetes.container.image", "bitnami/spark:3.1.1" )
                             .set( "spark.submit.deployMode", "cluster" )
                             .set( "spark.driver.memory", "10G" )
                             .set( "spark.executor.memory", "4G" )
                             .set( "spark.network.timeout", "1000s" )
                             .set( "spark.driver.maxResultSize", "5G" );

If I start the application and run the Spark execution, the master gets the job and pass it to the workers, this works fine, but on the worker I get an error on the executors, see the log of one worker:

This script is deprecated, use start-worker.sh
starting org.apache.spark.deploy.worker.Worker, logging to /opt/bitnami/spark/logs/spark--org.apache.spark.deploy.worker.Worker-1-test-spark-worker-0.out
Spark Command: /opt/bitnami/java/bin/java -cp /opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://test-spark-master-svc:7077
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/05/18 18:56:11 INFO Worker: Started daemon with process name: 41@test-spark-worker-0
21/05/18 18:56:11 INFO SignalUtils: Registering signal handler for TERM
21/05/18 18:56:11 INFO SignalUtils: Registering signal handler for HUP
21/05/18 18:56:11 INFO SignalUtils: Registering signal handler for INT
21/05/18 18:56:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/05/18 18:56:11 INFO SecurityManager: Changing view acls to: spark
21/05/18 18:56:11 INFO SecurityManager: Changing modify acls to: spark
21/05/18 18:56:11 INFO SecurityManager: Changing view acls groups to: 
21/05/18 18:56:11 INFO SecurityManager: Changing modify acls groups to: 
21/05/18 18:56:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
21/05/18 18:56:11 INFO Utils: Successfully started service 'sparkWorker' on port 35561.
21/05/18 18:56:11 INFO Worker: Worker decommissioning not enabled, SIGPWR will result in exiting.
21/05/18 18:56:12 INFO Worker: Starting Spark worker 10.223.130.87:35561 with 4 cores, 10.0 GiB RAM
21/05/18 18:56:12 INFO Worker: Running Spark version 3.1.1
21/05/18 18:56:12 INFO Worker: Spark home: /opt/bitnami/spark
21/05/18 18:56:12 INFO ResourceUtils: ==============================================================
21/05/18 18:56:12 INFO ResourceUtils: No custom resources configured for spark.worker.
21/05/18 18:56:12 INFO ResourceUtils: ==============================================================
21/05/18 18:56:12 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
21/05/18 18:56:12 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://test-spark-worker-0.test-spark-headless.workflow.svc.cluster.local:8081
21/05/18 18:56:12 INFO Worker: Connecting to master test-spark-master-svc:7077...
21/05/18 18:56:12 INFO TransportClientFactory: Successfully created connection to test-spark-master-svc/10.233.8.202:7077 after 31 ms (0 ms spent in bootstraps)
21/05/18 18:56:12 INFO Worker: Successfully registered with master spark://test-spark-master-0.test-spark-headless.workflow.svc.cluster.local:7077



---------- the next lines are shown on all workers in an infinity loop until I kill the application on the Spark master ----------

21/05/18 20:46:55 INFO Worker: Asked to launch executor app-20210518204655-0000/1 for f212b4b4-05df-4f22-a580-87cbe5fb9356
21/05/18 20:46:55 INFO SecurityManager: Changing view acls to: spark
21/05/18 20:46:55 INFO SecurityManager: Changing modify acls to: spark
21/05/18 20:46:55 INFO SecurityManager: Changing view acls groups to: 
21/05/18 20:46:55 INFO SecurityManager: Changing modify acls groups to: 
21/05/18 20:46:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()

21/05/18 20:46:55 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx4096M" "-Dspark.network.timeout=1000s" "-Dspark.driver.port=41904" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@test-workflowengine-567454667d-6c7p7:41904" "--executor-id" "1" "--hostname" "10.223.130.87" "--cores" "4" "--app-id" "app-20210518204655-0000" "--worker-url" "spark://Worker@10.223.130.87:35561“

21/05/18 20:46:56 INFO Worker: Executor app-20210518204655-0000/1 finished with state EXITED message Command exited with code 1 exitStatus 1
21/05/18 20:46:56 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 1
21/05/18 20:46:56 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20210518204655-0000, execId=1)




I have setup equal structure with Spark in a docker-compose, I’m using equal configuration values (I use the cluster mode on the docker-compose also), but on the K8 setup the executor fails and I don’t know how I can find out what goes wrong and how I can fix this issue. I need please some help to get more information what goes wrong and what I can do to fix this issue, I don’t know if this is an error on my K8 configuration, the application code for Spark initialization or an issue on my worker / spark configuration.

Thanks for help

Phil