You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2020/10/28 21:39:00 UTC
[jira] [Created] (SPARK-33276) Fix K8s IT Flakiness

Dongjoon Hyun created SPARK-33276:
-------------------------------------

             Summary: Fix K8s IT Flakiness
                 Key: SPARK-33276
                 URL: https://issues.apache.org/jira/browse/SPARK-33276
             Project: Spark
          Issue Type: Sub-task
          Components: Kubernetes, Tests
    Affects Versions: 3.0.1, 3.1.0
            Reporter: Dongjoon Hyun


The following two consecutive runs are using the same git hash, a744fea3be12f1a53ab553040b95da730210bc88 .
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20K8s%20Builds/job/spark-master-test-k8s/646/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20K8s%20Builds/job/spark-master-test-k8s/647/

However, the second one fails while the first one succeeds.
{code}
KubernetesSuite:
- Run SparkPi with no resources *** FAILED ***
  The code passed to eventually never returned normally. Attempted 190 times over 3.002699493366667 minutes. Last failure message: false was not true. (KubernetesSuite.scala:383)
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup *** FAILED ***
  The code passed to eventually never returned normally. Attempted 184 times over 3.0172133493666666 minutes. Last failure message: "++ id -u
  + myuid=185
  ++ id -g
  + mygid=0
  + set +e
  ++ getent passwd 185
  + uidentry=
  + set -e
  + '[' -z '' ']'
  + '[' -w /etc/passwd ']'
  + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
  + SPARK_CLASSPATH=':/opt/spark/jars/*'
  + env
  + grep SPARK_JAVA_OPT_
  + sort -t_ -k4 -n
  + sed 's/[^=]*=\(.*\)/\1/g'
  + readarray -t SPARK_EXECUTOR_JAVA_OPTS
  + '[' -n '' ']'
  + '[' 3 == 3 ']'
  ++ python3 -V
  + pyv3='Python 3.7.3'
  + export PYTHON_VERSION=3.7.3
  + PYTHON_VERSION=3.7.3
  + export PYSPARK_PYTHON=python3
  + PYSPARK_PYTHON=python3
  + export PYSPARK_DRIVER_PYTHON=python3
  + PYSPARK_DRIVER_PYTHON=python3
  + '[' -n '' ']'
  + '[' -z ']'
  + case "$1" in
  + shift 1
  + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
  + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.4 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///opt/spark/tests/decommissioning_cleanup.py
  20/10/28 19:47:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  Starting decom test
  Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
  20/10/28 19:47:29 INFO SparkContext: Running Spark version 3.1.0-SNAPSHOT
  20/10/28 19:47:29 INFO ResourceUtils: ==============================================================
  20/10/28 19:47:29 INFO ResourceUtils: No custom resources configured for spark.driver.
  20/10/28 19:47:29 INFO ResourceUtils: ==============================================================
  20/10/28 19:47:29 INFO SparkContext: Submitted application: DecomTest
  20/10/28 19:47:29 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
  20/10/28 19:47:29 INFO ResourceProfile: Limiting resource is cpus at 1 tasks per executor
  20/10/28 19:47:29 INFO ResourceProfileManager: Added ResourceProfile id: 0
  20/10/28 19:47:29 INFO SecurityManager: Changing view acls to: 185,jenkins
  20/10/28 19:47:29 INFO SecurityManager: Changing modify acls to: 185,jenkins
  20/10/28 19:47:29 INFO SecurityManager: Changing view acls groups to: 
  20/10/28 19:47:29 INFO SecurityManager: Changing modify acls groups to: 
  20/10/28 19:47:29 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users  with view permissions: Set(185, jenkins); groups with view permissions: Set(); users  with modify permissions: Set(185, jenkins); groups with modify permissions: Set()
  20/10/28 19:47:29 INFO Utils: Successfully started service 'sparkDriver' on port 7078.
  20/10/28 19:47:29 INFO SparkEnv: Registering MapOutputTracker
  20/10/28 19:47:29 INFO SparkEnv: Registering BlockManagerMaster
  20/10/28 19:47:29 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
  20/10/28 19:47:29 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
  20/10/28 19:47:29 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
  20/10/28 19:47:29 INFO DiskBlockManager: Created local directory at /var/data/spark-3d1c3c4d-9961-4aa3-a386-4cb183112d0b/blockmgr-353b0ccf-1847-45c8-a709-8a862432b76a
  20/10/28 19:47:29 INFO MemoryStore: MemoryStore started with capacity 593.9 MiB
  20/10/28 19:47:29 INFO SparkEnv: Registering OutputCommitCoordinator
  20/10/28 19:47:30 INFO Utils: Successfully started service 'SparkUI' on port 4040.
  20/10/28 19:47:30 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:4040
  20/10/28 19:47:30 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
  20/10/28 19:47:31 INFO ExecutorPodsAllocator: Going to request 3 executors from Kubernetes.
  20/10/28 19:47:31 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
  20/10/28 19:47:31 INFO NettyBlockTransferService: Server created on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079
  20/10/28 19:47:31 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
  20/10/28 19:47:31 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc, 7079, None)
  20/10/28 19:47:31 INFO BlockManagerMasterEndpoint: Registering block manager spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 with 593.9 MiB RAM, BlockManagerId(driver, spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc, 7079, None)
  20/10/28 19:47:31 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc, 7079, None)
  20/10/28 19:47:31 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc, 7079, None)
  20/10/28 19:47:31 INFO BasicExecutorFeatureStep: Adding decommission script to lifecycle
  20/10/28 19:47:32 INFO BasicExecutorFeatureStep: Adding decommission script to lifecycle
  20/10/28 19:47:32 INFO BasicExecutorFeatureStep: Adding decommission script to lifecycle
  20/10/28 19:47:36 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.5:53038) with ID 1,  ResourceProfileId 0
  20/10/28 19:47:36 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.6:48612) with ID 2,  ResourceProfileId 0
  20/10/28 19:47:36 INFO BlockManagerMasterEndpoint: Registering block manager 172.17.0.5:44019 with 593.9 MiB RAM, BlockManagerId(1, 172.17.0.5, 44019, None)
  20/10/28 19:47:36 INFO BlockManagerMasterEndpoint: Registering block manager 172.17.0.6:41183 with 593.9 MiB RAM, BlockManagerId(2, 172.17.0.6, 41183, None)
  20/10/28 19:48:01 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000000000(ns)
  20/10/28 19:48:02 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work-dir/spark-warehouse').
  20/10/28 19:48:02 INFO SharedState: Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'.
  20/10/28 19:48:03 INFO SparkContext: Starting job: collect at /opt/spark/tests/decommissioning_cleanup.py:47
  20/10/28 19:48:03 INFO DAGScheduler: Registering RDD 7 (groupByKey at /opt/spark/tests/decommissioning_cleanup.py:46) as input to shuffle 0
  20/10/28 19:48:03 INFO DAGScheduler: Got job 0 (collect at /opt/spark/tests/decommissioning_cleanup.py:47) with 5 output partitions
  20/10/28 19:48:03 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /opt/spark/tests/decommissioning_cleanup.py:47)
  20/10/28 19:48:03 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
  20/10/28 19:48:03 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
  20/10/28 19:48:03 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[7] at groupByKey at /opt/spark/tests/decommissioning_cleanup.py:46), which has no missing parents
  20/10/28 19:48:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 10.2 KiB, free 593.9 MiB)
  20/10/28 19:48:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 6.3 KiB, free 593.9 MiB)
  20/10/28 19:48:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 (size: 6.3 KiB, free: 593.9 MiB)
  20/10/28 19:48:03 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1348
  20/10/28 19:48:03 INFO DAGScheduler: Submitting 5 missing tasks from ShuffleMapStage 0 (PairwiseRDD[7] at groupByKey at /opt/spark/tests/decommissioning_cleanup.py:46) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4))
  20/10/28 19:48:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 5 tasks resource profile 0
  20/10/28 19:48:03 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (172.17.0.6, executor 2, partition 0, PROCESS_LOCAL, 7341 bytes) taskResourceAssignments Map()
  20/10/28 19:48:03 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (172.17.0.5, executor 1, partition 1, PROCESS_LOCAL, 7341 bytes) taskResourceAssignments Map()
  20/10/28 19:48:04 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.17.0.5:44019 (size: 6.3 KiB, free: 593.9 MiB)
  20/10/28 19:48:04 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.17.0.6:41183 (size: 6.3 KiB, free: 593.9 MiB)
  20/10/28 19:48:05 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2) (172.17.0.6, executor 2, partition 2, PROCESS_LOCAL, 7341 bytes) taskResourceAssignments Map()
  20/10/28 19:48:05 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1918 ms on 172.17.0.6 (executor 2) (1/5)
  20/10/28 19:48:05 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 46789
  20/10/28 19:48:05 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3) (172.17.0.5, executor 1, partition 3, PROCESS_LOCAL, 7341 bytes) taskResourceAssignments Map()
  20/10/28 19:48:05 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1992 ms on 172.17.0.5 (executor 1) (2/5)
  20/10/28 19:48:05 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4) (172.17.0.6, executor 2, partition 4, PROCESS_LOCAL, 7341 bytes) taskResourceAssignments Map()
  20/10/28 19:48:05 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 137 ms on 172.17.0.6 (executor 2) (3/5)
  20/10/28 19:48:06 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 114 ms on 172.17.0.5 (executor 1) (4/5)
  20/10/28 19:48:06 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 91 ms on 172.17.0.6 (executor 2) (5/5)
  20/10/28 19:48:06 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
  20/10/28 19:48:06 INFO DAGScheduler: ShuffleMapStage 0 (groupByKey at /opt/spark/tests/decommissioning_cleanup.py:46) finished in 2.341 s
  20/10/28 19:48:06 INFO DAGScheduler: looking for newly runnable stages
  20/10/28 19:48:06 INFO DAGScheduler: running: Set()
  20/10/28 19:48:06 INFO DAGScheduler: waiting: Set(ResultStage 1)
  20/10/28 19:48:06 INFO DAGScheduler: failed: Set()
  20/10/28 19:48:06 INFO DAGScheduler: Submitting ResultStage 1 (PythonRDD[10] at collect at /opt/spark/tests/decommissioning_cleanup.py:47), which has no missing parents
  20/10/28 19:48:06 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 9.3 KiB, free 593.9 MiB)
  20/10/28 19:48:06 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.4 KiB, free 593.9 MiB)
  20/10/28 19:48:06 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:06 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1348
  20/10/28 19:48:06 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 1 (PythonRDD[10] at collect at /opt/spark/tests/decommissioning_cleanup.py:47) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4))
  20/10/28 19:48:06 INFO TaskSchedulerImpl: Adding task set 1.0 with 5 tasks resource profile 0
  20/10/28 19:48:06 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 5) (172.17.0.6, executor 2, partition 0, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:06 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 6) (172.17.0.5, executor 1, partition 1, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:06 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.17.0.5:44019 (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:06 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.17.0.6:41183 (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 172.17.0.6:48612
  20/10/28 19:48:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 172.17.0.5:53038
  20/10/28 19:48:06 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 7) (172.17.0.6, executor 2, partition 3, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:06 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 5) in 692 ms on 172.17.0.6 (executor 2) (1/5)
  20/10/28 19:48:06 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 8) (172.17.0.5, executor 1, partition 2, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:06 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 6) in 711 ms on 172.17.0.5 (executor 1) (2/5)
  20/10/28 19:48:06 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 8) in 84 ms on 172.17.0.5 (executor 1) (3/5)
  20/10/28 19:48:06 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 9) (172.17.0.6, executor 2, partition 4, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:06 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID 7) in 138 ms on 172.17.0.6 (executor 2) (4/5)
  20/10/28 19:48:06 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 9) in 80 ms on 172.17.0.6 (executor 2) (5/5)
  20/10/28 19:48:06 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
  20/10/28 19:48:06 INFO DAGScheduler: ResultStage 1 (collect at /opt/spark/tests/decommissioning_cleanup.py:47) finished in 0.929 s
  20/10/28 19:48:06 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
  20/10/28 19:48:06 INFO TaskSchedulerImpl: Killing all running tasks in stage 1: Stage finished
  20/10/28 19:48:06 INFO DAGScheduler: Job 0 finished: collect at /opt/spark/tests/decommissioning_cleanup.py:47, took 3.418954 s
  20/10/28 19:48:07 INFO SparkContext: Starting job: collect at /opt/spark/tests/decommissioning_cleanup.py:48
  20/10/28 19:48:07 INFO DAGScheduler: Registering RDD 2 (groupByKey at /opt/spark/tests/decommissioning_cleanup.py:43) as input to shuffle 1
  20/10/28 19:48:07 INFO DAGScheduler: Got job 1 (collect at /opt/spark/tests/decommissioning_cleanup.py:48) with 5 output partitions
  20/10/28 19:48:07 INFO DAGScheduler: Final stage: ResultStage 3 (collect at /opt/spark/tests/decommissioning_cleanup.py:48)
  20/10/28 19:48:07 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
  20/10/28 19:48:07 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 2)
  20/10/28 19:48:07 INFO DAGScheduler: Submitting ShuffleMapStage 2 (PairwiseRDD[2] at groupByKey at /opt/spark/tests/decommissioning_cleanup.py:43), which has no missing parents
  20/10/28 19:48:07 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 10.6 KiB, free 593.9 MiB)
  20/10/28 19:48:07 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 6.5 KiB, free 593.9 MiB)
  20/10/28 19:48:07 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 (size: 6.5 KiB, free: 593.9 MiB)
  20/10/28 19:48:07 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1348
  20/10/28 19:48:07 INFO DAGScheduler: Submitting 5 missing tasks from ShuffleMapStage 2 (PairwiseRDD[2] at groupByKey at /opt/spark/tests/decommissioning_cleanup.py:43) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4))
  20/10/28 19:48:07 INFO TaskSchedulerImpl: Adding task set 2.0 with 5 tasks resource profile 0
  20/10/28 19:48:07 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 10) (172.17.0.6, executor 2, partition 0, PROCESS_LOCAL, 7341 bytes) taskResourceAssignments Map()
  20/10/28 19:48:07 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 11) (172.17.0.5, executor 1, partition 1, PROCESS_LOCAL, 7341 bytes) taskResourceAssignments Map()
  20/10/28 19:48:07 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 172.17.0.5:44019 in memory (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:07 INFO BlockManagerInfo: Removed broadcast_1_piece0 on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 in memory (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:07 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 172.17.0.6:41183 in memory (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:07 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.17.0.6:41183 (size: 6.5 KiB, free: 593.9 MiB)
  20/10/28 19:48:07 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.17.0.5:44019 (size: 6.5 KiB, free: 593.9 MiB)
  20/10/28 19:48:07 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 12) (172.17.0.6, executor 2, partition 2, PROCESS_LOCAL, 7341 bytes) taskResourceAssignments Map()
  20/10/28 19:48:07 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 10) in 138 ms on 172.17.0.6 (executor 2) (1/5)
  20/10/28 19:48:07 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 13) (172.17.0.5, executor 1, partition 3, PROCESS_LOCAL, 7341 bytes) taskResourceAssignments Map()
  20/10/28 19:48:07 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 11) in 148 ms on 172.17.0.5 (executor 1) (2/5)
  20/10/28 19:48:07 INFO TaskSetManager: Starting task 4.0 in stage 2.0 (TID 14) (172.17.0.6, executor 2, partition 4, PROCESS_LOCAL, 7341 bytes) taskResourceAssignments Map()
  20/10/28 19:48:07 INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID 12) in 100 ms on 172.17.0.6 (executor 2) (3/5)
  20/10/28 19:48:07 INFO TaskSetManager: Finished task 3.0 in stage 2.0 (TID 13) in 104 ms on 172.17.0.5 (executor 1) (4/5)
  20/10/28 19:48:07 INFO TaskSetManager: Finished task 4.0 in stage 2.0 (TID 14) in 94 ms on 172.17.0.6 (executor 2) (5/5)
  20/10/28 19:48:07 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
  20/10/28 19:48:07 INFO DAGScheduler: ShuffleMapStage 2 (groupByKey at /opt/spark/tests/decommissioning_cleanup.py:43) finished in 0.368 s
  20/10/28 19:48:07 INFO DAGScheduler: looking for newly runnable stages
  20/10/28 19:48:07 INFO DAGScheduler: running: Set()
  20/10/28 19:48:07 INFO DAGScheduler: waiting: Set(ResultStage 3)
  20/10/28 19:48:07 INFO DAGScheduler: failed: Set()
  20/10/28 19:48:07 INFO DAGScheduler: Submitting ResultStage 3 (PythonRDD[11] at collect at /opt/spark/tests/decommissioning_cleanup.py:48), which has no missing parents
  20/10/28 19:48:07 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 9.3 KiB, free 593.9 MiB)
  20/10/28 19:48:07 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 5.4 KiB, free 593.9 MiB)
  20/10/28 19:48:07 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:07 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1348
  20/10/28 19:48:07 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 3 (PythonRDD[11] at collect at /opt/spark/tests/decommissioning_cleanup.py:48) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4))
  20/10/28 19:48:07 INFO TaskSchedulerImpl: Adding task set 3.0 with 5 tasks resource profile 0
  20/10/28 19:48:07 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 15) (172.17.0.6, executor 2, partition 0, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:07 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 16) (172.17.0.5, executor 1, partition 1, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:07 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 172.17.0.5:44019 (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:07 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 172.17.0.6:41183 (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:07 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 172.17.0.5:53038
  20/10/28 19:48:07 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 172.17.0.6:48612
  20/10/28 19:48:07 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID 17) (172.17.0.5, executor 1, partition 2, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:07 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 16) in 113 ms on 172.17.0.5 (executor 1) (1/5)
  20/10/28 19:48:07 INFO TaskSetManager: Starting task 3.0 in stage 3.0 (TID 18) (172.17.0.6, executor 2, partition 3, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:07 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 15) in 123 ms on 172.17.0.6 (executor 2) (2/5)
  20/10/28 19:48:07 INFO TaskSetManager: Starting task 4.0 in stage 3.0 (TID 19) (172.17.0.5, executor 1, partition 4, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:07 INFO TaskSetManager: Finished task 2.0 in stage 3.0 (TID 17) in 73 ms on 172.17.0.5 (executor 1) (3/5)
  20/10/28 19:48:07 INFO TaskSetManager: Finished task 3.0 in stage 3.0 (TID 18) in 75 ms on 172.17.0.6 (executor 2) (4/5)
  20/10/28 19:48:07 INFO TaskSetManager: Finished task 4.0 in stage 3.0 (TID 19) in 78 ms on 172.17.0.5 (executor 1) (5/5)
  20/10/28 19:48:07 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
  20/10/28 19:48:07 INFO DAGScheduler: ResultStage 3 (collect at /opt/spark/tests/decommissioning_cleanup.py:48) finished in 0.277 s
  20/10/28 19:48:07 INFO DAGScheduler: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job
  20/10/28 19:48:07 INFO TaskSchedulerImpl: Killing all running tasks in stage 3: Stage finished
  20/10/28 19:48:07 INFO DAGScheduler: Job 1 finished: collect at /opt/spark/tests/decommissioning_cleanup.py:48, took 0.665306 s
  1st accumulator value is: 100
  Waiting to give nodes time to finish migration, decom exec 1.
  ...
  20/10/28 19:48:08 ERROR TaskSchedulerImpl: Lost executor 1 on 172.17.0.5: The executor with id 1 was deleted by a user or the framework.
  20/10/28 19:48:08 INFO DAGScheduler: Executor lost: 1 (epoch 2)
  20/10/28 19:48:08 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
  20/10/28 19:48:08 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, 172.17.0.5, 44019, None)
  20/10/28 19:48:08 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
  20/10/28 19:48:08 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 2)
  20/10/28 19:48:09 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
  20/10/28 19:48:09 INFO BasicExecutorFeatureStep: Adding decommission script to lifecycle
  20/10/28 19:48:13 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.5:53252) with ID 3,  ResourceProfileId 0
  20/10/28 19:48:13 INFO BlockManagerMasterEndpoint: Registering block manager 172.17.0.5:43069 with 593.9 MiB RAM, BlockManagerId(3, 172.17.0.5, 43069, None)
  20/10/28 19:48:37 INFO BlockManagerInfo: Removed broadcast_2_piece0 on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 in memory (size: 6.5 KiB, free: 593.9 MiB)
  20/10/28 19:48:37 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 172.17.0.6:41183 in memory (size: 6.5 KiB, free: 593.9 MiB)
  20/10/28 19:48:37 INFO SparkContext: Starting job: count at /opt/spark/tests/decommissioning_cleanup.py:53
  20/10/28 19:48:37 INFO DAGScheduler: Got job 2 (count at /opt/spark/tests/decommissioning_cleanup.py:53) with 5 output partitions
  20/10/28 19:48:37 INFO DAGScheduler: Final stage: ResultStage 5 (count at /opt/spark/tests/decommissioning_cleanup.py:53)
  20/10/28 19:48:37 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 4)
  20/10/28 19:48:37 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 4)
  20/10/28 19:48:37 INFO DAGScheduler: Submitting ShuffleMapStage 4 (PairwiseRDD[2] at groupByKey at /opt/spark/tests/decommissioning_cleanup.py:43), which has no missing parents
  20/10/28 19:48:37 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 10.6 KiB, free 593.9 MiB)
  20/10/28 19:48:37 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 6.5 KiB, free 593.9 MiB)
  20/10/28 19:48:37 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 (size: 6.5 KiB, free: 593.9 MiB)
  20/10/28 19:48:37 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1348
  20/10/28 19:48:37 INFO BlockManagerInfo: Removed broadcast_3_piece0 on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 in memory (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:37 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 4 (PairwiseRDD[2] at groupByKey at /opt/spark/tests/decommissioning_cleanup.py:43) (first 15 tasks are for partitions Vector(1, 3))
  20/10/28 19:48:37 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks resource profile 0
  20/10/28 19:48:37 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 20) (172.17.0.6, executor 2, partition 1, PROCESS_LOCAL, 7341 bytes) taskResourceAssignments Map()
  20/10/28 19:48:37 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 172.17.0.6:41183 in memory (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:37 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 21) (172.17.0.5, executor 3, partition 3, PROCESS_LOCAL, 7341 bytes) taskResourceAssignments Map()
  20/10/28 19:48:37 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 172.17.0.6:41183 (size: 6.5 KiB, free: 593.9 MiB)
  20/10/28 19:48:37 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 20) in 119 ms on 172.17.0.6 (executor 2) (1/2)
  20/10/28 19:48:38 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 172.17.0.5:43069 (size: 6.5 KiB, free: 593.9 MiB)
  20/10/28 19:48:39 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 21) in 1595 ms on 172.17.0.5 (executor 3) (2/2)
  20/10/28 19:48:39 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
  20/10/28 19:48:39 INFO DAGScheduler: ShuffleMapStage 4 (groupByKey at /opt/spark/tests/decommissioning_cleanup.py:43) finished in 1.618 s
  20/10/28 19:48:39 INFO DAGScheduler: looking for newly runnable stages
  20/10/28 19:48:39 INFO DAGScheduler: running: Set()
  20/10/28 19:48:39 INFO DAGScheduler: waiting: Set(ResultStage 5)
  20/10/28 19:48:39 INFO DAGScheduler: failed: Set()
  20/10/28 19:48:39 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] at count at /opt/spark/tests/decommissioning_cleanup.py:53), which has no missing parents
  20/10/28 19:48:39 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 10.6 KiB, free 593.9 MiB)
  20/10/28 19:48:39 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 5.9 KiB, free 593.9 MiB)
  20/10/28 19:48:39 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 (size: 5.9 KiB, free: 593.9 MiB)
  20/10/28 19:48:39 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1348
  20/10/28 19:48:39 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 5 (PythonRDD[12] at count at /opt/spark/tests/decommissioning_cleanup.py:53) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4))
  20/10/28 19:48:39 INFO TaskSchedulerImpl: Adding task set 5.0 with 5 tasks resource profile 0
  20/10/28 19:48:39 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 22) (172.17.0.5, executor 3, partition 0, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:39 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 23) (172.17.0.6, executor 2, partition 1, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:39 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 172.17.0.6:41183 (size: 5.9 KiB, free: 593.9 MiB)
  20/10/28 19:48:39 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 172.17.0.6:48612
  20/10/28 19:48:39 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 172.17.0.5:43069 (size: 5.9 KiB, free: 593.9 MiB)
  20/10/28 19:48:39 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID 23) in 173 ms on 172.17.0.6 (executor 2) (1/5)
  20/10/28 19:48:39 INFO TaskSetManager: Starting task 2.0 in stage 5.0 (TID 24) (172.17.0.6, executor 2, partition 2, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:39 INFO TaskSetManager: Starting task 3.0 in stage 5.0 (TID 25) (172.17.0.6, executor 2, partition 3, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:39 INFO TaskSetManager: Finished task 2.0 in stage 5.0 (TID 24) in 93 ms on 172.17.0.6 (executor 2) (2/5)
  20/10/28 19:48:39 INFO TaskSetManager: Starting task 4.0 in stage 5.0 (TID 26) (172.17.0.6, executor 2, partition 4, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:39 INFO TaskSetManager: Finished task 3.0 in stage 5.0 (TID 25) in 77 ms on 172.17.0.6 (executor 2) (3/5)
  20/10/28 19:48:39 INFO TaskSetManager: Finished task 4.0 in stage 5.0 (TID 26) in 80 ms on 172.17.0.6 (executor 2) (4/5)
  20/10/28 19:48:39 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 172.17.0.5:53252
  20/10/28 19:48:40 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 22) in 605 ms on 172.17.0.5 (executor 3) (5/5)
  20/10/28 19:48:40 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
  20/10/28 19:48:40 INFO DAGScheduler: ResultStage 5 (count at /opt/spark/tests/decommissioning_cleanup.py:53) finished in 0.630 s
  20/10/28 19:48:40 INFO DAGScheduler: Job 2 is finished. Cancelling potential speculative or zombie tasks for this job
  20/10/28 19:48:40 INFO TaskSchedulerImpl: Killing all running tasks in stage 5: Stage finished
  20/10/28 19:48:40 INFO DAGScheduler: Job 2 finished: count at /opt/spark/tests/decommissioning_cleanup.py:53, took 2.264185 s
  20/10/28 19:48:40 INFO SparkContext: Starting job: collect at /opt/spark/tests/decommissioning_cleanup.py:54
  20/10/28 19:48:40 INFO DAGScheduler: Got job 3 (collect at /opt/spark/tests/decommissioning_cleanup.py:54) with 5 output partitions
  20/10/28 19:48:40 INFO DAGScheduler: Final stage: ResultStage 7 (collect at /opt/spark/tests/decommissioning_cleanup.py:54)
  20/10/28 19:48:40 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 6)
  20/10/28 19:48:40 INFO DAGScheduler: Missing parents: List()
  20/10/28 19:48:40 INFO DAGScheduler: Submitting ResultStage 7 (PythonRDD[11] at collect at /opt/spark/tests/decommissioning_cleanup.py:48), which has no missing parents
  20/10/28 19:48:40 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 9.3 KiB, free 593.9 MiB)
  20/10/28 19:48:40 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 5.4 KiB, free 593.9 MiB)
  20/10/28 19:48:40 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:40 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1348
  20/10/28 19:48:40 INFO BlockManagerInfo: Removed broadcast_4_piece0 on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 in memory (size: 6.5 KiB, free: 593.9 MiB)
  20/10/28 19:48:40 INFO BlockManagerInfo: Removed broadcast_4_piece0 on 172.17.0.6:41183 in memory (size: 6.5 KiB, free: 593.9 MiB)
  20/10/28 19:48:40 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 7 (PythonRDD[11] at collect at /opt/spark/tests/decommissioning_cleanup.py:48) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4))
  20/10/28 19:48:40 INFO TaskSchedulerImpl: Adding task set 7.0 with 5 tasks resource profile 0
  20/10/28 19:48:40 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 27) (172.17.0.5, executor 3, partition 0, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:40 INFO TaskSetManager: Starting task 1.0 in stage 7.0 (TID 28) (172.17.0.6, executor 2, partition 1, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:40 INFO BlockManagerInfo: Removed broadcast_4_piece0 on 172.17.0.5:43069 in memory (size: 6.5 KiB, free: 593.9 MiB)
  20/10/28 19:48:40 INFO BlockManagerInfo: Removed broadcast_5_piece0 on spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:7079 in memory (size: 5.9 KiB, free: 593.9 MiB)
  20/10/28 19:48:40 INFO BlockManagerInfo: Removed broadcast_5_piece0 on 172.17.0.5:43069 in memory (size: 5.9 KiB, free: 593.9 MiB)
  20/10/28 19:48:40 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 172.17.0.5:43069 (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:40 INFO BlockManagerInfo: Removed broadcast_5_piece0 on 172.17.0.6:41183 in memory (size: 5.9 KiB, free: 593.9 MiB)
  20/10/28 19:48:40 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 172.17.0.6:41183 (size: 5.4 KiB, free: 593.9 MiB)
  20/10/28 19:48:40 INFO TaskSetManager: Starting task 2.0 in stage 7.0 (TID 29) (172.17.0.6, executor 2, partition 2, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:40 INFO TaskSetManager: Finished task 1.0 in stage 7.0 (TID 28) in 99 ms on 172.17.0.6 (executor 2) (1/5)
  20/10/28 19:48:40 INFO TaskSetManager: Starting task 3.0 in stage 7.0 (TID 30) (172.17.0.5, executor 3, partition 3, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:40 INFO TaskSetManager: Finished task 0.0 in stage 7.0 (TID 27) in 103 ms on 172.17.0.5 (executor 3) (2/5)
  20/10/28 19:48:40 INFO TaskSetManager: Starting task 4.0 in stage 7.0 (TID 31) (172.17.0.6, executor 2, partition 4, NODE_LOCAL, 7162 bytes) taskResourceAssignments Map()
  20/10/28 19:48:40 INFO TaskSetManager: Finished task 2.0 in stage 7.0 (TID 29) in 81 ms on 172.17.0.6 (executor 2) (3/5)
  20/10/28 19:48:40 INFO TaskSetManager: Finished task 3.0 in stage 7.0 (TID 30) in 80 ms on 172.17.0.5 (executor 3) (4/5)
  20/10/28 19:48:40 INFO TaskSetManager: Finished task 4.0 in stage 7.0 (TID 31) in 73 ms on 172.17.0.6 (executor 2) (5/5)
  20/10/28 19:48:40 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool 
  20/10/28 19:48:40 INFO DAGScheduler: ResultStage 7 (collect at /opt/spark/tests/decommissioning_cleanup.py:54) finished in 0.272 s
  20/10/28 19:48:40 INFO DAGScheduler: Job 3 is finished. Cancelling potential speculative or zombie tasks for this job
  20/10/28 19:48:40 INFO TaskSchedulerImpl: Killing all running tasks in stage 7: Stage finished
  20/10/28 19:48:40 INFO DAGScheduler: Job 3 finished: collect at /opt/spark/tests/decommissioning_cleanup.py:54, took 0.285746 s
  Final accumulator value is: 140
  Finished waiting, stopping Spark.
  20/10/28 19:48:40 INFO SparkUI: Stopped Spark web UI at http://spark-test-app-e7aaec7570c02f36-driver-svc.d73933b6103b4a1ca6896efb735e36d1.svc:4040
  20/10/28 19:48:40 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
  20/10/28 19:48:40 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
  20/10/28 19:48:40 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
  20/10/28 19:48:40 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
  20/10/28 19:48:40 INFO MemoryStore: MemoryStore cleared
  20/10/28 19:48:40 INFO BlockManager: BlockManager stopped
  20/10/28 19:48:40 INFO BlockManagerMaster: BlockManagerMaster stopped
  20/10/28 19:48:40 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
  20/10/28 19:48:40 INFO SparkContext: Successfully stopped SparkContext
  Done, exiting Python
  20/10/28 19:48:41 INFO ShutdownHookManager: Shutdown hook called
  20/10/28 19:48:41 INFO ShutdownHookManager: Deleting directory /var/data/spark-3d1c3c4d-9961-4aa3-a386-4cb183112d0b/spark-346a6a32-0c4b-4644-8378-e4608301526d/pyspark-b0512933-1fc1-45f2-b403-68900a01075b
  20/10/28 19:48:41 INFO ShutdownHookManager: Deleting directory /var/data/spark-3d1c3c4d-9961-4aa3-a386-4cb183112d0b/spark-346a6a32-0c4b-4644-8378-e4608301526d
  20/10/28 19:48:41 INFO ShutdownHookManager: Deleting directory /tmp/spark-93e828ac-2756-4e4d-bf5c-21cb7b345c82
  " did not contain "Decommission executors" The application did not complete, did not find str Decommission executors. (KubernetesSuite.scala:387)
- Test decommissioning with dynamic allocation & shuffle cleanups
Run completed in 20 minutes, 14 seconds.
Total number of tests run: 20
Suites: completed 2, aborted 0
Tests: succeeded 18, failed 2, canceled 0, ignored 0, pending 0
*** 2 TESTS FAILED ***
{code}

Usually, the following fails.
- Run SparkPi with no resources (The first UT)
- Test basic decommissioning with shuffle cleanup

Sometimes, the second UT also fails with the first UT.
- Run SparkPi with a very long application name.

And, SparkR test has been disabled for a long time due to its failure. When we run K8s IT locally, SparkR K8S IT also passes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org