You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Holden Karau (Jira)" <ji...@apache.org> on 2022/08/12 23:35:00 UTC

[jira] [Resolved] (SPARK-38969) Graceful decomissionning on Kubernetes fails / decom script error

     [ https://issues.apache.org/jira/browse/SPARK-38969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Holden Karau resolved SPARK-38969.
----------------------------------
    Fix Version/s: 3.4.0
         Assignee: Holden Karau
       Resolution: Fixed

Updated decommissioning script to be more resilent and block as long as it takes on the executor to exit. K8s will still kill the pod if it exceeds the graceful shutdown time-limit so we don't have to worry too much about blocking forever there.

 

Also updated how we tag executor loss reasons for executors which decommission too "quickly"

 

See https://github.com/apache/spark/pull/36434/files

> Graceful decomissionning on Kubernetes fails / decom script error
> -----------------------------------------------------------------
>
>                 Key: SPARK-38969
>                 URL: https://issues.apache.org/jira/browse/SPARK-38969
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0
>         Environment: Running spark-thriftserver (3.2.0) on Kubernetes (GKE 1.20.15-gke.2500). 
>  
>            Reporter: Yeachan Park
>            Assignee: Holden Karau
>            Priority: Minor
>             Fix For: 3.4.0
>
>
> Hello, we are running into some issue while attempting graceful decommissioning of executors. We enabled:
>  * spark.decommission.enabled 
>  * spark.storage.decommission.rddBlocks.enabled
>  * spark.storage.decommission.shuffleBlocks.enabled
>  * spark.storage.decommission.enabled
> and set spark.storage.decommission.fallbackStorage.path to a path in our bucket.
>  
> The logs from the driver seems to suggest the decommissioning process started but then unexpectedly exited and failed:
>  
> ```
> 22/04/20 15:09:09 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 3 decommissioned message
> 22/04/20 15:09:09 INFO KubernetesClusterSchedulerBackend: Decommission executors: 3
> 22/04/20 15:09:09 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(3, 100.96.1.130, 44789, None)) as being decommissioning.
> 22/04/20 15:09:10 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.1.130: Executor decommission.
> 22/04/20 15:09:10 INFO DAGScheduler: Executor lost: 3 (epoch 2)
> 22/04/20 15:09:10 INFO ExecutorMonitor: Executor 3 is removed. Remove reason statistics: (gracefully decommissioned: 0, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 3).
> 22/04/20 15:09:10 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
> 22/04/20 15:09:10 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, 100.96.1.130, 44789, None)
> 22/04/20 15:09:10 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
> 22/04/20 15:09:10 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 2)
> ```
>  
> However, the executor logs seem to suggest that decommissioning was successful:
>  
> ```
> 22/04/20 15:09:09 INFO CoarseGrainedExecutorBackend: Decommission executor 3.
> 22/04/20 15:09:09 INFO CoarseGrainedExecutorBackend: Will exit when finished decommissioning
> 22/04/20 15:09:09 INFO BlockManager: Starting block manager decommissioning process...
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting block migration
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all RDD blocks
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all shuffle blocks
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Start refreshing migratable shuffle blocks
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are added. In total, 0 shuffles are remained.
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all cached RDD blocks
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting shuffle block migration thread for BlockManagerId(4, 100.96.1.131, 35607, None)
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting shuffle block migration thread for BlockManagerId(fallback, remote, 7337, None)
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Finished current round refreshing migratable shuffle blocks, waiting for 30000ms before the next round refreshing.
> 22/04/20 15:09:10 WARN BlockManagerDecommissioner: Asked to decommission RDD cache blocks, but no blocks to migrate
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Finished current round RDD blocks migration, waiting for 30000ms before the next round migration.
> 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: Checking to see if we can shutdown.
> 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: No running tasks, checking migrations
> 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: No running tasks, all blocks migrated, stopping.
> 22/04/20 15:09:10 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Finished decommissioning
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop RDD blocks migration().
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop refreshing migratable shuffle blocks.
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stopping migrating shuffle blocks.
> 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stopped block migration
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop shuffle block migration().
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop shuffle block migration().
> 22/04/20 15:09:10 INFO MemoryStore: MemoryStore cleared
> 22/04/20 15:09:10 INFO BlockManager: BlockManager stopped
> 22/04/20 15:09:10 INFO ShutdownHookManager: Shutdown hook called
> ```
>  
> The decommissioning script `/opt/decom.sh` also always terminates with exit code 137, not really sure why that is.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org