You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/03/29 10:05:03 UTC

[GitHub] [spark] prakharjain09 edited a comment on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned

prakharjain09 edited a comment on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned
URL: https://github.com/apache/spark/pull/27864#issuecomment-605599508

> Thank you so much for working on this, I'm really you glad you picked it up. I have some questions about the design and some small places for improvement, but really excited.

@holdenk Thanks for the review. My bad, I should have added the overall design initially itself.

Current overall design:

1) CoarseGrainedSchedulerBackend receives a signal to decommissionExecutor. On receiving the signal, it do 2 things - Stop assigning new tasks (SPARK-20628), Send another message to BlockManagerMasterEndpoint (via BlockManagerMaster) to decommission the corresponding BlockManager.

2) BlockManagerMasterEndpoint receives "DecommissionBlockManagers" message. On receiving this,
- it updates the "decommissioningBlockManagerSet". This set contains all BMs which are undergoing decommissioning. This set is maintained to giver the correct peerList to the active BMs. Active BMs keep on asking BlockManagerMasterEndpoint for peer list. We should only give active BM's as part of peer list.

- it sends a message to the corresponding BlockManager to start the decommissioning process.

3) BlockManager on worker (say BM-x) receives the "DecommissionBlockManager" message. Now it will take following 2 actions on receiving the message -
- The BM-x will stop accepting new RDD Cache blocks. Since this block manager is in decommissioning state - so it shouldn't store any new cache data (this is similar to DataNode decommissioning in HDFS). If it accepts new blocks - then those blocks also needs to be offloaded.
- BM-x will start BlockManagerDecommissionManager thread. This thread will try to offload any cache blocks on this BM every 30 seconds (or configured time). In the first attempt - it might be possible that all cache blocks are not offloaded because of limited space on other BMs, so it can retry offloading after 30 seconds and so on... The available cache space in the application can keep on changing, because of dynamic allocation/or user can explicitly uncache some other RDD. So we can do multiple attempts to offload RDD cache blocks from BM-x. Steps performed in single attempt:
a) Ask for replication info of all the cache blocks on BM-x with BlockManagerMasterEndpoint. This will let BM-x know which block is already replicated at what all places - So that we can avoid replicating at those places.
b) For each block - try to replicate it to peers. Drop the block if block is successfully replicated to one of the peer, else keep the block. If the block is replicated and dropped from BM-x, the same will be automatically communicated to the driver and dynamic allocation (ExecutorMonitor class) will update its bookkeeping and remove the executor from the system.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org