You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/13 20:51:07 UTC

[GitHub] [spark] sumeetgajjar commented on pull request #32114: [SPARK-35011][CORE] Avoid Block Manager registrations when StopExecutor msg is in-flight

sumeetgajjar commented on pull request #32114:
URL: https://github.com/apache/spark/pull/32114#issuecomment-819046189


   > Hey guys, I'd like to propose a simpler (but might be a little bit ticky) fix if I understand the issue correctly. The idea is,
   > 
   > Instead of removing the executor directly, we set `executorLastSeen(executorId) = -1L` when we receives `ExecutorRemoved` in `HeartbeatReceiver`. And then,
   > 
   >     1. if `Heartbeat` comes first before `ExpireDeadHosts`, we remove the executor from `executorLastSeen` by checking the value "< 0" and avoid the re-register.
   > 
   >     2. if `ExpireDeadHosts` comes first before `Heartbeat`,  we set `executorLastSeen(executorId) = -2L`. We can't remove it this time in `ExpireDeadHosts` because if `Heartbeat` comes later we'd have the same issue again.
   > 
   > 
   > 2.1 if `Heartbeat` comes later, we remove the executor from `executorLastSeen` by checking the value "< 0" too and also avoid the re-register.
   > 
   > 2.2 if `Heartbeat` doesn't come (that means the executor stopped before sending the heartbeat), we remove the executor from `executorLastSeen` by checking the value = -2L in next `ExpireDeadHosts`.
   > 
   > In this way, we can avoid the extra cache and all changes should be limited to `HeartbeatReceiver`.
   > 
   > Any thoughts?
   
   Thanks for the comment @Ngone51 .
   There are two places from where the re-registration can be triggered
   - From `HeartbeatReceiver` - by responding `HeartbeatResponse(reregisterBlockManager = true)`. 
   Your solution will take care of this.
   - From `BlockManager` - e.g. [reportBlockStatus](https://github.com/apache/spark/blob/ee7d838aaf46f9d786e0388915b422fb78952893/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L766).
   Just modifying `HeartbeatReceiver` won't solve the re-registration issue here. We will also have to implement a similar kind of tracking inside `BlockManagerMasterEndpoint`. And now since both tracking are independent of each other, it might introduce some race condition (please correct me if I am wrong).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org