You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/13 02:09:02 UTC

[GitHub] [spark] agrawaldevesh edited a comment on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

agrawaldevesh edited a comment on pull request #28708:
URL: https://github.com/apache/spark/pull/28708#issuecomment-643552975

@holdenk, what are your thoughts on reducing the number of fetch failed exceptions (and thus potential job failures) due to decommissioning:
- This PR tries to offload the shuffle blocks to another peer but that is somewhat best effort.
- When the executor is eventually lost (detected much later by a heartbeat timeout), its shuffle files will be cleared. Until that happens the unmigrated blocks (for whatever reason) will result in fetch failures.

Block migration may not be possible in all cases since we may not have enough time to do the migration or the peer's disk might be full or it might soon after be decommissioned itself. This sort of "bulk decommissioning" can happen for example during a cluster scale down where many nodes are brought down in quick succession.

While block migration is great in that it avoids wasting work, we cannot guarantee it and I think we need a second line of defense by proactively calling `MapOutputTracker.removeOutputsOnExecutor` (and clearing `DAGScheduler.cacheLocs`) perhaps after some timeout. (I think FetchFailures may still happen if a reducer has already started.)

Another way to solve this problem is for the executor to cleanly exit and tell the driver that it is going away, so that the driver can forget about its (unmigrated) shuffle blocks. But even this is somewhat best effort since the cloud provider may not give it time to do a clean shutdown.

It would be nice to prevent/reduce job failures caused by fetch failures: Particularly because we already know that the executor would be yanked soon, so we should be able to do better.

All these three approaches to preventing fetch failures are orthogonal and we probably should allow users to enable each one of them independently:
* Offloading shuffle blocks
* Clearing shuffle statuses
* Eager and clean termination of the executor.

Thanks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org