You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/02 13:59:56 UTC

[GitHub] [spark] Ngone51 commented on pull request #31876: [SPARK-34942][API][CORE] Abstract Location in MapStatus to enable support for custom storage

Ngone51 commented on pull request #31876:
URL: https://github.com/apache/spark/pull/31876#issuecomment-812542883


   (Sorry for the delay, was busy with internal stuff..)
   
   So I have removed all the methods from the interface `Location`. And now, the casting to `BlockMangerId` happens in these 4 places:
   
   a) ShuffleBlockFetcherIterator
   
   Castings here should be extracted to a Spark native shuffle reader, so it should be fine.
   
   b) DAGScheduler/MapOutputTrakcer
   
   * use the `host` or `executorId` from `BlockManagerId` to manage shuffle map outputs, e.g.,
   
   `removeOutputsOnHost(...)`
   `removeOutputsOnExecutor(...)`
   
   * use the `host` from `BlockManagerId` as the preferred location, e.g.,
   `getPreferredLocationsForShuffle`
   `getMapLocation`
   
   
   c) TaskSetManager
   Using both `host` and `executorId` to update the `HealthyTracker`
   
   d) JsonProtocol
   convert the `BlockManagerId` into a Json
   
   For cases b,c,d, I'll try to get rid of the casting in later commits. One feasible way is to use the pattern match to skip other Locations. At the same time, I'm still thinking if there would be a better way to unify the behavior of locations. e.g., for storage like HDFS, which doesn't have a specific host, we could probably use "*" to represent it. And for `executorId`, although some storage doesn't have a meaningful value, each map task actually does have a corresponding executorId (but kind of agree that adding `executorId` would be confusing ).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org