You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/03 01:03:41 UTC

[GitHub] [spark] hiboyang opened a new pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

hiboyang opened a new pull request #31715:
URL: https://github.com/apache/spark/pull/31715


   
   ### What changes were proposed in this pull request?
   
   Add Spark config "spark.shuffle.markFileLostOnExecutorLost" to not delete shuffle file on executor lost event
   
   ### Why are the changes needed?
   
   There are multiple work going on with disaggregated/remote shuffle service (e.g. LinkedIn shuffle, Facebook shuffle service, Uber shuffle service). In those systems, shuffle data will be stored on different server other than executor. Spark should not mark shuffle data lost when the executor is lost. We could add a Spark configuration to control this behavior. By default, Spark still mark shuffle file lost. For disaggregated/remote shuffle service, people could set the configure to not mark shuffle file lost.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Run Spark unit test
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790126760


   Thank you for pinging me, @dbtsai .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31715:
URL: https://github.com/apache/spark/pull/31715#discussion_r586852577



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -2117,4 +2117,13 @@ package object config {
       // batch of block will be loaded in memory with memory mapping, which has higher overhead
       // with small MB sized chunk of data.
       .createWithDefaultString("3m")
+
+  private[spark] val MARK_FILE_LOST_ON_EXECUTOR_LOST =
+    ConfigBuilder("spark.shuffle.markFileLostOnExecutorLost")
+      .doc("Mark shuffle files as lost when an executor is lost. If you are using a remote " +
+       "shuffle service such that shuffle files are not stored on the executor, consider " +

Review comment:
       Well. In this context, `remote shuffle service` is not `external shuffle service` in Apache Spark because we are ignoring `env.blockManager.externalShuffleServiceEnabled` here. We have better be clear on this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-789362764


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790130701


   **[Test build #135723 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135723/testReport)** for PR 31715 at commit [`de2d2b5`](https://github.com/apache/spark/commit/de2d2b5738347994f6ed764425dab5be23caf4c1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-791236944


   Thank you for your decision, @hiboyang .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dbtsai commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
dbtsai commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-789982631


   cc @dongjoon-hyun @holdenk @viirya for reviews.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-791810374


   > Just checked Uber RSS (as I know that is open source and once I read its source at a level) where the location is a dummy block manager and the topology info holds the RSS servers list:
   > 
   > https://github.com/uber/RemoteShuffleService/blob/master/src/main/scala/org/apache/spark/shuffle/rss/RssUtils.scala#L56
   > 
   > But I know you know this already ;)
   
   Yes, I added there :) That was kind of a work around to make it work with open source Spark. Still waiting for another PR "[Adding metadata to MapStatus](https://github.com/apache/spark/pull/30004)) gets reviewed. That could provide a better solution for embedding RSS servers list inside MapStatus.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790177304


   I'd tend to more agree with @attilapiros on the point. It sounds not very proper and error-prone to let users control the behavior. I think more ideal approach is to let Spark decide to if the shuffle output should be unregistered when the executor is going to be removed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang commented on a change in pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang commented on a change in pull request #31715:
URL: https://github.com/apache/spark/pull/31715#discussion_r586824068



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -2117,4 +2117,12 @@ package object config {
       // batch of block will be loaded in memory with memory mapping, which has higher overhead
       // with small MB sized chunk of data.
       .createWithDefaultString("3m")
+
+  private[spark] val MARK_FILE_LOST_ON_EXECUTOR_LOST =
+    ConfigBuilder("spark.shuffle.markFileLostOnExecutorLost")
+      .doc("Mark shuffle file lost when executor is lost. People could set this to false when " +
+       "using remote shuffle services because the shuffle file is not stored on the executor.")

Review comment:
       This is good suggestion, will do the rewording, thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on a change in pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
holdenk commented on a change in pull request #31715:
URL: https://github.com/apache/spark/pull/31715#discussion_r586814240



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -2117,4 +2117,12 @@ package object config {
       // batch of block will be loaded in memory with memory mapping, which has higher overhead
       // with small MB sized chunk of data.
       .createWithDefaultString("3m")
+
+  private[spark] val MARK_FILE_LOST_ON_EXECUTOR_LOST =
+    ConfigBuilder("spark.shuffle.markFileLostOnExecutorLost")
+      .doc("Mark shuffle file lost when executor is lost. People could set this to false when " +
+       "using remote shuffle services because the shuffle file is not stored on the executor.")

Review comment:
       nit: What about if we reworded this to something like:
   "Mark shuffle files as lost when an executor is lost. If you are using a remote shuffle service such that shuffle files are not stored on the executor consider setting this to false."




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790247354


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40308/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-791513101


   Sorry I was away from the computer but I agree we can close this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-791592427


   Thank you all! I'm closing this now.
   We can reopen this if there is a new decision.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790222222


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135723/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-791596244


   Thanks @hiboyang @dongjoon-hyun. Yea, we can reopen this if we find it is necessary.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on a change in pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
attilapiros commented on a change in pull request #31715:
URL: https://github.com/apache/spark/pull/31715#discussion_r588475220



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -2117,4 +2117,15 @@ package object config {
       // batch of block will be loaded in memory with memory mapping, which has higher overhead
       // with small MB sized chunk of data.
       .createWithDefaultString("3m")
+
+  private[spark] val MARK_FILE_LOST_ON_EXECUTOR_LOST =
+    ConfigBuilder("spark.shuffle.markFileLostOnExecutorLost")

Review comment:
       @c21 May I ask what is the status of open sourcing Cosco? 
   I just think it might help us to make better decisions about the Shuffle plugin API of Spark if we could look into that one too.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #31715:
URL: https://github.com/apache/spark/pull/31715#discussion_r586981565



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -2117,4 +2117,15 @@ package object config {
       // batch of block will be loaded in memory with memory mapping, which has higher overhead
       // with small MB sized chunk of data.
       .createWithDefaultString("3m")
+
+  private[spark] val MARK_FILE_LOST_ON_EXECUTOR_LOST =
+    ConfigBuilder("spark.shuffle.markFileLostOnExecutorLost")

Review comment:
       I am having similar concern with @viirya and @attilapiros . I think we should not make it as a user-facing config. If we would like introducing a config for this anyway, it'd better to start with `internal()` config, but not user-facing.
   
   Though `spark.shuffle.manager` can be arbitrary class, but in practice, they are just a handful of implementation in one company's environment, and ideally the infra developers should control which implementation to use, instead of user to control this. Similarly for whether to mark shuffle file lost should be controlled by developers team but not users.
   
   Just share some context, in FB, we (spark developer) control these kinds of behavior transparently and make this invisible to spark user. This also helps us to migrate to newer implementation easier without worrying about users setting wrong config. The mixed case (query uses customized shuffle service and default shuffle service) can happen quite a bit in production, as we have rate limit for traffic on customized service, and need fallback.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790283250


   **[Test build #135726 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135726/testReport)** for PR 31715 at commit [`1418e8f`](https://github.com/apache/spark/commit/1418e8f9507375f4fec025d8df357a86cb99ee45).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #31715:
URL: https://github.com/apache/spark/pull/31715


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790222763


   **[Test build #135726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135726/testReport)** for PR 31715 at commit [`1418e8f`](https://github.com/apache/spark/commit/1418e8f9507375f4fec025d8df357a86cb99ee45).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790046943


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40302/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang commented on a change in pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang commented on a change in pull request #31715:
URL: https://github.com/apache/spark/pull/31715#discussion_r587189709



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -2117,4 +2117,15 @@ package object config {
       // batch of block will be loaded in memory with memory mapping, which has higher overhead
       // with small MB sized chunk of data.
       .createWithDefaultString("3m")
+
+  private[spark] val MARK_FILE_LOST_ON_EXECUTOR_LOST =
+    ConfigBuilder("spark.shuffle.markFileLostOnExecutorLost")

Review comment:
       Hi Cheng, thanks for the comment, and known you from FB! We previously discussed with people from FB about Cosco, but not for this specific issue. Just curious, do you guys modify your internal Spark distribution to make executor lost event not trigger driver to delete map output?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790130914


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135720/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790013265


   > I do not think this flag (indicating whether to unregister all the shuffle blocks created by the failed executor) must be set at the application level (the way this PR does).
   > 
   > I think this must be a property kept for the replicated shuffle block so somewhere close to `MapStatus`.
   > 
   > Imagine a mixed solution where the disaggregated storage just a fallback and still locally stored blocks can be used.
   > Then it make sense to unregister the shuffle blocks for the failed executors and leave the disaggregated storage access for the blocks intact.
   
   Yeah, if user uses a mixed solution where the disaggregated storage just a fallback, it is better to have MapStatus to track availability of shuffle blocks. The PR does not block this scenario.
   
   There is non-mixed solution where people totally rely on remote shuffle service (like Facebook/Uber's one). In that case, people only need to set this flag in the application level.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dbtsai commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
dbtsai commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-789983205


   cc @attilapiros @HyukjinKwon as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang edited a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang edited a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790184675


   > I'd tend to more agree with @attilapiros on the point. It sounds not very proper and error-prone to let users control the behavior. I think more ideal approach is to let Spark decide to if the shuffle output should be unregistered when the executor is going to be removed.
   
   The usage here is user decides to use Remote Shuffle Service (e.g. Facebook/Uber's) and sets spark.shuffle.manager with a customized class which supports that Remote Shuffle Service. Then they will set spark.shuffle.markFileLostOnExecutorLost to avoid marking shuffle file lost when executor is lost. In other scenarios, user will not set spark.shuffle.markFileLostOnExecutorLost. Would you clarify the "error-prone" concern here?
   
   I am going to update the comment for this config to something like that to help users understand more:
   Mark shuffle files as lost when an executor is lost. If you set spark.shuffle.manager with a customized class to use a different shuffle implementation like storing shuffle files on remote storage/server or using third party remote shuffle service, and you are sure the shuffle files are not stored on the  executor, consider setting this to false.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790184675


   > I'd tend to more agree with @attilapiros on the point. It sounds not very proper and error-prone to let users control the behavior. I think more ideal approach is to let Spark decide to if the shuffle output should be unregistered when the executor is going to be removed.
   
   The usage here is user decides to use Remote Shuffle Service (e.g. Facebook/Uber's) and sets spark.shuffle.manager with a customized class which supports that Remote Shuffle Service. Then they will set spark.shuffle.markFileLostOnExecutorLost to avoid marking shuffle file lost when executor is lost. In other scenarios, user will not set spark.shuffle.markFileLostOnExecutorLost. Would you clarify the "error-prone" concern here?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790247354


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40308/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-791874983


   > Still waiting for another PR "Adding metadata to MapStatus" gets reviewed. 
   
   Ok, I can help on that :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #31715:
URL: https://github.com/apache/spark/pull/31715#discussion_r586988087



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -2034,7 +2034,8 @@ private[spark] class DAGScheduler(
     // if the cluster manager explicitly tells us that the entire worker was lost, then
     // we know to unregister shuffle output.  (Note that "worker" specifically refers to the process
     // from a Standalone cluster, where the shuffle service lives in the Worker.)
-    val fileLost = workerHost.isDefined || !env.blockManager.externalShuffleServiceEnabled
+    val fileLost = (workerHost.isDefined || !env.blockManager.externalShuffleServiceEnabled) &&
+      env.blockManager.markFileLostOnExecutorLost

Review comment:
       shouldn't we check this config to not take effect, if `spark.shuffle.manager` is `sort` (the default one)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790196688


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40305/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790031470


   **[Test build #135720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135720/testReport)** for PR 31715 at commit [`ddb6d8f`](https://github.com/apache/spark/commit/ddb6d8fd7c1a86f6c68a0ef11f6de8a06dd4ecb2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790222222


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135723/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang edited a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang edited a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790013265


   > I do not think this flag (indicating whether to unregister all the shuffle blocks created by the failed executor) must be set at the application level (the way this PR does).
   > 
   > I think this must be a property kept for the replicated shuffle block so somewhere close to `MapStatus`.
   > 
   > Imagine a mixed solution where the disaggregated storage just a fallback and still locally stored blocks can be used.
   > Then it make sense to unregister the shuffle blocks for the failed executors and leave the disaggregated storage access for the blocks intact.
   
   Yeah, if user uses a mixed solution where the disaggregated storage just a fallback, it is better to have MapStatus to track availability of shuffle blocks. This PR does not block this scenario.
   
   There is non-mixed solution where people totally rely on remote shuffle service (like Facebook/Uber's one). In that case, people only need to set this flag in the application level. This PR will help this scenario.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790275748


   Then how about a property somewhere close to `MapStatus`?  I guess this config will become sort of mysterious config. As Spark, users also might not have knowledge about how to set this? In Spark, it should know where the shuffle output is kept. So ideally Spark should know if the shuffle output should be unregistered or not. I just don't know if currently Spark provides necessary stuffs for it.
   
   Under mixed solution, this config is also hard to set properly. Should it be set to true or false if the shuffle output could be kept either in fallback storage and executor?
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790204909


   > Thank you for your contribution, @hiboyang .
   > 
   > 1. Could you update the documentation together in this PR because this configuration can be misleading to the other config sets?
   > 2. Could you provide a unit test for this new feature please? Without test coverage, your new feature is not protected at all in the future release.
   
   Hi @dongjoon-hyun, thanks for the suggestion! I updated the documentation and added a unit test :)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790041248


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40302/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790000190


   I do not think this flag (indicating whether to unregister all the shuffle blocks created by the failed executor) must be set at the application level (the way this PR does). 
   
   I think this must be a property kept for the replicated shuffle block so close to  `MapStatus`.
    
   Imagine a mixed solution where the disaggregated storage just a fallback and still locally stored blocks can be used. 
   Then it make sense to unregister the shuffle blocks for the failed executors and leave the disaggregated storage access for the blocks intact.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790175955


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40305/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790036444


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40302/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang commented on a change in pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang commented on a change in pull request #31715:
URL: https://github.com/apache/spark/pull/31715#discussion_r586829867



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -2117,4 +2117,12 @@ package object config {
       // batch of block will be loaded in memory with memory mapping, which has higher overhead
       // with small MB sized chunk of data.
       .createWithDefaultString("3m")
+
+  private[spark] val MARK_FILE_LOST_ON_EXECUTOR_LOST =
+    ConfigBuilder("spark.shuffle.markFileLostOnExecutorLost")
+      .doc("Mark shuffle file lost when executor is lost. People could set this to false when " +
+       "using remote shuffle services because the shuffle file is not stored on the executor.")

Review comment:
       @holdenk I updated the wording just now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang edited a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang edited a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-791810374


   > Just checked Uber RSS (as I know that is open source and once I read its source at a level) where the location is a dummy block manager and the topology info holds the RSS servers list:
   > 
   > https://github.com/uber/RemoteShuffleService/blob/master/src/main/scala/org/apache/spark/shuffle/rss/RssUtils.scala#L56
   > 
   > But I know you know this already ;)
   
   Yes, I added there :) That was kind of a work around to make it work with open source Spark. Still waiting for another PR "[Adding metadata to MapStatus](https://github.com/apache/spark/pull/30004)" gets reviewed. That could provide a better solution for embedding RSS servers list inside MapStatus.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790181705


   > Thank you for your contribution, @hiboyang .
   > 
   > 1. Could you update the documentation together in this PR because this configuration can be misleading to the other config sets?
   > 2. Could you provide a unit test for this new feature please? Without test coverage, your new feature is not protected at all in the future release.
   
   Good suggestions, will do!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790199017


   > Users may need to set up this application config differently across different solutions, e.g. external shuffle service, built-in shuffle service, remote shuffle service, mixed solution, etc. This is somehow low-level Spark behavior, and I'm suspicious it is good to expose it to end users and let them decide the config. It sounds easy to set a improper value.
   > 
   > Can Spark decide to unregister shuffle output automatically? Like based on which shuffle manager is used for shuffle output? or like @attilapiros's idea, to have a property somewhere close to MapStatus?
   
   If Spark decides to unregister shuffle output based on which shuffle manager is used, that requires Spark has knowledge about different shuffle manager implementation. It is hard to implement because user could set any shuffle manager implementation by spark.shuffle.manager.
   
   In terms of "Users may need to set up this application config differently across different solutions", yes, this is the purpose. There are many shuffle solutions as you listed. Current Spark design is pretty good, allowing user to set spark.shuffle.manager with customized class to choose different solution. However, it assumes shuffle file always lost when executor is lost. This assumption conflicts with the customizable shuffle manager design. The new config spark.shuffle.markFileLostOnExecutorLost is to keep that assumption by default, but gives user the option to choose different solution when needed.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790331585


   > Then how about a property somewhere close to `MapStatus`? I guess this config will become sort of mysterious config. As Spark, users also might not have knowledge about how to set this? In Spark, it should know where the shuffle output is kept. So ideally Spark should know if the shuffle output should be unregistered or not. I just don't know if currently Spark provides necessary stuffs for it.
   > 
   > Under mixed solution, this config is also hard to set properly. Should it be set to true or false if the shuffle output could be kept either in fallback storage and executor?
   
   Yeah, I did some further looking following your suggestion. It looks Spark already checks and matches the executor id when it tries to remove map output, like following code. And it could already work well with customized shuffle manager and thrid party remote shuffle service.
   
   ```
     def removeOutputsOnExecutor(execId: String): Unit = withWriteLock {
       logDebug(s"Removing outputs for execId ${execId}")
       removeOutputsByFilter(x => x.executorId == execId)
     }
   ```
   
   Turns out I do not need to add this markFileLostOnExecutorLost configure any more :) Thanks again for your comments!
   
   I will close this PR (after checking other comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790046943


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40302/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dbtsai commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
dbtsai commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-789982170


   Jenkins, add to whitelist. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #31715:
URL: https://github.com/apache/spark/pull/31715#discussion_r588645879



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -2117,4 +2117,15 @@ package object config {
       // batch of block will be loaded in memory with memory mapping, which has higher overhead
       // with small MB sized chunk of data.
       .createWithDefaultString("3m")
+
+  private[spark] val MARK_FILE_LOST_ON_EXECUTOR_LOST =
+    ConfigBuilder("spark.shuffle.markFileLostOnExecutorLost")

Review comment:
       @attilapiros - unfortunately there's no specific ETA now, but the team is working on it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790243847


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40308/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790222763


   **[Test build #135726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135726/testReport)** for PR 31715 at commit [`1418e8f`](https://github.com/apache/spark/commit/1418e8f9507375f4fec025d8df357a86cb99ee45).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790209825


   **[Test build #135723 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135723/testReport)** for PR 31715 at commit [`de2d2b5`](https://github.com/apache/spark/commit/de2d2b5738347994f6ed764425dab5be23caf4c1).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang commented on a change in pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang commented on a change in pull request #31715:
URL: https://github.com/apache/spark/pull/31715#discussion_r587717627



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -2117,4 +2117,15 @@ package object config {
       // batch of block will be loaded in memory with memory mapping, which has higher overhead
       // with small MB sized chunk of data.
       .createWithDefaultString("3m")
+
+  private[spark] val MARK_FILE_LOST_ON_EXECUTOR_LOST =
+    ConfigBuilder("spark.shuffle.markFileLostOnExecutorLost")

Review comment:
       Cool, that will be interested to know, thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang edited a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang edited a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-791810374


   > Just checked Uber RSS (as I know that is open source and once I read its source at a level) where the location is a dummy block manager and the topology info holds the RSS servers list:
   > 
   > https://github.com/uber/RemoteShuffleService/blob/master/src/main/scala/org/apache/spark/shuffle/rss/RssUtils.scala#L56
   > 
   > But I know you know this already ;)
   
   Yes, I added there :) That was kind of a work around to make it work with open source Spark. Still waiting for another PR [Adding metadata to MapStatus](https://github.com/apache/spark/pull/30004)) gets reviewed. That could provide a better solution for embedding RSS servers list inside MapStatus.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #31715:
URL: https://github.com/apache/spark/pull/31715#discussion_r587194708



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -2117,4 +2117,15 @@ package object config {
       // batch of block will be loaded in memory with memory mapping, which has higher overhead
       // with small MB sized chunk of data.
       .createWithDefaultString("3m")
+
+  private[spark] val MARK_FILE_LOST_ON_EXECUTOR_LOST =
+    ConfigBuilder("spark.shuffle.markFileLostOnExecutorLost")

Review comment:
       > do you guys modify your internal Spark distribution to make executor lost event not trigger driver to delete map output?
   
   Yeah I think so, I can do a double check with cosco folks as well tomorrow. btw nice to know you as well!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790289938


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135726/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-789362764


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790031470


   **[Test build #135720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135720/testReport)** for PR 31715 at commit [`ddb6d8f`](https://github.com/apache/spark/commit/ddb6d8fd7c1a86f6c68a0ef11f6de8a06dd4ecb2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790129925


   **[Test build #135720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135720/testReport)** for PR 31715 at commit [`ddb6d8f`](https://github.com/apache/spark/commit/ddb6d8fd7c1a86f6c68a0ef11f6de8a06dd4ecb2).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790130914


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135720/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790289938


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135726/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790130701


   **[Test build #135723 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135723/testReport)** for PR 31715 at commit [`de2d2b5`](https://github.com/apache/spark/commit/de2d2b5738347994f6ed764425dab5be23caf4c1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790192646


   Users may need to set up this application config differently across different solutions, e.g. external shuffle service, built-in shuffle service, remote shuffle service, mixed solution, etc. This is somehow low-level Spark behavior, and I'm suspicious it is good to expose it to end users and let them decide the config. It sounds easy to set a improper value.
   
   Can Spark decide to unregister shuffle output automatically? Like based on which shuffle manager is used for shuffle output? or like @attilapiros's idea, to have a property somewhere close to MapStatus?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790247070


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40308/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hiboyang commented on a change in pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
hiboyang commented on a change in pull request #31715:
URL: https://github.com/apache/spark/pull/31715#discussion_r587189709



##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -2117,4 +2117,15 @@ package object config {
       // batch of block will be loaded in memory with memory mapping, which has higher overhead
       // with small MB sized chunk of data.
       .createWithDefaultString("3m")
+
+  private[spark] val MARK_FILE_LOST_ON_EXECUTOR_LOST =
+    ConfigBuilder("spark.shuffle.markFileLostOnExecutorLost")

Review comment:
       Hi @c21 , thanks for the comment, and known you from FB! We previously discussed with people from FB about Cosco, but not for this specific issue. Just curious, do you guys modify your internal Spark distribution to make executor lost event not trigger driver to delete map output?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
attilapiros commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-791545570


   Just checked Uber RSS (as I know that is open source and once I read its source at a level) where the location is a dummy block manager and the topology info holds the RSS servers list:
   
   https://github.com/uber/RemoteShuffleService/blob/master/src/main/scala/org/apache/spark/shuffle/rss/RssUtils.scala#L56
   
   But I know you know this already ;)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790196688


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40305/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros edited a comment on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
attilapiros edited a comment on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790000190


   I do not think this flag (indicating whether to unregister all the shuffle blocks created by the failed executor) must be set at the application level (the way this PR does). 
   
   I think this must be a property kept for the replicated shuffle block so somewhere close to  `MapStatus`.
    
   Imagine a mixed solution where the disaggregated storage just a fallback and still locally stored blocks can be used. 
   Then it make sense to unregister the shuffle blocks for the failed executors and leave the disaggregated storage access for the blocks intact.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31715: [SPARK-34601][SHUFFLE] Add spark.shuffle.markFileLostOnExecutorLost to not delete shuffle file on executor lost event

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31715:
URL: https://github.com/apache/spark/pull/31715#issuecomment-790191128


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40305/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org