You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "warrenzhu25 (via GitHub)" <gi...@apache.org> on 2023/05/07 19:29:12 UTC

[GitHub] [spark] warrenzhu25 opened a new pull request, #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

warrenzhu25 opened a new pull request, #41083:
URL: https://github.com/apache/spark/pull/41083

   ### What changes were proposed in this pull request?
   Add the config `spark.files.fetchFailure.unRegisterOutputThreshold` to control the number of fetch failed failures needed for one specific executor to unregister map output on this executor and allow to disable unregister.
   
   ### Why are the changes needed?
   Spark will unregister map output on the executor when fetch failed from this executor. This might be too aggressive when fetch failed is temporary and recoverable, especially when re-computation is more expensive than retry failed fetch.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. Added the config `spark.files.fetchFailure.unRegisterOutputThreshold`
   
   ### How was this patch tested?
   Added test in `DAGSchedulerSuite`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #41083:
URL: https://github.com/apache/spark/pull/41083#discussion_r1186909561


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -2516,4 +2516,13 @@ package object config {
       .version("3.5.0")
       .booleanConf
       .createWithDefault(false)
+
+  private[spark] val UNREGISTER_OUTPUT_THRESHOLD_ON_FETCH_FAILURE =
+    ConfigBuilder("spark.files.fetchFailure.unRegisterOutputThreshold")

Review Comment:
   If you want to use `spark.files.fetchFailure.` namespace, could you move this after the existing config?
   
   https://github.com/apache/spark/blob/1e090a57f0c6aa342c032130bbf86f8a152b9fee/core/src/main/scala/org/apache/spark/internal/config/package.scala#L942-L949



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #41083:
URL: https://github.com/apache/spark/pull/41083#issuecomment-1545904179

   @dongjoon-hyun any comments on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on PR #41083:
URL: https://github.com/apache/spark/pull/41083#issuecomment-1546807210

   How are you observing recoverable fetch failures ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #41083:
URL: https://github.com/apache/spark/pull/41083#issuecomment-1537523100

   @dongjoon-hyun @mridulm Help take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on a diff in pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on code in PR #41083:
URL: https://github.com/apache/spark/pull/41083#discussion_r1186926609


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -2516,4 +2516,13 @@ package object config {
       .version("3.5.0")
       .booleanConf
       .createWithDefault(false)
+
+  private[spark] val UNREGISTER_OUTPUT_THRESHOLD_ON_FETCH_FAILURE =
+    ConfigBuilder("spark.files.fetchFailure.unRegisterOutputThreshold")

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #41083:
URL: https://github.com/apache/spark/pull/41083#issuecomment-1694529843

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #41083:
URL: https://github.com/apache/spark/pull/41083#issuecomment-1548009051

   > How are you observing recoverable fetch failures ?
   
   I have seen 2 cases when target executor has busy shuffle fetch and upload due to shuffle migration:
   1. All Netty request handler threads are exhausted, it'll throw `TimeoutException` when creating connection to target executor took longer than `connectTimeout`.
   2. `IdleStateHandler` will close connection when fetch request took longer than `requestTimeout` to receive response.
   
   Both cases are recoverable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on PR #41083:
URL: https://github.com/apache/spark/pull/41083#issuecomment-1550571606

   These looks like things which can be handled by appropriate configuration tuning ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

Posted by "warrenzhu25 (via GitHub)" <gi...@apache.org>.
warrenzhu25 commented on PR #41083:
URL: https://github.com/apache/spark/pull/41083#issuecomment-1552141121

   > These looks like things which can be handled by appropriate configuration tuning ? The PR itself requires a bit more work if that is not a feasible direction (efficient cleanup, handling corner cases, etc).
   
   Case 2 can't be handled by existing config, there'll be other similar recoverable cases. Generally speaking, I think unregister all map output when fetch failed is too aggressive. So it's better to have config to control or disable such behavior. If the executor is really dead, the map output will be unregistered 
   when removing executor, if executor is just experiencing temporary and recoverable hiccup, then unregister is too expensive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

Posted by "mridulm (via GitHub)" <gi...@apache.org>.
mridulm commented on PR #41083:
URL: https://github.com/apache/spark/pull/41083#issuecomment-1553434747

   > Case 2 can't be handled by existing config, there'll be other similar recoverable cases
   
   Did you try increasing idle timeout ?
   The behavior is specific to the environment application is running in - where executors are unable to respond to shuffle requests for more than 2 minutes: this is a tuning or deployment issue.
   
   > Generally speaking, I think unregister all map output when fetch failed is too aggressive.
   
   As described, this is a case of not appropriately configuring spark for the load/cluster characteristics.
   For example, in our internal env, the network timeout is set to a significantly higher value than the default 120s due to a variety of factors - the default 2mins would result in failures (including this specific shuffle issue mentioned).
   
   This proposed change would complicate the way we reason about when shuffle data is lost - and I am hesitant about this if it is something that can be mitigated with appropriate tuning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #41083: [SPARK-43399][CORE] Add config to control threshold of unregister map ouput when fetch failed
URL: https://github.com/apache/spark/pull/41083


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org