You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/09 10:04:21 UTC

[GitHub] [spark] WangGuangxin opened a new pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

WangGuangxin opened a new pull request #35460:
URL: https://github.com/apache/spark/pull/35460


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   When we do shuffle on indeterminate expressions such as `rand`, and ShuffleFetchFailed happend, we may get incorrent result since it only retries failed map tasks.
   
   To illustrate this, suppose we have a dataset with two columns `(range(1, 5) as a, rand() as b)`, we shuffle by `b` using two map tasks and two reduce tasks.
   <img width="630" alt="image" src="https://user-images.githubusercontent.com/1312321/153171110-13887a78-565f-4cdf-afda-bb9053ea7ef0.png">
   
   **When there is a fetch failed and we need to rerun map task 2,  the generated partitions maybe different compared with last attempt, and finally we get a duplicate record with a = 4**
   <img width="610" alt="image" src="https://user-images.githubusercontent.com/1312321/153172262-bf5e2c81-8db6-4b01-8b70-8f284b54b09d.png">
    
   This is very similary to the bug in Repartition+Shuffle, which is fixed by https://github.com/apache/spark/pull/22112. 
   This PR try to fix this by reuse current machenism.
   
   
   ### Why are the changes needed?
   Fix data inconsistent issue.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Added UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WangGuangxin commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
WangGuangxin commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1041421374


   > If there is a fetch failure and the parent stage is `INDETERMINATE`, both the parent and child stage are recomputed. Custom RDD's can extend `getOutputDeterministicLevel` and return the right `DeterministicLevel`. See #22112 for more details
   
   Thanks for your reference. That's what I do in this MR for SparkSQL. 
   When we shuffle by rand with sqls like `distributed by rand`, the RDD's DeterministicLevel generated by SparkSQL is `INDETERMINATE` after this MR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WangGuangxin commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
WangGuangxin commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1040409122


   > Isn't this why you shouldn't partition, shuffle, etc on some random value? use a hash?
   
   The data analyst may always have various needs such as `distributed by rand()` to redistribute data evenly
   or `select * from
       (select concat(key1, rand()) as key1 from tbl1) a
      join
       (select key2 from tbl2) b
     on a.key1 = b.key2` to work around skew data,  which is a valid SQL in Spark. 
   
   Both of these sqls will generate a `HashPartitioning` with non-deterministic expressions.
   
   If we don't support shuffle by random value, we should disable this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm edited a comment on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
mridulm edited a comment on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1041221563


   If there is a fetch failure and the parent stage is INDETERMINATE, both the parent and child stage are recomputed.
   Custom RDD's can extend `getOutputDeterministicLevel` and return the right `DeterministicLevel`.
   See #22112 for more details


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WangGuangxin commented on a change in pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
WangGuangxin commented on a change in pull request #35460:
URL: https://github.com/apache/spark/pull/35460#discussion_r813638411



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
##########
@@ -401,6 +405,29 @@ object ShuffleExchangeExec {
     dependency
   }
 
+  /**
+   * Checks if the shuffle partitioning contains indeterminate expression/reference.
+   */
+  private def isPartitioningIndeterminate(partitioning: Partitioning, plan: SparkPlan): Boolean = {

Review comment:
       ok, I'll first try to add a column level nondeterministic information before this pr




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1040413396


   Right, shouldn't we reject it? distributing by "hash(ID)" or similar makes more sense, not least of which because it is reproducible and deterministic across runs and environments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WangGuangxin commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
WangGuangxin commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1034446963


   @cloud-fan @maropu Can you please help review this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
mridulm commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1041992156


   I was not referring to the SQL changes per se @WangGuangxin, will let @srowen or @cloud-fan, etc review that.
   Specifically for changes in core, there is already a means to provide deterministic level - we dont need `isPartitionKeyIndeterminate`, related changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35460:
URL: https://github.com/apache/spark/pull/35460#discussion_r811635873



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
##########
@@ -401,6 +405,29 @@ object ShuffleExchangeExec {
     dependency
   }
 
+  /**
+   * Checks if the shuffle partitioning contains indeterminate expression/reference.
+   */
+  private def isPartitioningIndeterminate(partitioning: Partitioning, plan: SparkPlan): Boolean = {

Review comment:
       I think we need to build a framework to properly propagate the column-level nondeterministic information. This function looks quite fragile




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan removed a comment on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
cloud-fan removed a comment on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1047487230


   This is really a hard problem and rerunning the entire stage is more of a compromise. In a large enough cluster, we may always see task failures when running a stage, and rerunning the entire stage may never succeed. That's why in Spark SQL, we don't really rely on the `DeterministicLevel` framework, but by default we sort before repartition to fix the correctness issue.
   
   I think we should either have reliable shuffle storage (AFAIK there are several third-party remote shuffle services) so that fetch failure never happens, or we reject such queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WangGuangxin commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
WangGuangxin commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1035899786


   also cc @srowen @viirya @yaooqinn I found there is a similar report before https://issues.apache.org/jira/browse/SPARK-24607


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1040377882


   Isn't this why you shouldn't partition, shuffle, etc on some random value? use a hash?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
mridulm commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1041221563


   If there is a fetch failure and the parent stage is INDETERMINATE, both the parent and child stage are entirely recomputed .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35460:
URL: https://github.com/apache/spark/pull/35460#discussion_r811638706



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
##########
@@ -401,6 +405,29 @@ object ShuffleExchangeExec {
     dependency
   }
 
+  /**
+   * Checks if the shuffle partitioning contains indeterminate expression/reference.
+   */
+  private def isPartitioningIndeterminate(partitioning: Partitioning, plan: SparkPlan): Boolean = {

Review comment:
       For example, `Filter(rand_cond, Project(a, b, c, ...))`. I think all the columns are nondeterministic after `Filter`, even though attributes `a`, `b` and `c` are deterministic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WangGuangxin commented on a change in pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
WangGuangxin commented on a change in pull request #35460:
URL: https://github.com/apache/spark/pull/35460#discussion_r812533440



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
##########
@@ -401,6 +405,29 @@ object ShuffleExchangeExec {
     dependency
   }
 
+  /**
+   * Checks if the shuffle partitioning contains indeterminate expression/reference.
+   */
+  private def isPartitioningIndeterminate(partitioning: Partitioning, plan: SparkPlan): Boolean = {

Review comment:
       You mean the QueryPlan's `deterministic`?  https://github.com/apache/spark/pull/34470




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1034554440


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm edited a comment on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
mridulm edited a comment on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1041221563


   If there is a fetch failure and the parent stage is INDETERMINATE, both the parent and child stage are entirely recomputed.
   Custom RDD's can extend `getOutputDeterministicLevel` and return the right `DeterministicLevel`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WangGuangxin commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
WangGuangxin commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1040409122


   > Isn't this why you shouldn't partition, shuffle, etc on some random value? use a hash?
   
   The data analyst may always have various needs such as `distributed by rand()` to redistribute data evenly
   or `select * from
       (select concat(key1, rand()) as key1 from tbl1) a
      join
       (select key2 from tbl2) b
     on a.key1 = b.key2` to work around skew data,  which is a valid SQL in Spark. 
   
   Both of these sqls will generate a `HashPartitioning` with non-deterministic expressions.
   
   If we don't support shuffle by random value, we should disable this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1040377882






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1047487230


   This is really a hard problem and rerunning the entire stage is more of a compromise. In a large enough cluster, we may always see task failures when running a stage, and rerunning the entire stage may never succeed. That's why in Spark SQL, we don't really rely on the `DeterministicLevel` framework, but by default we sort before repartition to fix the correctness issue.
   
   I think we should either have reliable shuffle storage (AFAIK there are several third-party remote shuffle services) so that fetch failure never happens, or we reject such queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] WangGuangxin commented on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
WangGuangxin commented on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1041113248


   > Right, shouldn't we reject it? distributing by "hash(ID)" or similar makes more sense, not least of which because it is reproducible and deterministic across runs and environments
   
   Reject it maybe not a good idea.
   1. Both Hive/Presto support patterns like `distributed by rand` or `Join/GroupBy by rand`. And seems Spark is intent to support `groupby by rand`, refer https://github.com/apache/spark/pull/16404.  And also some udfs or java_methods is marked as indeterminated, if we reject it, it means users cannot join/groupby a column generated by udf/java_methods.
   2. The root cause of data inconsist when shuffle by rand expression is Spark only retry partially map tasks when shuffle fetch failed. If we retry the whole stage, there is no problem. We can utilize current logic in DAGScheduler to achieve this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm edited a comment on pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
mridulm edited a comment on pull request #35460:
URL: https://github.com/apache/spark/pull/35460#issuecomment-1041221563


   If there is a fetch failure and the parent stage is `INDETERMINATE`, both the parent and child stage are recomputed.
   Custom RDD's can extend `getOutputDeterministicLevel` and return the right `DeterministicLevel`.
   See #22112 for more details


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #35460: [SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #35460:
URL: https://github.com/apache/spark/pull/35460#discussion_r812620485



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
##########
@@ -401,6 +405,29 @@ object ShuffleExchangeExec {
     dependency
   }
 
+  /**
+   * Checks if the shuffle partitioning contains indeterminate expression/reference.
+   */
+  private def isPartitioningIndeterminate(partitioning: Partitioning, plan: SparkPlan): Boolean = {

Review comment:
       that's plan level, not column level. We need something more fine-grained




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org