You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/01 14:09:49 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #28695: [SPARK-28344][SQL] Check the ambiguous self-join only if there is a join in the plan

HyukjinKwon opened a new pull request #28695:
URL: https://github.com/apache/spark/pull/28695


   ### What changes were proposed in this pull request?
   
   This PR proposes to check `DetectAmbiguousSelfJoin` only if there is `Join` in the plan. Currently, the checking is too strict even to non-join queries.
   
   For example, the codes below don't have join at all but it fails as the ambiguous self-join:
   
   ```scala
   import org.apache.spark.sql.expressions.Window
   import org.apache.spark.sql.functions.sum
   val df = Seq(1, 1, 2, 2).toDF("A")
   val w = Window.partitionBy(df("A"))
   df.select(df("A").alias("X"), sum(df("A")).over(w)).explain(true)
   ```
   
   It is because `ExtractWindowExpressions` can create a `AttributeReference` with the same metadata but a different expression ID, see:
   
   https://github.com/apache/spark/blob/0fd98abd859049dc3b200492487041eeeaa8f737/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2679
   https://github.com/apache/spark/blob/71c73d58f6e88d2558ed2e696897767d93bac60f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L63
   https://github.com/apache/spark/blob/5945d46c11a86fd85f9e65f24c2e88f368eee01f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L180
   
   Before:
   
   ```
   'Project [A#19 AS X#21, sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L]
   +- Relation[A#19] parquet
   ```
   
   After:
   
   ```
   Project [X#21, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L]
   +- Project [X#21, A#19, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L]
      +- Window [sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L], [A#19]
         +- Project [A#19 AS X#21, A#19]
            +- Relation[A#19] parquet
   ```
   
   `X#21` holds the same metadata of DataFrame ID and column position with `A#19` but it has a different expression ID which ends up with the checking fails.
   
   ### Why are the changes needed?
   
   To loose the checking and make users not surprised.
   
   ### Does this PR introduce _any_ user-facing change?
   
   It's the changes in unreleased branches only.
   
   ### How was this patch tested?
   
   Manually tested and unittest was added.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28695: [SPARK-28344][SQL] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-636882902


   I believe this fix is conservative enough to land `branch-3.0` ...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #28695:
URL: https://github.com/apache/spark/pull/28695


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-637104105






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-637032203


   **[Test build #123377 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123377/testReport)** for PR 28695 at commit [`00bbe9d`](https://github.com/apache/spark/commit/00bbe9d3a52ff0bac210de1144ba2121c5a52232).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-637104105






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-636884945






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-637033065


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123377/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-637102930


   **[Test build #123381 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123381/testReport)** for PR 28695 at commit [`21e4241`](https://github.com/apache/spark/commit/21e4241a95b7bd7aa01c0257accbf2a017e1afa3).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-636925471






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28695:
URL: https://github.com/apache/spark/pull/28695#discussion_r433282226



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala
##########
@@ -137,6 +137,9 @@ class DetectAmbiguousSelfJoin(conf: SQLConf) extends Rule[LogicalPlan] {
           }
           condition.toSeq.flatMap(getAmbiguousAttrs)
 
+        case _ if plan.find(_.isInstanceOf[Join]).isEmpty =>
+          Nil  // If there's no join, there's no self-join.

Review comment:
       Yes, we can. I wanted to make a minimised fix with running the metadata cleanup logic but maybe I was worried too much.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-636924994


   **[Test build #123381 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123381/testReport)** for PR 28695 at commit [`21e4241`](https://github.com/apache/spark/commit/21e4241a95b7bd7aa01c0257accbf2a017e1afa3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-637197790


   Thanks @cloud-fan and @dongjoon-hyun.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-636884287


   **[Test build #123377 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123377/testReport)** for PR 28695 at commit [`00bbe9d`](https://github.com/apache/spark/commit/00bbe9d3a52ff0bac210de1144ba2121c5a52232).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-636884287


   **[Test build #123377 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123377/testReport)** for PR 28695 at commit [`00bbe9d`](https://github.com/apache/spark/commit/00bbe9d3a52ff0bac210de1144ba2121c5a52232).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-636884945






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-636924994


   **[Test build #123381 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123381/testReport)** for PR 28695 at commit [`21e4241`](https://github.com/apache/spark/commit/21e4241a95b7bd7aa01c0257accbf2a017e1afa3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-637033052


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-637033052






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28695:
URL: https://github.com/apache/spark/pull/28695#discussion_r433273969



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala
##########
@@ -137,6 +137,9 @@ class DetectAmbiguousSelfJoin(conf: SQLConf) extends Rule[LogicalPlan] {
           }
           condition.toSeq.flatMap(getAmbiguousAttrs)
 
+        case _ if plan.find(_.isInstanceOf[Join]).isEmpty =>
+          Nil  // If there's no join, there's no self-join.

Review comment:
       can we do it at the beginning to make it clear?
   ```
   if (!conf.getConf(SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED)) return plan
   // ...
   if (plan.find(_.isInstanceOf[Join]).isEmpty) return plan
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28695: [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28695:
URL: https://github.com/apache/spark/pull/28695#issuecomment-636925471






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org