You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/16 15:35:06 UTC

[GitHub] [spark] EnricoMi opened a new pull request, #38676: [SPARK-41162][SQL] Do not push down join predicate that are ambiguous to both sides

EnricoMi opened a new pull request, #38676:
URL: https://github.com/apache/spark/pull/38676

   ### What changes were proposed in this pull request?
   Rule `PushDownLeftSemiAntiJoin` should not push an anti-join below an `Aggregate` when the translated (cf. `aliasMap`) join conditions become ambiguous w.r.t. to both join sides.
   
   ### Why are the changes needed?
   The following example fails with `distinct()`, and succeeds without `distinct()`, but both queries are identical.
   
   ```scala
   val ids = Seq(1, 2, 3).toDF("id").distinct()
   val result = ids.withColumn("id", $"id" + 1).join(ids, "id", "left_anti").collect()
   assert(result.length == 1)
   ```
   
   With `distinct()`, rule `PushDownLeftSemiAntiJoin` creates a join condition `(id#750 + 1) = id#750`, which can never be true. This effectively removes the anti-join.
   
   ### Does this PR introduce _any_ user-facing change?
   It fixes correctness.
   
   ### How was this patch tested?
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down join predicate that are ambiguous to both sides

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1317224629

   Note: `Window`, `Union` and `UnaryNode` in `PushDownLeftSemiAntiJoin` might be affected as well and should be tested in `LeftSemiAntiJoinPushDownSuite` as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] shardulm94 commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
shardulm94 commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1356030772

   I tried looking into this a bit
   > @EnricoMi @cloud-fan Could we fix the DeduplicateRelations? It did not generate different expression IDs for all conflicting attributes:
   
   As @EnricoMi said `DeduplicateRelations` only considers the output attrs of the left and right, which do not conflict here. Also the `Project` case in `PushDownLeftSemiAntiJoin` calls [this method](https://github.com/apache/spark/blob/45d9daa2ecf6081ef1d031065a9c0e9a3a7f7a58/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownLeftSemiAntiJoin.scala#L119) which seems to check for self-join case based on conflicting expression ids. This makes me believe the duplicate expression IDs are expected here and hence DeduplicateRelations may not be at fault.
   
   Similar to the `Project` case, should we add a check like `canPushThroughCondition(Seq(agg.child), joinCond, rightOp)` to ensure that it is safe to push the join down an `Aggregate` node too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1318255174

   @wangyum @cloud-fan appreciate your suggestion on how to test this bug in `LeftSemiAntiJoinPushDownSuite` (see https://github.com/apache/spark/pull/38676#issuecomment-1317220559).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1320300231

   Problem is that `DeduplicateRelations` is only considering duplicates between left `output` and right `output`, and not duplicates between left `references` and right `output`. I have sketched a fix for `Join` and `LateralJoin`, including a proper test.
   
   This could potentially be done for all operators specifically handled in `DeduplicateRelations.apply()`, i.e. `AsOfJoin`, `Intersect`, `Except`, `Union` and `MergeIntoTable`.
   
   Deduplicating attributes that are already referenced will break the plan as those references break.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1371192971

   Closed in favour of #39131.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1358011350

   @shardulm94 you are right, `canPushThroughCondition` already guards `Project` and `Union` against this situation, so that should be the natural way to fix this for `Aggregate` as well. It is the least possible change to fix this issue.
   
   I have created #39131, to keep it separate from this approach, which tries to fix this issue through `DeduplicateRelations`. Will close this PR if the other one makes it.
   
   Thanks for the pointer!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1323236232

   @wangyum @cloud-fan I am not sure if this is the right approach to fix `DeduplicateRelations`. Please advise.
   
   Problem is that `DeduplicateRelations` is only considering duplicates between left `output` and right `output`, but this situation is caused by duplicates between left `references` and right `output`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1320929902

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1346167823

   @wangyum @cloud-fan do you consider this issue a correctness bug?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1319839395

   > Could we fix the `DeduplicateRelations`?
   
   Interesting, that sounds like a better solution. I'll look into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on a diff in pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on code in PR #38676:
URL: https://github.com/apache/spark/pull/38676#discussion_r1026695873


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##########
@@ -1938,7 +1940,10 @@ case class LateralJoin(
     joinType: JoinType,
     condition: Option[Expression]) extends UnaryNode {
 
-  require(Seq(Inner, LeftOuter, Cross).contains(joinType),
+  require(Seq(Inner, LeftOuter, Cross).contains(joinType match {

Review Comment:
   just needed by `sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/DeduplicateRelationsSuite.scala`:
   
       val originalQuery = left.lateralJoin(right, UsingJoin(Inner, Seq("a")))



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down join predicate that are ambiguous to both sides

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1317220559

   I did not manage to test this in `LeftSemiAntiJoinPushDownSuite`, which would be preferrably.
   
   My approach is
   ```
     test("Aggregate: LeftAnti join no pushdown on ambiguity") {
       val relation = testRelation
         .groupBy($"b")($"b", sum($"c").as("sum"))
       val relationPlusOne = relation.select(($"b" + 1).as("b"))
   
       val originalQuery = relationPlusOne
         .join(relation, joinType = LeftAnti, usingColumns = Seq("b"))
   
       val optimized = Optimize.execute(originalQuery.analyze)
       comparePlans(optimized, originalQuery.analyze)
     }
   ```
   
   This creates plan
   ```
   Project [b#7]
   +- Join LeftAnti, (b#7 = b#12)
      :- Project [(b#1 + 1) AS b#7]
      :  +- Aggregate [b#1], [b#1, sum(c#2) AS sum#6L]
      :     +- LocalRelation <empty>, [a#0, b#1, c#2]
      +- Aggregate [b#12], [b#12, sum(c#13) AS sum#6L]
         +- LocalRelation <empty>, [a#11, b#12, c#13]
   ```
   while this would be required to expose the bug (both `Aggregate` plans have to have identical references):
   ```
   Project [b#7]
   +- Join LeftAnti, (b#7 = b#1)
      :- Project [(b#1 + 1) AS b#7]
      :  +- Aggregate [b#1], [b#1, sum(c#2) AS sum#6L]
      :     +- LocalRelation <empty>, [a#0, b#1, c#2]
      +- Aggregate [b#1], [b#1, sum(c#2) AS sum#6L]
         +- LocalRelation <empty>, [a#0, b#1, c#2]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi closed pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
EnricoMi closed pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous
URL: https://github.com/apache/spark/pull/38676


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] EnricoMi commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
EnricoMi commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1335354876

   @wangyum @cloud-fan what do you think about my approach? Do you have a suggestion for a better strategy?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #38676: [SPARK-41162][SQL] Do not push down anti-join predicates that become ambiguous

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #38676:
URL: https://github.com/apache/spark/pull/38676#issuecomment-1319657140

   @EnricoMi @cloud-fan Could we fix the `DeduplicateRelations`?  It did not generate different expression IDs for all conflicting attributes:
   ```
   === Applying Rule org.apache.spark.sql.catalyst.analysis.DeduplicateRelations ===
    Join LeftSemi                         Join LeftSemi
    :- Project [(id#4 + 1) AS id#6]       :- Project [(id#4 + 1) AS id#6]
    :  +- Deduplicate [id#4]              :  +- Deduplicate [id#4]
    :     +- Project [value#1 AS id#4]    :     +- Project [value#1 AS id#4]
    :        +- LocalRelation [value#1]   :        +- LocalRelation [value#1]
    +- Deduplicate [id#4]                 +- Deduplicate [id#4]
   !   +- Project [value#1 AS id#4]          +- Project [value#8 AS id#4]
   !      +- LocalRelation [value#1]            +- LocalRelation [value#8]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org