You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Enrico Minack (Jira)" <ji...@apache.org> on 2019/11/25 08:51:00 UTC

[jira] [Commented] (SPARK-29176) Optimization should change join type to CROSS

    [ https://issues.apache.org/jira/browse/SPARK-29176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16981388#comment-16981388 ] 

Enrico Minack commented on SPARK-29176:
---------------------------------------

This has been discussed on the dev mailing list: [http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-29176-DISCUSS-Optimization-should-change-join-type-to-CROSS-td28263.html#none]

> Optimization should change join type to CROSS
> ---------------------------------------------
>
>                 Key: SPARK-29176
>                 URL: https://issues.apache.org/jira/browse/SPARK-29176
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.2
>            Reporter: Enrico Minack
>            Priority: Major
>
> The following query is a valid join but gets optimized into a Cartesian join of {{INNER}} type:
> {code:java}
> case class Value(id: Int, lower: String, upper: String)
> val values = Seq(Value(1, "one", "ONE")).toDS
> val join = values.join(values.withColumn("id", lit(1)), "id")
> {code}
> Catalyst optimizes this query into an inner join without any conditions. The {{CheckCartesianProducts}} rule throws an exception then.
> The following rules are involved:
> The {{FoldablePropagation}} pushes the constant {{lit(1)}} into the join condition, making it trivial:
> {code:java}
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
>  Project [id#3, lower#4, upper#5, lower#12, upper#13]   Project [id#3, lower#4, upper#5, lower#12, upper#13]
> !+- Join Inner, (id#3 = id#7)                           +- Join Inner, (id#3 = 1)
>     :- LocalRelation [id#3, lower#4, upper#5]              :- LocalRelation [id#3, lower#4, upper#5]
>     +- Project [1 AS id#7, lower#12, upper#13]             +- Project [1 AS id#7, lower#12, upper#13]
>        +- LocalRelation [id#11, lower#12, upper#13]           +- LocalRelation [id#11, lower#12, upper#13]
> {code}
> The {{PushPredicateThroughJoin}} rule pushes this join condition into the other branch of the query tree as a filter, removing the only join condition:
> {code:java}
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
>  Project [id#3, lower#4, upper#5, lower#12, upper#13]   Project [id#3, lower#4, upper#5, lower#12, upper#13]
> !+- Join Inner, (id#3 = 1)                              +- Join Inner
> !   :- LocalRelation [id#3, lower#4, upper#5]              :- Filter (id#3 = 1)
> !   +- Project [1 AS id#7, lower#12, upper#13]             :  +- LocalRelation [id#3, lower#4, upper#5]
> !      +- LocalRelation [id#11, lower#12, upper#13]        +- Project [1 AS id#7, lower#12, upper#13]
> !                                                             +- LocalRelation [id#11, lower#12, upper#13]
> {code}
> The first rule reduces the join into a trivial join, the second rule removes the condition altogether. On both plans the optimizer throws an exception: {{Join condition is missing or trivial.}}
> A valid join that becomes trivial or Cartesian during optimization should change the type to {{CROSS}}, and not keep the join type as it is. For a user it is hard to see from the original query that it represents a Cartesian join though a join condition is given. And setting {{spark.sql.crossJoin.enabled=true}} may not be desirable in a complex production environment.
> Do you agree that those optimization rules (and potentially others) need to alter the join type if the condition becomes trivial or removed completely?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org