You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Enrico Minack (Jira)" <ji...@apache.org> on 2019/11/25 08:51:00 UTC
[jira] [Commented] (SPARK-29176) Optimization should change join
type to CROSS
[ https://issues.apache.org/jira/browse/SPARK-29176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16981388#comment-16981388 ]
Enrico Minack commented on SPARK-29176:
---------------------------------------
This has been discussed on the dev mailing list: [http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-29176-DISCUSS-Optimization-should-change-join-type-to-CROSS-td28263.html#none]
> Optimization should change join type to CROSS
> ---------------------------------------------
>
> Key: SPARK-29176
> URL: https://issues.apache.org/jira/browse/SPARK-29176
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.2
> Reporter: Enrico Minack
> Priority: Major
>
> The following query is a valid join but gets optimized into a Cartesian join of {{INNER}} type:
> {code:java}
> case class Value(id: Int, lower: String, upper: String)
> val values = Seq(Value(1, "one", "ONE")).toDS
> val join = values.join(values.withColumn("id", lit(1)), "id")
> {code}
> Catalyst optimizes this query into an inner join without any conditions. The {{CheckCartesianProducts}} rule throws an exception then.
> The following rules are involved:
> The {{FoldablePropagation}} pushes the constant {{lit(1)}} into the join condition, making it trivial:
> {code:java}
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
> Project [id#3, lower#4, upper#5, lower#12, upper#13] Project [id#3, lower#4, upper#5, lower#12, upper#13]
> !+- Join Inner, (id#3 = id#7) +- Join Inner, (id#3 = 1)
> :- LocalRelation [id#3, lower#4, upper#5] :- LocalRelation [id#3, lower#4, upper#5]
> +- Project [1 AS id#7, lower#12, upper#13] +- Project [1 AS id#7, lower#12, upper#13]
> +- LocalRelation [id#11, lower#12, upper#13] +- LocalRelation [id#11, lower#12, upper#13]
> {code}
> The {{PushPredicateThroughJoin}} rule pushes this join condition into the other branch of the query tree as a filter, removing the only join condition:
> {code:java}
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
> Project [id#3, lower#4, upper#5, lower#12, upper#13] Project [id#3, lower#4, upper#5, lower#12, upper#13]
> !+- Join Inner, (id#3 = 1) +- Join Inner
> ! :- LocalRelation [id#3, lower#4, upper#5] :- Filter (id#3 = 1)
> ! +- Project [1 AS id#7, lower#12, upper#13] : +- LocalRelation [id#3, lower#4, upper#5]
> ! +- LocalRelation [id#11, lower#12, upper#13] +- Project [1 AS id#7, lower#12, upper#13]
> ! +- LocalRelation [id#11, lower#12, upper#13]
> {code}
> The first rule reduces the join into a trivial join, the second rule removes the condition altogether. On both plans the optimizer throws an exception: {{Join condition is missing or trivial.}}
> A valid join that becomes trivial or Cartesian during optimization should change the type to {{CROSS}}, and not keep the join type as it is. For a user it is hard to see from the original query that it represents a Cartesian join though a join condition is given. And setting {{spark.sql.crossJoin.enabled=true}} may not be desirable in a complex production environment.
> Do you agree that those optimization rules (and potentially others) need to alter the join type if the condition becomes trivial or removed completely?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org