You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Enrico Minack (Jira)" <ji...@apache.org> on 2019/09/19 08:30:00 UTC
[jira] [Created] (SPARK-29176) Optimization should change join type
to CROSS
Enrico Minack created SPARK-29176:
-------------------------------------
Summary: Optimization should change join type to CROSS
Key: SPARK-29176
URL: https://issues.apache.org/jira/browse/SPARK-29176
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.3.2
Reporter: Enrico Minack
The following query is a valid join but gets optimized into a Cartesian join of {{INNER}} type:
{code:java}
case class Value(id: Int, lower: String, upper: String)
val values = Seq(Value(1, "one", "ONE")).toDS
val join = values.join(values.withColumn("id", lit(1)), "id")
{code}
Catalyst optimizes this query into an inner join without any conditions. The {{CheckCartesianProducts}} rule throws an exception then.
The following rules are involved:
The {{FoldablePropagation}} pushes the constant {{lit(1)}} into the join condition, making it trivial:
{code:java}
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
Project [id#3, lower#4, upper#5, lower#12, upper#13] Project [id#3, lower#4, upper#5, lower#12, upper#13]
!+- Join Inner, (id#3 = id#7) +- Join Inner, (id#3 = 1)
:- LocalRelation [id#3, lower#4, upper#5] :- LocalRelation [id#3, lower#4, upper#5]
+- Project [1 AS id#7, lower#12, upper#13] +- Project [1 AS id#7, lower#12, upper#13]
+- LocalRelation [id#11, lower#12, upper#13] +- LocalRelation [id#11, lower#12, upper#13]
{code}
The {{PushPredicateThroughJoin}} rule pushes this join condition into the other branch of the query tree as a filter, removing the only join condition:
{code:java}
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
Project [id#3, lower#4, upper#5, lower#12, upper#13] Project [id#3, lower#4, upper#5, lower#12, upper#13]
!+- Join Inner, (id#3 = 1) +- Join Inner
! :- LocalRelation [id#3, lower#4, upper#5] :- Filter (id#3 = 1)
! +- Project [1 AS id#7, lower#12, upper#13] : +- LocalRelation [id#3, lower#4, upper#5]
! +- LocalRelation [id#11, lower#12, upper#13] +- Project [1 AS id#7, lower#12, upper#13]
! +- LocalRelation [id#11, lower#12, upper#13]
{code}
The first rule reduces the join into a trivial join, the second rule removes the condition altogether. On both plans the optimizer throws an exception: {{Join condition is missing or trivial.}}
A valid join that becomes trivial or Cartesian during optimization should change the type to {{CROSS}}, and not keep the join type as it is. For a user it is hard to see from the original query that it represents a Cartesian join though a join condition is given. And setting {{spark.sql.crossJoin.enabled=true}} may not be desirable in a complex production environment.
Do you agree that those optimization rules (and potentially others) need to alter the join type if the condition becomes trivial or removed completely?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org