You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Enrico Minack (Jira)" <ji...@apache.org> on 2019/09/19 08:30:00 UTC
[jira] [Created] (SPARK-29176) Optimization should change join type to CROSS

Enrico Minack created SPARK-29176:
-------------------------------------

             Summary: Optimization should change join type to CROSS
                 Key: SPARK-29176
                 URL: https://issues.apache.org/jira/browse/SPARK-29176
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.2
            Reporter: Enrico Minack


The following query is a valid join but gets optimized into a Cartesian join of {{INNER}} type:
{code:java}
case class Value(id: Int, lower: String, upper: String)

val values = Seq(Value(1, "one", "ONE")).toDS
val join = values.join(values.withColumn("id", lit(1)), "id")
{code}
Catalyst optimizes this query into an inner join without any conditions. The {{CheckCartesianProducts}} rule throws an exception then.

The following rules are involved:

The {{FoldablePropagation}} pushes the constant {{lit(1)}} into the join condition, making it trivial:
{code:java}
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
 Project [id#3, lower#4, upper#5, lower#12, upper#13]   Project [id#3, lower#4, upper#5, lower#12, upper#13]
!+- Join Inner, (id#3 = id#7)                           +- Join Inner, (id#3 = 1)
    :- LocalRelation [id#3, lower#4, upper#5]              :- LocalRelation [id#3, lower#4, upper#5]
    +- Project [1 AS id#7, lower#12, upper#13]             +- Project [1 AS id#7, lower#12, upper#13]
       +- LocalRelation [id#11, lower#12, upper#13]           +- LocalRelation [id#11, lower#12, upper#13]
{code}
The {{PushPredicateThroughJoin}} rule pushes this join condition into the other branch of the query tree as a filter, removing the only join condition:
{code:java}
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
 Project [id#3, lower#4, upper#5, lower#12, upper#13]   Project [id#3, lower#4, upper#5, lower#12, upper#13]
!+- Join Inner, (id#3 = 1)                              +- Join Inner
!   :- LocalRelation [id#3, lower#4, upper#5]              :- Filter (id#3 = 1)
!   +- Project [1 AS id#7, lower#12, upper#13]             :  +- LocalRelation [id#3, lower#4, upper#5]
!      +- LocalRelation [id#11, lower#12, upper#13]        +- Project [1 AS id#7, lower#12, upper#13]
!                                                             +- LocalRelation [id#11, lower#12, upper#13]
{code}
The first rule reduces the join into a trivial join, the second rule removes the condition altogether. On both plans the optimizer throws an exception: {{Join condition is missing or trivial.}}

A valid join that becomes trivial or Cartesian during optimization should change the type to {{CROSS}}, and not keep the join type as it is. For a user it is hard to see from the original query that it represents a Cartesian join though a join condition is given. And setting {{spark.sql.crossJoin.enabled=true}} may not be desirable in a complex production environment.

Do you agree that those optimization rules (and potentially others) need to alter the join type if the condition becomes trivial or removed completely?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org