You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nattavut Sutyanyong (JIRA)" <ji...@apache.org> on 2016/08/31 14:42:20 UTC

[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result

    [ https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15452415#comment-15452415 ] 

Nattavut Sutyanyong commented on SPARK-17337:
---------------------------------------------

This problem originated from the Case 1 documented in SPARK-16951. The script illustrates the problem

{noformat}
Seq(1,2).toDF("c1").createOrReplaceTempView("t1")
Seq(1).toDF("c2").createOrReplaceTempView("t2")

scala> sql("select * from (select t2.c2+1 as c3 from t1 left join t2 on t1.c1=t2.c2) t3 where c3 not in (select c2 from t2)").show
+----+
|  c3|
+----+
|   2|
|null|
+----+
{noformat}

The correct answer is 1 row of (2). From the plan below, the incorrect portion of the plan is the LeftAnti, rewritten from the NOT IN subquery, is pushed down below the (T1 LOJ T2) operation. Because LeftAnti predicate is evaluated to unknown (or null) if any argument of a comparison operator in the predicate is null, e.g. NULL = <value> is evaluated to unknown (which is equivalent to false in the context of a predicate), the LeftAnti predicate cannot be pushed down into a LOJ operation.

{noformat}
scala> sql("select * from (select t2.c2+1 as c3 from t1 left join t2 on t1.c1=t2.c2) t3 where c3 not in (select c2 from t2)").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'Filter NOT 'c3 IN (list#124)
   :  +- 'SubqueryAlias list#124
   :     +- 'Project ['c2]
   :        +- 'UnresolvedRelation `t2`
   +- 'SubqueryAlias t3
      +- 'Project [('t2.c2 + 1) AS c3#123]
         +- 'Join LeftOuter, ('t1.c1 = 't2.c2)
            :- 'UnresolvedRelation `t1`
            +- 'UnresolvedRelation `t2`

== Analyzed Logical Plan ==
c3: int
Project [c3#123]
+- Filter NOT predicate-subquery#124 [(c3#123 = c2#77)]
   :  +- SubqueryAlias predicate-subquery#124 [(c3#123 = c2#77)]
   :     +- Project [c2#77]
   :        +- SubqueryAlias t2
   :           +- Project [value#75 AS c2#77]
   :              +- LocalRelation [value#75]
   +- SubqueryAlias t3
      +- Project [(c2#77 + 1) AS c3#123]
         +- Join LeftOuter, (c1#3 = c2#77)
            :- SubqueryAlias t1
            :  +- Project [value#1 AS c1#3]
            :     +- LocalRelation [value#1]
            +- SubqueryAlias t2
               +- Project [value#75 AS c2#77]
                  +- LocalRelation [value#75]

== Optimized Logical Plan ==
Project [(c2#77 + 1) AS c3#123]
+- Join LeftOuter, (c1#3 = c2#77)
   :- Project [value#1 AS c1#3]
   :  +- Join LeftAnti, (isnull(((c2#77 + 1) = c2#77)) || ((c2#77 + 1) = c2#77))
   :     :- LocalRelation [value#1]
   :     +- LocalRelation [c2#77]
   +- LocalRelation [c2#77]
{noformat}

> Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-17337
>                 URL: https://issues.apache.org/jira/browse/SPARK-17337
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>
> While investigating SPARK-16951, I found an incorrect results case from a NOT IN subquery. I thought originally it is an edge case. Further investigation found this is a more general problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org