You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by nsyca <gi...@git.apache.org> on 2017/03/14 17:13:33 UTC

[GitHub] spark pull request #17294: [SPARK-18966][SQL] NOT IN subquery with correlate...

GitHub user nsyca opened a pull request:

    https://github.com/apache/spark/pull/17294

    [SPARK-18966][SQL] NOT IN subquery with correlated expressions may return incorrect result

    ## What changes were proposed in this pull request?
    
    This PR fixes the following problem:
    ````
    Seq((1, 2)).toDF("a1", "a2").createOrReplaceTempView("a")
    Seq[(java.lang.Integer, java.lang.Integer)]((1, null)).toDF("b1", "b2").createOrReplaceTempView("b")
    
    // The expected result is 1 row of (1,2) as shown in the next statement.
    sql("select * from a where a1 not in (select b1 from b where b2 = a2)").show
    +---+---+
    | a1| a2|
    +---+---+
    +---+---+
    
    sql("select * from a where a1 not in (select b1 from b where b2 = 2)").show
    +---+---+
    | a1| a2|
    +---+---+
    |  1|  2|
    +---+---+
    ````
    There are a number of scenarios to consider:
    
    1. When the correlated predicate yields a match (i.e., B.B2 = A.A2)
    1.1. When the NOT IN expression yields a match (i.e., A.A1 = B.B1)
    1.2. When the NOT IN expression yields no match (i.e., A.A1 = B.B1 returns false)
    1.3. When A.A1 is null
    1.4. When B.B1 is null
    1.4.1. When A.A1 is not null
    1.4.2. When A.A1 is null
    
    2. When the correlated predicate yields no match (i.e.,B.B2 = A.A2 is false or unknown)
    2.1. When B.B2 is null and A.A2 is null
    2.2. When B.B2 is null and A.A2 is not null
    2.3. When the value of A.A2 does not match any of B.B2
    
    ````
     A.A1   A.A2      B.B1   B.B2    
    -----  -----     -----  -----
        1      1         1      1    (1.1)
        2      1                     (1.2)
     null      1                     (1.3)
    
        1      3      null      3    (1.4.1)
     null      3                     (1.4.2)
    
        1   null         1   null    (2.1)
     null      2                     (2.2 & 2.3)
    ````
    
    We can divide the evaluation of the above correlated NOT IN subquery into 2 groups:-
    
    Group 1: The rows in A when there is a match from the correlated predicate (A.A1 = B.B1)
    
    In this case, the result of the subquery is not empty and the semantics of the NOT IN depends solely on the evaluation of the equality comparison of the columns of NOT IN, i.e., A1 = B1, which says
    
    - If A.A1 is null, the row is filtered (1.3 and 1.4.2)
    - If A.A1 = B.B1, the row is filtered (1.1)
    - If B.B1 is null, any rows of A in the same group (A.A2 = B.B2) is filtered (1.4.1 & 1.4.2)
    - Otherwise, the row is qualified.
    
    Hence, in this group, the result is the row from (1.2).
    
    Group 2: The rows in A when there is no match from the correlated predicate (A.A2 = B.B2)
    
    In this case, all the rows in A, including the rows where A.A1, are qualified because the subquery returns an empty set and by the semantics of the NOT IN, all rows from the parent side qualifies as the result set, that is, the rows from (2.1, 2.2 and 2.3).
    
    In conclusion, the correct result set of the above query is
    ````
     A.A1   A.A2
    -----  -----
        2      1    (1.2)
        1   null    (2.1)
     null      2    (2.2 & 2.3)
    ````
    ## How was this patch tested?
    unit tests, regression tests, and new test cases focusing on the problem being fixed.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nsyca/spark 18966

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17294.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17294
    
----
commit b98865127a39bde885f9b1680cfe608629d59d51
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2016-07-29T21:43:56Z

    [SPARK-16804][SQL] Correlated subqueries containing LIMIT return incorrect results
    
    ## What changes were proposed in this pull request?
    
    This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase.
    
    ## How was this patch tested?
    ./dev/run-tests
    a new unit test on the problematic pattern.

commit 069ed8f8e5f14dca7a15701945d42fc27fe82f3c
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2016-07-29T21:50:02Z

    [SPARK-16804][SQL] Correlated subqueries containing LIMIT return incorrect results
    
    ## What changes were proposed in this pull request?
    
    This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase.
    
    ## How was this patch tested?
    ./dev/run-tests
    a new unit test on the problematic pattern.

commit edca333c081e6d4e53a91b496fba4a3ef4ee89ac
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2016-07-30T00:28:15Z

    New positive test cases

commit 64184fdb77c1a305bb2932e82582da28bb4c0e53
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2016-08-01T13:20:09Z

    Fix unit test case failure

commit 29f82b05c9e40e7934397257c674b260a8e8a996
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2016-08-05T17:42:01Z

    blocking TABLESAMPLE

commit ac43ab47907a1ccd6d22f920415fbb4de93d4720
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2016-08-05T21:10:19Z

    Fixing code styling

commit 631d396031e8bf627eb1f4872a4d3a17c144536c
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2016-08-07T18:39:44Z

    Correcting Scala test style

commit 7eb9b2dbba3633a1958e38e0019e3ce816300514
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2016-08-08T02:31:09Z

    One (last) attempt to correct the Scala style tests

commit 1387cf51541408ac20048064fa5e559836af932c
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2016-08-12T20:11:50Z

    Merge remote-tracking branch 'upstream/master'

commit 648afac8d35f557ca48d19b93956a9e0fbc6ea6e
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2017-03-14T14:12:59Z

    Merge remote-tracking branch 'upstream/master'

commit 507b2fa3743ae0e6a70547056b0600700d6f602e
Author: Nattavut Sutyanyong <ns...@gmail.com>
Date:   2017-03-14T16:09:43Z

    initial code

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17294: [SPARK-18966][SQL] NOT IN subquery with correlate...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17294


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17294: [SPARK-18966][SQL] NOT IN subquery with correlated expre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17294
  
    **[Test build #74545 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74545/testReport)** for PR 17294 at commit [`507b2fa`](https://github.com/apache/spark/commit/507b2fa3743ae0e6a70547056b0600700d6f602e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17294: [SPARK-18966][SQL] NOT IN subquery with correlated expre...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/17294
  
    @nsyca that makes sense. Lets not backport for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17294: [SPARK-18966][SQL] NOT IN subquery with correlated expre...

Posted by nsyca <gi...@git.apache.org>.
Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/17294
  
    @hvanhovell It would not be trivial to backport to 2.1/2.0. The correlated predicates in the NOT IN subquery have been pulled up in Analyzer phase and combined with the the columns in the LHS and RHS of NOT IN. We will need a different technique from this fix such as marking the correlated predicates differently from the columns in NOT IN.
    
    NOT IN is not commonly used with null values. I think the effort may not be worth it.
    
    Thank you for your quick turnaround time in the review and merging the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17294: [SPARK-18966][SQL] NOT IN subquery with correlated expre...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/17294
  
    LGTM - pending jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17294: [SPARK-18966][SQL] NOT IN subquery with correlated expre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17294
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17294: [SPARK-18966][SQL] NOT IN subquery with correlated expre...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/17294
  
    Merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17294: [SPARK-18966][SQL] NOT IN subquery with correlated expre...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/17294
  
    @nsyca should we backport this to 2.1/2.0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17294: [SPARK-18966][SQL] NOT IN subquery with correlated expre...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17294
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74545/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17294: [SPARK-18966][SQL] NOT IN subquery with correlated expre...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17294
  
    **[Test build #74545 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74545/testReport)** for PR 17294 at commit [`507b2fa`](https://github.com/apache/spark/commit/507b2fa3743ae0e6a70547056b0600700d6f602e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org