You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ruslan Dautkhanov (JIRA)" <ji...@apache.org> on 2017/07/28 15:47:00 UTC
[jira] [Comment Edited] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

    [ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105127#comment-16105127 ] 

Ruslan Dautkhanov edited comment on SPARK-21274 at 7/28/17 3:46 PM:
--------------------------------------------------------------------

[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the queries. 

I don't think {noformat}[1, 2]{noformat} is the correct behavior for the first query.

EXCEPT ALL which returns all records from the *first* table which are not present in the second table, leaving the duplicates as is.

If you believe it should be "1,2", then it's easy to fix by just changing tab1 to tab2 in the second query.

Or other way around, original queries would return 
{noformat}
[1, 2]
for [1, 2] intersect_all [1, 2, 2]
{noformat}


was (Author: tagar):
[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the queries. 

I don't think [1, 2] is the correct behavior for the first query.
EXCEPT ALL which returns all records from the *first* table which are not present in the second table, leaving the duplicates as is.



> Implement EXCEPT ALL and INTERSECT ALL
> --------------------------------------
>
>                 Key: SPARK-21274
>                 URL: https://issues.apache.org/jira/browse/SPARK-21274
>             Project: Spark
>          Issue Type: New Feature
>          Components: Optimizer, SQL
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0
>            Reporter: Ruslan Dautkhanov
>              Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
>      LEFT OUTER JOIN 
>         tab2 t2
>      ON (
>         (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>      )
> WHERE
>     COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined above:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
> WHERE 
>    NOT EXISTS
>    (    SELECT 1
>         FROM    t1_except_t2_df e
>         WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>    )
> {code}
> So the suggestion is just to use above query rewrites to implement both EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org