You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2020/03/29 02:38:00 UTC

[jira] [Updated] (SPARK-23752) [Performance] Existential Subquery to Inner Join

     [ https://issues.apache.org/jira/browse/SPARK-23752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun updated SPARK-23752:
----------------------------------
    Affects Version/s:     (was: 3.0.0)
                       3.1.0

> [Performance] Existential Subquery to Inner Join
> ------------------------------------------------
>
>                 Key: SPARK-23752
>                 URL: https://issues.apache.org/jira/browse/SPARK-23752
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Ioana Delaney
>            Priority: Major
>
> *+Existential Subquery to Inner Join+*
> Another enhancement that uses Informational Constraints is existential subquery to inner join. This rewrite converts an existential subquery to an inner join, and thus provides alternative join choices for the Optimizer based on the selectivity of the tables. 
> An example using TPC-DS schema is shown below.
> {code}
> select c_first_name, c_last_name, c_email_address
> from customer c
> where EXISTS (select * 
>               from store_sales, date_dim
>               where c.c_customer_sk = ss_customer_sk and
>                     ss_sold_date_sk = d_date_sk and
>                     d_year = 2002 and
>                     d_moy between 4 and 4+3)
> {code}
> Spark uses left semi-join to evaluated existential subqueries. A left semi-join will return a row from the outer table if there is at least one match in the inner. Semi-join is a general used technique to rewrite existential subqueries, but it has some limitations as it imposes a certain order of the joined table. In this case the large fact table _store_sales_ has to be on the inner of the join. A more efficient execution can be obtained if the subquery is converted to a regular Inner join. This will allow the Optimizer to choose better join orders.
> Converting a subquery to inner join is possible if either the subquery produces at most one row or, by introducing a _Distinct_ on the outer table’s row key in order to remove the duplicate rows that will result after the inner join and thus to enforce the semantics of the subquery. As a key for the outer, we can use the primary key of the _customer_ table.
> *Internal query after rewrite:*
> {code}
> select distinct c_customer_sk /*PK */, c_first_name, c_last_name, c_email_address
> from customer c, store_sales, date_dim
> where c.c_customer_sk = ss_customer_sk and
>        ss_sold_date_sk = d_date_sk and
>        d_year = 2002 and
>        d_moy between 4 and 4+3
> {code}
> \\
> *Example performance results using 1TB TPC-DS benchmark:*
> \\
> ||TPC-DS Query||spark-2.2||spark-2.2 w/ sub2join||Query speedup||
> ||                        ||(secs)||(secs)                        ||                        ||
> |Q10|355|190|2x|
> |Q16|1394|706|2x|
> |Q35|462|285|1.5x|
> |Q69|327|173|1.5x|
> |Q94|603|307|2x|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org