You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "linna shuang (Jira)" <ji...@apache.org> on 2020/05/08 07:43:00 UTC

[jira] [Commented] (SPARK-16951) Alternative implementation of NOT IN to Anti-join

    [ https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102336#comment-17102336 ] 

linna shuang commented on SPARK-16951:
--------------------------------------

In TPC-H test, we met performance issue of Q16, which used NOT IN subquery and being translated into broadcast nested loop join. This query uses almost half time of total 22 queries. For example, 512GB data set, totally execution time is 1400 seconds, while Q16’s execution time is 630 seconds.

TPC-H is a common spark sql performance benchmark, this performance issue will be met usually. Do you have plan to reopen and fix this issue?

> Alternative implementation of NOT IN to Anti-join
> -------------------------------------------------
>
>                 Key: SPARK-16951
>                 URL: https://issues.apache.org/jira/browse/SPARK-16951
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Nattavut Sutyanyong
>            Priority: Major
>
> A transformation currently used to process {{NOT IN}} subquery is to rewrite to a form of Anti-join with null-aware property in the Logical Plan and then translate to a form of {{OR}} predicate joining the parent side and the subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} predicate is limited to the nested-loop join execution plan, which will have a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to the technique used in the implementation of the Existence join that addresses the problem of {{EXISTS (..) OR ..}} type of queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org