You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Leanken.Lin (Jira)" <ji...@apache.org> on 2020/08/05 10:16:00 UTC

[jira] [Resolved] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

     [ https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Leanken.Lin resolved SPARK-32494.
---------------------------------
    Resolution: Later

> Null Aware Anti Join Optimize Support Multi-Column
> --------------------------------------------------
>
>                 Key: SPARK-32494
>                 URL: https://issues.apache.org/jira/browse/SPARK-32494
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Leanken.Lin
>            Priority: Major
>
> In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash lookup instead of loop join. 
> It's simple to just fulfill a "NOT IN" logical when it's a single key, but multi-column not in is much more complicated with all these null aware comparison.
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 
> For NullAwareHashedRelation
>  # it will not skip anyNullColumn key like LongHashedRelation and UnsafeHashedRelation do.
>  # while building NullAwareHashedRelation, will put extra keys into the relation, just to make null aware columns comparison in hash lookup style.
> the duplication would be 2^numKeys - 1 times, for example, if we are to support NAAJ with 3 column join key, the buildSide would be expanded into (2^3 - 1) times, 7X.
> For example, if there is a UnsafeRow key (1,2,3)
> In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), C(3,2) combinations, within the combinations, we duplicated these record with null padding as following.
> Original record
> (1,2,3)
> Extra record to be appended into NullAwareHashedRelation
> (null, 2, 3) (1, null, 3) (1, 2, null)
>  (null, null, 3) (null, 2, null) (1, null, null))
> with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org