You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/07/30 07:49:00 UTC
[jira] [Assigned] (SPARK-32474) NullAwareAntiJoin multi-column support

     [ https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-32474:
------------------------------------

    Assignee:     (was: Apache Spark)

> NullAwareAntiJoin multi-column support
> --------------------------------------
>
>                 Key: SPARK-32474
>                 URL: https://issues.apache.org/jira/browse/SPARK-32474
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Leanken.Lin
>            Priority: Minor
>
> This is a follow up improvement of Issue SPARK-32290.
> In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support.
> See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6
>  
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation.
>  
> In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record
> (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow.
> (null, 2, 3)
> (1, null, 3)
> (1, 2, null)
> (null, null, 3)
> (null, 2, null)
> (1, null, null)
> (1, 2, 3)
>  
> with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row
>  
> this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org