You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Leanken.Lin (Jira)" <ji...@apache.org> on 2020/07/29 04:00:00 UTC
[jira] [Created] (SPARK-32474) NullAwareAntiJoin multi-column support

Leanken.Lin created SPARK-32474:
-----------------------------------

             Summary: NullAwareAntiJoin multi-column support
                 Key: SPARK-32474
                 URL: https://issues.apache.org/jira/browse/SPARK-32474
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Leanken.Lin
             Fix For: 3.1.0


This is a follow up improvement of Issue [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290].

In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support.

See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6

 

NAAJ multi column logical

!image-2020-07-29-11-41-11-939.png!

NAAJ single column logical

!image-2020-07-29-11-41-03-757.png!

For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation.

 

In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record

(1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow.

(null, 2, 3)

(1, null, 3)

(1, 2, null)

(null, null, 3)

(null, 2, null)

(1, null, null)

(1, 2, 3)

 

with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row
 * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row

 

this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org