You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/21 16:24:50 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new pull request #24: Micro optimize hash join functions

Dandandan opened a new pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24


   ~5% speed up by speeding up some hot functions
   
   ```
   Query 5 avg time: 88.59 ms
   Query 10 avg time: 256.62 ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r620640590



##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
         let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();
 
         match (left_array.is_null($left), left_array.is_null($right)) {
-            (true, true) => true,

Review comment:
       Yep! To be fair I also added this line myself 😆
   
   I think we also should start testing DataFusion more rigorously wrt null handling, but one step at a time.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r621304356



##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -65,7 +65,7 @@ use log::debug;
 // Maps a `u64` hash value based on the left ["on" values] to a list of indices with this key's value.
 // E.g. 1 -> [3, 6, 8] indicates that the column values map to rows 3, 6 and 8 for hash value 1
 // As the key is a hash value, we need to check possible hash collisions in the probe stage
-type JoinHashMap = HashMap<u64, SmallVec<[u64; 1]>, IdHashBuilder>;
+type JoinHashMap = HashMap<(), SmallVec<[u64; 1]>, IdHashBuilder>;

Review comment:
       Added some extra docs now :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r620649530



##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
         let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();
 
         match (left_array.is_null($left), left_array.is_null($right)) {
-            (true, true) => true,

Review comment:
       Even if it is, I noticed the values method doesn't take offset into account right now, so this is maybe an optimization worth doing once hashing is included as arrow kernel.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r620649530



##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
         let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();
 
         match (left_array.is_null($left), left_array.is_null($right)) {
-            (true, true) => true,

Review comment:
       Even if it is, I noticed the values method doesn't take offset into account right now, so this is maybe an optimization worth doing once it is in the arrow kernel.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #24: [DataFusion] Optimize hash join inner workings

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (2dacc01) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **decrease** coverage by `0.00%`.
   > The diff coverage is `78.57%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master      #24      +/-   ##
   ==========================================
   - Coverage   76.24%   76.23%   -0.01%     
   ==========================================
     Files         134      134              
     Lines       23051    23055       +4     
   ==========================================
   + Hits        17576    17577       +1     
   - Misses       5475     5478       +3     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.94% <78.57%> (-0.49%)` | :arrow_down: |
   | [benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL3RwY2gucnM=) | `35.15% <0.00%> (ø)` | |
   | [benchmarks/src/bin/nyctaxi.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL255Y3RheGkucnM=) | `0.00% <0.00%> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...2dacc01](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-827170266


   Sorry for the delay @Dandandan -- I am about out of time today, but I will take a careful look tomorrow morning


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #24: [DataFusion] Optimize hash join inner workings

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9d8985b) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **decrease** coverage by `0.01%`.
   > The diff coverage is `78.46%`.
   
   > :exclamation: Current head 9d8985b differs from pull request most recent head e1ac6a9. Consider uploading reports for the commit e1ac6a9 to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master      #24      +/-   ##
   ==========================================
   - Coverage   76.24%   76.23%   -0.02%     
   ==========================================
     Files         134      134              
     Lines       23051    23052       +1     
   ==========================================
   - Hits        17576    17574       -2     
   - Misses       5475     5478       +3     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.85% <78.46%> (-0.58%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...e1ac6a9](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r620643551



##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -781,58 +846,140 @@ pub fn create_hashes<'a>(
     random_state: &RandomState,
     hashes_buffer: &'a mut Vec<u64>,
 ) -> Result<&'a mut Vec<u64>> {
+    // combine hashes with `combine_hashes` if we have more than 1 column
+    let multi_col = arrays.len() > 1;
+
     for col in arrays {
         match col.data_type() {
             DataType::UInt8 => {
-                hash_array!(UInt8Array, col, u8, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    UInt8Array,
+                    col,
+                    u8,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::UInt16 => {
-                hash_array!(UInt16Array, col, u16, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    UInt16Array,
+                    col,
+                    u16,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::UInt32 => {
-                hash_array!(UInt32Array, col, u32, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    UInt32Array,
+                    col,
+                    u32,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::UInt64 => {
-                hash_array!(UInt64Array, col, u64, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    UInt64Array,
+                    col,
+                    u64,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Int8 => {
-                hash_array!(Int8Array, col, i8, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    Int8Array,
+                    col,
+                    i8,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Int16 => {
-                hash_array!(Int16Array, col, i16, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    Int16Array,
+                    col,
+                    i16,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Int32 => {
-                hash_array!(Int32Array, col, i32, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    Int32Array,
+                    col,
+                    i32,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Int64 => {
-                hash_array!(Int64Array, col, i64, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    Int64Array,
+                    col,
+                    i64,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Timestamp(TimeUnit::Microsecond, None) => {
-                hash_array!(
+                hash_array_primitive!(
                     TimestampMicrosecondArray,
                     col,
                     i64,
                     hashes_buffer,
-                    random_state
+                    random_state,
+                    multi_col
                 );
             }
             DataType::Timestamp(TimeUnit::Nanosecond, None) => {
-                hash_array!(
+                hash_array_primitive!(
                     TimestampNanosecondArray,
                     col,
                     i64,
                     hashes_buffer,
-                    random_state
+                    random_state,
+                    multi_col
                 );
             }
             DataType::Boolean => {
-                hash_array!(BooleanArray, col, u8, hashes_buffer, random_state);
+                hash_array_primitive!(

Review comment:
       I am wondering, are the bytee not guaranteed to be the same for the same bitmaps?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #24: [DataFusion] Optimize hash join inner workings

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (7672164) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **decrease** coverage by `0.01%`.
   > The diff coverage is `78.46%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master      #24      +/-   ##
   ==========================================
   - Coverage   76.24%   76.23%   -0.02%     
   ==========================================
     Files         134      134              
     Lines       23051    23052       +1     
   ==========================================
   - Hits        17576    17574       -2     
   - Misses       5475     5478       +3     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.85% <78.46%> (-0.58%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...7672164](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #24: [DataFusion] Optimize hash join inner workings

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (7672164) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **decrease** coverage by `0.01%`.
   > The diff coverage is `78.46%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master      #24      +/-   ##
   ==========================================
   - Coverage   76.24%   76.23%   -0.02%     
   ==========================================
     Files         134      134              
     Lines       23051    23052       +1     
   ==========================================
   - Hits        17576    17574       -2     
   - Misses       5475     5478       +3     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.85% <78.46%> (-0.58%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...7672164](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-827125937


   @alamb @jorgecarleitao FYI if you have time, I am not planning any changes to this PR for now.
   
   I think the more important part of the PR is the fix wrt null handling which turned out to be a one line change.
   
   The other part is some small performance tweaks and prepares it a bit for further improvements down the road.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r620634002



##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -781,58 +846,140 @@ pub fn create_hashes<'a>(
     random_state: &RandomState,
     hashes_buffer: &'a mut Vec<u64>,
 ) -> Result<&'a mut Vec<u64>> {
+    // combine hashes with `combine_hashes` if we have more than 1 column
+    let multi_col = arrays.len() > 1;
+
     for col in arrays {
         match col.data_type() {
             DataType::UInt8 => {
-                hash_array!(UInt8Array, col, u8, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    UInt8Array,
+                    col,
+                    u8,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::UInt16 => {
-                hash_array!(UInt16Array, col, u16, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    UInt16Array,
+                    col,
+                    u16,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::UInt32 => {
-                hash_array!(UInt32Array, col, u32, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    UInt32Array,
+                    col,
+                    u32,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::UInt64 => {
-                hash_array!(UInt64Array, col, u64, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    UInt64Array,
+                    col,
+                    u64,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Int8 => {
-                hash_array!(Int8Array, col, i8, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    Int8Array,
+                    col,
+                    i8,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Int16 => {
-                hash_array!(Int16Array, col, i16, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    Int16Array,
+                    col,
+                    i16,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Int32 => {
-                hash_array!(Int32Array, col, i32, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    Int32Array,
+                    col,
+                    i32,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Int64 => {
-                hash_array!(Int64Array, col, i64, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    Int64Array,
+                    col,
+                    i64,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Timestamp(TimeUnit::Microsecond, None) => {
-                hash_array!(
+                hash_array_primitive!(
                     TimestampMicrosecondArray,
                     col,
                     i64,
                     hashes_buffer,
-                    random_state
+                    random_state,
+                    multi_col
                 );
             }
             DataType::Timestamp(TimeUnit::Nanosecond, None) => {
-                hash_array!(
+                hash_array_primitive!(
                     TimestampNanosecondArray,
                     col,
                     i64,
                     hashes_buffer,
-                    random_state
+                    random_state,
+                    multi_col
                 );
             }
             DataType::Boolean => {
-                hash_array!(BooleanArray, col, u8, hashes_buffer, random_state);
+                hash_array_primitive!(

Review comment:
       ```suggestion
                   hash_array!(
   ```
   
   otherwise this will be an iterator over the bytes of the bitmap, not over the individual bits.
   
   In general, I recommend using generics, as they make it more difficult for these to happen, but I agree that the API is a bit funny on the arrow side. ^_^
   

##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
         let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();
 
         match (left_array.is_null($left), left_array.is_null($right)) {
-            (true, true) => true,

Review comment:
       this was the line, right? Damm, so many wrong things for this one.
   
   Really great finding, @Dandandan !




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r621193159



##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -399,15 +395,23 @@ fn update_hash(
         .map(|name| Ok(col(name).evaluate(batch)?.into_array(batch.num_rows())))
         .collect::<Result<Vec<_>>>()?;
 
-    // update the hash map
+    // calculate the hash values
     let hash_values = create_hashes(&keys_values, &random_state, hashes_buffer)?;
 
     // insert hashes to key of the hashmap
     for (row, hash_value) in hash_values.iter().enumerate() {
-        hash.raw_entry_mut()
-            .from_key_hashed_nocheck(*hash_value, hash_value)
-            .and_modify(|_, v| v.push((row + offset) as u64))
-            .or_insert_with(|| (*hash_value, smallvec![(row + offset) as u64]));
+        match hash.raw_entry_mut().from_hash(*hash_value, |_| true) {
+            hashbrown::hash_map::RawEntryMut::Occupied(mut entry) => {
+                entry.get_mut().push((row + offset) as u64);
+            }
+            hashbrown::hash_map::RawEntryMut::Vacant(entry) => {
+                entry.insert_hashed_nocheck(

Review comment:
       The overall algorithm does do the checking for collisions (values with the same hash) in the later stage. There is a unit checking this case too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r621191677



##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -65,7 +65,7 @@ use log::debug;
 // Maps a `u64` hash value based on the left ["on" values] to a list of indices with this key's value.
 // E.g. 1 -> [3, 6, 8] indicates that the column values map to rows 3, 6 and 8 for hash value 1
 // As the key is a hash value, we need to check possible hash collisions in the probe stage
-type JoinHashMap = HashMap<u64, SmallVec<[u64; 1]>, IdHashBuilder>;
+type JoinHashMap = HashMap<(), SmallVec<[u64; 1]>, IdHashBuilder>;

Review comment:
       I will add some comments about it for sure!
   
   Eventually we can get rid of the hashmap entirely, as we check for collisions in a later stage.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r621193159



##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -399,15 +395,23 @@ fn update_hash(
         .map(|name| Ok(col(name).evaluate(batch)?.into_array(batch.num_rows())))
         .collect::<Result<Vec<_>>>()?;
 
-    // update the hash map
+    // calculate the hash values
     let hash_values = create_hashes(&keys_values, &random_state, hashes_buffer)?;
 
     // insert hashes to key of the hashmap
     for (row, hash_value) in hash_values.iter().enumerate() {
-        hash.raw_entry_mut()
-            .from_key_hashed_nocheck(*hash_value, hash_value)
-            .and_modify(|_, v| v.push((row + offset) as u64))
-            .or_insert_with(|| (*hash_value, smallvec![(row + offset) as u64]));
+        match hash.raw_entry_mut().from_hash(*hash_value, |_| true) {
+            hashbrown::hash_map::RawEntryMut::Occupied(mut entry) => {
+                entry.get_mut().push((row + offset) as u64);
+            }
+            hashbrown::hash_map::RawEntryMut::Vacant(entry) => {
+                entry.insert_hashed_nocheck(

Review comment:
       The overall algorithm does do the checking for collisions (values with the same hash) in the later stage. There is a unit test checking this case too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r621186963



##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -399,15 +395,23 @@ fn update_hash(
         .map(|name| Ok(col(name).evaluate(batch)?.into_array(batch.num_rows())))
         .collect::<Result<Vec<_>>>()?;
 
-    // update the hash map
+    // calculate the hash values
     let hash_values = create_hashes(&keys_values, &random_state, hashes_buffer)?;
 
     // insert hashes to key of the hashmap
     for (row, hash_value) in hash_values.iter().enumerate() {
-        hash.raw_entry_mut()
-            .from_key_hashed_nocheck(*hash_value, hash_value)
-            .and_modify(|_, v| v.push((row + offset) as u64))
-            .or_insert_with(|| (*hash_value, smallvec![(row + offset) as u64]));
+        match hash.raw_entry_mut().from_hash(*hash_value, |_| true) {
+            hashbrown::hash_map::RawEntryMut::Occupied(mut entry) => {
+                entry.get_mut().push((row + offset) as u64);
+            }
+            hashbrown::hash_map::RawEntryMut::Vacant(entry) => {
+                entry.insert_hashed_nocheck(

Review comment:
       Does this code do the right thing if there are collisions in the hash table (aka the same key column values hash to the same hash value?) I am not enough of an expert on HashBrown to really know, but something feels off to me here. 
   

##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -65,7 +65,7 @@ use log::debug;
 // Maps a `u64` hash value based on the left ["on" values] to a list of indices with this key's value.
 // E.g. 1 -> [3, 6, 8] indicates that the column values map to rows 3, 6 and 8 for hash value 1
 // As the key is a hash value, we need to check possible hash collisions in the probe stage
-type JoinHashMap = HashMap<u64, SmallVec<[u64; 1]>, IdHashBuilder>;
+type JoinHashMap = HashMap<(), SmallVec<[u64; 1]>, IdHashBuilder>;

Review comment:
       I am a little confused about using `()` as the key type for the `HashMap` -- maybe updating the comments would help

##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -755,21 +766,75 @@ fn equal_rows(
 }
 
 macro_rules! hash_array {
-    ($array_type:ident, $column: ident, $ty: ident, $hashes: ident, $random_state: ident) => {
+    ($array_type:ident, $column: ident, $ty: ident, $hashes: ident, $random_state: ident, $multi_col: ident) => {
         let array = $column.as_any().downcast_ref::<$array_type>().unwrap();
         if array.null_count() == 0 {
-            for (i, hash) in $hashes.iter_mut().enumerate() {
-                *hash =
-                    combine_hashes($ty::get_hash(&array.value(i), $random_state), *hash);
-            }
-        } else {
-            for (i, hash) in $hashes.iter_mut().enumerate() {
-                if !array.is_null(i) {
+            if $multi_col {
+                for (i, hash) in $hashes.iter_mut().enumerate() {
                     *hash = combine_hashes(
                         $ty::get_hash(&array.value(i), $random_state),
                         *hash,
                     );
                 }
+            } else {
+                for (i, hash) in $hashes.iter_mut().enumerate() {
+                    *hash = $ty::get_hash(&array.value(i), $random_state);
+                }
+            }
+        } else {
+            if $multi_col {
+                for (i, hash) in $hashes.iter_mut().enumerate() {
+                    if !array.is_null(i) {
+                        *hash = combine_hashes(
+                            $ty::get_hash(&array.value(i), $random_state),
+                            *hash,
+                        );
+                    }
+                }
+            } else {
+                for (i, hash) in $hashes.iter_mut().enumerate() {
+                    if !array.is_null(i) {
+                        *hash = $ty::get_hash(&array.value(i), $random_state);
+                    }
+                }
+            }
+        }
+    };
+}
+
+macro_rules! hash_array_primitive {
+    ($array_type:ident, $column: ident, $ty: ident, $hashes: ident, $random_state: ident, $multi_col: ident) => {
+        let array = $column.as_any().downcast_ref::<$array_type>().unwrap();
+        let values = array.values();
+
+        if array.null_count() == 0 {
+            if $multi_col {
+                for (hash, value) in $hashes.iter_mut().zip(values.iter()) {
+                    *hash = combine_hashes($ty::get_hash(value, $random_state), *hash);
+                }
+            } else {
+                for (hash, value) in $hashes.iter_mut().zip(values.iter()) {
+                    *hash = $ty::get_hash(value, $random_state)
+                }
+            }
+        } else {
+            if $multi_col {

Review comment:
       special casing multi-value join columns is a good one

##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
         let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();
 
         match (left_array.is_null($left), left_array.is_null($right)) {
-            (true, true) => true,

Review comment:
       The fact we are now worrying about and fixing null handling is a sign, in my mind, of DataFusion's maturation 🤣 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #24: [DataFusion] Optimize hash join inner workings, null hanling fix

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (3ebd266) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **increase** coverage by `0.10%`.
   > The diff coverage is `78.87%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master      #24      +/-   ##
   ==========================================
   + Coverage   76.24%   76.35%   +0.10%     
   ==========================================
     Files         134      134              
     Lines       23051    23062      +11     
   ==========================================
   + Hits        17576    17609      +33     
   + Misses       5475     5453      -22     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.94% <78.57%> (-0.49%)` | :arrow_down: |
   | [datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi90ZXN0cy9zcWwucnM=) | `99.88% <100.00%> (+1.42%)` | :arrow_up: |
   | [benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL3RwY2gucnM=) | `35.15% <0.00%> (ø)` | |
   | [benchmarks/src/bin/nyctaxi.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL255Y3RheGkucnM=) | `0.00% <0.00%> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...3ebd266](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r620643551



##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -781,58 +846,140 @@ pub fn create_hashes<'a>(
     random_state: &RandomState,
     hashes_buffer: &'a mut Vec<u64>,
 ) -> Result<&'a mut Vec<u64>> {
+    // combine hashes with `combine_hashes` if we have more than 1 column
+    let multi_col = arrays.len() > 1;
+
     for col in arrays {
         match col.data_type() {
             DataType::UInt8 => {
-                hash_array!(UInt8Array, col, u8, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    UInt8Array,
+                    col,
+                    u8,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::UInt16 => {
-                hash_array!(UInt16Array, col, u16, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    UInt16Array,
+                    col,
+                    u16,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::UInt32 => {
-                hash_array!(UInt32Array, col, u32, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    UInt32Array,
+                    col,
+                    u32,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::UInt64 => {
-                hash_array!(UInt64Array, col, u64, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    UInt64Array,
+                    col,
+                    u64,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Int8 => {
-                hash_array!(Int8Array, col, i8, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    Int8Array,
+                    col,
+                    i8,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Int16 => {
-                hash_array!(Int16Array, col, i16, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    Int16Array,
+                    col,
+                    i16,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Int32 => {
-                hash_array!(Int32Array, col, i32, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    Int32Array,
+                    col,
+                    i32,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Int64 => {
-                hash_array!(Int64Array, col, i64, hashes_buffer, random_state);
+                hash_array_primitive!(
+                    Int64Array,
+                    col,
+                    i64,
+                    hashes_buffer,
+                    random_state,
+                    multi_col
+                );
             }
             DataType::Timestamp(TimeUnit::Microsecond, None) => {
-                hash_array!(
+                hash_array_primitive!(
                     TimestampMicrosecondArray,
                     col,
                     i64,
                     hashes_buffer,
-                    random_state
+                    random_state,
+                    multi_col
                 );
             }
             DataType::Timestamp(TimeUnit::Nanosecond, None) => {
-                hash_array!(
+                hash_array_primitive!(
                     TimestampNanosecondArray,
                     col,
                     i64,
                     hashes_buffer,
-                    random_state
+                    random_state,
+                    multi_col
                 );
             }
             DataType::Boolean => {
-                hash_array!(BooleanArray, col, u8, hashes_buffer, random_state);
+                hash_array_primitive!(

Review comment:
       I am wondering, are the bytes not guaranteed to be the same for the same bitmaps?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #24: [DataFusion] Optimize hash join inner workings

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (90e539b) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **increase** coverage by `0.21%`.
   > The diff coverage is `48.73%`.
   
   > :exclamation: Current head 90e539b differs from pull request most recent head 2f30453. Consider uploading reports for the commit 2f30453 to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master      #24      +/-   ##
   ==========================================
   + Coverage   76.24%   76.46%   +0.21%     
   ==========================================
     Files         134      134              
     Lines       23051    23029      -22     
   ==========================================
   + Hits        17576    17609      +33     
   + Misses       5475     5420      -55     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [ballista/rust/client/src/context.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jbGllbnQvc3JjL2NvbnRleHQucnM=) | `0.00% <0.00%> (ø)` | |
   | [benchmarks/src/bin/nyctaxi.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL255Y3RheGkucnM=) | `0.00% <ø> (ø)` | |
   | [benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL3RwY2gucnM=) | `35.07% <0.00%> (-0.08%)` | :arrow_down: |
   | [datafusion-examples/examples/flight\_server.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1leGFtcGxlcy9leGFtcGxlcy9mbGlnaHRfc2VydmVyLnJz) | `0.00% <0.00%> (ø)` | |
   | [datafusion-examples/examples/simple\_udaf.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1leGFtcGxlcy9leGFtcGxlcy9zaW1wbGVfdWRhZi5ycw==) | `0.00% <ø> (ø)` | |
   | [datafusion/src/physical\_plan/expressions/case.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9jYXNlLnJz) | `72.91% <16.66%> (-0.39%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.94% <78.57%> (-0.49%)` | :arrow_down: |
   | [datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi90ZXN0cy9zcWwucnM=) | `99.88% <100.00%> (+1.42%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...2f30453](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (e1c0a4e) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **increase** coverage by `0.16%`.
   > The diff coverage is `63.00%`.
   
   > :exclamation: Current head e1c0a4e differs from pull request most recent head 0e1bdb4. Consider uploading reports for the commit 0e1bdb4 to get more accurate results
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master      #24      +/-   ##
   ==========================================
   + Coverage   76.24%   76.41%   +0.16%     
   ==========================================
     Files         134      134              
     Lines       23051    23181     +130     
   ==========================================
   + Hits        17576    17714     +138     
   + Misses       5475     5467       -8     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [ballista/rust/client/src/context.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jbGllbnQvc3JjL2NvbnRleHQucnM=) | `0.00% <0.00%> (ø)` | |
   | [ballista/rust/executor/src/main.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9leGVjdXRvci9zcmMvbWFpbi5ycw==) | `0.00% <0.00%> (ø)` | |
   | [benchmarks/src/bin/nyctaxi.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL255Y3RheGkucnM=) | `0.00% <ø> (ø)` | |
   | [datafusion-examples/examples/flight\_server.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1leGFtcGxlcy9leGFtcGxlcy9mbGlnaHRfc2VydmVyLnJz) | `0.00% <0.00%> (ø)` | |
   | [datafusion-examples/examples/simple\_udaf.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1leGFtcGxlcy9leGFtcGxlcy9zaW1wbGVfdWRhZi5ycw==) | `0.00% <ø> (ø)` | |
   | [datafusion/src/physical\_plan/expressions/case.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9jYXNlLnJz) | `72.91% <16.66%> (-0.39%)` | :arrow_down: |
   | [benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL3RwY2gucnM=) | `35.35% <33.33%> (+0.20%)` | :arrow_up: |
   | [datafusion/src/datasource/csv.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvZGF0YXNvdXJjZS9jc3YucnM=) | `73.33% <69.86%> (-9.28%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/csv.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9jc3YucnM=) | `79.09% <75.72%> (-4.13%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.94% <78.57%> (-0.49%)` | :arrow_down: |
   | ... and [3 more](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...0e1bdb4](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org