You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/06 20:22:07 UTC

[GitHub] [arrow] Dandandan opened a new pull request #9116: ARROW-11156: [Rust][DataFusion] Vectorized hashing [WIP

Dandandan opened a new pull request #9116:
URL: https://github.com/apache/arrow/pull/9116


   Create hashes vectorized in hash join
   
   This is based on the open PR https://github.com/apache/arrow/pull/9070
   
   The idea is as follows:
   
   * We still use the `HashMap` but rather than using the data as key we use the hash value ( `u64`) as key. this allows the hashmap still to grow etc, but it does a bit more on insert (an extra hash) than with a custom vectorized hashmap
   * Collision check (for probe side) needs to be implemented, also hash concat/merging needs to be improved, now it is just h1 + h2 + h3 etc. which makes `(1, 2)` map to the same hash value as `(2, 1)`.
   * Only the hash value creation is in this PR vectorized, the rest is still on a row basis.
   
   TCPH is without the remaining part ~10% faster than the PR: ~180 vs ~200ms.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r554612488



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -80,6 +80,8 @@ pub struct HashJoinExec {
     schema: SchemaRef,
     /// Build-side
     build_side: Arc<Mutex<Option<JoinLeftData>>>,
+    /// Shares the `RandomState` for the hashing algorithm
+    random_state: RandomState,

Review comment:
       This is needed to keep the random state the same between hashing and probing stage.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (28cd876) into [master](https://codecov.io/gh/apache/arrow/commit/98159f18dae0fdccfa967bfc452966a054e41cce?el=desc) (98159f1) will **decrease** coverage by `0.08%`.
   > The diff coverage is `67.39%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   82.60%   82.52%   -0.09%     
   ==========================================
     Files         204      204              
     Lines       50496    50618     +122     
   ==========================================
   + Hits        41713    41772      +59     
   - Misses       8783     8846      +63     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.90% <ø> (ø)` | |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `87.50% <ø> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.82% <ø> (ø)` | |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `78.51% <63.63%> (-1.49%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `77.85% <65.11%> (-8.44%)` | :arrow_down: |
   | [rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=) | `93.56% <68.62%> (+0.14%)` | :arrow_up: |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `75.95% <80.00%> (+0.28%)` | :arrow_up: |
   | [...t/datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvYWxlc2NlX2JhdGNoZXMucnM=) | `88.23% <100.00%> (ø)` | |
   | [rust/arrow/src/array/transform/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2ZpeGVkX2JpbmFyeS5ycw==) | `78.94% <0.00%> (-5.27%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [0050795...28cd876](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-763040341


   This one is next in line for merging @jorgecarleitao  and I have our eyes on it... Once a few more tests have completed on https://github.com/apache/arrow/commits/master we'll get it in


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r554613451



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -907,4 +1068,53 @@ mod tests {
 
         Ok(())
     }
+
+    #[test]
+    fn join_with_hash_collision() -> Result<()> {
+        let mut hashmap_left = HashMap::with_hasher(IdHashBuilder {});
+        let left = build_table_i32(
+            ("a", &vec![10, 20]),
+            ("x", &vec![100, 200]),
+            ("y", &vec![200, 300]),
+        );
+
+        let random_state = RandomState::new();
+
+        let hashes = create_hashes(&[left.columns()[0].clone()], &random_state)?;
+
+        // Create hash collisions
+        hashmap_left.insert(hashes[0], vec![0, 1]);

Review comment:
       Maps to both indices, the test makes sure the extra "collisions" are removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-757006724






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r554612673



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -193,6 +198,8 @@ impl ExecutionPlan for HashJoinExec {
     }
 
     async fn execute(&self, partition: usize) -> Result<SendableRecordBatchStream> {
+        let on_left = self.on.iter().map(|on| on.0.clone()).collect::<Vec<_>>();

Review comment:
       Shared between build / probe stage.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (dcc9045) into [master](https://codecov.io/gh/apache/arrow/commit/98159f18dae0fdccfa967bfc452966a054e41cce?el=desc) (98159f1) will **decrease** coverage by `0.14%`.
   > The diff coverage is `63.26%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   82.60%   82.46%   -0.15%     
   ==========================================
     Files         204      204              
     Lines       50496    51020     +524     
   ==========================================
   + Hits        41713    42073     +360     
   - Misses       8783     8947     +164     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.92% <0.00%> (+0.02%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `87.50% <ø> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.82% <ø> (+<0.01%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `74.08% <60.33%> (-12.21%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `78.51% <63.63%> (-1.49%)` | :arrow_down: |
   | [rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=) | `93.56% <68.62%> (+0.14%)` | :arrow_up: |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `75.95% <80.00%> (+0.28%)` | :arrow_up: |
   | [...t/datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvYWxlc2NlX2JhdGNoZXMucnM=) | `88.23% <100.00%> (ø)` | |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `83.77% <0.00%> (-0.71%)` | :arrow_down: |
   | ... and [8 more](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [0050795...dcc9045](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755945052


   So what remains is the collision check.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r554612380



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -329,6 +329,76 @@ fn group_aggregate_batch(
     Ok(accumulators)
 }
 
+/// Create a key `Vec<u8>` that is used as key for the hashmap
+fn create_key(group_by_keys: &[ArrayRef], row: usize, vec: &mut Vec<u8>) -> Result<()> {

Review comment:
       This moved to `hash_aggregate`. It can be removed once hash aggregate does stops using `Vec<u8>` as a key




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-763072450


   I am calling the day, so feel free to continue the big flush. Thanks a lot for taking this, @alamb !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paddyhoran commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
paddyhoran commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-757131336


   @Dandandan I was trying to set up the process for running the integration tests this morning and I cancelled your travis job by accident.  I'm not familiar with Travis, I looked to see if I could easily re-trigger it but I don't want to mess anything else up.
   
   @jorgecarleitao @nevi-me @alamb @andygrove Is there an easy way to re-start just the travis job?
   
   Sorry for the disruption.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755761637


   https://issues.apache.org/jira/browse/ARROW-11156


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r554584104



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -55,13 +56,12 @@ use crate::error::{DataFusionError, Result};
 
 use super::{ExecutionPlan, Partitioning, RecordBatchStream, SendableRecordBatchStream};
 use crate::physical_plan::coalesce_batches::concat_batches;
-use ahash::RandomState;
 use log::debug;
 
-// Maps ["on" value] -> [list of indices with this key's value]
-// E.g. [1, 2] -> [(0, 3), (1, 6), (0, 8)] indicates that (column1, column2) = [1, 2] is true
-// for rows 3 and 8 from batch 0 and row 6 from batch 1.
-type JoinHashMap = HashMap<Vec<u8>, Vec<u64>, RandomState>;
+// Maps a `u64` hash value based on the left ["on" values] to a list of indices with this key's value.
+// E.g. [1, 2] -> [3, 6, 8] indicates that the column values map to rows 3, 6 and 8
+// As the key is a hash value, we need to check possible hash collisions in the probe stage

Review comment:
       👍 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755945236


   @jorgecarleitao yes, definitely!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-763087051


   Master is looking pretty good: https://github.com/apache/arrow/runs/1729927062 and merged this branch locally into master and it compiles and passes tests. Merging it in


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-757247320


   @paddyhoran no problem, I will make some more changes anyway. Thanks for notifying!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755765653


   👍  I think that this could be also beneficial for the hash aggregate!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (d6102c9) into [master](https://codecov.io/gh/apache/arrow/commit/98159f18dae0fdccfa967bfc452966a054e41cce?el=desc) (98159f1) will **decrease** coverage by `0.85%`.
   > The diff coverage is `72.10%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   82.60%   81.75%   -0.86%     
   ==========================================
     Files         204      214      +10     
     Lines       50496    51510    +1014     
   ==========================================
   + Hits        41713    42110     +397     
   - Misses       8783     9400     +617     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.92% <0.00%> (+0.02%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `87.50% <ø> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.82% <ø> (+<0.01%)` | :arrow_up: |
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `83.16% <47.05%> (-3.32%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `78.51% <63.63%> (-1.49%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `82.02% <74.05%> (-4.27%)` | :arrow_down: |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `75.95% <80.00%> (+0.28%)` | :arrow_up: |
   | [rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=) | `93.56% <86.11%> (+0.14%)` | :arrow_up: |
   | [...t/datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvYWxlc2NlX2JhdGNoZXMucnM=) | `88.23% <100.00%> (ø)` | |
   | ... and [32 more](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [0050795...1823c50](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (413660e) into [master](https://codecov.io/gh/apache/arrow/commit/98159f18dae0fdccfa967bfc452966a054e41cce?el=desc) (98159f1) will **decrease** coverage by `0.90%`.
   > The diff coverage is `64.93%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   82.60%   81.70%   -0.91%     
   ==========================================
     Files         204      214      +10     
     Lines       50496    51493     +997     
   ==========================================
   + Hits        41713    42072     +359     
   - Misses       8783     9421     +638     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.92% <0.00%> (+0.02%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `87.50% <ø> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.82% <ø> (+<0.01%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `74.08% <60.33%> (-12.21%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `78.51% <63.63%> (-1.49%)` | :arrow_down: |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `75.95% <80.00%> (+0.28%)` | :arrow_up: |
   | [rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=) | `93.56% <86.11%> (+0.14%)` | :arrow_up: |
   | [...t/datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvYWxlc2NlX2JhdGNoZXMucnM=) | `88.23% <100.00%> (ø)` | |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `83.77% <0.00%> (-0.71%)` | :arrow_down: |
   | ... and [24 more](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [0050795...413660e](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-758274127


   Hi @Dandandan  thanks ! I will try and review this over the next few days. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (b167b3b) into [master](https://codecov.io/gh/apache/arrow/commit/98159f18dae0fdccfa967bfc452966a054e41cce?el=desc) (98159f1) will **decrease** coverage by `0.08%`.
   > The diff coverage is `66.91%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   82.60%   82.52%   -0.09%     
   ==========================================
     Files         204      204              
     Lines       50496    50615     +119     
   ==========================================
   + Hits        41713    41769      +56     
   - Misses       8783     8846      +63     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.90% <ø> (ø)` | |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `87.50% <ø> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.82% <ø> (ø)` | |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `78.51% <63.63%> (-1.49%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `77.70% <64.28%> (-8.59%)` | :arrow_down: |
   | [rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=) | `93.56% <68.62%> (+0.14%)` | :arrow_up: |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `75.95% <80.00%> (+0.28%)` | :arrow_up: |
   | [...t/datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvYWxlc2NlX2JhdGNoZXMucnM=) | `88.23% <100.00%> (ø)` | |
   | [rust/arrow/src/array/transform/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2ZpeGVkX2JpbmFyeS5ycw==) | `78.94% <0.00%> (-5.27%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [0050795...b167b3b](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r554613147



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -575,6 +542,199 @@ fn build_join_indexes(
         }
     }
 }
+use core::hash::BuildHasher;
+
+/// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing
+/// it when inserting/indexing or regrowing the `HashMap`
+struct IdHasher {
+    hash: u64,
+}
+
+impl Hasher for IdHasher {
+    fn finish(&self) -> u64 {
+        self.hash
+    }
+
+    fn write_u64(&mut self, i: u64) {
+        self.hash = i;
+    }
+
+    fn write(&mut self, _bytes: &[u8]) {
+        unreachable!("IdHasher should only be used for u64 keys")
+    }
+}
+
+#[derive(Debug)]
+struct IdHashBuilder {}
+
+impl BuildHasher for IdHashBuilder {
+    type Hasher = IdHasher;
+
+    fn build_hasher(&self) -> Self::Hasher {
+        IdHasher { hash: 0 }
+    }
+}
+
+// Combines two hashes into one hash
+fn combine_hashes(l: u64, r: u64) -> u64 {

Review comment:
       Just a simple hashing combination function. Source: http://myeyesareblind.com/2017/02/06/Combine-hash-values/




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (d14cff3) into [master](https://codecov.io/gh/apache/arrow/commit/98159f18dae0fdccfa967bfc452966a054e41cce?el=desc) (98159f1) will **decrease** coverage by `0.90%`.
   > The diff coverage is `64.93%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   82.60%   81.70%   -0.91%     
   ==========================================
     Files         204      214      +10     
     Lines       50496    51493     +997     
   ==========================================
   + Hits        41713    42072     +359     
   - Misses       8783     9421     +638     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.92% <0.00%> (+0.02%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `87.50% <ø> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.82% <ø> (+<0.01%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `74.08% <60.33%> (-12.21%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `78.51% <63.63%> (-1.49%)` | :arrow_down: |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `75.95% <80.00%> (+0.28%)` | :arrow_up: |
   | [rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=) | `93.56% <86.11%> (+0.14%)` | :arrow_up: |
   | [...t/datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvYWxlc2NlX2JhdGNoZXMucnM=) | `88.23% <100.00%> (ø)` | |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `83.77% <0.00%> (-0.71%)` | :arrow_down: |
   | ... and [24 more](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [0050795...d14cff3](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9116:
URL: https://github.com/apache/arrow/pull/9116


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paddyhoran commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
paddyhoran commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-757131336


   @Dandandan I was trying to set up the process for running the integration tests this morning and I cancelled your travis job by accident.  I'm not familiar with Travis, I looked to see if I could easily re-trigger it but I don't want to mess anything else up.
   
   @jorgecarleitao @nevi-me @alamb @andygrove Is there an easy way to re-start just the travis job?
   
   Sorry for the disruption.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r557994080



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -604,6 +562,199 @@ fn build_join_indexes(
         }
     }
 }
+use core::hash::BuildHasher;
+
+/// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing
+/// it when inserting/indexing or regrowing the `HashMap`
+struct IdHasher {
+    hash: u64,
+}
+
+impl Hasher for IdHasher {
+    fn finish(&self) -> u64 {
+        self.hash
+    }
+
+    fn write_u64(&mut self, i: u64) {
+        self.hash = i;
+    }
+
+    fn write(&mut self, _bytes: &[u8]) {
+        unreachable!("IdHasher should only be used for u64 keys")
+    }
+}
+
+#[derive(Debug)]
+struct IdHashBuilder {}
+
+impl BuildHasher for IdHashBuilder {
+    type Hasher = IdHasher;
+
+    fn build_hasher(&self) -> Self::Hasher {
+        IdHasher { hash: 0 }
+    }
+}
+
+// Combines two hashes into one hash
+fn combine_hashes(l: u64, r: u64) -> u64 {
+    let hash = (17 * 37u64).overflowing_add(l).0;
+    hash.overflowing_mul(37).0.overflowing_add(r).0
+}
+
+macro_rules! equal_rows_elem {
+    ($array_type:ident, $l: ident, $r: ident, $left: ident, $right: ident) => {
+        $l.as_any()
+            .downcast_ref::<$array_type>()
+            .unwrap()
+            .value($left)
+            == $r
+                .as_any()
+                .downcast_ref::<$array_type>()
+                .unwrap()
+                .value($right)
+    };
+}
+
+/// Left and right row have equal values
+fn equal_rows(
+    left: usize,
+    right: usize,
+    left_arrays: &[ArrayRef],
+    right_arrays: &[ArrayRef],
+) -> Result<bool> {
+    let mut err = None;
+    let res = left_arrays
+        .iter()
+        .zip(right_arrays)
+        .all(|(l, r)| match l.data_type() {
+            DataType::Null => true,
+            DataType::Boolean => {
+                equal_rows_elem!(BooleanArray, l, r, left, right)
+            }
+            DataType::Int8 => {
+                equal_rows_elem!(Int8Array, l, r, left, right)
+            }
+            DataType::Int16 => {
+                equal_rows_elem!(Int16Array, l, r, left, right)
+            }
+            DataType::Int32 => {
+                equal_rows_elem!(Int32Array, l, r, left, right)
+            }
+            DataType::Int64 => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::UInt8 => {
+                equal_rows_elem!(UInt8Array, l, r, left, right)
+            }
+            DataType::UInt16 => {
+                equal_rows_elem!(UInt16Array, l, r, left, right)
+            }
+            DataType::UInt32 => {
+                equal_rows_elem!(UInt32Array, l, r, left, right)
+            }
+            DataType::UInt64 => {
+                equal_rows_elem!(UInt64Array, l, r, left, right)
+            }
+            DataType::Timestamp(_, None) => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::Utf8 => {
+                equal_rows_elem!(StringArray, l, r, left, right)
+            }
+            DataType::LargeUtf8 => {
+                equal_rows_elem!(LargeStringArray, l, r, left, right)
+            }
+            _ => {
+                // This is internal because we should have caught this before.
+                err = Some(Err(DataFusionError::Internal(
+                    "Unsupported data type in hasher".to_string(),
+                )));
+                false
+            }
+        });
+
+    err.unwrap_or(Ok(res))
+}
+
+macro_rules! hash_array {
+    ($array_type:ident, $column: ident, $f: ident, $hashes: ident, $random_state: ident) => {

Review comment:
       Yes, agreed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r554612380



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -329,6 +329,76 @@ fn group_aggregate_batch(
     Ok(accumulators)
 }
 
+/// Create a key `Vec<u8>` that is used as key for the hashmap
+fn create_key(group_by_keys: &[ArrayRef], row: usize, vec: &mut Vec<u8>) -> Result<()> {

Review comment:
       This moved to `hash_aggregate`. It can move once hash aggregate does stops using `Vec<u8>` as a key




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-761168290


   Thanks @alamb submodule change reverted, not sure what happened there.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (ac0de44) into [master](https://codecov.io/gh/apache/arrow/commit/98159f18dae0fdccfa967bfc452966a054e41cce?el=desc) (98159f1) will **decrease** coverage by `0.08%`.
   > The diff coverage is `67.15%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   82.60%   82.52%   -0.09%     
   ==========================================
     Files         204      204              
     Lines       50496    50618     +122     
   ==========================================
   + Hits        41713    41772      +59     
   - Misses       8783     8846      +63     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.90% <ø> (ø)` | |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `87.50% <ø> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.82% <ø> (ø)` | |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `78.51% <63.63%> (-1.49%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `77.85% <64.70%> (-8.44%)` | :arrow_down: |
   | [rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=) | `93.56% <68.62%> (+0.14%)` | :arrow_up: |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `75.95% <80.00%> (+0.28%)` | :arrow_up: |
   | [...t/datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvYWxlc2NlX2JhdGNoZXMucnM=) | `88.23% <100.00%> (ø)` | |
   | [rust/arrow/src/array/transform/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2ZpeGVkX2JpbmFyeS5ycw==) | `78.94% <0.00%> (-5.27%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [0050795...28cd876](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r554979135



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -269,34 +276,35 @@ impl ExecutionPlan for HashJoinExec {
             num_output_batches: 0,
             num_output_rows: 0,
             join_time: 0,
+            random_state: self.random_state.clone(),
         }))
     }
 }
 
 /// Updates `hash` with new entries from [RecordBatch] evaluated against the expressions `on`,
 /// assuming that the [RecordBatch] corresponds to the `index`th
 fn update_hash(
-    on: &HashSet<String>,
+    on: &[String],
     batch: &RecordBatch,
     hash: &mut JoinHashMap,
     offset: usize,
+    random_state: &RandomState,
 ) -> Result<()> {
     // evaluate the keys
     let keys_values = on
         .iter()
         .map(|name| Ok(col(name).evaluate(batch)?.into_array(batch.num_rows())))
         .collect::<Result<Vec<_>>>()?;
 
-    let mut key = Vec::with_capacity(keys_values.len());
-
     // update the hash map
-    for row in 0..batch.num_rows() {
-        create_key(&keys_values, row, &mut key)?;
+    let hash_values = create_hashes(&keys_values, &random_state)?;
 
+    // insert hashes to key of the hashmap
+    for (row, hash_value) in hash_values.iter().enumerate() {
         hash.raw_entry_mut()
-            .from_key(&key)
+            .from_key_hashed_nocheck(*hash_value, hash_value)

Review comment:
       hash = key now




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-761773287


   I apologize for the delay in merging Rust PRs -- the 3.0 release is being finalized now and are planning to minimize entropy by postponing merging  changes not critical for the release until the process was complete. I hope the process is complete in the next few days. There is more [discussion](https://lists.apache.org/list.html?dev@arrow.apache.org) in the mailing list 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (f87daf4) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **decrease** coverage by `0.02%`.
   > The diff coverage is `72.83%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   81.61%   81.58%   -0.03%     
   ==========================================
     Files         215      215              
     Lines       51867    51965      +98     
   ==========================================
   + Hits        42329    42398      +69     
   - Misses       9538     9567      +29     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `81.92% <52.63%> (-2.87%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `84.31% <78.51%> (-0.40%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [4e1c3bb...f87daf4](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-757530518


   This is ready for review now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (24537da) into [master](https://codecov.io/gh/apache/arrow/commit/5228ede9abf8ecf9b4bb68a06075cd16af3523e9?el=desc) (5228ede) will **decrease** coverage by `0.08%`.
   > The diff coverage is `64.90%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   82.59%   82.50%   -0.09%     
   ==========================================
     Files         204      204              
     Lines       50169    50411     +242     
   ==========================================
   + Hits        41436    41594     +158     
   - Misses       8733     8817      +84     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `77.85% <64.66%> (-8.48%)` | :arrow_down: |
   | [...t/datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvYWxlc2NlX2JhdGNoZXMucnM=) | `88.23% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/utils.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvdXRpbHMucnM=) | `73.78% <0.00%> (-11.22%)` | :arrow_down: |
   | [rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=) | `71.69% <0.00%> (-3.66%)` | :arrow_down: |
   | [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `88.88% <0.00%> (-1.28%)` | :arrow_down: |
   | [rust/parquet/src/schema/types.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9zY2hlbWEvdHlwZXMucnM=) | `89.52% <0.00%> (-0.42%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=) | `84.61% <0.00%> (-0.27%)` | :arrow_down: |
   | [rust/arrow/src/array/array\_struct.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RydWN0LnJz) | `88.43% <0.00%> (-0.18%)` | :arrow_down: |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.82% <0.00%> (-0.01%)` | :arrow_down: |
   | [rust/arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZGF0YS5ycw==) | `97.31% <0.00%> (ø)` | |
   | ... and [18 more](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [5228ede...24537da](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r554612765



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -269,34 +276,35 @@ impl ExecutionPlan for HashJoinExec {
             num_output_batches: 0,
             num_output_rows: 0,
             join_time: 0,
+            random_state: self.random_state.clone(),
         }))
     }
 }
 
 /// Updates `hash` with new entries from [RecordBatch] evaluated against the expressions `on`,
 /// assuming that the [RecordBatch] corresponds to the `index`th
 fn update_hash(
-    on: &HashSet<String>,
+    on: &[String],

Review comment:
       `HashSet` is not needed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (1823c50) into [master](https://codecov.io/gh/apache/arrow/commit/98159f18dae0fdccfa967bfc452966a054e41cce?el=desc) (98159f1) will **decrease** coverage by `0.82%`.
   > The diff coverage is `76.29%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   82.60%   81.78%   -0.83%     
   ==========================================
     Files         204      214      +10     
     Lines       50496    51481     +985     
   ==========================================
   + Hits        41713    42103     +390     
   - Misses       8783     9378     +595     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.92% <0.00%> (+0.02%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `87.50% <ø> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.82% <ø> (+<0.01%)` | :arrow_up: |
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `83.16% <47.05%> (-3.32%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `78.51% <63.63%> (-1.49%)` | :arrow_down: |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `75.95% <80.00%> (+0.28%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `85.58% <81.42%> (-0.71%)` | :arrow_down: |
   | [rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=) | `93.56% <86.11%> (+0.14%)` | :arrow_up: |
   | [...t/datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvYWxlc2NlX2JhdGNoZXMucnM=) | `88.23% <100.00%> (ø)` | |
   | ... and [31 more](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [0050795...1823c50](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (5deec7b) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **decrease** coverage by `0.02%`.
   > The diff coverage is `72.83%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   81.61%   81.58%   -0.03%     
   ==========================================
     Files         215      215              
     Lines       51867    51965      +98     
   ==========================================
   + Hits        42329    42397      +68     
   - Misses       9538     9568      +30     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `81.92% <52.63%> (-2.87%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `84.31% <78.51%> (-0.40%)` | :arrow_down: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `94.86% <0.00%> (-0.20%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [4e1c3bb...f87daf4](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r554613035



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -575,6 +542,199 @@ fn build_join_indexes(
         }
     }
 }
+use core::hash::BuildHasher;
+
+/// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing
+/// it when inserting/indexing or regrowing the `HashMap`
+struct IdHasher {
+    hash: u64,

Review comment:
       We just return the key value here, so we don't "rehash" when growing or indexing the `HashMap`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-758136540


   FYI @jorgecarleitao @alamb @andygrove 
   This PR is ready for review. I added some comments to the PR. My next step would be reusing the code/strategy for the hash aggregates. I think a similar speed up could be realized there as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755944532


   The unit tests are succeeding now, had to do with the `RandomState` being reset on every `execute` 🎉
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r557861327



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -604,6 +562,199 @@ fn build_join_indexes(
         }
     }
 }
+use core::hash::BuildHasher;
+
+/// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing
+/// it when inserting/indexing or regrowing the `HashMap`
+struct IdHasher {
+    hash: u64,
+}
+
+impl Hasher for IdHasher {
+    fn finish(&self) -> u64 {
+        self.hash
+    }
+
+    fn write_u64(&mut self, i: u64) {
+        self.hash = i;
+    }
+
+    fn write(&mut self, _bytes: &[u8]) {
+        unreachable!("IdHasher should only be used for u64 keys")
+    }
+}
+
+#[derive(Debug)]
+struct IdHashBuilder {}
+
+impl BuildHasher for IdHashBuilder {
+    type Hasher = IdHasher;
+
+    fn build_hasher(&self) -> Self::Hasher {
+        IdHasher { hash: 0 }
+    }
+}
+
+// Combines two hashes into one hash
+fn combine_hashes(l: u64, r: u64) -> u64 {
+    let hash = (17 * 37u64).overflowing_add(l).0;
+    hash.overflowing_mul(37).0.overflowing_add(r).0
+}
+
+macro_rules! equal_rows_elem {
+    ($array_type:ident, $l: ident, $r: ident, $left: ident, $right: ident) => {
+        $l.as_any()
+            .downcast_ref::<$array_type>()
+            .unwrap()
+            .value($left)
+            == $r
+                .as_any()
+                .downcast_ref::<$array_type>()
+                .unwrap()
+                .value($right)
+    };
+}
+
+/// Left and right row have equal values
+fn equal_rows(
+    left: usize,
+    right: usize,
+    left_arrays: &[ArrayRef],
+    right_arrays: &[ArrayRef],
+) -> Result<bool> {
+    let mut err = None;
+    let res = left_arrays
+        .iter()
+        .zip(right_arrays)
+        .all(|(l, r)| match l.data_type() {
+            DataType::Null => true,
+            DataType::Boolean => {
+                equal_rows_elem!(BooleanArray, l, r, left, right)
+            }
+            DataType::Int8 => {
+                equal_rows_elem!(Int8Array, l, r, left, right)
+            }
+            DataType::Int16 => {
+                equal_rows_elem!(Int16Array, l, r, left, right)
+            }
+            DataType::Int32 => {
+                equal_rows_elem!(Int32Array, l, r, left, right)
+            }
+            DataType::Int64 => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::UInt8 => {
+                equal_rows_elem!(UInt8Array, l, r, left, right)
+            }
+            DataType::UInt16 => {
+                equal_rows_elem!(UInt16Array, l, r, left, right)
+            }
+            DataType::UInt32 => {
+                equal_rows_elem!(UInt32Array, l, r, left, right)
+            }
+            DataType::UInt64 => {
+                equal_rows_elem!(UInt64Array, l, r, left, right)
+            }
+            DataType::Timestamp(_, None) => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::Utf8 => {
+                equal_rows_elem!(StringArray, l, r, left, right)
+            }
+            DataType::LargeUtf8 => {
+                equal_rows_elem!(LargeStringArray, l, r, left, right)
+            }
+            _ => {
+                // This is internal because we should have caught this before.
+                err = Some(Err(DataFusionError::Internal(
+                    "Unsupported data type in hasher".to_string(),
+                )));
+                false
+            }
+        });
+
+    err.unwrap_or(Ok(res))
+}
+
+macro_rules! hash_array {
+    ($array_type:ident, $column: ident, $f: ident, $hashes: ident, $random_state: ident) => {
+        let array = $column.as_any().downcast_ref::<$array_type>().unwrap();
+
+        for (i, hash) in $hashes.iter_mut().enumerate() {
+            let mut hasher = $random_state.build_hasher();
+            hasher.$f(array.value(i));

Review comment:
       A potential simplification here is to use `let values = array.values();` and zip them with `$hashes`.

##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -604,6 +562,199 @@ fn build_join_indexes(
         }
     }
 }
+use core::hash::BuildHasher;
+
+/// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing
+/// it when inserting/indexing or regrowing the `HashMap`
+struct IdHasher {
+    hash: u64,
+}
+
+impl Hasher for IdHasher {
+    fn finish(&self) -> u64 {
+        self.hash
+    }
+
+    fn write_u64(&mut self, i: u64) {
+        self.hash = i;
+    }
+
+    fn write(&mut self, _bytes: &[u8]) {
+        unreachable!("IdHasher should only be used for u64 keys")
+    }
+}
+
+#[derive(Debug)]
+struct IdHashBuilder {}
+
+impl BuildHasher for IdHashBuilder {
+    type Hasher = IdHasher;
+
+    fn build_hasher(&self) -> Self::Hasher {
+        IdHasher { hash: 0 }
+    }
+}
+
+// Combines two hashes into one hash
+fn combine_hashes(l: u64, r: u64) -> u64 {
+    let hash = (17 * 37u64).overflowing_add(l).0;
+    hash.overflowing_mul(37).0.overflowing_add(r).0
+}
+
+macro_rules! equal_rows_elem {
+    ($array_type:ident, $l: ident, $r: ident, $left: ident, $right: ident) => {
+        $l.as_any()
+            .downcast_ref::<$array_type>()
+            .unwrap()
+            .value($left)
+            == $r
+                .as_any()
+                .downcast_ref::<$array_type>()
+                .unwrap()
+                .value($right)
+    };
+}
+
+/// Left and right row have equal values
+fn equal_rows(
+    left: usize,
+    right: usize,
+    left_arrays: &[ArrayRef],
+    right_arrays: &[ArrayRef],
+) -> Result<bool> {
+    let mut err = None;
+    let res = left_arrays
+        .iter()
+        .zip(right_arrays)
+        .all(|(l, r)| match l.data_type() {
+            DataType::Null => true,
+            DataType::Boolean => {
+                equal_rows_elem!(BooleanArray, l, r, left, right)
+            }
+            DataType::Int8 => {
+                equal_rows_elem!(Int8Array, l, r, left, right)
+            }
+            DataType::Int16 => {
+                equal_rows_elem!(Int16Array, l, r, left, right)
+            }
+            DataType::Int32 => {
+                equal_rows_elem!(Int32Array, l, r, left, right)
+            }
+            DataType::Int64 => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::UInt8 => {
+                equal_rows_elem!(UInt8Array, l, r, left, right)
+            }
+            DataType::UInt16 => {
+                equal_rows_elem!(UInt16Array, l, r, left, right)
+            }
+            DataType::UInt32 => {
+                equal_rows_elem!(UInt32Array, l, r, left, right)
+            }
+            DataType::UInt64 => {
+                equal_rows_elem!(UInt64Array, l, r, left, right)
+            }
+            DataType::Timestamp(_, None) => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::Utf8 => {
+                equal_rows_elem!(StringArray, l, r, left, right)
+            }
+            DataType::LargeUtf8 => {
+                equal_rows_elem!(LargeStringArray, l, r, left, right)
+            }
+            _ => {
+                // This is internal because we should have caught this before.
+                err = Some(Err(DataFusionError::Internal(
+                    "Unsupported data type in hasher".to_string(),
+                )));
+                false
+            }
+        });
+
+    err.unwrap_or(Ok(res))
+}
+
+macro_rules! hash_array {
+    ($array_type:ident, $column: ident, $f: ident, $hashes: ident, $random_state: ident) => {

Review comment:
       IMO we should prob move this to the arrow crate at some point: hashing and equality are operations that need to be implemented with some invariants for consistency. I see that you had to implement an `equal_rows_elem`, which already indicates that.

##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -604,6 +562,199 @@ fn build_join_indexes(
         }
     }
 }
+use core::hash::BuildHasher;
+
+/// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing
+/// it when inserting/indexing or regrowing the `HashMap`
+struct IdHasher {
+    hash: u64,
+}
+
+impl Hasher for IdHasher {
+    fn finish(&self) -> u64 {
+        self.hash
+    }
+
+    fn write_u64(&mut self, i: u64) {
+        self.hash = i;
+    }
+
+    fn write(&mut self, _bytes: &[u8]) {
+        unreachable!("IdHasher should only be used for u64 keys")
+    }
+}
+
+#[derive(Debug)]
+struct IdHashBuilder {}
+
+impl BuildHasher for IdHashBuilder {
+    type Hasher = IdHasher;
+
+    fn build_hasher(&self) -> Self::Hasher {
+        IdHasher { hash: 0 }
+    }
+}
+
+// Combines two hashes into one hash
+fn combine_hashes(l: u64, r: u64) -> u64 {
+    let hash = (17 * 37u64).overflowing_add(l).0;
+    hash.overflowing_mul(37).0.overflowing_add(r).0
+}
+
+macro_rules! equal_rows_elem {
+    ($array_type:ident, $l: ident, $r: ident, $left: ident, $right: ident) => {
+        $l.as_any()
+            .downcast_ref::<$array_type>()
+            .unwrap()
+            .value($left)
+            == $r
+                .as_any()
+                .downcast_ref::<$array_type>()
+                .unwrap()
+                .value($right)
+    };
+}
+
+/// Left and right row have equal values
+fn equal_rows(
+    left: usize,
+    right: usize,
+    left_arrays: &[ArrayRef],
+    right_arrays: &[ArrayRef],
+) -> Result<bool> {
+    let mut err = None;
+    let res = left_arrays
+        .iter()
+        .zip(right_arrays)
+        .all(|(l, r)| match l.data_type() {
+            DataType::Null => true,
+            DataType::Boolean => {
+                equal_rows_elem!(BooleanArray, l, r, left, right)
+            }
+            DataType::Int8 => {
+                equal_rows_elem!(Int8Array, l, r, left, right)
+            }
+            DataType::Int16 => {
+                equal_rows_elem!(Int16Array, l, r, left, right)
+            }
+            DataType::Int32 => {
+                equal_rows_elem!(Int32Array, l, r, left, right)
+            }
+            DataType::Int64 => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::UInt8 => {
+                equal_rows_elem!(UInt8Array, l, r, left, right)
+            }
+            DataType::UInt16 => {
+                equal_rows_elem!(UInt16Array, l, r, left, right)
+            }
+            DataType::UInt32 => {
+                equal_rows_elem!(UInt32Array, l, r, left, right)
+            }
+            DataType::UInt64 => {
+                equal_rows_elem!(UInt64Array, l, r, left, right)
+            }
+            DataType::Timestamp(_, None) => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::Utf8 => {
+                equal_rows_elem!(StringArray, l, r, left, right)
+            }
+            DataType::LargeUtf8 => {
+                equal_rows_elem!(LargeStringArray, l, r, left, right)
+            }
+            _ => {
+                // This is internal because we should have caught this before.
+                err = Some(Err(DataFusionError::Internal(
+                    "Unsupported data type in hasher".to_string(),
+                )));
+                false
+            }
+        });
+
+    err.unwrap_or(Ok(res))
+}
+
+macro_rules! hash_array {
+    ($array_type:ident, $column: ident, $f: ident, $hashes: ident, $random_state: ident) => {
+        let array = $column.as_any().downcast_ref::<$array_type>().unwrap();
+
+        for (i, hash) in $hashes.iter_mut().enumerate() {
+            let mut hasher = $random_state.build_hasher();
+            hasher.$f(array.value(i));
+            *hash = combine_hashes(hasher.finish(), *hash);
+        }
+    };
+}
+
+/// Creates hash values for every element in the row based on the values in the columns
+fn create_hashes(arrays: &[ArrayRef], random_state: &RandomState) -> Result<Vec<u64>> {

Review comment:
       This could be an iterator, which would allow to be plugged in in the iteration where it is used: avoids allocating a vector for later iterating over.

##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -604,6 +562,199 @@ fn build_join_indexes(
         }
     }
 }
+use core::hash::BuildHasher;
+
+/// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing
+/// it when inserting/indexing or regrowing the `HashMap`
+struct IdHasher {
+    hash: u64,
+}
+
+impl Hasher for IdHasher {
+    fn finish(&self) -> u64 {
+        self.hash
+    }
+
+    fn write_u64(&mut self, i: u64) {
+        self.hash = i;
+    }
+
+    fn write(&mut self, _bytes: &[u8]) {
+        unreachable!("IdHasher should only be used for u64 keys")
+    }
+}
+
+#[derive(Debug)]
+struct IdHashBuilder {}
+
+impl BuildHasher for IdHashBuilder {
+    type Hasher = IdHasher;
+
+    fn build_hasher(&self) -> Self::Hasher {
+        IdHasher { hash: 0 }
+    }
+}
+
+// Combines two hashes into one hash
+fn combine_hashes(l: u64, r: u64) -> u64 {
+    let hash = (17 * 37u64).overflowing_add(l).0;
+    hash.overflowing_mul(37).0.overflowing_add(r).0
+}
+
+macro_rules! equal_rows_elem {
+    ($array_type:ident, $l: ident, $r: ident, $left: ident, $right: ident) => {
+        $l.as_any()
+            .downcast_ref::<$array_type>()
+            .unwrap()
+            .value($left)
+            == $r
+                .as_any()
+                .downcast_ref::<$array_type>()
+                .unwrap()
+                .value($right)
+    };
+}
+
+/// Left and right row have equal values
+fn equal_rows(
+    left: usize,
+    right: usize,
+    left_arrays: &[ArrayRef],
+    right_arrays: &[ArrayRef],
+) -> Result<bool> {
+    let mut err = None;
+    let res = left_arrays
+        .iter()
+        .zip(right_arrays)
+        .all(|(l, r)| match l.data_type() {
+            DataType::Null => true,
+            DataType::Boolean => {
+                equal_rows_elem!(BooleanArray, l, r, left, right)
+            }
+            DataType::Int8 => {
+                equal_rows_elem!(Int8Array, l, r, left, right)
+            }
+            DataType::Int16 => {
+                equal_rows_elem!(Int16Array, l, r, left, right)
+            }
+            DataType::Int32 => {
+                equal_rows_elem!(Int32Array, l, r, left, right)
+            }
+            DataType::Int64 => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::UInt8 => {
+                equal_rows_elem!(UInt8Array, l, r, left, right)
+            }
+            DataType::UInt16 => {
+                equal_rows_elem!(UInt16Array, l, r, left, right)
+            }
+            DataType::UInt32 => {
+                equal_rows_elem!(UInt32Array, l, r, left, right)
+            }
+            DataType::UInt64 => {
+                equal_rows_elem!(UInt64Array, l, r, left, right)
+            }
+            DataType::Timestamp(_, None) => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::Utf8 => {
+                equal_rows_elem!(StringArray, l, r, left, right)
+            }
+            DataType::LargeUtf8 => {
+                equal_rows_elem!(LargeStringArray, l, r, left, right)
+            }
+            _ => {
+                // This is internal because we should have caught this before.
+                err = Some(Err(DataFusionError::Internal(
+                    "Unsupported data type in hasher".to_string(),
+                )));
+                false
+            }
+        });
+
+    err.unwrap_or(Ok(res))
+}
+
+macro_rules! hash_array {
+    ($array_type:ident, $column: ident, $f: ident, $hashes: ident, $random_state: ident) => {
+        let array = $column.as_any().downcast_ref::<$array_type>().unwrap();
+
+        for (i, hash) in $hashes.iter_mut().enumerate() {
+            let mut hasher = $random_state.build_hasher();
+            hasher.$f(array.value(i));

Review comment:
       Isn't this ignoring nulls? Maybe because they are discarded in joins? If so, isn't this a problem, as we are reading the value of a null slot?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r557992737



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -604,6 +562,199 @@ fn build_join_indexes(
         }
     }
 }
+use core::hash::BuildHasher;
+
+/// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing
+/// it when inserting/indexing or regrowing the `HashMap`
+struct IdHasher {
+    hash: u64,
+}
+
+impl Hasher for IdHasher {
+    fn finish(&self) -> u64 {
+        self.hash
+    }
+
+    fn write_u64(&mut self, i: u64) {
+        self.hash = i;
+    }
+
+    fn write(&mut self, _bytes: &[u8]) {
+        unreachable!("IdHasher should only be used for u64 keys")
+    }
+}
+
+#[derive(Debug)]
+struct IdHashBuilder {}
+
+impl BuildHasher for IdHashBuilder {
+    type Hasher = IdHasher;
+
+    fn build_hasher(&self) -> Self::Hasher {
+        IdHasher { hash: 0 }
+    }
+}
+
+// Combines two hashes into one hash
+fn combine_hashes(l: u64, r: u64) -> u64 {
+    let hash = (17 * 37u64).overflowing_add(l).0;
+    hash.overflowing_mul(37).0.overflowing_add(r).0
+}
+
+macro_rules! equal_rows_elem {
+    ($array_type:ident, $l: ident, $r: ident, $left: ident, $right: ident) => {
+        $l.as_any()
+            .downcast_ref::<$array_type>()
+            .unwrap()
+            .value($left)
+            == $r
+                .as_any()
+                .downcast_ref::<$array_type>()
+                .unwrap()
+                .value($right)
+    };
+}
+
+/// Left and right row have equal values
+fn equal_rows(
+    left: usize,
+    right: usize,
+    left_arrays: &[ArrayRef],
+    right_arrays: &[ArrayRef],
+) -> Result<bool> {
+    let mut err = None;
+    let res = left_arrays
+        .iter()
+        .zip(right_arrays)
+        .all(|(l, r)| match l.data_type() {
+            DataType::Null => true,
+            DataType::Boolean => {
+                equal_rows_elem!(BooleanArray, l, r, left, right)
+            }
+            DataType::Int8 => {
+                equal_rows_elem!(Int8Array, l, r, left, right)
+            }
+            DataType::Int16 => {
+                equal_rows_elem!(Int16Array, l, r, left, right)
+            }
+            DataType::Int32 => {
+                equal_rows_elem!(Int32Array, l, r, left, right)
+            }
+            DataType::Int64 => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::UInt8 => {
+                equal_rows_elem!(UInt8Array, l, r, left, right)
+            }
+            DataType::UInt16 => {
+                equal_rows_elem!(UInt16Array, l, r, left, right)
+            }
+            DataType::UInt32 => {
+                equal_rows_elem!(UInt32Array, l, r, left, right)
+            }
+            DataType::UInt64 => {
+                equal_rows_elem!(UInt64Array, l, r, left, right)
+            }
+            DataType::Timestamp(_, None) => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::Utf8 => {
+                equal_rows_elem!(StringArray, l, r, left, right)
+            }
+            DataType::LargeUtf8 => {
+                equal_rows_elem!(LargeStringArray, l, r, left, right)
+            }
+            _ => {
+                // This is internal because we should have caught this before.
+                err = Some(Err(DataFusionError::Internal(
+                    "Unsupported data type in hasher".to_string(),
+                )));
+                false
+            }
+        });
+
+    err.unwrap_or(Ok(res))
+}
+
+macro_rules! hash_array {
+    ($array_type:ident, $column: ident, $f: ident, $hashes: ident, $random_state: ident) => {
+        let array = $column.as_any().downcast_ref::<$array_type>().unwrap();
+
+        for (i, hash) in $hashes.iter_mut().enumerate() {
+            let mut hasher = $random_state.build_hasher();
+            hasher.$f(array.value(i));

Review comment:
       I think so, I added some null handling in `hash_array` and `equal_rows_elem`. I think it makes sense that we have some more testing for nulls as well as this can be very tricky.
   I am not sure this is handled 100% correctly with the previous code either, I think it is actually not correct, for example, the inner join doesn't check on null values. 
   
   I added an issue here to add (more) tests on null values in keys:
   
   https://issues.apache.org/jira/browse/ARROW-11258




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (567bc67) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **decrease** coverage by `0.02%`.
   > The diff coverage is `72.83%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   81.61%   81.58%   -0.03%     
   ==========================================
     Files         215      215              
     Lines       51867    51965      +98     
   ==========================================
   + Hits        42329    42397      +68     
   - Misses       9538     9568      +30     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `81.92% <52.63%> (-2.87%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `84.31% <78.51%> (-0.40%)` | :arrow_down: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `94.86% <0.00%> (-0.20%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [4e1c3bb...567bc67](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r558531603



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -936,4 +1096,53 @@ mod tests {
 
         Ok(())
     }
+
+    #[test]
+    fn join_with_hash_collision() -> Result<()> {
+        let mut hashmap_left = HashMap::with_hasher(IdHashBuilder {});
+        let left = build_table_i32(
+            ("a", &vec![10, 20]),
+            ("x", &vec![100, 200]),
+            ("y", &vec![200, 300]),
+        );
+
+        let random_state = RandomState::new();
+
+        let hashes = create_hashes(&[left.columns()[0].clone()], &random_state)?;
+
+        // Create hash collisions
+        hashmap_left.insert(hashes[0], vec![0, 1]);
+        hashmap_left.insert(hashes[1], vec![0, 1]);
+
+        let right = build_table_i32(
+            ("a", &vec![10, 20]),
+            ("b", &vec![0, 0]),
+            ("c", &vec![30, 40]),
+        );
+
+        let left_data = JoinLeftData::new((hashmap_left, left));
+        let (l, r) = build_join_indexes(
+            &left_data,
+            &right,
+            JoinType::Inner,

Review comment:
       I recommend adding a test that covers collisions for all the `JoinType` variants -- the path for collision handling is different for those different join type variants. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (2893556) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **decrease** coverage by `0.02%`.
   > The diff coverage is `72.83%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   81.61%   81.58%   -0.03%     
   ==========================================
     Files         215      215              
     Lines       51867    51965      +98     
   ==========================================
   + Hits        42329    42397      +68     
   - Misses       9538     9568      +30     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `81.92% <52.63%> (-2.87%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `84.31% <78.51%> (-0.40%)` | :arrow_down: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `94.86% <0.00%> (-0.20%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [4e1c3bb...2893556](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r554613355



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -575,6 +542,199 @@ fn build_join_indexes(
         }
     }
 }
+use core::hash::BuildHasher;
+
+/// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing
+/// it when inserting/indexing or regrowing the `HashMap`
+struct IdHasher {
+    hash: u64,
+}
+
+impl Hasher for IdHasher {
+    fn finish(&self) -> u64 {
+        self.hash
+    }
+
+    fn write_u64(&mut self, i: u64) {
+        self.hash = i;
+    }
+
+    fn write(&mut self, _bytes: &[u8]) {
+        unreachable!("IdHasher should only be used for u64 keys")
+    }
+}
+
+#[derive(Debug)]
+struct IdHashBuilder {}
+
+impl BuildHasher for IdHashBuilder {
+    type Hasher = IdHasher;
+
+    fn build_hasher(&self) -> Self::Hasher {
+        IdHasher { hash: 0 }
+    }
+}
+
+// Combines two hashes into one hash
+fn combine_hashes(l: u64, r: u64) -> u64 {
+    let hash = (17 * 37u64).overflowing_add(l).0;
+    hash.overflowing_mul(37).0.overflowing_add(r).0
+}
+
+macro_rules! equal_rows_elem {
+    ($array_type:ident, $l: ident, $r: ident, $left: ident, $right: ident) => {
+        $l.as_any()
+            .downcast_ref::<$array_type>()
+            .unwrap()
+            .value($left)
+            == $r
+                .as_any()
+                .downcast_ref::<$array_type>()
+                .unwrap()
+                .value($right)
+    };
+}
+
+/// Left and right row have equal values
+fn equal_rows(

Review comment:
       This is not vectorized (one row at a time). Could be vectorized in the future, checking an array of "matched candidates" instead of 1 by 1.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r557968302



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -604,6 +562,199 @@ fn build_join_indexes(
         }
     }
 }
+use core::hash::BuildHasher;
+
+/// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing
+/// it when inserting/indexing or regrowing the `HashMap`
+struct IdHasher {
+    hash: u64,
+}
+
+impl Hasher for IdHasher {
+    fn finish(&self) -> u64 {
+        self.hash
+    }
+
+    fn write_u64(&mut self, i: u64) {
+        self.hash = i;
+    }
+
+    fn write(&mut self, _bytes: &[u8]) {
+        unreachable!("IdHasher should only be used for u64 keys")
+    }
+}
+
+#[derive(Debug)]
+struct IdHashBuilder {}
+
+impl BuildHasher for IdHashBuilder {
+    type Hasher = IdHasher;
+
+    fn build_hasher(&self) -> Self::Hasher {
+        IdHasher { hash: 0 }
+    }
+}
+
+// Combines two hashes into one hash
+fn combine_hashes(l: u64, r: u64) -> u64 {
+    let hash = (17 * 37u64).overflowing_add(l).0;
+    hash.overflowing_mul(37).0.overflowing_add(r).0
+}
+
+macro_rules! equal_rows_elem {
+    ($array_type:ident, $l: ident, $r: ident, $left: ident, $right: ident) => {
+        $l.as_any()
+            .downcast_ref::<$array_type>()
+            .unwrap()
+            .value($left)
+            == $r
+                .as_any()
+                .downcast_ref::<$array_type>()
+                .unwrap()
+                .value($right)
+    };
+}
+
+/// Left and right row have equal values
+fn equal_rows(
+    left: usize,
+    right: usize,
+    left_arrays: &[ArrayRef],
+    right_arrays: &[ArrayRef],
+) -> Result<bool> {
+    let mut err = None;
+    let res = left_arrays
+        .iter()
+        .zip(right_arrays)
+        .all(|(l, r)| match l.data_type() {
+            DataType::Null => true,
+            DataType::Boolean => {
+                equal_rows_elem!(BooleanArray, l, r, left, right)
+            }
+            DataType::Int8 => {
+                equal_rows_elem!(Int8Array, l, r, left, right)
+            }
+            DataType::Int16 => {
+                equal_rows_elem!(Int16Array, l, r, left, right)
+            }
+            DataType::Int32 => {
+                equal_rows_elem!(Int32Array, l, r, left, right)
+            }
+            DataType::Int64 => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::UInt8 => {
+                equal_rows_elem!(UInt8Array, l, r, left, right)
+            }
+            DataType::UInt16 => {
+                equal_rows_elem!(UInt16Array, l, r, left, right)
+            }
+            DataType::UInt32 => {
+                equal_rows_elem!(UInt32Array, l, r, left, right)
+            }
+            DataType::UInt64 => {
+                equal_rows_elem!(UInt64Array, l, r, left, right)
+            }
+            DataType::Timestamp(_, None) => {
+                equal_rows_elem!(Int64Array, l, r, left, right)
+            }
+            DataType::Utf8 => {
+                equal_rows_elem!(StringArray, l, r, left, right)
+            }
+            DataType::LargeUtf8 => {
+                equal_rows_elem!(LargeStringArray, l, r, left, right)
+            }
+            _ => {
+                // This is internal because we should have caught this before.
+                err = Some(Err(DataFusionError::Internal(
+                    "Unsupported data type in hasher".to_string(),
+                )));
+                false
+            }
+        });
+
+    err.unwrap_or(Ok(res))
+}
+
+macro_rules! hash_array {
+    ($array_type:ident, $column: ident, $f: ident, $hashes: ident, $random_state: ident) => {
+        let array = $column.as_any().downcast_ref::<$array_type>().unwrap();
+
+        for (i, hash) in $hashes.iter_mut().enumerate() {
+            let mut hasher = $random_state.build_hasher();
+            hasher.$f(array.value(i));
+            *hash = combine_hashes(hasher.finish(), *hash);
+        }
+    };
+}
+
+/// Creates hash values for every element in the row based on the values in the columns
+fn create_hashes(arrays: &[ArrayRef], random_state: &RandomState) -> Result<Vec<u64>> {

Review comment:
       Not 100% sure how that would work for multiple columns? An earlier version of the PR also tried to reuse a bit of the allocation, but didn't seem to have a large impact on performance.
   I think this can be looked at in a following PR. I am also wondering if there are some cool SIMD hash algorithms on multiple elements of arrays (rather than the common `Vec<u8>`) if we want to optimize`create_hashes`. Or maybe we can write one ourselves.  I am not sure how that would work with an iterator?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-755963787


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=h1) Report
   > Merging [#9116](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=desc) (749ea4c) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **decrease** coverage by `0.02%`.
   > The diff coverage is `72.83%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9116/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9116      +/-   ##
   ==========================================
   - Coverage   81.61%   81.58%   -0.03%     
   ==========================================
     Files         215      215              
     Lines       51867    51965      +98     
   ==========================================
   + Hits        42329    42398      +69     
   - Misses       9538     9567      +29     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `81.92% <52.63%> (-2.87%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9116/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `84.31% <78.51%> (-0.40%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=footer). Last update [4e1c3bb...086d642](https://codecov.io/gh/apache/arrow/pull/9116?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#discussion_r554613178



##########
File path: rust/datafusion/src/physical_plan/hash_join.rs
##########
@@ -575,6 +542,199 @@ fn build_join_indexes(
         }
     }
 }
+use core::hash::BuildHasher;
+
+/// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing
+/// it when inserting/indexing or regrowing the `HashMap`
+struct IdHasher {
+    hash: u64,
+}
+
+impl Hasher for IdHasher {
+    fn finish(&self) -> u64 {
+        self.hash
+    }
+
+    fn write_u64(&mut self, i: u64) {
+        self.hash = i;
+    }
+
+    fn write(&mut self, _bytes: &[u8]) {
+        unreachable!("IdHasher should only be used for u64 keys")
+    }
+}
+
+#[derive(Debug)]
+struct IdHashBuilder {}
+
+impl BuildHasher for IdHashBuilder {
+    type Hasher = IdHasher;
+
+    fn build_hasher(&self) -> Self::Hasher {
+        IdHasher { hash: 0 }
+    }
+}
+
+// Combines two hashes into one hash
+fn combine_hashes(l: u64, r: u64) -> u64 {

Review comment:
       Could be improved upon later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9116: ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP]

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9116:
URL: https://github.com/apache/arrow/pull/9116#issuecomment-757006724


   This is nearing readiness. Need to do some final cleanup & add a test for the collision detection.
   
   Benchmarks results are still roughly the same. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org