You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/21 16:24:50 UTC
[GitHub] [arrow-datafusion] Dandandan opened a new pull request #24: Micro optimize hash join functions
Dandandan opened a new pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24
~5% speed up by speeding up some hot functions
```
Query 5 avg time: 88.59 ms
Query 10 avg time: 256.62 ms
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r620640590
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();
match (left_array.is_null($left), left_array.is_null($right)) {
- (true, true) => true,
Review comment:
Yep! To be fair I also added this line myself 😆
I think we also should start testing DataFusion more rigorously wrt null handling, but one step at a time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r621304356
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -65,7 +65,7 @@ use log::debug;
// Maps a `u64` hash value based on the left ["on" values] to a list of indices with this key's value.
// E.g. 1 -> [3, 6, 8] indicates that the column values map to rows 3, 6 and 8 for hash value 1
// As the key is a hash value, we need to check possible hash collisions in the probe stage
-type JoinHashMap = HashMap<u64, SmallVec<[u64; 1]>, IdHashBuilder>;
+type JoinHashMap = HashMap<(), SmallVec<[u64; 1]>, IdHashBuilder>;
Review comment:
Added some extra docs now :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r620649530
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();
match (left_array.is_null($left), left_array.is_null($right)) {
- (true, true) => true,
Review comment:
Even if it is, I noticed the values method doesn't take offset into account right now, so this is maybe an optimization worth doing once hashing is included as arrow kernel.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r620649530
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();
match (left_array.is_null($left), left_array.is_null($right)) {
- (true, true) => true,
Review comment:
Even if it is, I noticed the values method doesn't take offset into account right now, so this is maybe an optimization worth doing once it is in the arrow kernel.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
alamb merged pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #24: [DataFusion] Optimize hash join inner workings
Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892
# [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (2dacc01) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **decrease** coverage by `0.00%`.
> The diff coverage is `78.57%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #24 +/- ##
==========================================
- Coverage 76.24% 76.23% -0.01%
==========================================
Files 134 134
Lines 23051 23055 +4
==========================================
+ Hits 17576 17577 +1
- Misses 5475 5478 +3
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.94% <78.57%> (-0.49%)` | :arrow_down: |
| [benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL3RwY2gucnM=) | `35.15% <0.00%> (ø)` | |
| [benchmarks/src/bin/nyctaxi.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL255Y3RheGkucnM=) | `0.00% <0.00%> (ø)` | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...2dacc01](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-827170266
Sorry for the delay @Dandandan -- I am about out of time today, but I will take a careful look tomorrow morning
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #24: [DataFusion] Optimize hash join inner workings
Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892
# [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9d8985b) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **decrease** coverage by `0.01%`.
> The diff coverage is `78.46%`.
> :exclamation: Current head 9d8985b differs from pull request most recent head e1ac6a9. Consider uploading reports for the commit e1ac6a9 to get more accurate results
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #24 +/- ##
==========================================
- Coverage 76.24% 76.23% -0.02%
==========================================
Files 134 134
Lines 23051 23052 +1
==========================================
- Hits 17576 17574 -2
- Misses 5475 5478 +3
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.85% <78.46%> (-0.58%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...e1ac6a9](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r620643551
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -781,58 +846,140 @@ pub fn create_hashes<'a>(
random_state: &RandomState,
hashes_buffer: &'a mut Vec<u64>,
) -> Result<&'a mut Vec<u64>> {
+ // combine hashes with `combine_hashes` if we have more than 1 column
+ let multi_col = arrays.len() > 1;
+
for col in arrays {
match col.data_type() {
DataType::UInt8 => {
- hash_array!(UInt8Array, col, u8, hashes_buffer, random_state);
+ hash_array_primitive!(
+ UInt8Array,
+ col,
+ u8,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::UInt16 => {
- hash_array!(UInt16Array, col, u16, hashes_buffer, random_state);
+ hash_array_primitive!(
+ UInt16Array,
+ col,
+ u16,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::UInt32 => {
- hash_array!(UInt32Array, col, u32, hashes_buffer, random_state);
+ hash_array_primitive!(
+ UInt32Array,
+ col,
+ u32,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::UInt64 => {
- hash_array!(UInt64Array, col, u64, hashes_buffer, random_state);
+ hash_array_primitive!(
+ UInt64Array,
+ col,
+ u64,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Int8 => {
- hash_array!(Int8Array, col, i8, hashes_buffer, random_state);
+ hash_array_primitive!(
+ Int8Array,
+ col,
+ i8,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Int16 => {
- hash_array!(Int16Array, col, i16, hashes_buffer, random_state);
+ hash_array_primitive!(
+ Int16Array,
+ col,
+ i16,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Int32 => {
- hash_array!(Int32Array, col, i32, hashes_buffer, random_state);
+ hash_array_primitive!(
+ Int32Array,
+ col,
+ i32,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Int64 => {
- hash_array!(Int64Array, col, i64, hashes_buffer, random_state);
+ hash_array_primitive!(
+ Int64Array,
+ col,
+ i64,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Timestamp(TimeUnit::Microsecond, None) => {
- hash_array!(
+ hash_array_primitive!(
TimestampMicrosecondArray,
col,
i64,
hashes_buffer,
- random_state
+ random_state,
+ multi_col
);
}
DataType::Timestamp(TimeUnit::Nanosecond, None) => {
- hash_array!(
+ hash_array_primitive!(
TimestampNanosecondArray,
col,
i64,
hashes_buffer,
- random_state
+ random_state,
+ multi_col
);
}
DataType::Boolean => {
- hash_array!(BooleanArray, col, u8, hashes_buffer, random_state);
+ hash_array_primitive!(
Review comment:
I am wondering, are the bytee not guaranteed to be the same for the same bitmaps?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #24: [DataFusion] Optimize hash join inner workings
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892
# [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (7672164) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **decrease** coverage by `0.01%`.
> The diff coverage is `78.46%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #24 +/- ##
==========================================
- Coverage 76.24% 76.23% -0.02%
==========================================
Files 134 134
Lines 23051 23052 +1
==========================================
- Hits 17576 17574 -2
- Misses 5475 5478 +3
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.85% <78.46%> (-0.58%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...7672164](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #24: [DataFusion] Optimize hash join inner workings
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892
# [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (7672164) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **decrease** coverage by `0.01%`.
> The diff coverage is `78.46%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #24 +/- ##
==========================================
- Coverage 76.24% 76.23% -0.02%
==========================================
Files 134 134
Lines 23051 23052 +1
==========================================
- Hits 17576 17574 -2
- Misses 5475 5478 +3
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.85% <78.46%> (-0.58%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...7672164](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-827125937
@alamb @jorgecarleitao FYI if you have time, I am not planning any changes to this PR for now.
I think the more important part of the PR is the fix wrt null handling which turned out to be a one line change.
The other part is some small performance tweaks and prepares it a bit for further improvements down the road.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jorgecarleitao commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r620634002
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -781,58 +846,140 @@ pub fn create_hashes<'a>(
random_state: &RandomState,
hashes_buffer: &'a mut Vec<u64>,
) -> Result<&'a mut Vec<u64>> {
+ // combine hashes with `combine_hashes` if we have more than 1 column
+ let multi_col = arrays.len() > 1;
+
for col in arrays {
match col.data_type() {
DataType::UInt8 => {
- hash_array!(UInt8Array, col, u8, hashes_buffer, random_state);
+ hash_array_primitive!(
+ UInt8Array,
+ col,
+ u8,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::UInt16 => {
- hash_array!(UInt16Array, col, u16, hashes_buffer, random_state);
+ hash_array_primitive!(
+ UInt16Array,
+ col,
+ u16,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::UInt32 => {
- hash_array!(UInt32Array, col, u32, hashes_buffer, random_state);
+ hash_array_primitive!(
+ UInt32Array,
+ col,
+ u32,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::UInt64 => {
- hash_array!(UInt64Array, col, u64, hashes_buffer, random_state);
+ hash_array_primitive!(
+ UInt64Array,
+ col,
+ u64,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Int8 => {
- hash_array!(Int8Array, col, i8, hashes_buffer, random_state);
+ hash_array_primitive!(
+ Int8Array,
+ col,
+ i8,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Int16 => {
- hash_array!(Int16Array, col, i16, hashes_buffer, random_state);
+ hash_array_primitive!(
+ Int16Array,
+ col,
+ i16,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Int32 => {
- hash_array!(Int32Array, col, i32, hashes_buffer, random_state);
+ hash_array_primitive!(
+ Int32Array,
+ col,
+ i32,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Int64 => {
- hash_array!(Int64Array, col, i64, hashes_buffer, random_state);
+ hash_array_primitive!(
+ Int64Array,
+ col,
+ i64,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Timestamp(TimeUnit::Microsecond, None) => {
- hash_array!(
+ hash_array_primitive!(
TimestampMicrosecondArray,
col,
i64,
hashes_buffer,
- random_state
+ random_state,
+ multi_col
);
}
DataType::Timestamp(TimeUnit::Nanosecond, None) => {
- hash_array!(
+ hash_array_primitive!(
TimestampNanosecondArray,
col,
i64,
hashes_buffer,
- random_state
+ random_state,
+ multi_col
);
}
DataType::Boolean => {
- hash_array!(BooleanArray, col, u8, hashes_buffer, random_state);
+ hash_array_primitive!(
Review comment:
```suggestion
hash_array!(
```
otherwise this will be an iterator over the bytes of the bitmap, not over the individual bits.
In general, I recommend using generics, as they make it more difficult for these to happen, but I agree that the API is a bit funny on the arrow side. ^_^
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();
match (left_array.is_null($left), left_array.is_null($right)) {
- (true, true) => true,
Review comment:
this was the line, right? Damm, so many wrong things for this one.
Really great finding, @Dandandan !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r621193159
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -399,15 +395,23 @@ fn update_hash(
.map(|name| Ok(col(name).evaluate(batch)?.into_array(batch.num_rows())))
.collect::<Result<Vec<_>>>()?;
- // update the hash map
+ // calculate the hash values
let hash_values = create_hashes(&keys_values, &random_state, hashes_buffer)?;
// insert hashes to key of the hashmap
for (row, hash_value) in hash_values.iter().enumerate() {
- hash.raw_entry_mut()
- .from_key_hashed_nocheck(*hash_value, hash_value)
- .and_modify(|_, v| v.push((row + offset) as u64))
- .or_insert_with(|| (*hash_value, smallvec![(row + offset) as u64]));
+ match hash.raw_entry_mut().from_hash(*hash_value, |_| true) {
+ hashbrown::hash_map::RawEntryMut::Occupied(mut entry) => {
+ entry.get_mut().push((row + offset) as u64);
+ }
+ hashbrown::hash_map::RawEntryMut::Vacant(entry) => {
+ entry.insert_hashed_nocheck(
Review comment:
The overall algorithm does do the checking for collisions (values with the same hash) in the later stage. There is a unit checking this case too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r621191677
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -65,7 +65,7 @@ use log::debug;
// Maps a `u64` hash value based on the left ["on" values] to a list of indices with this key's value.
// E.g. 1 -> [3, 6, 8] indicates that the column values map to rows 3, 6 and 8 for hash value 1
// As the key is a hash value, we need to check possible hash collisions in the probe stage
-type JoinHashMap = HashMap<u64, SmallVec<[u64; 1]>, IdHashBuilder>;
+type JoinHashMap = HashMap<(), SmallVec<[u64; 1]>, IdHashBuilder>;
Review comment:
I will add some comments about it for sure!
Eventually we can get rid of the hashmap entirely, as we check for collisions in a later stage.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r621193159
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -399,15 +395,23 @@ fn update_hash(
.map(|name| Ok(col(name).evaluate(batch)?.into_array(batch.num_rows())))
.collect::<Result<Vec<_>>>()?;
- // update the hash map
+ // calculate the hash values
let hash_values = create_hashes(&keys_values, &random_state, hashes_buffer)?;
// insert hashes to key of the hashmap
for (row, hash_value) in hash_values.iter().enumerate() {
- hash.raw_entry_mut()
- .from_key_hashed_nocheck(*hash_value, hash_value)
- .and_modify(|_, v| v.push((row + offset) as u64))
- .or_insert_with(|| (*hash_value, smallvec![(row + offset) as u64]));
+ match hash.raw_entry_mut().from_hash(*hash_value, |_| true) {
+ hashbrown::hash_map::RawEntryMut::Occupied(mut entry) => {
+ entry.get_mut().push((row + offset) as u64);
+ }
+ hashbrown::hash_map::RawEntryMut::Vacant(entry) => {
+ entry.insert_hashed_nocheck(
Review comment:
The overall algorithm does do the checking for collisions (values with the same hash) in the later stage. There is a unit test checking this case too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r621186963
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -399,15 +395,23 @@ fn update_hash(
.map(|name| Ok(col(name).evaluate(batch)?.into_array(batch.num_rows())))
.collect::<Result<Vec<_>>>()?;
- // update the hash map
+ // calculate the hash values
let hash_values = create_hashes(&keys_values, &random_state, hashes_buffer)?;
// insert hashes to key of the hashmap
for (row, hash_value) in hash_values.iter().enumerate() {
- hash.raw_entry_mut()
- .from_key_hashed_nocheck(*hash_value, hash_value)
- .and_modify(|_, v| v.push((row + offset) as u64))
- .or_insert_with(|| (*hash_value, smallvec![(row + offset) as u64]));
+ match hash.raw_entry_mut().from_hash(*hash_value, |_| true) {
+ hashbrown::hash_map::RawEntryMut::Occupied(mut entry) => {
+ entry.get_mut().push((row + offset) as u64);
+ }
+ hashbrown::hash_map::RawEntryMut::Vacant(entry) => {
+ entry.insert_hashed_nocheck(
Review comment:
Does this code do the right thing if there are collisions in the hash table (aka the same key column values hash to the same hash value?) I am not enough of an expert on HashBrown to really know, but something feels off to me here.
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -65,7 +65,7 @@ use log::debug;
// Maps a `u64` hash value based on the left ["on" values] to a list of indices with this key's value.
// E.g. 1 -> [3, 6, 8] indicates that the column values map to rows 3, 6 and 8 for hash value 1
// As the key is a hash value, we need to check possible hash collisions in the probe stage
-type JoinHashMap = HashMap<u64, SmallVec<[u64; 1]>, IdHashBuilder>;
+type JoinHashMap = HashMap<(), SmallVec<[u64; 1]>, IdHashBuilder>;
Review comment:
I am a little confused about using `()` as the key type for the `HashMap` -- maybe updating the comments would help
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -755,21 +766,75 @@ fn equal_rows(
}
macro_rules! hash_array {
- ($array_type:ident, $column: ident, $ty: ident, $hashes: ident, $random_state: ident) => {
+ ($array_type:ident, $column: ident, $ty: ident, $hashes: ident, $random_state: ident, $multi_col: ident) => {
let array = $column.as_any().downcast_ref::<$array_type>().unwrap();
if array.null_count() == 0 {
- for (i, hash) in $hashes.iter_mut().enumerate() {
- *hash =
- combine_hashes($ty::get_hash(&array.value(i), $random_state), *hash);
- }
- } else {
- for (i, hash) in $hashes.iter_mut().enumerate() {
- if !array.is_null(i) {
+ if $multi_col {
+ for (i, hash) in $hashes.iter_mut().enumerate() {
*hash = combine_hashes(
$ty::get_hash(&array.value(i), $random_state),
*hash,
);
}
+ } else {
+ for (i, hash) in $hashes.iter_mut().enumerate() {
+ *hash = $ty::get_hash(&array.value(i), $random_state);
+ }
+ }
+ } else {
+ if $multi_col {
+ for (i, hash) in $hashes.iter_mut().enumerate() {
+ if !array.is_null(i) {
+ *hash = combine_hashes(
+ $ty::get_hash(&array.value(i), $random_state),
+ *hash,
+ );
+ }
+ }
+ } else {
+ for (i, hash) in $hashes.iter_mut().enumerate() {
+ if !array.is_null(i) {
+ *hash = $ty::get_hash(&array.value(i), $random_state);
+ }
+ }
+ }
+ }
+ };
+}
+
+macro_rules! hash_array_primitive {
+ ($array_type:ident, $column: ident, $ty: ident, $hashes: ident, $random_state: ident, $multi_col: ident) => {
+ let array = $column.as_any().downcast_ref::<$array_type>().unwrap();
+ let values = array.values();
+
+ if array.null_count() == 0 {
+ if $multi_col {
+ for (hash, value) in $hashes.iter_mut().zip(values.iter()) {
+ *hash = combine_hashes($ty::get_hash(value, $random_state), *hash);
+ }
+ } else {
+ for (hash, value) in $hashes.iter_mut().zip(values.iter()) {
+ *hash = $ty::get_hash(value, $random_state)
+ }
+ }
+ } else {
+ if $multi_col {
Review comment:
special casing multi-value join columns is a good one
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -708,7 +720,6 @@ macro_rules! equal_rows_elem {
let right_array = $r.as_any().downcast_ref::<$array_type>().unwrap();
match (left_array.is_null($left), left_array.is_null($right)) {
- (true, true) => true,
Review comment:
The fact we are now worrying about and fixing null handling is a sign, in my mind, of DataFusion's maturation 🤣
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #24: [DataFusion] Optimize hash join inner workings, null hanling fix
Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892
# [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (3ebd266) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **increase** coverage by `0.10%`.
> The diff coverage is `78.87%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #24 +/- ##
==========================================
+ Coverage 76.24% 76.35% +0.10%
==========================================
Files 134 134
Lines 23051 23062 +11
==========================================
+ Hits 17576 17609 +33
+ Misses 5475 5453 -22
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.94% <78.57%> (-0.49%)` | :arrow_down: |
| [datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi90ZXN0cy9zcWwucnM=) | `99.88% <100.00%> (+1.42%)` | :arrow_up: |
| [benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL3RwY2gucnM=) | `35.15% <0.00%> (ø)` | |
| [benchmarks/src/bin/nyctaxi.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL255Y3RheGkucnM=) | `0.00% <0.00%> (ø)` | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...3ebd266](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#discussion_r620643551
##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -781,58 +846,140 @@ pub fn create_hashes<'a>(
random_state: &RandomState,
hashes_buffer: &'a mut Vec<u64>,
) -> Result<&'a mut Vec<u64>> {
+ // combine hashes with `combine_hashes` if we have more than 1 column
+ let multi_col = arrays.len() > 1;
+
for col in arrays {
match col.data_type() {
DataType::UInt8 => {
- hash_array!(UInt8Array, col, u8, hashes_buffer, random_state);
+ hash_array_primitive!(
+ UInt8Array,
+ col,
+ u8,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::UInt16 => {
- hash_array!(UInt16Array, col, u16, hashes_buffer, random_state);
+ hash_array_primitive!(
+ UInt16Array,
+ col,
+ u16,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::UInt32 => {
- hash_array!(UInt32Array, col, u32, hashes_buffer, random_state);
+ hash_array_primitive!(
+ UInt32Array,
+ col,
+ u32,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::UInt64 => {
- hash_array!(UInt64Array, col, u64, hashes_buffer, random_state);
+ hash_array_primitive!(
+ UInt64Array,
+ col,
+ u64,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Int8 => {
- hash_array!(Int8Array, col, i8, hashes_buffer, random_state);
+ hash_array_primitive!(
+ Int8Array,
+ col,
+ i8,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Int16 => {
- hash_array!(Int16Array, col, i16, hashes_buffer, random_state);
+ hash_array_primitive!(
+ Int16Array,
+ col,
+ i16,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Int32 => {
- hash_array!(Int32Array, col, i32, hashes_buffer, random_state);
+ hash_array_primitive!(
+ Int32Array,
+ col,
+ i32,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Int64 => {
- hash_array!(Int64Array, col, i64, hashes_buffer, random_state);
+ hash_array_primitive!(
+ Int64Array,
+ col,
+ i64,
+ hashes_buffer,
+ random_state,
+ multi_col
+ );
}
DataType::Timestamp(TimeUnit::Microsecond, None) => {
- hash_array!(
+ hash_array_primitive!(
TimestampMicrosecondArray,
col,
i64,
hashes_buffer,
- random_state
+ random_state,
+ multi_col
);
}
DataType::Timestamp(TimeUnit::Nanosecond, None) => {
- hash_array!(
+ hash_array_primitive!(
TimestampNanosecondArray,
col,
i64,
hashes_buffer,
- random_state
+ random_state,
+ multi_col
);
}
DataType::Boolean => {
- hash_array!(BooleanArray, col, u8, hashes_buffer, random_state);
+ hash_array_primitive!(
Review comment:
I am wondering, are the bytes not guaranteed to be the same for the same bitmaps?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #24: [DataFusion] Optimize hash join inner workings
Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892
# [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (90e539b) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **increase** coverage by `0.21%`.
> The diff coverage is `48.73%`.
> :exclamation: Current head 90e539b differs from pull request most recent head 2f30453. Consider uploading reports for the commit 2f30453 to get more accurate results
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #24 +/- ##
==========================================
+ Coverage 76.24% 76.46% +0.21%
==========================================
Files 134 134
Lines 23051 23029 -22
==========================================
+ Hits 17576 17609 +33
+ Misses 5475 5420 -55
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [ballista/rust/client/src/context.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jbGllbnQvc3JjL2NvbnRleHQucnM=) | `0.00% <0.00%> (ø)` | |
| [benchmarks/src/bin/nyctaxi.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL255Y3RheGkucnM=) | `0.00% <ø> (ø)` | |
| [benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL3RwY2gucnM=) | `35.07% <0.00%> (-0.08%)` | :arrow_down: |
| [datafusion-examples/examples/flight\_server.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1leGFtcGxlcy9leGFtcGxlcy9mbGlnaHRfc2VydmVyLnJz) | `0.00% <0.00%> (ø)` | |
| [datafusion-examples/examples/simple\_udaf.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1leGFtcGxlcy9leGFtcGxlcy9zaW1wbGVfdWRhZi5ycw==) | `0.00% <ø> (ø)` | |
| [datafusion/src/physical\_plan/expressions/case.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9jYXNlLnJz) | `72.91% <16.66%> (-0.39%)` | :arrow_down: |
| [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.94% <78.57%> (-0.49%)` | :arrow_down: |
| [datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi90ZXN0cy9zcWwucnM=) | `99.88% <100.00%> (+1.42%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...2f30453](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #24: [DataFusion] Optimize hash join inner workings, null handling fix
Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #24:
URL: https://github.com/apache/arrow-datafusion/pull/24#issuecomment-826102892
# [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#24](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (e1c0a4e) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/245f0b8a68c5763a236aef3e727f0502188d0bfa?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (245f0b8) will **increase** coverage by `0.16%`.
> The diff coverage is `63.00%`.
> :exclamation: Current head e1c0a4e differs from pull request most recent head 0e1bdb4. Consider uploading reports for the commit 0e1bdb4 to get more accurate results
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/24/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #24 +/- ##
==========================================
+ Coverage 76.24% 76.41% +0.16%
==========================================
Files 134 134
Lines 23051 23181 +130
==========================================
+ Hits 17576 17714 +138
+ Misses 5475 5467 -8
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [ballista/rust/client/src/context.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jbGllbnQvc3JjL2NvbnRleHQucnM=) | `0.00% <0.00%> (ø)` | |
| [ballista/rust/executor/src/main.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9leGVjdXRvci9zcmMvbWFpbi5ycw==) | `0.00% <0.00%> (ø)` | |
| [benchmarks/src/bin/nyctaxi.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL255Y3RheGkucnM=) | `0.00% <ø> (ø)` | |
| [datafusion-examples/examples/flight\_server.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1leGFtcGxlcy9leGFtcGxlcy9mbGlnaHRfc2VydmVyLnJz) | `0.00% <0.00%> (ø)` | |
| [datafusion-examples/examples/simple\_udaf.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1leGFtcGxlcy9leGFtcGxlcy9zaW1wbGVfdWRhZi5ycw==) | `0.00% <ø> (ø)` | |
| [datafusion/src/physical\_plan/expressions/case.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9jYXNlLnJz) | `72.91% <16.66%> (-0.39%)` | :arrow_down: |
| [benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmVuY2htYXJrcy9zcmMvYmluL3RwY2gucnM=) | `35.35% <33.33%> (+0.20%)` | :arrow_up: |
| [datafusion/src/datasource/csv.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvZGF0YXNvdXJjZS9jc3YucnM=) | `73.33% <69.86%> (-9.28%)` | :arrow_down: |
| [datafusion/src/physical\_plan/csv.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9jc3YucnM=) | `79.09% <75.72%> (-4.13%)` | :arrow_down: |
| [datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9oYXNoX2pvaW4ucnM=) | `85.94% <78.57%> (-0.49%)` | :arrow_down: |
| ... and [3 more](https://codecov.io/gh/apache/arrow-datafusion/pull/24/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [245f0b8...0e1bdb4](https://codecov.io/gh/apache/arrow-datafusion/pull/24?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org