You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/07 18:34:41 UTC

[GitHub] [arrow] Dandandan opened a new pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash join key values

Dandandan opened a new pull request #8863:
URL: https://github.com/apache/arrow/pull/8863


   This PR is a follow up of https://github.com/apache/arrow/pull/8765/ . Here, the hasmap keys are converted to `Vec<u8>` for hash joins instead.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash keys

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #8863:
URL: https://github.com/apache/arrow/pull/8863#issuecomment-740808850


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=h1) Report
   > Merging [#8863](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=desc) (3559d97) into [master](https://codecov.io/gh/apache/arrow/commit/bae1ccae4395f2fd4e4accb91bff15857139f122?el=desc) (bae1cca) will **decrease** coverage by `7.27%`.
   > The diff coverage is `5.88%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8863/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8863      +/-   ##
   ==========================================
   - Coverage   84.32%   77.04%   -7.28%     
   ==========================================
     Files         192      172      -20     
     Lines       47319    40046    -7273     
   ==========================================
   - Hits        39900    30852    -9048     
   - Misses       7419     9194    +1775     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/logical\_plan/dfschema.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZGZzY2hlbWEucnM=) | `0.00% <0.00%> (-75.94%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/group\_scalar.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2dyb3VwX3NjYWxhci5ycw==) | `0.00% <0.00%> (-69.24%)` | :arrow_down: |
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `0.00% <ø> (-87.33%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `0.00% <ø> (-96.12%)` | :arrow_down: |
   | [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `82.93% <50.00%> (-0.08%)` | :arrow_down: |
   | [rust/datafusion/src/test/variable.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy90ZXN0L3ZhcmlhYmxlLnJz) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [rust/datafusion/src/optimizer/filter\_push\_down.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvZmlsdGVyX3B1c2hfZG93bi5ycw==) | `0.00% <0.00%> (-99.33%)` | :arrow_down: |
   | [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `0.00% <0.00%> (-97.00%)` | :arrow_down: |
   | [...tafusion/src/physical\_plan/datetime\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2RhdGV0aW1lX2V4cHJlc3Npb25zLnJz) | `0.00% <0.00%> (-95.09%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/display.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZGlzcGxheS5ycw==) | `0.00% <0.00%> (-94.81%)` | :arrow_down: |
   | ... and [84 more](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=footer). Last update [b290e23...3559d97](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash keys

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8863:
URL: https://github.com/apache/arrow/pull/8863#issuecomment-740809969


   @jorgecarleitao 
   
   Just extended the change for hash aggregates as well.
   
   Turns out, a good speedup as well for hash aggregate queries!
   
   [This version]
   ```
   Query 1 iteration 0 took 871 ms
   Query 1 iteration 1 took 866 ms
   Query 1 iteration 2 took 869 ms
   Query 1 iteration 3 took 869 ms
   Query 1 iteration 4 took 867 ms
   Query 1 iteration 5 took 874 ms
   Query 1 iteration 6 took 870 ms
   Query 1 iteration 7 took 875 ms
   Query 1 iteration 8 took 871 ms
   Query 1 iteration 9 took 869 ms
   ```
   
   [Master]
   ```
   Query 1 iteration 0 took 1189 ms
   Query 1 iteration 1 took 1192 ms
   Query 1 iteration 2 took 1189 ms
   Query 1 iteration 3 took 1185 ms
   Query 1 iteration 4 took 1193 ms
   Query 1 iteration 5 took 1202 ms
   Query 1 iteration 6 took 1547 ms
   Query 1 iteration 7 took 1242 ms
   Query 1 iteration 8 took 1202 ms
   Query 1 iteration 9 took 1197 ms
   ```
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao closed pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash keys

Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #8863:
URL: https://github.com/apache/arrow/pull/8863


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash keys

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8863:
URL: https://github.com/apache/arrow/pull/8863#issuecomment-740808850


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=h1) Report
   > Merging [#8863](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=desc) (008d3ea) into [master](https://codecov.io/gh/apache/arrow/commit/db94f24f15ad783efe676fef8883b7cab9a82f88?el=desc) (db94f24) will **decrease** coverage by `0.02%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8863/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8863      +/-   ##
   ==========================================
   - Coverage   77.03%   77.01%   -0.03%     
   ==========================================
     Files         173      173              
     Lines       40090    40101      +11     
   ==========================================
     Hits        30885    30885              
   - Misses       9205     9216      +11     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/tests/dataframe.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL2RhdGFmcmFtZS5ycw==) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/src/datasource/csv.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2Nzdi5ycw==) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `0.00% <0.00%> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=footer). Last update [db94f24...008d3ea](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash join keys

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #8863:
URL: https://github.com/apache/arrow/pull/8863#issuecomment-740183272


   As a next step after this, I think it would be interesting if we can have a look at calculating the hashes on the columns instead to benefit from the columnar data layout.
   
   Some material I found on this:
   
   https://www.cockroachlabs.com/blog/vectorized-hash-joiner/ (simple explanation)
   https://pure.uva.nl/ws/files/4321270/68049_09.pdf 
   
   Please add if you know of more/newer material!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash join key values

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8863:
URL: https://github.com/apache/arrow/pull/8863#issuecomment-740116321


   https://issues.apache.org/jira/browse/ARROW-10837


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash join keys

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8863:
URL: https://github.com/apache/arrow/pull/8863#issuecomment-740183272


   As a next step after this, I think it would be interesting if we can have a look at calculating the hashes on the columns instead to benefit from the columnar data.
   
   Some material I found on this:
   
   https://www.cockroachlabs.com/blog/vectorized-hash-joiner/ (simple explanation)
   https://pure.uva.nl/ws/files/4321270/68049_09.pdf 
   
   Please add if you know of more/newer material!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org