You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/07 18:34:41 UTC
[GitHub] [arrow] Dandandan opened a new pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash join key values
Dandandan opened a new pull request #8863:
URL: https://github.com/apache/arrow/pull/8863
This PR is a follow up of https://github.com/apache/arrow/pull/8765/ . Here, the hasmap keys are converted to `Vec<u8>` for hash joins instead.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash keys
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #8863:
URL: https://github.com/apache/arrow/pull/8863#issuecomment-740808850
# [Codecov](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=h1) Report
> Merging [#8863](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=desc) (3559d97) into [master](https://codecov.io/gh/apache/arrow/commit/bae1ccae4395f2fd4e4accb91bff15857139f122?el=desc) (bae1cca) will **decrease** coverage by `7.27%`.
> The diff coverage is `5.88%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8863/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #8863 +/- ##
==========================================
- Coverage 84.32% 77.04% -7.28%
==========================================
Files 192 172 -20
Lines 47319 40046 -7273
==========================================
- Hits 39900 30852 -9048
- Misses 7419 9194 +1775
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/datafusion/src/logical\_plan/dfschema.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZGZzY2hlbWEucnM=) | `0.00% <0.00%> (-75.94%)` | :arrow_down: |
| [rust/datafusion/src/physical\_plan/group\_scalar.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2dyb3VwX3NjYWxhci5ycw==) | `0.00% <0.00%> (-69.24%)` | :arrow_down: |
| [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `0.00% <ø> (-87.33%)` | :arrow_down: |
| [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `0.00% <ø> (-96.12%)` | :arrow_down: |
| [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `82.93% <50.00%> (-0.08%)` | :arrow_down: |
| [rust/datafusion/src/test/variable.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy90ZXN0L3ZhcmlhYmxlLnJz) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
| [rust/datafusion/src/optimizer/filter\_push\_down.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvZmlsdGVyX3B1c2hfZG93bi5ycw==) | `0.00% <0.00%> (-99.33%)` | :arrow_down: |
| [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `0.00% <0.00%> (-97.00%)` | :arrow_down: |
| [...tafusion/src/physical\_plan/datetime\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2RhdGV0aW1lX2V4cHJlc3Npb25zLnJz) | `0.00% <0.00%> (-95.09%)` | :arrow_down: |
| [rust/datafusion/src/logical\_plan/display.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZGlzcGxheS5ycw==) | `0.00% <0.00%> (-94.81%)` | :arrow_down: |
| ... and [84 more](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=footer). Last update [b290e23...3559d97](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash keys
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8863:
URL: https://github.com/apache/arrow/pull/8863#issuecomment-740809969
@jorgecarleitao
Just extended the change for hash aggregates as well.
Turns out, a good speedup as well for hash aggregate queries!
[This version]
```
Query 1 iteration 0 took 871 ms
Query 1 iteration 1 took 866 ms
Query 1 iteration 2 took 869 ms
Query 1 iteration 3 took 869 ms
Query 1 iteration 4 took 867 ms
Query 1 iteration 5 took 874 ms
Query 1 iteration 6 took 870 ms
Query 1 iteration 7 took 875 ms
Query 1 iteration 8 took 871 ms
Query 1 iteration 9 took 869 ms
```
[Master]
```
Query 1 iteration 0 took 1189 ms
Query 1 iteration 1 took 1192 ms
Query 1 iteration 2 took 1189 ms
Query 1 iteration 3 took 1185 ms
Query 1 iteration 4 took 1193 ms
Query 1 iteration 5 took 1202 ms
Query 1 iteration 6 took 1547 ms
Query 1 iteration 7 took 1242 ms
Query 1 iteration 8 took 1202 ms
Query 1 iteration 9 took 1197 ms
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao closed pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash keys
Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #8863:
URL: https://github.com/apache/arrow/pull/8863
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash keys
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8863:
URL: https://github.com/apache/arrow/pull/8863#issuecomment-740808850
# [Codecov](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=h1) Report
> Merging [#8863](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=desc) (008d3ea) into [master](https://codecov.io/gh/apache/arrow/commit/db94f24f15ad783efe676fef8883b7cab9a82f88?el=desc) (db94f24) will **decrease** coverage by `0.02%`.
> The diff coverage is `n/a`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8863/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #8863 +/- ##
==========================================
- Coverage 77.03% 77.01% -0.03%
==========================================
Files 173 173
Lines 40090 40101 +11
==========================================
Hits 30885 30885
- Misses 9205 9216 +11
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `0.00% <ø> (ø)` | |
| [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `0.00% <ø> (ø)` | |
| [rust/datafusion/tests/dataframe.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL2RhdGFmcmFtZS5ycw==) | `0.00% <0.00%> (ø)` | |
| [rust/datafusion/src/datasource/csv.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2Nzdi5ycw==) | `0.00% <0.00%> (ø)` | |
| [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `0.00% <0.00%> (ø)` | |
| [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8863/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `0.00% <0.00%> (ø)` | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=footer). Last update [db94f24...008d3ea](https://codecov.io/gh/apache/arrow/pull/8863?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan edited a comment on pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash join keys
Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #8863:
URL: https://github.com/apache/arrow/pull/8863#issuecomment-740183272
As a next step after this, I think it would be interesting if we can have a look at calculating the hashes on the columns instead to benefit from the columnar data layout.
Some material I found on this:
https://www.cockroachlabs.com/blog/vectorized-hash-joiner/ (simple explanation)
https://pure.uva.nl/ws/files/4321270/68049_09.pdf
Please add if you know of more/newer material!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash join key values
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8863:
URL: https://github.com/apache/arrow/pull/8863#issuecomment-740116321
https://issues.apache.org/jira/browse/ARROW-10837
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #8863: ARROW-10837: [Rust][DataFusion] Use `Vec` for hash join keys
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8863:
URL: https://github.com/apache/arrow/pull/8863#issuecomment-740183272
As a next step after this, I think it would be interesting if we can have a look at calculating the hashes on the columns instead to benefit from the columnar data.
Some material I found on this:
https://www.cockroachlabs.com/blog/vectorized-hash-joiner/ (simple explanation)
https://pure.uva.nl/ws/files/4321270/68049_09.pdf
Please add if you know of more/newer material!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org