You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/19 22:24:48 UTC

[GitHub] [arrow] Dandandan opened a new pull request #9271: ARROW-11300: [Rust][DataFusion] Hash agg few rows

Dandandan opened a new pull request #9271:
URL: https://github.com/apache/arrow/pull/9271


   Still WIP, but this shows the concept for improving the situation described in https://issues.apache.org/jira/browse/ARROW-11300.
   
   The current situation is that we 
   
   The PR changes the algorithm to:
   
   * Create indices offsets of all keys
   * `take` the arrays based on indices in one go (so it only requires one big allocation
   * Use `slice` to look into the arrays
   
   There still 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-768014037


   Thanks @Dandandan . fmt missing :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765865323

Thanks a lot for your points. I am learning a lot! :)

Note that for small arrays, we are basically in the metadata problem on which the "payload size" of transmitting 1 element is driven by its metadata, not the data itself. This will always be a problem, as the arrow format was designed to be performant for large arrays.

For example, all our buffers are shared via an `Arc`. There is a tradeoff between this indirection and mem-copying the memory region. The tradeoff works in `Arc`'s favor for large memory regions and vice-versa.

With that said, we could consider replacing `Arc<ArrayData>` by `ArrayData` on all our arrays, to avoid the extra `Arc`: cloning an `ArrayData` is actually cheap. I am not sure if that would work for FFI, but we could certainly try.

Another idea is to use `buffer1: Buffer`, `buffer2: Buffer` instead of `buffers: Vec<Buffer>` to avoid the `Vec`. This is possible arrow arrays support at most 2 buffers. For types of a single buffer, we are already incurring the cost of the `Vec` and thus adding a `Buffer` instead should not be a big issue (memory-wise). The advantage of this is that we avoid cloning the `Vec` on every operation as well as the extra bound check. The disadvantage is that we have to be more verbose when we want to apply an operation to every buffer.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups [WIP]

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-763529613


   Yes, slicing is suboptimal atm. Also, IMO it should not be the `Array` to implement that method, but each implementation individually. I haven't touch that part yet, though.
   
   That analysis is really cool, though. How did you came up with that????? 😮 Any blog post or readme??? :P


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups [WIP]

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-763553182


   Thanks @jorgecarleitao makes sense.
   I'm planning a blog post soon again, might put some details there.
   In short its `cargo profiler` installed from github https://github.com/svenstaro/cargo-profiler with `callgrind` and `kcachegrind`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765564334


   This PR itself is ready for review.
   
   I think the performance for slicing for small slices would be something to look at later.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] codecov-io commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

codecov-io commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-763660623


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=h1) Report
   > Merging [#9271](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=desc) (9933c7b) into [master](https://codecov.io/gh/apache/arrow/commit/691286975f277f00586cabc6d834ff1efd8caf8c?el=desc) (6912869) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9271/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #9271   +/-   ##
   =======================================
     Coverage   81.68%   81.68%           
   =======================================
     Files         215      215           
     Lines       52561    52576   +15     
   =======================================
   + Hits        42935    42949   +14     
   - Misses       9626     9627    +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `82.77% <100.00%> (+0.59%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `94.86% <0.00%> (-0.20%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=footer). Last update [6912869...9933c7b](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765211936


   Isn't the data contained on a buffer `Arc`ed? I.e. `Vec<Buffer>::clone()` should be cheap, no?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

nevi-me commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765375980


   > I found the "offending" code is this function in `array/data.rs` which does a `self.clone()`.
   
   This relates to the other discussion that we had on how slicing an array does a clone without propagating offset information to child_data and buffers.
   
   I find it interesting though that `self.clone()` would cost us so much. If you were to clone an array, and inspect the original and the cloned, the buffers point to the same location in memory. I thought that's what zero-copying would give us.
   
   Is that not the case?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups [WIP]

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-763553182


   Thanks @jorgecarleitao makes sense.
   I'm planning a blog post soon again, might put some details there.
   In short its `cargo profiler` installed from github https://github.com/svenstaro/cargo-profiler with callgrind and `kcachegrind`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao closed pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

jorgecarleitao closed pull request #9271:
URL: https://github.com/apache/arrow/pull/9271


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

jorgecarleitao edited a comment on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765865323

Thanks a lot for your points. I am learning a lot! :)

Another idea is to use `buffer1: Buffer`, `buffer2: Buffer` instead of `buffers: Vec<Buffer>` to avoid the `Vec`. This is possible because arrow arrays support at most 2 buffers (3 with the null). For types of a single buffer, we are already incurring the cost of the `Vec` and thus adding a `Buffer` instead should not be a big issue (memory-wise). The advantage of this is that we avoid cloning the `Vec` on every operation as well as the extra bound check. The disadvantage is that we have to be more verbose when we want to apply an operation to every buffer.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765564334


   This PR itself is ready for review.
   
   I think the performance for slicing small slices would be something to look at later.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups [WIP]

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-763517408


   ![image](https://user-images.githubusercontent.com/163737/105164296-42515780-5b15-11eb-87f0-a042c4287514.png)
   
   @jorgecarleitao while the code is quite a bit faster already, it seems `Array.slice` doesn't really do a zero-copy as promised in the doc (and accounts for about 10% of the total instructions of the program). Maybe you have an idea what could be going on here?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765223781


   @jorgecarleitao 
   Maybe, reasonably, yes (as long the underlying vecs have a few items).
   
   I think the clone on the  `ArrayData` structure itself is expensive when slicing for a few elements as it does some allocations for the struct, the vecs, execute some code, etc.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765392170


   @nevi-me 
   
   I don't think indeed it is very expensive on large Arrays compared to the size / operations on the array, but it turns out to be expensive on very small arrays. For this PR I am using `slice` to make the hash aggregate code in DataFusion more efficient for small output groups with a small amount of rows (only 1 row / `Array.slice(i, 1)`) in extreme cases), in which case the slicing function becomes a bottleneck, because of the cloning here + `make_array` function and because it will be called many times, for example (I believe) in total 20M times for a table of 10M rows (it is one example of the db-benchmark benchmark). 
   It still is faster than `taking` for each group individually though as the benchmark results show.
   
   I am wondering if instead of trying to make a new array when doing `.slice()`, we could create a data-structure for slicing instead that implements the Array interface and is supported in kernels, so creating the slice would be cheap?
   
   Something like this :
   ```
   struct ArraySlice {
       offset: usize,
       length: usize,
       array: ArrayRef
   }
   ```
    
   
   
   
   
   
    
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jhorstmann commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

jhorstmann commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765371442


   Cloning the vector of Buffers and child ArrayData has some overhead. Incrementing the reference counts should be relatively cheap unless there are concurrent threads accessing the same Arc.
   
   I tried replacing the vectors inside ArrayData with SmallVec some time ago. That made the slice benchmarks faster, but several other benchmarks slowed down because of it. Might be worth to revisit that.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765952965


   @jorgecarleitao 
   
   Thanks, that it a great summary of the situation! I think removing the `Vec` sounds like it might improve the "metadata overhead", maybe there are some more opportunities like that as well.
   Looking at Arraydata, that might be reasonable cheap as well to clone, not sure if it would improve in all situations as it also involves some extra copying.
   
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups [WIP]

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-763553182


   Thanks @jorgecarleitao makes sense.
   I'm planning a blog post soon again, might put some details there.
   In short it is `cargo profiler` installed from github https://github.com/svenstaro/cargo-profiler with `callgrind` and `kcachegrind`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-763646185


   This PR in itself is ready for review now. The change in approach will help us in the future, even more with `Array.slice` being improved (will create a ticket later), execution times should be further reduced by something like 15-20%.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-763553182


   Thanks @jorgecarleitao makes sense.
   I'm planning a blog post soon again, might put some details there.
   In short it is `cargo profiler` installed from the repo https://github.com/svenstaro/cargo-profiler with `callgrind` and `kcachegrind`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] codecov-io edited a comment on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

codecov-io edited a comment on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-763660623


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=h1) Report
   > Merging [#9271](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=desc) (f578e4c) into [master](https://codecov.io/gh/apache/arrow/commit/3e47777441795c1970a22f6bad103da3e867dc98?el=desc) (3e47777) will **increase** coverage by `0.00%`.
   > The diff coverage is `85.03%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9271/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #9271   +/-   ##
   =======================================
     Coverage   81.89%   81.89%           
   =======================================
     Files         215      215           
     Lines       52988    53003   +15     
   =======================================
   + Hits        43392    43407   +15     
     Misses       9596     9596           
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/array\_struct.rs](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RydWN0LnJz) | `88.43% <ø> (ø)` | |
   | [rust/arrow/src/array/equal/utils.rs](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvdXRpbHMucnM=) | `75.49% <0.00%> (ø)` | |
   | [rust/arrow/src/bytes.rs](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnl0ZXMucnM=) | `53.12% <ø> (ø)` | |
   | [rust/arrow/src/compute/kernels/aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2FnZ3JlZ2F0ZS5ycw==) | `74.93% <ø> (ø)` | |
   | [rust/arrow/src/ipc/gen/SparseTensor.rs](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL2dlbi9TcGFyc2VUZW5zb3IucnM=) | `0.00% <0.00%> (ø)` | |
   | [rust/arrow/src/ipc/gen/Tensor.rs](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL2dlbi9UZW5zb3IucnM=) | `0.00% <0.00%> (ø)` | |
   | [rust/benchmarks/src/bin/nyctaxi.rs](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vbnljdGF4aS5ycw==) | `0.00% <ø> (ø)` | |
   | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `6.97% <0.00%> (ø)` | |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `79.75% <0.00%> (ø)` | |
   | [rust/datafusion/src/logical\_plan/operators.rs](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vb3BlcmF0b3JzLnJz) | `75.00% <ø> (ø)` | |
   | ... and [134 more](https://codecov.io/gh/apache/arrow/pull/9271/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=footer). Last update [9199801...f578e4c](https://codecov.io/gh/apache/arrow/pull/9271?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765952965


   @jorgecarleitao 
   
   Thanks, that it a great summary of the situation! I think removing the `Vec` sounds like it might improve the "metadata overhead", maybe there are some more opportunities like that as well. Maybe the second buffer could be `Option<Buffer>` as well to avoid creating a buffer (if that would add extra overhead)?
   Looking at Arraydata, that might be reasonable cheap as well to clone, not sure if it would improve in all situations as it also involves some extra copying.
   
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups [WIP]

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-763517408


   ![image](https://user-images.githubusercontent.com/163737/105164296-42515780-5b15-11eb-87f0-a042c4287514.png)
   
   @jorgecarleitao while the change makes hash aggregates a bit faster already, it seems `Array.slice` doesn't really do a zero-copy as promised in the doc (and accounts for about 11% of the total instructions of the program). Maybe you have an idea what could be going on here?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups [WIP]

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-763517408


   ![image](https://user-images.githubusercontent.com/163737/105164296-42515780-5b15-11eb-87f0-a042c4287514.png)
   
   @jorgecarleitao while the code is quite a bit faster already, it seems `Array.slice` doesn't really do a zero-copy as promised in the doc (and accounts for about 11% of the total instructions of the program). Maybe you have an idea what could be going on here?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#discussion_r565673085



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -322,48 +325,76 @@ fn group_aggregate_batch(
             });
     }
 
+    // Collect all indices + offsets based on keys in this vec
+    let mut batch_indices: UInt32Builder = UInt32Builder::new(0);

Review comment:
       I wonder if you could get any additional performance by using the knowledge of the size of `batch_keys`
   
   ```suggestion
       let mut batch_indices: UInt32Builder = UInt32Builder::new(batch_keys.len());
   ```

##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -322,48 +325,76 @@ fn group_aggregate_batch(
             });
     }
 
+    // Collect all indices + offsets based on keys in this vec
+    let mut batch_indices: UInt32Builder = UInt32Builder::new(0);

Review comment:
       Or maybe you have to scale it by the number of accumulators too

##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -322,48 +325,76 @@ fn group_aggregate_batch(
             });
     }
 
+    // Collect all indices + offsets based on keys in this vec
+    let mut batch_indices: UInt32Builder = UInt32Builder::new(0);
+    let mut offsets = vec![0];

Review comment:
       ```suggestion
       let mut offsets = Vec::with_capacity(batch_keys.len());
       offsets.push(0);
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765208218


   @jorgecarleitao  I found the "offending" code is this function in `array/data.rs` which does a `self.clone()`. Any idea how we could write a non copying version?
   I think it was copied/adapted based on a function in the array module, but I think it was also cloning in previous versions.
   
   ```rust
       /// Creates a zero-copy slice of itself. This creates a new [ArrayData]
       /// with a different offset, len and a shifted null bitmap.
       ///
       /// # Panics
       ///
       /// Panics if `offset + length > self.len()`.
       pub fn slice(&self, offset: usize, length: usize) -> ArrayData {
           assert!((offset + length) <= self.len());
   
           let mut new_data = self.clone();
   
           new_data.len = length;
           new_data.offset = offset + self.offset;
   
           new_data.null_count =
               count_nulls(new_data.null_buffer(), new_data.offset, new_data.len);
   
           new_data
       }
       
    ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-763180781


   https://issues.apache.org/jira/browse/ARROW-11300


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-767771210


   I rebased this PR @jorgecarleitao @alamb 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org