You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/23 04:37:10 UTC

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9271: ARROW-11300: [Rust][DataFusion] Further performance improvements on hash aggregation with small groups

jorgecarleitao edited a comment on pull request #9271:
URL: https://github.com/apache/arrow/pull/9271#issuecomment-765865323

Thanks a lot for your points. I am learning a lot! :)

Note that for small arrays, we are basically in the metadata problem on which the "payload size" of transmitting 1 element is driven by its metadata, not the data itself. This will always be a problem, as the arrow format was designed to be performant for large arrays.

For example, all our buffers are shared via an `Arc`. There is a tradeoff between this indirection and mem-copying the memory region. The tradeoff works in `Arc`'s favor for large memory regions and vice-versa.

With that said, we could consider replacing `Arc<ArrayData>` by `ArrayData` on all our arrays, to avoid the extra `Arc`: cloning an `ArrayData` is actually cheap. I am not sure if that would work for FFI, but we could certainly try.

Another idea is to use `buffer1: Buffer`, `buffer2: Buffer` instead of `buffers: Vec<Buffer>` to avoid the `Vec`. This is possible because arrow arrays support at most 2 buffers (3 with the null). For types of a single buffer, we are already incurring the cost of the `Vec` and thus adding a `Buffer` instead should not be a big issue (memory-wise). The advantage of this is that we avoid cloning the `Vec` on every operation as well as the extra bound check. The disadvantage is that we have to be more verbose when we want to apply an operation to every buffer.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org