You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/07 04:56:30 UTC

[GitHub] [arrow] jorgecarleitao commented on pull request #8116: ARROW-9919: [Rust][DataFusion] Speedup math operations by 15%+

jorgecarleitao commented on pull request #8116:
URL: https://github.com/apache/arrow/pull/8116#issuecomment-688034629


   @paddyhoran , thanks a lot for the feedback, it really helps to understand a bit better the context, as I am not aware of those practices.
   
   I do think that both builders and `from` use `memory.rs`. For primitive types, 
   
   * the builder uses a `BufferBuilder` and `BooleanBufferBuilder`, which in turn uses `memory` [here](https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L463)
   * the `from` uses `Buffer`, which uses `memory` [here](https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L737)
   
   There are two main differences between the two:
   
   * A) `from` requires an intermediary allocation (to vec), while the builder does not (uses a MutableBuffer).
   * B) `builder` assess whether it needs to change capacity on every item, while `from` does not.
   
   Specifically, `from` uses `Buffer::from(data.to_byte_slice())`, while the `builder` uses `for datum in data builder.append(datum); builder.finish()`. Therefore, `builders` perform checks on its capacity to ensure that each `append` does not require a `resize` (see impl of `append`, `BufferBuilderTrait<T> for BufferBuilder`, that has a `reserve(1)?` on it, and `advance`, used in nulls, that has a `resize` on it).
   
   My hypothesis for this performance change is that B) is more CPU intensive than A).
   
   One the solution is to offer an API to the `Builder`s to not perform those checks when the user already set the capacity. This would address `B)` and would avoid an intermediary allocation.
   
   In this direction, one idea is to support appending from an `ExactSizeIterator` (e.g. `append_iter`) and use `size_hint` to not perform bound checks on every item.
   
   -----------------------
   
   A common operation we have in the readers and compute is
   
   ```
   rows: &[Value];
   
   for row in rows {
       match some_operation(row) {
            Some(item) => builder.append(item)
            None => builder.append_null()
       }
   }
   ```
   
   which is a basically an `ExactSizeIterator`. However, in the example above, the `builder` has no way of knowing that the new append won't go beyond the reserved size (and thus it checks).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org