You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/11 15:52:14 UTC

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8630: ARROW-10540 [Rust] Improve filtering

jorgecarleitao edited a comment on pull request #8630:
URL: https://github.com/apache/arrow/pull/8630#issuecomment-725500367


   Thanks a lot, @alamb and @vertexclique . I agree with the naming issues here, and great insight into those crates. I do not have strong feelings about naming; I tried to be close to what I am familiar with from Rust (e.g. `Vec::extend`).
   
   A bit of background: @yordan-pavlov did an impressive job on #7798 to make filtering performant; It is _really_ hard to get that performance with a generalization. I had to re-write this code at least 3 times to get it to a stage with comparable performance and even then, you can see that it depends on the filter conditions. But there is a (IMO good) reason for generalizing the code in `filter`.
   
   Specifically, the use-case I am looking at is to remove most code from the `take` kernel, since it is doing what this struct is doing (and what `filter` was doing). There are implementation details that are different (`take`'s indices can be null and `filter` does not support `ArrayStruct` atm), but they are fundamentally doing the same thing: constructing an `ArrayData` by memcopying chunks of another `ArrayData`.
   
   Also, note that this goes beyond masking: it supports taking values repeatedly. I see it as a "builder" bound to a specific `ArrayData` (`ArrayDataBuilder` is already taken, though). It is not a builder like the builders in `builder.rs` because those memcopy data from rust native types, not Arrow types.
   
   The reason this performance remains comparable(-ish) to master is that when it binds to an `ArrayData`, it inspects its `DataType` to initialize functions (the `type Extend`) bound to that `ArrayData` that are performant.
   
   For example,
   
   ```rust
   let values = &array.buffers()[0].data()[array.offset() * size_of::<T>()..];
   Box::new(move |mutable: &mut _MutableArrayData, start: usize, len: usize| {
   ```
   
   instead of
   
   ```rust
   Box::new(move |mutable: &mut _MutableArrayData, array: &ArrayData, start: usize, len: usize| {
       let values = &array.buffers()[0].data()[array.offset() * size_of::<T>()..];
   ```
   
   has a meaningful performance impact because it avoids checks on both `buffers()[0]` and `[something..]` on every call of `extend`. This check may seem small, but it significantly impacts performance because `extend` can be as small as: `copy 2 bits` (in the case of a boolean array). When there are many calls of this operation (e.g. filter every other slot), these checks are as or more expensive than the `memcopy` themselves.
   
   There is a further boost possible that I explored but did not finish, but it requires a bit of `unsafe` code: we know that these functions never outlive `MutableDataArray`. Therefore, we could actually unsafely pass `&self.data` to their initialization and avoid all costs associated with accessing `mutable.buffers()` and the like inside `extend`. In this case, we would use
   
   ```rust
   type Extend<'a> = Box<Fn(usize, usize) -> () + 'a>;
   ```
   
   and bind the `_MutableArrayData` on `build_primitive_extend<'a, T>(array, data)`. This would be the closest to the current filter implementation that remains generic for other use-cases.
   
   My point is that there are really non-trivial optimizations done in this proposal that were borrowed from the excellent work from @yordan-pavlov and that are required to keep up with very high bar set by #7798  This draft is trying to leverage that work on two of our kernels, `take` and `filter`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org