You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/19 08:30:27 UTC
[GitHub] [arrow] jorgecarleitao commented on pull request #8960: ARROW-10540: [Rust] Extended filter kernel to all types and improved performance

jorgecarleitao commented on pull request #8960:
URL: https://github.com/apache/arrow/pull/8960#issuecomment-748441499


   @yordan-pavlov 
   
   Thanks for the feedback. All great points.
   
   > The performance degradation in the filter u8 is interesting - do you have a hypothesis for what's causing this?
   
   I am sorry, I was not clear in the PR description:
   
   * `filter u8` is the case of "keep 50%" of the array
   * `filter u8 high selectivity` is the case "keep 90%" of the array.
   
   `filter u8` has a 30% degradation, `filter u8 high selectivity` is 2x faster. Both are single array filters.
   
   > I wonder if this could be explained again by this new implementation doing more work in advance, which works very well when filtering multiple columns but is a bit slower when filtering a single column.
   
   The single filter in this PR uses a single pass via an iterator, while master was building the context on `filter`. So, there is less initial work, and 2x less memory consumption on the operation, as there is no vector allocated on `filter`, only on multi filter.
   
   This PR's implementation does perform more work per slot, to minimize the number of copies. I.e. while the total number of copied bytes is the same in both implementations, this implementation is guaranteed to call memcopy the minimum necessary number of times, by "packing" these calls together in a single call when the calls are contiguous in memory. This implementation is thus optimized for filters that take contiguous regions, which tends to happen when there are a lot of 1s, or when the data is distributed in the array in that way. AFAIK, in real data, this happens more often than by chance, so our benchmarks are being conservative here.
   
   This behavior (of minimizing the number of calls) is crucial for non-primitive types because they minimize the number of relocations. The prime example here is the `StringArray`. There, we can't tell how much to reserve the "data" buffer from the number of 1s in the filter (it depends on the actual number of bytes on each slot). Therefore, when building the new array, the calls ("copy slot [2]", "copy slot [3]") vs "copy slots [2,3]" is slower: the former can cause up to 2 reallocations, while the latter is up to 1. The more complex and larger the data structure is, the worse this is.
   
   By grouping these "extend" together, we reduce the number of relocations when building the new array. This is not so relevant in primitive types because we know the buffer size from the number of 1s and data type.
   
   The implementation is just performing this computation on the fly (via an Iterator), so that, in single filter ops, they happen during the build of the array (and is cached for multi-filter ops).
   
   > highly selective filters (mostly 0s in the filter array)
   
   This PR is calling "highly selective" as filters that have mostly `1s`, as they "select" many items, but maybe the naming is incorrect? If this naming is incorrect, could we come up with something other than `selective` at all? IMO `highly selective` for selecting few items is confusing.
   
   > I also wonder how repeatable the benchmarks are now that they use randomly generated arrays. What are your observations; are the benchmarks results fairly stable across multiple runs?
   
   Good point. The randomness is to not make assumptions about the distribution of the data (e.g. `i % 2 == 0` is highly predictable). We do this in most benches. The results are reproducible because the seed is set to be a constant (independent of the thread, machine or time).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org