You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/18 21:23:18 UTC

[GitHub] [arrow] yordan-pavlov commented on pull request #8960: ARROW-10540: [Rust] Extended filter kernel to all types and improved performance

yordan-pavlov commented on pull request #8960:
URL: https://github.com/apache/arrow/pull/8960#issuecomment-748325991


   @jorgecarleitao  these are some great performance improvements when multiple arrays are filtered - this should have great performance when filtering a record batch containing many columns. I imagine this is explained by doing more work in advance, when building the filter, and less work when applying the filter to each array (compared to the previous implementation with the filter context). 
   
   The performance degradation in the `filter u8` is interesting - do you have a hypothesis for what's causing this? I wonder if this could be explained again by this new implementation doing more work in advance, which works very well when filtering multiple columns but is a bit slower when filtering a single column.
   
   Also I would expect the benchmarks with highly selective filters (mostly 0s in the filter array) to be faster (as there is more skipping and less copying), compared to the low selectivity filter (mostly 1s in the filter array) benchmarks (because of more copying and less skipping), but this relationship appears to be reversed in the results above.
   
   I also wonder how repeatable the benchmarks are now that they use randomly generated arrays. What are your observations; are the benchmarks results fairly stable across multiple runs?
   
   I also like how the filter kernel is now implemented using the `BitChunkIterator`; overall great work!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org