You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/21 21:05:09 UTC

[GitHub] [arrow-rs] tustvold edited a comment on pull request #1180: Preserve dictionary encoding when decoding parquet into Arrow arrays, 60x perf improvement (#171)

tustvold edited a comment on pull request #1180:
URL: https://github.com/apache/arrow-rs/pull/1180#issuecomment-1018838480


   > next logical question is what's the impact to end-to-end performance of queries in Data Fusion
   
   This is highly context dependent, queries that were previously bottlenecked by parquet will of course see improvements. A simple table scan with no predicates, for example, should see most of the raw 60x performance uplift. This sort of "query" shows up in IOx when compacting or loading data from object storage into an in-memory cache.
   
   The story for more complex queries is a bit more WIP. Currently Datafusion's handling of dictionary encoded arrays isn't brilliant with it often fully materializing dictionaries when it shouldn't need to. https://github.com/apache/arrow-datafusion/pull/1475 and https://github.com/apache/arrow-datafusion/issues/1610 track the process of switching DataFusion to delegate comparator selection to arrow-rs, which should help to alleviate this.
   
   TLDR at this point in time I'm focused on getting arrow-rs to a good place, with the necessary base primitives, and then I'll turn my attention to what Datafusion is doing with them. Who knows someone else may even get there first :grin: 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org