You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/18 12:10:06 UTC

[GitHub] [arrow-rs] tustvold commented on issue #1191: Parquet Scan Filter

tustvold commented on issue #1191:
URL: https://github.com/apache/arrow-rs/issues/1191#issuecomment-1015350353


   > I think it would be best to implement in DataFusion if at all possible 
   
   Agreed, I was somewhat hedging here :laughing: 
   
   > In the Impala implementation, there was negligible impact on unsorted/random data https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/.
   
   That's a great link :+1:, and yeah the ability to prune on aggregate statistics is very dependent on the sort order.
   
   > that would help with the evaluation decision of whether to use a page index or not.
   > I think the predicate evaluation would best live in parquet as it can get complex for some pages
   
   I was somewhat hoping to avoid using the page index, in part for this reason, as it would require pushing predicate evaluation into the parquet crate, but also because of the small matter that we don't currently read or write it :sweat_smile: 
   
   I believe in most cases a selection mask will perform similarly or significantly better, allowing skipping pages and even runs within pages, whilst also not requiring predicate evaluation logic to leak into the parquet crate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org