You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "crepererum (via GitHub)" <gi...@apache.org> on 2023/03/15 16:34:13 UTC

[GitHub] [arrow-datafusion] crepererum opened a new issue, #5614: `ParquetExec::statistics::is_exact` likely wrong/misunderstood

crepererum opened a new issue, #5614:
URL: https://github.com/apache/arrow-datafusion/issues/5614

   A `ParquetExec` is created from a `FileScanConfig` and an optional filter predicate[^size_hint]. These two are different, independent parameters -- at least the documentation is not implying that the predicate should be considered when constructing the `FileScanConfig`. Now the statistics for the `ParquetExec` are calculated by `FileScanConfig::project`:
   
   https://github.com/apache/arrow-datafusion/blob/0f6931caa6f8b48e116a8e77e989c404f31f3f8d/datafusion/core/src/physical_plan/file_format/mod.rs#L213-L219
   
   This forwards `is_exact` from the input which might have been set to `true`. However there is a predicate, `is_exact` should likely be `false` because some data may be removed which will mess up the exact statistic. So either the forwarding is wrong (at least when a predicate is given) or the docs are imprecise.
   
   Note that this is unrelated to #5613 because this issue here is about the `is_exact=true` case.
   
   [^size_hint]: And a metadata size hint, but this is irrelevant here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org