You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/25 13:29:55 UTC

[GitHub] [arrow-datafusion] liukun4515 opened a new issue, #2962: Can't filter rowgroup for parquet prune for some data type

liukun4515 opened a new issue, #2962:
URL: https://github.com/apache/arrow-datafusion/issues/2962

   **Describe the bug**
   In the `RowGroupPruningStatistics`, we use the statistics to prune the row group for parquet file.
   
   In the below logical: 
   https://github.com/apache/arrow-datafusion/blob/f386f7a7344d54455fe04d92248e373fac990e6d/datafusion/core/src/physical_plan/file_format/parquet.rs#L392
   to get the min and max for a column.
   
   But the logic has bug for the data type.
   
   In the parquet, we can use `INT32`、`INT64` or `BINARY` to store decimal value, but in the below logical, we can't get the right type of the `ArrayRef`. 
   **To Reproduce**
   Steps to reproduce the behavior:
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2962: Can't filter rowgroup for parquet prune for some data type

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2962:
URL: https://github.com/apache/arrow-datafusion/issues/2962#issuecomment-1197978724

   >  Maybe those two types of pruning should be part of the parquet arrow project. 
   
   I suspect additional filter pushdown will require changes in both the parquet reader and then datafusion
   
   I think there is work underway by @Ted-Jiang @liukun4515  @thinkharderdev and @tustvold  to implement "Page Pruning" which I think may be what you are referring to here (it allows the parquet reader to skip materializing/decoding positions based on evaluating the predicates) -- the work is partially described in https://github.com/apache/arrow-rs/issues/1191
   
   In terms of using parquet bloom filters, I suspect that would also need work in parquet and datafusion, and I don't know of any efforts underway to do so. @shanisolomon added initial support to expose the bloom filter metadata in https://github.com/apache/arrow-rs/pull/1309 and [follow on](https://github.com/apache/arrow-rs/pulls?q=is%3Apr+bloom+is%3Aclosed) PRs, but I believe they then implemented the Bloom Filtering in a closed source project (cc @zeevm who might know more)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #2962: Can't filter rowgroup for parquet prune for some data type

Posted by GitBox <gi...@apache.org>.
alamb closed issue #2962: Can't filter rowgroup for parquet prune for some data type
URL: https://github.com/apache/arrow-datafusion/issues/2962


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #2962: Can't filter rowgroup for parquet prune for some data type

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #2962:
URL: https://github.com/apache/arrow-datafusion/issues/2962#issuecomment-1197770582

   @alamb 
   Regarding the parquet row group pruning, the current pruning logic covers the stats pruning which is common for any columnar storage who provides stats and can be reused. But for parquet format, it also has specific pruning like dict pruning, bloom filter pruning, those two types of pruning is not implemented yet.  Maybe those two types of pruning should be part of the parquet arrow project.  And in the current parquet reader implementation, I do not find a method we can use to read the dictionary page out and use it to construct a Set for filtering purpose.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org