You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/05 13:52:42 UTC

[GitHub] [arrow-datafusion] andygrove opened a new issue #825: Add documentation for support for skipping Parquet row groups

andygrove opened a new issue #825:
URL: https://github.com/apache/arrow-datafusion/issues/825

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

We sometimes get questions about support for skipping Parquet row groups based on statistics. It seems that we do not have good documentation around this really cool feature, so we should write something up. We can base it on this response copied from the slack channel.

```
DataFusion has support for skipping entire row groups using predicates and min and max statistics.

It does not (yet) push the predicates down into the actual scan (e.g. to avoid materializing data that wouldn’t pass the predicate) — instead any row groups that are not pruned are decompressed into RecordBatches and then filtered.
Also, DataFusion will do “projection pushdown” — aka it will read only those columns needed to answer the query.
```

**Describe the solution you'd like**
Promote this cool feature in the documentation somewhere (user guide? README?)

**Describe alternatives you've considered**
None

**Additional context**
None

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on issue #825: Add documentation for support for skipping Parquet row groups

Posted by GitBox <gi...@apache.org>.

houqp commented on issue #825:
URL: https://github.com/apache/arrow-datafusion/issues/825#issuecomment-894029329


   seems like something that would be a good fit for design doc or user guide.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #825: Add documentation for support for skipping Parquet row groups

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #825:
URL: https://github.com/apache/arrow-datafusion/issues/825#issuecomment-912892861


   Hi there - I can work on this.  Just to make sure I understand - would doing this at the scan level mean extracting the min-max from the compressed data in order to determine whether the row group even needs to be materialized?
   
   With regards to the actual docs - does it make sense to add a general section on the main docs page to list the optimizations that are currently implemented / planned? i.e. whats here https://docs.rs/datafusion/5.0.0/datafusion/optimizer/optimizer/trait.OptimizerRule.html could be used for whats implemented.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org