You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Robert Gruener (JIRA)" <ji...@apache.org> on 2018/09/05 18:21:00 UTC

[jira] [Commented] (ARROW-1796) [Python] RowGroup filtering on file level

    [ https://issues.apache.org/jira/browse/ARROW-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604756#comment-16604756 ] 

Robert Gruener commented on ARROW-1796:
---------------------------------------

That sounds good to me. I would like to point out it would be nice if it would be possible to apply it at the ParquetDataset level as well extending the filter parameter that already exists to handle both hive partitions and row group level filtering [https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L777] It could do this by using the summary _metadata file or by reading all footers.

> [Python] RowGroup filtering on file level
> -----------------------------------------
>
>                 Key: ARROW-1796
>                 URL: https://issues.apache.org/jira/browse/ARROW-1796
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Uwe L. Korn
>            Assignee: Uwe L. Korn
>            Priority: Major
>             Fix For: 0.11.0
>
>
> We can build upon the API defined in {{fastparquet}} for defining RowGroup filters: https://github.com/dask/fastparquet/blob/master/fastparquet/api.py#L296-L300 and translate them into the C++ enums we will define in https://issues.apache.org/jira/browse/PARQUET-1158 . This should enable us to provide the user with a simple predicate pushdown API that we can extend in the background from RowGroup to Page level later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)