You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weiyang Zhao (Jira)" <ji...@apache.org> on 2021/02/09 03:39:00 UTC

[jira] [Assigned] (ARROW-11566) [Python][Parquet] Use pypi condition package to filter partitions in a user friendly way

     [ https://issues.apache.org/jira/browse/ARROW-11566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Weiyang Zhao reassigned ARROW-11566:
------------------------------------

    Assignee: Weiyang Zhao

> [Python][Parquet] Use pypi condition package to filter partitions in a user friendly way
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-11566
>                 URL: https://issues.apache.org/jira/browse/ARROW-11566
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Weiyang Zhao
>            Assignee: Weiyang Zhao
>            Priority: Major
>
> I created the pypi condition package to allow user friendly expression of conditions. For example, a condition can be:
> (A <= 3 or B != 'b1') and C == ['c1', 'c2'] 
> For usage details, please see its document at: 
> [https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#]
>  
> Arbitrary condition objects can be converted to pyarrow's filter by calling its
> to_pyarrow_filter() method:
> [https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering]
> The above method will normalize the condition to conform to pyarrow filter specification.
>  
> Furthermore, the condition object be directly used to evaluate partition paths. This can replace the current complex filtering codes. (both native and python)
> For max efficiency, filtering with the condition object can be done in the below ways:
>  # read the paths in chunks to keep the memory footprint small;
>  # parse the paths to be a pandas dataframe;
>  # use condition.query(dataframe) to get the filtered dataframe of path.
>  # use numexpr backend for dataframe query for efficiency.
> Please discuss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)