You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weiyang Zhao (Jira)" <ji...@apache.org> on 2021/02/09 03:39:00 UTC
[jira] [Assigned] (ARROW-11566) [Python][Parquet] Use pypi
condition package to filter partitions in a user friendly way
[ https://issues.apache.org/jira/browse/ARROW-11566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weiyang Zhao reassigned ARROW-11566:
------------------------------------
Assignee: Weiyang Zhao
> [Python][Parquet] Use pypi condition package to filter partitions in a user friendly way
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-11566
> URL: https://issues.apache.org/jira/browse/ARROW-11566
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Weiyang Zhao
> Assignee: Weiyang Zhao
> Priority: Major
>
> I created the pypi condition package to allow user friendly expression of conditions. For example, a condition can be:
> (A <= 3 or B != 'b1') and C == ['c1', 'c2']
> For usage details, please see its document at:
> [https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#]
>
> Arbitrary condition objects can be converted to pyarrow's filter by calling its
> to_pyarrow_filter() method:
> [https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering]
> The above method will normalize the condition to conform to pyarrow filter specification.
>
> Furthermore, the condition object be directly used to evaluate partition paths. This can replace the current complex filtering codes. (both native and python)
> For max efficiency, filtering with the condition object can be done in the below ways:
> # read the paths in chunks to keep the memory footprint small;
> # parse the paths to be a pandas dataframe;
> # use condition.query(dataframe) to get the filtered dataframe of path.
> # use numexpr backend for dataframe query for efficiency.
> Please discuss.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)