You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Weiyang Zhao (Jira)" <ji...@apache.org> on 2021/02/09 03:38:00 UTC
[jira] [Created] (ARROW-11566) [Python][Parquet] Use pypi condition
package to filter partitions in a user friendly way
Weiyang Zhao created ARROW-11566:
------------------------------------
Summary: [Python][Parquet] Use pypi condition package to filter partitions in a user friendly way
Key: ARROW-11566
URL: https://issues.apache.org/jira/browse/ARROW-11566
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Reporter: Weiyang Zhao
I created the pypi condition package to allow user friendly expression of conditions. For example, a condition can be:
(A <= 3 or B != 'b1') and C == ['c1', 'c2']
For usage details, please see its document at:
[https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#]
Arbitrary condition objects can be converted to pyarrow's filter by calling its
to_pyarrow_filter() method:
[https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering]
The above method will normalize the condition to conform to pyarrow filter specification.
Furthermore, the condition object be directly used to evaluate partition paths. This can replace the current complex filtering codes. (both native and python)
For max efficiency, filtering with the condition object can be done in the below ways:
# read the paths in chunks to keep the memory footprint small;
# parse the paths to be a pandas dataframe;
# use condition.query(dataframe) to get the filtered dataframe of path.
# use numexpr backend for dataframe query for efficiency.
Please discuss.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)