You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weiyang Zhao (Jira)" <ji...@apache.org> on 2021/02/09 03:38:00 UTC

[jira] [Created] (ARROW-11566) [Python][Parquet] Use pypi condition package to filter partitions in a user friendly way

Weiyang Zhao created ARROW-11566:
------------------------------------

             Summary: [Python][Parquet] Use pypi condition package to filter partitions in a user friendly way
                 Key: ARROW-11566
                 URL: https://issues.apache.org/jira/browse/ARROW-11566
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Weiyang Zhao


I created the pypi condition package to allow user friendly expression of conditions. For example, a condition can be:

(A <= 3 or B != 'b1') and C == ['c1', 'c2'] 

For usage details, please see its document at: 

[https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#]

 

Arbitrary condition objects can be converted to pyarrow's filter by calling its

to_pyarrow_filter() method:

[https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering]

The above method will normalize the condition to conform to pyarrow filter specification.

 

Furthermore, the condition object be directly used to evaluate partition paths. This can replace the current complex filtering codes. (both native and python)

For max efficiency, filtering with the condition object can be done in the below ways:
 # read the paths in chunks to keep the memory footprint small;
 # parse the paths to be a pandas dataframe;
 # use condition.query(dataframe) to get the filtered dataframe of path.
 # use numexpr backend for dataframe query for efficiency.

Please discuss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)