You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weiyang Zhao (Jira)" <ji...@apache.org> on 2021/02/12 23:20:00 UTC

[jira] [Updated] (ARROW-11566) [Python][Parquet] Use pypi condition package to filter partitions in a user friendly way

     [ https://issues.apache.org/jira/browse/ARROW-11566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Weiyang Zhao updated ARROW-11566:
---------------------------------
    Description: 
I created the pypi condition package to allow user friendly expression of conditions. For example, a condition can be written as:

(f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2'] 

where A, B, C are partition keys.

For usage details, please see its document at: 

[https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#]

 

Arbitrary condition objects can be converted to pyarrow's filter by calling its

to_pyarrow_filter() method:

[https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering]

The above method will normalize the condition to conform to pyarrow filter specification.

 

Furthermore, the condition object be directly used to evaluate partition paths. This can replace the current complex filtering codes. (both native and python)

For max efficiency, filtering with the condition object can be done in the below ways:
 # read the paths in chunks to keep the memory footprint small;
 # parse the paths to be a pandas dataframe;
 # use condition.query(dataframe) to get the filtered dataframe of path.
 # use numexpr backend for dataframe query for efficiency.

Please discuss.

  was:
I created the pypi condition package to allow user friendly expression of conditions. For example, a condition can be:

(A <= 3 or B != 'b1') and C == ['c1', 'c2'] 

For usage details, please see its document at: 

[https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#]

 

Arbitrary condition objects can be converted to pyarrow's filter by calling its

to_pyarrow_filter() method:

[https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering]

The above method will normalize the condition to conform to pyarrow filter specification.

 

Furthermore, the condition object be directly used to evaluate partition paths. This can replace the current complex filtering codes. (both native and python)

For max efficiency, filtering with the condition object can be done in the below ways:
 # read the paths in chunks to keep the memory footprint small;
 # parse the paths to be a pandas dataframe;
 # use condition.query(dataframe) to get the filtered dataframe of path.
 # use numexpr backend for dataframe query for efficiency.

Please discuss.


> [Python][Parquet] Use pypi condition package to filter partitions in a user friendly way
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-11566
>                 URL: https://issues.apache.org/jira/browse/ARROW-11566
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Weiyang Zhao
>            Assignee: Weiyang Zhao
>            Priority: Major
>
> I created the pypi condition package to allow user friendly expression of conditions. For example, a condition can be written as:
> (f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2'] 
> where A, B, C are partition keys.
> For usage details, please see its document at: 
> [https://condition.readthedocs.io/en/latest/usage.html|https://condition.readthedocs.io/en/latest/usage.html#]
>  
> Arbitrary condition objects can be converted to pyarrow's filter by calling its
> to_pyarrow_filter() method:
> [https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering]
> The above method will normalize the condition to conform to pyarrow filter specification.
>  
> Furthermore, the condition object be directly used to evaluate partition paths. This can replace the current complex filtering codes. (both native and python)
> For max efficiency, filtering with the condition object can be done in the below ways:
>  # read the paths in chunks to keep the memory footprint small;
>  # parse the paths to be a pandas dataframe;
>  # use condition.query(dataframe) to get the filtered dataframe of path.
>  # use numexpr backend for dataframe query for efficiency.
> Please discuss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)