You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/11/13 13:03:00 UTC

[jira] [Commented] (ARROW-10574) [Python][Parquet] Enhance hive partition filtering

    [ https://issues.apache.org/jira/browse/ARROW-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231438#comment-17231438 ] 

Joris Van den Bossche commented on ARROW-10574:
-----------------------------------------------

Could you already push a branch with your changes to GitHub? Seeing the code might help in understanding what you are proposing.

bq. for operator "in", "not in", the value currently must be a set. 

I would think we already support other iterable that support the "in" python operator. Do you have an example where it fails? 
But I agree that converting it to a set might be good anyway.

bq. I would like to add a 'like' operator which has a semantics of a sql "like". Alternatively, a regular expression can be used. I prefer sql like semantics for reasons to achieve sql consistency. 

The ParquetDataset code is being replaced with a {{pyarrow.dataset}} based implementation. So any significant enhancement or new feature should probably target this new implementation. Currently, we do not yet support general filter expressions, but there is work ongoing on allowing this (I can't directly find the correct JIRA, but see eg ARROW-10305 for similar discussion)


> [Python][Parquet] Enhance hive partition filtering
> --------------------------------------------------
>
>                 Key: ARROW-10574
>                 URL: https://issues.apache.org/jira/browse/ARROW-10574
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Weiyang Zhao
>            Assignee: Weiyang Zhao
>            Priority: Major
>
> I would like to enhance partition filters in methods such as:
> {{pyarrow.parquet.ParquetDataset(path, filters)}}
> I am proposing two enhancements:
>  # for operator "in", "not in", the value currently must be a set. My experience is that if I passed in a list, it will simply not result in any values without good warning. I would like to change it to accept any Iterable, which includes set, list, tuple and etc., but not strings. Internally I will construct a set from the Iterable to avoid duplicate elements.
>  # I would like to add a 'like' operator which has a semantics of a sql "like". Alternatively, a regular expression can be used. I prefer sql like semantics for reasons to achieve sql consistency. 
> I have already made the changes and test cases locally. Once this is approved, I can submit it.
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)