You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/09 13:53:38 UTC
[GitHub] [arrow] jorisvandenbossche opened a new pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to deconstruct a partition expression
jorisvandenbossche opened a new pull request #7691:
URL: https://github.com/apache/arrow/pull/7691
Not an actual proper fix for ARROW-8655, but it can provide a workaround for now to retrieve the partition fields' name and value from the `partition_expression`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to deconstruct a partition expression
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7691:
URL: https://github.com/apache/arrow/pull/7691#issuecomment-656147394
https://issues.apache.org/jira/browse/ARROW-8655
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to get keys from a partition expression
Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7691:
URL: https://github.com/apache/arrow/pull/7691#issuecomment-656312308
Thanks! I indeed basically reimplemented `VisitKeys` in cython ..
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] bkietz commented on a change in pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to get keys from a partition expression
Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7691:
URL: https://github.com/apache/arrow/pull/7691#discussion_r452456433
##########
File path: python/pyarrow/includes/libarrow_dataset.pxd
##########
@@ -314,6 +314,10 @@ cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:
const CExpression& partition_expression,
CRecordBatchProjector* projector)
+ cdef CResult[unordered_map[c_string, shared_ptr[CScalar]]] \
+ CGetPartitionKeys "arrow::dataset::KeyValuePartitioning::GetKeys"(
Review comment:
Yes, as far as cython is concerned a static method can be treated as a free function. `CSetPartitionKeysInProjector` (decl above this one) is another example. I think it works the other way, too: a c++ free function can be exposed as a `@staticmethod` of a cppclass (but haven't got a standing example and I'm not sure why we'd ever need that)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to get keys from a partition expression
Posted by GitBox <gi...@apache.org>.
jorisvandenbossche edited a comment on pull request #7691:
URL: https://github.com/apache/arrow/pull/7691#issuecomment-656312308
@bkietz Thanks! I indeed basically reimplemented `VisitKeys` in cython ..
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] bkietz closed pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to get keys from a partition expression
Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #7691:
URL: https://github.com/apache/arrow/pull/7691
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to get keys from a partition expression
Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7691:
URL: https://github.com/apache/arrow/pull/7691#discussion_r452446416
##########
File path: python/pyarrow/includes/libarrow_dataset.pxd
##########
@@ -314,6 +314,10 @@ cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:
const CExpression& partition_expression,
CRecordBatchProjector* projector)
+ cdef CResult[unordered_map[c_string, shared_ptr[CScalar]]] \
+ CGetPartitionKeys "arrow::dataset::KeyValuePartitioning::GetKeys"(
Review comment:
One question for my education: `GetKeys` is a method on the `KeyValuePartitioning` class, but since that method doesn't use any of the initialized variabled of the object, you can just use it as a free function like this independent from a KeyValuePartitioning object in C++? (like a static/class method in Python, but then without needing to mark it as such)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to deconstruct a partition expression
Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7691:
URL: https://github.com/apache/arrow/pull/7691#issuecomment-656143156
Not the cleanest solution, but could do this relatively quickly because it's based on what I did earlier in https://github.com/apache/arrow/pull/7523. But I think a more proper solution won't be possible before 1.0, and this at least gives a way to get the information needed.
A few examples:
```python
In [1]: import pyarrow.dataset as ds
In [2]: dataset = ds.dataset("test_filter_fragments_pandas/", format="parquet", partitioning="hive")
In [4]: expr = list(dataset.get_fragments())[0].partition_expression
# single partition level with a string
In [5]: expr
Out[5]: <pyarrow.dataset.Expression (part == A:string)>
In [6]: ds._unwrap_partition_expression(expr)
Out[6]: [('part', 'A')]
In [7]: dataset = ds.dataset("test_parquet_dask/", format="parquet", partitioning="hive")
In [8]: expr = list(dataset.get_fragments())[0].partition_expression
# two partition levels with integers
In [9]: expr
Out[9]: <pyarrow.dataset.Expression ((year == 2016:int32) and (month == 1:int32))>
In [10]: ds._unwrap_partition_expression(expr)
Out[10]: [('year', 2016), ('month', 1)]
In [11]: dataset = ds.dataset("test.parquet", format="parquet")
In [12]: expr = list(dataset.get_fragments())[0].partition_expression
# no partitioned dataset
In [13]: expr
Out[13]: <pyarrow.dataset.Expression true:bool>
In [14]: ds._unwrap_partition_expression(expr)
Out[14]: []
```
cc @rjzamora
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org