You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/09 13:53:38 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to deconstruct a partition expression

jorisvandenbossche opened a new pull request #7691:
URL: https://github.com/apache/arrow/pull/7691


   Not an actual proper fix for ARROW-8655, but it can provide a workaround for now to retrieve the partition fields' name and value from the `partition_expression`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to deconstruct a partition expression

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7691:
URL: https://github.com/apache/arrow/pull/7691#issuecomment-656147394


   https://issues.apache.org/jira/browse/ARROW-8655


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to get keys from a partition expression

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7691:
URL: https://github.com/apache/arrow/pull/7691#issuecomment-656312308


   Thanks! I indeed basically reimplemented  `VisitKeys` in cython ..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to get keys from a partition expression

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #7691:
URL: https://github.com/apache/arrow/pull/7691#discussion_r452456433



##########
File path: python/pyarrow/includes/libarrow_dataset.pxd
##########
@@ -314,6 +314,10 @@ cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:
             const CExpression& partition_expression,
             CRecordBatchProjector* projector)
 
+    cdef CResult[unordered_map[c_string, shared_ptr[CScalar]]] \
+        CGetPartitionKeys "arrow::dataset::KeyValuePartitioning::GetKeys"(

Review comment:
       Yes, as far as cython is concerned a static method can be treated as a free function. `CSetPartitionKeysInProjector` (decl above this one) is another example. I think it works the other way, too: a c++ free function can be exposed as a `@staticmethod` of a cppclass (but haven't got a standing example and I'm not sure why we'd ever need that)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to get keys from a partition expression

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche edited a comment on pull request #7691:
URL: https://github.com/apache/arrow/pull/7691#issuecomment-656312308


   @bkietz Thanks! I indeed basically reimplemented  `VisitKeys` in cython ..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz closed pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to get keys from a partition expression

Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #7691:
URL: https://github.com/apache/arrow/pull/7691


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to get keys from a partition expression

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7691:
URL: https://github.com/apache/arrow/pull/7691#discussion_r452446416



##########
File path: python/pyarrow/includes/libarrow_dataset.pxd
##########
@@ -314,6 +314,10 @@ cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:
             const CExpression& partition_expression,
             CRecordBatchProjector* projector)
 
+    cdef CResult[unordered_map[c_string, shared_ptr[CScalar]]] \
+        CGetPartitionKeys "arrow::dataset::KeyValuePartitioning::GetKeys"(

Review comment:
       One question for my education: `GetKeys` is a method on the `KeyValuePartitioning` class, but since that method doesn't use any of the initialized variabled of the object, you can just use it as a free function like this independent from a KeyValuePartitioning object in C++? (like a static/class method in Python, but then without needing to mark it as such)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7691: ARROW-8655: [Python][Dataset] Provide helper method to deconstruct a partition expression

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7691:
URL: https://github.com/apache/arrow/pull/7691#issuecomment-656143156


   Not the cleanest solution, but could do this relatively quickly because it's based on what I did earlier in https://github.com/apache/arrow/pull/7523. But I think a more proper solution won't be possible before 1.0, and this at least gives a way to get the information needed.
   
   A few examples:
   
   ```python
   In [1]: import pyarrow.dataset as ds                                                                                                                                                                               
   
   In [2]: dataset = ds.dataset("test_filter_fragments_pandas/", format="parquet", partitioning="hive")                                                                                                               
   In [4]: expr = list(dataset.get_fragments())[0].partition_expression                                                                                                                                               
   
   # single partition level with a string
   In [5]: expr                                                                                                                                                                                                       
   Out[5]: <pyarrow.dataset.Expression (part == A:string)>
   
   In [6]: ds._unwrap_partition_expression(expr)                                                                                                                                                                      
   Out[6]: [('part', 'A')]
   
   
   In [7]: dataset = ds.dataset("test_parquet_dask/", format="parquet", partitioning="hive")                                                                                                                          
   In [8]: expr = list(dataset.get_fragments())[0].partition_expression                                                                                                                                               
   
   # two partition levels with integers
   In [9]: expr                                                                                                                                                                                                       
   Out[9]: <pyarrow.dataset.Expression ((year == 2016:int32) and (month == 1:int32))>
   
   In [10]: ds._unwrap_partition_expression(expr)                                                                                                                                                                     
   Out[10]: [('year', 2016), ('month', 1)]
   
   
   In [11]: dataset = ds.dataset("test.parquet", format="parquet")                                                                                                                                                    
   In [12]: expr = list(dataset.get_fragments())[0].partition_expression                                                                                                                                              
   
   # no partitioned dataset
   In [13]: expr                                                                                                                                                                                                      
   Out[13]: <pyarrow.dataset.Expression true:bool>
   
   In [14]: ds._unwrap_partition_expression(expr)                                                                                                                                                                     
   Out[14]: []
   ```
   
   cc @rjzamora 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org