You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/01 15:39:51 UTC

[GitHub] [arrow] jorisvandenbossche commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

jorisvandenbossche commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-652493993


   @bkietz thanks for the update ensuring all uniques as dictionary values!
   
   Testing this out, I ran into an issue with HivePartitioning -> ARROW-9288 / #7608
   
   Further, a usability issue: this now creates partition expressions that use a dictionary type. Which means that doing something like `dataset.to_table(filter=ds.field("part") == "A")` to filter on the partition field with a plain string expression doesn't work, limiting the usability of this option (and even with the new Python scalar stuff, it would not be easy to construct the correct expression):
   
   ```
   In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2)  
   
   In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", partitioning=part)
   
   In [11]: fragment = list(dataset.get_fragments())[0]   
   
   In [12]: fragment.partition_expression  
   Out[12]: 
   <pyarrow.dataset.Expression (part == [
     "A",
     "B"
   ][0]:dictionary<values=string, indices=int32, ordered=0>)>
   
   In [13]: dataset.to_table(filter=ds.field("part") == "A") 
   ...
   ArrowNotImplementedError: cast from string
   ```
   
   It might also be an option to keep the `partition_expression` use the dictionary *value type* instead of dictionary type?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org