You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/24 22:45:23 UTC

[GitHub] [arrow] bkietz opened a new pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

bkietz opened a new pull request #7536:
URL: https://github.com/apache/arrow/pull/7536


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] fsaintjacques closed pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Posted by GitBox <gi...@apache.org>.

fsaintjacques closed pull request #7536:
URL: https://github.com/apache/arrow/pull/7536


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Posted by GitBox <gi...@apache.org>.

bkietz commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-650885378


   Int32 indices are now used whatever the dictionary size


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-652493993


   @bkietz thanks for the update ensuring all uniques as dictionary values!
   
   Testing this out, I ran into an issue with HivePartitioning -> ARROW-9288 / #7608
   
   Further, a usability issue: this now creates partition expressions that use a dictionary type. Which means that doing something like `dataset.to_table(filter=ds.field("part") == "A")` to filter on the partition field with a plain string expression doesn't work, limiting the usability of this option (and even with the new Python scalar stuff, it would not be easy to construct the correct expression):
   
   ```
   In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2)  
   
   In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", partitioning=part)
   
   In [11]: fragment = list(dataset.get_fragments())[0]   
   
   In [12]: fragment.partition_expression  
   Out[12]: 
   <pyarrow.dataset.Expression (part == [
     "A",
     "B"
   ][0]:dictionary<values=string, indices=int32, ordered=0>)>
   
   In [13]: dataset.to_table(filter=ds.field("part") == "A") 
   ...
   ArrowNotImplementedError: cast from string
   ```
   
   It might also be an option to keep the `partition_expression` use the dictionary *value type* instead of dictionary type?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] fsaintjacques commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Posted by GitBox <gi...@apache.org>.

fsaintjacques commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-653124260


   I think that any comparison involving the dict type should also work with the "effective" logical type (the value type of the dict).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Posted by GitBox <gi...@apache.org>.

bkietz commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649154064


   Actually, on reflection: I'm not sure it's worthwhile to check the count of unique values at all. In any given batch a virtual column would be materialized with a single-item dictionary so `int8` should always suffice. (Unless we want to always support concatenation of a materialized table's chunks, though even in that case we could promote the index type on concatenation...). 
   
   Currently it seems preferable to remove `max_partition_dictionary_size` in favor of a boolean flag and always infer `dictionary<indices=int8, values=utf8>`.
   
   @jorisvandenbossche ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649500017


   Currently for the ParquetDataset, it also simply uses int32 for the indices. 
   
   Now, there is a more fundamental issue I had not thought of: the actual dictionary of the DictionaryArray. Right now, you create a DictionaryArray with only the (single) value of the partition field for that specific fragment (because we don't keep track of all unique values of a certain partition level?). 
   In the python ParquetDataset, however, we create a DictionaryArray with all occurring values of that partition field (also for the other fragments/pieces).
   
   To illustrate with a small dataset with `part=A` and `part=B` directories:
   
   ```python
   In [1]: import pyarrow.dataset as ds
   
   In [6]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)   
   
   In [9]: dataset = ds.dataset("test_partitioned/", format="parquet", partitioning=part) 
   
   In [10]: fragment = list(dataset.get_fragments())[0] 
   
   In [11]: fragment.to_table(schema=dataset.schema) 
   Out[11]: 
   pyarrow.Table
   dummy: int64
   part: dictionary<values=string, indices=int8, ordered=0>
   
   # only A included
   In [13]: fragment.to_table(schema=dataset.schema).column("part")  
   Out[13]: 
   <pyarrow.lib.ChunkedArray object at 0x7fb4a0b6c5e8>
   [
   
     -- dictionary:
       [
         "A"
       ]
     -- indices:
       [
         0,
         0
       ]
   ]
   
   In [15]: import pyarrow.parquet as pq  
   
   In [16]: dataset2 = pq.ParquetDataset("test_partitioned/")  
   
   In [19]: piece = dataset2.pieces[0] 
   
   In [25]: piece.read(partitions=dataset2.partitions) 
   Out[25]: 
   pyarrow.Table
   dummy: int64
   part: dictionary<values=string, indices=int32, ordered=0>
   
   # both A and B included
   In [26]: piece.read(partitions=dataset2.partitions).column("part")   
   Out[26]: 
   <pyarrow.lib.ChunkedArray object at 0x7fb4a08b26d8>
   [
   
     -- dictionary:
       [
         "A",
         "B"
       ]
     -- indices:
       [
         0,
         0
       ]
   ]
   ```
   
   I think for this being valuable (eg in the context of dask, or for pandas where reading in only a part of the parquet dataset), it's important to get all values of the partition field. But I am not sure to what extent that fits in the Dataset design (although I think that during the discovery in the Factory, we could keep track of all unique values of a partition field?)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Posted by GitBox <gi...@apache.org>.

bkietz commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649148829


   I think there's value in finding the smallest index type possible; we expect partition fields to have few unique values in most cases.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Posted by GitBox <gi...@apache.org>.

bkietz commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649514981


   @jorisvandenbossche okay, I'll extend the key value `Partitioning`s to maintain dictionaries of all unique values of a field.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-654704786


   > I think that any comparison involving the dict type should also work with the "effective" logical type (the value type of the dict).
   
   Opened https://issues.apache.org/jira/browse/ARROW-9345 to track this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649145717


   We could use just int32() dictionary indices and call it a day? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] fsaintjacques commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Posted by GitBox <gi...@apache.org>.

fsaintjacques commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649607214


   I'm also of the opinion that we should stick with int32_t. That's what parquet uses for dict column, that's what R uses for factor columns, that's what we use by default in DictType, etc... I suspect the short-time net effect of this is uncovering index type issues we have around.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649112307


   https://issues.apache.org/jira/browse/ARROW-8647


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org