You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/25 12:05:25 UTC

[GitHub] [arrow] jorisvandenbossche commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

jorisvandenbossche commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-649500017


   Currently for the ParquetDataset, it also simply uses int32 for the indices. 
   
   Now, there is a more fundamental issue I had not thought of: the actual dictionary of the DictionaryArray. Right now, you create a DictionaryArray with only the (single) value of the partition field for that specific fragment (because we don't keep track of all unique values of a certain partition level?). 
   In the python ParquetDataset, however, we create a DictionaryArray with all occurring values of that partition field (also for the other fragments/pieces).
   
   To illustrate with a small dataset with `part=A` and `part=B` directories:
   
   ```python
   In [1]: import pyarrow.dataset as ds
   
   In [6]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)   
   
   In [9]: dataset = ds.dataset("test_partitioned/", format="parquet", partitioning=part) 
   
   In [10]: fragment = list(dataset.get_fragments())[0] 
   
   In [11]: fragment.to_table(schema=dataset.schema) 
   Out[11]: 
   pyarrow.Table
   dummy: int64
   part: dictionary<values=string, indices=int8, ordered=0>
   
   # only A included
   In [13]: fragment.to_table(schema=dataset.schema).column("part")  
   Out[13]: 
   <pyarrow.lib.ChunkedArray object at 0x7fb4a0b6c5e8>
   [
   
     -- dictionary:
       [
         "A"
       ]
     -- indices:
       [
         0,
         0
       ]
   ]
   
   In [15]: import pyarrow.parquet as pq  
   
   In [16]: dataset2 = pq.ParquetDataset("test_partitioned/")  
   
   In [19]: piece = dataset2.pieces[0] 
   
   In [25]: piece.read(partitions=dataset2.partitions) 
   Out[25]: 
   pyarrow.Table
   dummy: int64
   part: dictionary<values=string, indices=int32, ordered=0>
   
   # both A and B included
   In [26]: piece.read(partitions=dataset2.partitions).column("part")   
   Out[26]: 
   <pyarrow.lib.ChunkedArray object at 0x7fb4a08b26d8>
   [
   
     -- dictionary:
       [
         "A",
         "B"
       ]
     -- indices:
       [
         0,
         0
       ]
   ]
   ```
   
   I think for this being valuable (eg in the context of dask, or for pandas where reading in only a part of the parquet dataset), it's important to get all values of the partition field. But I am not sure to what extent that fits in the Dataset design (although I think that during the discovery in the Factory, we could keep track of all unique values of a partition field?)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org