You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/04/30 13:36:00 UTC

[jira] [Created] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type

Joris Van den Bossche created ARROW-8647:
--------------------------------------------

             Summary: [C++][Dataset] Optionally encode partition field values as dictionary type
                 Key: ARROW-8647
                 URL: https://issues.apache.org/jira/browse/ARROW-8647
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche
             Fix For: 1.0.0


In the Python ParquetDataset implementation, the partition fields are returned as dictionary type columns. 

In the new Dataset API, we now use a plain type (integer or string when inferred). But, you can already manually specify that the partition keys should be dictionary type by specifying the partitioning schema (in {{Partitioning}} passed to the dataset factory). 

Since using dictionary type can be more efficient (since partition keys will typically be repeated values in the resulting table), it might be good to still have an option in the DatasetFactory to use dictionary types for the partition fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)