You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/04/30 16:10:00 UTC

[jira] [Created] (ARROW-8655) [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset

Joris Van den Bossche created ARROW-8655:
--------------------------------------------

             Summary: [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset
                 Key: ARROW-8655
                 URL: https://issues.apache.org/jira/browse/ARROW-8655
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche
             Fix For: 1.0.0


Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} classes that describe a partitioning used in the discovery phase. But once a dataset object is created, it doesn't know any more about this, it just has partition expressions for the fragments. And the partition keys are added to the schema, but you can't directly know which columns of the schema originated from the partitions.

However, there can be use cases where it would be useful that a dataset still "knows" from what kind of partitioning it was created:

- The "read CSV write back Parquet" use case, where the CSV was already partitioned and you want to automatically preserve that partitioning for parquet (kind of roundtripping the partitioning on read/write)
- To convert the dataset to other representation, eg conversion to pandas, it can be useful to know what columns were partition columns (eg for pandas, those columns might be good candidates to be set as the index of the pandas/dask DataFrame). I can imagine conversions to other systems can use similar information.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)