You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/04/30 16:10:00 UTC
[jira] [Created] (ARROW-8655) [C++][Dataset][Python][R] Preserve
partitioning information for a discovered Dataset
Joris Van den Bossche created ARROW-8655:
--------------------------------------------
Summary: [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset
Key: ARROW-8655
URL: https://issues.apache.org/jira/browse/ARROW-8655
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
Fix For: 1.0.0
Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} classes that describe a partitioning used in the discovery phase. But once a dataset object is created, it doesn't know any more about this, it just has partition expressions for the fragments. And the partition keys are added to the schema, but you can't directly know which columns of the schema originated from the partitions.
However, there can be use cases where it would be useful that a dataset still "knows" from what kind of partitioning it was created:
- The "read CSV write back Parquet" use case, where the CSV was already partitioned and you want to automatically preserve that partitioning for parquet (kind of roundtripping the partitioning on read/write)
- To convert the dataset to other representation, eg conversion to pandas, it can be useful to know what columns were partition columns (eg for pandas, those columns might be good candidates to be set as the index of the pandas/dask DataFrame). I can imagine conversions to other systems can use similar information.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)