You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/11 11:16:00 UTC

[jira] [Updated] (ARROW-8655) [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset

     [ https://issues.apache.org/jira/browse/ARROW-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-8655:
-----------------------------------------
    Fix Version/s:     (was: 1.0.0)
                   3.0.0

> [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-8655
>                 URL: https://issues.apache.org/jira/browse/ARROW-8655
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, dataset-dask-integration, pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} classes that describe a partitioning used in the discovery phase. But once a dataset object is created, it doesn't know any more about this, it just has partition expressions for the fragments. And the partition keys are added to the schema, but you can't directly know which columns of the schema originated from the partitions.
> However, there can be use cases where it would be useful that a dataset still "knows" from what kind of partitioning it was created:
> - The "read CSV write back Parquet" use case, where the CSV was already partitioned and you want to automatically preserve that partitioning for parquet (kind of roundtripping the partitioning on read/write)
> - To convert the dataset to other representation, eg conversion to pandas, it can be useful to know what columns were partition columns (eg for pandas, those columns might be good candidates to be set as the index of the pandas/dask DataFrame). I can imagine conversions to other systems can use similar information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)