You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Alenka Frim (Jira)" <ji...@apache.org> on 2022/10/20 12:21:00 UTC

[jira] [Closed] (ARROW-17200) [Python][Parquet] support partitioning by Pandas DataFrame index

     [ https://issues.apache.org/jira/browse/ARROW-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alenka Frim closed ARROW-17200.
-------------------------------
    Resolution: Invalid

> [Python][Parquet] support partitioning by Pandas DataFrame index
> ----------------------------------------------------------------
>
>                 Key: ARROW-17200
>                 URL: https://issues.apache.org/jira/browse/ARROW-17200
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Parquet, Python
>            Reporter: Gregory Werbin
>            Priority: Minor
>
> In a Pandas {{DataFrame}} with a multi-index, with a slowly-varying "outer" index level, one might want to partition by that index level when saving the data frame to Parquet format. This is currently not possible; you need to manually reset the index before writing, and re-add the index after reading. It would be very useful if you could supply the name of an index level to {{partition_cols}} instead of (or ideally in addition to) a data column name.
> I originally posted this on the Pandas issue tracker ([https://github.com/pandas-dev/pandas/issues/47797]). Matthew Roeschke looked at the code and figured out that the partitioning functionality was implemented entirely in PyArrow, and that the change would need to happen within PyArrow itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)