You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Norbert (Jira)" <ji...@apache.org> on 2020/09/10 10:18:00 UTC

[jira] [Created] (ARROW-9959) [Python][C++][Parquet] Ability to delete row groups from metadata

Norbert created ARROW-9959:
------------------------------

Summary: [Python][C++][Parquet] Ability to delete row groups from metadata
Key: ARROW-9959
URL: https://issues.apache.org/jira/browse/ARROW-9959
Project: Apache Arrow
Issue Type: Improvement
Reporter: Norbert

Hi,

We currently use PyArrow to maintain a partitioned dataset of Parquet files on disk. We also manage our own `_metadata` file - when new rows are written to the dataset, we use the `metadata_collector` argument of `write_to_dataset` to collect all metadata that was written inside individual files. We then load the existing `_metadata` file and merge it with all the newly-written metadatas using `metadata.append_row_groups` (as in the docs) and then write the result to `_metadata` on disk.

However, we would also like to occasionally amend this dataset by deleting individual files. In order to keep the `_metadata` file in sync, we would need to load the metadata of all the files we're willing to delete, then find their row groups inside `_metadata` and remove them. Therefore we require a method such as `delete_row_groups` to exist on the `FileMetaData` object. Would it be possible for PyArrow to support this? Another way of accomplishing the same thing would be to initialise an empty `FileMetaData` object and simply use `append_row_groups` to add back all the row groups that are required. However, I've been unable to accomplish this programmaticaly as the constructor for `FileMetaData` seems to ask for a C structure which I'm not sure how to construct.

Many thanks.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)