You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Pearu Peterson (JIRA)" <ji...@apache.org> on 2019/04/15 20:07:00 UTC
[jira] [Comment Edited] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

    [ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818343#comment-16818343 ] 

Pearu Peterson edited comment on ARROW-1983 at 4/15/19 8:06 PM:
----------------------------------------------------------------

Note that the Parquet format has three different metadata structures, see [https://github.com/apache/parquet-format#metadata] .

The "_metadata" corresponds to `FileMetaData.key_value_metadata` (in parquet-format specification) + schema while the "statistics" (that is of interest of Dask, if I understand it correctly) corresponds to `ColumnMetadata.key_value_metadata`.
 Yes, Arrow can read all this information and more. My basic questions are:
 # What information needs to be collected? Note that some information is internal to parquet files that one would never need, hence it would just a waste of space to collect it, especially when the Datasets become huge (as would be expected in Dask applications).
 # Where this information should be gathered for easy and efficient access?

 


was (Author: pearu):
Note that the Parquet format has three different metadata structures, see [https://github.com/apache/parquet-format#metadata] .

The "_metadata" corresponds to `FileMetaData.key_value_metadata` (in parquet-format specification) while the "statistics" (that is of interest of Dask, if I understand it correctly) corresponds to `ColumnMetadata.key_value_metadata`.
Yes, Arrow can read all this information and more. My basic questions are:
 # What information needs to be collected? Note that some information is internal to parquet files that one would never need, hence it would just a waste of space to collect it, especially when the Datasets become huge (as would be expected in Dask applications).
 # Where this information should be gathered for easy and efficient access?

 

> [Python] Add ability to write parquet `_metadata` file
> ------------------------------------------------------
>
>                 Key: ARROW-1983
>                 URL: https://issues.apache.org/jira/browse/ARROW-1983
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Jim Crist
>            Priority: Major
>              Labels: beginner, parquet
>             Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file (mostly just schema information). It would be useful to add the ability to write a {{_metadata}} file as well. This should include information about each row group in the dataset, including summary statistics. Having this summary file would allow filtering of row groups without needing to access each file beforehand.
> This would require that the user is able to get the written RowGroups out of a {{pyarrow.parquet.write_table}} call and then give these objects as a list to new function that then passes them on as C++ objects to {{parquet-cpp}} that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)