You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2015/04/04 19:37:33 UTC

[jira] [Commented] (PARQUET-194) Provide callback to allow user defined key-value metadata merging strategy

    [ https://issues.apache.org/jira/browse/PARQUET-194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395835#comment-14395835 ] 

Ryan Blue commented on PARQUET-194:
-----------------------------------

True, we decided we needed to keep the code to produce {{_common_metadata}} around for a while longer. That isn't to say I think it should be used, except in some specific cases. The problem is that information in that file could easily be incorrect from normal operations on a folder of Parquet data, like adding new files. A better option would be to track the metadata elsewhere, like the Hive metastore or with another dataset management library like Kite. Parquet doesn't manage entire datasets of Parquet files and I'm not sure that it should be expected to.

> Provide callback to allow user defined key-value metadata merging strategy
> --------------------------------------------------------------------------
>
>                 Key: PARQUET-194
>                 URL: https://issues.apache.org/jira/browse/PARQUET-194
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Cheng Lian
>
> When merging footers, Parquet doesn't know how to merge conflicting user defined key-value metadata entries, and simply throws. It would be better to provide callbacks to let users define metadata merging strategies.
> For example, in Spark SQL, we store our own schema information in Parquet files as key-value metadata (similar to parquet-avro). While trying to add schema merging support for reading Parquet files with different but compatible schemas, {{InitContext.getMergedKeyValueMetaData}} throws because we have different Spark SQL schemas stored in different Parquet data files. Thus, we have to overwrite {{ParquetInputFormat}} and merge the schema within {{getSplits}}, which is kinda hacky and inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)