You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Zoltan Ivanfi (JIRA)" <ji...@apache.org> on 2018/08/13 12:39:00 UTC

[jira] [Resolved] (PARQUET-899) Add metadata field describing the application that wrote the file

     [ https://issues.apache.org/jira/browse/PARQUET-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zoltan Ivanfi resolved PARQUET-899.
-----------------------------------
    Resolution: Duplicate

Quoting from the commit for PARQUET-352:

WriteSupport now has a getName getter method that is added to the footer
if it returns a non-null string as writer.model.name. This is intended
to help identify files written by object models incorrectly.

So writer.model.name is already there for this purpose, albeit undocumented.

> Add metadata field describing the application that wrote the file
> -----------------------------------------------------------------
>
>                 Key: PARQUET-899
>                 URL: https://issues.apache.org/jira/browse/PARQUET-899
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Zoltan Ivanfi
>            Priority: Major
>
> Although the Parquet library should behave the same regardless of what application uses it, occasionally serious interoperability bugs are introduced in specific applications. For example, data written by a specific application may be unnecessarily adjusted or the calculated statistics may be invalid (both actual problems).
> Unfortunately, currently it is not possible to recognize Parquet files affected by application problems because the metadata does not contain any information about the application using the Parquet library. (The name and version number of the Parquet library is recorded, but that only has limited use, because apart from Impala, the most widespread Parquet writers all use the same Java library.)
> To allow creating workarounds for future known issues, we should introduce new metadata fields that applications can populate. The simplest approach is to have one field for the application name and another for its version number. A more sophisticated approach suggested by [~julienledem] could also reference a list of earlier issues that are known to be fixed in the application that wrote the Parquet file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)