You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Joshua Howard (Jira)" <ji...@apache.org> on 2021/09/14 14:04:00 UTC

[jira] [Comment Edited] (PARQUET-2088) Different created_by field values for application and library

    [ https://issues.apache.org/jira/browse/PARQUET-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414923#comment-17414923 ] 

Joshua Howard edited comment on PARQUET-2088 at 9/14/21, 2:03 PM:
------------------------------------------------------------------

That is exactly the pain point.

It seems to be the convention that an engine depending on the parquet-mr library should populate the `created_by` field with FULL_VERSION. This seems to be the case based on Hive and Spark, but would like to confirm.

In this case, you couldn't handle bugs in the wrapper code around parquet-mr using on the version of the application because it wouldn't be present in the file. Do you think that there would be a benefit in creating an additional `created_by_application` field? 


was (Author: joshthoward@gmail.com):
That is exactly the pain point. Is it the convention that if an engine depends on the parquet-mr library then the `created_by` field should be the FULL_VERSION attribute (seems to be the case, but I want to make sure that it is intended this way)? If so, then one could have the issue that you couldn't handle bugs in the wrapper code around parquet-mr that couldn't be handled based on the version of the engine because it wouldn't be present in the file. Do you think that there would be a benefit in creating an additional `created_by_application` field? 

> Different created_by field values for application and library
> -------------------------------------------------------------
>
>                 Key: PARQUET-2088
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2088
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: format-2.9.0
>            Reporter: Joshua Howard
>            Priority: Minor
>
> There seems to be a discrepancy in the Parquet format created_by field regarding how it should be filled out. The parquet-mr library uses this value to enable/disable features based on the parquet-mr version [here|https://github.com/apache/parquet-mr/blob/5f403501e9de05b6aa48f028191b4e78bb97fb12/parquet-column/src/main/java/org/apache/parquet/CorruptDeltaByteArrays.java#L64-L68]. Meanwhile, users are encouraged to make use of the application version [here|https://www.javadoc.io/doc/org.apache.parquet/parquet-format/latest/org/apache/parquet/format/FileMetaData.html]. It seems like there are multiple fields needed for an application and library version. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)