You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Joshua Howard (Jira)" <ji...@apache.org> on 2021/09/14 14:04:00 UTC
[jira] [Comment Edited] (PARQUET-2088) Different created_by field
values for application and library
[ https://issues.apache.org/jira/browse/PARQUET-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414923#comment-17414923 ]
Joshua Howard edited comment on PARQUET-2088 at 9/14/21, 2:03 PM:
------------------------------------------------------------------
That is exactly the pain point.
It seems to be the convention that an engine depending on the parquet-mr library should populate the `created_by` field with FULL_VERSION. This seems to be the case based on Hive and Spark, but would like to confirm.
In this case, you couldn't handle bugs in the wrapper code around parquet-mr using on the version of the application because it wouldn't be present in the file. Do you think that there would be a benefit in creating an additional `created_by_application` field?
was (Author: joshthoward@gmail.com):
That is exactly the pain point. Is it the convention that if an engine depends on the parquet-mr library then the `created_by` field should be the FULL_VERSION attribute (seems to be the case, but I want to make sure that it is intended this way)? If so, then one could have the issue that you couldn't handle bugs in the wrapper code around parquet-mr that couldn't be handled based on the version of the engine because it wouldn't be present in the file. Do you think that there would be a benefit in creating an additional `created_by_application` field?
> Different created_by field values for application and library
> -------------------------------------------------------------
>
> Key: PARQUET-2088
> URL: https://issues.apache.org/jira/browse/PARQUET-2088
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: format-2.9.0
> Reporter: Joshua Howard
> Priority: Minor
>
> There seems to be a discrepancy in the Parquet format created_by field regarding how it should be filled out. The parquet-mr library uses this value to enable/disable features based on the parquet-mr version [here|https://github.com/apache/parquet-mr/blob/5f403501e9de05b6aa48f028191b4e78bb97fb12/parquet-column/src/main/java/org/apache/parquet/CorruptDeltaByteArrays.java#L64-L68]. Meanwhile, users are encouraged to make use of the application version [here|https://www.javadoc.io/doc/org.apache.parquet/parquet-format/latest/org/apache/parquet/format/FileMetaData.html]. It seems like there are multiple fields needed for an application and library version.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)