You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Zoltan Ivanfi <zi...@cloudera.com> on 2017/01/30 14:42:22 UTC

Recording the Parquet client info in the extra metadata

Hi,

Although Parquet should behave the same across its client components,
occasionally serious interoperability bugs are found in released versions
of those Parquet clients. For example, an adjustment is applied to the data
even though it should not be, or the statistics are written incorrectly and
end up being unusable (both actual problems). Unfortunately, currently it
is not possible to recognize Parquet files written by buggy client
components as their name and version number are not recorded in the
metadata. (The name and version number of the Parquet library is recorded,
but that only has limited use, because apart from Impala, the most
widespread Parquet writers all use the same Java library.)

To prevent similar problems in the future, we would like to add the name
and version number of Hive, SparkSQL and Impala to the extra metadata
section. I would like to ask your opinions about what the name of this
metadata should be. Some ideas that come to my mind: data-producer or
data-writer.

Would you share your suggestions?

Thanks,

Zoltan