You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Damian Guy <da...@gmail.com> on 2015/08/11 11:48:29 UTC

Make the parquet schema consistent across all input formats

Hi,

I've hit an issue recently with the parquet schema produced from
parquet-protobuf. As it is not consistent with what Avro and Thrift
produce, a downstream app, Spark SQL, couldn't read files containing
repeated types. We've since had this fixed here:
https://issues.apache.org/jira/browse/SPARK-9340

While it is possible for downstream users of parquet to handle the
incompatibilities between the various different input formats it seems a
shame that parquet doesn't produce a consistent schema across all formats.
This would make working with parquet much simpler as the rules for
converting to/from a list etc would always be the same. It would make for
simpler, less error prone, code. Besides, i thought this was one of the
reasons for using parquet...

I submitted a pull request that addresses the issue we have been facing:
https://github.com/apache/parquet-mr/pull/253

Is there any reason why you wouldn't want to have a consistent parquet
representation?

Thanks,
Damian

Re: Make the parquet schema consistent across all input formats

Posted by Nathan Howell <nh...@godaddy.com>.

On 8/11/15, 2:48 AM, "Damian Guy" <da...@gmail.com> wrote:

>While it is possible for downstream users of parquet to handle the
>incompatibilities between the various different input formats it seems a
>shame that parquet doesn't produce a consistent schema across all formats.
>This would make working with parquet much simpler as the rules for
>converting to/from a list etc would always be the same. It would make for
>simpler, less error prone, code. Besides, i thought this was one of the
>reasons for using parquet...

After struggling with similar fixes for Pig and Hive last year, I tried a different approach: make parquet-column handle all of the type coercions and compatibility rules, instead of placing this burden on the parquet-<encoding> libraries and various applications. It already does a bit of this to handle different repeated levels (reading optional fields as a list, etc) and primitive widening (e.g. reading an int as long). Similarly, a list of non-nullable integers should be able to be read as a list of nullable integers, and so this should also be supported as a first class concept in parquet-column.

It’s an old and incomplete patch but it demonstrates that incorporating such a change into Pig added about 4 new lines of code and deleted 100. The Hive fix was similar.

Entire patch: https://github.com/NathanHowell/incubator-parquet-mr/commit/3e74b52a7cdade3d38b5829e47fd55315bbb02f9

Pig modification: https://github.com/NathanHowell/incubator-parquet-mr/commit/3e74b52a7cdade3d38b5829e47fd55315bbb02f9#diff-357f986dd06a50315198f8fd08bb81b6R139

-n