You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Qinghui Xu (JIRA)" <ji...@apache.org> on 2018/11/05 15:25:00 UTC

[jira] [Created] (PARQUET-1455) Handle "unknown" enum values for parquet-protobuf

Qinghui Xu created PARQUET-1455:
-----------------------------------

             Summary: Handle "unknown" enum values for parquet-protobuf
                 Key: PARQUET-1455
                 URL: https://issues.apache.org/jira/browse/PARQUET-1455
             Project: Parquet
          Issue Type: Bug
            Reporter: Qinghui Xu


Background - 

In protobuf enum is more like integers other than string, and is encoded as integer on the wire.
In Protobuf, each enum value is associated with a number (integer), and people can set enum field using number directly regardless whether the number is associated to an enum value or not. While enum filed is set with a number that does not match any enum value defined in the schema, by using protobuf reflection API (as parquet-protobuf does) to read the enum field we will get a label "UNKNOWN_ENUM_<enumName>_<number>" generated by protobuf reflection. Thus parquet-protobuf will write string "UNKNOWN_ENUM_<enumName>_<number>" into the enum column whenever its protobuf schema does not recognize the number.

 

Problematics -

There are two cases of unknown enum while using parquet-protobuf:
 1. Protobuf already contains unknown enum when we write it to parquet (sometimes people manipulate enum using numbers), so it will write a label "UNKNOWN_ENUM_*" as string in parquet. And when we read it back to protobuf, we found this "true" unknown value
 2. Protobuf contains valid value when write to parquet, but the reader uses an outdated proto schema which misses some enum values. So the not-in-old-schema enum values are "unknown" to the reader.

Current behavior of parquet-proto reader is to reject in both cases with some runtime exception. This does not make sense in case 1, the write part does respect protobuf enum behavior while the read part does not. And case 2 should be handled if protobuf user is interested in the number instead of label.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)