You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by ying <yi...@gmail.com> on 2019/06/14 00:42:42 UTC

Parquet proto writer de-nest Protobuf wrapper classes

Dear Parquet community:

We are working on a data pipeline which takes on protobuf data and write in
Parquet. Currently we take advantage of the Parquet proto writer support
<https://github.com/apache/parquet-mr/tree/master/parquet-protobuf>.

While the existing Parquet protobuf writer preserves all the message
structure of a Protobuf definition, in our case users often prefer
de-nesting the protobuf wrappers classes and filling in the same field with
simply its "value" data.  We have implemented some basic functionality to
achieve this, on top of the existing Parquet-proto writer. For details,
please refer to Parquet-1595
<https://issues.apache.org/jira/browse/PARQUET-1595> .

We would like to solicit comments, and would be happy to contribute if the
community thinks it is a sound idea to pursue.  Any comments or pointers to
related prior discussions are welcome.

Thanks!

-
Ying

Re: Parquet proto writer de-nest Protobuf wrapper classes

Posted by ying <yi...@gmail.com>.
Hi Qinghui:

Thanks for the detailed extra explanations.  Yes we found (in our use case)
that de-nesting these wrapper messages has great benefits in terms of
user-friendliness. For example, in the current Parquet writer, a protobuf
field UUID defined as StringValue would be written as a  UUID GROUP plus
UUID.value as binary (string). When querying from Hive/Presto, it would be
much easier for data engineers/scientists to refer to such a field
directly, say "UUID=xxxxx" rather than "UUID.value=xxxxx".   Similar
rationale also applies a few other protobuf WKT such as Timestamp and
Duration.

Good observation on maintaining compatibility during the reading process.
Although not a must in our use case, I can see there is value to have an
e2e solution which allows de-nested fields to be read back consistently
according to their protobuf definitions.  For now, I would assume a similar
configuration could be applied on the reader side, which allows de-nested
fields to be mapped to their original types. The reader is given the
original protobuf definition hence shall be able to detect the discrepancy
between the de-nested Parquet data and the nested protobuf schema.

Other thoughts and comments are welcome as well.

-
Ying


On Mon, Jun 17, 2019 at 5:10 AM XU Qinghui <qi...@gmail.com> wrote:

> Hello, Ying
>
> From my own experience, the proposal seems interesting. So to give some
> more context about this "protobuf wrapper" for people that are not familiar
> with it: protobuf3 drops support for "null" semantics for primitives both
> in its wire format and in its API, for people that wish to have nullable
> fields, they provide the "wrapper" to nest the primitive fields in some
> struct. The current parquet-protobuf implementation is converting protobuf
> schema to parquet schema in a loyal way, so that all the wrappers will
> become an intermediate struct in parquet field path. Denesting those
> wrappers should make the parquet file (schema) easier to use.
> In the meantime, it seems to me the proposal is more focused on the
> writing. Maybe it is worth to think about how to make reading
> backward/forward compatible.
>
> cc @lukasnalezenec @zivanfi @rdblue
>
> Best regards,
>
>
> Le ven. 14 juin 2019 à 02:42, ying <yi...@gmail.com> a écrit :
>
> > Dear Parquet community:
> >
> > We are working on a data pipeline which takes on protobuf data and write
> in
> > Parquet. Currently we take advantage of the Parquet proto writer support
> > <https://github.com/apache/parquet-mr/tree/master/parquet-protobuf>.
> >
> > While the existing Parquet protobuf writer preserves all the message
> > structure of a Protobuf definition, in our case users often prefer
> > de-nesting the protobuf wrappers classes and filling in the same field
> with
> > simply its "value" data.  We have implemented some basic functionality to
> > achieve this, on top of the existing Parquet-proto writer. For details,
> > please refer to Parquet-1595
> > <https://issues.apache.org/jira/browse/PARQUET-1595> .
> >
> > We would like to solicit comments, and would be happy to contribute if
> the
> > community thinks it is a sound idea to pursue.  Any comments or pointers
> to
> > related prior discussions are welcome.
> >
> > Thanks!
> >
> > -
> > Ying
> >
>

Re: Parquet proto writer de-nest Protobuf wrapper classes

Posted by XU Qinghui <qi...@gmail.com>.
Hello, Ying

From my own experience, the proposal seems interesting. So to give some
more context about this "protobuf wrapper" for people that are not familiar
with it: protobuf3 drops support for "null" semantics for primitives both
in its wire format and in its API, for people that wish to have nullable
fields, they provide the "wrapper" to nest the primitive fields in some
struct. The current parquet-protobuf implementation is converting protobuf
schema to parquet schema in a loyal way, so that all the wrappers will
become an intermediate struct in parquet field path. Denesting those
wrappers should make the parquet file (schema) easier to use.
In the meantime, it seems to me the proposal is more focused on the
writing. Maybe it is worth to think about how to make reading
backward/forward compatible.

cc @lukasnalezenec @zivanfi @rdblue

Best regards,


Le ven. 14 juin 2019 à 02:42, ying <yi...@gmail.com> a écrit :

> Dear Parquet community:
>
> We are working on a data pipeline which takes on protobuf data and write in
> Parquet. Currently we take advantage of the Parquet proto writer support
> <https://github.com/apache/parquet-mr/tree/master/parquet-protobuf>.
>
> While the existing Parquet protobuf writer preserves all the message
> structure of a Protobuf definition, in our case users often prefer
> de-nesting the protobuf wrappers classes and filling in the same field with
> simply its "value" data.  We have implemented some basic functionality to
> achieve this, on top of the existing Parquet-proto writer. For details,
> please refer to Parquet-1595
> <https://issues.apache.org/jira/browse/PARQUET-1595> .
>
> We would like to solicit comments, and would be happy to contribute if the
> community thinks it is a sound idea to pursue.  Any comments or pointers to
> related prior discussions are welcome.
>
> Thanks!
>
> -
> Ying
>