You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Aaron Niskode-Dossett <an...@etsy.com.INVALID> on 2020/09/25 13:58:25 UTC

protobuf3 and oneof fields

Hello,

I am experimenting with serializing protobuf3 to parquet and have a
question about how "oneOf" fields should be treated.  I will describe an
example.  I'm running parquet 1.11.1 with PARQUET-1684 applied.  That JIRA
is about how default values are written out, and seems related to my
question.

SCHEMA
--------
message Person {
  int32 foo = 1;
  oneof optional_bar {
    int32 bar_int = 200;
    int32 bar_int2 = 201;
    string bar_string = 300;
  }
}

CODE
--------
I set values for foo and bar_string

for (int i = 0; i < 3; i += 1) {
                com.etsy.grpcparquet.Person message = Person.newBuilder()
                        .setFoo(i)
                        .setBarString("hello world")
                        .build();
                message.writeDelimitedTo(out);
            }
And then I write the protobuf file out to parquet.

RESULT
-----------
$ parquet-tools show example.parquet


+-------+-----------+------------+--------------+
|   foo |   bar_int |   bar_int2 | bar_string   |
|-------+-----------+------------+--------------|
|     0 |         0 |          0 | hello world  |
|     1 |         0 |          0 | hello world  |
|     2 |         0 |          0 | hello world  |
+-------+-----------+------------+--------------+

I would expect that bar_int and bar_int2 are EMPTY for all three rows since
only bar_string is set in the oneof.

Is this the right expectation for me to have?

Thank you!

-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy

Re: protobuf3 and oneof fields

Posted by Aaron Niskode-Dossett <an...@etsy.com.INVALID>.
I played around with the code and found a simple, maybe too simple,
solution and opened a PR.  Fingers crossed.

On Tue, Sep 29, 2020 at 10:55 AM Aaron Niskode-Dossett <
aniskodedossett@etsy.com> wrote:

> Thank you, David, I agree with your conclusions.  I opened PARQUET-1917.
>
> On Tue, Sep 29, 2020 at 10:18 AM David <da...@gmail.com> wrote:
>
>> Hello,
>>
>> Perhaps a bit more nuance here.  I believe that the values are technically
>> correct (they should be the default value of 0), but we should not be
>> storing them as 0 values.  We need to check the hasBar*() to determine if
>> the value should be stored or omitted.
>>
>> Thanks.
>>
>> On Tue, Sep 29, 2020 at 10:39 AM David <da...@gmail.com> wrote:
>>
>> > Hello,
>> >
>> > I too have been poking around the Parquet-Proto package as well.
>> >
>> > I would expect "bar_int" and "bar_int2" to be 'null' here.
>> >
>> > Have you filed a JIRA with this reproduction?
>> >
>> > Thanks.
>> >
>> > On Fri, Sep 25, 2020 at 9:58 AM Aaron Niskode-Dossett
>> > <an...@etsy.com.invalid> wrote:
>> >
>> >> Hello,
>> >>
>> >> I am experimenting with serializing protobuf3 to parquet and have a
>> >> question about how "oneOf" fields should be treated.  I will describe
>> an
>> >> example.  I'm running parquet 1.11.1 with PARQUET-1684 applied.  That
>> JIRA
>> >> is about how default values are written out, and seems related to my
>> >> question.
>> >>
>> >> SCHEMA
>> >> --------
>> >> message Person {
>> >>   int32 foo = 1;
>> >>   oneof optional_bar {
>> >>     int32 bar_int = 200;
>> >>     int32 bar_int2 = 201;
>> >>     string bar_string = 300;
>> >>   }
>> >> }
>> >>
>> >> CODE
>> >> --------
>> >> I set values for foo and bar_string
>> >>
>> >> for (int i = 0; i < 3; i += 1) {
>> >>                 com.etsy.grpcparquet.Person message =
>> Person.newBuilder()
>> >>                         .setFoo(i)
>> >>                         .setBarString("hello world")
>> >>                         .build();
>> >>                 message.writeDelimitedTo(out);
>> >>             }
>> >> And then I write the protobuf file out to parquet.
>> >>
>> >> RESULT
>> >> -----------
>> >> $ parquet-tools show example.parquet
>> >>
>> >>
>> >> +-------+-----------+------------+--------------+
>> >> |   foo |   bar_int |   bar_int2 | bar_string   |
>> >> |-------+-----------+------------+--------------|
>> >> |     0 |         0 |          0 | hello world  |
>> >> |     1 |         0 |          0 | hello world  |
>> >> |     2 |         0 |          0 | hello world  |
>> >> +-------+-----------+------------+--------------+
>> >>
>> >> I would expect that bar_int and bar_int2 are EMPTY for all three rows
>> >> since
>> >> only bar_string is set in the oneof.
>> >>
>> >> Is this the right expectation for me to have?
>> >>
>> >> Thank you!
>> >>
>> >> --
>> >> Aaron Niskode-Dossett, Data Engineering -- Etsy
>> >>
>> >
>>
>
>
> --
> Aaron Niskode-Dossett, Data Engineering -- Etsy
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy

Re: protobuf3 and oneof fields

Posted by Aaron Niskode-Dossett <an...@etsy.com.INVALID>.
Thank you, David, I agree with your conclusions.  I opened PARQUET-1917.

On Tue, Sep 29, 2020 at 10:18 AM David <da...@gmail.com> wrote:

> Hello,
>
> Perhaps a bit more nuance here.  I believe that the values are technically
> correct (they should be the default value of 0), but we should not be
> storing them as 0 values.  We need to check the hasBar*() to determine if
> the value should be stored or omitted.
>
> Thanks.
>
> On Tue, Sep 29, 2020 at 10:39 AM David <da...@gmail.com> wrote:
>
> > Hello,
> >
> > I too have been poking around the Parquet-Proto package as well.
> >
> > I would expect "bar_int" and "bar_int2" to be 'null' here.
> >
> > Have you filed a JIRA with this reproduction?
> >
> > Thanks.
> >
> > On Fri, Sep 25, 2020 at 9:58 AM Aaron Niskode-Dossett
> > <an...@etsy.com.invalid> wrote:
> >
> >> Hello,
> >>
> >> I am experimenting with serializing protobuf3 to parquet and have a
> >> question about how "oneOf" fields should be treated.  I will describe an
> >> example.  I'm running parquet 1.11.1 with PARQUET-1684 applied.  That
> JIRA
> >> is about how default values are written out, and seems related to my
> >> question.
> >>
> >> SCHEMA
> >> --------
> >> message Person {
> >>   int32 foo = 1;
> >>   oneof optional_bar {
> >>     int32 bar_int = 200;
> >>     int32 bar_int2 = 201;
> >>     string bar_string = 300;
> >>   }
> >> }
> >>
> >> CODE
> >> --------
> >> I set values for foo and bar_string
> >>
> >> for (int i = 0; i < 3; i += 1) {
> >>                 com.etsy.grpcparquet.Person message =
> Person.newBuilder()
> >>                         .setFoo(i)
> >>                         .setBarString("hello world")
> >>                         .build();
> >>                 message.writeDelimitedTo(out);
> >>             }
> >> And then I write the protobuf file out to parquet.
> >>
> >> RESULT
> >> -----------
> >> $ parquet-tools show example.parquet
> >>
> >>
> >> +-------+-----------+------------+--------------+
> >> |   foo |   bar_int |   bar_int2 | bar_string   |
> >> |-------+-----------+------------+--------------|
> >> |     0 |         0 |          0 | hello world  |
> >> |     1 |         0 |          0 | hello world  |
> >> |     2 |         0 |          0 | hello world  |
> >> +-------+-----------+------------+--------------+
> >>
> >> I would expect that bar_int and bar_int2 are EMPTY for all three rows
> >> since
> >> only bar_string is set in the oneof.
> >>
> >> Is this the right expectation for me to have?
> >>
> >> Thank you!
> >>
> >> --
> >> Aaron Niskode-Dossett, Data Engineering -- Etsy
> >>
> >
>


-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy

Re: protobuf3 and oneof fields

Posted by David <da...@gmail.com>.
Hello,

Perhaps a bit more nuance here.  I believe that the values are technically
correct (they should be the default value of 0), but we should not be
storing them as 0 values.  We need to check the hasBar*() to determine if
the value should be stored or omitted.

Thanks.

On Tue, Sep 29, 2020 at 10:39 AM David <da...@gmail.com> wrote:

> Hello,
>
> I too have been poking around the Parquet-Proto package as well.
>
> I would expect "bar_int" and "bar_int2" to be 'null' here.
>
> Have you filed a JIRA with this reproduction?
>
> Thanks.
>
> On Fri, Sep 25, 2020 at 9:58 AM Aaron Niskode-Dossett
> <an...@etsy.com.invalid> wrote:
>
>> Hello,
>>
>> I am experimenting with serializing protobuf3 to parquet and have a
>> question about how "oneOf" fields should be treated.  I will describe an
>> example.  I'm running parquet 1.11.1 with PARQUET-1684 applied.  That JIRA
>> is about how default values are written out, and seems related to my
>> question.
>>
>> SCHEMA
>> --------
>> message Person {
>>   int32 foo = 1;
>>   oneof optional_bar {
>>     int32 bar_int = 200;
>>     int32 bar_int2 = 201;
>>     string bar_string = 300;
>>   }
>> }
>>
>> CODE
>> --------
>> I set values for foo and bar_string
>>
>> for (int i = 0; i < 3; i += 1) {
>>                 com.etsy.grpcparquet.Person message = Person.newBuilder()
>>                         .setFoo(i)
>>                         .setBarString("hello world")
>>                         .build();
>>                 message.writeDelimitedTo(out);
>>             }
>> And then I write the protobuf file out to parquet.
>>
>> RESULT
>> -----------
>> $ parquet-tools show example.parquet
>>
>>
>> +-------+-----------+------------+--------------+
>> |   foo |   bar_int |   bar_int2 | bar_string   |
>> |-------+-----------+------------+--------------|
>> |     0 |         0 |          0 | hello world  |
>> |     1 |         0 |          0 | hello world  |
>> |     2 |         0 |          0 | hello world  |
>> +-------+-----------+------------+--------------+
>>
>> I would expect that bar_int and bar_int2 are EMPTY for all three rows
>> since
>> only bar_string is set in the oneof.
>>
>> Is this the right expectation for me to have?
>>
>> Thank you!
>>
>> --
>> Aaron Niskode-Dossett, Data Engineering -- Etsy
>>
>

Re: protobuf3 and oneof fields

Posted by David <da...@gmail.com>.
Hello,

I too have been poking around the Parquet-Proto package as well.

I would expect "bar_int" and "bar_int2" to be 'null' here.

Have you filed a JIRA with this reproduction?

Thanks.

On Fri, Sep 25, 2020 at 9:58 AM Aaron Niskode-Dossett
<an...@etsy.com.invalid> wrote:

> Hello,
>
> I am experimenting with serializing protobuf3 to parquet and have a
> question about how "oneOf" fields should be treated.  I will describe an
> example.  I'm running parquet 1.11.1 with PARQUET-1684 applied.  That JIRA
> is about how default values are written out, and seems related to my
> question.
>
> SCHEMA
> --------
> message Person {
>   int32 foo = 1;
>   oneof optional_bar {
>     int32 bar_int = 200;
>     int32 bar_int2 = 201;
>     string bar_string = 300;
>   }
> }
>
> CODE
> --------
> I set values for foo and bar_string
>
> for (int i = 0; i < 3; i += 1) {
>                 com.etsy.grpcparquet.Person message = Person.newBuilder()
>                         .setFoo(i)
>                         .setBarString("hello world")
>                         .build();
>                 message.writeDelimitedTo(out);
>             }
> And then I write the protobuf file out to parquet.
>
> RESULT
> -----------
> $ parquet-tools show example.parquet
>
>
> +-------+-----------+------------+--------------+
> |   foo |   bar_int |   bar_int2 | bar_string   |
> |-------+-----------+------------+--------------|
> |     0 |         0 |          0 | hello world  |
> |     1 |         0 |          0 | hello world  |
> |     2 |         0 |          0 | hello world  |
> +-------+-----------+------------+--------------+
>
> I would expect that bar_int and bar_int2 are EMPTY for all three rows since
> only bar_string is set in the oneof.
>
> Is this the right expectation for me to have?
>
> Thank you!
>
> --
> Aaron Niskode-Dossett, Data Engineering -- Etsy
>