You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Aaron Niskode-Dossett <an...@etsy.com.INVALID> on 2020/09/25 13:58:25 UTC
protobuf3 and oneof fields
Hello,
I am experimenting with serializing protobuf3 to parquet and have a
question about how "oneOf" fields should be treated. I will describe an
example. I'm running parquet 1.11.1 with PARQUET-1684 applied. That JIRA
is about how default values are written out, and seems related to my
question.
SCHEMA
--------
message Person {
int32 foo = 1;
oneof optional_bar {
int32 bar_int = 200;
int32 bar_int2 = 201;
string bar_string = 300;
}
}
CODE
--------
I set values for foo and bar_string
for (int i = 0; i < 3; i += 1) {
com.etsy.grpcparquet.Person message = Person.newBuilder()
.setFoo(i)
.setBarString("hello world")
.build();
message.writeDelimitedTo(out);
}
And then I write the protobuf file out to parquet.
RESULT
-----------
$ parquet-tools show example.parquet
+-------+-----------+------------+--------------+
| foo | bar_int | bar_int2 | bar_string |
|-------+-----------+------------+--------------|
| 0 | 0 | 0 | hello world |
| 1 | 0 | 0 | hello world |
| 2 | 0 | 0 | hello world |
+-------+-----------+------------+--------------+
I would expect that bar_int and bar_int2 are EMPTY for all three rows since
only bar_string is set in the oneof.
Is this the right expectation for me to have?
Thank you!
--
Aaron Niskode-Dossett, Data Engineering -- Etsy
Re: protobuf3 and oneof fields
Posted by Aaron Niskode-Dossett <an...@etsy.com.INVALID>.
I played around with the code and found a simple, maybe too simple,
solution and opened a PR. Fingers crossed.
On Tue, Sep 29, 2020 at 10:55 AM Aaron Niskode-Dossett <
aniskodedossett@etsy.com> wrote:
> Thank you, David, I agree with your conclusions. I opened PARQUET-1917.
>
> On Tue, Sep 29, 2020 at 10:18 AM David <da...@gmail.com> wrote:
>
>> Hello,
>>
>> Perhaps a bit more nuance here. I believe that the values are technically
>> correct (they should be the default value of 0), but we should not be
>> storing them as 0 values. We need to check the hasBar*() to determine if
>> the value should be stored or omitted.
>>
>> Thanks.
>>
>> On Tue, Sep 29, 2020 at 10:39 AM David <da...@gmail.com> wrote:
>>
>> > Hello,
>> >
>> > I too have been poking around the Parquet-Proto package as well.
>> >
>> > I would expect "bar_int" and "bar_int2" to be 'null' here.
>> >
>> > Have you filed a JIRA with this reproduction?
>> >
>> > Thanks.
>> >
>> > On Fri, Sep 25, 2020 at 9:58 AM Aaron Niskode-Dossett
>> > <an...@etsy.com.invalid> wrote:
>> >
>> >> Hello,
>> >>
>> >> I am experimenting with serializing protobuf3 to parquet and have a
>> >> question about how "oneOf" fields should be treated. I will describe
>> an
>> >> example. I'm running parquet 1.11.1 with PARQUET-1684 applied. That
>> JIRA
>> >> is about how default values are written out, and seems related to my
>> >> question.
>> >>
>> >> SCHEMA
>> >> --------
>> >> message Person {
>> >> int32 foo = 1;
>> >> oneof optional_bar {
>> >> int32 bar_int = 200;
>> >> int32 bar_int2 = 201;
>> >> string bar_string = 300;
>> >> }
>> >> }
>> >>
>> >> CODE
>> >> --------
>> >> I set values for foo and bar_string
>> >>
>> >> for (int i = 0; i < 3; i += 1) {
>> >> com.etsy.grpcparquet.Person message =
>> Person.newBuilder()
>> >> .setFoo(i)
>> >> .setBarString("hello world")
>> >> .build();
>> >> message.writeDelimitedTo(out);
>> >> }
>> >> And then I write the protobuf file out to parquet.
>> >>
>> >> RESULT
>> >> -----------
>> >> $ parquet-tools show example.parquet
>> >>
>> >>
>> >> +-------+-----------+------------+--------------+
>> >> | foo | bar_int | bar_int2 | bar_string |
>> >> |-------+-----------+------------+--------------|
>> >> | 0 | 0 | 0 | hello world |
>> >> | 1 | 0 | 0 | hello world |
>> >> | 2 | 0 | 0 | hello world |
>> >> +-------+-----------+------------+--------------+
>> >>
>> >> I would expect that bar_int and bar_int2 are EMPTY for all three rows
>> >> since
>> >> only bar_string is set in the oneof.
>> >>
>> >> Is this the right expectation for me to have?
>> >>
>> >> Thank you!
>> >>
>> >> --
>> >> Aaron Niskode-Dossett, Data Engineering -- Etsy
>> >>
>> >
>>
>
>
> --
> Aaron Niskode-Dossett, Data Engineering -- Etsy
>
--
Aaron Niskode-Dossett, Data Engineering -- Etsy
Re: protobuf3 and oneof fields
Posted by Aaron Niskode-Dossett <an...@etsy.com.INVALID>.
Thank you, David, I agree with your conclusions. I opened PARQUET-1917.
On Tue, Sep 29, 2020 at 10:18 AM David <da...@gmail.com> wrote:
> Hello,
>
> Perhaps a bit more nuance here. I believe that the values are technically
> correct (they should be the default value of 0), but we should not be
> storing them as 0 values. We need to check the hasBar*() to determine if
> the value should be stored or omitted.
>
> Thanks.
>
> On Tue, Sep 29, 2020 at 10:39 AM David <da...@gmail.com> wrote:
>
> > Hello,
> >
> > I too have been poking around the Parquet-Proto package as well.
> >
> > I would expect "bar_int" and "bar_int2" to be 'null' here.
> >
> > Have you filed a JIRA with this reproduction?
> >
> > Thanks.
> >
> > On Fri, Sep 25, 2020 at 9:58 AM Aaron Niskode-Dossett
> > <an...@etsy.com.invalid> wrote:
> >
> >> Hello,
> >>
> >> I am experimenting with serializing protobuf3 to parquet and have a
> >> question about how "oneOf" fields should be treated. I will describe an
> >> example. I'm running parquet 1.11.1 with PARQUET-1684 applied. That
> JIRA
> >> is about how default values are written out, and seems related to my
> >> question.
> >>
> >> SCHEMA
> >> --------
> >> message Person {
> >> int32 foo = 1;
> >> oneof optional_bar {
> >> int32 bar_int = 200;
> >> int32 bar_int2 = 201;
> >> string bar_string = 300;
> >> }
> >> }
> >>
> >> CODE
> >> --------
> >> I set values for foo and bar_string
> >>
> >> for (int i = 0; i < 3; i += 1) {
> >> com.etsy.grpcparquet.Person message =
> Person.newBuilder()
> >> .setFoo(i)
> >> .setBarString("hello world")
> >> .build();
> >> message.writeDelimitedTo(out);
> >> }
> >> And then I write the protobuf file out to parquet.
> >>
> >> RESULT
> >> -----------
> >> $ parquet-tools show example.parquet
> >>
> >>
> >> +-------+-----------+------------+--------------+
> >> | foo | bar_int | bar_int2 | bar_string |
> >> |-------+-----------+------------+--------------|
> >> | 0 | 0 | 0 | hello world |
> >> | 1 | 0 | 0 | hello world |
> >> | 2 | 0 | 0 | hello world |
> >> +-------+-----------+------------+--------------+
> >>
> >> I would expect that bar_int and bar_int2 are EMPTY for all three rows
> >> since
> >> only bar_string is set in the oneof.
> >>
> >> Is this the right expectation for me to have?
> >>
> >> Thank you!
> >>
> >> --
> >> Aaron Niskode-Dossett, Data Engineering -- Etsy
> >>
> >
>
--
Aaron Niskode-Dossett, Data Engineering -- Etsy
Re: protobuf3 and oneof fields
Posted by David <da...@gmail.com>.
Hello,
Perhaps a bit more nuance here. I believe that the values are technically
correct (they should be the default value of 0), but we should not be
storing them as 0 values. We need to check the hasBar*() to determine if
the value should be stored or omitted.
Thanks.
On Tue, Sep 29, 2020 at 10:39 AM David <da...@gmail.com> wrote:
> Hello,
>
> I too have been poking around the Parquet-Proto package as well.
>
> I would expect "bar_int" and "bar_int2" to be 'null' here.
>
> Have you filed a JIRA with this reproduction?
>
> Thanks.
>
> On Fri, Sep 25, 2020 at 9:58 AM Aaron Niskode-Dossett
> <an...@etsy.com.invalid> wrote:
>
>> Hello,
>>
>> I am experimenting with serializing protobuf3 to parquet and have a
>> question about how "oneOf" fields should be treated. I will describe an
>> example. I'm running parquet 1.11.1 with PARQUET-1684 applied. That JIRA
>> is about how default values are written out, and seems related to my
>> question.
>>
>> SCHEMA
>> --------
>> message Person {
>> int32 foo = 1;
>> oneof optional_bar {
>> int32 bar_int = 200;
>> int32 bar_int2 = 201;
>> string bar_string = 300;
>> }
>> }
>>
>> CODE
>> --------
>> I set values for foo and bar_string
>>
>> for (int i = 0; i < 3; i += 1) {
>> com.etsy.grpcparquet.Person message = Person.newBuilder()
>> .setFoo(i)
>> .setBarString("hello world")
>> .build();
>> message.writeDelimitedTo(out);
>> }
>> And then I write the protobuf file out to parquet.
>>
>> RESULT
>> -----------
>> $ parquet-tools show example.parquet
>>
>>
>> +-------+-----------+------------+--------------+
>> | foo | bar_int | bar_int2 | bar_string |
>> |-------+-----------+------------+--------------|
>> | 0 | 0 | 0 | hello world |
>> | 1 | 0 | 0 | hello world |
>> | 2 | 0 | 0 | hello world |
>> +-------+-----------+------------+--------------+
>>
>> I would expect that bar_int and bar_int2 are EMPTY for all three rows
>> since
>> only bar_string is set in the oneof.
>>
>> Is this the right expectation for me to have?
>>
>> Thank you!
>>
>> --
>> Aaron Niskode-Dossett, Data Engineering -- Etsy
>>
>
Re: protobuf3 and oneof fields
Posted by David <da...@gmail.com>.
Hello,
I too have been poking around the Parquet-Proto package as well.
I would expect "bar_int" and "bar_int2" to be 'null' here.
Have you filed a JIRA with this reproduction?
Thanks.
On Fri, Sep 25, 2020 at 9:58 AM Aaron Niskode-Dossett
<an...@etsy.com.invalid> wrote:
> Hello,
>
> I am experimenting with serializing protobuf3 to parquet and have a
> question about how "oneOf" fields should be treated. I will describe an
> example. I'm running parquet 1.11.1 with PARQUET-1684 applied. That JIRA
> is about how default values are written out, and seems related to my
> question.
>
> SCHEMA
> --------
> message Person {
> int32 foo = 1;
> oneof optional_bar {
> int32 bar_int = 200;
> int32 bar_int2 = 201;
> string bar_string = 300;
> }
> }
>
> CODE
> --------
> I set values for foo and bar_string
>
> for (int i = 0; i < 3; i += 1) {
> com.etsy.grpcparquet.Person message = Person.newBuilder()
> .setFoo(i)
> .setBarString("hello world")
> .build();
> message.writeDelimitedTo(out);
> }
> And then I write the protobuf file out to parquet.
>
> RESULT
> -----------
> $ parquet-tools show example.parquet
>
>
> +-------+-----------+------------+--------------+
> | foo | bar_int | bar_int2 | bar_string |
> |-------+-----------+------------+--------------|
> | 0 | 0 | 0 | hello world |
> | 1 | 0 | 0 | hello world |
> | 2 | 0 | 0 | hello world |
> +-------+-----------+------------+--------------+
>
> I would expect that bar_int and bar_int2 are EMPTY for all three rows since
> only bar_string is set in the oneof.
>
> Is this the right expectation for me to have?
>
> Thank you!
>
> --
> Aaron Niskode-Dossett, Data Engineering -- Etsy
>