You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Anh Le <an...@gmail.com> on 2020/03/20 06:58:26 UTC

Avro Spec: encoding unions

Hi guys,

I'm reading the current Avro Spec. It states that:

> A union is encoded by first writing a long value indicating the zero-based position within the union of the schema of its value. The value is then encoded per the indicated schema within the union.

But as I dive through the code base, for example: https://github.com/rdblue/avro-java/blob/master/avro/src/main/java/org/apache/avro/generic/GenericDatumWriter.java#L123-L125, I see there's no long value here. We've got an Int instead.

Would you please tell me if there's any misunderstanding here.

Thank you (and be strong)!

Re: Avro Spec: encoding unions

Posted by Nandor Kollar <nk...@cloudera.com>.
I agree that it is unlikely to cause any issues, since it is a pretty
unlikely that someone uses unions like this, but I think we should at
least track in a separate ticket that modification are required in C/C++
too.

On Mon, Mar 30, 2020 at 3:28 PM roger peppe <ro...@gmail.com> wrote:

> AIUI, longs are encoded exactly the same as ints, so there should only be
> a problem if your union has more than 217483647 members, which seems
> unlikely to me in practice :)
>
> On Mon, 30 Mar 2020 at 09:00, Andy Le <an...@gmail.com> wrote:
>
>> Hey Nandor,
>>
>> Here what I see:
>> - Java/Perl/Python use int values to encode position indices
>> - C/C++ use long ones instead
>>
>> So is there any incompatibility when a C/C++ program talk to a Java one?
>> If yes, so we have to modify the spec, right?
>>
>>
>> On 2020/03/30 07:55:10, Nandor Kollar <nk...@cloudera.com> wrote:
>> > I think we should be cautious when changing specification, other
>> language
>> > bindings might already use longs as position index. For example, it
>> appears
>> > that C++ implementation does what the spec says now:
>> >
>> https://github.com/apache/avro/blob/master/lang/c%2B%2B/impl/BinaryDecoder.cc#L230
>> ,
>> > and if we restrict this to int in the spec, then we make a breaking
>> change
>> > for sure, in the unlikely situation when one writes a huge union where
>> the
>> > position fits only into a long, then that won't be a valid Avro file any
>> > more - according to the new spec.
>> >
>> > On Sun, Mar 29, 2020 at 12:27 PM Driesprong, Fokko <fokko@driesprong.frl
>> >
>> > wrote:
>> >
>> > > Hi Anh,
>> > >
>> > > It looks like that you've found an inconsistency in the docs there. I
>> > > think we need to update the docs, and state that an int is being
>> written.
>> > >
>> > > Stay strong!
>> > >
>> > > Cheers, Fokko
>> > >
>> > > Op vr 20 mrt. 2020 om 07:58 schreef Anh Le <an...@gmail.com>:
>> > >
>> > >> Hi guys,
>> > >>
>> > >> I'm reading the current Avro Spec. It states that:
>> > >>
>> > >> > A union is encoded by first writing a long value indicating the
>> > >> zero-based position within the union of the schema of its value. The
>> value
>> > >> is then encoded per the indicated schema within the union.
>> > >>
>> > >> But as I dive through the code base, for example:
>> > >>
>> https://github.com/rdblue/avro-java/blob/master/avro/src/main/java/org/apache/avro/generic/GenericDatumWriter.java#L123-L125
>> ,
>> > >> I see there's no long value here. We've got an Int instead.
>> > >>
>> > >> Would you please tell me if there's any misunderstanding here.
>> > >>
>> > >> Thank you (and be strong)!
>> > >>
>> > >
>> >
>>
>

Re: Avro Spec: encoding unions

Posted by roger peppe <ro...@gmail.com>.
AIUI, longs are encoded exactly the same as ints, so there should only be a
problem if your union has more than 217483647 members, which seems unlikely
to me in practice :)

On Mon, 30 Mar 2020 at 09:00, Andy Le <an...@gmail.com> wrote:

> Hey Nandor,
>
> Here what I see:
> - Java/Perl/Python use int values to encode position indices
> - C/C++ use long ones instead
>
> So is there any incompatibility when a C/C++ program talk to a Java one?
> If yes, so we have to modify the spec, right?
>
>
> On 2020/03/30 07:55:10, Nandor Kollar <nk...@cloudera.com> wrote:
> > I think we should be cautious when changing specification, other language
> > bindings might already use longs as position index. For example, it
> appears
> > that C++ implementation does what the spec says now:
> >
> https://github.com/apache/avro/blob/master/lang/c%2B%2B/impl/BinaryDecoder.cc#L230
> ,
> > and if we restrict this to int in the spec, then we make a breaking
> change
> > for sure, in the unlikely situation when one writes a huge union where
> the
> > position fits only into a long, then that won't be a valid Avro file any
> > more - according to the new spec.
> >
> > On Sun, Mar 29, 2020 at 12:27 PM Driesprong, Fokko <fokko@driesprong.frl
> >
> > wrote:
> >
> > > Hi Anh,
> > >
> > > It looks like that you've found an inconsistency in the docs there. I
> > > think we need to update the docs, and state that an int is being
> written.
> > >
> > > Stay strong!
> > >
> > > Cheers, Fokko
> > >
> > > Op vr 20 mrt. 2020 om 07:58 schreef Anh Le <an...@gmail.com>:
> > >
> > >> Hi guys,
> > >>
> > >> I'm reading the current Avro Spec. It states that:
> > >>
> > >> > A union is encoded by first writing a long value indicating the
> > >> zero-based position within the union of the schema of its value. The
> value
> > >> is then encoded per the indicated schema within the union.
> > >>
> > >> But as I dive through the code base, for example:
> > >>
> https://github.com/rdblue/avro-java/blob/master/avro/src/main/java/org/apache/avro/generic/GenericDatumWriter.java#L123-L125
> ,
> > >> I see there's no long value here. We've got an Int instead.
> > >>
> > >> Would you please tell me if there's any misunderstanding here.
> > >>
> > >> Thank you (and be strong)!
> > >>
> > >
> >
>

Re: Avro Spec: encoding unions

Posted by Andy Le <an...@gmail.com>.
Hey Nandor,

Here what I see:
- Java/Perl/Python use int values to encode position indices
- C/C++ use long ones instead

So is there any incompatibility when a C/C++ program talk to a Java one? If yes, so we have to modify the spec, right?


On 2020/03/30 07:55:10, Nandor Kollar <nk...@cloudera.com> wrote: 
> I think we should be cautious when changing specification, other language
> bindings might already use longs as position index. For example, it appears
> that C++ implementation does what the spec says now:
> https://github.com/apache/avro/blob/master/lang/c%2B%2B/impl/BinaryDecoder.cc#L230,
> and if we restrict this to int in the spec, then we make a breaking change
> for sure, in the unlikely situation when one writes a huge union where the
> position fits only into a long, then that won't be a valid Avro file any
> more - according to the new spec.
> 
> On Sun, Mar 29, 2020 at 12:27 PM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
> 
> > Hi Anh,
> >
> > It looks like that you've found an inconsistency in the docs there. I
> > think we need to update the docs, and state that an int is being written.
> >
> > Stay strong!
> >
> > Cheers, Fokko
> >
> > Op vr 20 mrt. 2020 om 07:58 schreef Anh Le <an...@gmail.com>:
> >
> >> Hi guys,
> >>
> >> I'm reading the current Avro Spec. It states that:
> >>
> >> > A union is encoded by first writing a long value indicating the
> >> zero-based position within the union of the schema of its value. The value
> >> is then encoded per the indicated schema within the union.
> >>
> >> But as I dive through the code base, for example:
> >> https://github.com/rdblue/avro-java/blob/master/avro/src/main/java/org/apache/avro/generic/GenericDatumWriter.java#L123-L125,
> >> I see there's no long value here. We've got an Int instead.
> >>
> >> Would you please tell me if there's any misunderstanding here.
> >>
> >> Thank you (and be strong)!
> >>
> >
> 

Re: Avro Spec: encoding unions

Posted by Nandor Kollar <nk...@cloudera.com>.
I think we should be cautious when changing specification, other language
bindings might already use longs as position index. For example, it appears
that C++ implementation does what the spec says now:
https://github.com/apache/avro/blob/master/lang/c%2B%2B/impl/BinaryDecoder.cc#L230,
and if we restrict this to int in the spec, then we make a breaking change
for sure, in the unlikely situation when one writes a huge union where the
position fits only into a long, then that won't be a valid Avro file any
more - according to the new spec.

On Sun, Mar 29, 2020 at 12:27 PM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Hi Anh,
>
> It looks like that you've found an inconsistency in the docs there. I
> think we need to update the docs, and state that an int is being written.
>
> Stay strong!
>
> Cheers, Fokko
>
> Op vr 20 mrt. 2020 om 07:58 schreef Anh Le <an...@gmail.com>:
>
>> Hi guys,
>>
>> I'm reading the current Avro Spec. It states that:
>>
>> > A union is encoded by first writing a long value indicating the
>> zero-based position within the union of the schema of its value. The value
>> is then encoded per the indicated schema within the union.
>>
>> But as I dive through the code base, for example:
>> https://github.com/rdblue/avro-java/blob/master/avro/src/main/java/org/apache/avro/generic/GenericDatumWriter.java#L123-L125,
>> I see there's no long value here. We've got an Int instead.
>>
>> Would you please tell me if there's any misunderstanding here.
>>
>> Thank you (and be strong)!
>>
>

Re: Avro Spec: encoding unions

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Hi Anh,

It looks like that you've found an inconsistency in the docs there. I think
we need to update the docs, and state that an int is being written.

Stay strong!

Cheers, Fokko

Op vr 20 mrt. 2020 om 07:58 schreef Anh Le <an...@gmail.com>:

> Hi guys,
>
> I'm reading the current Avro Spec. It states that:
>
> > A union is encoded by first writing a long value indicating the
> zero-based position within the union of the schema of its value. The value
> is then encoded per the indicated schema within the union.
>
> But as I dive through the code base, for example:
> https://github.com/rdblue/avro-java/blob/master/avro/src/main/java/org/apache/avro/generic/GenericDatumWriter.java#L123-L125,
> I see there's no long value here. We've got an Int instead.
>
> Would you please tell me if there's any misunderstanding here.
>
> Thank you (and be strong)!
>