You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Jonathan Coveney <jc...@gmail.com> on 2013/04/05 18:11:16 UTC

How is a union of multiple primitives handled?

The following gist illustrates my question:

https://gist.github.com/jcoveney/5320422

It seems pretty surprising to me that all of these cases all return 1.0, at
least in python (I will now do this in Java, it's just more verbose). Is
this an issue with python? Is this an issue period? Is this unexpected?

At the very least, if you write 1 to ["int", "double"] you'd expect that
it'd get serialized as an int? Or is there a set of rules governing which
primitive type to choose? Is it implementation dependent?

Also, the case where it throws an error, then returns 0 seems completely
wrong. Why would it do that at all? Is it that once it throws an error, it
gets into an inconsistent state and nothing is guaranteed?

Thanks for helping me understand this!

Re: How is a union of multiple primitives handled?

Posted by Pankaj Shroff <sh...@gmail.com>.
>From reading threads regarding issues with parsing Avro union types, and
struggling to solve my own problems with it, it appears like this is a
common "problem" when using SpecificRecord and SpecificDatumReader. More
specifically, if you are passing just a JSON string to the deserializer
(such as a SpeicificDatumReader using a JSONDecoder) the parser used
expects the types to be specifically identified in the serialized string. I
presume this is done to preserve type safety when assigning the values read
during deserialization, to the attribute fields of the object.

The most popular response on these threads is to suggest that simply use
GenericDatumReader. However, it appears to me that if we did that, all the
most important and useful functionality of Avro deserialization (and
serialization by converse logic) would be lost because this would mean that
Avro can only be used if both sender (writer) and receiver (reader) are
using Avro for what appear to be format agnostic encodings (such as
Ascii-text JSON, Protobuf, etc.)

Is there some kind of workaround so that we can continue to benefit from
immensely useful avro library components like schema builders, encoders,
decoders and such while having some flexibility in terms of the formatting
of inputs provided to these.

Pankaj



On Sat, Apr 6, 2013 at 6:30 AM, Jonathan Coveney <jc...@gmail.com> wrote:

> Java has its own issues in this regard, which is that when deserializing a
> JSON String, if there is a union in the Schema then you have to give it
> {"<type>": <data>} which seems wrong to me (see the other email thread I
> started). I asked this question to understand how it should work in python,
> but also to get a sense of what the fix should be. I have made a patch that
> works according to my understanding, but I still am unsure if that
> understanding is correct, as well as if the Java treatment of unions in
> this case is correct (to me it seems needlessly cumbersome).
>
> Thanks for your help
>
>
> 2013/4/5 Curt Hagenlocher <cu...@hagenlocher.org>
>
>> This is a Python-specific issue, and results from the interplay of two
>> implementation-specific features:
>> 1) Python ints, longs and floats can all legally be serialized as an Avro
>> double (or float). See io.py, line 118.
>> 2) The union serializer picks the first type that allows legal
>> serialization.
>>
>> I would be surprised if you got the same thing in Java; it's not the kind
>> of behavior I would expect from a statically-typed language.
>>
>>
>> On Fri, Apr 5, 2013 at 9:11 AM, Jonathan Coveney <jc...@gmail.com>wrote:
>>
>>> The following gist illustrates my question:
>>>
>>> https://gist.github.com/jcoveney/5320422
>>>
>>> It seems pretty surprising to me that all of these cases all return 1.0,
>>> at least in python (I will now do this in Java, it's just more verbose). Is
>>> this an issue with python? Is this an issue period? Is this unexpected?
>>>
>>> At the very least, if you write 1 to ["int", "double"] you'd expect that
>>> it'd get serialized as an int? Or is there a set of rules governing which
>>> primitive type to choose? Is it implementation dependent?
>>>
>>> Also, the case where it throws an error, then returns 0 seems completely
>>> wrong. Why would it do that at all? Is it that once it throws an error, it
>>> gets into an inconsistent state and nothing is guaranteed?
>>>
>>> Thanks for helping me understand this!
>>>
>>
>>
>


-- 
Pankaj Shroff
shroffG@Gmail.com

Re: How is a union of multiple primitives handled?

Posted by Jonathan Coveney <jc...@gmail.com>.
Java has its own issues in this regard, which is that when deserializing a
JSON String, if there is a union in the Schema then you have to give it
{"<type>": <data>} which seems wrong to me (see the other email thread I
started). I asked this question to understand how it should work in python,
but also to get a sense of what the fix should be. I have made a patch that
works according to my understanding, but I still am unsure if that
understanding is correct, as well as if the Java treatment of unions in
this case is correct (to me it seems needlessly cumbersome).

Thanks for your help


2013/4/5 Curt Hagenlocher <cu...@hagenlocher.org>

> This is a Python-specific issue, and results from the interplay of two
> implementation-specific features:
> 1) Python ints, longs and floats can all legally be serialized as an Avro
> double (or float). See io.py, line 118.
> 2) The union serializer picks the first type that allows legal
> serialization.
>
> I would be surprised if you got the same thing in Java; it's not the kind
> of behavior I would expect from a statically-typed language.
>
>
> On Fri, Apr 5, 2013 at 9:11 AM, Jonathan Coveney <jc...@gmail.com>wrote:
>
>> The following gist illustrates my question:
>>
>> https://gist.github.com/jcoveney/5320422
>>
>> It seems pretty surprising to me that all of these cases all return 1.0,
>> at least in python (I will now do this in Java, it's just more verbose). Is
>> this an issue with python? Is this an issue period? Is this unexpected?
>>
>> At the very least, if you write 1 to ["int", "double"] you'd expect that
>> it'd get serialized as an int? Or is there a set of rules governing which
>> primitive type to choose? Is it implementation dependent?
>>
>> Also, the case where it throws an error, then returns 0 seems completely
>> wrong. Why would it do that at all? Is it that once it throws an error, it
>> gets into an inconsistent state and nothing is guaranteed?
>>
>> Thanks for helping me understand this!
>>
>
>

Re: How is a union of multiple primitives handled?

Posted by Curt Hagenlocher <cu...@hagenlocher.org>.
This is a Python-specific issue, and results from the interplay of two
implementation-specific features:
1) Python ints, longs and floats can all legally be serialized as an Avro
double (or float). See io.py, line 118.
2) The union serializer picks the first type that allows legal
serialization.

I would be surprised if you got the same thing in Java; it's not the kind
of behavior I would expect from a statically-typed language.


On Fri, Apr 5, 2013 at 9:11 AM, Jonathan Coveney <jc...@gmail.com> wrote:

> The following gist illustrates my question:
>
> https://gist.github.com/jcoveney/5320422
>
> It seems pretty surprising to me that all of these cases all return 1.0,
> at least in python (I will now do this in Java, it's just more verbose). Is
> this an issue with python? Is this an issue period? Is this unexpected?
>
> At the very least, if you write 1 to ["int", "double"] you'd expect that
> it'd get serialized as an int? Or is there a set of rules governing which
> primitive type to choose? Is it implementation dependent?
>
> Also, the case where it throws an error, then returns 0 seems completely
> wrong. Why would it do that at all? Is it that once it throws an error, it
> gets into an inconsistent state and nothing is guaranteed?
>
> Thanks for helping me understand this!
>