You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Steven Willis (JIRA)" <ji...@apache.org> on 2014/05/22 20:25:02 UTC

[jira] [Commented] (AVRO-973) Union behavior not consistent

    [ https://issues.apache.org/jira/browse/AVRO-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006252#comment-14006252 ] 

Steven Willis commented on AVRO-973:
------------------------------------

I just discovered this bug. Having a non-ambiguous way to determine the correct type of a datum in a union is certainly desirable  (through wrapper classes and such). But until then, I really think we need the {{break}} in {{write_union}} after it finds an acceptable schema. I suppose it's just as arbitrary to use the first matching schema as it is the last, but it makes more sense that the types would be in priority order, it would also be more efficient to stop on the first matching schema. The ruby code uses {{writers_schema.schemas.find}} which I believe returns the first matching schema.

All I know is that I was very surprised when the I tried to serialize {{True}} to a union of {{['boolean', 'double']}} and got a double:

{noformat}
>>> from StringIO import StringIO
>>> import avro.schema
>>> from avro.datafile import DataFileReader, DataFileWriter
>>> from avro.io import DatumReader, DatumWriter
>>> avr = StringIO()
>>> writer = DataFileWriter(avr, DatumWriter(), avro.schema.parse('{"name": "foo", "type": "record", "fields": [{"name": "bar", "type": ["boolean", "float"]}]}'))
>>> writer.append({"bar": True})
>>> writer.flush()
>>> avr.seek(0)
>>> reader = DataFileReader(avr, DatumReader())
>>> reader.next()
{u'bar': 1.0}
{noformat}

This is very surprising.

> Union behavior not consistent
> -----------------------------
>
>                 Key: AVRO-973
>                 URL: https://issues.apache.org/jira/browse/AVRO-973
>             Project: Avro
>          Issue Type: Bug
>          Components: python
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Gaurav Nanda
>              Labels: patch
>         Attachments: AVRO-973-patch-1.patch, AVRO-973-patch-2.patch, AVRO-973-patch-3.patch, AVRO-973-wrapper.patch, AVRO-973-wrapper.patch, test_unions.py
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> Python's union does not respect the order in which type is specified.
> For following schema: {"type":"map","values":["int","long","float","double","string","boolean"]}, an integer value is written as double, but it should respect the order in which types have been specified.
> Fixed Code (io.py):
> def write_union(self, writers_schema, datum, encoder):
>    """
>    A union is encoded by first writing a long value indicating
>    the zero-based position within the union of the schema of its value.
>    The value is then encoded per the indicated schema within the union.
>    """
>    # resolve union
>    index_of_schema = -1
>    for i, candidate_schema in enumerate(writers_schema.schemas):
>      if validate(candidate_schema, datum):
>        index_of_schema = i
>        break // XXX Add break statement here XXX//
>    if index_of_schema < 0: raise AvroTypeException(writers_schema, datum)



--
This message was sent by Atlassian JIRA
(v6.2#6252)