You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2010/09/03 21:33:33 UTC

[jira] Commented: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

    [ https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906036#action_12906036 ] 

Doug Cutting commented on AVRO-656:
-----------------------------------

Since this has never been implemented correctly, we might safely change the specification without introducing incompatibilities.  There are many unions in use that contain multiple records with different names, so we must retain at least that ability.  Unions with multiple fixed, or enum schemas will have behaved erratically in all implementations and are not likely used much.

We might change the specification to only permit a single fixed or enum in a union.  This would permit Java to conform to the spec for fixed and enums without changing how these types are represented at runtime.  For Python, Ruby and PHP things are more difficult, since, e.g., string, bytes, enum and fixed can be represented with identical primitive types in these, making runtime type determination difficult without introducing more runtime datatypes.

Alternately we might change the spec so that unions are only permitted to contain, e.g., a single numeric type (int, float, long or double), a single symbolic type (string, bytes, enum & fixed) a single sequence type (map or array), and a number of records, distinguished by name.  I think this would support most uses that exist today and permit fast writing in most languages.  For compatibility, we might accept schemas at read time that do not conform to this, but at write time generate errors, forcing applications to conform to the new schema requirements when they upgrade to a new version of Avro.

Ruby, Python and PHP can currently write the wrong branch if two records have the same fields, but this probably occurs rarely.  Also, full recursive validation is expensive (N^2 for nested structures).  So Ruby, Python and PHP should fix their writing of unions to check only the name of records and not to recursively descend any types.  My second proposal above would make this much simpler.

Thoughts?




> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>
> According to the specification, a union may contain multiple instances of a named type, provided they have different names.  There are several bugs in the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first enum or fixed in the union will always be assumed when writing.  in many cases this may cause the wrong data to be written, potentially corrupting output.
> This is not a regression.  This has never been implemented correctly by Java.  Python and Ruby never check names, but rather perform a full, recursive validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.