You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2010/09/03 20:49:32 UTC

[jira] Created: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

writing unions with multiple records, fixed or enums can choose wrong branch 
-----------------------------------------------------------------------------

                 Key: AVRO-656
                 URL: https://issues.apache.org/jira/browse/AVRO-656
             Project: Avro
          Issue Type: Bug
          Components: java
    Affects Versions: 1.4.0
            Reporter: Doug Cutting
            Assignee: Doug Cutting


According to the specification, a union may contain multiple instances of a named type, provided they have different names.  There are several bugs in the Java implementation of this when writing data:
 - for record, only the short-name of the record is checked, so the branch for a record of the same name in a different namespace may be used by mistake
 - for enum and fixed, the name of the record is not checked, so the first enum or fixed in the union will always be assumed when writing.  in many cases this may cause the wrong data to be written, potentially corrupting output.

This is not a regression.  This has never been implemented correctly by Java.  Python and Ruby never check names, but rather perform a full, recursive validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-656:
------------------------------

    Fix Version/s: 1.5.0

Marking this for 1.5.0 so we don't forget about it.

> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.5.0
>
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a named type, provided they have different names.  There are several bugs in the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first enum or fixed in the union will always be assumed when writing.  in many cases this may cause the wrong data to be written, potentially corrupting output.
> This is not a regression.  This has never been implemented correctly by Java.  Python and Ruby never check names, but rather perform a full, recursive validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906934#action_12906934 ] 

Doug Cutting commented on AVRO-656:
-----------------------------------

I guess we could distinguish multiple fixed schemas in a union by their size, instead of by their name.  That's what the Ruby, Python and PHP implementations already do, more or less.

> would new code be able to read old data written with the above schema?

Yes, I think so.  I don't see why it would not.


> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a named type, provided they have different names.  There are several bugs in the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first enum or fixed in the union will always be assumed when writing.  in many cases this may cause the wrong data to be written, potentially corrupting output.
> This is not a regression.  This has never been implemented correctly by Java.  Python and Ruby never check names, but rather perform a full, recursive validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906075#action_12906075 ] 

Scott Carey commented on AVRO-656:
----------------------------------

OK, I'm going to review all my in use schemas and see what the above options would break.

First, there is the schema used to represent an arbitrary Pig field, which the second alternative would break:

{code}
    List<Schema> pigTypes = new ArrayList<Schema>();
    pigTypes.add(Schema.create(Type.NULL));
    pigTypes.add(Schema.create(Type.BOOLEAN));
    pigTypes.add(Schema.create(Type.INT));
    pigTypes.add(Schema.create(Type.LONG));
    pigTypes.add(Schema.create(Type.FLOAT));
    pigTypes.add(Schema.create(Type.DOUBLE));
    pigTypes.add(Schema.create(Type.STRING));
    pigTypes.add(Schema.create(Type.BYTES));
    pigTypes.add(Schema.createArray(GENERIC_TUPLE));
    pigTypes.add(GENERIC_TUPLE);  // Tuple is a record containing a list of fields of type GENERIC_FIELD_UNION
    pigTypes.add(GENERIC_ELEMENT_MAP); // Map is a map from String to GENERIC_FIELD_UNION
    GENERIC_FIELD_UNION = Schema.createUnion(pigTypes);
{code}

I had tried to create an enum with multiple fixed types and ran into issues long ago.  I thought I was doing something wrong, actually.
I have long since wrapped these in a record.  So I have avoided this bug due to that:
{code}
[
{"name": "com.rr.avro.Fixed16", "type": "fixed", "size":16},
{"name": "com.rr.avro.Fixed4", "type": "fixed", "size":4},
{"name": "com.rr.avro.MyRecord", "type": "record", "fields": [
  {"name": "hostIp", "type": ["Fixed4", "Fixed16"], "doc": "should always be 4 bytes (IPv4) or 16 bytes (IPv6)"},
   ... (more fields)
  }}
]
{code}

I have some other unions like this that are important:
["Fixed16", "string", "null"]


So in short, I think the first option makes sense from my use cases and the second one is very restrictive.  
It might make sense to simplify it and say that enum and/or fixed are not allowed in UNION at all -- they must be wrapped in a named record.  Limiting it to only one of each might be somewhat useful, but be more complicated.  

Alternatively, making some or all of the unnamed types named might help too.

Making only one symbolic type allowed in a union is restrictive, especially since I already have use cases for combining fixed, string, and bytes in a union. 

What about something like:
['BrowserTypeEnum", "string"] as a union.  BrowserTypeEnum is a canonicalized set of known browsers.  If a user-agent string can't be bucketed into one of the known types, its full string is stored instead.  Sure, we could instead have a record with an enum and a nullable string in it instead, but now you have a case where it could be both types at once.  The purpose of the Union is to guarantee its only one of the branches.

> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a named type, provided they have different names.  There are several bugs in the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first enum or fixed in the union will always be assumed when writing.  in many cases this may cause the wrong data to be written, potentially corrupting output.
> This is not a regression.  This has never been implemented correctly by Java.  Python and Ruby never check names, but rather perform a full, recursive validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

Posted by "Bruce Martin (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914931#action_12914931 ] 

Bruce Martin commented on AVRO-656:
-----------------------------------

In java (Avro version 1.4) if you use anything other than the first ENUM in a UNION you can get an exception when writing to a file:


java.lang.NullPointerException: null of SaleType of union in field f02 of fields
	at org.apache.avro.generic.GenericDatumWriter.npe(GenericDatumWriter.java:90)
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:85)
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:56)
	at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245)

Caused by: java.lang.NullPointerException
	at org.apache.avro.Schema$EnumSchema.getEnumOrdinal(Schema.java:651)
	at org.apache.avro.generic.GenericDatumWriter.writeEnum(GenericDatumWriter.java:120)
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:65)
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:71)
	at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:102)
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:64)


One other issue is you can use a the same ENUM value in multiple ENUM's  but the code can not decide
which version you are using in a UNION

e.g. I have used RETURN in both SaleType and PoType then used SaleType and PoType in the same Enum ???

  enum SaleType {
      RETURN,
      OTHER,
      SALE
  }
  
  enum PoType {
    PURCHASE_ORDER,
    DIRECT_DELIVERY,
    RETURN,
    CONSIGNMENT
  }
  
 

  record fields {
    union {null, int, float, double, SaleType, PoType, letters, string} f02;

> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a named type, provided they have different names.  There are several bugs in the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first enum or fixed in the union will always be assumed when writing.  in many cases this may cause the wrong data to be written, potentially corrupting output.
> This is not a regression.  This has never been implemented correctly by Java.  Python and Ruby never check names, but rather perform a full, recursive validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-656:
------------------------------

    Attachment: AVRO-656.patch

Here are unit tests for Java that currently fail, illustrating spec non-conformance.

> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a named type, provided they have different names.  There are several bugs in the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first enum or fixed in the union will always be assumed when writing.  in many cases this may cause the wrong data to be written, potentially corrupting output.
> This is not a regression.  This has never been implemented correctly by Java.  Python and Ruby never check names, but rather perform a full, recursive validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906036#action_12906036 ] 

Doug Cutting commented on AVRO-656:
-----------------------------------

Since this has never been implemented correctly, we might safely change the specification without introducing incompatibilities.  There are many unions in use that contain multiple records with different names, so we must retain at least that ability.  Unions with multiple fixed, or enum schemas will have behaved erratically in all implementations and are not likely used much.

We might change the specification to only permit a single fixed or enum in a union.  This would permit Java to conform to the spec for fixed and enums without changing how these types are represented at runtime.  For Python, Ruby and PHP things are more difficult, since, e.g., string, bytes, enum and fixed can be represented with identical primitive types in these, making runtime type determination difficult without introducing more runtime datatypes.

Alternately we might change the spec so that unions are only permitted to contain, e.g., a single numeric type (int, float, long or double), a single symbolic type (string, bytes, enum & fixed) a single sequence type (map or array), and a number of records, distinguished by name.  I think this would support most uses that exist today and permit fast writing in most languages.  For compatibility, we might accept schemas at read time that do not conform to this, but at write time generate errors, forcing applications to conform to the new schema requirements when they upgrade to a new version of Avro.

Ruby, Python and PHP can currently write the wrong branch if two records have the same fields, but this probably occurs rarely.  Also, full recursive validation is expensive (N^2 for nested structures).  So Ruby, Python and PHP should fix their writing of unions to check only the name of records and not to recursively descend any types.  My second proposal above would make this much simpler.

Thoughts?




> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>
> According to the specification, a union may contain multiple instances of a named type, provided they have different names.  There are several bugs in the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first enum or fixed in the union will always be assumed when writing.  in many cases this may cause the wrong data to be written, potentially corrupting output.
> This is not a regression.  This has never been implemented correctly by Java.  Python and Ruby never check names, but rather perform a full, recursive validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906924#action_12906924 ] 

Scott Carey commented on AVRO-656:
----------------------------------

bq. A high-fidelity implementation can read and write data without alteration, but an implementation that cannot write data exactly as read might still be both useful and correctly implement the Avro specification.

I agree, an implementation doesn't need to have that ability.  I am wary of restricting what is capable in unions to what is 'easy' in languages with weaker type systems.

bq. A primary question of this issue is whether to continue to permit multiple enums and fixed in a union, distinguished by name. No implementation takes advantage of this today, and it might make implementations simpler to drop this, permitting only a single enum and fixed per union. So far, no one has presented a use case for this feature.

To be clear, would that break this:

{code}
[
{"name": "com.rr.avro.Fixed16", "type": "fixed", "size":16},
{"name": "com.rr.avro.Fixed4", "type": "fixed", "size":4},
{"name": "com.rr.avro.MyRecord", "type": "record", "fields": [
  {"name": "hostIp", "type": ["Fixed4", "Fixed16"], "doc": "should always be 4 bytes (IPv4) or 16 bytes (IPv6)"},
   ... (more fields)
  }}
]
{code}

Which I have in use in production right now.  I could switch to bytes and control the size restrictions client side however.  But schema migration might be a bit annoying in that case -- in particular would new code be able to read old data written with the above schema?

I have a hard time thinking of a use case for multiple enums.  A union of two different enums is too much like a single, larger enum.
A union of multiple fixed has some uses, but can always be replaced with bytes.  The main motivation for the union of two fixed instead of bytes is that if there is a third member of the union, it saves space.  ["null", "Fixed4", "Fixed16"] takes up 1 less byte than ["null", "bytes"] when not null.


On a different note with Unions, doing some research and experimentation with Scala recently I fount it interesting that Avro Unions map almost 1:1 to Scala 'case classes'.  It is a bit annoying to map Unions to Java polymorphically (perhaps with Avro-648), but would be simple in Scala.

> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a named type, provided they have different names.  There are several bugs in the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first enum or fixed in the union will always be assumed when writing.  in many cases this may cause the wrong data to be written, potentially corrupting output.
> This is not a regression.  This has never been implemented correctly by Java.  Python and Ruby never check names, but rather perform a full, recursive validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906867#action_12906867 ] 

Doug Cutting commented on AVRO-656:
-----------------------------------

> That would be a major change in what the Union is and what you can do with it.

The specification is primarily concerned with (a) schema & protocol syntax; (b) format of corresponding data.  So, as long as an implementation produces and consumes valid schemas and data, it's a conforming implementation.  A high-fidelity implementation can read and write data without alteration, but an implementation that cannot write data exactly as read might still be both useful and correctly implement the Avro specification.

> If this means that an implementation can't use a string directly for an enum, but instead uses sentinel objects or a container with a value string and name string, Isn't that OK?

Sure, that's okay.  But currently Ruby, PHP and Python don't distinguish bytes, enum and fixed at runtime.  This is fine except in the case of a union that contains these types.  In that case, an application may end up treating a value intended to be one type as a different type.  That may be a problem for some applications, and may not be for others.  Hopefully someone will fix these implementations, e.g., to wrap such union values.  But I don't think in the meantime we need to declare that these implementations are non-conforming or change the spec.  Rather we should document the limitation and file bugs to improve the implementations.

A primary question of this issue is whether to continue to permit multiple enums and fixed in a union, distinguished by name.  No implementation takes advantage of this today, and it might make implementations simpler to drop this, permitting only a single enum and fixed per union.  So far, no one has presented a use case for this feature.

I'd also like to see Ruby, Python and PHP improve their union handling by avoiding recursive validation.  If they add a name to each record instance this is easy, and better implements the spirit of the specification.  Adding wrappers for enum, fixed and bytes would also be good, but is a bigger change.


> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a named type, provided they have different names.  There are several bugs in the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first enum or fixed in the union will always be assumed when writing.  in many cases this may cause the wrong data to be written, potentially corrupting output.
> This is not a regression.  This has never been implemented correctly by Java.  Python and Ruby never check names, but rather perform a full, recursive validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906209#action_12906209 ] 

Scott Carey commented on AVRO-656:
----------------------------------

bq. Arguably we shouldn't worry so much. If an implementation can't distinguish between string and bytes then it should not be expected to preserve that distinction.

That would be a major change in what the Union is and what you can do with it.

For example, you might want a union of string and bytes, where the string is a hex representation of some data, and the bytes are raw data.  If the distinction can't be preserved, you can't use unions to store different representations of the same data.  What if one language does not differentiate between string and bytes, because its implicit assumption is that strings are just utf8 byte arrays.  Another language likely cannot differentiate those two, but assumes strings are LittleEndian encoded UTF16 byte arrays?   If avro can't guarantee that a user can find out what branch of the union a piece of data came from, and doesn't allow specifying what it should be when written, then I think we've just blown away a lot of cross-language compatibility.  

What if an implementation only has strings, and can't differentiate between strings and numerics without parsing the string?  I think it should be required to tag/flag the union field with what type it is and expose that to the user.  In fact, I think all implementations should be expected to expose what avro trype the branch of a union field is one way or another.  We can't really be 'magic' here and expect to achieve cross language capabilities.


A user needs to be able to ask the implementation:  "what branch of the union is this union field" and specify "store this union field using branch X" when there is ambiguity present in the language.  An implementation might not require that a user specify what type it is setting and default to the first matching type, but that should be up to the user.

bq. Implementations will read data into the highest fidelity representation they can, but an implementation that represents floats as doubles will not be able to always write exactly the data it reads when processing a [float,double] union.   

I think if a user wants to write exactly what was read, it should be possible.
So a language that uses doubles internally for both float and double would need to tag the union field it reads with what type it was when it was read and make that available, so that a user could make an informed decision on whether to serialize as a float or double.

bq.  Folks could be advised to order their unions to guard against this.

I think doing too much implicitly here will lead to trouble, especially since the possible combinations of things various languages might do when present with ambiguity is large and may not be understood at the time a schema is defined.


Back to the original problem, I'm not sure I get it.   Records, Enums, and Fixed are named types.  If the type is named, why is it so hard to figure out what branch it belongs to?  If this means that an implementation can't use a string directly for an enum, but instead uses sentinel objects or a container with a value string and name string, Isn't that OK?   
If an implementation can't distinguish strings and bytes by type, shouldn't it track what branch it is some other way than the type?  
If an implementation can't distinguish between bytes and fixed (like Java), it can wrap the fixed in a container and keep the name somewhere.

All implementations have at their disposal the ability to keep an additional internal value that tracks the union branch if it is ambiguous due to the language or otherwise.

Am I missing something?

> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a named type, provided they have different names.  There are several bugs in the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first enum or fixed in the union will always be assumed when writing.  in many cases this may cause the wrong data to be written, potentially corrupting output.
> This is not a regression.  This has never been implemented correctly by Java.  Python and Ruby never check names, but rather perform a full, recursive validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906127#action_12906127 ] 

Doug Cutting commented on AVRO-656:
-----------------------------------

Arguably we shouldn't worry so much.  If an implementation can't distinguish between string and bytes then it should not be expected to preserve that distinction.  All that's really required is that it write valid data.

If we accept this, then we can go with my first proposal above: records are the only type that can occur multiply in a union.  Implementations will read data into the highest fidelity representation they can, but an implementation that represents floats as doubles will not be able to always write exactly the data it reads when processing a [float,double] union.  Similarly, an implementation that represents enum symbols with strings might sometimes write one in place of the other.

Folks could be advised to order their unions to guard against this.  Higher-precision numeric types should usually occur before lower-precision types.  Enum and fixed should usually occur before string and bytes.

For performance, it is reasonable to continue to prohibit multiple arrays and maps, since otherwise recursive validation would be required.  Similarly, we should update all implementations to use record names, rather than recursive validation.

> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a named type, provided they have different names.  There are several bugs in the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first enum or fixed in the union will always be assumed when writing.  in many cases this may cause the wrong data to be written, potentially corrupting output.
> This is not a regression.  This has never been implemented correctly by Java.  Python and Ruby never check names, but rather perform a full, recursive validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.