You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Raymie Stata (JIRA)" <ji...@apache.org> on 2009/07/06 14:34:15 UTC

[jira] Created: (AVRO-75) Clarify resolution for enums (and fix code)

Clarify resolution for enums (and fix code)
-------------------------------------------

                 Key: AVRO-75
                 URL: https://issues.apache.org/jira/browse/AVRO-75
             Project: Avro
          Issue Type: Bug
          Components: spec
            Reporter: Raymie Stata
            Assignee: Doug Cutting


The current resolution rule for enum's says: "if the writer's symbol is not present in the reader's enum, then the enum value is unset."  This is the only place the word "unset" is used in the doc, it's not clear what you mean.  The code seems to be inconsistent: GenericDatumReader will happily return a symbol the reader doesn't understand; ReflectDatumReader will probably throw a class-not-found exception; ResolvingDecoder throws an error.

I propose that we declare this case an error, i.e., rewrite the spec to "if the writer's symbol is not listed in the reader's enum, an error is signaled."  GenericDatumReader should be updated to throw an error in this case.

If we decide to stick with the "unset" language, we need to define what "unset" means (and, if necessary, update ReflectDatumReader and ResolvingDecoder).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (AVRO-75) Clarify resolution for enums (and fix code)

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-75?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting resolved AVRO-75.
------------------------------

    Resolution: Fixed

I added some tests and committed this.

> Clarify resolution for enums (and fix code)
> -------------------------------------------
>
>                 Key: AVRO-75
>                 URL: https://issues.apache.org/jira/browse/AVRO-75
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-75.patch
>
>
> The current resolution rule for enum's says: "if the writer's symbol is not present in the reader's enum, then the enum value is unset."  This is the only place the word "unset" is used in the doc, it's not clear what you mean.  The code seems to be inconsistent: GenericDatumReader will happily return a symbol the reader doesn't understand; ReflectDatumReader will probably throw a class-not-found exception; ResolvingDecoder throws an error.
> I propose that we declare this case an error, i.e., rewrite the spec to "if the writer's symbol is not listed in the reader's enum, an error is signaled."  GenericDatumReader should be updated to throw an error in this case.
> If we decide to stick with the "unset" language, we need to define what "unset" means (and, if necessary, update ReflectDatumReader and ResolvingDecoder).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-75) Clarify resolution for enums (and fix code)

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-75?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797926#action_12797926 ] 

Philip Zeyliger commented on AVRO-75:
-------------------------------------

bq. if writer's is a union, but reader's is not If the reader's schema matches the selected writer's schema, it is recursively resolved against it. If they do not match, an error is signalled.

The quote above is from the union resolution, and I'm unsure what it means too.  Does it mean that if you write with [int, string] and read with string, the error occurs only when you encounter an int?

Anyway, I think "unset" should be "an error is signalled", though I could be convinced.  I'd just like some guarantees that data is not null.  (Or, if everything is optional, why bother having unions with nulls?)

-- Philip

> Clarify resolution for enums (and fix code)
> -------------------------------------------
>
>                 Key: AVRO-75
>                 URL: https://issues.apache.org/jira/browse/AVRO-75
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>
> The current resolution rule for enum's says: "if the writer's symbol is not present in the reader's enum, then the enum value is unset."  This is the only place the word "unset" is used in the doc, it's not clear what you mean.  The code seems to be inconsistent: GenericDatumReader will happily return a symbol the reader doesn't understand; ReflectDatumReader will probably throw a class-not-found exception; ResolvingDecoder throws an error.
> I propose that we declare this case an error, i.e., rewrite the spec to "if the writer's symbol is not listed in the reader's enum, an error is signaled."  GenericDatumReader should be updated to throw an error in this case.
> If we decide to stick with the "unset" language, we need to define what "unset" means (and, if necessary, update ReflectDatumReader and ResolvingDecoder).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-75) Clarify resolution for enums (and fix code)

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-75?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797863#action_12797863 ] 

Jeff Hammerbacher commented on AVRO-75:
---------------------------------------

Two uses of "unset" that I see:
 * Reader's schema has a field with no default value and the writer's schema does not have that field.
 * The Reader's schema has an enum which matches the name of an enum in the Writer's schema, but the writer's schema has used a symbol not present in the Reader's schema.
In both cases, in the Python implementation, I'm raising a SchemaResolutionException.

> Clarify resolution for enums (and fix code)
> -------------------------------------------
>
>                 Key: AVRO-75
>                 URL: https://issues.apache.org/jira/browse/AVRO-75
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>
> The current resolution rule for enum's says: "if the writer's symbol is not present in the reader's enum, then the enum value is unset."  This is the only place the word "unset" is used in the doc, it's not clear what you mean.  The code seems to be inconsistent: GenericDatumReader will happily return a symbol the reader doesn't understand; ReflectDatumReader will probably throw a class-not-found exception; ResolvingDecoder throws an error.
> I propose that we declare this case an error, i.e., rewrite the spec to "if the writer's symbol is not listed in the reader's enum, an error is signaled."  GenericDatumReader should be updated to throw an error in this case.
> If we decide to stick with the "unset" language, we need to define what "unset" means (and, if necessary, update ReflectDatumReader and ResolvingDecoder).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-75) Clarify resolution for enums (and fix code)

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-75?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797873#action_12797873 ] 

Jeff Hammerbacher commented on AVRO-75:
---------------------------------------

bq. I was under the impression that you were not allowed to read the schema record(a=string) with the schema record(b=string), i.e., that this would throw an exception when the schemas were being reconciled.

That's what we're discussing here. Currently, two record schemas are assumed to "match" if they have the same name, regardless of fields. For the two record schemas you specify, if they had the same name, we'd be in the first case I mention above. According to the spec, the current behavior is that "the field's value is unset". We're trying to clarify what "unset" means.

> Clarify resolution for enums (and fix code)
> -------------------------------------------
>
>                 Key: AVRO-75
>                 URL: https://issues.apache.org/jira/browse/AVRO-75
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>
> The current resolution rule for enum's says: "if the writer's symbol is not present in the reader's enum, then the enum value is unset."  This is the only place the word "unset" is used in the doc, it's not clear what you mean.  The code seems to be inconsistent: GenericDatumReader will happily return a symbol the reader doesn't understand; ReflectDatumReader will probably throw a class-not-found exception; ResolvingDecoder throws an error.
> I propose that we declare this case an error, i.e., rewrite the spec to "if the writer's symbol is not listed in the reader's enum, an error is signaled."  GenericDatumReader should be updated to throw an error in this case.
> If we decide to stick with the "unset" language, we need to define what "unset" means (and, if necessary, update ReflectDatumReader and ResolvingDecoder).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-75) Clarify resolution for enums (and fix code)

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-75?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798124#action_12798124 ] 

Philip Zeyliger commented on AVRO-75:
-------------------------------------

Patch looks fine, but there's not a test which shows the exception being triggered.

> Clarify resolution for enums (and fix code)
> -------------------------------------------
>
>                 Key: AVRO-75
>                 URL: https://issues.apache.org/jira/browse/AVRO-75
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-75.patch
>
>
> The current resolution rule for enum's says: "if the writer's symbol is not present in the reader's enum, then the enum value is unset."  This is the only place the word "unset" is used in the doc, it's not clear what you mean.  The code seems to be inconsistent: GenericDatumReader will happily return a symbol the reader doesn't understand; ReflectDatumReader will probably throw a class-not-found exception; ResolvingDecoder throws an error.
> I propose that we declare this case an error, i.e., rewrite the spec to "if the writer's symbol is not listed in the reader's enum, an error is signaled."  GenericDatumReader should be updated to throw an error in this case.
> If we decide to stick with the "unset" language, we need to define what "unset" means (and, if necessary, update ReflectDatumReader and ResolvingDecoder).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-75) Clarify resolution for enums (and fix code)

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-75?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-75:
-----------------------------

    Attachment: AVRO-75.patch

Here's a patch that changes "unset" to "an error is signalled" and updates the Java implementation to implement this.

> Clarify resolution for enums (and fix code)
> -------------------------------------------
>
>                 Key: AVRO-75
>                 URL: https://issues.apache.org/jira/browse/AVRO-75
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>         Attachments: AVRO-75.patch
>
>
> The current resolution rule for enum's says: "if the writer's symbol is not present in the reader's enum, then the enum value is unset."  This is the only place the word "unset" is used in the doc, it's not clear what you mean.  The code seems to be inconsistent: GenericDatumReader will happily return a symbol the reader doesn't understand; ReflectDatumReader will probably throw a class-not-found exception; ResolvingDecoder throws an error.
> I propose that we declare this case an error, i.e., rewrite the spec to "if the writer's symbol is not listed in the reader's enum, an error is signaled."  GenericDatumReader should be updated to throw an error in this case.
> If we decide to stick with the "unset" language, we need to define what "unset" means (and, if necessary, update ReflectDatumReader and ResolvingDecoder).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-75) Clarify resolution for enums (and fix code)

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-75?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797869#action_12797869 ] 

Philip Zeyliger commented on AVRO-75:
-------------------------------------

I was under the impression that you were not allowed to read the schema record(a=string) with the schema record(b=string), i.e., that this would throw an exception when the schemas were being reconciled.  If the reader wanted to be open to unset fields, he could explicitly read with record(b=union(null,string)) or record(b=string(default="foo")).

I think failing fast here is good behavior.  If you're using a schema record(b=string), then you expect to be able to use record.b without checking, first, that b is not null.  In the RPC case especially, if some rogue-ish client (one that was too old, or had a typo) came by, my server would simply throw NPE, instead of usefully telling me where the schema went wrong.

For types where null is not an option (int), you'd see even more specific behavior (because int would default, presumably, to 0).  If we have so much type information, I'd rather not ask the user to check isSet(field_name) before every usage.

> Clarify resolution for enums (and fix code)
> -------------------------------------------
>
>                 Key: AVRO-75
>                 URL: https://issues.apache.org/jira/browse/AVRO-75
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>
> The current resolution rule for enum's says: "if the writer's symbol is not present in the reader's enum, then the enum value is unset."  This is the only place the word "unset" is used in the doc, it's not clear what you mean.  The code seems to be inconsistent: GenericDatumReader will happily return a symbol the reader doesn't understand; ReflectDatumReader will probably throw a class-not-found exception; ResolvingDecoder throws an error.
> I propose that we declare this case an error, i.e., rewrite the spec to "if the writer's symbol is not listed in the reader's enum, an error is signaled."  GenericDatumReader should be updated to throw an error in this case.
> If we decide to stick with the "unset" language, we need to define what "unset" means (and, if necessary, update ReflectDatumReader and ResolvingDecoder).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-75) Clarify resolution for enums (and fix code)

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-75?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728834#action_12728834 ] 

Doug Cutting commented on AVRO-75:
----------------------------------

> This is the only place the word "unset" is used in the doc [ ... ]

That may be a bug.  We don't currently say what happens when a reader has a field that a writer didn't write and where no default value is specified in the reader's schema.  In this case we might provide some implementation flexibility.  An optimized implementation might have an int field that's just set to zero in this case, with no means to differentiate this from an actual zero value for the field, or it might provide a means to tell whether this field was set or not.  I am not yet convinced that the spec should mandate this behavior, but might rather define such fields as "unset", which may or may not be detectable, depending on the implementation.

This gets back to the issue of optional/required fields.  I think the current intent is to treat all fields as optional, that, if a field must have a valid value then one can specify a default value in the schema rather than require that all writers already have that field.

> I propose that we declare this case an error [ ... ]

"Unset" may make sense in this case too.  In Java's reflect and specific API's, an Enum instance could be null.  This is somewhat analogous to a field that's been added to the writer but not yet to the reader.  A reader that requires this might provide a default value that would be used when either the writer does not provide a value or the writer provides an unknown value.

That said, I don't have a strong feeling and would be willing to make this an error if others can explain why that should be preferred.

> GenericDatumReader should be updated to throw an error in this case.

Or, if we decide that "unset" is useful here, we could have it use null or the default in this case.  We'd then need to update ReflectDatumReader too, as it currently throws an exception in this case.  In either case, the code does not conform to the spec.


> Clarify resolution for enums (and fix code)
> -------------------------------------------
>
>                 Key: AVRO-75
>                 URL: https://issues.apache.org/jira/browse/AVRO-75
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>            Assignee: Doug Cutting
>
> The current resolution rule for enum's says: "if the writer's symbol is not present in the reader's enum, then the enum value is unset."  This is the only place the word "unset" is used in the doc, it's not clear what you mean.  The code seems to be inconsistent: GenericDatumReader will happily return a symbol the reader doesn't understand; ReflectDatumReader will probably throw a class-not-found exception; ResolvingDecoder throws an error.
> I propose that we declare this case an error, i.e., rewrite the spec to "if the writer's symbol is not listed in the reader's enum, an error is signaled."  GenericDatumReader should be updated to throw an error in this case.
> If we decide to stick with the "unset" language, we need to define what "unset" means (and, if necessary, update ReflectDatumReader and ResolvingDecoder).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (AVRO-75) Clarify resolution for enums (and fix code)

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-75?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797936#action_12797936 ] 

Philip Zeyliger commented on AVRO-75:
-------------------------------------

The code answered my own question:

{noformat}
    if (actual.getType() == Type.UNION)           // resolve unions
      actual = actual.getTypes().get((int)in.readIndex());
    if (expected.getType() == Type.UNION)
      expected = resolveExpected(actual, expected);
{noformat}

A record written with a schema of [string, int] can be read by a schema of [int] as long as the record only follows the int branch.

It still feels weird to me that schema resolution depends on the data being resolved.  I was going to write an avroj tool tlike "valid_schema_resolution writer_schema reader_schema", but such a thing would have to depend on the datum.

> Clarify resolution for enums (and fix code)
> -------------------------------------------
>
>                 Key: AVRO-75
>                 URL: https://issues.apache.org/jira/browse/AVRO-75
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>
> The current resolution rule for enum's says: "if the writer's symbol is not present in the reader's enum, then the enum value is unset."  This is the only place the word "unset" is used in the doc, it's not clear what you mean.  The code seems to be inconsistent: GenericDatumReader will happily return a symbol the reader doesn't understand; ReflectDatumReader will probably throw a class-not-found exception; ResolvingDecoder throws an error.
> I propose that we declare this case an error, i.e., rewrite the spec to "if the writer's symbol is not listed in the reader's enum, an error is signaled."  GenericDatumReader should be updated to throw an error in this case.
> If we decide to stick with the "unset" language, we need to define what "unset" means (and, if necessary, update ReflectDatumReader and ResolvingDecoder).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (AVRO-75) Clarify resolution for enums (and fix code)

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-75?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hammerbacher updated AVRO-75:
----------------------------------

    Fix Version/s: 1.3.0

> Clarify resolution for enums (and fix code)
> -------------------------------------------
>
>                 Key: AVRO-75
>                 URL: https://issues.apache.org/jira/browse/AVRO-75
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>            Assignee: Doug Cutting
>             Fix For: 1.3.0
>
>
> The current resolution rule for enum's says: "if the writer's symbol is not present in the reader's enum, then the enum value is unset."  This is the only place the word "unset" is used in the doc, it's not clear what you mean.  The code seems to be inconsistent: GenericDatumReader will happily return a symbol the reader doesn't understand; ReflectDatumReader will probably throw a class-not-found exception; ResolvingDecoder throws an error.
> I propose that we declare this case an error, i.e., rewrite the spec to "if the writer's symbol is not listed in the reader's enum, an error is signaled."  GenericDatumReader should be updated to throw an error in this case.
> If we decide to stick with the "unset" language, we need to define what "unset" means (and, if necessary, update ReflectDatumReader and ResolvingDecoder).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.