You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by "Raymie Stata (JIRA)" <ji...@apache.org> on 2010/02/18 19:34:27 UTC

[jira] Created: (AVRO-419) Consistent laziness when resolving partially-compatible changes

Consistent laziness when resolving partially-compatible changes
---------------------------------------------------------------

Key: AVRO-419
URL: https://issues.apache.org/jira/browse/AVRO-419
Project: Avro
Issue Type: Bug
Components: spec
Reporter: Raymie Stata

Avro schema resolution is generally "lazy" when it comes to dealing with incompatible changes. If the writer writes a union of "int" and "null", and the reader expects just an "int", Avro doesn't raise an exception unless the writer _actually_ writes a "null" (and the reader attempts to read it).

This laziness is a powerful feature for supporting "forward compatibility" (old readers reading data written by new writers). In the example just given, for example, we might decide at some point that a column needs to be "nullable" but there's a lot of old code that assumes that it's not. When using old code, we can often arrange to avoid sending the old code any new records that have null-values in that column. It's powerful to allow new writers to write against the nullable schema and allow readers to read those records. (For this to be safe, it's also important that this be _checked,_ i.e., that a run time error is thrown is a bad value is passed to the reader.)

Avro is lazy in many places (e.g., in the union example just given, and for enumerations). But it's not _consistently_ lazy. I propose we comb through the spec and make it lazy in all places we can, unless there's a compelling reason not to.

Numeric types is one area where Avro is not consistently lazy. I propose that we fairly liberally allow any change from one numeric type to another, and raise errors at runtime if bad values are found. An "int" can be changed to a "long", for example, and an error is raised when a reader gets an out-of-bounds value. A "double" can be changed to an "int", and an error is raised if the reader gets a non-integer value or an out-of-bounds value. (I'm not sure if there are types beyond numerics where we could be more consistently lazy, but I decided to write this issue generically just in case.)

(One might object that these checks are expensive, but note that they are only needed when the reader and writer specs don't agree. Thus, if these checks are induced, then the system designer _wanted_ these checks, we're only adding value here, not inducing costs.)

I'm not sure if there are other a

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-419) Consistent laziness when resolving partially-compatible changes

Posted by "Raymie Stata (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835448#action_12835448 ] 

Raymie Stata commented on AVRO-419:
-----------------------------------

That's a left-over editing crumb I forgot to bush off, sorry about that.  Can I clarify the issue for you in some way?

> Consistent laziness when resolving partially-compatible changes
> ---------------------------------------------------------------
>
>                 Key: AVRO-419
>                 URL: https://issues.apache.org/jira/browse/AVRO-419
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>
> Avro schema resolution is generally "lazy" when it comes to dealing with incompatible changes.  If the writer writes a union of "int" and "null", and the reader expects just an "int", Avro doesn't raise an exception unless the writer _actually_ writes a "null" (and the reader attempts to read it).
> This laziness is a powerful feature for supporting "forward compatibility" (old readers reading data written by new writers).  In the example just given, for example, we might decide at some point that a column needs to be "nullable" but there's a lot of old code that assumes that it's not.  When using old code, we can often arrange to avoid sending the old code any new records that have null-values in that column.  It's powerful to allow new writers to write against the nullable schema and allow readers to read those records.  (For this to be safe, it's also important that this be _checked,_ i.e., that a run time error is thrown is a bad value is passed to the reader.)
> Avro is lazy in many places (e.g., in the union example just given, and for enumerations).  But it's not _consistently_ lazy.  I propose we comb through the spec and make it lazy in all places we can, unless there's a compelling reason not to.
> Numeric types is one area where Avro is not consistently lazy.  I propose that we fairly liberally allow any change from one numeric type to another, and raise errors at runtime if bad values are found.  An "int" can be changed to a "long", for example, and an error is raised when a reader gets an out-of-bounds value.  A "double" can be changed to an "int", and an error is raised if the reader gets a non-integer value or an out-of-bounds value.  (I'm not sure if there are types beyond numerics where we could be more consistently lazy, but I decided to write this issue generically just in case.)
> (One might object that these checks are expensive, but note that they are only needed when the reader and writer specs don't agree.  Thus, if these checks are induced, then the system designer _wanted_ these checks, we're only adding value here, not inducing costs.)
> I'm not sure if there are other a

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-419) Consistent laziness when resolving partially-compatible changes

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835465#action_12835465 ] 

Scott Carey commented on AVRO-419:
----------------------------------

I think that Avro should have good, consistent default behavior, and agree that this default should probably be lazy.  But these defaults also need to be safe and consistent across all target languages.

However, the details should really be up to the client, fully controlled by some sort of configuration or annotation.  Sometimes a client will want to fail eagerly long before a specific tuple is encountered that can't be promoted.  Sometimes a client will want an exception to be thrown.  Sometimes a client will want _something else_ to happen -- perhaps a callback or an override for a default value or something else.  Maybe one client thinks its completely fine to take a double that is larger than MAX_INT and cast it to an int with truncation to MAX_INT, while another wants an exception in that case, and a third never wants to down-cast.

There are lots of possibilities, and in the long run I think all those decisions can be configurable -- both at the schema level and via client specific overrides.  

This can all be achieved with fantastic performance in Java if these 'rules' were configured up front, and compiled into a parser (fast), static state machine (faster), or a class generated by asm and compiled 'to the metal' by the JIT (fastest -- zero overhead for resolving decoders beyond the initial resolution/compilation cost paid once per schema resolution pair).

> Consistent laziness when resolving partially-compatible changes
> ---------------------------------------------------------------
>
>                 Key: AVRO-419
>                 URL: https://issues.apache.org/jira/browse/AVRO-419
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>
> Avro schema resolution is generally "lazy" when it comes to dealing with incompatible changes.  If the writer writes a union of "int" and "null", and the reader expects just an "int", Avro doesn't raise an exception unless the writer _actually_ writes a "null" (and the reader attempts to read it).
> This laziness is a powerful feature for supporting "forward compatibility" (old readers reading data written by new writers).  In the example just given, for example, we might decide at some point that a column needs to be "nullable" but there's a lot of old code that assumes that it's not.  When using old code, we can often arrange to avoid sending the old code any new records that have null-values in that column.  It's powerful to allow new writers to write against the nullable schema and allow readers to read those records.  (For this to be safe, it's also important that this be _checked,_ i.e., that a run time error is thrown is a bad value is passed to the reader.)
> Avro is lazy in many places (e.g., in the union example just given, and for enumerations).  But it's not _consistently_ lazy.  I propose we comb through the spec and make it lazy in all places we can, unless there's a compelling reason not to.
> Numeric types is one area where Avro is not consistently lazy.  I propose that we fairly liberally allow any change from one numeric type to another, and raise errors at runtime if bad values are found.  An "int" can be changed to a "long", for example, and an error is raised when a reader gets an out-of-bounds value.  A "double" can be changed to an "int", and an error is raised if the reader gets a non-integer value or an out-of-bounds value.  (I'm not sure if there are types beyond numerics where we could be more consistently lazy, but I decided to write this issue generically just in case.)
> (One might object that these checks are expensive, but note that they are only needed when the reader and writer specs don't agree.  Thus, if these checks are induced, then the system designer _wanted_ these checks, we're only adding value here, not inducing costs.)
> I'm not sure if there are other a

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-419) Consistent laziness when resolving partially-compatible changes

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835383#action_12835383 ] 

Jeff Hammerbacher commented on AVRO-419:
----------------------------------------

Hey Raymie, your description appears to have been truncated ("I'm not sure if there are other a" is the end). I'd love to see the rest of the description, if you care to post it here.

> Consistent laziness when resolving partially-compatible changes
> ---------------------------------------------------------------
>
>                 Key: AVRO-419
>                 URL: https://issues.apache.org/jira/browse/AVRO-419
>             Project: Avro
>          Issue Type: Bug
>          Components: spec
>            Reporter: Raymie Stata
>
> Avro schema resolution is generally "lazy" when it comes to dealing with incompatible changes.  If the writer writes a union of "int" and "null", and the reader expects just an "int", Avro doesn't raise an exception unless the writer _actually_ writes a "null" (and the reader attempts to read it).
> This laziness is a powerful feature for supporting "forward compatibility" (old readers reading data written by new writers).  In the example just given, for example, we might decide at some point that a column needs to be "nullable" but there's a lot of old code that assumes that it's not.  When using old code, we can often arrange to avoid sending the old code any new records that have null-values in that column.  It's powerful to allow new writers to write against the nullable schema and allow readers to read those records.  (For this to be safe, it's also important that this be _checked,_ i.e., that a run time error is thrown is a bad value is passed to the reader.)
> Avro is lazy in many places (e.g., in the union example just given, and for enumerations).  But it's not _consistently_ lazy.  I propose we comb through the spec and make it lazy in all places we can, unless there's a compelling reason not to.
> Numeric types is one area where Avro is not consistently lazy.  I propose that we fairly liberally allow any change from one numeric type to another, and raise errors at runtime if bad values are found.  An "int" can be changed to a "long", for example, and an error is raised when a reader gets an out-of-bounds value.  A "double" can be changed to an "int", and an error is raised if the reader gets a non-integer value or an out-of-bounds value.  (I'm not sure if there are types beyond numerics where we could be more consistently lazy, but I decided to write this issue generically just in case.)
> (One might object that these checks are expensive, but note that they are only needed when the reader and writer specs don't agree.  Thus, if these checks are induced, then the system designer _wanted_ these checks, we're only adding value here, not inducing costs.)
> I'm not sure if there are other a

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.