You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Christophe Taton <ch...@gmail.com> on 2013/12/02 22:42:54 UTC

Effort towards Avro 2.0?

Hi all,

Avro, in its current form, exhibits a number of limitations that are hard
to work with or around, and hard to fix within the scope of Avro 1.x :
fixing these issues would introduce incompatible changes that warrant a
major version bump, ie. Avro 2.0. An Avro 2.0 branch would be an
opportunity to address most issues that appeared held back for
compatibility purposes so far.

I would like to initiate an effort in this direction and I am willing to do
the necessary work to gather and organize requirements, and draft a design
document for what Avro 2.0 would look like. For this reason, if you have
opinions regarding an Avro 2.0 branch or regarding issues and features that
could fit in Avro 2.0, please reply to this thread.

To bootstrap, below is a list I gathered over the last couple of years from
several discussions:

   - Specification
   - Improved support for unions (incompatible change with named unions and
      union properties).
      - New extension data type, similar to ProtocolBuffer extensions
      (incompatible change).
      - Clear separation between Avro schema (data format) and specific API
      client concerns: for example, the way Avro strings are exposed
through the
      Java API should not pollute the schema definition. Each particular Java
      client should configure their own decoders with the way they want Avro
      strings to be represented.
      - Clarification of compatibility and type promotion (safe lossless
      conversions vs. best-effort lossy conversions): promoting int to float
      potentially loses precision, which is not necessarily acceptable for all
      clients. Avro decoders should let clients configure which mode they need.
   - IDL
   - Generalized IDL for Avro schemas.
      - Support for recursive records.
      - Meta-schema : IDL definition for a schema.
      - Java API
   - Truly immutable schema objects (no properties / hashcode mutation
      after construction).
      - Immutable records.
      - Complete record builder API (current record builders do not play
      well with nested records).
      - Complete generic API (there currently is no GenericUnion or
      GenericMap).
      - Improved unions support : union values as java.lang.Object are less
      than ideal; union values could expose the union branch through an enum
      (nulls could be handled specifically).
      - Python 3 support
   - RPC
      - SASL support
      - Full Python/Java parity and interoperability.

Please, comment or extend this list. Provided enough interest, I'll happily
digest feedback and organize it into a document (most likely a wiki page?).

Thanks,
Christophe

Re: Effort towards Avro 2.0?

Posted by Dan Burkert <da...@gmail.com>.

I am mistaken in the previous email-- specific record does implement the
generic record API.  However, generated specific record builders are not
generic, so building a specific record with fields not known at compile
time requires reflection.

- Dan


On Tue, Dec 3, 2013 at 5:28 PM, Dan Burkert <da...@gmail.com> wrote:

> It would be nice if specific records implemented the generic record API.
>  This is useful when you don't know field names of a specific record at
> compile time.  Reflection is the only way around this currently.  Also, it
> would be useful if there was built in support for determining if a given
> schema is reader-compatible with a writer-schema.  Neither of these should
> require changing the data serialization format.
>
> - Dan
>
> On Tue, Dec 3, 2013 at 7:20 AM, Doug Cutting <cu...@apache.org> wrote:
>
>> On Mon, Dec 2, 2013 at 1:56 PM, Philip Zeyliger <ph...@cloudera.com>
>> wrote:
>> > It sounds like you're proposing to break language API compatibility.
>>  Are
>> > you also proposing to break wire compatibility for Avro HTTP RPC, Avro
>> Netty
>> > RPC, and/or Avro datafiles?
>>
>> We should be able to provide back-compatibility.  When current APIs
>> cannot be back-compatibly extended, new APIs can be added.  Old APIs
>> may be deprecated but should be retained for a while.  Data files
>> written by 1.x should be readable by 2.x.
>>
>> Forward compatibility may not be possible when new schema features are
>> used.  Data files written in 2.x may not be readable by 1.x.  Perhaps
>> we could add a mode that forces 2.x to write a 1.x format file.
>>
>> RPC interoperability requires that 2.x be able to both read and write
>> 1.x format.  So long as a 1.x protocol is used, then 1.x and 2.x
>> clients and servers might be able to interoperate using 1.x wire
>> formats.  But when 2.x schema features are used this may not be
>> possible.
>>
>> Perhaps we should proceed by making back-compatibility (ability to
>> read 1.x) a requirement, then adding interoperabilty features (ability
>> to write 1.x) as needed?
>>
>> Should we require that all new schema features (named unions,
>> extensions, date primitive, etc,) have a lossless translation to a 1.x
>> schema?
>>
>> Doug
>>
>
>

Re: Effort towards Avro 2.0?

Posted by Dan Burkert <da...@gmail.com>.

It would be nice if specific records implemented the generic record API.
 This is useful when you don't know field names of a specific record at
compile time.  Reflection is the only way around this currently.  Also, it
would be useful if there was built in support for determining if a given
schema is reader-compatible with a writer-schema.  Neither of these should
require changing the data serialization format.

- Dan

On Tue, Dec 3, 2013 at 7:20 AM, Doug Cutting <cu...@apache.org> wrote:

> On Mon, Dec 2, 2013 at 1:56 PM, Philip Zeyliger <ph...@cloudera.com>
> wrote:
> > It sounds like you're proposing to break language API compatibility.  Are
> > you also proposing to break wire compatibility for Avro HTTP RPC, Avro
> Netty
> > RPC, and/or Avro datafiles?
>
> We should be able to provide back-compatibility.  When current APIs
> cannot be back-compatibly extended, new APIs can be added.  Old APIs
> may be deprecated but should be retained for a while.  Data files
> written by 1.x should be readable by 2.x.
>
> Forward compatibility may not be possible when new schema features are
> used.  Data files written in 2.x may not be readable by 1.x.  Perhaps
> we could add a mode that forces 2.x to write a 1.x format file.
>
> RPC interoperability requires that 2.x be able to both read and write
> 1.x format.  So long as a 1.x protocol is used, then 1.x and 2.x
> clients and servers might be able to interoperate using 1.x wire
> formats.  But when 2.x schema features are used this may not be
> possible.
>
> Perhaps we should proceed by making back-compatibility (ability to
> read 1.x) a requirement, then adding interoperabilty features (ability
> to write 1.x) as needed?
>
> Should we require that all new schema features (named unions,
> extensions, date primitive, etc,) have a lossless translation to a 1.x
> schema?
>
> Doug
>

Re: Effort towards Avro 2.0?

Posted by Doug Cutting <cu...@apache.org>.

On Mon, Dec 2, 2013 at 1:56 PM, Philip Zeyliger <ph...@cloudera.com> wrote:
> It sounds like you're proposing to break language API compatibility.  Are
> you also proposing to break wire compatibility for Avro HTTP RPC, Avro Netty
> RPC, and/or Avro datafiles?

We should be able to provide back-compatibility.  When current APIs
cannot be back-compatibly extended, new APIs can be added.  Old APIs
may be deprecated but should be retained for a while.  Data files
written by 1.x should be readable by 2.x.

Forward compatibility may not be possible when new schema features are
used.  Data files written in 2.x may not be readable by 1.x.  Perhaps
we could add a mode that forces 2.x to write a 1.x format file.

RPC interoperability requires that 2.x be able to both read and write
1.x format.  So long as a 1.x protocol is used, then 1.x and 2.x
clients and servers might be able to interoperate using 1.x wire
formats.  But when 2.x schema features are used this may not be
possible.

Perhaps we should proceed by making back-compatibility (ability to
read 1.x) a requirement, then adding interoperabilty features (ability
to write 1.x) as needed?

Should we require that all new schema features (named unions,
extensions, date primitive, etc,) have a lossless translation to a 1.x
schema?

Doug

Re: Effort towards Avro 2.0?

Posted by Tibor Benke <bt...@balabit.hu>.

In the case of RPC, I think, not just Python and Java, but all supported 
languages, should be compatible with each other.

Tibor

On 12/02/2013 10:56 PM, Philip Zeyliger wrote:
> It sounds like you're proposing to break language API compatibility. 
>  Are you also proposing to break wire compatibility for Avro HTTP RPC, 
> Avro Netty RPC, and/or Avro datafiles?
>
> I'd be appreciative of a mechanism by which systems that happen to use 
> Avro currently need not be forced to choose one version or another. 
>  (One approach to this is to use a different package name.)
>
> As for adding to your list, I'd like to see a code-generated API for 
> Python.  (We like to call these APIs "specific" but I find that 
> terminology confusing.)
>
> -- Philip
>
>
> On Mon, Dec 2, 2013 at 1:42 PM, Christophe Taton 
> <christophe.taton@gmail.com <ma...@gmail.com>> wrote:
>
>     Hi all,
>
>     Avro, in its current form, exhibits a number of limitations that
>     are hard to work with or around, and hard to fix within the scope
>     of Avro 1.x : fixing these issues would introduce incompatible
>     changes that warrant a major version bump, ie. Avro 2.0. An Avro
>     2.0 branch would be an opportunity to address most issues that
>     appeared held back for compatibility purposes so far.
>
>     I would like to initiate an effort in this direction and I am
>     willing to do the necessary work to gather and organize
>     requirements, and draft a design document for what Avro 2.0 would
>     look like. For this reason, if you have opinions regarding an Avro
>     2.0 branch or regarding issues and features that could fit in Avro
>     2.0, please reply to this thread.
>
>     To bootstrap, below is a list I gathered over the last couple of
>     years from several discussions:
>
>       * Specification
>           o Improved support for unions (incompatible change with
>             named unions and union properties).
>           o New extension data type, similar to ProtocolBuffer
>             extensions (incompatible change).
>           o Clear separation between Avro schema (data format) and
>             specific API client concerns: for example, the way Avro
>             strings are exposed through the Java API should not
>             pollute the schema definition. Each particular Java client
>             should configure their own decoders with the way they want
>             Avro strings to be represented.
>           o Clarification of compatibility and type promotion (safe
>             lossless conversions vs. best-effort lossy conversions):
>             promoting int to float potentially loses precision, which
>             is not necessarily acceptable for all clients. Avro
>             decoders should let clients configure which mode they need.
>       * IDL
>           o Generalized IDL for Avro schemas.
>           o Support for recursive records.
>           o Meta-schema : IDL definition for a schema.
>       * Java API
>           o Truly immutable schema objects (no properties / hashcode
>             mutation after construction).
>           o Immutable records.
>           o Complete record builder API (current record builders do
>             not play well with nested records).
>           o Complete generic API (there currently is no GenericUnion
>             or GenericMap).
>           o Improved unions support : union values as java.lang.Object
>             are less than ideal; union values could expose the union
>             branch through an enum (nulls could be handled specifically).
>       * Python 3 support
>       * RPC
>           o SASL support
>           o Full Python/Java parity and interoperability.
>
>     Please, comment or extend this list. Provided enough interest,
>     I'll happily digest feedback and organize it into a document (most
>     likely a wiki page?).
>
>     Thanks,
>     Christophe
>
>

Re: Effort towards Avro 2.0?

Posted by Philip Zeyliger <ph...@cloudera.com>.

It sounds like you're proposing to break language API compatibility.  Are
you also proposing to break wire compatibility for Avro HTTP RPC, Avro
Netty RPC, and/or Avro datafiles?

I'd be appreciative of a mechanism by which systems that happen to use Avro
currently need not be forced to choose one version or another.  (One
approach to this is to use a different package name.)

As for adding to your list, I'd like to see a code-generated API for
Python.  (We like to call these APIs "specific" but I find that terminology
confusing.)

-- Philip


On Mon, Dec 2, 2013 at 1:42 PM, Christophe Taton <christophe.taton@gmail.com
> wrote:

> Hi all,
>
> Avro, in its current form, exhibits a number of limitations that are hard
> to work with or around, and hard to fix within the scope of Avro 1.x :
> fixing these issues would introduce incompatible changes that warrant a
> major version bump, ie. Avro 2.0. An Avro 2.0 branch would be an
> opportunity to address most issues that appeared held back for
> compatibility purposes so far.
>
> I would like to initiate an effort in this direction and I am willing to
> do the necessary work to gather and organize requirements, and draft a
> design document for what Avro 2.0 would look like. For this reason, if you
> have opinions regarding an Avro 2.0 branch or regarding issues and features
> that could fit in Avro 2.0, please reply to this thread.
>
> To bootstrap, below is a list I gathered over the last couple of years
> from several discussions:
>
>    - Specification
>    - Improved support for unions (incompatible change with named unions
>       and union properties).
>       - New extension data type, similar to ProtocolBuffer extensions
>       (incompatible change).
>       - Clear separation between Avro schema (data format) and specific
>       API client concerns: for example, the way Avro strings are exposed through
>       the Java API should not pollute the schema definition. Each particular Java
>       client should configure their own decoders with the way they want Avro
>       strings to be represented.
>       - Clarification of compatibility and type promotion (safe lossless
>       conversions vs. best-effort lossy conversions): promoting int to float
>       potentially loses precision, which is not necessarily acceptable for all
>       clients. Avro decoders should let clients configure which mode they need.
>    - IDL
>    - Generalized IDL for Avro schemas.
>       - Support for recursive records.
>       - Meta-schema : IDL definition for a schema.
>       - Java API
>    - Truly immutable schema objects (no properties / hashcode mutation
>       after construction).
>       - Immutable records.
>       - Complete record builder API (current record builders do not play
>       well with nested records).
>       - Complete generic API (there currently is no GenericUnion or
>       GenericMap).
>       - Improved unions support : union values as java.lang.Object are
>       less than ideal; union values could expose the union branch through an enum
>       (nulls could be handled specifically).
>        - Python 3 support
>    - RPC
>       - SASL support
>       - Full Python/Java parity and interoperability.
>
> Please, comment or extend this list. Provided enough interest, I'll
> happily digest feedback and organize it into a document (most likely a wiki
> page?).
>
> Thanks,
> Christophe
>

Re: Effort towards Avro 2.0?

Posted by Philip Zeyliger <ph...@cloudera.com>.

It sounds like you're proposing to break language API compatibility.  Are
you also proposing to break wire compatibility for Avro HTTP RPC, Avro
Netty RPC, and/or Avro datafiles?

I'd be appreciative of a mechanism by which systems that happen to use Avro
currently need not be forced to choose one version or another.  (One
approach to this is to use a different package name.)

As for adding to your list, I'd like to see a code-generated API for
Python.  (We like to call these APIs "specific" but I find that terminology
confusing.)

-- Philip


On Mon, Dec 2, 2013 at 1:42 PM, Christophe Taton <christophe.taton@gmail.com
> wrote:

> Hi all,
>
> Avro, in its current form, exhibits a number of limitations that are hard
> to work with or around, and hard to fix within the scope of Avro 1.x :
> fixing these issues would introduce incompatible changes that warrant a
> major version bump, ie. Avro 2.0. An Avro 2.0 branch would be an
> opportunity to address most issues that appeared held back for
> compatibility purposes so far.
>
> I would like to initiate an effort in this direction and I am willing to
> do the necessary work to gather and organize requirements, and draft a
> design document for what Avro 2.0 would look like. For this reason, if you
> have opinions regarding an Avro 2.0 branch or regarding issues and features
> that could fit in Avro 2.0, please reply to this thread.
>
> To bootstrap, below is a list I gathered over the last couple of years
> from several discussions:
>
>    - Specification
>    - Improved support for unions (incompatible change with named unions
>       and union properties).
>       - New extension data type, similar to ProtocolBuffer extensions
>       (incompatible change).
>       - Clear separation between Avro schema (data format) and specific
>       API client concerns: for example, the way Avro strings are exposed through
>       the Java API should not pollute the schema definition. Each particular Java
>       client should configure their own decoders with the way they want Avro
>       strings to be represented.
>       - Clarification of compatibility and type promotion (safe lossless
>       conversions vs. best-effort lossy conversions): promoting int to float
>       potentially loses precision, which is not necessarily acceptable for all
>       clients. Avro decoders should let clients configure which mode they need.
>    - IDL
>    - Generalized IDL for Avro schemas.
>       - Support for recursive records.
>       - Meta-schema : IDL definition for a schema.
>       - Java API
>    - Truly immutable schema objects (no properties / hashcode mutation
>       after construction).
>       - Immutable records.
>       - Complete record builder API (current record builders do not play
>       well with nested records).
>       - Complete generic API (there currently is no GenericUnion or
>       GenericMap).
>       - Improved unions support : union values as java.lang.Object are
>       less than ideal; union values could expose the union branch through an enum
>       (nulls could be handled specifically).
>        - Python 3 support
>    - RPC
>       - SASL support
>       - Full Python/Java parity and interoperability.
>
> Please, comment or extend this list. Provided enough interest, I'll
> happily digest feedback and organize it into a document (most likely a wiki
> page?).
>
> Thanks,
> Christophe
>

Re: Effort towards Avro 2.0?

Posted by Doug Cutting <cu...@apache.org>.

On Wed, Dec 4, 2013 at 11:40 PM, Christophe Taton <
christophe.taton@gmail.com> wrote:

> Well, I guess one can always handle such things externally to Avro.
>

This needn't be done externally.

When an extension schema is encountered, the schema compiler can generate
Object references, the DatumWriter can write the schema signature and
encode the object, and the DatumReader can locate the referenced schema and
create and deserialize the appropriate object.  This behavior could be
compatibly added to the existing generic, specific and reflect
representations, since the extension schema is in Avro's namespace and
should thus not conflict with any user schema.

So this could be implemented compatibly.  Until support is added in other
languages (Python, C, C++, etc.) these objects would be opaque, but no
matter how the feature is implemented it will require implementation work
in each language.

Doug

Re: Effort towards Avro 2.0?

Posted by Christophe Taton <ch...@gmail.com>.

On Wed, Dec 4, 2013 at 12:18 PM, Douglas Creager <do...@creagertino.net>wrote:

> >    - unwieldy because as a user, I'll have to encode and decode the bytes
>
>    field manually everytime I want to access this field from the original
> >    record, unless I keep track of the decoded extension externally to the
> >    Avro record.
>
> Can you handle this in the middleware?  I.e., have the middleware decode
> the bytes field before passing control to the user code.  That's better
> from a decoupling standpoint anyway, since the user code shouldn't care
> what middleware is wrapping it.
>

Well, I guess one can always handle such things externally to Avro.

I don't necessarily want to dissociate the generic part of a record and its
extensions.
In some cases, the extension in isolation doesn't make sense without the
context defined by the generic part of the record.

> When you write a middleware that lets users define custom types,
> > extensions are pretty much required.
>
> I guess my main point is that we already have two mechanisms for dealing
> with user extensions (schema resolution and Doug's bytes field
> proposal), both of which work just fine at runtime without rebuilding or
> restarting your code.  In general, I think it's better if we can solve a
> problem at the library or application level, without having to update
> the spec.
>

One use-case I have involves recursive records that can be extended by
users.
Think of a tree of operations (eg. math/logic operations) that can be
extended by users with new operations.
Without support from Avro, I have to manually decode each node,
recursively, dealing with bytes at every step.
With extensions built-in, Avro could decode the entire tree in one single
step.

Or maybe I misunderstand what you are suggesting I do instead?

C.

Re: Effort towards Avro 2.0?

Posted by Douglas Creager <do...@creagertino.net>.

>    - inefficient because you'll end up serializing your data twice, once
>    from the actual type into the bytes field, then a second type as a
>    bytes field;

I don't think it's as inefficient as you might think — the second
serialization just blits the raw bytes content into some destination
buffer/pipe/socket/etc.  The C binding already does this under the
covers to handle blocks when writing into a data file.  And it hasn't
been a performance bottleneck.

>    - unwieldy because as a user, I'll have to encode and decode the bytes
>    field manually everytime I want to access this field from the original
>    record, unless I keep track of the decoded extension externally to the
>    Avro record.

Can you handle this in the middleware?  I.e., have the middleware decode
the bytes field before passing control to the user code.  That's better
from a decoupling standpoint anyway, since the user code shouldn't care
what middleware is wrapping it.

> When you write a middleware that lets users define custom types,
> extensions are pretty much required.

I guess my main point is that we already have two mechanisms for dealing
with user extensions (schema resolution and Doug's bytes field
proposal), both of which work just fine at runtime without rebuilding or
restarting your code.  In general, I think it's better if we can solve a
problem at the library or application level, without having to update
the spec.

–doug

Re: Effort towards Avro 2.0?

Posted by Christophe Taton <ch...@gmail.com>.

Hi Douglas,

When you write a middleware that lets users define custom types, extensions
are pretty much required.

Middleware doesn't need to, and shouldn't need to know these user-defined
custom types ahead of time : you don't want to rebuild and restart your
middleware everytime a user define a new type they want handled by the
middleware.

An explicit bytes field always works, but is both inefficient and unwieldy:

   - inefficient because you'll end up serializing your data twice, once
   from the actual type into the bytes field, then a second type as a bytes
   field;
   - unwieldy because as a user, I'll have to encode and decode the bytes
   field manually everytime I want to access this field from the original
   record, unless I keep track of the decoded extension externally to the Avro
   record.

C.


On Wed, Dec 4, 2013 at 8:07 AM, Douglas Creager <do...@creagertino.net>wrote:

> On Tue, Dec 3, 2013, at 07:49 AM, Doug Cutting wrote:
> > On Mon, Dec 2, 2013 at 1:42 PM, Christophe Taton
> > <ch...@gmail.com> wrote:
> > > - New extension data type, similar to ProtocolBuffer extensions
> (incompatible change).
> >
> > Extensions might be implemented as something like:
> >
> >   {"type":"record", "name":"extension", "fields":[
> >     {"name":"fingerprint", "type": {"type":"fixed", "size":16}},
> >     {"name":"payload", "type":"bytes"}
> >     ]
> >   }
>
> I'd also want to know more about the kind of use cases that you'd need
> protobuf-style extensions for.  I like Doug's solution if each record
> can have a different set of extensions.  If all of the records will have
> the same set of extensions, my hunch is that you'd only need to use
> extra fields and schema resolution.  Either way, I can't think of a use
> case where a new data type in the spec is a noticeable improvement.
>
> –doug
>

Re: Effort towards Avro 2.0?

Posted by Douglas Creager <do...@creagertino.net>.

On Tue, Dec 3, 2013, at 07:49 AM, Doug Cutting wrote:
> On Mon, Dec 2, 2013 at 1:42 PM, Christophe Taton
> <ch...@gmail.com> wrote:
> > - New extension data type, similar to ProtocolBuffer extensions (incompatible change).
> 
> Extensions might be implemented as something like:
> 
>   {"type":"record", "name":"extension", "fields":[
>     {"name":"fingerprint", "type": {"type":"fixed", "size":16}},
>     {"name":"payload", "type":"bytes"}
>     ]
>   }

I'd also want to know more about the kind of use cases that you'd need
protobuf-style extensions for.  I like Doug's solution if each record
can have a different set of extensions.  If all of the records will have
the same set of extensions, my hunch is that you'd only need to use
extra fields and schema resolution.  Either way, I can't think of a use
case where a new data type in the spec is a noticeable improvement.

–doug

Re: Effort towards Avro 2.0?

Posted by Christophe Taton <ch...@gmail.com>.

On Tue, Dec 3, 2013 at 7:49 AM, Doug Cutting <cu...@apache.org> wrote:

> On Mon, Dec 2, 2013 at 1:42 PM, Christophe Taton
> <ch...@gmail.com> wrote:
> > - New extension data type, similar to ProtocolBuffer extensions
> (incompatible change).
>
> Extensions might be implemented as something like:
>
>   {"type":"record", "name":"extension", "fields":[
>     {"name":"fingerprint", "type": {"type":"fixed", "size":16}},
>     {"name":"payload", "type":"bytes"}
>     ]
>   }
>
> One could then use this with:
>
>   {"type":"record", "name":"Foo", "fields":[
>     {"name":"bar", "type":"extension"}
>     ]
>   }
>
> The implementation could then find the schema for the extension at
> runtime given its fingerprint.  The reader could have a table mapping
> fingerprints to schemas.
>
> In particular, the specific compiler, when it sees a schema like:
>
>
>   {"type":"record", "name":"Bar", "isExtension":true, "fields":[
>     {"name":"x", "type":"long"}
>     ]
>   }
>
> Might emit code to add entries to the extension mapping table used by
> SpecificDatumReader, e.g.:
>
>   static {
>     SpecificData.addExtension(getSchema());
>   }
>
> Might something like this work?
>

Yes, this is very much the idea.
In a prototype I made a few months ago, I found allowing the user to
specify the fingerprint schema useful : in some scenario, an extension
could be prefixed by a string that contains the JSON schema; in some other
scenario, I may want to use fingerprints to identify the schema of the
extension; in some other cases, I may want to use some external mapping
maintained by another system (eg. the schema repository worked on in
AVRO-1124).

C.

Re: Effort towards Avro 2.0?

Posted by Doug Cutting <cu...@apache.org>.

On Mon, Dec 2, 2013 at 1:42 PM, Christophe Taton
<ch...@gmail.com> wrote:
> - New extension data type, similar to ProtocolBuffer extensions (incompatible change).

Extensions might be implemented as something like:

  {"type":"record", "name":"extension", "fields":[
    {"name":"fingerprint", "type": {"type":"fixed", "size":16}},
    {"name":"payload", "type":"bytes"}
    ]
  }

One could then use this with:

  {"type":"record", "name":"Foo", "fields":[
    {"name":"bar", "type":"extension"}
    ]
  }

The implementation could then find the schema for the extension at
runtime given its fingerprint.  The reader could have a table mapping
fingerprints to schemas.

In particular, the specific compiler, when it sees a schema like:


  {"type":"record", "name":"Bar", "isExtension":true, "fields":[
    {"name":"x", "type":"long"}
    ]
  }

Might emit code to add entries to the extension mapping table used by
SpecificDatumReader, e.g.:

  static {
    SpecificData.addExtension(getSchema());
  }

Might something like this work?

Doug