You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by Niels Basjes <Ni...@basjes.nl> on 2017/01/31 10:04:33 UTC

[IDEA] Making schema evolution for enums slightly easier.

Hi,

I'm working on a project where we are putting message serialized avro
records into Kafka. The schemas are made available via a schema registry of
some sorts.
Because Kafka stores the messages for a longer period 'weeks' we have two
common scenarios that occur when a new version of the schema is introduced
(i.e. from V1 to V2).

1) A V2 producer is released and a V1 consumer must be able to read the
records.
2) A 'new' V2 consumer is released a few days after the V2 producer started
creating records. The V2 consumer starts reading Kafka "from the beginning"
and as a consequence first has to go through a set of V1 records.

So in this usecase we need schema evolution in two directions.

To make sure it all works as expected I did some experiments and found that
these requirements are all doable except when you are in need of an enum.

This 'two directions' turns out to have a problem with changing the values
of an enum.

You cannot write an enum { 'A', 'B', 'C' } and then read it with the schema
enum { 'A', 'B' }


So I was thinking about a possible way to make this easier for the
developer.

The current idea that I want your opinion on:
1) In the IDL we add a way of directing that we want the enum to be stored
in a different way in the schema. I was thinking about something like
either defining a new type like 'string enum' or perhaps use an annotation
of some sorts.
2) The 'string enum' is mapped into the actual schema as a string (which
can contain ANY value). So anyone using the json schema can simply read it
because it is a string.
3) The generated code that is used to set/change the value enforces that
only the allowed values can be set.

This way a 'reader' can read any value, the schema is compatible in all
directions.

What do you guys think?
Is this an idea worth trying out?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: [IDEA] Making schema evolution for enums slightly easier.

Posted by Niels Basjes <Ni...@basjes.nl>.

Thanks for the idea.
I'm gonna play around with that to see if it could work.

Niels

On Tue, Jan 31, 2017 at 5:57 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> If you want to solve this problem by using a String to encode the value,
> then you can do that by defining a logical type that is an enum-as-string.
> But I'm not sure you want to do that. The nice thing about an enum is that
> you use what you know about the schema ahead of time to get a much more
> compact representation -- usually a byte rather than encoding the entire
> string. So I'd much rather find a way of handling this case that keeps the
> compact representation, while allowing for applications to gracefully
> handling these.
>
> For generic, enum symbols are translated to GenericEnumSymbol, which can
> hold any symbol. Adding an option to return the symbol from the writer's
> schema even if it isn't in the reader's schema is one way around the
> problem. That wouldn't work for reflect or specific, though.
>
> Another option that was suggested last year is to designate a catch-all
> enum symbol. So your enum would be { 'A', 'B', 'UNKNOWN' } and { 'A', 'B',
> 'C', 'UNKNOWN' }. When a v1 consumer reads v2 records, C gets turned into
> UNKNOWN.
>
> I like the designated catch-all symbol because it is a reasonable way to
> opt-in for forward-compatibility.
>
> rb
>
> On Tue, Jan 31, 2017 at 2:04 AM, Niels Basjes <Ni...@basjes.nl> wrote:
>
> > Hi,
> >
> > I'm working on a project where we are putting message serialized avro
> > records into Kafka. The schemas are made available via a schema registry
> of
> > some sorts.
> > Because Kafka stores the messages for a longer period 'weeks' we have two
> > common scenarios that occur when a new version of the schema is
> introduced
> > (i.e. from V1 to V2).
> >
> > 1) A V2 producer is released and a V1 consumer must be able to read the
> > records.
> > 2) A 'new' V2 consumer is released a few days after the V2 producer
> started
> > creating records. The V2 consumer starts reading Kafka "from the
> beginning"
> > and as a consequence first has to go through a set of V1 records.
> >
> > So in this usecase we need schema evolution in two directions.
> >
> > To make sure it all works as expected I did some experiments and found
> that
> > these requirements are all doable except when you are in need of an enum.
> >
> > This 'two directions' turns out to have a problem with changing the
> values
> > of an enum.
> >
> > You cannot write an enum { 'A', 'B', 'C' } and then read it with the
> schema
> > enum { 'A', 'B' }
> >
> >
> > So I was thinking about a possible way to make this easier for the
> > developer.
> >
> > The current idea that I want your opinion on:
> > 1) In the IDL we add a way of directing that we want the enum to be
> stored
> > in a different way in the schema. I was thinking about something like
> > either defining a new type like 'string enum' or perhaps use an
> annotation
> > of some sorts.
> > 2) The 'string enum' is mapped into the actual schema as a string (which
> > can contain ANY value). So anyone using the json schema can simply read
> it
> > because it is a string.
> > 3) The generated code that is used to set/change the value enforces that
> > only the allowed values can be set.
> >
> > This way a 'reader' can read any value, the schema is compatible in all
> > directions.
> >
> > What do you guys think?
> > Is this an idea worth trying out?
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: [IDEA] Making schema evolution for enums slightly easier.

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

If you want to solve this problem by using a String to encode the value,
then you can do that by defining a logical type that is an enum-as-string.
But I'm not sure you want to do that. The nice thing about an enum is that
you use what you know about the schema ahead of time to get a much more
compact representation -- usually a byte rather than encoding the entire
string. So I'd much rather find a way of handling this case that keeps the
compact representation, while allowing for applications to gracefully
handling these.

For generic, enum symbols are translated to GenericEnumSymbol, which can
hold any symbol. Adding an option to return the symbol from the writer's
schema even if it isn't in the reader's schema is one way around the
problem. That wouldn't work for reflect or specific, though.

Another option that was suggested last year is to designate a catch-all
enum symbol. So your enum would be { 'A', 'B', 'UNKNOWN' } and { 'A', 'B',
'C', 'UNKNOWN' }. When a v1 consumer reads v2 records, C gets turned into
UNKNOWN.

I like the designated catch-all symbol because it is a reasonable way to
opt-in for forward-compatibility.

rb

On Tue, Jan 31, 2017 at 2:04 AM, Niels Basjes <Ni...@basjes.nl> wrote:

> Hi,
>
> I'm working on a project where we are putting message serialized avro
> records into Kafka. The schemas are made available via a schema registry of
> some sorts.
> Because Kafka stores the messages for a longer period 'weeks' we have two
> common scenarios that occur when a new version of the schema is introduced
> (i.e. from V1 to V2).
>
> 1) A V2 producer is released and a V1 consumer must be able to read the
> records.
> 2) A 'new' V2 consumer is released a few days after the V2 producer started
> creating records. The V2 consumer starts reading Kafka "from the beginning"
> and as a consequence first has to go through a set of V1 records.
>
> So in this usecase we need schema evolution in two directions.
>
> To make sure it all works as expected I did some experiments and found that
> these requirements are all doable except when you are in need of an enum.
>
> This 'two directions' turns out to have a problem with changing the values
> of an enum.
>
> You cannot write an enum { 'A', 'B', 'C' } and then read it with the schema
> enum { 'A', 'B' }
>
>
> So I was thinking about a possible way to make this easier for the
> developer.
>
> The current idea that I want your opinion on:
> 1) In the IDL we add a way of directing that we want the enum to be stored
> in a different way in the schema. I was thinking about something like
> either defining a new type like 'string enum' or perhaps use an annotation
> of some sorts.
> 2) The 'string enum' is mapped into the actual schema as a string (which
> can contain ANY value). So anyone using the json schema can simply read it
> because it is a string.
> 3) The generated code that is used to set/change the value enforces that
> only the allowed values can be set.
>
> This way a 'reader' can read any value, the schema is compatible in all
> directions.
>
> What do you guys think?
> Is this an idea worth trying out?
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>

-- 
Ryan Blue
Software Engineer
Netflix