You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Burak Emre <em...@gmail.com> on 2015/02/03 12:57:47 UTC

Adding new field with default value to an Avro schema

> I added a field with a default value to an Avro schema which is previously used for writing data. Is it possible to read the previous data using only new schema which has that new field at the end?
> 
> I tried this scenario but unfortunately it throws EOFException while reading third field. Even though it has a default value and the previous fields is read successfully, I'm not able to de-serialize the record back without providing the writer schema I used previously.
> 
> Schema schema = Schema.createRecord("test", null, "avro.test", false); schema.setFields(Lists.newArrayList( new Field("project", Schema.create(Type.STRING), null, null), new Field("city", Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), Schema.create(Type.STRING))), null, NullNode.getInstance()) )); GenericData.Record record = new GenericRecordBuilder(schema) .set("project", "ff").build(); GenericDatumWriter w = new GenericDatumWriter(schema); ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(outputStream, null); w.write(record, encoder); encoder.flush(); schema = Schema.createRecord("test", null, "avro.test", false); schema.setFields(Lists.newArrayList( new Field("project", Schema.create(Type.STRING), null, null), new Field("city", Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), Schema.create(Type.STRING))), null, NullNode.getInstance()), new Field("newField", Schema.createUni
on(Lists.newArrayList(Schema.create(Type.NULL), Schema.create(Type.STRING))), null, NullNode.getInstance()) )); DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema); Decoder decoder = DecoderFactory.get().binaryDecoder(outputStream.toByteArray(), null); GenericRecord result = reader.read(null, decoder);
>

Re: Adding new field with default value to an Avro schema

Posted by Doug Cutting <cu...@apache.org>.

On Tue, Feb 3, 2015 at 9:34 AM, Lukas Steiblys <lu...@doubledutch.me> wrote:
> On a related note, is there a tool that can check the backwards
> compatibility of schemas?

https://avro.apache.org/docs/current/api/java/org/apache/avro/SchemaCompatibility.html

Doug

Re: Adding new field with default value to an Avro schema

Posted by Sean Busbey <bu...@cloudera.com>.

On Tue, Feb 3, 2015 at 11:34 AM, Lukas Steiblys <lu...@doubledutch.me>
wrote:

>   On a related note, is there a tool that can check the backwards
> compatibility of schemas? I found some old messages talking about it, but
> no actual tool. I guess I could hack it together using some functions in
> the Avro library.
>
> Lukas
>
>
I don't think so, but this would be a great addition to the avro-tools
utility. Would you mind filing a JIRA for it?

-- 
Sean

Re: Adding new field with default value to an Avro schema

Posted by Lukas Steiblys <lu...@doubledutch.me>.

On a related note, is there a tool that can check the backwards compatibility of schemas? I found some old messages talking about it, but no actual tool. I guess I could hack it together using some functions in the Avro library.

Lukas

From: Burak Emre 
Sent: Tuesday, February 3, 2015 9:01 AM
To: user@avro.apache.org 
Subject: Re: Adding new field with default value to an Avro schema

@Sean thanks for the explanation. 

I have multiple writers but only one reader and the only schema migration operation is adding a new field so I thought that I may use the same schema for all dataset since the ordering will be same in all of them even though some may contain extra fields which is also defined in schema definition.

Actually I wanted to avoid using an external database for sequential schema ids since it would make the system more complex than it should be in my case but it seems this is the only option for now.

-- 
Burak Emre
Koc University

On Tuesday 3 February 2015 at 18:22, Sean Busbey wrote:

  Schema evolution in Avro requires access to both the schema used when writing the data and the desired Schema for reading the data. 

  Normally, Avro data is stored in some container format (i.e. the one in the spec[1]) and the parsing library takes care of pulling the schema used when writing out of said container.

  If you are using Avro data in some other location, you must have the writer schema as well. One common use case is a shared messaging system focused on small messages (but that doesn't use Avro RPC). In such cases, Doug Cutting has some guidance he's previously given (quoted with permission, albeit very late):

  > A best practice for things like this is to prefix each Avro record
  > with a (small) numeric schema ID.  This is used as the key for a
  > shared database of schemas.  The schema corresponding to a key never
  > changes, so the database can be cached heavily.  It never gets very
  > big either.  It could be as simple as a .java file, with the
  > constraint that you'd need to upgrade things downstream before
  > upstream, or as complicated as an enterprise-wide REST schema service
  > (AVRO-1124).  A variation is to use schema fingerprints as keys.
  > 
  > Potentially relevant stuff:
  > 
  > https://issues.apache.org/jira/browse/AVRO-1124
  > http://avro.apache.org/docs/current/spec.html#Schema+Fingerprints

  If you take the integer schema ID approach, you can use Avro's built in utilities for zig-zap encoding, which will ensure that most of the time your identifier only takes a small amount of space.

  [1]: http://avro.apache.org/docs/current/spec.html#Object+Container+Files

  On Tue, Feb 3, 2015 at 5:57 AM, Burak Emre <em...@gmail.com> wrote:

      I added a field with a default value to an Avro schema which is previously used for writing data. Is it possible to read the previous data using only new schema which has that new field at the end?
      I tried this scenario but unfortunately it throws EOFException while reading third field. Even though it has a default value and the previous fields is read successfully, I'm not able to de-serialize the record back without providing the writer schema I used previously.

Schema schema = Schema.createRecord("test", null, "avro.test", false);
schema.setFields(Lists.newArrayList(
    new Field("project", Schema.create(Type.STRING), null, null),
    new Field("city", Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), Schema.create(Type.STRING))), null, NullNode.getInstance())
));

GenericData.Record record = new GenericRecordBuilder(schema)
    .set("project", "ff").build();

GenericDatumWriter w = new GenericDatumWriter(schema);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(outputStream, null);

w.write(record, encoder);
encoder.flush();

schema = Schema.createRecord("test", null, "avro.test", false);
schema.setFields(Lists.newArrayList(
        new Field("project", Schema.create(Type.STRING), null, null),
        new Field("city", Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), Schema.create(Type.STRING))), null, NullNode.getInstance()),
        new Field("newField", Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), Schema.create(Type.STRING))), null, NullNode.getInstance())
));

DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(outputStream.toByteArray(), null);
GenericRecord result = reader.read(null, decoder);

  -- 

  Sean

Re: Adding new field with default value to an Avro schema

Posted by Sean Busbey <bu...@cloudera.com>.

On Tue, Feb 3, 2015 at 11:01 AM, Burak Emre <em...@gmail.com> wrote:

> @Sean thanks for the explanation.
>
> I have multiple writers but only one reader and the only schema migration
> operation is adding a new field so I thought that I may use the same schema
> for all dataset since the ordering will be same in all of them even though
> some may contain extra fields which is also defined in schema definition.
>
> Actually I wanted to avoid using an external database for sequential
> schema ids since it would make the system more complex than it should be in
> my case but it seems this is the only option for now.
>
>
>

An external database isn't strictly required. The only important bit is
that each schema have a unique immutable identifier. As Doug mentioned, you
could do this as an enum of schemas in your source code (so long as you
handled updates in reader-then-writer order). Similarly, you could do it by
relying on schema fingerprints and just loading avsc files out of shared
storage.

-- 
Sean

Re: Adding new field with default value to an Avro schema

Posted by Burak Emre <em...@gmail.com>.

@Sean thanks for the explanation.

I have multiple writers but only one reader and the only schema migration operation is adding a new field so I thought that I may use the same schema for all dataset since the ordering will be same in all of them even though some may contain extra fields which is also defined in schema definition.

Actually I wanted to avoid using an external database for sequential schema ids since it would make the system more complex than it should be in my case but it seems this is the only option for now. 

-- 
Burak Emre
Koc University


On Tuesday 3 February 2015 at 18:22, Sean Busbey wrote:

> Schema evolution in Avro requires access to both the schema used when writing the data and the desired Schema for reading the data.
> 
> Normally, Avro data is stored in some container format (i.e. the one in the spec[1]) and the parsing library takes care of pulling the schema used when writing out of said container.
> 
> If you are using Avro data in some other location, you must have the writer schema as well. One common use case is a shared messaging system focused on small messages (but that doesn't use Avro RPC). In such cases, Doug Cutting has some guidance he's previously given (quoted with permission, albeit very late):
> 
> > A best practice for things like this is to prefix each Avro record
> > with a (small) numeric schema ID.  This is used as the key for a
> > shared database of schemas.  The schema corresponding to a key never
> > changes, so the database can be cached heavily.  It never gets very
> > big either.  It could be as simple as a .java file, with the
> > constraint that you'd need to upgrade things downstream before
> > upstream, or as complicated as an enterprise-wide REST schema service
> > (AVRO-1124).  A variation is to use schema fingerprints as keys.
> > 
> > Potentially relevant stuff:
> > 
> > https://issues.apache.org/jira/browse/AVRO-1124
> > http://avro.apache.org/docs/current/spec.html#Schema+Fingerprints
> 
> If you take the integer schema ID approach, you can use Avro's built in utilities for zig-zap encoding, which will ensure that most of the time your identifier only takes a small amount of space.
> 
> [1]: http://avro.apache.org/docs/current/spec.html#Object+Container+Files
> 
> 
> On Tue, Feb 3, 2015 at 5:57 AM, Burak Emre <emrekabakci@gmail.com (mailto:emrekabakci@gmail.com)> wrote:
> > > I added a field with a default value to an Avro schema which is previously used for writing data. Is it possible to read the previous data using only new schema which has that new field at the end?
> > > 
> > > I tried this scenario but unfortunately it throws EOFException while reading third field. Even though it has a default value and the previous fields is read successfully, I'm not able to de-serialize the record back without providing the writer schema I used previously.
> > > 
> > > Schema schema = Schema.createRecord("test", null, "avro.test", false); schema.setFields(Lists.newArrayList( new Field("project", Schema.create(Type.STRING), null, null), new Field("city", Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), Schema.create(Type.STRING))), null, NullNode.getInstance()) )); GenericData.Record record = new GenericRecordBuilder(schema) .set("project", "ff").build(); GenericDatumWriter w = new GenericDatumWriter(schema); ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(outputStream, null); w.write(record, encoder); encoder.flush(); schema = Schema.createRecord("test", null, "avro.test", false); schema.setFields(Lists.newArrayList( new Field("project", Schema.create(Type.STRING), null, null), new Field("city", Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), Schema.create(Type.STRING))), null, NullNode.getInstance()), new Field("newField", Schema.creat
eUnion(Lists.newArrayList(Schema.create(Type.NULL), Schema.create(Type.STRING))), null, NullNode.getInstance()) )); DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema); Decoder decoder = DecoderFactory.get().binaryDecoder(outputStream.toByteArray(), null); GenericRecord result = reader.read(null, decoder);
> > > 
> > > 
> > 
> > 
> 
> 
> 
> -- 
> Sean

Re: Adding new field with default value to an Avro schema

Posted by Sean Busbey <bu...@cloudera.com>.

Schema evolution in Avro requires access to both the schema used when
writing the data and the desired Schema for reading the data.

Normally, Avro data is stored in some container format (i.e. the one in the
spec[1]) and the parsing library takes care of pulling the schema used when
writing out of said container.

If you are using Avro data in some other location, you must have the writer
schema as well. One common use case is a shared messaging system focused on
small messages (but that doesn't use Avro RPC). In such cases, Doug Cutting
has some guidance he's previously given (quoted with permission, albeit
very late):

> A best practice for things like this is to prefix each Avro record
> with a (small) numeric schema ID.  This is used as the key for a
> shared database of schemas.  The schema corresponding to a key never
> changes, so the database can be cached heavily.  It never gets very
> big either.  It could be as simple as a .java file, with the
> constraint that you'd need to upgrade things downstream before
> upstream, or as complicated as an enterprise-wide REST schema service
> (AVRO-1124).  A variation is to use schema fingerprints as keys.
>
> Potentially relevant stuff:
>
> https://issues.apache.org/jira/browse/AVRO-1124
> http://avro.apache.org/docs/current/spec.html#Schema+Fingerprints

If you take the integer schema ID approach, you can use Avro's built in
utilities for zig-zap encoding, which will ensure that most of the time
your identifier only takes a small amount of space.

[1]: http://avro.apache.org/docs/current/spec.html#Object+Container+Files


On Tue, Feb 3, 2015 at 5:57 AM, Burak Emre <em...@gmail.com> wrote:

> I added a field with a default value to an Avro schema which is previously
> used for writing data. Is it possible to read the previous data using *only
> new schema* which has that new field at the end?
>
> I tried this scenario but unfortunately it throws EOFException while
> reading third field. Even though it has a default value and the previous
> fields is read successfully, I'm not able to de-serialize the record back
> without providing the writer schema I used previously.
>
> Schema schema = Schema.createRecord("test", null, "avro.test", false);
> schema.setFields(Lists.newArrayList(
>     new Field("project", Schema.create(Type.STRING), null, null),
>     new Field("city", Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), Schema.create(Type.STRING))), null, NullNode.getInstance())));
> GenericData.Record record = new GenericRecordBuilder(schema)
>     .set("project", "ff").build();
> GenericDatumWriter w = new GenericDatumWriter(schema);ByteArrayOutputStream outputStream = new ByteArrayOutputStream();BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(outputStream, null);
>
> w.write(record, encoder);
> encoder.flush();
>
> schema = Schema.createRecord("test", null, "avro.test", false);
> schema.setFields(Lists.newArrayList(
>         new Field("project", Schema.create(Type.STRING), null, null),
>         new Field("city", Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), Schema.create(Type.STRING))), null, NullNode.getInstance()),
>         new Field("newField", Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), Schema.create(Type.STRING))), null, NullNode.getInstance())));
> DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);Decoder decoder = DecoderFactory.get().binaryDecoder(outputStream.toByteArray(), null);GenericRecord result = reader.read(null, decoder);
>
>
>


-- 
Sean