You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Jay Kreps <ja...@gmail.com> on 2012/07/10 19:53:48 UTC

schema repositories?

I noticed in AVRO-1006 there was a mention of standardizing on some kind of
schema repository that would maintain a central set of all versions of a
schema and allow a way to reference schemas by id.

At LinkedIn we have standardized (almost) all of our persistent data on
Avro and we have a repository like this for managing schemas. Messages are
stored with the schema in Hadoop, but for systems that store rows
independently like databases or messaging we instead store a schema id with
each row/message. We would love for there to be an open source version of
this to make it possible to open up our other tools
for compatibility checking, etl and other things that depend on service.

The service itself is basically a REST service that maintains schemas. Each
schema has a "source" that it is associated with (the table or messaging
topic or whatever) and a unique id. Schemas can be fetched by id or you can
get the latest schema for a given source. Having the notion of sources
allows us to do two things: (1) enforce a compatibility modal on schema
changes (no backwards incompatible changes for various definitions of
backwards compatibility), and (2) allow our hadoop etl to project all
messages forward to the latest schema (since AvroFile requires a single
schema not a per-row schema).

If the Avro project is interested in adopting an official repository that
would be really nice. It is frankly a pretty trivial piece of code, but
standardization would allow interoperability between things. I would be
willing to either open source our repository implementation or do a
from-scratch one if we come up with more requirements.

-Jay

Re: schema repositories?

Posted by Jay Kreps <ja...@gmail.com>.
Cool, I will write up a more detailed proposal and include it in a JIRA.

WRT the "source" yes, we originally started versioning things by the record
name, and we generally still use the convention that record name matches
the topic/table name. However two things lead us to generalize this. First
is the issue Scott points out there are schemas that may be shared across
many tables/topics, but may evolve at slightly different rates. Secondly a
generic utility that needs to handle many such sources must know the source
name (or else how could it get data) but may not know the type of records
in the source. At that point we realized "source" is just a generic key for
a sequence of schemas which may or may not have
some computability guarantees between them. The key can be the record name
or any other string so long as it is consistent. I will try to flesh out
some of these details as well as some details around id generation, in the
proposal.

-Jay


On Tue, Jul 10, 2012 at 2:54 PM, Scott Carey <sc...@apache.org> wrote:

>
>
> On 7/10/12 11:25 AM, "Doug Cutting" <cu...@apache.org> wrote:
>
> >Jay,
> >
> >This sounds to me like something of general utility that would make a
> >great addition to Avro.
> >
> >To be clear, I assume you mean contributing this as source code for a
> >service that folks can deploy, right?  For example, it might be a Java
> >project that builds a WAR file that, when deployed, presents a REST
> >front end and talks to a backing persistence layer where the schemas
> >are stored.  Is that right?
> >
> >Also note that Avro recently added a standard facility for defining
> >Schema fingerprints that might be used as Schema IDs in such a
> >service:
> >
> >
> http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+S
> >chemas
> >
> >This has currently been implemented in Java and C#:
> >
> >
> http://avro.apache.org/docs/current/api/java/org/apache/avro/SchemaNormali
> >zation.html
> >
> >I like the notion of a Schema source for the uses you describe.  For
> >records might this simply be the fully-qualified record name?  For
> >unions and other unnamed types it might take the same form as a record
> >name.  We could use the "long form" for primitives and always supply a
> >name so that an string schema might be {"type":"string",
> >"name":"org.foo.Bar"}.  Would this work, or is there some other
> >structure and use of sources for which schema names are not a good
> >match?
>
> If your source is a database table, or pub-sub topic, then multiple
> sources might have overlapping schemas at different points of time.  Two
> topics might share a schema, and their schema evolution may or may not
> diverge over time.  The FQDN of a record might be appropriate in some
> cases to capture this, but not all.
> Capturing the sequence allows you to eagerly ensure that your current
> schema is compatible with the entire history of the source's schema
> evolution.
>
> I suppose there could be support for fingerprints here, but it does not
> seem to be required and in some cases a generated sequential ID will take
> up a lot less space if stored with each record.  A typical source might
> only change its schema once a month, each record needs to have the id
> portion of a (source, id) pair stored with it, which will typically be 1
> to 2 bytes.
>
> >
> >Cheers,
> >
> >Doug
> >
> >On Tue, Jul 10, 2012 at 10:53 AM, Jay Kreps <ja...@gmail.com> wrote:
> >> I noticed in AVRO-1006 there was a mention of standardizing on some
> >>kind of
> >> schema repository that would maintain a central set of all versions of a
> >> schema and allow a way to reference schemas by id.
> >>
> >> At LinkedIn we have standardized (almost) all of our persistent data on
> >> Avro and we have a repository like this for managing schemas. Messages
> >>are
> >> stored with the schema in Hadoop, but for systems that store rows
> >> independently like databases or messaging we instead store a schema id
> >>with
> >> each row/message. We would love for there to be an open source version
> >>of
> >> this to make it possible to open up our other tools
> >> for compatibility checking, etl and other things that depend on service.
> >>
> >> The service itself is basically a REST service that maintains schemas.
> >>Each
> >> schema has a "source" that it is associated with (the table or messaging
> >> topic or whatever) and a unique id. Schemas can be fetched by id or you
> >>can
> >> get the latest schema for a given source. Having the notion of sources
> >> allows us to do two things: (1) enforce a compatibility modal on schema
> >> changes (no backwards incompatible changes for various definitions of
> >> backwards compatibility), and (2) allow our hadoop etl to project all
> >> messages forward to the latest schema (since AvroFile requires a single
> >> schema not a per-row schema).
> >>
> >> If the Avro project is interested in adopting an official repository
> >>that
> >> would be really nice. It is frankly a pretty trivial piece of code, but
> >> standardization would allow interoperability between things. I would be
> >> willing to either open source our repository implementation or do a
> >> from-scratch one if we come up with more requirements.
> >>
> >> -Jay
>
>
>

Re: schema repositories?

Posted by Scott Carey <sc...@apache.org>.

On 7/10/12 11:25 AM, "Doug Cutting" <cu...@apache.org> wrote:

>Jay,
>
>This sounds to me like something of general utility that would make a
>great addition to Avro.
>
>To be clear, I assume you mean contributing this as source code for a
>service that folks can deploy, right?  For example, it might be a Java
>project that builds a WAR file that, when deployed, presents a REST
>front end and talks to a backing persistence layer where the schemas
>are stored.  Is that right?
>
>Also note that Avro recently added a standard facility for defining
>Schema fingerprints that might be used as Schema IDs in such a
>service:
>
>http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+S
>chemas
>
>This has currently been implemented in Java and C#:
>
>http://avro.apache.org/docs/current/api/java/org/apache/avro/SchemaNormali
>zation.html
>
>I like the notion of a Schema source for the uses you describe.  For
>records might this simply be the fully-qualified record name?  For
>unions and other unnamed types it might take the same form as a record
>name.  We could use the "long form" for primitives and always supply a
>name so that an string schema might be {"type":"string",
>"name":"org.foo.Bar"}.  Would this work, or is there some other
>structure and use of sources for which schema names are not a good
>match?

If your source is a database table, or pub-sub topic, then multiple
sources might have overlapping schemas at different points of time.  Two
topics might share a schema, and their schema evolution may or may not
diverge over time.  The FQDN of a record might be appropriate in some
cases to capture this, but not all.
Capturing the sequence allows you to eagerly ensure that your current
schema is compatible with the entire history of the source's schema
evolution.

I suppose there could be support for fingerprints here, but it does not
seem to be required and in some cases a generated sequential ID will take
up a lot less space if stored with each record.  A typical source might
only change its schema once a month, each record needs to have the id
portion of a (source, id) pair stored with it, which will typically be 1
to 2 bytes.

>
>Cheers,
>
>Doug
>
>On Tue, Jul 10, 2012 at 10:53 AM, Jay Kreps <ja...@gmail.com> wrote:
>> I noticed in AVRO-1006 there was a mention of standardizing on some
>>kind of
>> schema repository that would maintain a central set of all versions of a
>> schema and allow a way to reference schemas by id.
>>
>> At LinkedIn we have standardized (almost) all of our persistent data on
>> Avro and we have a repository like this for managing schemas. Messages
>>are
>> stored with the schema in Hadoop, but for systems that store rows
>> independently like databases or messaging we instead store a schema id
>>with
>> each row/message. We would love for there to be an open source version
>>of
>> this to make it possible to open up our other tools
>> for compatibility checking, etl and other things that depend on service.
>>
>> The service itself is basically a REST service that maintains schemas.
>>Each
>> schema has a "source" that it is associated with (the table or messaging
>> topic or whatever) and a unique id. Schemas can be fetched by id or you
>>can
>> get the latest schema for a given source. Having the notion of sources
>> allows us to do two things: (1) enforce a compatibility modal on schema
>> changes (no backwards incompatible changes for various definitions of
>> backwards compatibility), and (2) allow our hadoop etl to project all
>> messages forward to the latest schema (since AvroFile requires a single
>> schema not a per-row schema).
>>
>> If the Avro project is interested in adopting an official repository
>>that
>> would be really nice. It is frankly a pretty trivial piece of code, but
>> standardization would allow interoperability between things. I would be
>> willing to either open source our repository implementation or do a
>> from-scratch one if we come up with more requirements.
>>
>> -Jay



Re: schema repositories?

Posted by Doug Cutting <cu...@apache.org>.
Jay,

This sounds to me like something of general utility that would make a
great addition to Avro.

To be clear, I assume you mean contributing this as source code for a
service that folks can deploy, right?  For example, it might be a Java
project that builds a WAR file that, when deployed, presents a REST
front end and talks to a backing persistence layer where the schemas
are stored.  Is that right?

Also note that Avro recently added a standard facility for defining
Schema fingerprints that might be used as Schema IDs in such a
service:

http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas

This has currently been implemented in Java and C#:

http://avro.apache.org/docs/current/api/java/org/apache/avro/SchemaNormalization.html

I like the notion of a Schema source for the uses you describe.  For
records might this simply be the fully-qualified record name?  For
unions and other unnamed types it might take the same form as a record
name.  We could use the "long form" for primitives and always supply a
name so that an string schema might be {"type":"string",
"name":"org.foo.Bar"}.  Would this work, or is there some other
structure and use of sources for which schema names are not a good
match?

Cheers,

Doug

On Tue, Jul 10, 2012 at 10:53 AM, Jay Kreps <ja...@gmail.com> wrote:
> I noticed in AVRO-1006 there was a mention of standardizing on some kind of
> schema repository that would maintain a central set of all versions of a
> schema and allow a way to reference schemas by id.
>
> At LinkedIn we have standardized (almost) all of our persistent data on
> Avro and we have a repository like this for managing schemas. Messages are
> stored with the schema in Hadoop, but for systems that store rows
> independently like databases or messaging we instead store a schema id with
> each row/message. We would love for there to be an open source version of
> this to make it possible to open up our other tools
> for compatibility checking, etl and other things that depend on service.
>
> The service itself is basically a REST service that maintains schemas. Each
> schema has a "source" that it is associated with (the table or messaging
> topic or whatever) and a unique id. Schemas can be fetched by id or you can
> get the latest schema for a given source. Having the notion of sources
> allows us to do two things: (1) enforce a compatibility modal on schema
> changes (no backwards incompatible changes for various definitions of
> backwards compatibility), and (2) allow our hadoop etl to project all
> messages forward to the latest schema (since AvroFile requires a single
> schema not a per-row schema).
>
> If the Avro project is interested in adopting an official repository that
> would be really nice. It is frankly a pretty trivial piece of code, but
> standardization would allow interoperability between things. I would be
> willing to either open source our repository implementation or do a
> from-scratch one if we come up with more requirements.
>
> -Jay

Re: schema repositories?

Posted by Doug Cutting <cu...@apache.org>.
On Tue, Jul 10, 2012 at 1:37 PM, Scott Carey <sc...@apache.org> wrote:
> Please use AVRO-1006 or if it does not seem appropriate create
> another.

A new issue would be better.  AVRO-1006 is related but is fixed and closed.

Doug

Re: schema repositories?

Posted by Scott Carey <sc...@apache.org>.
Jay,

This would be fantastic.  I can sponsor getting this into Avro and help
out.  Please use AVRO-1006 or if it does not seem appropriate create
another.

I would like to start with what you have, since it is used and in
production it seems like the right starting place.  But the community will
need to form consensus on what to do; we will discuss that in the JIRA.


Thanks!

-Scott


On 7/10/12 10:53 AM, "Jay Kreps" <ja...@gmail.com> wrote:

>I noticed in AVRO-1006 there was a mention of standardizing on some kind
>of
>schema repository that would maintain a central set of all versions of a
>schema and allow a way to reference schemas by id.
>
>At LinkedIn we have standardized (almost) all of our persistent data on
>Avro and we have a repository like this for managing schemas. Messages are
>stored with the schema in Hadoop, but for systems that store rows
>independently like databases or messaging we instead store a schema id
>with
>each row/message. We would love for there to be an open source version of
>this to make it possible to open up our other tools
>for compatibility checking, etl and other things that depend on service.
>
>The service itself is basically a REST service that maintains schemas.
>Each
>schema has a "source" that it is associated with (the table or messaging
>topic or whatever) and a unique id. Schemas can be fetched by id or you
>can
>get the latest schema for a given source. Having the notion of sources
>allows us to do two things: (1) enforce a compatibility modal on schema
>changes (no backwards incompatible changes for various definitions of
>backwards compatibility), and (2) allow our hadoop etl to project all
>messages forward to the latest schema (since AvroFile requires a single
>schema not a per-row schema).
>
>If the Avro project is interested in adopting an official repository that
>would be really nice. It is frankly a pretty trivial piece of code, but
>standardization would allow interoperability between things. I would be
>willing to either open source our repository implementation or do a
>from-scratch one if we come up with more requirements.
>
>-Jay