You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by as...@strandls.com on 2011/03/23 19:56:25 UTC

Schema registry

Hi,

Is there some java implementation of Avro schema registry? The use case is
to have separate schema data files for a bunch of types and be able to
resolve nested types.

I tried avro for the first time and could not have schema parsed from one
file have a nested record from a schema described in a second file.

I am using a modified version of the AvroUtil class from
http://www.infoq.com/articles/ApacheAvro . The modified file is attached.
I uses the SchemaParse exception and loads schema files from classpath.

Is there a better alternative. If this is a strong use case I could work
on creating such a schema registry with plugable resolvers and loaders.

Thanks and regards,
 - Ashish

Re: Schema registry

Posted by Stu Hood <st...@gmail.com>.
> This full schema should either go with the data (data files) or in a
registry (e.g. HAvroBase).

Isn't the latter what they want? A registry?

Presumably the RPC framework implements such a registry, since it can look
schemas up by their hashcode.

On Thu, Mar 24, 2011 at 2:04 PM, Scott Carey <sc...@richrelevance.com>wrote:

> There is danger in this.
>
> What is the schema used for in this case?  There are three common reasons
> for assembling a schema:
> 1.  Assembling the schema that represents the format of the data to be
> written.
> 2.  Assembling the schema that represents the way a reader wishes to view
> the data. (a.k.a. 'reader' or 'expected' schema).
> 3.  Assembling the schema that represents the way that some data was
> persisted.
>
> If you are persisting data, you should persist the _entire_ schema used to
> write that data as well.  This full schema should either go with the data
> (data files) or in a registry (e.g. HAvroBase).  A schema name reference
> is not sufficient -- you lose the ability to evolve the referenced schema.
>
> What if the version of the nested schema has changed?  Now you have a data
> file that refers to a nested schema by name "com.navteq.avro.FacebookUser"
> and finds a schema with that name through some resolution mechanism.  If
> that resolution mechanism is not version-aware, you're in trouble.
>
> So for #3, assembling schema fragments by reference is dangerous and
> complicated.
> Making the resolution mechanism version aware is problematic but doable.
> You can manually version every schema with a number, and use that, but
> then you are manually versioning schemas and storing the version meta-data
> in the schemas.
>
> Avro by nature versions schemas by equivalence.  The natural way to encode
> a schema version is to write the schema itself.
>
> In short: Any such registry would have to be version-aware if it is used
> to assemble schemas for use case #3 above, and the schemas that refer to
> these versions would also have to be version-aware.  It is much simpler to
> just embed the schemas.
>
> Use cases #1 and #2 above are essentially the assembly of the 'current'
> schema version, and a registry could work.  Avro does not have many
> built-in tools for this.  Generally, avsc, avpr, or avdl files are used as
> schema source for 'schema first' design, and 'code first' design persists
> the current schema in the code.
> avdl files support includes, avsc and avpr are more primitive.
>
>
> On 3/23/11 10:21 PM, "Ashish Shinde" <as...@strandls.com> wrote:
>
> >Hi,
> >
> >My use case is very similar to the nested schema in
> >the test case AvroUtilsTest on http://www.infoq.com/articles/ApacheAvro
> >
> >The only difference is I would like to automatically load schema's from
> >resources in classpath and also automatically load schema's
> >for nested types.
> >
> >If you look at the test example mentioned above if I ask the
> >"AvroSchemaRegistry" for a schema named
> >com.navteq.avro.FacebookSpecialUser it should also load the nested
> >com.navteq.avro.FacebookUser schema using some resolving and loading
> >mechanism.
> >
> >Thanks and regards,
> >- Ashish
> >
> >
> >
> >On Thu, 24 Mar 2011 10:38:20 +0800
> >Felix Xu <yg...@gmail.com> wrote:
> >
> >> Hi,I'm not quite understand the question..
> >> Can you give an example of your schema?
> >>
> >> 2011/3/24 <as...@strandls.com>
> >>
> >> > Hi,
> >> >
> >> > Is there some java implementation of Avro schema registry? The use
> >> > case is to have separate schema data files for a bunch of types and
> >> > be able to resolve nested types.
> >> >
> >> > I tried avro for the first time and could not have schema parsed
> >> > from one file have a nested record from a schema described in a
> >> > second file.
> >> >
> >> > I am using a modified version of the AvroUtil class from
> >> > http://www.infoq.com/articles/ApacheAvro . The modified file is
> >> > attached. I uses the SchemaParse exception and loads schema files
> >> > from classpath.
> >> >
> >> > Is there a better alternative. If this is a strong use case I could
> >> > work on creating such a schema registry with plugable resolvers and
> >> > loaders.
> >> >
> >> > Thanks and regards,
> >> >  - Ashish
> >> >
> >
>
>

Re: Schema registry

Posted by Scott Carey <sc...@richrelevance.com>.
There is danger in this.

What is the schema used for in this case?  There are three common reasons
for assembling a schema:
1.  Assembling the schema that represents the format of the data to be
written.
2.  Assembling the schema that represents the way a reader wishes to view
the data. (a.k.a. 'reader' or 'expected' schema).
3.  Assembling the schema that represents the way that some data was
persisted.

If you are persisting data, you should persist the _entire_ schema used to
write that data as well.  This full schema should either go with the data
(data files) or in a registry (e.g. HAvroBase).  A schema name reference
is not sufficient -- you lose the ability to evolve the referenced schema.

What if the version of the nested schema has changed?  Now you have a data
file that refers to a nested schema by name "com.navteq.avro.FacebookUser"
and finds a schema with that name through some resolution mechanism.  If
that resolution mechanism is not version-aware, you're in trouble.

So for #3, assembling schema fragments by reference is dangerous and
complicated.
Making the resolution mechanism version aware is problematic but doable.
You can manually version every schema with a number, and use that, but
then you are manually versioning schemas and storing the version meta-data
in the schemas.

Avro by nature versions schemas by equivalence.  The natural way to encode
a schema version is to write the schema itself.

In short: Any such registry would have to be version-aware if it is used
to assemble schemas for use case #3 above, and the schemas that refer to
these versions would also have to be version-aware.  It is much simpler to
just embed the schemas.

Use cases #1 and #2 above are essentially the assembly of the 'current'
schema version, and a registry could work.  Avro does not have many
built-in tools for this.  Generally, avsc, avpr, or avdl files are used as
schema source for 'schema first' design, and 'code first' design persists
the current schema in the code.
avdl files support includes, avsc and avpr are more primitive.


On 3/23/11 10:21 PM, "Ashish Shinde" <as...@strandls.com> wrote:

>Hi,
>
>My use case is very similar to the nested schema in
>the test case AvroUtilsTest on http://www.infoq.com/articles/ApacheAvro
>
>The only difference is I would like to automatically load schema's from
>resources in classpath and also automatically load schema's
>for nested types.
>
>If you look at the test example mentioned above if I ask the
>"AvroSchemaRegistry" for a schema named
>com.navteq.avro.FacebookSpecialUser it should also load the nested
>com.navteq.avro.FacebookUser schema using some resolving and loading
>mechanism.
>
>Thanks and regards,
>- Ashish
>
>
>
>On Thu, 24 Mar 2011 10:38:20 +0800
>Felix Xu <yg...@gmail.com> wrote:
>
>> Hi,I'm not quite understand the question..
>> Can you give an example of your schema?
>> 
>> 2011/3/24 <as...@strandls.com>
>> 
>> > Hi,
>> >
>> > Is there some java implementation of Avro schema registry? The use
>> > case is to have separate schema data files for a bunch of types and
>> > be able to resolve nested types.
>> >
>> > I tried avro for the first time and could not have schema parsed
>> > from one file have a nested record from a schema described in a
>> > second file.
>> >
>> > I am using a modified version of the AvroUtil class from
>> > http://www.infoq.com/articles/ApacheAvro . The modified file is
>> > attached. I uses the SchemaParse exception and loads schema files
>> > from classpath.
>> >
>> > Is there a better alternative. If this is a strong use case I could
>> > work on creating such a schema registry with plugable resolvers and
>> > loaders.
>> >
>> > Thanks and regards,
>> >  - Ashish
>> >
>


Re: Schema registry

Posted by Ashish Shinde <as...@strandls.com>.
Hi,

My use case is very similar to the nested schema in 
the test case AvroUtilsTest on http://www.infoq.com/articles/ApacheAvro

The only difference is I would like to automatically load schema's from
resources in classpath and also automatically load schema's
for nested types.

If you look at the test example mentioned above if I ask the
"AvroSchemaRegistry" for a schema named
com.navteq.avro.FacebookSpecialUser it should also load the nested
com.navteq.avro.FacebookUser schema using some resolving and loading
mechanism.

Thanks and regards,
- Ashish



On Thu, 24 Mar 2011 10:38:20 +0800
Felix Xu <yg...@gmail.com> wrote:

> Hi,I'm not quite understand the question..
> Can you give an example of your schema?
> 
> 2011/3/24 <as...@strandls.com>
> 
> > Hi,
> >
> > Is there some java implementation of Avro schema registry? The use
> > case is to have separate schema data files for a bunch of types and
> > be able to resolve nested types.
> >
> > I tried avro for the first time and could not have schema parsed
> > from one file have a nested record from a schema described in a
> > second file.
> >
> > I am using a modified version of the AvroUtil class from
> > http://www.infoq.com/articles/ApacheAvro . The modified file is
> > attached. I uses the SchemaParse exception and loads schema files
> > from classpath.
> >
> > Is there a better alternative. If this is a strong use case I could
> > work on creating such a schema registry with plugable resolvers and
> > loaders.
> >
> > Thanks and regards,
> >  - Ashish
> >


Re: Schema registry

Posted by Felix Xu <yg...@gmail.com>.
Hi,I'm not quite understand the question..
Can you give an example of your schema?

2011/3/24 <as...@strandls.com>

> Hi,
>
> Is there some java implementation of Avro schema registry? The use case is
> to have separate schema data files for a bunch of types and be able to
> resolve nested types.
>
> I tried avro for the first time and could not have schema parsed from one
> file have a nested record from a schema described in a second file.
>
> I am using a modified version of the AvroUtil class from
> http://www.infoq.com/articles/ApacheAvro . The modified file is attached.
> I uses the SchemaParse exception and loads schema files from classpath.
>
> Is there a better alternative. If this is a strong use case I could work
> on creating such a schema registry with plugable resolvers and loaders.
>
> Thanks and regards,
>  - Ashish
>

Re: Schema registry

Posted by Ashish Shinde <as...@strandls.com>.
Hi,

Thanks I wasn't aware of avro IDL and preprocessors. Will give it a
try.  


Thanks and regards,
- Ashish

On Thu, 24 Mar 2011 09:29:57 -0700
Doug Cutting <cu...@apache.org> wrote:

> Avro in several places requires that schemas are self-contained.  For
> example, when reading a data file the schema that was used to write it
> must be available and should not be dynamically re-constructed from
> references to schemas in the reader's environment.  So, if such a
> registry were implemented, it should perhaps only be used when parsing
> schemas, not when printing them, and, even then, only in some
> contexts.
> 
> It's thus perhaps safer and simpler to handle this as a pre-processing
> step.  One can, e.g., use a preprocessor like cpp or m4 to generate
> schemas from input files.  For example, one could have a file named
> md5.avph that contains:
> 
> #define MD5 {"type": "record", ... }
> 
> And another file named Foo.avpp that contains:
> 
> #include "md5.avps"
> 
> {"type": "record", "name":"Foo", "fields": [
>   {"name":"checksum", "type": MD5 }
>  ]
> }
> 
> Then your build process can run cpp over .avps files to generate the
> .avsc files that can be used by Avro.
> 
> Also note that Avro IDL supports imports:
> 
> http://avro.apache.org/docs/current/idl.html#imports
> 
> One can use an IDL file with no messages to define a set of types.
> 
> Doug
> 
> On 03/23/2011 11:56 AM, ashish@strandls.com wrote:
> > Hi,
> > 
> > Is there some java implementation of Avro schema registry? The use
> > case is to have separate schema data files for a bunch of types and
> > be able to resolve nested types.
> > 
> > I tried avro for the first time and could not have schema parsed
> > from one file have a nested record from a schema described in a
> > second file.
> > 
> > I am using a modified version of the AvroUtil class from
> > http://www.infoq.com/articles/ApacheAvro . The modified file is
> > attached. I uses the SchemaParse exception and loads schema files
> > from classpath.
> > 
> > Is there a better alternative. If this is a strong use case I could
> > work on creating such a schema registry with plugable resolvers and
> > loaders.
> > 
> > Thanks and regards,
> >  - Ashish


Re: Schema registry

Posted by Doug Cutting <cu...@apache.org>.
Avro in several places requires that schemas are self-contained.  For
example, when reading a data file the schema that was used to write it
must be available and should not be dynamically re-constructed from
references to schemas in the reader's environment.  So, if such a
registry were implemented, it should perhaps only be used when parsing
schemas, not when printing them, and, even then, only in some contexts.

It's thus perhaps safer and simpler to handle this as a pre-processing
step.  One can, e.g., use a preprocessor like cpp or m4 to generate
schemas from input files.  For example, one could have a file named
md5.avph that contains:

#define MD5 {"type": "record", ... }

And another file named Foo.avpp that contains:

#include "md5.avps"

{"type": "record", "name":"Foo", "fields": [
  {"name":"checksum", "type": MD5 }
 ]
}

Then your build process can run cpp over .avps files to generate the
.avsc files that can be used by Avro.

Also note that Avro IDL supports imports:

http://avro.apache.org/docs/current/idl.html#imports

One can use an IDL file with no messages to define a set of types.

Doug

On 03/23/2011 11:56 AM, ashish@strandls.com wrote:
> Hi,
> 
> Is there some java implementation of Avro schema registry? The use case is
> to have separate schema data files for a bunch of types and be able to
> resolve nested types.
> 
> I tried avro for the first time and could not have schema parsed from one
> file have a nested record from a schema described in a second file.
> 
> I am using a modified version of the AvroUtil class from
> http://www.infoq.com/articles/ApacheAvro . The modified file is attached.
> I uses the SchemaParse exception and loads schema files from classpath.
> 
> Is there a better alternative. If this is a strong use case I could work
> on creating such a schema registry with plugable resolvers and loaders.
> 
> Thanks and regards,
>  - Ashish