You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Bill Graham <bi...@gmail.com> on 2011/08/09 20:15:29 UTC

Combining schemas

Hi,

I'm trying to create a schema that references a type defined in another
schema and I'm having some troubles. Is there an easy way to do this?

My test schemas look like this:

$ cat position.avsc
{"type":"enum", "name": "Position", "namespace": "avro.examples.baseball",
 "symbols": ["P", "C", "B1", "B2", "B3", "SS", "LF", "CF", "RF", "DH"]
}

$ cat player.avsc
{"type":"record", "name":"Player", "namespace": "avro.examples.baseball",
 "fields": [
  {"name": "number", "type": "int"},
  {"name": "first_name", "type": "string"},
  {"name": "last_name", "type": "string"},
  {"name": "position", "type": {"type": "array", "items":
"avro.examples.baseball.Position"} }
 ]
}

I've read this thread (
http://apache-avro.679487.n3.nabble.com/How-to-reference-previously-defined-enum-in-avsc-file-td2663512.html)
and tried using IDL like so with no luck:

$ cat baseball.avdl
@namespace("avro.examples.baseball")
protocol Baseball {
  import schema "position.avsc";
  import schema "player.avsc";
}

$ java -jar avro-tools-1.5.1.jar idl  baseball.avdl baseball.avpr
Exception in thread "main" org.apache.avro.SchemaParseException: Undefined
name: "avro.examples.baseball.Position"
        at org.apache.avro.Schema.parse(Schema.java:979)
        at org.apache.avro.Schema.parse(Schema.java:1052)
        at org.apache.avro.Schema.parse(Schema.java:1021)
        at org.apache.avro.Schema.parse(Schema.java:884)
        at org.apache.avro.compiler.idl.Idl.ImportSchema(Idl.java:388)
        at org.apache.avro.compiler.idl.Idl.ProtocolBody(Idl.java:320)
        at
org.apache.avro.compiler.idl.Idl.ProtocolDeclaration(Idl.java:206)
        at org.apache.avro.compiler.idl.Idl.CompilationUnit(Idl.java:84)
        ...


I also saw this blog post (
http://www.infoq.com/articles/ApacheAvro#_ftnref6_7758) where the author had
to write some nasty String.replace(..) code to combine schemas, but there's
got to be a better way that this.

Also FYI, it seems enum values can't start with numbers (i.e. '1B'). Is this
a know issue or a feature? I haven't seen it documented anywhere. You get an
error like this if the value starts with a number:

org.apache.avro.SchemaParseException: Illegal initial character

thanks,
Bill

Re: Combining schemas

Posted by Scott Carey <sc...@apache.org>.

On 8/9/11 5:53 PM, "Bill Graham" <bi...@gmail.com> wrote:

> I see that Doug's already got an IDL patch in place for AVRO-872, so I'll take
> a look at that (tonight hopefully) and comment on it. Maybe these two items
> are one in the same though...
> 
> Scott I like your approach, but I wonder if it could be simplified. Maybe we
> add two new methods like this instead:
> 
> public Schema Schema.parse(File[] files); // (1)
> public Schema Schema.parse(File[] files, Map<Name, Schema> context); // (2)
> 
> 1 - This allows the caller to not have to make multiple calls and handle state
> if they don't want to.
> 2 - This does the same, but also exposes the context, which would be handy to
> easily grab a specific schema. Now that I say it I wonder if (1) would ever be
> useful on it's own without having the context?

I agree, the API can be made much simpler for common use cases.  At this
point, the right place for continued conversation and ideas is AVRO-872
https://issues.apache.org/jira/browse/AVRO-872

Re: Combining schemas

Posted by Bill Graham <bi...@gmail.com>.

I see that Doug's already got an IDL patch in place for AVRO-872, so I'll
take a look at that (tonight hopefully) and comment on it. Maybe these two
items are one in the same though...

Scott I like your approach, but I wonder if it could be simplified. Maybe we
add two new methods like this instead:

public Schema Schema.parse(File[] files); // (1)
public Schema Schema.parse(File[] files, Map<Name, Schema> context); // (2)

1 - This allows the caller to not have to make multiple calls and handle
state if they don't want to.
2 - This does the same, but also exposes the context, which would be handy
to easily grab a specific schema. Now that I say it I wonder if (1) would
ever be useful on it's own without having the context?


On Tue, Aug 9, 2011 at 2:20 PM, Scott Carey <sc...@apache.org> wrote:

> On 8/9/11 1:45 PM, "Bill Graham" <bi...@gmail.com> wrote:
>
> Thanks Scott and Doug, see follow up below.
>
> On Tue, Aug 9, 2011 at 11:42 AM, Scott Carey <sc...@apache.org>wrote:
>
>> On 8/9/11 11:15 AM, "Bill Graham" <bi...@gmail.com> wrote:
>>
>>
>>  Using the lower level Avro API you can parse the files yourself in an
>> order that will work.
>
>
> How exactly would the approach work where you parse files in
> reverse-dependency order work? This is something I'd like to explore and
> maybe contribute a helper for. I've tried a few combinations of this
> approach to no avail:
>
>         Schema schema1 = Schema.parse(new
> File("examples/java/avro/position.avsc"));
>         Schema schema2 = schema1.parse(new
> File("examples/java/avro/player.avsc"));
>
>
> I was mistaken.  The methods to do this are not public.
>
> The internals of the parser in Schema.java use a hidden inner class, Names.
>  Names is a LinkedHashMap<Name, Schema> .
>
> It should be relatively easy to add a signature to Schema.java along the
> lines:
>
> parse(File schema, Map<Name, Schema> context);
>
> This would:
>   for the given context:
>      validate it (each Name should match a Named schema with the same name)
>      place names into a Names instance and parse the schema with the Names
> context
>   on return from the internal parse, copy the contents of the Names object
> into the provided context
>
> Alternatively we could make the Names class public, clean it up and
> document it, and expose parse methods that take Names objects.
>
>
> Then you could do:
>
> Map<Name, Schema> context = new HashMap<Name, Schema>();
> Schema schema1 = Schema.parse(new
> File("examples/java/avro/position.avsc", context));
> Schema schema2 = schema1.parse(new
> File("examples/java/avro/player.avsc", context));
>
>
>
>

Re: Combining schemas

Posted by Scott Carey <sc...@apache.org>.

On 8/9/11 1:45 PM, "Bill Graham" <bi...@gmail.com> wrote:

> Thanks Scott and Doug, see follow up below.
> 
> On Tue, Aug 9, 2011 at 11:42 AM, Scott Carey <sc...@apache.org> wrote:
>> On 8/9/11 11:15 AM, "Bill Graham" <bi...@gmail.com> wrote:
>> 
>>> 
>>  Using the lower level Avro API you can parse the files yourself in an order
>> that will work.
>   
> How exactly would the approach work where you parse files in
> reverse-dependency order work? This is something I'd like to explore and maybe
> contribute a helper for. I've tried a few combinations of this approach to no
> avail:
> 
>         Schema schema1 = Schema.parse(new
> File("examples/java/avro/position.avsc"));
>         Schema schema2 = schema1.parse(new
> File("examples/java/avro/player.avsc"));
> 

I was mistaken.  The methods to do this are not public.

The internals of the parser in Schema.java use a hidden inner class, Names.
Names is a LinkedHashMap<Name, Schema> .

It should be relatively easy to add a signature to Schema.java along the
lines:

parse(File schema, Map<Name, Schema> context);

This would: 
  for the given context:
     validate it (each Name should match a Named schema with the same name)
     place names into a Names instance and parse the schema with the Names
context
  on return from the internal parse, copy the contents of the Names object
into the provided context

Alternatively we could make the Names class public, clean it up and document
it, and expose parse methods that take Names objects.


Then you could do:

Map<Name, Schema> context = new HashMap<Name, Schema>();
Schema schema1 = Schema.parse(new File("examples/java/avro/position.avsc",
context));
Schema schema2 = schema1.parse(new File("examples/java/avro/player.avsc",
context));

Re: Combining schemas

Posted by Bill Graham <bi...@gmail.com>.

Thanks Scott and Doug, see follow up below.

On Tue, Aug 9, 2011 at 11:42 AM, Scott Carey <sc...@apache.org> wrote:

> On 8/9/11 11:15 AM, "Bill Graham" <bi...@gmail.com> wrote:
>
> Hi,
>
> I'm trying to create a schema that references a type defined in another
> schema and I'm having some troubles. Is there an easy way to do this?
>
> My test schemas look like this:
>
> $ cat position.avsc
> {"type":"enum", "name": "Position", "namespace": "avro.examples.baseball",
>  "symbols": ["P", "C", "B1", "B2", "B3", "SS", "LF", "CF", "RF", "DH"]
> }
>
> $ cat player.avsc
> {"type":"record", "name":"Player", "namespace": "avro.examples.baseball",
>  "fields": [
>   {"name": "number", "type": "int"},
>   {"name": "first_name", "type": "string"},
>   {"name": "last_name", "type": "string"},
>   {"name": "position", "type": {"type": "array", "items":
> "avro.examples.baseball.Position"} }
>  ]
> }
>
> I've read this thread (
> http://apache-avro.679487.n3.nabble.com/How-to-reference-previously-defined-enum-in-avsc-file-td2663512.html)
> and tried using IDL like so with no luck:
>
> $ cat baseball.avdl
> @namespace("avro.examples.baseball")
> protocol Baseball {
>   import schema "position.avsc";
>   import schema "player.avsc";
> }
>
> $ java -jar avro-tools-1.5.1.jar idl  baseball.avdl baseball.avpr
> Exception in thread "main" org.apache.avro.SchemaParseException: Undefined
> name: "avro.examples.baseball.Position"
>         at org.apache.avro.Schema.parse(Schema.java:979)
>         at org.apache.avro.Schema.parse(Schema.java:1052)
>         at org.apache.avro.Schema.parse(Schema.java:1021)
>         at org.apache.avro.Schema.parse(Schema.java:884)
>         at org.apache.avro.compiler.idl.Idl.ImportSchema(Idl.java:388)
>         at org.apache.avro.compiler.idl.Idl.ProtocolBody(Idl.java:320)
>         at
> org.apache.avro.compiler.idl.Idl.ProtocolDeclaration(Idl.java:206)
>         at org.apache.avro.compiler.idl.Idl.CompilationUnit(Idl.java:84)
>         ...
>
>
> I agree that the documentation indicates that this should work.  I suspect
> that it may not be able to resolve dependencies among imports.  That is if
> Baseball depends on position, and on player, it works.  But since player
> depends on position, it does not.  The import statement pulls in each item
> individually for use in composite things in the AvroIDL, but does not allow
> for interdependencies in the imports.
> This seems worthy of a JIRA enhancement request.  I'm sure the project will
> accept a patch that adds this.
>
>
Done:  https://issues.apache.org/jira/browse/AVRO-872


>
> I also saw this blog post (
> http://www.infoq.com/articles/ApacheAvro#_ftnref6_7758) where the author
> had to write some nasty String.replace(..) code to combine schemas, but
> there's got to be a better way that this.
>
>
> We need to improve the ability to import multiple files when parsing.
>  Using the lower level Avro API you can parse the files yourself in an order
> that will work.
> I have simply put all my types in one file.  If you made one avsc file with
> both Position and Player in a JSON array it will complie.  It would look
> like:
> [
>   < position schema here>,
>   < player schema here>
> ]
>

Yes, I've used this approach in the past. Initially I was thinking that I
could write something to combine multiple files into a single InputStream
facade that generates a union like you describe, which could then be parsed.
I could then hold a handle to the union schema and provide a method to get a
given scheme type (i.e. the Player) by name. This is better than the String
replace(..) approach, but still a bit hacky.

 Using the lower level Avro API you can parse the files yourself in an order
> that will work.


How exactly would the approach work where you parse files in
reverse-dependency order work? This is something I'd like to explore and
maybe contribute a helper for. I've tried a few combinations of this
approach to no avail:

        Schema schema1 = Schema.parse(new
File("examples/java/avro/position.avsc"));
        Schema schema2 = schema1.parse(new
File("examples/java/avro/player.avsc"));




>
>
> Also FYI, it seems enum values can't start with numbers (i.e. '1B'). Is
> this a know issue or a feature? I haven't seen it documented anywhere. You
> get an error like this if the value starts with a number:
>
> org.apache.avro.SchemaParseException: Illegal initial character
>
>
>
> Enums are a named type.  The enum names must start with [A-Za-z_]  and
> subsequently contain only [A-Za-z0-9_].
> http://avro.apache.org/docs/1.5.1/spec.html#Names
>

I hadn't noticed that before, thanks.


>
> However, the spec does not say that the values must have such restrictions.
>  This may be a bug, can you file a JIRA ticket?
>

Done: https://issues.apache.org/jira/browse/AVRO-871


>
> Thanks!
>
> -Scott
>
>
> thanks,
> Bill
>
>

Re: Combining schemas

Posted by Doug Cutting <cu...@apache.org>.

On 08/09/2011 11:42 AM, Scott Carey wrote:
> On 8/9/11 11:15 AM, "Bill Graham" <billgraham@gmail.com
> <ma...@gmail.com>> wrote:
[ ... ]
>     Also FYI, it seems enum values can't start with numbers (i.e. '1B').
>     Is this a know issue or a feature? I haven't seen it documented
>     anywhere. You get an error like this if the value starts with a number:
> 
>     org.apache.avro.SchemaParseException: Illegal initial character
> 
> Enums are a named type.  The enum names must start with [A-Za-z_]  and
> subsequently contain only [A-Za-z0-9_].
> http://avro.apache.org/docs/1.5.1/spec.html#Names
> 
> However, the spec does not say that the values must have such
> restrictions.  This may be a bug, can you file a JIRA ticket?

Enum values are identifiers in many programming languages (e.g., Java,
C, C++, & C#) and identifiers in these languages cannot begin with a
digit.  To simplify integration with these languages Avro should use the
same convention, but this is not stated clearly in the spec.  We should
probably clarify this, that enum symbols, record field names, and
protocol message names have the same restrictions as type names.  That's
what's currently implemented in Java.

Doug

Re: Combining schemas

Posted by Scott Carey <sc...@apache.org>.

On 8/9/11 11:15 AM, "Bill Graham" <bi...@gmail.com> wrote:

> Hi,
> 
> I'm trying to create a schema that references a type defined in another schema
> and I'm having some troubles. Is there an easy way to do this?
> 
> My test schemas look like this:
> 
> $ cat position.avsc
> {"type":"enum", "name": "Position", "namespace": "avro.examples.baseball",
>  "symbols": ["P", "C", "B1", "B2", "B3", "SS", "LF", "CF", "RF", "DH"]
> }
> 
> $ cat player.avsc
> {"type":"record", "name":"Player", "namespace": "avro.examples.baseball",
>  "fields": [
>   {"name": "number", "type": "int"},
>   {"name": "first_name", "type": "string"},
>   {"name": "last_name", "type": "string"},
>   {"name": "position", "type": {"type": "array", "items":
> "avro.examples.baseball.Position"} }
>  ]
> }
> 
> I've read this thread
> (http://apache-avro.679487.n3.nabble.com/How-to-reference-previously-defined-e
> num-in-avsc-file-td2663512.html) and tried using IDL like so with no luck:
> 
> $ cat baseball.avdl
> @namespace("avro.examples.baseball")
> protocol Baseball {
>   import schema "position.avsc";
>   import schema "player.avsc";
> }
> 
> $ java -jar avro-tools-1.5.1.jar idl  baseball.avdl baseball.avpr
> Exception in thread "main" org.apache.avro.SchemaParseException: Undefined
> name: "avro.examples.baseball.Position"
>         at org.apache.avro.Schema.parse(Schema.java:979)
>         at org.apache.avro.Schema.parse(Schema.java:1052)
>         at org.apache.avro.Schema.parse(Schema.java:1021)
>         at org.apache.avro.Schema.parse(Schema.java:884)
>         at org.apache.avro.compiler.idl.Idl.ImportSchema(Idl.java:388)
>         at org.apache.avro.compiler.idl.Idl.ProtocolBody(Idl.java:320)
>         at org.apache.avro.compiler.idl.Idl.ProtocolDeclaration(Idl.java:206)
>         at org.apache.avro.compiler.idl.Idl.CompilationUnit(Idl.java:84)
>         ...

I agree that the documentation indicates that this should work.  I suspect
that it may not be able to resolve dependencies among imports.  That is if
Baseball depends on position, and on player, it works.  But since player
depends on position, it does not.  The import statement pulls in each item
individually for use in composite things in the AvroIDL, but does not allow
for interdependencies in the imports.
This seems worthy of a JIRA enhancement request.  I'm sure the project will
accept a patch that adds this.

> 
> 
> I also saw this blog post
> (http://www.infoq.com/articles/ApacheAvro#_ftnref6_7758) where the author had
> to write some nasty String.replace(..) code to combine schemas, but there's
> got to be a better way that this.

We need to improve the ability to import multiple files when parsing.  Using
the lower level Avro API you can parse the files yourself in an order that
will work.  
I have simply put all my types in one file.  If you made one avsc file with
both Position and Player in a JSON array it will complie.  It would look
like:
[
  < position schema here>,
  < player schema here>
]

> 
> Also FYI, it seems enum values can't start with numbers (i.e. '1B'). Is this a
> know issue or a feature? I haven't seen it documented anywhere. You get an
> error like this if the value starts with a number:
> 
> org.apache.avro.SchemaParseException: Illegal initial character

Enums are a named type.  The enum names must start with [A-Za-z_]  and
subsequently contain only [A-Za-z0-9_].
http://avro.apache.org/docs/1.5.1/spec.html#Names

However, the spec does not say that the values must have such restrictions.
This may be a bug, can you file a JIRA ticket?

Thanks!

-Scott

> 
> thanks,
> Bill
>