You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by Jay Kreps <ja...@gmail.com> on 2010/05/02 21:18:00 UTC

references to other schemas

I want to have a shared type schema which would be used by 50 or so
messages (say a type Header defined in a single place that all
messages would use), and I can't seem to find a way to do this (though
I may just have missed it).

This could be done either by an "import" statement in the .avsc file
as protocol buffers does, but I do not think that really makes sense
in a world of non-statically compiled schemas. Probably a better way
is just to make a type "Xyz" resolve to the schema of that type. Then
just to open up these methods, and make the SpecificCompiler take lots
of files, resolve all the inter-references, and then generate a bunch
of classes instead of a single file. The resulting schema would have
no reference to Xyz, but rather would directly include the schema for
Xyz in its place.

This looks like it can *almost* be done using some internal private methods:

/* this package protected method parses wrt the given names. Header
could be given here if I understand correctly */
Schema.parse(JsonNode schema, Names names)

/* compile multiple schemas into multiple files*/
s = SpecificCompiler()
s.enqueue(header)
s.enqueue(schemaUsingHeader)
outputFiles = s.compile()

Is this kind of thing handled in some other way I have just missed? If
not any objection to a patch that opens up these methods and adds
options to SpecificCompiler to jointly compile a bunch of files all at
once? Perhaps this is already in flight?

-Jay

Re: references to other schemas

Posted by Jay Kreps <ja...@gmail.com>.

Yes, agreed you need a full schema for the resulting message with no
external references. My proposal is just pre-processor support that
does the expansion based on unresolved names as you describe. I think
this is better than explicit includes or URLs directly in the schemas
(after all the fully qualified name is the name system used to refer
to types not a URL). I think you need this both for specific and
generic for it to be useful (it shouldn't just work for one--you
should be able to load the fragments, resolve them and use them as
schemas normally without the compiler). Without actually thinking it
all through, the idea would be introduce

class Schemas {
   public List<Schema> parse(String...jsons){...}
}

Plus of course parse methods with File, InputStream, etc. This would
be need to resolve inter-referencing schemas and expand them based on
type references. The inputs would be fragments and the schemas you get
back will be fully resolved. Once you have the full Schema it would be
no different than if you had manually expanded the whole thing.

The specific compiler would then be changed to use this class to load
schemas and the arguments would be changed from
   input output_dir
to
   input1 input2 ... inputN output_dir

It would probably make sense to support directories as well as files for inputs.

The use case this addresses is the common case of having shared
headers, fields, or other includes that get used in a standard way
across a large number of messages.

It is worth thinking this proposal through, since for an organization
that needs to maintain a large set of messages, how they interconnect
and what dependencies there are is quite critical.

-Jay

On Mon, May 3, 2010 at 10:48 AM, Scott Carey <sc...@richrelevance.com> wrote:
>
> On May 3, 2010, at 10:03 AM, Doug Cutting wrote:
>
>> Scott Carey wrote:
>>> There has been talk that AvroGen would handle features like this (as well as many others) in time.  However this is one that should probably be addressed at the JSON level regardless of the future direction of AvroGen.
>>
>> Note that JSON schemas and protocols need to be standalone, containing
>> the full lexical closure of schemas referenced, when they are included
>> in data files and exchanged in RPC handshakes without reference to
>> external data.  Thus I am reluctant to add a JSON syntax for file
>> inclusion.  Rather, I think a pre-processor is appropriate.  The
>> pre-processor would not be run on schemas included in files or exchanged
>> in RPC handshakes, but would be run for schemas read from files.
>
> Exactly.  I don't think we shouldn't change the JSON syntax by adding references or includes.
>
> We should just make the SpecificCompiler capable of reading a collection of files and figuring out how to compile them when there is not full lexical closure in a .avsc file.
> File formats and RPC's have much stricter requirements than the SpecificCompiler.
>
>>
>> I have experimented with using the m4 pre-processor for this purpose,
>> and found it a bit awkward.  Perhaps someone can develop macros for m4
>> that make it palatable, or perhaps we can develop a custom pre-processor
>> for JSON.
>>
>> We might exploit otherwise-illegal JSON syntax, like backquotes, for
>> pre-processor directives.  An include might look something like:
>>
>> {"protocol": "org.foo.BarProtocol",
>>  "types": [
>>    `include org.foo.Bar`,
>>     ...
>>   ]
>> }
>>
>
> Rather than use a preprocessor, Is it possible to have the SpecificCompiler search the other files in the set for types that can't be found in the current file?  The result will be SpecificRecord objects that have their $SCHEMA field populated with a schema that has full lexical closure.
>
> Essentially, if given two files:
> IpTypes.avsc --
>
> [{"name": "com.somewhere.avro.IPV4", "type": "fixed", "size":4},
> {"name": "com.somewhere.avro.IPV6", "type": "fixed", "size":16}]
>
> MyRecord.avsc --
>
> {"name": "com.somewhere.avro.MyRecord", "type": "record", "fields": [
>  {"name": "hostname", "type": "string"},
>  {"name": "IP", "type": [ "IPV4", "IPV6" ]}
> ]}
>
> The SpecificCompiler could compile MyRecord.avsc if concurrently given IpTypes.avsc to resolve the "IPV4" and "IPV6" unknown references.   Perhaps it could also compile if it is aware of a SpecificRecord Java class that has an appropriate schema.   A preprocessor would be tricky to do this especially in a namespace-appropriate way, and would not be able to support integration with already made SpecificRecord classes.
>
> Perhaps IPV4 and IPV6 are already compiled SpecificRecord classes in jar "CommonTypes.jar" -- SpecificCompiler could run with those in its classpath and a directive to look for valid types in its classpath in addition to the files.
>
> The MyRecord.avsc file above does not contain a fully valid Avro schema, so perhaps we could denote this with a different file extension.
>
>> Also note that a protocol file (.avpr) need not actually define any
>> messages but can be used to define a set of types that reference one
>> another.  This is a stopgap, but a useful one.
>>
>> Doug
>
>

Re: references to other schemas

Posted by Scott Carey <sc...@richrelevance.com>.

On May 3, 2010, at 10:03 AM, Doug Cutting wrote:

> Scott Carey wrote:
>> There has been talk that AvroGen would handle features like this (as well as many others) in time.  However this is one that should probably be addressed at the JSON level regardless of the future direction of AvroGen.
> 
> Note that JSON schemas and protocols need to be standalone, containing 
> the full lexical closure of schemas referenced, when they are included 
> in data files and exchanged in RPC handshakes without reference to 
> external data.  Thus I am reluctant to add a JSON syntax for file 
> inclusion.  Rather, I think a pre-processor is appropriate.  The 
> pre-processor would not be run on schemas included in files or exchanged 
> in RPC handshakes, but would be run for schemas read from files.

Exactly.  I don't think we shouldn't change the JSON syntax by adding references or includes.  

We should just make the SpecificCompiler capable of reading a collection of files and figuring out how to compile them when there is not full lexical closure in a .avsc file.
File formats and RPC's have much stricter requirements than the SpecificCompiler.

> 
> I have experimented with using the m4 pre-processor for this purpose, 
> and found it a bit awkward.  Perhaps someone can develop macros for m4 
> that make it palatable, or perhaps we can develop a custom pre-processor 
> for JSON.
> 
> We might exploit otherwise-illegal JSON syntax, like backquotes, for 
> pre-processor directives.  An include might look something like:
> 
> {"protocol": "org.foo.BarProtocol",
>  "types": [
>    `include org.foo.Bar`,
>     ...
>   ]
> }
> 

Rather than use a preprocessor, Is it possible to have the SpecificCompiler search the other files in the set for types that can't be found in the current file?  The result will be SpecificRecord objects that have their $SCHEMA field populated with a schema that has full lexical closure.

Essentially, if given two files:
IpTypes.avsc --

[{"name": "com.somewhere.avro.IPV4", "type": "fixed", "size":4},
{"name": "com.somewhere.avro.IPV6", "type": "fixed", "size":16}]

MyRecord.avsc --

{"name": "com.somewhere.avro.MyRecord", "type": "record", "fields": [
  {"name": "hostname", "type": "string"},
  {"name": "IP", "type": [ "IPV4", "IPV6" ]}
]}

The SpecificCompiler could compile MyRecord.avsc if concurrently given IpTypes.avsc to resolve the "IPV4" and "IPV6" unknown references.   Perhaps it could also compile if it is aware of a SpecificRecord Java class that has an appropriate schema.   A preprocessor would be tricky to do this especially in a namespace-appropriate way, and would not be able to support integration with already made SpecificRecord classes.  

Perhaps IPV4 and IPV6 are already compiled SpecificRecord classes in jar "CommonTypes.jar" -- SpecificCompiler could run with those in its classpath and a directive to look for valid types in its classpath in addition to the files.

The MyRecord.avsc file above does not contain a fully valid Avro schema, so perhaps we could denote this with a different file extension.

> Also note that a protocol file (.avpr) need not actually define any 
> messages but can be used to define a set of types that reference one 
> another.  This is a stopgap, but a useful one.
> 
> Doug

Re: references to other schemas

Posted by Jeff Hodges <jh...@twitter.com>.

Backticks are allowed inside of strings, though, so whatever
preprocessor was used would have to have some understanding of JSON.
This reduces the preprocessor options for that.

I'm fairly neutral on the idea of composite schemas, overall. The
biggest problem I have is that JSON has no standard way of referring
to URLs (in the HTML5 sense) and they seem to be the best way to do
this.

On schema read, the references could be loaded once and kept that way
in order to have a complete schema on RPC and datafile write.
Basically, we would say references will be used on read, but not on
write.
--
Jeff

On May 3, 2010 10:03 AM, "Doug Cutting" <cu...@apache.org> wrote:

Scott Carey wrote:
>
> There has been talk that AvroGen would handle features like this (as well as ...

Note that JSON schemas and protocols need to be standalone, containing
the full lexical closure of schemas referenced, when they are included
in data files and exchanged in RPC handshakes without reference to
external data.  Thus I am reluctant to add a JSON syntax for file
inclusion.  Rather, I think a pre-processor is appropriate.  The
pre-processor would not be run on schemas included in files or
exchanged in RPC handshakes, but would be run for schemas read from
files.

I have experimented with using the m4 pre-processor for this purpose,
and found it a bit awkward.  Perhaps someone can develop macros for m4
that make it palatable, or perhaps we can develop a custom
pre-processor for JSON.

We might exploit otherwise-illegal JSON syntax, like backquotes, for
pre-processor directives.  An include might look something like:

{"protocol": "org.foo.BarProtocol",
 "types": [
  `include org.foo.Bar`,
   ...
 ]
}

Also note that a protocol file (.avpr) need not actually define any
messages but can be used to define a set of types that reference one
another.  This is a stopgap, but a useful one.

Doug

Re: references to other schemas

Posted by Doug Cutting <cu...@apache.org>.

Scott Carey wrote:
> There has been talk that AvroGen would handle features like this (as well as many others) in time.  However this is one that should probably be addressed at the JSON level regardless of the future direction of AvroGen.

Note that JSON schemas and protocols need to be standalone, containing 
the full lexical closure of schemas referenced, when they are included 
in data files and exchanged in RPC handshakes without reference to 
external data.  Thus I am reluctant to add a JSON syntax for file 
inclusion.  Rather, I think a pre-processor is appropriate.  The 
pre-processor would not be run on schemas included in files or exchanged 
in RPC handshakes, but would be run for schemas read from files.

I have experimented with using the m4 pre-processor for this purpose, 
and found it a bit awkward.  Perhaps someone can develop macros for m4 
that make it palatable, or perhaps we can develop a custom pre-processor 
for JSON.

We might exploit otherwise-illegal JSON syntax, like backquotes, for 
pre-processor directives.  An include might look something like:

{"protocol": "org.foo.BarProtocol",
  "types": [
    `include org.foo.Bar`,
     ...
   ]
}

Also note that a protocol file (.avpr) need not actually define any 
messages but can be used to define a set of types that reference one 
another.  This is a stopgap, but a useful one.

Doug

Re: references to other schemas

Posted by Scott Carey <sc...@richrelevance.com>.

On May 2, 2010, at 12:18 PM, Jay Kreps wrote:

> I want to have a shared type schema which would be used by 50 or so
> messages (say a type Header defined in a single place that all
> messages would use), and I can't seem to find a way to do this (though
> I may just have missed it).
> 
> This could be done either by an "import" statement in the .avsc file
> as protocol buffers does, but I do not think that really makes sense
> in a world of non-statically compiled schemas. Probably a better way
> is just to make a type "Xyz" resolve to the schema of that type. Then
> just to open up these methods, and make the SpecificCompiler take lots
> of files, resolve all the inter-references, and then generate a bunch
> of classes instead of a single file. The resulting schema would have
> no reference to Xyz, but rather would directly include the schema for
> Xyz in its place.
> 
> This looks like it can *almost* be done using some internal private methods:
> 
> /* this package protected method parses wrt the given names. Header
> could be given here if I understand correctly */
> Schema.parse(JsonNode schema, Names names)
> 
> /* compile multiple schemas into multiple files*/
> s = SpecificCompiler()
> s.enqueue(header)
> s.enqueue(schemaUsingHeader)
> outputFiles = s.compile()
> 
> Is this kind of thing handled in some other way I have just missed? If
> not any objection to a patch that opens up these methods and adds
> options to SpecificCompiler to jointly compile a bunch of files all at
> once? Perhaps this is already in flight?
> 

It is not in flight to my knowledge, and it would certainly make the SpecificCompiler easier to use.  I would welcome such a contribution.

Being able to compile a collection of *.avsc and *.avpr files and resolve types across them would be great.

There has been talk that AvroGen would handle features like this (as well as many others) in time.  However this is one that should probably be addressed at the JSON level regardless of the future direction of AvroGen.

-Scott 

> -Jay