You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Aaron Kimball <ak...@gmail.com> on 2013/02/01 07:17:37 UTC

Re: static schema validation

That sounds like what I'm looking for. I'll take a look!

Thanks,
- Aaron
On Jan 31, 2013 10:39 AM, "Doug Cutting" <cu...@apache.org> wrote:

> Aaron,
>
> You can use the SchemaNormalization class to test if two schemas are
> effectively identical:
>
>
> http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas
>
> http://avro.apache.org/docs/current/api/java/org/apache/avro/SchemaNormalization.html
>
> AVRO-816 has code to tell whether one Schema subsumes another (i.e.,
> can, with resolution, read the other) and to combine multiple schemas
> into a single that subsumes them all.
>
> https://issues.apache.org/jira/browse/AVRO-816
>
> Bob Cotton recently suggested that we should commit some form of this.
>  I'd be happy to do this if others agree.
>
> Doug
>
> On Wed, Jan 30, 2013 at 3:17 PM, Aaron Kimball <ak...@gmail.com>
> wrote:
> > Does Avro have an API to allow you to tell whether two schemas are a
> match,
> > statically?
> >
> > i.e., schema1.canRead(schema2) /** return true iff schema1 can be used
> as a
> > reader schema for schema2 */
> >
> > From my (admittedly cursorary) scan of the docs + source, it seems like
> > there isn't something quite that concise, though maybe this can be
> > accomplished using ResolvingGrammarGenerator?
> >
> > I'm pessimistic because of the following quote from the spec [1]
> >
> > [matching] if both are unions:
> > The first schema in the reader's union that matches the selected writer's
> > union schema is recursively resolved against it. if none match, an error
> is
> > signalled.
> >
> > That sentence makes me think it's context dependent; I interpret "the
> > selected writer's union schema" as "the schema of the actual thing
> written
> > in a data buffer, which is one of the possible schemas the writer
> declared
> > in her union type". i.e., you can only tell if schema R can be a reader
> for
> > some other schema W in terms of a literal record written by W, and
> cannot be
> > deduced statically for all possible records that can be encoded with
> schema
> > W.  Is this interpretation correct? If so, does anyone have any ideas
> how to
> > ensure the best bounds on statically-guaranteed backward compatibility
> > between a given reader and writer?
> >
> > Thanks,
> > - Aaron
> >
> > [1] http://avro.apache.org/docs/current/spec.html#Schema+Resolution
>

Re: static schema validation

Posted by Doug Cutting <cu...@apache.org>.
I think AVRO-816 should help you.  Neither S1 nor S2 subsume one
another, but S3 subsumes them both.

Doug

On Fri, Feb 1, 2013 at 1:42 PM, Aaron Kimball <ak...@gmail.com> wrote:
> Ok, I read the patch and JIRA issue a bit more thoroughly. Schema
> normalization just tells you if two schemas differ only in the unimportant
> bits.
>
> As I understand it, subsumes() will tell you if a schema is a strict
> superset of another.
> i.e.,
> if S1 is a record of { a:int, b:string }, and S2 is a record of { a:int,
> b:string, c:int }, then S2.subsumes(S1) would return true but not vice
> versa. Is that correct?
>
> The functionality I need, is to guarantee that two writers who write to a
> common data store with possibly different schemas can still read one
> another's data without a deserialization error. They need to agree ahead of
> time that they're going to write data with schemas "close enough" that the
> other one can always deserialize the data into their preferred format.
>
> S1 and S2 above do not meet this criterion, because S2 cannot read record
> written with S1. It doesn't know how to instantiate field 'c'.
>
> However, S1 and S3 = { a:int, b:string, c:int default 0 } would meet my
> criterion.
>
> Does AVRO-816 help me answer this question?
> Thanks,
> - Aaron
>
>
>
> On Thu, Jan 31, 2013 at 10:17 PM, Aaron Kimball <ak...@gmail.com>
> wrote:
>>
>> That sounds like what I'm looking for. I'll take a look!
>>
>> Thanks,
>> - Aaron
>>
>> On Jan 31, 2013 10:39 AM, "Doug Cutting" <cu...@apache.org> wrote:
>>>
>>> Aaron,
>>>
>>> You can use the SchemaNormalization class to test if two schemas are
>>> effectively identical:
>>>
>>>
>>> http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas
>>>
>>> http://avro.apache.org/docs/current/api/java/org/apache/avro/SchemaNormalization.html
>>>
>>> AVRO-816 has code to tell whether one Schema subsumes another (i.e.,
>>> can, with resolution, read the other) and to combine multiple schemas
>>> into a single that subsumes them all.
>>>
>>> https://issues.apache.org/jira/browse/AVRO-816
>>>
>>> Bob Cotton recently suggested that we should commit some form of this.
>>>  I'd be happy to do this if others agree.
>>>
>>> Doug
>>>
>>> On Wed, Jan 30, 2013 at 3:17 PM, Aaron Kimball <ak...@gmail.com>
>>> wrote:
>>> > Does Avro have an API to allow you to tell whether two schemas are a
>>> > match,
>>> > statically?
>>> >
>>> > i.e., schema1.canRead(schema2) /** return true iff schema1 can be used
>>> > as a
>>> > reader schema for schema2 */
>>> >
>>> > From my (admittedly cursorary) scan of the docs + source, it seems like
>>> > there isn't something quite that concise, though maybe this can be
>>> > accomplished using ResolvingGrammarGenerator?
>>> >
>>> > I'm pessimistic because of the following quote from the spec [1]
>>> >
>>> > [matching] if both are unions:
>>> > The first schema in the reader's union that matches the selected
>>> > writer's
>>> > union schema is recursively resolved against it. if none match, an
>>> > error is
>>> > signalled.
>>> >
>>> > That sentence makes me think it's context dependent; I interpret "the
>>> > selected writer's union schema" as "the schema of the actual thing
>>> > written
>>> > in a data buffer, which is one of the possible schemas the writer
>>> > declared
>>> > in her union type". i.e., you can only tell if schema R can be a reader
>>> > for
>>> > some other schema W in terms of a literal record written by W, and
>>> > cannot be
>>> > deduced statically for all possible records that can be encoded with
>>> > schema
>>> > W.  Is this interpretation correct? If so, does anyone have any ideas
>>> > how to
>>> > ensure the best bounds on statically-guaranteed backward compatibility
>>> > between a given reader and writer?
>>> >
>>> > Thanks,
>>> > - Aaron
>>> >
>>> > [1] http://avro.apache.org/docs/current/spec.html#Schema+Resolution
>
>

Re: static schema validation

Posted by Aaron Kimball <ak...@gmail.com>.
Ok, I read the patch and JIRA issue a bit more thoroughly. Schema
normalization just tells you if two schemas differ only in the unimportant
bits.

As I understand it, subsumes() will tell you if a schema is a strict
superset of another.
i.e.,
if S1 is a record of { a:int, b:string }, and S2 is a record of { a:int,
b:string, c:int }, then S2.subsumes(S1) would return true but not vice
versa. Is that correct?

The functionality I need, is to guarantee that two writers who write to a
common data store with possibly different schemas can still read one
another's data without a deserialization error. They need to agree ahead of
time that they're going to write data with schemas "close enough" that the
other one can always deserialize the data into their preferred format.

S1 and S2 above do not meet this criterion, because S2 cannot read record
written with S1. It doesn't know how to instantiate field 'c'.

However, S1 and S3 = { a:int, b:string, c:int default 0 } would meet my
criterion.

Does AVRO-816 help me answer this question?
Thanks,
- Aaron



On Thu, Jan 31, 2013 at 10:17 PM, Aaron Kimball <ak...@gmail.com>wrote:

> That sounds like what I'm looking for. I'll take a look!
>
> Thanks,
> - Aaron
> On Jan 31, 2013 10:39 AM, "Doug Cutting" <cu...@apache.org> wrote:
>
>> Aaron,
>>
>> You can use the SchemaNormalization class to test if two schemas are
>> effectively identical:
>>
>>
>> http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas
>>
>> http://avro.apache.org/docs/current/api/java/org/apache/avro/SchemaNormalization.html
>>
>> AVRO-816 has code to tell whether one Schema subsumes another (i.e.,
>> can, with resolution, read the other) and to combine multiple schemas
>> into a single that subsumes them all.
>>
>> https://issues.apache.org/jira/browse/AVRO-816
>>
>> Bob Cotton recently suggested that we should commit some form of this.
>>  I'd be happy to do this if others agree.
>>
>> Doug
>>
>> On Wed, Jan 30, 2013 at 3:17 PM, Aaron Kimball <ak...@gmail.com>
>> wrote:
>> > Does Avro have an API to allow you to tell whether two schemas are a
>> match,
>> > statically?
>> >
>> > i.e., schema1.canRead(schema2) /** return true iff schema1 can be used
>> as a
>> > reader schema for schema2 */
>> >
>> > From my (admittedly cursorary) scan of the docs + source, it seems like
>> > there isn't something quite that concise, though maybe this can be
>> > accomplished using ResolvingGrammarGenerator?
>> >
>> > I'm pessimistic because of the following quote from the spec [1]
>> >
>> > [matching] if both are unions:
>> > The first schema in the reader's union that matches the selected
>> writer's
>> > union schema is recursively resolved against it. if none match, an
>> error is
>> > signalled.
>> >
>> > That sentence makes me think it's context dependent; I interpret "the
>> > selected writer's union schema" as "the schema of the actual thing
>> written
>> > in a data buffer, which is one of the possible schemas the writer
>> declared
>> > in her union type". i.e., you can only tell if schema R can be a reader
>> for
>> > some other schema W in terms of a literal record written by W, and
>> cannot be
>> > deduced statically for all possible records that can be encoded with
>> schema
>> > W.  Is this interpretation correct? If so, does anyone have any ideas
>> how to
>> > ensure the best bounds on statically-guaranteed backward compatibility
>> > between a given reader and writer?
>> >
>> > Thanks,
>> > - Aaron
>> >
>> > [1] http://avro.apache.org/docs/current/spec.html#Schema+Resolution
>>
>