You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by roger peppe <ro...@gmail.com> on 2020/04/07 11:03:40 UTC

schema resolution vs logical types

Hi,

I'm just contemplating an implementation of the decimal logical type, and
I'm a bit confused by the specification around this.

On the one hand the specification says
<https://avro.apache.org/docs/1.9.1/spec.html#Parsing+Canonical+Form+for+Schemas>
:

If the Parsing Canonical Forms of two different schemas are textually
> equal, then those schemas are "the same" as far as any reader is concerned


but on the other, when discussing the decimal logical type, it says:

For the purposes of schema resolution, two schemas that are decimal logical
> types *match* if their scales and precisions match.


I'm not sure how to reconcile those two statements. If two schemas with
mismatched scales or precisions should be considered to be mismatched for
schema resolution, then I'm not sure how the first statement could be
considered true, as surely mismatched schemas are something that a reader
should be concerned about?

Given that the spec recommends using the canonical form for schema
fingerprints, ISTM there might be some possibility for attack (or at least
data corruption) there - if we unwittingly read a decimal value that was
written with a different scale, we could read numbers thinking they're a
different order of magnitude than they actually are.

  cheers,
    rog.

Re: schema resolution vs logical types

Posted by Doug Cutting <cu...@gmail.com>.
On Wed, Apr 8, 2020 at 5:03 AM roger peppe <ro...@gmail.com> wrote:

> On Tue, 7 Apr 2020 at 17:57, Doug Cutting <cu...@gmail.com> wrote:
>
>> On Tue, Apr 7, 2020 at 4:03 AM roger peppe <ro...@gmail.com> wrote:
>>
>>> On the one hand the specification says
>>> <https://avro.apache.org/docs/1.9.1/spec.html#Parsing+Canonical+Form+for+Schemas>
>>> :
>>>
>>> If the Parsing Canonical Forms of two different schemas are textually
>>>> equal, then those schemas are "the same" as far as any reader is concerned
>>>
>>>
>> This statement in the specification could perhaps be improved.  What it
>> means is that low-level parsing errors will not be encountered when using
>> two such schemas.  It does not mean they're equivalent for all purposes.
>>
>
> OK, that could definitely be improved then!
>

If you can suggest improved wording for this, please feel free to make a
pull request.


> For the purposes of schema resolution, two schemas that are decimal logical
>>>> types *match* if their scales and precisions match.
>>>
>>>
>>>
>> Schema resolution involves a different kind of equivalence for schemas.
>> Two compatible schemas here may have quite different binary formats, fields
>> might be reordered, removed, or added.  Scalar types may be promoted, etc.
>>
>
> Perhaps it might be worth mentioning in the schema resolution section that
> other fields can play a role here.
> That section mentions field reordering, scalar type promotion, etc, but
> doesn't talk about logical types at all. It reads like it's supposed to be
> definitive.
>

This is historic.  Logical types were added as a new, incremental feature.
Initially no languages implemented them.  Now a couple do, but some still
do not.  So they are described as a separate, optional feature.  Similarly,
implementation of aliases and schema resolution are optional.  Such is the
reality of adding features to something implemented in multiple programming
languages via volunteer efforts.

Perhaps this optionality and layering of features could be more explicit in
the specification.


> The logical type attributes are definitely not irrelevant to readers
> trying to parse incoming data. We could wrangle "low level" vs "high level"
> parsing errors but in the end, knowing the scale of a number is critically
> important to the reader of the data. If messages are encoded along with a
> schema (e.g. using Single Object Encoding) that strips out important
> attributes like that, then we have no decent assurance that our data won't
> be corrupted when we read it.
>
> We're using a schema registry to store schemas for our messages in Kafka,
> and we've got full compatibility enabled, but it ignores logical types for
> the purposes of schema comparison. I suspect that it's comparing
> canonicalised schemas. This means that if someone accidentally changes a
> decimal scale and pushes a new schema, we're at risk of not being able to
> read messages (if the reader takes logical types into account) or silent
> data corruption (if it doesn't).
>

This compatibility checking was was implemented before logical types were
added to Avro and has not yet been updated to support logical types like
decimal.  So your decimal data is treated simply as bytes.  These are not
corrupted, FWIW.


> There is a proposal to add an alternate canonical form that incorporates
>> logical types:
>>
>> https://github.com/apache/avro/pull/805
>> https://issues.apache.org/jira/browse/AVRO-2299
>>
>> Does this look like what you'd like?  It seems that patch has been
>> ignored, but perhaps we can pick it up again and get it committed.
>>
>
> Something along those kinds of lines seems like it would be useful, yes.
> I'm not entirely convinced about the exact content of that particular PR,
> however:
>
>    - the naming could be considered confusing. Why "Standard" Canonical
>    Form? The PR doesn't make it very clear exactly why this kind of canonical
>    form is required, which might inform the naming better. Maybe "Resolving
>    Canonical Form" might work better.
>
>
This adds a collection of features into the new canonicalized form,
including aliases, defaults, and logical types.  I don't see an obvious,
precise, name for that set of features.  It's more than just resolution, I
think.  I'd welcome a better suggestion than Standard, though!

>
>    - it doesn't seem to suggest an ordering for custom properties. I
>    wonder if it might be better just to say that all properties should be in
>    lexical order.
>
> I think the order for custom properties is meant to be
application-specified in the current patch.  I agree that lexicographic
might be better.

>
>    - there's a lot of redundancy in the spec with respect to the Parsing
>    Canonical Form. I'd be tempted to try to fold them together into one, or at
>    least define one in terms of the other.
>
> I agree.

>
>    - it would be good if it mentioned "scale" and "precision" amongst the
>    standard fields.
>
> I agree.

>
>    - since this new canonical form now contains default values, the spec
>    also needs to define canonicalisation for those, including numbers and
>    arbitrary properties.
>
> Good point.

Thanks,

Doug

Re: schema resolution vs logical types

Posted by roger peppe <ro...@gmail.com>.
On Tue, 7 Apr 2020 at 17:57, Doug Cutting <cu...@gmail.com> wrote:

> On Tue, Apr 7, 2020 at 4:03 AM roger peppe <ro...@gmail.com> wrote:
>
>> On the one hand the specification says
>> <https://avro.apache.org/docs/1.9.1/spec.html#Parsing+Canonical+Form+for+Schemas>
>> :
>>
>> If the Parsing Canonical Forms of two different schemas are textually
>>> equal, then those schemas are "the same" as far as any reader is concerned
>>
>>
> This statement in the specification could perhaps be improved.  What it
> means is that low-level parsing errors will not be encountered when using
> two such schemas.  It does not mean they're equivalent for all purposes.
>

OK, that could definitely be improved then!


> but on the other, when discussing the decimal logical type, it says:
>>
>> For the purposes of schema resolution, two schemas that are decimal logical
>>> types *match* if their scales and precisions match.
>>
>>
>>
> Schema resolution involves a different kind of equivalence for schemas.
> Two compatible schemas here may have quite different binary formats, fields
> might be reordered, removed, or added.  Scalar types may be promoted, etc.
>

Perhaps it might be worth mentioning in the schema resolution section that
other fields can play a role here.
That section mentions field reordering, scalar type promotion, etc, but
doesn't talk about logical types at all. It reads like it's supposed to be
definitive.


>
>
>> Given that the spec recommends using the canonical form for schema
>> fingerprints, ISTM there might be some possibility for attack (or at least
>> data corruption) there - if we unwittingly read a decimal value that was
>> written with a different scale, we could read numbers thinking they're a
>> different order of magnitude than they actually are.
>>
>
> Identical Parsing Canonical form only tells you whether you can parse the
> data, not whether you can resolve it.  Indeed, if you use a different
> logical type definition but only check parsing-level compatibility then you
> can get incorrect data.
>

It seems to me that this is somewhat problematic. The spec says:

It is called *Parsing* Canonical Form because the transformations strip
> away parts of the schema, like "doc" attributes, that are irrelevant to
> readers trying to parse incoming data.


The logical type attributes are definitely not irrelevant to readers trying
to parse incoming data. We could wrangle "low level" vs "high level"
parsing errors but in the end, knowing the scale of a number is critically
important to the reader of the data. If messages are encoded along with a
schema (e.g. using Single Object Encoding) that strips out important
attributes like that, then we have no decent assurance that our data won't
be corrupted when we read it.

We're using a schema registry to store schemas for our messages in Kafka,
and we've got full compatibility enabled, but it ignores logical types for
the purposes of schema comparison. I suspect that it's comparing
canonicalised schemas. This means that if someone accidentally changes a
decimal scale and pushes a new schema, we're at risk of not being able to
read messages (if the reader takes logical types into account) or silent
data corruption (if it doesn't).

There is a proposal to add an alternate canonical form that incorporates
> logical types:
>
> https://github.com/apache/avro/pull/805
> https://issues.apache.org/jira/browse/AVRO-2299
>
> Does this look like what you'd like?  It seems that patch has been
> ignored, but perhaps we can pick it up again and get it committed.
>

Something along those kinds of lines seems like it would be useful, yes.
I'm not entirely convinced about the exact content of that particular PR,
however:

   - the naming could be considered confusing. Why "Standard" Canonical
   Form? The PR doesn't make it very clear exactly why this kind of canonical
   form is required, which might inform the naming better. Maybe "Resolving
   Canonical Form" might work better.
   - it doesn't seem to suggest an ordering for custom properties. I wonder
   if it might be better just to say that all properties should be in lexical
   order.
   - there's a lot of redundancy in the spec with respect to the Parsing
   Canonical Form. I'd be tempted to try to fold them together into one, or at
   least define one in terms of the other.
   - it would be good if it mentioned "scale" and "precision" amongst the
   standard fields.
   - since this new canonical form now contains default values, the spec
   also needs to define canonicalisation for those, including numbers and
   arbitrary properties.

  cheers,
    rog.

Re: schema resolution vs logical types

Posted by Doug Cutting <cu...@gmail.com>.
On Tue, Apr 7, 2020 at 4:03 AM roger peppe <ro...@gmail.com> wrote:

> On the one hand the specification says
> <https://avro.apache.org/docs/1.9.1/spec.html#Parsing+Canonical+Form+for+Schemas>
> :
>
> If the Parsing Canonical Forms of two different schemas are textually
>> equal, then those schemas are "the same" as far as any reader is concerned
>
>
This statement in the specification could perhaps be improved.  What it
means is that low-level parsing errors will not be encountered when using
two such schemas.  It does not mean they're equivalent for all purposes.


> but on the other, when discussing the decimal logical type, it says:
>
> For the purposes of schema resolution, two schemas that are decimal logical
>> types *match* if their scales and precisions match.
>
>
>
Schema resolution involves a different kind of equivalence for schemas.
Two compatible schemas here may have quite different binary formats, fields
might be reordered, removed, or added.  Scalar types may be promoted, etc.


> Given that the spec recommends using the canonical form for schema
> fingerprints, ISTM there might be some possibility for attack (or at least
> data corruption) there - if we unwittingly read a decimal value that was
> written with a different scale, we could read numbers thinking they're a
> different order of magnitude than they actually are.
>

Identical Parsing Canonical form only tells you whether you can parse the
data, not whether you can resolve it.  Indeed, if you use a different
logical type definition but only check parsing-level compatibility then you
can get incorrect data.

There is a proposal to add an alternate canonical form that incorporates
logical types:

https://github.com/apache/avro/pull/805
https://issues.apache.org/jira/browse/AVRO-2299

Does this look like what you'd like?  It seems that patch has been ignored,
but perhaps we can pick it up again and get it committed.

Thanks,

Doug

Does this