You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Oscar Westra van Holthe - Kind <os...@westravanholthe.nl> on 2023/08/03 06:36:08 UTC

Re: Enum default not in canonical form

Hi everyone,

Below is one opinion of the subject.

The most robust way to determine if two schemata are the same is how they
serialise datums to bytes. Schema evolution allows us to match schemata if
they are not the same, but similar enough.

Default values, both for fields and enums, are perfect for cases where the
serialised bytes do not contain enough information to deserialise the
bytes, and thus should NOT be part of the canonical form.

After all, the less properties are part of the canonical form, the easier
is it to make two different json schemata the "same".

Logical types and conversions on the other hand, define how data is turned
into a raw form. This influences serialisation, and arguably should be part
of the canonical form. This would allow, for example, schema evolution
between a string based date and the current number based one.

The tricky bit here is that they sometimes need additional properties (e.g.
decimal), and not all combinations of properties are resolvable (to resolve
a decimal, the scale and precision in the read schema cannot be smaller
than in the write schema).

However, we currently have no way to let a logical type include extra
properties in the canonical form, or to determine if the properties allow
resolution.

Kind regards,
Oscar

-- 
Oscar Westra van Holthe - Kind <os...@westravanholthe.nl>

Op ma 31 jul. 2023 19:42 schreef Ryan Skraba <ry...@skraba.com>:

> This is a bit weird, and it could be clarified.
>
> The "same" schema, of course, can be represented by different JSON
> text files: attribute order, ways to express the full name, inheriting
> full names, etc.  The canonical form is always the "same" for the
> "same" schema and can be used to generate a fingerprint.  Which leads
> to the confusing question: what does SAME mean, exactly?
>
> My understanding is that it defines the minimum you need to
> *serialize* the data (the writer, or actual schema).  If you have a
> SAME schema when reading that data, it is guaranteed to be able to
> deserialize it, even if you've changed some extra attributes in the
> reader (or expected) schema: how the namespace is set, added or
> removed a logical type on a field, or (you guessed it) if you've
> changed the default that you want to use to cover missing data or
> enums.
>
> With the rise of streaming data, and schema registries, there probably
> is a new need for a definition of SAME that includes schema evolution
> attributes.  I think there's a good JIRA that describes this, but the
> parsing canonical form does NOT meet that need.
>
> If I've made a mistake here, feel free to jump in with your clarification!
>
> All my best, Ryan
>
>
>
>
>
>
>
>
> On Sat, Jul 29, 2023 at 5:06 PM Michael A. Smith <mi...@smith-li.com>
> wrote:
> >
> > The spec says one of the steps to get parsing canonical form is
> >
> > > [STRIP] Keep only attributes that are relevant to parsing data, which
> are: type, name, fields, symbols, items, values, size. Strip all others
> (e.g., doc and aliases).
> >
> > and indeed, we strip the default from an EnumSchema. But is that
> > right? It seems to me that we'd want to keep that. Can someone help me
> > understand if (and how) it's correct to strip the enum default in
> > parsing canonical form?
> >
> > Thanks,
> > Michael
>