You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by roger peppe <ro...@gmail.com> on 2019/12/05 13:06:46 UTC

defaults for complex types (was Re: recursive types)

On Wed, 4 Dec 2019 at 11:38, Lee Hambley <le...@gmail.com> wrote:

> HI Rog,
>
> Good question, the answer lay in the docs in the "Parsing Canonical Form
> for Schemas" where it states (amongst all the other transformation rules)
>
> [ORDER] Order the appearance of fields of JSON objects as follows: *name*,
>> type, * fields*, symbols, items, values, size. For example, if an object
>> has type, name, and size fields, then the name field should appear
>> first, followed by the type and then the size fields.
>
>
> (emphasis mine)
>
> The canonical form for schemas becomes more relevant to Avro usage when
> working with a schema registry for e.g, but it's a really common use-case
> and I consider definition of a canonical form for schema comparisons to be
> a strength of Avro compared with other serialization formats.
>
> -
> https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas
>

Thanks very much - I'd missed that, very helpful!

Maybe you might be able to help with another part of the spec that I've
been puzzling over too: default values for complex types.
The spec doesn't seem to say how unions in complex types are specified when
in default values.

For example, consider the following schema:

{
    "type": "record",
    "name": "R",
    "fields": [
        {
            "name": "F",
            "type": {
                "type": "array",
                "items": [
                    {
                        "type": "enum",
                        "name": "E1",
                        "symbols": ["A", "B"]
                    },
                    {
                        "type": "enum",
                        "name": "E2",
                        "symbols": ["B", "A", "C"]
                    }
                ]
            },
            "default": ["A", "B", "C"]
        }
    ]
}

This seems like it should be valid according to the spec, because default
value encodings don't encode the type name in enums, unlike in the JSON
encoding, but in this case there seems to way to tell which enum types end
up in the array value of the field F, because the enum symbols themselves
are ambiguous.

How are schema validators meant to resolve this ambiguity?

 cheers,
    rog.


> HTH,
>
> Lee Hambley
> http://lee.hambley.name/
> +49 (0) 170 298 5667
>
>
> On Wed, 4 Dec 2019 at 12:17, roger peppe <ro...@gmail.com> wrote:
>
>> Hi,
>>
>> My apologies in advance if this topic has been well discussed before -
>> the mailing list search tool appears to be broken (the link points to the
>> expired domain name "search-hadoop.com").
>>
>> I'm trying to understand about recursive types in Avro, given that the
>> specification says about names
>> <http://avro.apache.org/docs/current/spec.html#names>:
>>
>> a name must be defined before it is used ("before" in the depth-first,
>>> left-to-right traversal of the JSON parse tree, where the types attribute
>>> of a protocol is always deemed to come "before" the messages attribute.)
>>
>>
>> By my reading, this would make the following Avro schema invalid, because
>> the name "R" will not yet be defined when it's referenced inside the type
>> of the field F, because in depth-first order, the leaf is traversed before
>> the root.
>>
>> {
>>     "type": "record",
>>     "fields": [
>>         {"name": "F", "type": ["null", "R"]}
>>     ],
>>     "name": "R"
>> }
>>
>> It seems that types like this are valid in practice (I found the above
>> example in an Avro test suite), so could someone enlighten me as to how
>> this is allowed, please?
>>
>> Thanks for any info. If I'm asking in the wrong place, please advise me
>> of a better forum!
>>
>>     rog.
>>
>>
>>

Re: defaults for complex types (was Re: recursive types)

Posted by roger peppe <ro...@gmail.com>.
Hi,

My immediate thought, if preserving backward-compatibility is a concern, is
to specify that the current rule applies recursively.
That is where the spec says: "Default values for union fields correspond to
the first schema in the union", we'd say:
"Default values for union fields (and any union values within the field)
correspond to the first schema in the union".

Then you don't need any complex ambiguity rules.

  cheers,
    rog.

On Tue, 24 Mar 2020 at 09:29, Andy Le <an...@gmail.com> wrote:

> Hi Roger,
>
> I'm thinking of reading again Avro Spec and writing down some
> dis-ambiguity rules. Suggested rule above for enums is one of them. It
> would be great if you can provide me other ones.
>
> To me, using rules is the most affordable way to keep compatibilities.
>
> If you care, please check my fork https://github.com/anhldbk/avro
>
> Thank you.
>
> On 2020/03/23 17:51:19, roger peppe <ro...@gmail.com> wrote:
> > On Mon, 23 Mar 2020 at 11:11, Andy Le <an...@gmail.com> wrote:
> >
> > > I may say:
> > >
> > > If enums are used in a Union, they must NOT use the same symbols
> > >
> > > Is that OK, Roger?
> > >
> >
> > I'm not sure that it is OK. The problem is wider than just enums -
> > AFAICS it applies to record and fixed types too, because they're named
> > types - more than one of a record or fixed type is allowed in a union,
> but
> > the default-value representation doesn't allow distinguishing between
> them.
> >
> > The ideal solution coming from a fresh start would be to use exactly the
> > same representation for default values as for the JSON encoding, but I
> > appreciate that backward-compatibility concerns would make that difficult
> > or impossible to do.
> >
> >
> >
> > > On 2020/03/23 09:44:45, roger peppe <ro...@gmail.com> wrote:
> > > > On Sun, 22 Mar 2020 at 09:09, Andy Le <an...@gmail.com> wrote:
> > > >
> > > > > Hi Roger,
> > > > >
> > > > > Instead of trying to modify the spec, is it easier for us to
> discard
> > > > > schemas with such ambiguity?
> > > > >
> > > > > That certainly sounds like a reasonable approach to me. How would
> you
> > > word
> > > > the definition of ambiguity for this purpose?
> > > >
> > >
> >
>

Re: defaults for complex types (was Re: recursive types)

Posted by Andy Le <an...@gmail.com>.
Hi Roger,

I'm thinking of reading again Avro Spec and writing down some dis-ambiguity rules. Suggested rule above for enums is one of them. It would be great if you can provide me other ones.

To me, using rules is the most affordable way to keep compatibilities.

If you care, please check my fork https://github.com/anhldbk/avro

Thank you.

On 2020/03/23 17:51:19, roger peppe <ro...@gmail.com> wrote: 
> On Mon, 23 Mar 2020 at 11:11, Andy Le <an...@gmail.com> wrote:
> 
> > I may say:
> >
> > If enums are used in a Union, they must NOT use the same symbols
> >
> > Is that OK, Roger?
> >
> 
> I'm not sure that it is OK. The problem is wider than just enums -
> AFAICS it applies to record and fixed types too, because they're named
> types - more than one of a record or fixed type is allowed in a union, but
> the default-value representation doesn't allow distinguishing between them.
> 
> The ideal solution coming from a fresh start would be to use exactly the
> same representation for default values as for the JSON encoding, but I
> appreciate that backward-compatibility concerns would make that difficult
> or impossible to do.
> 
> 
> 
> > On 2020/03/23 09:44:45, roger peppe <ro...@gmail.com> wrote:
> > > On Sun, 22 Mar 2020 at 09:09, Andy Le <an...@gmail.com> wrote:
> > >
> > > > Hi Roger,
> > > >
> > > > Instead of trying to modify the spec, is it easier for us to discard
> > > > schemas with such ambiguity?
> > > >
> > > > That certainly sounds like a reasonable approach to me. How would you
> > word
> > > the definition of ambiguity for this purpose?
> > >
> >
> 

Re: defaults for complex types (was Re: recursive types)

Posted by roger peppe <ro...@gmail.com>.
On Mon, 23 Mar 2020 at 11:11, Andy Le <an...@gmail.com> wrote:

> I may say:
>
> If enums are used in a Union, they must NOT use the same symbols
>
> Is that OK, Roger?
>

I'm not sure that it is OK. The problem is wider than just enums -
AFAICS it applies to record and fixed types too, because they're named
types - more than one of a record or fixed type is allowed in a union, but
the default-value representation doesn't allow distinguishing between them.

The ideal solution coming from a fresh start would be to use exactly the
same representation for default values as for the JSON encoding, but I
appreciate that backward-compatibility concerns would make that difficult
or impossible to do.



> On 2020/03/23 09:44:45, roger peppe <ro...@gmail.com> wrote:
> > On Sun, 22 Mar 2020 at 09:09, Andy Le <an...@gmail.com> wrote:
> >
> > > Hi Roger,
> > >
> > > Instead of trying to modify the spec, is it easier for us to discard
> > > schemas with such ambiguity?
> > >
> > > That certainly sounds like a reasonable approach to me. How would you
> word
> > the definition of ambiguity for this purpose?
> >
>

Re: defaults for complex types (was Re: recursive types)

Posted by Andy Le <an...@gmail.com>.
I may say:

If enums are used in a Union, they must NOT use the same symbols

Is that OK, Roger?

On 2020/03/23 09:44:45, roger peppe <ro...@gmail.com> wrote: 
> On Sun, 22 Mar 2020 at 09:09, Andy Le <an...@gmail.com> wrote:
> 
> > Hi Roger,
> >
> > Instead of trying to modify the spec, is it easier for us to discard
> > schemas with such ambiguity?
> >
> > That certainly sounds like a reasonable approach to me. How would you word
> the definition of ambiguity for this purpose?
> 

Re: defaults for complex types (was Re: recursive types)

Posted by roger peppe <ro...@gmail.com>.
On Sun, 22 Mar 2020 at 09:09, Andy Le <an...@gmail.com> wrote:

> Hi Roger,
>
> Instead of trying to modify the spec, is it easier for us to discard
> schemas with such ambiguity?
>
> That certainly sounds like a reasonable approach to me. How would you word
the definition of ambiguity for this purpose?

Re: defaults for complex types (was Re: recursive types)

Posted by Andy Le <an...@gmail.com>.
Hi Roger,

Instead of trying to modify the spec, is it easier for us to discard schemas with such ambiguity?



On 2019/12/06 14:42:45, roger peppe <ro...@gmail.com> wrote: 
> On Fri, 6 Dec 2019 at 13:49, Lee Hambley <le...@gmail.com> wrote:
> 
> > Rog,
> >
> > I alluded to it previously, but I really think you should send a PR to
> > improve the docs. This knowledge was hard-won for you, and the authors of
> > Avro are quite responsive at the moment after a couple of years (the
> > black-winter of 1.8.2) of low-activity.
> >
> 
> Good suggestion! I did that: https://github.com/apache/avro/pull/738
> 

Re: defaults for complex types (was Re: recursive types)

Posted by roger peppe <ro...@gmail.com>.
On Fri, 6 Dec 2019 at 13:49, Lee Hambley <le...@gmail.com> wrote:

> Rog,
>
> I alluded to it previously, but I really think you should send a PR to
> improve the docs. This knowledge was hard-won for you, and the authors of
> Avro are quite responsive at the moment after a couple of years (the
> black-winter of 1.8.2) of low-activity.
>

Good suggestion! I did that: https://github.com/apache/avro/pull/738

Re: defaults for complex types (was Re: recursive types)

Posted by Lee Hambley <le...@gmail.com>.
Rog,

I alluded to it previously, but I really think you should send a PR to
improve the docs. This knowledge was hard-won for you, and the authors of
Avro are quite responsive at the moment after a couple of years (the
black-winter of 1.8.2) of low-activity.

Carpe diem!

Thanks, this thread has been fun!

Lee Hambley
http://lee.hambley.name/
+49 (0) 170 298 5667


On Fri, 6 Dec 2019 at 14:47, roger peppe <ro...@gmail.com> wrote:

> On Fri, 6 Dec 2019 at 10:38, Ryan Skraba <ry...@skraba.com> wrote:
>
>> Hello!   I had a Java unit test ready to go (looking at default values
>> for complex types for AVRO-2636), so just reporting back (the easy
>> work!):
>>
>
> Thanks for the responses!
>
>
>> 1. In Java, the schema above is parsed without error, but when
>> attempting to use the default value, it fails with a
>> NullPointerException (trying to find the symbol C in E1).
>>
>
> I tried it with gogen-avro <https://github.com/actgardner/gogen-avro> and
> I had a similar issue. It generates invalid Go code as output.
>
> 2. If you were to disambiguate the symbols using the Avro JSON
>> encoding ("default": [{"E1":"B"},{"E2":"A"},{"E2":"C"}]), Java fails
>> while parsing the schema:
>>
>> org.apache.avro.AvroTypeException: Invalid default for field F:
>> [{"E1":"B"},{"E2":"A"},{"E2":"C"}] not a
>>
>> {"type":"array","items":[{"type":"enum","name":"E1","symbols":["A","B"]},{"type":"enum","name":"E2","symbols":["B","A","C"]}]}
>> at org.apache.avro.Schema.validateDefault(Schema.java:1542)
>> at org.apache.avro.Schema.access$500(Schema.java:87)
>> at org.apache.avro.Schema$Field.<init>(Schema.java:523)
>> at org.apache.avro.Schema.parse(Schema.java:1649)
>> at org.apache.avro.Schema$Parser.parse(Schema.java:1396)
>> at org.apache.avro.Schema$Parser.parse(Schema.java:1384)
>>
>> It seems that Java implements `Only the first schema in any union can
>> be used in a default value` as opposed to `Default values for union
>> fields correspond to the first schema in the union` (in the example,
>> it isn't a union field).
>>
>> Naively, I would expect any JSON encoded data to be a valid default
>> value (which is not what the spec says).  Does anyone know why the
>> "first schema only" rule was added to the spec?
>>
>
> Yes, I think the authors missed a trick here. I suspect things would be
> cleaner if the default value was encoded in exactly the way as JSON-encoded
> Avro values. As things currently are, an implementer needs to implement two
> slightly different ways of translating from JSON to an Avro value.
>
>   cheers,
>     rog.
>

Re: defaults for complex types (was Re: recursive types)

Posted by roger peppe <ro...@gmail.com>.
On Fri, 6 Dec 2019 at 10:38, Ryan Skraba <ry...@skraba.com> wrote:

> Hello!   I had a Java unit test ready to go (looking at default values
> for complex types for AVRO-2636), so just reporting back (the easy
> work!):
>

Thanks for the responses!


> 1. In Java, the schema above is parsed without error, but when
> attempting to use the default value, it fails with a
> NullPointerException (trying to find the symbol C in E1).
>

I tried it with gogen-avro <https://github.com/actgardner/gogen-avro> and I
had a similar issue. It generates invalid Go code as output.

2. If you were to disambiguate the symbols using the Avro JSON
> encoding ("default": [{"E1":"B"},{"E2":"A"},{"E2":"C"}]), Java fails
> while parsing the schema:
>
> org.apache.avro.AvroTypeException: Invalid default for field F:
> [{"E1":"B"},{"E2":"A"},{"E2":"C"}] not a
>
> {"type":"array","items":[{"type":"enum","name":"E1","symbols":["A","B"]},{"type":"enum","name":"E2","symbols":["B","A","C"]}]}
> at org.apache.avro.Schema.validateDefault(Schema.java:1542)
> at org.apache.avro.Schema.access$500(Schema.java:87)
> at org.apache.avro.Schema$Field.<init>(Schema.java:523)
> at org.apache.avro.Schema.parse(Schema.java:1649)
> at org.apache.avro.Schema$Parser.parse(Schema.java:1396)
> at org.apache.avro.Schema$Parser.parse(Schema.java:1384)
>
> It seems that Java implements `Only the first schema in any union can
> be used in a default value` as opposed to `Default values for union
> fields correspond to the first schema in the union` (in the example,
> it isn't a union field).
>
> Naively, I would expect any JSON encoded data to be a valid default
> value (which is not what the spec says).  Does anyone know why the
> "first schema only" rule was added to the spec?
>

Yes, I think the authors missed a trick here. I suspect things would be
cleaner if the default value was encoded in exactly the way as JSON-encoded
Avro values. As things currently are, an implementer needs to implement two
slightly different ways of translating from JSON to an Avro value.

  cheers,
    rog.

Re: defaults for complex types (was Re: recursive types)

Posted by Doug Cutting <cu...@gmail.com>.
On Fri, Dec 6, 2019 at 2:38 AM Ryan Skraba <ry...@skraba.com> wrote:

> Naively, I would expect any JSON encoded data to be a valid default
> value (which is not what the spec says).  Does anyone know why the
> "first schema only" rule was added to the spec?
>

I think we felt this would make things simpler.  That specifying a
mechanism for resolving the ambiguities in the JSON representations of ints
and longs, floats and doubles, strings and bytes, records and maps, etc.
would make implementation and comprehension more difficult.

In retrospect, it might have been better to use the type-tagged format
specified of the "JSON Encoding" section for default values.  This may be a
historical artifact.  If default values were added to Avro before the JSON
encoding, then the concept of the type tagging would not have been in the
spec when default values were defined.  However changing the format of
default values after they were defined would create a breaking
incompatibility.  That said, I don't recall anyone ever suggesting this
improvement before.

Doug

Re: defaults for complex types (was Re: recursive types)

Posted by Andy Le <an...@gmail.com>.
As Ryan said

> It seems that Java implements `Only the first schema in any union can
be used in a default value` as opposed to `Default values for union
fields correspond to the first schema in the union` (in the example,
it isn't a union field).

I think it's time for us to re-consider such requirement for Unions. I've already customized Avro code to make it happen.



On 2019/12/06 10:38:19, Ryan Skraba <ry...@skraba.com> wrote: 
> Hello!   I had a Java unit test ready to go (looking at default values
> for complex types for AVRO-2636), so just reporting back (the easy
> work!):
> 
> 1. In Java, the schema above is parsed without error, but when
> attempting to use the default value, it fails with a
> NullPointerException (trying to find the symbol C in E1).
> 
> 2. If you were to disambiguate the symbols using the Avro JSON
> encoding ("default": [{"E1":"B"},{"E2":"A"},{"E2":"C"}]), Java fails
> while parsing the schema:
> 
> org.apache.avro.AvroTypeException: Invalid default for field F:
> [{"E1":"B"},{"E2":"A"},{"E2":"C"}] not a
> {"type":"array","items":[{"type":"enum","name":"E1","symbols":["A","B"]},{"type":"enum","name":"E2","symbols":["B","A","C"]}]}
> at org.apache.avro.Schema.validateDefault(Schema.java:1542)
> at org.apache.avro.Schema.access$500(Schema.java:87)
> at org.apache.avro.Schema$Field.<init>(Schema.java:523)
> at org.apache.avro.Schema.parse(Schema.java:1649)
> at org.apache.avro.Schema$Parser.parse(Schema.java:1396)
> at org.apache.avro.Schema$Parser.parse(Schema.java:1384)
> 
> It seems that Java implements `Only the first schema in any union can
> be used in a default value` as opposed to `Default values for union
> fields correspond to the first schema in the union` (in the example,
> it isn't a union field).
> 
> Naively, I would expect any JSON encoded data to be a valid default
> value (which is not what the spec says).  Does anyone know why the
> "first schema only" rule was added to the spec?
> 
> Best regards, Ryan
> 
> 
> 
> On Thu, Dec 5, 2019 at 7:01 PM Lee Hambley <le...@gmail.com> wrote:
> >
> > Hi Rog,
> >
> > Glad my pointers were useful, the Avro spec really is a marvel.
> >
> > Regarding your follow-up question, I'm honestly not sure, interesting contrived example however, and interesting that no matter how well written the spec is, it can still be ambiguous.
> >
> > I found this snipped in the 1.9x docs, where I know there was some changes to defaults for complex types, the 1.8 docs may be incomplete in that regard. ( https://avro.apache.org/docs/1.9.0/spec.html#schema_complex )
> >
> >> Default values for union fields correspond to the first schema in the union. Default values for bytes and fixed fields are JSON strings, where Unicode code points 0-255 are mapped to unsigned 8-bit byte values 0-255.
> >
> >
> > I take `Default values for union fields correspond to the first schema in the union` to mean that your default including values from the 2nd schema in the union is invalid, *or* that where the member exists in the first union it refers to the first union, and when not, it refers to the first schema in which it _does_ exist.
> >
> > One way to find out would be to run some data through a couple of common implementations, and see how they handle the resulting data, and, maybe feed that back into Avro docs in the form of a PR if you come up with something useful?
> >
> > Either way, I'm curious now! Let me know when you have an answer?
> >
> > Cheers,
> >
> > Lee Hambley
> > http://lee.hambley.name/
> > +49 (0) 170 298 5667
> >
> >
> > On Thu, 5 Dec 2019 at 14:07, roger peppe <ro...@gmail.com> wrote:
> >>
> >> On Wed, 4 Dec 2019 at 11:38, Lee Hambley <le...@gmail.com> wrote:
> >>>
> >>> HI Rog,
> >>>
> >>> Good question, the answer lay in the docs in the "Parsing Canonical Form for Schemas" where it states (amongst all the other transformation rules)
> >>>
> >>>> [ORDER] Order the appearance of fields of JSON objects as follows: name, type, fields, symbols, items, values, size. For example, if an object has type, name, and size fields, then the name field should appear first, followed by the type and then the size fields.
> >>>
> >>>
> >>> (emphasis mine)
> >>>
> >>> The canonical form for schemas becomes more relevant to Avro usage when working with a schema registry for e.g, but it's a really common use-case and I consider definition of a canonical form for schema comparisons to be a strength of Avro compared with other serialization formats.
> >>>
> >>> - https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas
> >>
> >>
> >> Thanks very much - I'd missed that, very helpful!
> >>
> >> Maybe you might be able to help with another part of the spec that I've been puzzling over too: default values for complex types.
> >> The spec doesn't seem to say how unions in complex types are specified when in default values.
> >>
> >> For example, consider the following schema:
> >>
> >> {
> >>     "type": "record",
> >>     "name": "R",
> >>     "fields": [
> >>         {
> >>             "name": "F",
> >>             "type": {
> >>                 "type": "array",
> >>                 "items": [
> >>                     {
> >>                         "type": "enum",
> >>                         "name": "E1",
> >>                         "symbols": ["A", "B"]
> >>                     },
> >>                     {
> >>                         "type": "enum",
> >>                         "name": "E2",
> >>                         "symbols": ["B", "A", "C"]
> >>                     }
> >>                 ]
> >>             },
> >>             "default": ["A", "B", "C"]
> >>         }
> >>     ]
> >> }
> >>
> >> This seems like it should be valid according to the spec, because default value encodings don't encode the type name in enums, unlike in the JSON encoding, but in this case there seems to way to tell which enum types end up in the array value of the field F, because the enum symbols themselves are ambiguous.
> >>
> >> How are schema validators meant to resolve this ambiguity?
> >>
> >>  cheers,
> >>     rog.
> >>
> >>>
> >>> HTH,
> >>>
> >>> Lee Hambley
> >>> http://lee.hambley.name/
> >>> +49 (0) 170 298 5667
> >>>
> >>>
> >>> On Wed, 4 Dec 2019 at 12:17, roger peppe <ro...@gmail.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> My apologies in advance if this topic has been well discussed before - the mailing list search tool appears to be broken (the link points to the expired domain name "search-hadoop.com").
> >>>>
> >>>> I'm trying to understand about recursive types in Avro, given that the specification says about names:
> >>>>
> >>>>> a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.)
> >>>>
> >>>>
> >>>> By my reading, this would make the following Avro schema invalid, because the name "R" will not yet be defined when it's referenced inside the type of the field F, because in depth-first order, the leaf is traversed before the root.
> >>>>
> >>>> {
> >>>>     "type": "record",
> >>>>     "fields": [
> >>>>         {"name": "F", "type": ["null", "R"]}
> >>>>     ],
> >>>>     "name": "R"
> >>>> }
> >>>>
> >>>> It seems that types like this are valid in practice (I found the above example in an Avro test suite), so could someone enlighten me as to how this is allowed, please?
> >>>>
> >>>> Thanks for any info. If I'm asking in the wrong place, please advise me of a better forum!
> >>>>
> >>>>     rog.
> >>>>
> >>>>
> 

Re: defaults for complex types (was Re: recursive types)

Posted by Ryan Skraba <ry...@skraba.com>.
Hello!   I had a Java unit test ready to go (looking at default values
for complex types for AVRO-2636), so just reporting back (the easy
work!):

1. In Java, the schema above is parsed without error, but when
attempting to use the default value, it fails with a
NullPointerException (trying to find the symbol C in E1).

2. If you were to disambiguate the symbols using the Avro JSON
encoding ("default": [{"E1":"B"},{"E2":"A"},{"E2":"C"}]), Java fails
while parsing the schema:

org.apache.avro.AvroTypeException: Invalid default for field F:
[{"E1":"B"},{"E2":"A"},{"E2":"C"}] not a
{"type":"array","items":[{"type":"enum","name":"E1","symbols":["A","B"]},{"type":"enum","name":"E2","symbols":["B","A","C"]}]}
at org.apache.avro.Schema.validateDefault(Schema.java:1542)
at org.apache.avro.Schema.access$500(Schema.java:87)
at org.apache.avro.Schema$Field.<init>(Schema.java:523)
at org.apache.avro.Schema.parse(Schema.java:1649)
at org.apache.avro.Schema$Parser.parse(Schema.java:1396)
at org.apache.avro.Schema$Parser.parse(Schema.java:1384)

It seems that Java implements `Only the first schema in any union can
be used in a default value` as opposed to `Default values for union
fields correspond to the first schema in the union` (in the example,
it isn't a union field).

Naively, I would expect any JSON encoded data to be a valid default
value (which is not what the spec says).  Does anyone know why the
"first schema only" rule was added to the spec?

Best regards, Ryan



On Thu, Dec 5, 2019 at 7:01 PM Lee Hambley <le...@gmail.com> wrote:
>
> Hi Rog,
>
> Glad my pointers were useful, the Avro spec really is a marvel.
>
> Regarding your follow-up question, I'm honestly not sure, interesting contrived example however, and interesting that no matter how well written the spec is, it can still be ambiguous.
>
> I found this snipped in the 1.9x docs, where I know there was some changes to defaults for complex types, the 1.8 docs may be incomplete in that regard. ( https://avro.apache.org/docs/1.9.0/spec.html#schema_complex )
>
>> Default values for union fields correspond to the first schema in the union. Default values for bytes and fixed fields are JSON strings, where Unicode code points 0-255 are mapped to unsigned 8-bit byte values 0-255.
>
>
> I take `Default values for union fields correspond to the first schema in the union` to mean that your default including values from the 2nd schema in the union is invalid, *or* that where the member exists in the first union it refers to the first union, and when not, it refers to the first schema in which it _does_ exist.
>
> One way to find out would be to run some data through a couple of common implementations, and see how they handle the resulting data, and, maybe feed that back into Avro docs in the form of a PR if you come up with something useful?
>
> Either way, I'm curious now! Let me know when you have an answer?
>
> Cheers,
>
> Lee Hambley
> http://lee.hambley.name/
> +49 (0) 170 298 5667
>
>
> On Thu, 5 Dec 2019 at 14:07, roger peppe <ro...@gmail.com> wrote:
>>
>> On Wed, 4 Dec 2019 at 11:38, Lee Hambley <le...@gmail.com> wrote:
>>>
>>> HI Rog,
>>>
>>> Good question, the answer lay in the docs in the "Parsing Canonical Form for Schemas" where it states (amongst all the other transformation rules)
>>>
>>>> [ORDER] Order the appearance of fields of JSON objects as follows: name, type, fields, symbols, items, values, size. For example, if an object has type, name, and size fields, then the name field should appear first, followed by the type and then the size fields.
>>>
>>>
>>> (emphasis mine)
>>>
>>> The canonical form for schemas becomes more relevant to Avro usage when working with a schema registry for e.g, but it's a really common use-case and I consider definition of a canonical form for schema comparisons to be a strength of Avro compared with other serialization formats.
>>>
>>> - https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas
>>
>>
>> Thanks very much - I'd missed that, very helpful!
>>
>> Maybe you might be able to help with another part of the spec that I've been puzzling over too: default values for complex types.
>> The spec doesn't seem to say how unions in complex types are specified when in default values.
>>
>> For example, consider the following schema:
>>
>> {
>>     "type": "record",
>>     "name": "R",
>>     "fields": [
>>         {
>>             "name": "F",
>>             "type": {
>>                 "type": "array",
>>                 "items": [
>>                     {
>>                         "type": "enum",
>>                         "name": "E1",
>>                         "symbols": ["A", "B"]
>>                     },
>>                     {
>>                         "type": "enum",
>>                         "name": "E2",
>>                         "symbols": ["B", "A", "C"]
>>                     }
>>                 ]
>>             },
>>             "default": ["A", "B", "C"]
>>         }
>>     ]
>> }
>>
>> This seems like it should be valid according to the spec, because default value encodings don't encode the type name in enums, unlike in the JSON encoding, but in this case there seems to way to tell which enum types end up in the array value of the field F, because the enum symbols themselves are ambiguous.
>>
>> How are schema validators meant to resolve this ambiguity?
>>
>>  cheers,
>>     rog.
>>
>>>
>>> HTH,
>>>
>>> Lee Hambley
>>> http://lee.hambley.name/
>>> +49 (0) 170 298 5667
>>>
>>>
>>> On Wed, 4 Dec 2019 at 12:17, roger peppe <ro...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> My apologies in advance if this topic has been well discussed before - the mailing list search tool appears to be broken (the link points to the expired domain name "search-hadoop.com").
>>>>
>>>> I'm trying to understand about recursive types in Avro, given that the specification says about names:
>>>>
>>>>> a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.)
>>>>
>>>>
>>>> By my reading, this would make the following Avro schema invalid, because the name "R" will not yet be defined when it's referenced inside the type of the field F, because in depth-first order, the leaf is traversed before the root.
>>>>
>>>> {
>>>>     "type": "record",
>>>>     "fields": [
>>>>         {"name": "F", "type": ["null", "R"]}
>>>>     ],
>>>>     "name": "R"
>>>> }
>>>>
>>>> It seems that types like this are valid in practice (I found the above example in an Avro test suite), so could someone enlighten me as to how this is allowed, please?
>>>>
>>>> Thanks for any info. If I'm asking in the wrong place, please advise me of a better forum!
>>>>
>>>>     rog.
>>>>
>>>>

Re: defaults for complex types (was Re: recursive types)

Posted by Lee Hambley <le...@gmail.com>.
Hi Rog,

Glad my pointers were useful, the Avro spec really is a marvel.

Regarding your follow-up question, I'm honestly not sure, interesting
contrived example however, and interesting that no matter how well written
the spec is, it can still be ambiguous.

I found this snipped in the 1.9x docs, where I know there was some changes
to defaults for complex types, the 1.8 docs may be incomplete in that
regard. ( https://avro.apache.org/docs/1.9.0/spec.html#schema_complex )

Default values for union fields correspond to the first schema in the
> union. Default values for bytes and fixed fields are JSON strings, where
> Unicode code points 0-255 are mapped to unsigned 8-bit byte values 0-255.
>

I take `Default values for union fields correspond to the first schema in
the union` to mean that your default including values from the 2nd schema
in the union is invalid, *or* that where the member exists in the first
union it refers to the first union, and when not, it refers to the first
schema in which it _does_ exist.

One way to find out would be to run some data through a couple of common
implementations, and see how they handle the resulting data, and, maybe
feed that back into Avro docs in the form of a PR if you come up with
something useful?

Either way, I'm curious now! Let me know when you have an answer?

Cheers,

Lee Hambley
http://lee.hambley.name/
+49 (0) 170 298 5667


On Thu, 5 Dec 2019 at 14:07, roger peppe <ro...@gmail.com> wrote:

> On Wed, 4 Dec 2019 at 11:38, Lee Hambley <le...@gmail.com> wrote:
>
>> HI Rog,
>>
>> Good question, the answer lay in the docs in the "Parsing Canonical Form
>> for Schemas" where it states (amongst all the other transformation rules)
>>
>> [ORDER] Order the appearance of fields of JSON objects as follows: *name*,
>>> type, * fields*, symbols, items, values, size. For example, if an
>>> object has type, name, and size fields, then the name field should
>>> appear first, followed by the type and then the size fields.
>>
>>
>> (emphasis mine)
>>
>> The canonical form for schemas becomes more relevant to Avro usage when
>> working with a schema registry for e.g, but it's a really common use-case
>> and I consider definition of a canonical form for schema comparisons to be
>> a strength of Avro compared with other serialization formats.
>>
>> -
>> https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas
>>
>
> Thanks very much - I'd missed that, very helpful!
>
> Maybe you might be able to help with another part of the spec that I've
> been puzzling over too: default values for complex types.
> The spec doesn't seem to say how unions in complex types are specified
> when in default values.
>
> For example, consider the following schema:
>
> {
>     "type": "record",
>     "name": "R",
>     "fields": [
>         {
>             "name": "F",
>             "type": {
>                 "type": "array",
>                 "items": [
>                     {
>                         "type": "enum",
>                         "name": "E1",
>                         "symbols": ["A", "B"]
>                     },
>                     {
>                         "type": "enum",
>                         "name": "E2",
>                         "symbols": ["B", "A", "C"]
>                     }
>                 ]
>             },
>             "default": ["A", "B", "C"]
>         }
>     ]
> }
>
> This seems like it should be valid according to the spec, because default
> value encodings don't encode the type name in enums, unlike in the JSON
> encoding, but in this case there seems to way to tell which enum types end
> up in the array value of the field F, because the enum symbols themselves
> are ambiguous.
>
> How are schema validators meant to resolve this ambiguity?
>
>  cheers,
>     rog.
>
>
>> HTH,
>>
>> Lee Hambley
>> http://lee.hambley.name/
>> +49 (0) 170 298 5667
>>
>>
>> On Wed, 4 Dec 2019 at 12:17, roger peppe <ro...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> My apologies in advance if this topic has been well discussed before -
>>> the mailing list search tool appears to be broken (the link points to the
>>> expired domain name "search-hadoop.com").
>>>
>>> I'm trying to understand about recursive types in Avro, given that the
>>> specification says about names
>>> <http://avro.apache.org/docs/current/spec.html#names>:
>>>
>>> a name must be defined before it is used ("before" in the depth-first,
>>>> left-to-right traversal of the JSON parse tree, where the types attribute
>>>> of a protocol is always deemed to come "before" the messages
>>>>  attribute.)
>>>
>>>
>>> By my reading, this would make the following Avro schema invalid,
>>> because the name "R" will not yet be defined when it's referenced inside
>>> the type of the field F, because in depth-first order, the leaf is
>>> traversed before the root.
>>>
>>> {
>>>     "type": "record",
>>>     "fields": [
>>>         {"name": "F", "type": ["null", "R"]}
>>>     ],
>>>     "name": "R"
>>> }
>>>
>>> It seems that types like this are valid in practice (I found the above
>>> example in an Avro test suite), so could someone enlighten me as to how
>>> this is allowed, please?
>>>
>>> Thanks for any info. If I'm asking in the wrong place, please advise me
>>> of a better forum!
>>>
>>>     rog.
>>>
>>>
>>>