You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@avro.apache.org by roger peppe <ro...@gmail.com> on 2019/12/18 16:49:31 UTC

name-agnostic schema resolution (a.k.a. structural subtyping?)

Hi,

Background: I've been contemplating the proposed Avro format in the CloudEvent
specification
<https://github.com/cloudevents/spec/blob/master/avro-format.md>, which
defines standard metadata for events. It defines a very generic format for
an event that allows storage of almost any data. It seems to me that by
going in that direction it's losing almost all the advantages of using Avro
in the first place. It feels like it's trying to shoehorn a dynamic message
format like JSON into the Avro format, where using Avro itself could do so
much better.

I'm hoping to propose something better. I had what I thought was a nice
idea, but it doesn't *quite* work, and I thought I'd bring up the subject
here and see if anyone had some better ideas.

The schema resolution
<https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part of
the spec allows a reader to read a schema that was written with extra
fields. So, theoretically, we could define a CloudEvent something like this:

{ "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata", "
type": { "type": "record", "name": "CloudEvent", "namespace": "
avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name":
"source", "type": "string" }, { "name": "time", "type": "long", "logicalType":
"timestamp-micros" }] } }] }

Theoretically, this could enable any event that's a record that has *at
least* a Metadata field with the above fields to be read generically. The
CloudEvent type above could be seen as a structural supertype of all
possible more-specific CloudEvent-compatible records that have such a
compatible field.

This has a few nice advantages:
- there's no need for any wrapping of payload data.
- the CloudEvent type can evolve over time like any other Avro type.
- all the data message fields are immediately available alongside the
metadata.
- there's still exactly one schema for a topic, encapsulating both the
metadata and the payload.

However, this idea fails because of one problem - this schema resolution
rule: "both schemas are records with the same (unqualified) name". This
means that unless *everyone* names all their CloudEvent-compatible records
"CloudEvent", they can't be read like this.

I don't think people will be willing to name all their records
"CloudEvent", so we have a problem.

I can see a few possible workarounds:

   1. when reading the record as a CloudEvent, read it with a schema that's
   the same as CloudEvent, but with the top level record name changed to the
   top level name of the schema that was used to write the record.
   2. ignore record names when matching schema record types.
   3. allow aliases to be specified when writing data as well as reading
   it. When defining a CloudEvent-compatible event, you'd add a CloudEvent
   alias to your record.

None of the options are particularly nice. 1 is probably the easiest to do,
although means you'd still need some custom logic when decoding records,
meaning you couldn't use stock decoders.

I like the idea of 2, although it gets a bit tricky when dealing with union
types. You could define the matching such that it ignores names only when
the two matched types are unambiguous (i.e. only one record in both). This
could be implemented as an option ("use structural typing") when decoding.

3 is probably cleanest but interacts significantly with the spec (for
example, the canonical schema transformation strips aliases out, but they'd
need to be retained).

Any thoughts? Is this a silly thing to be contemplating? Is there a better
way?

  cheers,
    rog.

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Posted by Michael Burr <mi...@engisys.com>.

Unsubscribe

On Sat, Dec 28, 2019 at 11:29 roger peppe <ro...@gmail.com> wrote:

>
>
> On Sat, 21 Dec 2019, 17:09 Vance Duncan, <du...@gmail.com> wrote:
>
>> I suggest naming the timestamp field "timestamp" rather than "time". You
>> might also want to consider calling it "eventTimestamp", since there will
>> possibly be the need to distinguish when the event occurred vs. when it was
>> actually published, due to delays in batching, intermittent downtime, etc.
>>
>> Also, I suggest considering the addition of traceability metadata, which
>> for any practical implementation is almost always required. An array of
>> correlation ID's is great for that. It gives the publishers/subscribers a
>> way of tracing the events to the external causes. Also possibly an array of
>> "priorEventIds". This way a full tree of traceability can be established
>> post facto.
>>
>
> Your suggestions sound good, but I'm unfortunately not in a position to
> define those things at this time - the existing CloudEvent specification
> defines names and semantics for those fields already (see
> https://github.com/cloudevents/spec/blob/v1.0/spec.md)
>
> I am just trying to define a reasonable way of idiomatically encapsulating
> those existing CloudEvent semantics within the Avro format.
>
> (You might notice that I omitted some fields which are arguably redundant
> when one knows the writer's schema, eg. data content type and data schema).
>
>   cheers,
>     rog.
>
>
>> On Wed, Dec 18, 2019 at 11:49 AM roger peppe <ro...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Background: I've been contemplating the proposed Avro format in the CloudEvent
>>> specification
>>> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which
>>> defines standard metadata for events. It defines a very generic format for
>>> an event that allows storage of almost any data. It seems to me that by
>>> going in that direction it's losing almost all the advantages of using Avro
>>> in the first place. It feels like it's trying to shoehorn a dynamic message
>>> format like JSON into the Avro format, where using Avro itself could do so
>>> much better.
>>>
>>> I'm hoping to propose something better. I had what I thought was a nice
>>> idea, but it doesn't *quite* work, and I thought I'd bring up the
>>> subject here and see if anyone had some better ideas.
>>>
>>> The schema resolution
>>> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part
>>> of the spec allows a reader to read a schema that was written with extra
>>> fields. So, theoretically, we could define a CloudEvent something like this:
>>>
>>> { "name": "CloudEvent", "type": "record", "fields": [{ "name":
>>> "Metadata", "type": { "type": "record", "name": "CloudEvent", "namespace":
>>> "avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "
>>> name": "source", "type": "string" }, { "name": "time", "type": "long", "
>>> logicalType": "timestamp-micros" }] } }] }
>>>
>>> Theoretically, this could enable any event that's a record that has *at
>>> least* a Metadata field with the above fields to be read generically.
>>> The CloudEvent type above could be seen as a structural supertype of all
>>> possible more-specific CloudEvent-compatible records that have such a
>>> compatible field.
>>>
>>> This has a few nice advantages:
>>> - there's no need for any wrapping of payload data.
>>> - the CloudEvent type can evolve over time like any other Avro type.
>>> - all the data message fields are immediately available alongside the
>>> metadata.
>>> - there's still exactly one schema for a topic, encapsulating both the
>>> metadata and the payload.
>>>
>>> However, this idea fails because of one problem - this schema resolution
>>> rule: "both schemas are records with the same (unqualified) name". This
>>> means that unless *everyone* names all their CloudEvent-compatible
>>> records "CloudEvent", they can't be read like this.
>>>
>>> I don't think people will be willing to name all their records
>>> "CloudEvent", so we have a problem.
>>>
>>> I can see a few possible workarounds:
>>>
>>>    1. when reading the record as a CloudEvent, read it with a schema
>>>    that's the same as CloudEvent, but with the top level record name changed
>>>    to the top level name of the schema that was used to write the record.
>>>    2. ignore record names when matching schema record types.
>>>    3. allow aliases to be specified when writing data as well as
>>>    reading it. When defining a CloudEvent-compatible event, you'd add a
>>>    CloudEvent alias to your record.
>>>
>>> None of the options are particularly nice. 1 is probably the easiest to
>>> do, although means you'd still need some custom logic when decoding
>>> records, meaning you couldn't use stock decoders.
>>>
>>> I like the idea of 2, although it gets a bit tricky when dealing with
>>> union types. You could define the matching such that it ignores names only
>>> when the two matched types are unambiguous (i.e. only one record in both).
>>> This could be implemented as an option ("use structural typing") when
>>> decoding.
>>>
>>> 3 is probably cleanest but interacts significantly with the spec (for
>>> example, the canonical schema transformation strips aliases out, but they'd
>>> need to be retained).
>>>
>>> Any thoughts? Is this a silly thing to be contemplating? Is there a
>>> better way?
>>>
>>>   cheers,
>>>     rog.
>>>
>>>
>>
>> --
>> Regards,
>>
>> Vance Duncan
>> mailto:duncanjv@gmail.com
>> http://www.linkedin.com/in/VanceDuncan
>> (904) 553-5582
>>
>

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Posted by roger peppe <ro...@gmail.com>.

On Sat, 21 Dec 2019, 17:09 Vance Duncan, <du...@gmail.com> wrote:

> I suggest naming the timestamp field "timestamp" rather than "time". You
> might also want to consider calling it "eventTimestamp", since there will
> possibly be the need to distinguish when the event occurred vs. when it was
> actually published, due to delays in batching, intermittent downtime, etc.
>
> Also, I suggest considering the addition of traceability metadata, which
> for any practical implementation is almost always required. An array of
> correlation ID's is great for that. It gives the publishers/subscribers a
> way of tracing the events to the external causes. Also possibly an array of
> "priorEventIds". This way a full tree of traceability can be established
> post facto.
>

Your suggestions sound good, but I'm unfortunately not in a position to
define those things at this time - the existing CloudEvent specification
defines names and semantics for those fields already (see
https://github.com/cloudevents/spec/blob/v1.0/spec.md)

I am just trying to define a reasonable way of idiomatically encapsulating
those existing CloudEvent semantics within the Avro format.

(You might notice that I omitted some fields which are arguably redundant
when one knows the writer's schema, eg. data content type and data schema).

  cheers,
    rog.


> On Wed, Dec 18, 2019 at 11:49 AM roger peppe <ro...@gmail.com> wrote:
>
>> Hi,
>>
>> Background: I've been contemplating the proposed Avro format in the CloudEvent
>> specification
>> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which
>> defines standard metadata for events. It defines a very generic format for
>> an event that allows storage of almost any data. It seems to me that by
>> going in that direction it's losing almost all the advantages of using Avro
>> in the first place. It feels like it's trying to shoehorn a dynamic message
>> format like JSON into the Avro format, where using Avro itself could do so
>> much better.
>>
>> I'm hoping to propose something better. I had what I thought was a nice
>> idea, but it doesn't *quite* work, and I thought I'd bring up the
>> subject here and see if anyone had some better ideas.
>>
>> The schema resolution
>> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part
>> of the spec allows a reader to read a schema that was written with extra
>> fields. So, theoretically, we could define a CloudEvent something like this:
>>
>> { "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata",
>> "type": { "type": "record", "name": "CloudEvent", "namespace": "
>> avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name":
>> "source", "type": "string" }, { "name": "time", "type": "long", "
>> logicalType": "timestamp-micros" }] } }] }
>>
>> Theoretically, this could enable any event that's a record that has *at
>> least* a Metadata field with the above fields to be read generically.
>> The CloudEvent type above could be seen as a structural supertype of all
>> possible more-specific CloudEvent-compatible records that have such a
>> compatible field.
>>
>> This has a few nice advantages:
>> - there's no need for any wrapping of payload data.
>> - the CloudEvent type can evolve over time like any other Avro type.
>> - all the data message fields are immediately available alongside the
>> metadata.
>> - there's still exactly one schema for a topic, encapsulating both the
>> metadata and the payload.
>>
>> However, this idea fails because of one problem - this schema resolution
>> rule: "both schemas are records with the same (unqualified) name". This
>> means that unless *everyone* names all their CloudEvent-compatible
>> records "CloudEvent", they can't be read like this.
>>
>> I don't think people will be willing to name all their records
>> "CloudEvent", so we have a problem.
>>
>> I can see a few possible workarounds:
>>
>>    1. when reading the record as a CloudEvent, read it with a schema
>>    that's the same as CloudEvent, but with the top level record name changed
>>    to the top level name of the schema that was used to write the record.
>>    2. ignore record names when matching schema record types.
>>    3. allow aliases to be specified when writing data as well as reading
>>    it. When defining a CloudEvent-compatible event, you'd add a CloudEvent
>>    alias to your record.
>>
>> None of the options are particularly nice. 1 is probably the easiest to
>> do, although means you'd still need some custom logic when decoding
>> records, meaning you couldn't use stock decoders.
>>
>> I like the idea of 2, although it gets a bit tricky when dealing with
>> union types. You could define the matching such that it ignores names only
>> when the two matched types are unambiguous (i.e. only one record in both).
>> This could be implemented as an option ("use structural typing") when
>> decoding.
>>
>> 3 is probably cleanest but interacts significantly with the spec (for
>> example, the canonical schema transformation strips aliases out, but they'd
>> need to be retained).
>>
>> Any thoughts? Is this a silly thing to be contemplating? Is there a
>> better way?
>>
>>   cheers,
>>     rog.
>>
>>
>
> --
> Regards,
>
> Vance Duncan
> mailto:duncanjv@gmail.com
> http://www.linkedin.com/in/VanceDuncan
> (904) 553-5582
>

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Posted by Vance Duncan <du...@gmail.com>.

I suggest naming the timestamp field "timestamp" rather than "time". You
might also want to consider calling it "eventTimestamp", since there will
possibly be the need to distinguish when the event occurred vs. when it was
actually published, due to delays in batching, intermittent downtime, etc.

Also, I suggest considering the addition of traceability metadata, which
for any practical implementation is almost always required. An array of
correlation ID's is great for that. It gives the publishers/subscribers a
way of tracing the events to the external causes. Also possibly an array of
"priorEventIds". This way a full tree of traceability can be established
post facto.

On Wed, Dec 18, 2019 at 11:49 AM roger peppe <ro...@gmail.com> wrote:

> Hi,
>
> Background: I've been contemplating the proposed Avro format in the CloudEvent
> specification
> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which
> defines standard metadata for events. It defines a very generic format for
> an event that allows storage of almost any data. It seems to me that by
> going in that direction it's losing almost all the advantages of using Avro
> in the first place. It feels like it's trying to shoehorn a dynamic message
> format like JSON into the Avro format, where using Avro itself could do so
> much better.
>
> I'm hoping to propose something better. I had what I thought was a nice
> idea, but it doesn't *quite* work, and I thought I'd bring up the subject
> here and see if anyone had some better ideas.
>
> The schema resolution
> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part
> of the spec allows a reader to read a schema that was written with extra
> fields. So, theoretically, we could define a CloudEvent something like this:
>
> { "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata",
> "type": { "type": "record", "name": "CloudEvent", "namespace": "
> avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name":
> "source", "type": "string" }, { "name": "time", "type": "long", "
> logicalType": "timestamp-micros" }] } }] }
>
> Theoretically, this could enable any event that's a record that has *at
> least* a Metadata field with the above fields to be read generically. The
> CloudEvent type above could be seen as a structural supertype of all
> possible more-specific CloudEvent-compatible records that have such a
> compatible field.
>
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the
> metadata.
> - there's still exactly one schema for a topic, encapsulating both the
> metadata and the payload.
>
> However, this idea fails because of one problem - this schema resolution
> rule: "both schemas are records with the same (unqualified) name". This
> means that unless *everyone* names all their CloudEvent-compatible
> records "CloudEvent", they can't be read like this.
>
> I don't think people will be willing to name all their records
> "CloudEvent", so we have a problem.
>
> I can see a few possible workarounds:
>
>    1. when reading the record as a CloudEvent, read it with a schema
>    that's the same as CloudEvent, but with the top level record name changed
>    to the top level name of the schema that was used to write the record.
>    2. ignore record names when matching schema record types.
>    3. allow aliases to be specified when writing data as well as reading
>    it. When defining a CloudEvent-compatible event, you'd add a CloudEvent
>    alias to your record.
>
> None of the options are particularly nice. 1 is probably the easiest to
> do, although means you'd still need some custom logic when decoding
> records, meaning you couldn't use stock decoders.
>
> I like the idea of 2, although it gets a bit tricky when dealing with
> union types. You could define the matching such that it ignores names only
> when the two matched types are unambiguous (i.e. only one record in both).
> This could be implemented as an option ("use structural typing") when
> decoding.
>
> 3 is probably cleanest but interacts significantly with the spec (for
> example, the canonical schema transformation strips aliases out, but they'd
> need to be retained).
>
> Any thoughts? Is this a silly thing to be contemplating? Is there a better
> way?
>
>   cheers,
>     rog.
>
>

-- 
Regards,

Vance Duncan
mailto:duncanjv@gmail.com
http://www.linkedin.com/in/VanceDuncan
(904) 553-5582

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Posted by roger peppe <ro...@gmail.com>.

On Thu, 19 Dec 2019 at 10:11, Ryan Skraba <ry...@skraba.com> wrote:

> Hello!  You might be interested in this short discussion on the dev@
> mailing list:
> https://lists.apache.org/x/thread.html/dd7a23c303ef045c124050d7eac13356b20551a6a663a79cb8807f41@%3Cdev.avro.apache.org%3E
>
> In short, it appears that the record name is already ignored in
> record-to-record matching (at least outside of unions) as an
> implementation detail in Java.  I never *did* get around to verifying
> the behaviour of the other language implementations, but if this is
> what is being done in practice, it's worth clarifying in the
> specification.
>

That's really interesting, thanks! I think this should indeed be clarified
in the specification, as I'm aware of at least two implementations that
follow the letter of the spec in this respect (one is actually too strict
and compares the fully-qualified name).

It's also really useful to know that the most-used Java implementation is
relaxed in this way, because it means that (presumably) most tooling around
Avro will be similarly relaxed. Maybe my cunning plan is a possibility
after all :)

It does seems like a very pragmatic thing to do, and would help with
> the CloudEvents Avro use case.  It would be a nice recipe to share in
> the docs: the right way to read an envelope from a custom message when
> you don't need the payload.
>
> I'm not sure I understand the third strategy, however!  There aren't
> any names in binary data when writing - what would the alias do?
>

The names that I'm thinking of there are the names in the writer's schema.
When you're reading a message, you have access to both the writer schema
and the reader schema (including all the names). However, for comparison
purposes, writer schemas are often canonicalised, which includes stripping
out all redundant fields, including aliases. If this is done, then although
you'll have access to type names, you won't have access to the alias
information.

tbh, I'm less keen on the above option anyway. I like the idea of a
structural subtype relationship without the need to explicitly predeclare a
type relationship. To me that's somewhat reminiscent of Go's approach to
interface types, which works well in practice.

(Also, I largely prefer your avro version with explicitly typed
> metadata fields and names as well!)
>

Thanks again for your feedback. I'll try making a proposal for a different
CloudEvent format, and try to get some implementations to relax their rules
a bit.

  cheers,
    rog.

All my best, Ryan
>
> On Wed, Dec 18, 2019 at 5:49 PM roger peppe <ro...@gmail.com> wrote:
> >
> > Hi,
> >
> > Background: I've been contemplating the proposed Avro format in the
> CloudEvent specification, which defines standard metadata for events. It
> defines a very generic format for an event that allows storage of almost
> any data. It seems to me that by going in that direction it's losing almost
> all the advantages of using Avro in the first place. It feels like it's
> trying to shoehorn a dynamic message format like JSON into the Avro format,
> where using Avro itself could do so much better.
> >
> > I'm hoping to propose something better. I had what I thought was a nice
> idea, but it doesn't quite work, and I thought I'd bring up the subject
> here and see if anyone had some better ideas.
> >
> > The schema resolution part of the spec allows a reader to read a schema
> that was written with extra fields. So, theoretically, we could define a
> CloudEvent something like this:
> >
> > { "name": "CloudEvent", "type": "record", "fields": [{ "name":
> "Metadata", "type": { "type": "record", "name": "CloudEvent", "namespace": "
> avro.apache.org", "fields": [{ "name": "id", "type": "string" }, {
> "name": "source", "type": "string" }, { "name": "time", "type": "long",
> "logicalType": "timestamp-micros" }] } }] }
> >
> > Theoretically, this could enable any event that's a record that has at
> least a Metadata field with the above fields to be read generically. The
> CloudEvent type above could be seen as a structural supertype of all
> possible more-specific CloudEvent-compatible records that have such a
> compatible field.
> >
> > This has a few nice advantages:
> > - there's no need for any wrapping of payload data.
> > - the CloudEvent type can evolve over time like any other Avro type.
> > - all the data message fields are immediately available alongside the
> metadata.
> > - there's still exactly one schema for a topic, encapsulating both the
> metadata and the payload.
> >
> > However, this idea fails because of one problem - this schema resolution
> rule: "both schemas are records with the same (unqualified) name". This
> means that unless everyone names all their CloudEvent-compatible records
> "CloudEvent", they can't be read like this.
> >
> > I don't think people will be willing to name all their records
> "CloudEvent", so we have a problem.
> >
> > I can see a few possible workarounds:
> >
> > when reading the record as a CloudEvent, read it with a schema that's
> the same as CloudEvent, but with the top level record name changed to the
> top level name of the schema that was used to write the record.
> > ignore record names when matching schema record types.
> > allow aliases to be specified when writing data as well as reading it.
> When defining a CloudEvent-compatible event, you'd add a CloudEvent alias
> to your record.
> >
> > None of the options are particularly nice. 1 is probably the easiest to
> do, although means you'd still need some custom logic when decoding
> records, meaning you couldn't use stock decoders.
> >
> > I like the idea of 2, although it gets a bit tricky when dealing with
> union types. You could define the matching such that it ignores names only
> when the two matched types are unambiguous (i.e. only one record in both).
> This could be implemented as an option ("use structural typing") when
> decoding.
> >
> > 3 is probably cleanest but interacts significantly with the spec (for
> example, the canonical schema transformation strips aliases out, but they'd
> need to be retained).
> >
> > Any thoughts? Is this a silly thing to be contemplating? Is there a
> better way?
> >
> >   cheers,
> >     rog.
> >
>

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Posted by Ryan Skraba <ry...@skraba.com>.

Hello!  You might be interested in this short discussion on the dev@
mailing list: https://lists.apache.org/x/thread.html/dd7a23c303ef045c124050d7eac13356b20551a6a663a79cb8807f41@%3Cdev.avro.apache.org%3E

In short, it appears that the record name is already ignored in
record-to-record matching (at least outside of unions) as an
implementation detail in Java.  I never *did* get around to verifying
the behaviour of the other language implementations, but if this is
what is being done in practice, it's worth clarifying in the
specification.

It does seems like a very pragmatic thing to do, and would help with
the CloudEvents Avro use case.  It would be a nice recipe to share in
the docs: the right way to read an envelope from a custom message when
you don't need the payload.

I'm not sure I understand the third strategy, however!  There aren't
any names in binary data when writing - what would the alias do?

(Also, I largely prefer your avro version with explicitly typed
metadata fields and names as well!)

All my best, Ryan

On Wed, Dec 18, 2019 at 5:49 PM roger peppe <ro...@gmail.com> wrote:
>
> Hi,
>
> Background: I've been contemplating the proposed Avro format in the CloudEvent specification, which defines standard metadata for events. It defines a very generic format for an event that allows storage of almost any data. It seems to me that by going in that direction it's losing almost all the advantages of using Avro in the first place. It feels like it's trying to shoehorn a dynamic message format like JSON into the Avro format, where using Avro itself could do so much better.
>
> I'm hoping to propose something better. I had what I thought was a nice idea, but it doesn't quite work, and I thought I'd bring up the subject here and see if anyone had some better ideas.
>
> The schema resolution part of the spec allows a reader to read a schema that was written with extra fields. So, theoretically, we could define a CloudEvent something like this:
>
> { "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata", "type": { "type": "record", "name": "CloudEvent", "namespace": "avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name": "source", "type": "string" }, { "name": "time", "type": "long", "logicalType": "timestamp-micros" }] } }] }
>
> Theoretically, this could enable any event that's a record that has at least a Metadata field with the above fields to be read generically. The CloudEvent type above could be seen as a structural supertype of all possible more-specific CloudEvent-compatible records that have such a compatible field.
>
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the metadata.
> - there's still exactly one schema for a topic, encapsulating both the metadata and the payload.
>
> However, this idea fails because of one problem - this schema resolution rule: "both schemas are records with the same (unqualified) name". This means that unless everyone names all their CloudEvent-compatible records "CloudEvent", they can't be read like this.
>
> I don't think people will be willing to name all their records "CloudEvent", so we have a problem.
>
> I can see a few possible workarounds:
>
> when reading the record as a CloudEvent, read it with a schema that's the same as CloudEvent, but with the top level record name changed to the top level name of the schema that was used to write the record.
> ignore record names when matching schema record types.
> allow aliases to be specified when writing data as well as reading it. When defining a CloudEvent-compatible event, you'd add a CloudEvent alias to your record.
>
> None of the options are particularly nice. 1 is probably the easiest to do, although means you'd still need some custom logic when decoding records, meaning you couldn't use stock decoders.
>
> I like the idea of 2, although it gets a bit tricky when dealing with union types. You could define the matching such that it ignores names only when the two matched types are unambiguous (i.e. only one record in both). This could be implemented as an option ("use structural typing") when decoding.
>
> 3 is probably cleanest but interacts significantly with the spec (for example, the canonical schema transformation strips aliases out, but they'd need to be retained).
>
> Any thoughts? Is this a silly thing to be contemplating? Is there a better way?
>
>   cheers,
>     rog.
>

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Posted by roger peppe <ro...@gmail.com>.

Hi,

Excuse my ignorance, but I'm not at all familiar with IDL. Is there an easy
way to translate it to a JSON Avro schema, please? (preferably online :))

 cheers,
   rog.

On Fri, 20 Dec 2019 at 21:06, Zoltan Farkas <zo...@yahoo.com> wrote:

> Hi Roger,
>
> have you considered  leveraging  avro logical types, and keep the payload
> and event metadata “separate”?
>
> Here is a example (will use avro idl, since that is more readable to me
> :-) ):
>
> record MetaData {
> @logicalType(“instant") string timeStamp;
> ….. all the meta data fields...
> }
>
> record CloudEvent {
>
> MetaData metaData;
>
> Any payload;
>
> }
>
> @logicalType(“any")
> record Any {
>
> /** here you have the schema of the data, for efficiency, you can use a
> schema id + schema repo, or something like
> https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences */
> string schema;
>
> bytes data;
>
> }
>
> this way a system that is interested in the metadata does not even have to
> deserialize the payload….
>
> hope it helps.
>
> —Z
>
>
> On Dec 18, 2019, at 11:49 AM, roger peppe <ro...@gmail.com> wrote:
>
> Hi,
>
> Background: I've been contemplating the proposed Avro format in the CloudEvent
> specification
> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which
> defines standard metadata for events. It defines a very generic format for
> an event that allows storage of almost any data. It seems to me that by
> going in that direction it's losing almost all the advantages of using Avro
> in the first place. It feels like it's trying to shoehorn a dynamic message
> format like JSON into the Avro format, where using Avro itself could do so
> much better.
>
> I'm hoping to propose something better. I had what I thought was a nice
> idea, but it doesn't *quite* work, and I thought I'd bring up the subject
> here and see if anyone had some better ideas.
>
> The schema resolution
> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part
> of the spec allows a reader to read a schema that was written with extra
> fields. So, theoretically, we could define a CloudEvent something like this:
>
> { "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata",
> "type": { "type": "record", "name": "CloudEvent", "namespace": "
> avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name":
> "source", "type": "string" }, { "name": "time", "type": "long", "
> logicalType": "timestamp-micros" }] } }] }
>
> Theoretically, this could enable any event that's a record that has *at
> least* a Metadata field with the above fields to be read generically. The
> CloudEvent type above could be seen as a structural supertype of all
> possible more-specific CloudEvent-compatible records that have such a
> compatible field.
>
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the
> metadata.
> - there's still exactly one schema for a topic, encapsulating both the
> metadata and the payload.
>
> However, this idea fails because of one problem - this schema resolution
> rule: "both schemas are records with the same (unqualified) name". This
> means that unless *everyone* names all their CloudEvent-compatible
> records "CloudEvent", they can't be read like this.
>
> I don't think people will be willing to name all their records
> "CloudEvent", so we have a problem.
>
> I can see a few possible workarounds:
>
>    1. when reading the record as a CloudEvent, read it with a schema
>    that's the same as CloudEvent, but with the top level record name changed
>    to the top level name of the schema that was used to write the record.
>    2. ignore record names when matching schema record types.
>    3. allow aliases to be specified when writing data as well as reading
>    it. When defining a CloudEvent-compatible event, you'd add a CloudEvent
>    alias to your record.
>
> None of the options are particularly nice. 1 is probably the easiest to
> do, although means you'd still need some custom logic when decoding
> records, meaning you couldn't use stock decoders.
>
> I like the idea of 2, although it gets a bit tricky when dealing with
> union types. You could define the matching such that it ignores names only
> when the two matched types are unambiguous (i.e. only one record in both).
> This could be implemented as an option ("use structural typing") when
> decoding.
>
> 3 is probably cleanest but interacts significantly with the spec (for
> example, the canonical schema transformation strips aliases out, but they'd
> need to be retained).
>
> Any thoughts? Is this a silly thing to be contemplating? Is there a better
> way?
>
>   cheers,
>     rog.
>
>
>

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Posted by roger peppe <ro...@gmail.com>.

Actually, having looked a bit closer, I think I get the gist of what you're
saying (though the IDL spec
<https://avro.apache.org/docs/current/idl.html> doesn't
seem to mention the @logicalType form, so I'm still guessing somewhat).

I'd certainly considered that approach. Essentially you seem to be
suggesting wrapping an arbitrary data payload inside the message.
Let's call my approach "unified" and yours "wrapper".

I think there are some advantages to the unified approach.

- with the unified approach you have a single schema for all the data in
the topic; with the wrapper approach, each topic essentially has two
associated schemas: the metadata (wrapper) schema and the underlying
schema. This makes everything a bit more complex. Perhaps standard tooling
won't be able to use the topic's registered schema to decode the full
messages from the topic. Deciding backward compatibility with the unified
schema is also more straightforward - it can be done with exactly the usual
single-schema compatibility checks (as implemented by the schema registry
for example).

- If you want to pull out both metadata and payload data, you can do so in
a single operation; it's simpler code and simpler conceptually I think.

this way a system that is interested in the metadata does not even have to
> deserialize the payload….


I take this point; it could indeed be more efficient to use the wrapper
approach (although there might be extra data copying costs too). As always
with optimisation, it would be worth measuring. There's one interesting
possibility to get the best of both worlds, actually: if the messages are
written with a schema that has the Metadata field first in the struct and
the reader is only extracting the Metadata field, a sufficiently clever
decoder could stop after the information for that field has been read -
there's no need to read any further. I think that could be just as
efficient and I don't think it would be *that* hard to do.

Thanks very much for your feedback, BTW.

  cheers,
    rog.


On Fri, 20 Dec 2019 at 21:06, Zoltan Farkas <zo...@yahoo.com> wrote:

> Hi Roger,
>
> have you considered  leveraging  avro logical types, and keep the payload
> and event metadata “separate”?
>
> Here is a example (will use avro idl, since that is more readable to me
> :-) ):
>
> record MetaData {
> @logicalType(“instant") string timeStamp;
> ….. all the meta data fields...
> }
>
> record CloudEvent {
>
> MetaData metaData;
>
> Any payload;
>
> }
>
> @logicalType(“any")
> record Any {
>
> /** here you have the schema of the data, for efficiency, you can use a
> schema id + schema repo, or something like
> https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences */
> string schema;
>
> bytes data;
>
> }
>
> this way a system that is interested in the metadata does not even have to
> deserialize the payload….
>
> hope it helps.
>
> —Z
>
>
> On Dec 18, 2019, at 11:49 AM, roger peppe <ro...@gmail.com> wrote:
>
> Hi,
>
> Background: I've been contemplating the proposed Avro format in the CloudEvent
> specification
> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which
> defines standard metadata for events. It defines a very generic format for
> an event that allows storage of almost any data. It seems to me that by
> going in that direction it's losing almost all the advantages of using Avro
> in the first place. It feels like it's trying to shoehorn a dynamic message
> format like JSON into the Avro format, where using Avro itself could do so
> much better.
>
> I'm hoping to propose something better. I had what I thought was a nice
> idea, but it doesn't *quite* work, and I thought I'd bring up the subject
> here and see if anyone had some better ideas.
>
> The schema resolution
> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part
> of the spec allows a reader to read a schema that was written with extra
> fields. So, theoretically, we could define a CloudEvent something like this:
>
> { "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata",
> "type": { "type": "record", "name": "CloudEvent", "namespace": "
> avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name":
> "source", "type": "string" }, { "name": "time", "type": "long", "
> logicalType": "timestamp-micros" }] } }] }
>
> Theoretically, this could enable any event that's a record that has *at
> least* a Metadata field with the above fields to be read generically. The
> CloudEvent type above could be seen as a structural supertype of all
> possible more-specific CloudEvent-compatible records that have such a
> compatible field.
>
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the
> metadata.
> - there's still exactly one schema for a topic, encapsulating both the
> metadata and the payload.
>
> However, this idea fails because of one problem - this schema resolution
> rule: "both schemas are records with the same (unqualified) name". This
> means that unless *everyone* names all their CloudEvent-compatible
> records "CloudEvent", they can't be read like this.
>
> I don't think people will be willing to name all their records
> "CloudEvent", so we have a problem.
>
> I can see a few possible workarounds:
>
>    1. when reading the record as a CloudEvent, read it with a schema
>    that's the same as CloudEvent, but with the top level record name changed
>    to the top level name of the schema that was used to write the record.
>    2. ignore record names when matching schema record types.
>    3. allow aliases to be specified when writing data as well as reading
>    it. When defining a CloudEvent-compatible event, you'd add a CloudEvent
>    alias to your record.
>
> None of the options are particularly nice. 1 is probably the easiest to
> do, although means you'd still need some custom logic when decoding
> records, meaning you couldn't use stock decoders.
>
> I like the idea of 2, although it gets a bit tricky when dealing with
> union types. You could define the matching such that it ignores names only
> when the two matched types are unambiguous (i.e. only one record in both).
> This could be implemented as an option ("use structural typing") when
> decoding.
>
> 3 is probably cleanest but interacts significantly with the spec (for
> example, the canonical schema transformation strips aliases out, but they'd
> need to be retained).
>
> Any thoughts? Is this a silly thing to be contemplating? Is there a better
> way?
>
>   cheers,
>     rog.
>
>
>

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Posted by Zoltan Farkas <zo...@yahoo.com>.

Hi Roger,

have you considered  leveraging  avro logical types, and keep the payload and event metadata “separate”?

Here is a example (will use avro idl, since that is more readable to me :-) ):

record MetaData {
	@logicalType(“instant") string timeStamp;
	….. all the meta data fields...
}

record CloudEvent {

	MetaData metaData;

	Any payload;

}

@logicalType(“any")
record Any {

	/** here you have the schema of the data, for efficiency, you can use a schema id + schema repo, or something like https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences <https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences> */
	string schema;

	bytes data;

}

this way a system that is interested in the metadata does not even have to deserialize the payload….

hope it helps.

—Z


> On Dec 18, 2019, at 11:49 AM, roger peppe <ro...@gmail.com> wrote:
> 
> Hi,
> 
> Background: I've been contemplating the proposed Avro format in the CloudEvent specification <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which defines standard metadata for events. It defines a very generic format for an event that allows storage of almost any data. It seems to me that by going in that direction it's losing almost all the advantages of using Avro in the first place. It feels like it's trying to shoehorn a dynamic message format like JSON into the Avro format, where using Avro itself could do so much better.
> 
> I'm hoping to propose something better. I had what I thought was a nice idea, but it doesn't quite work, and I thought I'd bring up the subject here and see if anyone had some better ideas.
> 
> The schema resolution <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part of the spec allows a reader to read a schema that was written with extra fields. So, theoretically, we could define a CloudEvent something like this:
> 
> {
>     "name": "CloudEvent",
>     "type": "record",
>     "fields": [{
>             "name": "Metadata",
>             "type": {
>                 "type": "record",
>                 "name": "CloudEvent",
>                 "namespace": "avro.apache.org <http://avro.apache.org/>",
>                 "fields": [{
>                         "name": "id",
>                         "type": "string"
>             }, {
>                         "name": "source",
>                         "type": "string"
>             }, {
>                         "name": "time",
>                         "type": "long",
>                         "logicalType": "timestamp-micros"
>             }]
>         }
>     }]
> }
> 
> Theoretically, this could enable any event that's a record that has at least a Metadata field with the above fields to be read generically. The CloudEvent type above could be seen as a structural supertype of all possible more-specific CloudEvent-compatible records that have such a compatible field.
> 
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the metadata.
> - there's still exactly one schema for a topic, encapsulating both the metadata and the payload.
> 
> However, this idea fails because of one problem - this schema resolution rule: "both schemas are records with the same (unqualified) name". This means that unless everyone names all their CloudEvent-compatible records "CloudEvent", they can't be read like this.
> 
> I don't think people will be willing to name all their records "CloudEvent", so we have a problem.
> 
> I can see a few possible workarounds:
> when reading the record as a CloudEvent, read it with a schema that's the same as CloudEvent, but with the top level record name changed to the top level name of the schema that was used to write the record.
> ignore record names when matching schema record types.
> allow aliases to be specified when writing data as well as reading it. When defining a CloudEvent-compatible event, you'd add a CloudEvent alias to your record.
> None of the options are particularly nice. 1 is probably the easiest to do, although means you'd still need some custom logic when decoding records, meaning you couldn't use stock decoders.
> 
> I like the idea of 2, although it gets a bit tricky when dealing with union types. You could define the matching such that it ignores names only when the two matched types are unambiguous (i.e. only one record in both). This could be implemented as an option ("use structural typing") when decoding.
> 
> 3 is probably cleanest but interacts significantly with the spec (for example, the canonical schema transformation strips aliases out, but they'd need to be retained).
> 
> Any thoughts? Is this a silly thing to be contemplating? Is there a better way?
> 
>   cheers,
>     rog.
>