You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Devin Bost <de...@gmail.com> on 2022/10/05 17:07:21 UTC

Re: CloudEvents binding requires standard for Pulsar

Hi Enrico,

Great questions. The primary objective of the CloudEvents specification was
to improve interoperability across technologies. The specification was
created to enable technologies to create bindings or adapters to support
interchange, and it seems to be gaining momentum. Azure and GCP have
invested heavily into it, and AWS and RedHat have also made investments. A
number of CNCF technologies are adding support as well. If Pulsar supports
CloudEvents, then as new technologies support CloudEvents, we get
interoperability with them for free. The book *Nail It Then Scale It* mentions
that 3rd party integrations are key to getting adoption, and CloudEvents
could help Pulsar accomplish that. (I should mention that Kafka already has
support for CloudEvents.)

Another notable benefit to adopting the CloudEvents specification is
support for the JSON Schema specification. Pulsar internally was
standardized on Avro to simplify the architecture for schemas. Since that
decision was made, the JSON Schema specification has matured considerably.
(Adoption of JSON Schema by OpenAPI 3.1 and AsyncAPI is evidence of that
maturity.) I've come across companies that have data quality needs that
require validating each message. For example, companies using Pulsar for
financial processing often need to verify that each message conforms to the
consumers' data contracts or risk finanancial impact to customers as
producers evolve. The Change Management burden without built-in message
validation can be substantial. I documented a number of business cases in
this comment:
https://github.com/cloudevents/spec/issues/1052#issuecomment-1249260590.
(For context, that thread was about adding validation to the CloudEvents
header, but validation of the message body is already supported in
CloudEvents, and the scenarios I listed are still applicable here.) I also
learned that adoption for JSON Schema hasn't been more widespread because
they haven't created a "release" version of the spec (see this comment
<https://github.com/json-schema-org/json-schema-spec/pull/1277#issuecomment-1223038171>);
but, I learned this situation was more of a technicality of their
relationship with IETF, and they're progressing
<https://github.com/json-schema-org/json-schema-spec/pull/1277#issuecomment-1261365741>
towards
on a resolution.

Jerry Peng raised the point that Avro has utilities for schema
compatability checks whereas JSON Schema does not. I spoke with the
maintainers of JSON Schema, and they said a workaround is to validate a
mesage against a prior version of the schema. With that said, the current
implementation of Avro in Pulsar doesn't actually validate the messages, so
there's currently no way to guarantee in Pulsar that a message is
compatable with a new schema version anyway. So, I'm not sure how much
benefit we're getting from the compatability checking in Avro. (I can see
how it could be useful when mapping a Pulsar schema to database tables, but
it doesn't guarantee that the messages themselves are compatable with the
table definition, which seems to be the bigger issue.) Content-sensitive
apps would benefit from a Pulsar feature that allows invalid messages to be
sent to an "invalid message" topic for alerting, inspection, and
re-processing. Jerry also raised the performance impact of validating every
message; however, not every topic needs to validate *every *message. Some
topics might benefit from a statistical validation where a percentage
(let's say only 1%) of messages are validated. Anyway, these are
implementation details that could be worked out. I think the business cases
I linked above will help explain the need.

I hope this helps.

Devin G. Bost


On Wed, Sep 7, 2022 at 4:58 AM Enrico Olivelli <eo...@gmail.com> wrote:

> Devin,
> thanks for bringing up this discussion.
>
> I have one high level question: what is the goal that we want to achieve ?
> something like:
> 1) Use CloudEvents format natively in Pulsar Schema registry, so that
> Pulsar clients can register their schema using that format
> 2) Publish on some HTTP endpoint the Schemas saved in the Pulsar
> Schema Registry in a way that non-Pulsar clients (like WebServices)
> can consume Pulsar messages
> 3) other
>
> I agree that supporting CloudEvents would be great in Pulsar and we
> should do something.
>
> If you have a real world use case to share we can start by that use
> case, that will help a lot
>
> Enrico
>
>
> Il giorno lun 5 set 2022 alle ore 18:07 Devin Bost
> <de...@gmail.com> ha scritto:
> >
> > Maybe this is something we could discuss as part of Pulsar 3.0?
> > Seems like there's a pretty big difference between SchemaInfo and
> > CloudEvents in terms of the fields.
> >
> > CloudEvents requires:
> >    id: String
> >    source: URI-reference
> >    specversion: String
> >    type: String
> >
> > and optionally:
> >    datacontenttype: String
> >    dataschema: URI (compliant with JSON Schema specification 07)
> >    subject: String
> >    time: Timestamp
> >
> > For JSON, CloudEvents uses the JSON Schema spec for validation.
> >
> > In contrast, Pulsar's SchemaInfo has:
> >    name: String
> >    schema: byte[]
> >    type: SchemaType
> >    properties: Map<String, String>
> >    propertiesSet: bool
> >    timestamp: long
> >
> >
> > --
> > Devin Bost
> > Sent from mobile
> > Cell: 801-400-4602
> >
> > On Fri, Sep 2, 2022, 5:33 PM Devin Bost <de...@gmail.com> wrote:
> >
> > > Hi recently discovered the discussion around creating a CloudEvent
> binding
> > > for Pulsar. https://github.com/cloudevents/spec/pull/237
> > >
> > > It appears that Pulsar doesn't meet their minimum requirements due to
> lack
> > > of a standard protocol. See
> https://github.com/cloudevents/spec/pull/254
> > >
> > > Their comment says:
> > >
> > > "For a protocol or encoding to qualify for a core CloudEvents event
> format
> > > or protocol binding, it must belong to either one of the following
> > > categories:
> > > - The protocol has a formal status as a standard with a
> widely-recognized
> > > multi-vendor protocol standardization body (e.g. W3C, IETF, OASIS, ISO)
> > > - The protocol has a "de-facto standard" status for its ecosystem
> category
> > > which means it is used so widely that it is considered a standard for a
> > > given application. Practically, we would like to see at least one open
> > > source implementation and at least a dozen independent vendors using
> it in
> > > their products/services. "
> > >
> > > As CloudEvents is gaining momentum within CNCF, this may become a
> problem.
> > >
> > > Has their been any discussion around standardization and how we might
> meet
> > > this requirement?
> > >
> > > --
> > > Devin Bost
> > > Sent from mobile
> > > Cell: 801-400-4602
> > >
>