You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Christophe Bornet <bo...@gmail.com> on 2022/09/19 17:09:02 UTC

[DISCUSS] PIP-208: HTTP Sink

Hi all,

I have drafted PIP-208: HTTP Sink

PIP link:
https://github.com/apache/pulsar/issues/17719

Here's a copy of the contents of the GH issue for your references:

### Motivation

Currently, when you want to consume from Pulsar topics in applications
written in languages that don't have a Pulsar driver supported, you need to
run some type of proxy like the WebSocket Proxy or Pulsar Beam. In
production this needs additional effort to deploy, scale, load balance,
monitor, and so on...
Pulsar IO is a framework that deals with all these operational subjects and
can be leveraged to provide a way to push messages to external systems
using HTTP, a protocol supported by every existing language and OS.

### Goal

This proposal defines an HTTP Sink that sends the messages to a configured
URL.
It takes inspiration from [Pulsar Beam](
https://github.com/kafkaesque-io/pulsar-beam) and the [Confluent HTTP Sink
connector](
https://docs.confluent.io/kafka-connectors/http/current/overview.html).


### Implementation

A `pulsar-io-http` module will be added to `pulsar-io`.
On building the project `pulsar-io-http-{version}.nar` will be built and
added to the `pulsar-all` distribution.
The name of the Sink will be `http`.

The HTTP Sink pushes records to any HTTP server with the record value in
the body of a POST method.
The body of the HTTP request is the JSON representation of the record value.

Some headers are added to the HTTP request:
* `PulsarTopic`: the topic of the record
* `PulsarKey`: the key of the record
* `PulsarEventTime`: the event time of the record
* `PulsarPublishTime`: the publish time of the record
* `PulsarMessageId`: the ID of the message contained in the record
* `PulsarProperties-*`: each record property is passed with the property
name prefixed by `PulsarProperties-`

### Alternatives

Creating a separated project for this Sink is rejected since:
* this Sink is very useful for developers to test the Pulsar IO framework,
transform functions, and to make demos.
* the code has a very small footprint with no external dependencies.
* it should be visible at the same level as other sinks

I'm looking forward the discussion.

Best regards,

Christophe Bornet

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by Christophe Bornet <bo...@gmail.com>.
>
> I think we could shrink the connectors a lot by removing from the NAR
> archives dependencies that are already present in the
> pulsar-functions-instance.
>
I mean java-instance.jar

Le mer. 19 oct. 2022 à 12:29, Christophe Bornet <bo...@gmail.com> a
écrit :

> The pulsar-all docker image is pretty big. I assume we will continue
>> to build and package additional connectors. It would be great to
>> figure out how to make it smaller at some point.
>>
> I think we could shrink the connectors a lot by removing from the NAR
> archives dependencies that are already present in the
> pulsar-functions-instance.
>
> Le mar. 18 oct. 2022 à 16:51, Michael Marshall <mm...@apache.org> a
> écrit :
>
>> Great discussion. I have one minor comment that is tangentially related.
>>
>> > On building the project `pulsar-io-http-{version}.nar` will be built and
>> > added to the `pulsar-all` distribution.
>>
>> The pulsar-all docker image is pretty big. I assume we will continue
>> to build and package additional connectors. It would be great to
>> figure out how to make it smaller at some point.
>>
>> Thanks,
>> Michael
>>
>>
>> On Tue, Sep 27, 2022 at 9:27 AM Christophe Bornet
>> <bo...@gmail.com> wrote:
>> >
>> > Sure you can test with the Sink of my PR branch.
>> > Otherwise I'll do the test after ApacheCon.
>> >
>> > Le mar. 27 sept. 2022 à 12:57, tison <wa...@gmail.com> a écrit :
>> >
>> > > Yes. It's a potential use case for validating the implementation. If
>> you
>> > > don't have time to try it out, I can schedule some time to demo it
>> with a
>> > > prototype HTTP sink or after the patch gets merged :)
>> > >
>> > > Best,
>> > > tison.
>> > >
>> > >
>> > > Christophe Bornet <bo...@gmail.com> 于2022年9月27日周二 18:51写道:
>> > >
>> > > > Hi Tison,
>> > > >
>> > > > Very interesting and shows the value of such a HTTP Sink.
>> > > > The Pulsar HTTP Sink should work OOTB with ClickHouse. I don't have
>> time
>> > > to
>> > > > do the test right now, so would someone want to do it ?
>> > > >
>> > > > Best regards.
>> > > >
>> > > > Christophe Bornet
>> > > >
>> > > > Le mar. 27 sept. 2022 à 12:31, tison <wa...@gmail.com> a
>> écrit :
>> > > >
>> > > > > Hi Christophe,
>> > > > >
>> > > > > Thanks for starting this proposal. It looks cool.
>> > > > >
>> > > > > I'd suggest one real-world integration test you can make use of:
>> > > > >
>> https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
>> > > > > (replace source kafka with pulsar).
>> > > > >
>> > > > > Best,
>> > > > > tison.
>> > > > >
>> > > > >
>> > > > > Enrico Olivelli <eo...@gmail.com> 于2022年9月27日周二 18:04写道:
>> > > > >
>> > > > > > Thanks for your answers.
>> > > > > > I am fine with the current proposal.
>> > > > > > We can enhance it as follow up work
>> > > > > >
>> > > > > > Enrico
>> > > > > >
>> > > > > > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
>> > > > > > <bo...@gmail.com> ha scritto:
>> > > > > > >
>> > > > > > > Thanks for your feedback Enrico.
>> > > > > > > My answers to your comments below
>> > > > > > >
>> > > > > > > BR
>> > > > > > >
>> > > > > > > Christophe
>> > > > > > >
>> > > > > > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <
>> > > eolivelli@gmail.com>
>> > > > a
>> > > > > > > écrit :
>> > > > > > >
>> > > > > > > > Christophe,
>> > > > > > > > very good initiative!
>> > > > > > > >
>> > > > > > > > I support it
>> > > > > > > > Some comments inline below
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Enrico
>> > > > > > > >
>> > > > > > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
>> > > > > > > > <bo...@gmail.com> ha scritto:
>> > > > > > > > >
>> > > > > > > > > Hi all,
>> > > > > > > > >
>> > > > > > > > > I have drafted PIP-208: HTTP Sink
>> > > > > > > > >
>> > > > > > > > > PIP link:
>> > > > > > > > > https://github.com/apache/pulsar/issues/17719
>> > > > > > > > >
>> > > > > > > > > Here's a copy of the contents of the GH issue for your
>> > > > references:
>> > > > > > > > >
>> > > > > > > > > ### Motivation
>> > > > > > > > >
>> > > > > > > > > Currently, when you want to consume from Pulsar topics in
>> > > > > > applications
>> > > > > > > > > written in languages that don't have a Pulsar driver
>> supported,
>> > > > you
>> > > > > > need
>> > > > > > > > to
>> > > > > > > > > run some type of proxy like the WebSocket Proxy or Pulsar
>> Beam.
>> > > > In
>> > > > > > > > > production this needs additional effort to deploy, scale,
>> load
>> > > > > > balance,
>> > > > > > > > > monitor, and so on...
>> > > > > > > > > Pulsar IO is a framework that deals with all these
>> operational
>> > > > > > subjects
>> > > > > > > > and
>> > > > > > > > > can be leveraged to provide a way to push messages to
>> external
>> > > > > > systems
>> > > > > > > > > using HTTP, a protocol supported by every existing
>> language and
>> > > > OS.
>> > > > > > > > >
>> > > > > > > > > ### Goal
>> > > > > > > > >
>> > > > > > > > > This proposal defines an HTTP Sink that sends the
>> messages to a
>> > > > > > > > configured
>> > > > > > > > > URL.
>> > > > > > > > > It takes inspiration from [Pulsar Beam](
>> > > > > > > > > https://github.com/kafkaesque-io/pulsar-beam) and the
>> > > [Confluent
>> > > > > > HTTP
>> > > > > > > > Sink
>> > > > > > > > > connector](
>> > > > > > > > >
>> > > > > >
>> > > https://docs.confluent.io/kafka-connectors/http/current/overview.html
>> > > > ).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > ### Implementation
>> > > > > > > > >
>> > > > > > > > > A `pulsar-io-http` module will be added to `pulsar-io`.
>> > > > > > > > > On building the project `pulsar-io-http-{version}.nar`
>> will be
>> > > > > built
>> > > > > > and
>> > > > > > > > > added to the `pulsar-all` distribution.
>> > > > > > > > > The name of the Sink will be `http`.
>> > > > > > > > >
>> > > > > > > > > The HTTP Sink pushes records to any HTTP server with the
>> record
>> > > > > > value in
>> > > > > > > > > the body of a POST method.
>> > > > > > > > > The body of the HTTP request is the JSON representation
>> of the
>> > > > > record
>> > > > > > > > value.
>> > > > > > > >
>> > > > > > > > What do you mean ?
>> > > > > > > > I think that this should depend on the Schema.
>> > > > > > > >
>> > > > > > > > BYTES SCHEMA -> I would push the raw message payload
>> > > > > > > > PRIMITIVE VALUES (long, integer, string) - > I would push
>> the
>> > > JSON
>> > > > > > > > represantation
>> > > > > > > > JSON SCHEMA ->  push the raw message payload
>> > > > > > > > AVRO -> ?  convert to JSON ?
>> > > > > > > > PROTOBUF -> ? convert to JSON ?
>> > > > > > > > KEY-VALUE ?
>> > > > > > > >
>> > > > > > > > Probably we need some flag to define the behaviour for the
>> non
>> > > > > trivial
>> > > > > > > > cases.
>> > > > > > > >
>> > > > > > > > The current impl chooses to serialize as JSON because it's
>> a well
>> > > > > > > supported content-type on the server frameworks.
>> > > > > > > It's also to be consistent with existing HTTP Sinks such as
>> Pulsar
>> > > > Bean
>> > > > > > and
>> > > > > > > Confluent HTTP Sink Connector.
>> > > > > > > The possibility to adapt the content-type to the schema is
>> elegant
>> > > > and
>> > > > > > will
>> > > > > > > probably result in shorter payloads (but less readable) and I
>> think
>> > > > it
>> > > > > > > could be done as a follow-up option.
>> > > > > > > It has indeed the problem of being difficult to do for KV
>> schema.
>> > > > > > > For the content-type mappings I would do:
>> > > > > > > BYTES SCHEMA -> application/octet-stream (raw bytes)
>> > > > > > > PRIMITIVE VALUES (long, integer, string) - > text/plain
>> > > > > > > JSON ->  application/json
>> > > > > > > AVRO -> avro/binary
>> > > > > > > PROTOBUF -> probably application/octet-stream ?
>> > > > > > > KEY-VALUE ?
>> > > > > > >
>> > > > > > > Would also need to indicate the Schema-Type in the HTTP
>> headers.
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Some headers are added to the HTTP request:
>> > > > > > > > > * `PulsarTopic`: the topic of the record
>> > > > > > > > > * `PulsarKey`: the key of the record
>> > > > > > > > > * `PulsarEventTime`: the event time of the record
>> > > > > > > > > * `PulsarPublishTime`: the publish time of the record
>> > > > > > > > > * `PulsarMessageId`: the ID of the message contained in
>> the
>> > > > record
>> > > > > > > > > * `PulsarProperties-*`: each record property is passed
>> with the
>> > > > > > property
>> > > > > > > > > name prefixed by `PulsarProperties-`
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > Can we make the "Content-Type" configurable ?
>> > > > > > > >
>> > > > > > > Yes we can. But do we do it for the first iteration ?
>> > > > > > > If we do it, I would have an option to add some fix headers
>> and the
>> > > > > user
>> > > > > > > can override the content-type.
>> > > > > > > If we go for a variable content-type depending on the schema,
>> then
>> > > we
>> > > > > > could
>> > > > > > > have a map<SchemaType, content-type>
>> > > > > > >
>> > > > > > > > Can we make the HTTP METHOD configurable ?
>> > > > > > > >
>> > > > > > > Yes we can. But do we do it for the first iteration ?
>> > > > > > >
>> > > > > > > >
>> > > > > > > > > ### Alternatives
>> > > > > > > > >
>> > > > > > > > > Creating a separated project for this Sink is rejected
>> since:
>> > > > > > > > > * this Sink is very useful for developers to test the
>> Pulsar IO
>> > > > > > > > framework,
>> > > > > > > > > transform functions, and to make demos.
>> > > > > > > > > * the code has a very small footprint with no external
>> > > > > dependencies.
>> > > > > > > > > * it should be visible at the same level as other sinks
>> > > > > > > >
>> > > > > > > > 100% agreed !
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I'm looking forward the discussion.
>> > > > > > > > >
>> > > > > > > > > Best regards,
>> > > > > > > > >
>> > > > > > > > > Christophe Bornet
>> > > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>>
>
> Le mar. 18 oct. 2022 à 16:51, Michael Marshall <mm...@apache.org> a
> écrit :
>
>> Great discussion. I have one minor comment that is tangentially related.
>>
>> > On building the project `pulsar-io-http-{version}.nar` will be built and
>> > added to the `pulsar-all` distribution.
>>
>> The pulsar-all docker image is pretty big. I assume we will continue
>> to build and package additional connectors. It would be great to
>> figure out how to make it smaller at some point.
>>
>> Thanks,
>> Michael
>>
>>
>> On Tue, Sep 27, 2022 at 9:27 AM Christophe Bornet
>> <bo...@gmail.com> wrote:
>> >
>> > Sure you can test with the Sink of my PR branch.
>> > Otherwise I'll do the test after ApacheCon.
>> >
>> > Le mar. 27 sept. 2022 à 12:57, tison <wa...@gmail.com> a écrit :
>> >
>> > > Yes. It's a potential use case for validating the implementation. If
>> you
>> > > don't have time to try it out, I can schedule some time to demo it
>> with a
>> > > prototype HTTP sink or after the patch gets merged :)
>> > >
>> > > Best,
>> > > tison.
>> > >
>> > >
>> > > Christophe Bornet <bo...@gmail.com> 于2022年9月27日周二 18:51写道:
>> > >
>> > > > Hi Tison,
>> > > >
>> > > > Very interesting and shows the value of such a HTTP Sink.
>> > > > The Pulsar HTTP Sink should work OOTB with ClickHouse. I don't have
>> time
>> > > to
>> > > > do the test right now, so would someone want to do it ?
>> > > >
>> > > > Best regards.
>> > > >
>> > > > Christophe Bornet
>> > > >
>> > > > Le mar. 27 sept. 2022 à 12:31, tison <wa...@gmail.com> a
>> écrit :
>> > > >
>> > > > > Hi Christophe,
>> > > > >
>> > > > > Thanks for starting this proposal. It looks cool.
>> > > > >
>> > > > > I'd suggest one real-world integration test you can make use of:
>> > > > >
>> https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
>> > > > > (replace source kafka with pulsar).
>> > > > >
>> > > > > Best,
>> > > > > tison.
>> > > > >
>> > > > >
>> > > > > Enrico Olivelli <eo...@gmail.com> 于2022年9月27日周二 18:04写道:
>> > > > >
>> > > > > > Thanks for your answers.
>> > > > > > I am fine with the current proposal.
>> > > > > > We can enhance it as follow up work
>> > > > > >
>> > > > > > Enrico
>> > > > > >
>> > > > > > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
>> > > > > > <bo...@gmail.com> ha scritto:
>> > > > > > >
>> > > > > > > Thanks for your feedback Enrico.
>> > > > > > > My answers to your comments below
>> > > > > > >
>> > > > > > > BR
>> > > > > > >
>> > > > > > > Christophe
>> > > > > > >
>> > > > > > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <
>> > > eolivelli@gmail.com>
>> > > > a
>> > > > > > > écrit :
>> > > > > > >
>> > > > > > > > Christophe,
>> > > > > > > > very good initiative!
>> > > > > > > >
>> > > > > > > > I support it
>> > > > > > > > Some comments inline below
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Enrico
>> > > > > > > >
>> > > > > > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
>> > > > > > > > <bo...@gmail.com> ha scritto:
>> > > > > > > > >
>> > > > > > > > > Hi all,
>> > > > > > > > >
>> > > > > > > > > I have drafted PIP-208: HTTP Sink
>> > > > > > > > >
>> > > > > > > > > PIP link:
>> > > > > > > > > https://github.com/apache/pulsar/issues/17719
>> > > > > > > > >
>> > > > > > > > > Here's a copy of the contents of the GH issue for your
>> > > > references:
>> > > > > > > > >
>> > > > > > > > > ### Motivation
>> > > > > > > > >
>> > > > > > > > > Currently, when you want to consume from Pulsar topics in
>> > > > > > applications
>> > > > > > > > > written in languages that don't have a Pulsar driver
>> supported,
>> > > > you
>> > > > > > need
>> > > > > > > > to
>> > > > > > > > > run some type of proxy like the WebSocket Proxy or Pulsar
>> Beam.
>> > > > In
>> > > > > > > > > production this needs additional effort to deploy, scale,
>> load
>> > > > > > balance,
>> > > > > > > > > monitor, and so on...
>> > > > > > > > > Pulsar IO is a framework that deals with all these
>> operational
>> > > > > > subjects
>> > > > > > > > and
>> > > > > > > > > can be leveraged to provide a way to push messages to
>> external
>> > > > > > systems
>> > > > > > > > > using HTTP, a protocol supported by every existing
>> language and
>> > > > OS.
>> > > > > > > > >
>> > > > > > > > > ### Goal
>> > > > > > > > >
>> > > > > > > > > This proposal defines an HTTP Sink that sends the
>> messages to a
>> > > > > > > > configured
>> > > > > > > > > URL.
>> > > > > > > > > It takes inspiration from [Pulsar Beam](
>> > > > > > > > > https://github.com/kafkaesque-io/pulsar-beam) and the
>> > > [Confluent
>> > > > > > HTTP
>> > > > > > > > Sink
>> > > > > > > > > connector](
>> > > > > > > > >
>> > > > > >
>> > > https://docs.confluent.io/kafka-connectors/http/current/overview.html
>> > > > ).
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > ### Implementation
>> > > > > > > > >
>> > > > > > > > > A `pulsar-io-http` module will be added to `pulsar-io`.
>> > > > > > > > > On building the project `pulsar-io-http-{version}.nar`
>> will be
>> > > > > built
>> > > > > > and
>> > > > > > > > > added to the `pulsar-all` distribution.
>> > > > > > > > > The name of the Sink will be `http`.
>> > > > > > > > >
>> > > > > > > > > The HTTP Sink pushes records to any HTTP server with the
>> record
>> > > > > > value in
>> > > > > > > > > the body of a POST method.
>> > > > > > > > > The body of the HTTP request is the JSON representation
>> of the
>> > > > > record
>> > > > > > > > value.
>> > > > > > > >
>> > > > > > > > What do you mean ?
>> > > > > > > > I think that this should depend on the Schema.
>> > > > > > > >
>> > > > > > > > BYTES SCHEMA -> I would push the raw message payload
>> > > > > > > > PRIMITIVE VALUES (long, integer, string) - > I would push
>> the
>> > > JSON
>> > > > > > > > represantation
>> > > > > > > > JSON SCHEMA ->  push the raw message payload
>> > > > > > > > AVRO -> ?  convert to JSON ?
>> > > > > > > > PROTOBUF -> ? convert to JSON ?
>> > > > > > > > KEY-VALUE ?
>> > > > > > > >
>> > > > > > > > Probably we need some flag to define the behaviour for the
>> non
>> > > > > trivial
>> > > > > > > > cases.
>> > > > > > > >
>> > > > > > > > The current impl chooses to serialize as JSON because it's
>> a well
>> > > > > > > supported content-type on the server frameworks.
>> > > > > > > It's also to be consistent with existing HTTP Sinks such as
>> Pulsar
>> > > > Bean
>> > > > > > and
>> > > > > > > Confluent HTTP Sink Connector.
>> > > > > > > The possibility to adapt the content-type to the schema is
>> elegant
>> > > > and
>> > > > > > will
>> > > > > > > probably result in shorter payloads (but less readable) and I
>> think
>> > > > it
>> > > > > > > could be done as a follow-up option.
>> > > > > > > It has indeed the problem of being difficult to do for KV
>> schema.
>> > > > > > > For the content-type mappings I would do:
>> > > > > > > BYTES SCHEMA -> application/octet-stream (raw bytes)
>> > > > > > > PRIMITIVE VALUES (long, integer, string) - > text/plain
>> > > > > > > JSON ->  application/json
>> > > > > > > AVRO -> avro/binary
>> > > > > > > PROTOBUF -> probably application/octet-stream ?
>> > > > > > > KEY-VALUE ?
>> > > > > > >
>> > > > > > > Would also need to indicate the Schema-Type in the HTTP
>> headers.
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Some headers are added to the HTTP request:
>> > > > > > > > > * `PulsarTopic`: the topic of the record
>> > > > > > > > > * `PulsarKey`: the key of the record
>> > > > > > > > > * `PulsarEventTime`: the event time of the record
>> > > > > > > > > * `PulsarPublishTime`: the publish time of the record
>> > > > > > > > > * `PulsarMessageId`: the ID of the message contained in
>> the
>> > > > record
>> > > > > > > > > * `PulsarProperties-*`: each record property is passed
>> with the
>> > > > > > property
>> > > > > > > > > name prefixed by `PulsarProperties-`
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > Can we make the "Content-Type" configurable ?
>> > > > > > > >
>> > > > > > > Yes we can. But do we do it for the first iteration ?
>> > > > > > > If we do it, I would have an option to add some fix headers
>> and the
>> > > > > user
>> > > > > > > can override the content-type.
>> > > > > > > If we go for a variable content-type depending on the schema,
>> then
>> > > we
>> > > > > > could
>> > > > > > > have a map<SchemaType, content-type>
>> > > > > > >
>> > > > > > > > Can we make the HTTP METHOD configurable ?
>> > > > > > > >
>> > > > > > > Yes we can. But do we do it for the first iteration ?
>> > > > > > >
>> > > > > > > >
>> > > > > > > > > ### Alternatives
>> > > > > > > > >
>> > > > > > > > > Creating a separated project for this Sink is rejected
>> since:
>> > > > > > > > > * this Sink is very useful for developers to test the
>> Pulsar IO
>> > > > > > > > framework,
>> > > > > > > > > transform functions, and to make demos.
>> > > > > > > > > * the code has a very small footprint with no external
>> > > > > dependencies.
>> > > > > > > > > * it should be visible at the same level as other sinks
>> > > > > > > >
>> > > > > > > > 100% agreed !
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I'm looking forward the discussion.
>> > > > > > > > >
>> > > > > > > > > Best regards,
>> > > > > > > > >
>> > > > > > > > > Christophe Bornet
>> > > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>>
>

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by Christophe Bornet <bo...@gmail.com>.
>
> The pulsar-all docker image is pretty big. I assume we will continue
> to build and package additional connectors. It would be great to
> figure out how to make it smaller at some point.
>
I think we could shrink the connectors a lot by removing from the NAR
archives dependencies that are already present in the
pulsar-functions-instance.

Le mar. 18 oct. 2022 à 16:51, Michael Marshall <mm...@apache.org> a
écrit :

> Great discussion. I have one minor comment that is tangentially related.
>
> > On building the project `pulsar-io-http-{version}.nar` will be built and
> > added to the `pulsar-all` distribution.
>
> The pulsar-all docker image is pretty big. I assume we will continue
> to build and package additional connectors. It would be great to
> figure out how to make it smaller at some point.
>
> Thanks,
> Michael
>
>
> On Tue, Sep 27, 2022 at 9:27 AM Christophe Bornet
> <bo...@gmail.com> wrote:
> >
> > Sure you can test with the Sink of my PR branch.
> > Otherwise I'll do the test after ApacheCon.
> >
> > Le mar. 27 sept. 2022 à 12:57, tison <wa...@gmail.com> a écrit :
> >
> > > Yes. It's a potential use case for validating the implementation. If
> you
> > > don't have time to try it out, I can schedule some time to demo it
> with a
> > > prototype HTTP sink or after the patch gets merged :)
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > Christophe Bornet <bo...@gmail.com> 于2022年9月27日周二 18:51写道:
> > >
> > > > Hi Tison,
> > > >
> > > > Very interesting and shows the value of such a HTTP Sink.
> > > > The Pulsar HTTP Sink should work OOTB with ClickHouse. I don't have
> time
> > > to
> > > > do the test right now, so would someone want to do it ?
> > > >
> > > > Best regards.
> > > >
> > > > Christophe Bornet
> > > >
> > > > Le mar. 27 sept. 2022 à 12:31, tison <wa...@gmail.com> a écrit
> :
> > > >
> > > > > Hi Christophe,
> > > > >
> > > > > Thanks for starting this proposal. It looks cool.
> > > > >
> > > > > I'd suggest one real-world integration test you can make use of:
> > > > >
> https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
> > > > > (replace source kafka with pulsar).
> > > > >
> > > > > Best,
> > > > > tison.
> > > > >
> > > > >
> > > > > Enrico Olivelli <eo...@gmail.com> 于2022年9月27日周二 18:04写道:
> > > > >
> > > > > > Thanks for your answers.
> > > > > > I am fine with the current proposal.
> > > > > > We can enhance it as follow up work
> > > > > >
> > > > > > Enrico
> > > > > >
> > > > > > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
> > > > > > <bo...@gmail.com> ha scritto:
> > > > > > >
> > > > > > > Thanks for your feedback Enrico.
> > > > > > > My answers to your comments below
> > > > > > >
> > > > > > > BR
> > > > > > >
> > > > > > > Christophe
> > > > > > >
> > > > > > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <
> > > eolivelli@gmail.com>
> > > > a
> > > > > > > écrit :
> > > > > > >
> > > > > > > > Christophe,
> > > > > > > > very good initiative!
> > > > > > > >
> > > > > > > > I support it
> > > > > > > > Some comments inline below
> > > > > > > >
> > > > > > > >
> > > > > > > > Enrico
> > > > > > > >
> > > > > > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> > > > > > > > <bo...@gmail.com> ha scritto:
> > > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > I have drafted PIP-208: HTTP Sink
> > > > > > > > >
> > > > > > > > > PIP link:
> > > > > > > > > https://github.com/apache/pulsar/issues/17719
> > > > > > > > >
> > > > > > > > > Here's a copy of the contents of the GH issue for your
> > > > references:
> > > > > > > > >
> > > > > > > > > ### Motivation
> > > > > > > > >
> > > > > > > > > Currently, when you want to consume from Pulsar topics in
> > > > > > applications
> > > > > > > > > written in languages that don't have a Pulsar driver
> supported,
> > > > you
> > > > > > need
> > > > > > > > to
> > > > > > > > > run some type of proxy like the WebSocket Proxy or Pulsar
> Beam.
> > > > In
> > > > > > > > > production this needs additional effort to deploy, scale,
> load
> > > > > > balance,
> > > > > > > > > monitor, and so on...
> > > > > > > > > Pulsar IO is a framework that deals with all these
> operational
> > > > > > subjects
> > > > > > > > and
> > > > > > > > > can be leveraged to provide a way to push messages to
> external
> > > > > > systems
> > > > > > > > > using HTTP, a protocol supported by every existing
> language and
> > > > OS.
> > > > > > > > >
> > > > > > > > > ### Goal
> > > > > > > > >
> > > > > > > > > This proposal defines an HTTP Sink that sends the messages
> to a
> > > > > > > > configured
> > > > > > > > > URL.
> > > > > > > > > It takes inspiration from [Pulsar Beam](
> > > > > > > > > https://github.com/kafkaesque-io/pulsar-beam) and the
> > > [Confluent
> > > > > > HTTP
> > > > > > > > Sink
> > > > > > > > > connector](
> > > > > > > > >
> > > > > >
> > > https://docs.confluent.io/kafka-connectors/http/current/overview.html
> > > > ).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ### Implementation
> > > > > > > > >
> > > > > > > > > A `pulsar-io-http` module will be added to `pulsar-io`.
> > > > > > > > > On building the project `pulsar-io-http-{version}.nar`
> will be
> > > > > built
> > > > > > and
> > > > > > > > > added to the `pulsar-all` distribution.
> > > > > > > > > The name of the Sink will be `http`.
> > > > > > > > >
> > > > > > > > > The HTTP Sink pushes records to any HTTP server with the
> record
> > > > > > value in
> > > > > > > > > the body of a POST method.
> > > > > > > > > The body of the HTTP request is the JSON representation of
> the
> > > > > record
> > > > > > > > value.
> > > > > > > >
> > > > > > > > What do you mean ?
> > > > > > > > I think that this should depend on the Schema.
> > > > > > > >
> > > > > > > > BYTES SCHEMA -> I would push the raw message payload
> > > > > > > > PRIMITIVE VALUES (long, integer, string) - > I would push the
> > > JSON
> > > > > > > > represantation
> > > > > > > > JSON SCHEMA ->  push the raw message payload
> > > > > > > > AVRO -> ?  convert to JSON ?
> > > > > > > > PROTOBUF -> ? convert to JSON ?
> > > > > > > > KEY-VALUE ?
> > > > > > > >
> > > > > > > > Probably we need some flag to define the behaviour for the
> non
> > > > > trivial
> > > > > > > > cases.
> > > > > > > >
> > > > > > > > The current impl chooses to serialize as JSON because it's a
> well
> > > > > > > supported content-type on the server frameworks.
> > > > > > > It's also to be consistent with existing HTTP Sinks such as
> Pulsar
> > > > Bean
> > > > > > and
> > > > > > > Confluent HTTP Sink Connector.
> > > > > > > The possibility to adapt the content-type to the schema is
> elegant
> > > > and
> > > > > > will
> > > > > > > probably result in shorter payloads (but less readable) and I
> think
> > > > it
> > > > > > > could be done as a follow-up option.
> > > > > > > It has indeed the problem of being difficult to do for KV
> schema.
> > > > > > > For the content-type mappings I would do:
> > > > > > > BYTES SCHEMA -> application/octet-stream (raw bytes)
> > > > > > > PRIMITIVE VALUES (long, integer, string) - > text/plain
> > > > > > > JSON ->  application/json
> > > > > > > AVRO -> avro/binary
> > > > > > > PROTOBUF -> probably application/octet-stream ?
> > > > > > > KEY-VALUE ?
> > > > > > >
> > > > > > > Would also need to indicate the Schema-Type in the HTTP
> headers.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Some headers are added to the HTTP request:
> > > > > > > > > * `PulsarTopic`: the topic of the record
> > > > > > > > > * `PulsarKey`: the key of the record
> > > > > > > > > * `PulsarEventTime`: the event time of the record
> > > > > > > > > * `PulsarPublishTime`: the publish time of the record
> > > > > > > > > * `PulsarMessageId`: the ID of the message contained in the
> > > > record
> > > > > > > > > * `PulsarProperties-*`: each record property is passed
> with the
> > > > > > property
> > > > > > > > > name prefixed by `PulsarProperties-`
> > > > > > > > >
> > > > > > > >
> > > > > > > > Can we make the "Content-Type" configurable ?
> > > > > > > >
> > > > > > > Yes we can. But do we do it for the first iteration ?
> > > > > > > If we do it, I would have an option to add some fix headers
> and the
> > > > > user
> > > > > > > can override the content-type.
> > > > > > > If we go for a variable content-type depending on the schema,
> then
> > > we
> > > > > > could
> > > > > > > have a map<SchemaType, content-type>
> > > > > > >
> > > > > > > > Can we make the HTTP METHOD configurable ?
> > > > > > > >
> > > > > > > Yes we can. But do we do it for the first iteration ?
> > > > > > >
> > > > > > > >
> > > > > > > > > ### Alternatives
> > > > > > > > >
> > > > > > > > > Creating a separated project for this Sink is rejected
> since:
> > > > > > > > > * this Sink is very useful for developers to test the
> Pulsar IO
> > > > > > > > framework,
> > > > > > > > > transform functions, and to make demos.
> > > > > > > > > * the code has a very small footprint with no external
> > > > > dependencies.
> > > > > > > > > * it should be visible at the same level as other sinks
> > > > > > > >
> > > > > > > > 100% agreed !
> > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm looking forward the discussion.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > >
> > > > > > > > > Christophe Bornet
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Le mar. 18 oct. 2022 à 16:51, Michael Marshall <mm...@apache.org> a
écrit :

> Great discussion. I have one minor comment that is tangentially related.
>
> > On building the project `pulsar-io-http-{version}.nar` will be built and
> > added to the `pulsar-all` distribution.
>
> The pulsar-all docker image is pretty big. I assume we will continue
> to build and package additional connectors. It would be great to
> figure out how to make it smaller at some point.
>
> Thanks,
> Michael
>
>
> On Tue, Sep 27, 2022 at 9:27 AM Christophe Bornet
> <bo...@gmail.com> wrote:
> >
> > Sure you can test with the Sink of my PR branch.
> > Otherwise I'll do the test after ApacheCon.
> >
> > Le mar. 27 sept. 2022 à 12:57, tison <wa...@gmail.com> a écrit :
> >
> > > Yes. It's a potential use case for validating the implementation. If
> you
> > > don't have time to try it out, I can schedule some time to demo it
> with a
> > > prototype HTTP sink or after the patch gets merged :)
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > Christophe Bornet <bo...@gmail.com> 于2022年9月27日周二 18:51写道:
> > >
> > > > Hi Tison,
> > > >
> > > > Very interesting and shows the value of such a HTTP Sink.
> > > > The Pulsar HTTP Sink should work OOTB with ClickHouse. I don't have
> time
> > > to
> > > > do the test right now, so would someone want to do it ?
> > > >
> > > > Best regards.
> > > >
> > > > Christophe Bornet
> > > >
> > > > Le mar. 27 sept. 2022 à 12:31, tison <wa...@gmail.com> a écrit
> :
> > > >
> > > > > Hi Christophe,
> > > > >
> > > > > Thanks for starting this proposal. It looks cool.
> > > > >
> > > > > I'd suggest one real-world integration test you can make use of:
> > > > >
> https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
> > > > > (replace source kafka with pulsar).
> > > > >
> > > > > Best,
> > > > > tison.
> > > > >
> > > > >
> > > > > Enrico Olivelli <eo...@gmail.com> 于2022年9月27日周二 18:04写道:
> > > > >
> > > > > > Thanks for your answers.
> > > > > > I am fine with the current proposal.
> > > > > > We can enhance it as follow up work
> > > > > >
> > > > > > Enrico
> > > > > >
> > > > > > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
> > > > > > <bo...@gmail.com> ha scritto:
> > > > > > >
> > > > > > > Thanks for your feedback Enrico.
> > > > > > > My answers to your comments below
> > > > > > >
> > > > > > > BR
> > > > > > >
> > > > > > > Christophe
> > > > > > >
> > > > > > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <
> > > eolivelli@gmail.com>
> > > > a
> > > > > > > écrit :
> > > > > > >
> > > > > > > > Christophe,
> > > > > > > > very good initiative!
> > > > > > > >
> > > > > > > > I support it
> > > > > > > > Some comments inline below
> > > > > > > >
> > > > > > > >
> > > > > > > > Enrico
> > > > > > > >
> > > > > > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> > > > > > > > <bo...@gmail.com> ha scritto:
> > > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > I have drafted PIP-208: HTTP Sink
> > > > > > > > >
> > > > > > > > > PIP link:
> > > > > > > > > https://github.com/apache/pulsar/issues/17719
> > > > > > > > >
> > > > > > > > > Here's a copy of the contents of the GH issue for your
> > > > references:
> > > > > > > > >
> > > > > > > > > ### Motivation
> > > > > > > > >
> > > > > > > > > Currently, when you want to consume from Pulsar topics in
> > > > > > applications
> > > > > > > > > written in languages that don't have a Pulsar driver
> supported,
> > > > you
> > > > > > need
> > > > > > > > to
> > > > > > > > > run some type of proxy like the WebSocket Proxy or Pulsar
> Beam.
> > > > In
> > > > > > > > > production this needs additional effort to deploy, scale,
> load
> > > > > > balance,
> > > > > > > > > monitor, and so on...
> > > > > > > > > Pulsar IO is a framework that deals with all these
> operational
> > > > > > subjects
> > > > > > > > and
> > > > > > > > > can be leveraged to provide a way to push messages to
> external
> > > > > > systems
> > > > > > > > > using HTTP, a protocol supported by every existing
> language and
> > > > OS.
> > > > > > > > >
> > > > > > > > > ### Goal
> > > > > > > > >
> > > > > > > > > This proposal defines an HTTP Sink that sends the messages
> to a
> > > > > > > > configured
> > > > > > > > > URL.
> > > > > > > > > It takes inspiration from [Pulsar Beam](
> > > > > > > > > https://github.com/kafkaesque-io/pulsar-beam) and the
> > > [Confluent
> > > > > > HTTP
> > > > > > > > Sink
> > > > > > > > > connector](
> > > > > > > > >
> > > > > >
> > > https://docs.confluent.io/kafka-connectors/http/current/overview.html
> > > > ).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ### Implementation
> > > > > > > > >
> > > > > > > > > A `pulsar-io-http` module will be added to `pulsar-io`.
> > > > > > > > > On building the project `pulsar-io-http-{version}.nar`
> will be
> > > > > built
> > > > > > and
> > > > > > > > > added to the `pulsar-all` distribution.
> > > > > > > > > The name of the Sink will be `http`.
> > > > > > > > >
> > > > > > > > > The HTTP Sink pushes records to any HTTP server with the
> record
> > > > > > value in
> > > > > > > > > the body of a POST method.
> > > > > > > > > The body of the HTTP request is the JSON representation of
> the
> > > > > record
> > > > > > > > value.
> > > > > > > >
> > > > > > > > What do you mean ?
> > > > > > > > I think that this should depend on the Schema.
> > > > > > > >
> > > > > > > > BYTES SCHEMA -> I would push the raw message payload
> > > > > > > > PRIMITIVE VALUES (long, integer, string) - > I would push the
> > > JSON
> > > > > > > > represantation
> > > > > > > > JSON SCHEMA ->  push the raw message payload
> > > > > > > > AVRO -> ?  convert to JSON ?
> > > > > > > > PROTOBUF -> ? convert to JSON ?
> > > > > > > > KEY-VALUE ?
> > > > > > > >
> > > > > > > > Probably we need some flag to define the behaviour for the
> non
> > > > > trivial
> > > > > > > > cases.
> > > > > > > >
> > > > > > > > The current impl chooses to serialize as JSON because it's a
> well
> > > > > > > supported content-type on the server frameworks.
> > > > > > > It's also to be consistent with existing HTTP Sinks such as
> Pulsar
> > > > Bean
> > > > > > and
> > > > > > > Confluent HTTP Sink Connector.
> > > > > > > The possibility to adapt the content-type to the schema is
> elegant
> > > > and
> > > > > > will
> > > > > > > probably result in shorter payloads (but less readable) and I
> think
> > > > it
> > > > > > > could be done as a follow-up option.
> > > > > > > It has indeed the problem of being difficult to do for KV
> schema.
> > > > > > > For the content-type mappings I would do:
> > > > > > > BYTES SCHEMA -> application/octet-stream (raw bytes)
> > > > > > > PRIMITIVE VALUES (long, integer, string) - > text/plain
> > > > > > > JSON ->  application/json
> > > > > > > AVRO -> avro/binary
> > > > > > > PROTOBUF -> probably application/octet-stream ?
> > > > > > > KEY-VALUE ?
> > > > > > >
> > > > > > > Would also need to indicate the Schema-Type in the HTTP
> headers.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Some headers are added to the HTTP request:
> > > > > > > > > * `PulsarTopic`: the topic of the record
> > > > > > > > > * `PulsarKey`: the key of the record
> > > > > > > > > * `PulsarEventTime`: the event time of the record
> > > > > > > > > * `PulsarPublishTime`: the publish time of the record
> > > > > > > > > * `PulsarMessageId`: the ID of the message contained in the
> > > > record
> > > > > > > > > * `PulsarProperties-*`: each record property is passed
> with the
> > > > > > property
> > > > > > > > > name prefixed by `PulsarProperties-`
> > > > > > > > >
> > > > > > > >
> > > > > > > > Can we make the "Content-Type" configurable ?
> > > > > > > >
> > > > > > > Yes we can. But do we do it for the first iteration ?
> > > > > > > If we do it, I would have an option to add some fix headers
> and the
> > > > > user
> > > > > > > can override the content-type.
> > > > > > > If we go for a variable content-type depending on the schema,
> then
> > > we
> > > > > > could
> > > > > > > have a map<SchemaType, content-type>
> > > > > > >
> > > > > > > > Can we make the HTTP METHOD configurable ?
> > > > > > > >
> > > > > > > Yes we can. But do we do it for the first iteration ?
> > > > > > >
> > > > > > > >
> > > > > > > > > ### Alternatives
> > > > > > > > >
> > > > > > > > > Creating a separated project for this Sink is rejected
> since:
> > > > > > > > > * this Sink is very useful for developers to test the
> Pulsar IO
> > > > > > > > framework,
> > > > > > > > > transform functions, and to make demos.
> > > > > > > > > * the code has a very small footprint with no external
> > > > > dependencies.
> > > > > > > > > * it should be visible at the same level as other sinks
> > > > > > > >
> > > > > > > > 100% agreed !
> > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm looking forward the discussion.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > >
> > > > > > > > > Christophe Bornet
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by Michael Marshall <mm...@apache.org>.
Great discussion. I have one minor comment that is tangentially related.

> On building the project `pulsar-io-http-{version}.nar` will be built and
> added to the `pulsar-all` distribution.

The pulsar-all docker image is pretty big. I assume we will continue
to build and package additional connectors. It would be great to
figure out how to make it smaller at some point.

Thanks,
Michael


On Tue, Sep 27, 2022 at 9:27 AM Christophe Bornet
<bo...@gmail.com> wrote:
>
> Sure you can test with the Sink of my PR branch.
> Otherwise I'll do the test after ApacheCon.
>
> Le mar. 27 sept. 2022 à 12:57, tison <wa...@gmail.com> a écrit :
>
> > Yes. It's a potential use case for validating the implementation. If you
> > don't have time to try it out, I can schedule some time to demo it with a
> > prototype HTTP sink or after the patch gets merged :)
> >
> > Best,
> > tison.
> >
> >
> > Christophe Bornet <bo...@gmail.com> 于2022年9月27日周二 18:51写道:
> >
> > > Hi Tison,
> > >
> > > Very interesting and shows the value of such a HTTP Sink.
> > > The Pulsar HTTP Sink should work OOTB with ClickHouse. I don't have time
> > to
> > > do the test right now, so would someone want to do it ?
> > >
> > > Best regards.
> > >
> > > Christophe Bornet
> > >
> > > Le mar. 27 sept. 2022 à 12:31, tison <wa...@gmail.com> a écrit :
> > >
> > > > Hi Christophe,
> > > >
> > > > Thanks for starting this proposal. It looks cool.
> > > >
> > > > I'd suggest one real-world integration test you can make use of:
> > > > https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
> > > > (replace source kafka with pulsar).
> > > >
> > > > Best,
> > > > tison.
> > > >
> > > >
> > > > Enrico Olivelli <eo...@gmail.com> 于2022年9月27日周二 18:04写道:
> > > >
> > > > > Thanks for your answers.
> > > > > I am fine with the current proposal.
> > > > > We can enhance it as follow up work
> > > > >
> > > > > Enrico
> > > > >
> > > > > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
> > > > > <bo...@gmail.com> ha scritto:
> > > > > >
> > > > > > Thanks for your feedback Enrico.
> > > > > > My answers to your comments below
> > > > > >
> > > > > > BR
> > > > > >
> > > > > > Christophe
> > > > > >
> > > > > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <
> > eolivelli@gmail.com>
> > > a
> > > > > > écrit :
> > > > > >
> > > > > > > Christophe,
> > > > > > > very good initiative!
> > > > > > >
> > > > > > > I support it
> > > > > > > Some comments inline below
> > > > > > >
> > > > > > >
> > > > > > > Enrico
> > > > > > >
> > > > > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> > > > > > > <bo...@gmail.com> ha scritto:
> > > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I have drafted PIP-208: HTTP Sink
> > > > > > > >
> > > > > > > > PIP link:
> > > > > > > > https://github.com/apache/pulsar/issues/17719
> > > > > > > >
> > > > > > > > Here's a copy of the contents of the GH issue for your
> > > references:
> > > > > > > >
> > > > > > > > ### Motivation
> > > > > > > >
> > > > > > > > Currently, when you want to consume from Pulsar topics in
> > > > > applications
> > > > > > > > written in languages that don't have a Pulsar driver supported,
> > > you
> > > > > need
> > > > > > > to
> > > > > > > > run some type of proxy like the WebSocket Proxy or Pulsar Beam.
> > > In
> > > > > > > > production this needs additional effort to deploy, scale, load
> > > > > balance,
> > > > > > > > monitor, and so on...
> > > > > > > > Pulsar IO is a framework that deals with all these operational
> > > > > subjects
> > > > > > > and
> > > > > > > > can be leveraged to provide a way to push messages to external
> > > > > systems
> > > > > > > > using HTTP, a protocol supported by every existing language and
> > > OS.
> > > > > > > >
> > > > > > > > ### Goal
> > > > > > > >
> > > > > > > > This proposal defines an HTTP Sink that sends the messages to a
> > > > > > > configured
> > > > > > > > URL.
> > > > > > > > It takes inspiration from [Pulsar Beam](
> > > > > > > > https://github.com/kafkaesque-io/pulsar-beam) and the
> > [Confluent
> > > > > HTTP
> > > > > > > Sink
> > > > > > > > connector](
> > > > > > > >
> > > > >
> > https://docs.confluent.io/kafka-connectors/http/current/overview.html
> > > ).
> > > > > > > >
> > > > > > > >
> > > > > > > > ### Implementation
> > > > > > > >
> > > > > > > > A `pulsar-io-http` module will be added to `pulsar-io`.
> > > > > > > > On building the project `pulsar-io-http-{version}.nar` will be
> > > > built
> > > > > and
> > > > > > > > added to the `pulsar-all` distribution.
> > > > > > > > The name of the Sink will be `http`.
> > > > > > > >
> > > > > > > > The HTTP Sink pushes records to any HTTP server with the record
> > > > > value in
> > > > > > > > the body of a POST method.
> > > > > > > > The body of the HTTP request is the JSON representation of the
> > > > record
> > > > > > > value.
> > > > > > >
> > > > > > > What do you mean ?
> > > > > > > I think that this should depend on the Schema.
> > > > > > >
> > > > > > > BYTES SCHEMA -> I would push the raw message payload
> > > > > > > PRIMITIVE VALUES (long, integer, string) - > I would push the
> > JSON
> > > > > > > represantation
> > > > > > > JSON SCHEMA ->  push the raw message payload
> > > > > > > AVRO -> ?  convert to JSON ?
> > > > > > > PROTOBUF -> ? convert to JSON ?
> > > > > > > KEY-VALUE ?
> > > > > > >
> > > > > > > Probably we need some flag to define the behaviour for the non
> > > > trivial
> > > > > > > cases.
> > > > > > >
> > > > > > > The current impl chooses to serialize as JSON because it's a well
> > > > > > supported content-type on the server frameworks.
> > > > > > It's also to be consistent with existing HTTP Sinks such as Pulsar
> > > Bean
> > > > > and
> > > > > > Confluent HTTP Sink Connector.
> > > > > > The possibility to adapt the content-type to the schema is elegant
> > > and
> > > > > will
> > > > > > probably result in shorter payloads (but less readable) and I think
> > > it
> > > > > > could be done as a follow-up option.
> > > > > > It has indeed the problem of being difficult to do for KV schema.
> > > > > > For the content-type mappings I would do:
> > > > > > BYTES SCHEMA -> application/octet-stream (raw bytes)
> > > > > > PRIMITIVE VALUES (long, integer, string) - > text/plain
> > > > > > JSON ->  application/json
> > > > > > AVRO -> avro/binary
> > > > > > PROTOBUF -> probably application/octet-stream ?
> > > > > > KEY-VALUE ?
> > > > > >
> > > > > > Would also need to indicate the Schema-Type in the HTTP headers.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Some headers are added to the HTTP request:
> > > > > > > > * `PulsarTopic`: the topic of the record
> > > > > > > > * `PulsarKey`: the key of the record
> > > > > > > > * `PulsarEventTime`: the event time of the record
> > > > > > > > * `PulsarPublishTime`: the publish time of the record
> > > > > > > > * `PulsarMessageId`: the ID of the message contained in the
> > > record
> > > > > > > > * `PulsarProperties-*`: each record property is passed with the
> > > > > property
> > > > > > > > name prefixed by `PulsarProperties-`
> > > > > > > >
> > > > > > >
> > > > > > > Can we make the "Content-Type" configurable ?
> > > > > > >
> > > > > > Yes we can. But do we do it for the first iteration ?
> > > > > > If we do it, I would have an option to add some fix headers and the
> > > > user
> > > > > > can override the content-type.
> > > > > > If we go for a variable content-type depending on the schema, then
> > we
> > > > > could
> > > > > > have a map<SchemaType, content-type>
> > > > > >
> > > > > > > Can we make the HTTP METHOD configurable ?
> > > > > > >
> > > > > > Yes we can. But do we do it for the first iteration ?
> > > > > >
> > > > > > >
> > > > > > > > ### Alternatives
> > > > > > > >
> > > > > > > > Creating a separated project for this Sink is rejected since:
> > > > > > > > * this Sink is very useful for developers to test the Pulsar IO
> > > > > > > framework,
> > > > > > > > transform functions, and to make demos.
> > > > > > > > * the code has a very small footprint with no external
> > > > dependencies.
> > > > > > > > * it should be visible at the same level as other sinks
> > > > > > >
> > > > > > > 100% agreed !
> > > > > > >
> > > > > > > >
> > > > > > > > I'm looking forward the discussion.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > > > Christophe Bornet
> > > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by Christophe Bornet <bo...@gmail.com>.
Sure you can test with the Sink of my PR branch.
Otherwise I'll do the test after ApacheCon.

Le mar. 27 sept. 2022 à 12:57, tison <wa...@gmail.com> a écrit :

> Yes. It's a potential use case for validating the implementation. If you
> don't have time to try it out, I can schedule some time to demo it with a
> prototype HTTP sink or after the patch gets merged :)
>
> Best,
> tison.
>
>
> Christophe Bornet <bo...@gmail.com> 于2022年9月27日周二 18:51写道:
>
> > Hi Tison,
> >
> > Very interesting and shows the value of such a HTTP Sink.
> > The Pulsar HTTP Sink should work OOTB with ClickHouse. I don't have time
> to
> > do the test right now, so would someone want to do it ?
> >
> > Best regards.
> >
> > Christophe Bornet
> >
> > Le mar. 27 sept. 2022 à 12:31, tison <wa...@gmail.com> a écrit :
> >
> > > Hi Christophe,
> > >
> > > Thanks for starting this proposal. It looks cool.
> > >
> > > I'd suggest one real-world integration test you can make use of:
> > > https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
> > > (replace source kafka with pulsar).
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > Enrico Olivelli <eo...@gmail.com> 于2022年9月27日周二 18:04写道:
> > >
> > > > Thanks for your answers.
> > > > I am fine with the current proposal.
> > > > We can enhance it as follow up work
> > > >
> > > > Enrico
> > > >
> > > > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
> > > > <bo...@gmail.com> ha scritto:
> > > > >
> > > > > Thanks for your feedback Enrico.
> > > > > My answers to your comments below
> > > > >
> > > > > BR
> > > > >
> > > > > Christophe
> > > > >
> > > > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <
> eolivelli@gmail.com>
> > a
> > > > > écrit :
> > > > >
> > > > > > Christophe,
> > > > > > very good initiative!
> > > > > >
> > > > > > I support it
> > > > > > Some comments inline below
> > > > > >
> > > > > >
> > > > > > Enrico
> > > > > >
> > > > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> > > > > > <bo...@gmail.com> ha scritto:
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I have drafted PIP-208: HTTP Sink
> > > > > > >
> > > > > > > PIP link:
> > > > > > > https://github.com/apache/pulsar/issues/17719
> > > > > > >
> > > > > > > Here's a copy of the contents of the GH issue for your
> > references:
> > > > > > >
> > > > > > > ### Motivation
> > > > > > >
> > > > > > > Currently, when you want to consume from Pulsar topics in
> > > > applications
> > > > > > > written in languages that don't have a Pulsar driver supported,
> > you
> > > > need
> > > > > > to
> > > > > > > run some type of proxy like the WebSocket Proxy or Pulsar Beam.
> > In
> > > > > > > production this needs additional effort to deploy, scale, load
> > > > balance,
> > > > > > > monitor, and so on...
> > > > > > > Pulsar IO is a framework that deals with all these operational
> > > > subjects
> > > > > > and
> > > > > > > can be leveraged to provide a way to push messages to external
> > > > systems
> > > > > > > using HTTP, a protocol supported by every existing language and
> > OS.
> > > > > > >
> > > > > > > ### Goal
> > > > > > >
> > > > > > > This proposal defines an HTTP Sink that sends the messages to a
> > > > > > configured
> > > > > > > URL.
> > > > > > > It takes inspiration from [Pulsar Beam](
> > > > > > > https://github.com/kafkaesque-io/pulsar-beam) and the
> [Confluent
> > > > HTTP
> > > > > > Sink
> > > > > > > connector](
> > > > > > >
> > > >
> https://docs.confluent.io/kafka-connectors/http/current/overview.html
> > ).
> > > > > > >
> > > > > > >
> > > > > > > ### Implementation
> > > > > > >
> > > > > > > A `pulsar-io-http` module will be added to `pulsar-io`.
> > > > > > > On building the project `pulsar-io-http-{version}.nar` will be
> > > built
> > > > and
> > > > > > > added to the `pulsar-all` distribution.
> > > > > > > The name of the Sink will be `http`.
> > > > > > >
> > > > > > > The HTTP Sink pushes records to any HTTP server with the record
> > > > value in
> > > > > > > the body of a POST method.
> > > > > > > The body of the HTTP request is the JSON representation of the
> > > record
> > > > > > value.
> > > > > >
> > > > > > What do you mean ?
> > > > > > I think that this should depend on the Schema.
> > > > > >
> > > > > > BYTES SCHEMA -> I would push the raw message payload
> > > > > > PRIMITIVE VALUES (long, integer, string) - > I would push the
> JSON
> > > > > > represantation
> > > > > > JSON SCHEMA ->  push the raw message payload
> > > > > > AVRO -> ?  convert to JSON ?
> > > > > > PROTOBUF -> ? convert to JSON ?
> > > > > > KEY-VALUE ?
> > > > > >
> > > > > > Probably we need some flag to define the behaviour for the non
> > > trivial
> > > > > > cases.
> > > > > >
> > > > > > The current impl chooses to serialize as JSON because it's a well
> > > > > supported content-type on the server frameworks.
> > > > > It's also to be consistent with existing HTTP Sinks such as Pulsar
> > Bean
> > > > and
> > > > > Confluent HTTP Sink Connector.
> > > > > The possibility to adapt the content-type to the schema is elegant
> > and
> > > > will
> > > > > probably result in shorter payloads (but less readable) and I think
> > it
> > > > > could be done as a follow-up option.
> > > > > It has indeed the problem of being difficult to do for KV schema.
> > > > > For the content-type mappings I would do:
> > > > > BYTES SCHEMA -> application/octet-stream (raw bytes)
> > > > > PRIMITIVE VALUES (long, integer, string) - > text/plain
> > > > > JSON ->  application/json
> > > > > AVRO -> avro/binary
> > > > > PROTOBUF -> probably application/octet-stream ?
> > > > > KEY-VALUE ?
> > > > >
> > > > > Would also need to indicate the Schema-Type in the HTTP headers.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Some headers are added to the HTTP request:
> > > > > > > * `PulsarTopic`: the topic of the record
> > > > > > > * `PulsarKey`: the key of the record
> > > > > > > * `PulsarEventTime`: the event time of the record
> > > > > > > * `PulsarPublishTime`: the publish time of the record
> > > > > > > * `PulsarMessageId`: the ID of the message contained in the
> > record
> > > > > > > * `PulsarProperties-*`: each record property is passed with the
> > > > property
> > > > > > > name prefixed by `PulsarProperties-`
> > > > > > >
> > > > > >
> > > > > > Can we make the "Content-Type" configurable ?
> > > > > >
> > > > > Yes we can. But do we do it for the first iteration ?
> > > > > If we do it, I would have an option to add some fix headers and the
> > > user
> > > > > can override the content-type.
> > > > > If we go for a variable content-type depending on the schema, then
> we
> > > > could
> > > > > have a map<SchemaType, content-type>
> > > > >
> > > > > > Can we make the HTTP METHOD configurable ?
> > > > > >
> > > > > Yes we can. But do we do it for the first iteration ?
> > > > >
> > > > > >
> > > > > > > ### Alternatives
> > > > > > >
> > > > > > > Creating a separated project for this Sink is rejected since:
> > > > > > > * this Sink is very useful for developers to test the Pulsar IO
> > > > > > framework,
> > > > > > > transform functions, and to make demos.
> > > > > > > * the code has a very small footprint with no external
> > > dependencies.
> > > > > > > * it should be visible at the same level as other sinks
> > > > > >
> > > > > > 100% agreed !
> > > > > >
> > > > > > >
> > > > > > > I'm looking forward the discussion.
> > > > > > >
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Christophe Bornet
> > > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by tison <wa...@gmail.com>.
Yes. It's a potential use case for validating the implementation. If you
don't have time to try it out, I can schedule some time to demo it with a
prototype HTTP sink or after the patch gets merged :)

Best,
tison.


Christophe Bornet <bo...@gmail.com> 于2022年9月27日周二 18:51写道:

> Hi Tison,
>
> Very interesting and shows the value of such a HTTP Sink.
> The Pulsar HTTP Sink should work OOTB with ClickHouse. I don't have time to
> do the test right now, so would someone want to do it ?
>
> Best regards.
>
> Christophe Bornet
>
> Le mar. 27 sept. 2022 à 12:31, tison <wa...@gmail.com> a écrit :
>
> > Hi Christophe,
> >
> > Thanks for starting this proposal. It looks cool.
> >
> > I'd suggest one real-world integration test you can make use of:
> > https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
> > (replace source kafka with pulsar).
> >
> > Best,
> > tison.
> >
> >
> > Enrico Olivelli <eo...@gmail.com> 于2022年9月27日周二 18:04写道:
> >
> > > Thanks for your answers.
> > > I am fine with the current proposal.
> > > We can enhance it as follow up work
> > >
> > > Enrico
> > >
> > > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
> > > <bo...@gmail.com> ha scritto:
> > > >
> > > > Thanks for your feedback Enrico.
> > > > My answers to your comments below
> > > >
> > > > BR
> > > >
> > > > Christophe
> > > >
> > > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <eo...@gmail.com>
> a
> > > > écrit :
> > > >
> > > > > Christophe,
> > > > > very good initiative!
> > > > >
> > > > > I support it
> > > > > Some comments inline below
> > > > >
> > > > >
> > > > > Enrico
> > > > >
> > > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> > > > > <bo...@gmail.com> ha scritto:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I have drafted PIP-208: HTTP Sink
> > > > > >
> > > > > > PIP link:
> > > > > > https://github.com/apache/pulsar/issues/17719
> > > > > >
> > > > > > Here's a copy of the contents of the GH issue for your
> references:
> > > > > >
> > > > > > ### Motivation
> > > > > >
> > > > > > Currently, when you want to consume from Pulsar topics in
> > > applications
> > > > > > written in languages that don't have a Pulsar driver supported,
> you
> > > need
> > > > > to
> > > > > > run some type of proxy like the WebSocket Proxy or Pulsar Beam.
> In
> > > > > > production this needs additional effort to deploy, scale, load
> > > balance,
> > > > > > monitor, and so on...
> > > > > > Pulsar IO is a framework that deals with all these operational
> > > subjects
> > > > > and
> > > > > > can be leveraged to provide a way to push messages to external
> > > systems
> > > > > > using HTTP, a protocol supported by every existing language and
> OS.
> > > > > >
> > > > > > ### Goal
> > > > > >
> > > > > > This proposal defines an HTTP Sink that sends the messages to a
> > > > > configured
> > > > > > URL.
> > > > > > It takes inspiration from [Pulsar Beam](
> > > > > > https://github.com/kafkaesque-io/pulsar-beam) and the [Confluent
> > > HTTP
> > > > > Sink
> > > > > > connector](
> > > > > >
> > > https://docs.confluent.io/kafka-connectors/http/current/overview.html
> ).
> > > > > >
> > > > > >
> > > > > > ### Implementation
> > > > > >
> > > > > > A `pulsar-io-http` module will be added to `pulsar-io`.
> > > > > > On building the project `pulsar-io-http-{version}.nar` will be
> > built
> > > and
> > > > > > added to the `pulsar-all` distribution.
> > > > > > The name of the Sink will be `http`.
> > > > > >
> > > > > > The HTTP Sink pushes records to any HTTP server with the record
> > > value in
> > > > > > the body of a POST method.
> > > > > > The body of the HTTP request is the JSON representation of the
> > record
> > > > > value.
> > > > >
> > > > > What do you mean ?
> > > > > I think that this should depend on the Schema.
> > > > >
> > > > > BYTES SCHEMA -> I would push the raw message payload
> > > > > PRIMITIVE VALUES (long, integer, string) - > I would push the JSON
> > > > > represantation
> > > > > JSON SCHEMA ->  push the raw message payload
> > > > > AVRO -> ?  convert to JSON ?
> > > > > PROTOBUF -> ? convert to JSON ?
> > > > > KEY-VALUE ?
> > > > >
> > > > > Probably we need some flag to define the behaviour for the non
> > trivial
> > > > > cases.
> > > > >
> > > > > The current impl chooses to serialize as JSON because it's a well
> > > > supported content-type on the server frameworks.
> > > > It's also to be consistent with existing HTTP Sinks such as Pulsar
> Bean
> > > and
> > > > Confluent HTTP Sink Connector.
> > > > The possibility to adapt the content-type to the schema is elegant
> and
> > > will
> > > > probably result in shorter payloads (but less readable) and I think
> it
> > > > could be done as a follow-up option.
> > > > It has indeed the problem of being difficult to do for KV schema.
> > > > For the content-type mappings I would do:
> > > > BYTES SCHEMA -> application/octet-stream (raw bytes)
> > > > PRIMITIVE VALUES (long, integer, string) - > text/plain
> > > > JSON ->  application/json
> > > > AVRO -> avro/binary
> > > > PROTOBUF -> probably application/octet-stream ?
> > > > KEY-VALUE ?
> > > >
> > > > Would also need to indicate the Schema-Type in the HTTP headers.
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > Some headers are added to the HTTP request:
> > > > > > * `PulsarTopic`: the topic of the record
> > > > > > * `PulsarKey`: the key of the record
> > > > > > * `PulsarEventTime`: the event time of the record
> > > > > > * `PulsarPublishTime`: the publish time of the record
> > > > > > * `PulsarMessageId`: the ID of the message contained in the
> record
> > > > > > * `PulsarProperties-*`: each record property is passed with the
> > > property
> > > > > > name prefixed by `PulsarProperties-`
> > > > > >
> > > > >
> > > > > Can we make the "Content-Type" configurable ?
> > > > >
> > > > Yes we can. But do we do it for the first iteration ?
> > > > If we do it, I would have an option to add some fix headers and the
> > user
> > > > can override the content-type.
> > > > If we go for a variable content-type depending on the schema, then we
> > > could
> > > > have a map<SchemaType, content-type>
> > > >
> > > > > Can we make the HTTP METHOD configurable ?
> > > > >
> > > > Yes we can. But do we do it for the first iteration ?
> > > >
> > > > >
> > > > > > ### Alternatives
> > > > > >
> > > > > > Creating a separated project for this Sink is rejected since:
> > > > > > * this Sink is very useful for developers to test the Pulsar IO
> > > > > framework,
> > > > > > transform functions, and to make demos.
> > > > > > * the code has a very small footprint with no external
> > dependencies.
> > > > > > * it should be visible at the same level as other sinks
> > > > >
> > > > > 100% agreed !
> > > > >
> > > > > >
> > > > > > I'm looking forward the discussion.
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Christophe Bornet
> > > > >
> > >
> >
>

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by Christophe Bornet <bo...@gmail.com>.
Hi Tison,

Very interesting and shows the value of such a HTTP Sink.
The Pulsar HTTP Sink should work OOTB with ClickHouse. I don't have time to
do the test right now, so would someone want to do it ?

Best regards.

Christophe Bornet

Le mar. 27 sept. 2022 à 12:31, tison <wa...@gmail.com> a écrit :

> Hi Christophe,
>
> Thanks for starting this proposal. It looks cool.
>
> I'd suggest one real-world integration test you can make use of:
> https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
> (replace source kafka with pulsar).
>
> Best,
> tison.
>
>
> Enrico Olivelli <eo...@gmail.com> 于2022年9月27日周二 18:04写道:
>
> > Thanks for your answers.
> > I am fine with the current proposal.
> > We can enhance it as follow up work
> >
> > Enrico
> >
> > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
> > <bo...@gmail.com> ha scritto:
> > >
> > > Thanks for your feedback Enrico.
> > > My answers to your comments below
> > >
> > > BR
> > >
> > > Christophe
> > >
> > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <eo...@gmail.com> a
> > > écrit :
> > >
> > > > Christophe,
> > > > very good initiative!
> > > >
> > > > I support it
> > > > Some comments inline below
> > > >
> > > >
> > > > Enrico
> > > >
> > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> > > > <bo...@gmail.com> ha scritto:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > I have drafted PIP-208: HTTP Sink
> > > > >
> > > > > PIP link:
> > > > > https://github.com/apache/pulsar/issues/17719
> > > > >
> > > > > Here's a copy of the contents of the GH issue for your references:
> > > > >
> > > > > ### Motivation
> > > > >
> > > > > Currently, when you want to consume from Pulsar topics in
> > applications
> > > > > written in languages that don't have a Pulsar driver supported, you
> > need
> > > > to
> > > > > run some type of proxy like the WebSocket Proxy or Pulsar Beam. In
> > > > > production this needs additional effort to deploy, scale, load
> > balance,
> > > > > monitor, and so on...
> > > > > Pulsar IO is a framework that deals with all these operational
> > subjects
> > > > and
> > > > > can be leveraged to provide a way to push messages to external
> > systems
> > > > > using HTTP, a protocol supported by every existing language and OS.
> > > > >
> > > > > ### Goal
> > > > >
> > > > > This proposal defines an HTTP Sink that sends the messages to a
> > > > configured
> > > > > URL.
> > > > > It takes inspiration from [Pulsar Beam](
> > > > > https://github.com/kafkaesque-io/pulsar-beam) and the [Confluent
> > HTTP
> > > > Sink
> > > > > connector](
> > > > >
> > https://docs.confluent.io/kafka-connectors/http/current/overview.html).
> > > > >
> > > > >
> > > > > ### Implementation
> > > > >
> > > > > A `pulsar-io-http` module will be added to `pulsar-io`.
> > > > > On building the project `pulsar-io-http-{version}.nar` will be
> built
> > and
> > > > > added to the `pulsar-all` distribution.
> > > > > The name of the Sink will be `http`.
> > > > >
> > > > > The HTTP Sink pushes records to any HTTP server with the record
> > value in
> > > > > the body of a POST method.
> > > > > The body of the HTTP request is the JSON representation of the
> record
> > > > value.
> > > >
> > > > What do you mean ?
> > > > I think that this should depend on the Schema.
> > > >
> > > > BYTES SCHEMA -> I would push the raw message payload
> > > > PRIMITIVE VALUES (long, integer, string) - > I would push the JSON
> > > > represantation
> > > > JSON SCHEMA ->  push the raw message payload
> > > > AVRO -> ?  convert to JSON ?
> > > > PROTOBUF -> ? convert to JSON ?
> > > > KEY-VALUE ?
> > > >
> > > > Probably we need some flag to define the behaviour for the non
> trivial
> > > > cases.
> > > >
> > > > The current impl chooses to serialize as JSON because it's a well
> > > supported content-type on the server frameworks.
> > > It's also to be consistent with existing HTTP Sinks such as Pulsar Bean
> > and
> > > Confluent HTTP Sink Connector.
> > > The possibility to adapt the content-type to the schema is elegant and
> > will
> > > probably result in shorter payloads (but less readable) and I think it
> > > could be done as a follow-up option.
> > > It has indeed the problem of being difficult to do for KV schema.
> > > For the content-type mappings I would do:
> > > BYTES SCHEMA -> application/octet-stream (raw bytes)
> > > PRIMITIVE VALUES (long, integer, string) - > text/plain
> > > JSON ->  application/json
> > > AVRO -> avro/binary
> > > PROTOBUF -> probably application/octet-stream ?
> > > KEY-VALUE ?
> > >
> > > Would also need to indicate the Schema-Type in the HTTP headers.
> > >
> > >
> > > >
> > > > >
> > > > > Some headers are added to the HTTP request:
> > > > > * `PulsarTopic`: the topic of the record
> > > > > * `PulsarKey`: the key of the record
> > > > > * `PulsarEventTime`: the event time of the record
> > > > > * `PulsarPublishTime`: the publish time of the record
> > > > > * `PulsarMessageId`: the ID of the message contained in the record
> > > > > * `PulsarProperties-*`: each record property is passed with the
> > property
> > > > > name prefixed by `PulsarProperties-`
> > > > >
> > > >
> > > > Can we make the "Content-Type" configurable ?
> > > >
> > > Yes we can. But do we do it for the first iteration ?
> > > If we do it, I would have an option to add some fix headers and the
> user
> > > can override the content-type.
> > > If we go for a variable content-type depending on the schema, then we
> > could
> > > have a map<SchemaType, content-type>
> > >
> > > > Can we make the HTTP METHOD configurable ?
> > > >
> > > Yes we can. But do we do it for the first iteration ?
> > >
> > > >
> > > > > ### Alternatives
> > > > >
> > > > > Creating a separated project for this Sink is rejected since:
> > > > > * this Sink is very useful for developers to test the Pulsar IO
> > > > framework,
> > > > > transform functions, and to make demos.
> > > > > * the code has a very small footprint with no external
> dependencies.
> > > > > * it should be visible at the same level as other sinks
> > > >
> > > > 100% agreed !
> > > >
> > > > >
> > > > > I'm looking forward the discussion.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Christophe Bornet
> > > >
> >
>

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by Christophe Bornet <bo...@gmail.com>.
I just tested and the HTTP Sink works fine with the ClickHouse HTTP ingest
API.

Create table in ClickHouse:
CREATE TABLE test
(
    col1 String
) ENGINE = MergeTree()
PRIMARY KEY (col1)

Create Sink:
bin/pulsar-admin sinks create -a
pulsar-io/http/target/pulsar-io-http-2.11.0-SNAPSHOT.nar -i my-topic --name
my-sink --sink-config '{"url": "
http://localhost:8123?query=INSERT%20INTO%20default.test%20FORMAT%20JSONEachRow
"}'

Produce message:
bin/pulsar-client produce -vs
'json:{"name":"test","type":"record","fields":[{"name":"col1","type":"string"}]}'
my-topic -m '{"col1":"some-value"}'

Best regards.

Christophe

Le mar. 27 sept. 2022 à 12:31, tison <wa...@gmail.com> a écrit :

> Hi Christophe,
>
> Thanks for starting this proposal. It looks cool.
>
> I'd suggest one real-world integration test you can make use of:
> https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
> (replace source kafka with pulsar).
>
> Best,
> tison.
>
>
> Enrico Olivelli <eo...@gmail.com> 于2022年9月27日周二 18:04写道:
>
> > Thanks for your answers.
> > I am fine with the current proposal.
> > We can enhance it as follow up work
> >
> > Enrico
> >
> > Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
> > <bo...@gmail.com> ha scritto:
> > >
> > > Thanks for your feedback Enrico.
> > > My answers to your comments below
> > >
> > > BR
> > >
> > > Christophe
> > >
> > > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <eo...@gmail.com> a
> > > écrit :
> > >
> > > > Christophe,
> > > > very good initiative!
> > > >
> > > > I support it
> > > > Some comments inline below
> > > >
> > > >
> > > > Enrico
> > > >
> > > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> > > > <bo...@gmail.com> ha scritto:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > I have drafted PIP-208: HTTP Sink
> > > > >
> > > > > PIP link:
> > > > > https://github.com/apache/pulsar/issues/17719
> > > > >
> > > > > Here's a copy of the contents of the GH issue for your references:
> > > > >
> > > > > ### Motivation
> > > > >
> > > > > Currently, when you want to consume from Pulsar topics in
> > applications
> > > > > written in languages that don't have a Pulsar driver supported, you
> > need
> > > > to
> > > > > run some type of proxy like the WebSocket Proxy or Pulsar Beam. In
> > > > > production this needs additional effort to deploy, scale, load
> > balance,
> > > > > monitor, and so on...
> > > > > Pulsar IO is a framework that deals with all these operational
> > subjects
> > > > and
> > > > > can be leveraged to provide a way to push messages to external
> > systems
> > > > > using HTTP, a protocol supported by every existing language and OS.
> > > > >
> > > > > ### Goal
> > > > >
> > > > > This proposal defines an HTTP Sink that sends the messages to a
> > > > configured
> > > > > URL.
> > > > > It takes inspiration from [Pulsar Beam](
> > > > > https://github.com/kafkaesque-io/pulsar-beam) and the [Confluent
> > HTTP
> > > > Sink
> > > > > connector](
> > > > >
> > https://docs.confluent.io/kafka-connectors/http/current/overview.html).
> > > > >
> > > > >
> > > > > ### Implementation
> > > > >
> > > > > A `pulsar-io-http` module will be added to `pulsar-io`.
> > > > > On building the project `pulsar-io-http-{version}.nar` will be
> built
> > and
> > > > > added to the `pulsar-all` distribution.
> > > > > The name of the Sink will be `http`.
> > > > >
> > > > > The HTTP Sink pushes records to any HTTP server with the record
> > value in
> > > > > the body of a POST method.
> > > > > The body of the HTTP request is the JSON representation of the
> record
> > > > value.
> > > >
> > > > What do you mean ?
> > > > I think that this should depend on the Schema.
> > > >
> > > > BYTES SCHEMA -> I would push the raw message payload
> > > > PRIMITIVE VALUES (long, integer, string) - > I would push the JSON
> > > > represantation
> > > > JSON SCHEMA ->  push the raw message payload
> > > > AVRO -> ?  convert to JSON ?
> > > > PROTOBUF -> ? convert to JSON ?
> > > > KEY-VALUE ?
> > > >
> > > > Probably we need some flag to define the behaviour for the non
> trivial
> > > > cases.
> > > >
> > > > The current impl chooses to serialize as JSON because it's a well
> > > supported content-type on the server frameworks.
> > > It's also to be consistent with existing HTTP Sinks such as Pulsar Bean
> > and
> > > Confluent HTTP Sink Connector.
> > > The possibility to adapt the content-type to the schema is elegant and
> > will
> > > probably result in shorter payloads (but less readable) and I think it
> > > could be done as a follow-up option.
> > > It has indeed the problem of being difficult to do for KV schema.
> > > For the content-type mappings I would do:
> > > BYTES SCHEMA -> application/octet-stream (raw bytes)
> > > PRIMITIVE VALUES (long, integer, string) - > text/plain
> > > JSON ->  application/json
> > > AVRO -> avro/binary
> > > PROTOBUF -> probably application/octet-stream ?
> > > KEY-VALUE ?
> > >
> > > Would also need to indicate the Schema-Type in the HTTP headers.
> > >
> > >
> > > >
> > > > >
> > > > > Some headers are added to the HTTP request:
> > > > > * `PulsarTopic`: the topic of the record
> > > > > * `PulsarKey`: the key of the record
> > > > > * `PulsarEventTime`: the event time of the record
> > > > > * `PulsarPublishTime`: the publish time of the record
> > > > > * `PulsarMessageId`: the ID of the message contained in the record
> > > > > * `PulsarProperties-*`: each record property is passed with the
> > property
> > > > > name prefixed by `PulsarProperties-`
> > > > >
> > > >
> > > > Can we make the "Content-Type" configurable ?
> > > >
> > > Yes we can. But do we do it for the first iteration ?
> > > If we do it, I would have an option to add some fix headers and the
> user
> > > can override the content-type.
> > > If we go for a variable content-type depending on the schema, then we
> > could
> > > have a map<SchemaType, content-type>
> > >
> > > > Can we make the HTTP METHOD configurable ?
> > > >
> > > Yes we can. But do we do it for the first iteration ?
> > >
> > > >
> > > > > ### Alternatives
> > > > >
> > > > > Creating a separated project for this Sink is rejected since:
> > > > > * this Sink is very useful for developers to test the Pulsar IO
> > > > framework,
> > > > > transform functions, and to make demos.
> > > > > * the code has a very small footprint with no external
> dependencies.
> > > > > * it should be visible at the same level as other sinks
> > > >
> > > > 100% agreed !
> > > >
> > > > >
> > > > > I'm looking forward the discussion.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Christophe Bornet
> > > >
> >
>

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by tison <wa...@gmail.com>.
Hi Christophe,

Thanks for starting this proposal. It looks cool.

I'd suggest one real-world integration test you can make use of:
https://clickhouse.com/docs/en/integrations/kafka/kafka-connect-http
(replace source kafka with pulsar).

Best,
tison.


Enrico Olivelli <eo...@gmail.com> 于2022年9月27日周二 18:04写道:

> Thanks for your answers.
> I am fine with the current proposal.
> We can enhance it as follow up work
>
> Enrico
>
> Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
> <bo...@gmail.com> ha scritto:
> >
> > Thanks for your feedback Enrico.
> > My answers to your comments below
> >
> > BR
> >
> > Christophe
> >
> > Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <eo...@gmail.com> a
> > écrit :
> >
> > > Christophe,
> > > very good initiative!
> > >
> > > I support it
> > > Some comments inline below
> > >
> > >
> > > Enrico
> > >
> > > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> > > <bo...@gmail.com> ha scritto:
> > > >
> > > > Hi all,
> > > >
> > > > I have drafted PIP-208: HTTP Sink
> > > >
> > > > PIP link:
> > > > https://github.com/apache/pulsar/issues/17719
> > > >
> > > > Here's a copy of the contents of the GH issue for your references:
> > > >
> > > > ### Motivation
> > > >
> > > > Currently, when you want to consume from Pulsar topics in
> applications
> > > > written in languages that don't have a Pulsar driver supported, you
> need
> > > to
> > > > run some type of proxy like the WebSocket Proxy or Pulsar Beam. In
> > > > production this needs additional effort to deploy, scale, load
> balance,
> > > > monitor, and so on...
> > > > Pulsar IO is a framework that deals with all these operational
> subjects
> > > and
> > > > can be leveraged to provide a way to push messages to external
> systems
> > > > using HTTP, a protocol supported by every existing language and OS.
> > > >
> > > > ### Goal
> > > >
> > > > This proposal defines an HTTP Sink that sends the messages to a
> > > configured
> > > > URL.
> > > > It takes inspiration from [Pulsar Beam](
> > > > https://github.com/kafkaesque-io/pulsar-beam) and the [Confluent
> HTTP
> > > Sink
> > > > connector](
> > > >
> https://docs.confluent.io/kafka-connectors/http/current/overview.html).
> > > >
> > > >
> > > > ### Implementation
> > > >
> > > > A `pulsar-io-http` module will be added to `pulsar-io`.
> > > > On building the project `pulsar-io-http-{version}.nar` will be built
> and
> > > > added to the `pulsar-all` distribution.
> > > > The name of the Sink will be `http`.
> > > >
> > > > The HTTP Sink pushes records to any HTTP server with the record
> value in
> > > > the body of a POST method.
> > > > The body of the HTTP request is the JSON representation of the record
> > > value.
> > >
> > > What do you mean ?
> > > I think that this should depend on the Schema.
> > >
> > > BYTES SCHEMA -> I would push the raw message payload
> > > PRIMITIVE VALUES (long, integer, string) - > I would push the JSON
> > > represantation
> > > JSON SCHEMA ->  push the raw message payload
> > > AVRO -> ?  convert to JSON ?
> > > PROTOBUF -> ? convert to JSON ?
> > > KEY-VALUE ?
> > >
> > > Probably we need some flag to define the behaviour for the non trivial
> > > cases.
> > >
> > > The current impl chooses to serialize as JSON because it's a well
> > supported content-type on the server frameworks.
> > It's also to be consistent with existing HTTP Sinks such as Pulsar Bean
> and
> > Confluent HTTP Sink Connector.
> > The possibility to adapt the content-type to the schema is elegant and
> will
> > probably result in shorter payloads (but less readable) and I think it
> > could be done as a follow-up option.
> > It has indeed the problem of being difficult to do for KV schema.
> > For the content-type mappings I would do:
> > BYTES SCHEMA -> application/octet-stream (raw bytes)
> > PRIMITIVE VALUES (long, integer, string) - > text/plain
> > JSON ->  application/json
> > AVRO -> avro/binary
> > PROTOBUF -> probably application/octet-stream ?
> > KEY-VALUE ?
> >
> > Would also need to indicate the Schema-Type in the HTTP headers.
> >
> >
> > >
> > > >
> > > > Some headers are added to the HTTP request:
> > > > * `PulsarTopic`: the topic of the record
> > > > * `PulsarKey`: the key of the record
> > > > * `PulsarEventTime`: the event time of the record
> > > > * `PulsarPublishTime`: the publish time of the record
> > > > * `PulsarMessageId`: the ID of the message contained in the record
> > > > * `PulsarProperties-*`: each record property is passed with the
> property
> > > > name prefixed by `PulsarProperties-`
> > > >
> > >
> > > Can we make the "Content-Type" configurable ?
> > >
> > Yes we can. But do we do it for the first iteration ?
> > If we do it, I would have an option to add some fix headers and the user
> > can override the content-type.
> > If we go for a variable content-type depending on the schema, then we
> could
> > have a map<SchemaType, content-type>
> >
> > > Can we make the HTTP METHOD configurable ?
> > >
> > Yes we can. But do we do it for the first iteration ?
> >
> > >
> > > > ### Alternatives
> > > >
> > > > Creating a separated project for this Sink is rejected since:
> > > > * this Sink is very useful for developers to test the Pulsar IO
> > > framework,
> > > > transform functions, and to make demos.
> > > > * the code has a very small footprint with no external dependencies.
> > > > * it should be visible at the same level as other sinks
> > >
> > > 100% agreed !
> > >
> > > >
> > > > I'm looking forward the discussion.
> > > >
> > > > Best regards,
> > > >
> > > > Christophe Bornet
> > >
>

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by Enrico Olivelli <eo...@gmail.com>.
Thanks for your answers.
I am fine with the current proposal.
We can enhance it as follow up work

Enrico

Il giorno ven 23 set 2022 alle ore 19:20 Christophe Bornet
<bo...@gmail.com> ha scritto:
>
> Thanks for your feedback Enrico.
> My answers to your comments below
>
> BR
>
> Christophe
>
> Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <eo...@gmail.com> a
> écrit :
>
> > Christophe,
> > very good initiative!
> >
> > I support it
> > Some comments inline below
> >
> >
> > Enrico
> >
> > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> > <bo...@gmail.com> ha scritto:
> > >
> > > Hi all,
> > >
> > > I have drafted PIP-208: HTTP Sink
> > >
> > > PIP link:
> > > https://github.com/apache/pulsar/issues/17719
> > >
> > > Here's a copy of the contents of the GH issue for your references:
> > >
> > > ### Motivation
> > >
> > > Currently, when you want to consume from Pulsar topics in applications
> > > written in languages that don't have a Pulsar driver supported, you need
> > to
> > > run some type of proxy like the WebSocket Proxy or Pulsar Beam. In
> > > production this needs additional effort to deploy, scale, load balance,
> > > monitor, and so on...
> > > Pulsar IO is a framework that deals with all these operational subjects
> > and
> > > can be leveraged to provide a way to push messages to external systems
> > > using HTTP, a protocol supported by every existing language and OS.
> > >
> > > ### Goal
> > >
> > > This proposal defines an HTTP Sink that sends the messages to a
> > configured
> > > URL.
> > > It takes inspiration from [Pulsar Beam](
> > > https://github.com/kafkaesque-io/pulsar-beam) and the [Confluent HTTP
> > Sink
> > > connector](
> > > https://docs.confluent.io/kafka-connectors/http/current/overview.html).
> > >
> > >
> > > ### Implementation
> > >
> > > A `pulsar-io-http` module will be added to `pulsar-io`.
> > > On building the project `pulsar-io-http-{version}.nar` will be built and
> > > added to the `pulsar-all` distribution.
> > > The name of the Sink will be `http`.
> > >
> > > The HTTP Sink pushes records to any HTTP server with the record value in
> > > the body of a POST method.
> > > The body of the HTTP request is the JSON representation of the record
> > value.
> >
> > What do you mean ?
> > I think that this should depend on the Schema.
> >
> > BYTES SCHEMA -> I would push the raw message payload
> > PRIMITIVE VALUES (long, integer, string) - > I would push the JSON
> > represantation
> > JSON SCHEMA ->  push the raw message payload
> > AVRO -> ?  convert to JSON ?
> > PROTOBUF -> ? convert to JSON ?
> > KEY-VALUE ?
> >
> > Probably we need some flag to define the behaviour for the non trivial
> > cases.
> >
> > The current impl chooses to serialize as JSON because it's a well
> supported content-type on the server frameworks.
> It's also to be consistent with existing HTTP Sinks such as Pulsar Bean and
> Confluent HTTP Sink Connector.
> The possibility to adapt the content-type to the schema is elegant and will
> probably result in shorter payloads (but less readable) and I think it
> could be done as a follow-up option.
> It has indeed the problem of being difficult to do for KV schema.
> For the content-type mappings I would do:
> BYTES SCHEMA -> application/octet-stream (raw bytes)
> PRIMITIVE VALUES (long, integer, string) - > text/plain
> JSON ->  application/json
> AVRO -> avro/binary
> PROTOBUF -> probably application/octet-stream ?
> KEY-VALUE ?
>
> Would also need to indicate the Schema-Type in the HTTP headers.
>
>
> >
> > >
> > > Some headers are added to the HTTP request:
> > > * `PulsarTopic`: the topic of the record
> > > * `PulsarKey`: the key of the record
> > > * `PulsarEventTime`: the event time of the record
> > > * `PulsarPublishTime`: the publish time of the record
> > > * `PulsarMessageId`: the ID of the message contained in the record
> > > * `PulsarProperties-*`: each record property is passed with the property
> > > name prefixed by `PulsarProperties-`
> > >
> >
> > Can we make the "Content-Type" configurable ?
> >
> Yes we can. But do we do it for the first iteration ?
> If we do it, I would have an option to add some fix headers and the user
> can override the content-type.
> If we go for a variable content-type depending on the schema, then we could
> have a map<SchemaType, content-type>
>
> > Can we make the HTTP METHOD configurable ?
> >
> Yes we can. But do we do it for the first iteration ?
>
> >
> > > ### Alternatives
> > >
> > > Creating a separated project for this Sink is rejected since:
> > > * this Sink is very useful for developers to test the Pulsar IO
> > framework,
> > > transform functions, and to make demos.
> > > * the code has a very small footprint with no external dependencies.
> > > * it should be visible at the same level as other sinks
> >
> > 100% agreed !
> >
> > >
> > > I'm looking forward the discussion.
> > >
> > > Best regards,
> > >
> > > Christophe Bornet
> >

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by Christophe Bornet <bo...@gmail.com>.
Thanks for your feedback Enrico.
My answers to your comments below

BR

Christophe

Le mar. 20 sept. 2022 à 14:16, Enrico Olivelli <eo...@gmail.com> a
écrit :

> Christophe,
> very good initiative!
>
> I support it
> Some comments inline below
>
>
> Enrico
>
> Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> <bo...@gmail.com> ha scritto:
> >
> > Hi all,
> >
> > I have drafted PIP-208: HTTP Sink
> >
> > PIP link:
> > https://github.com/apache/pulsar/issues/17719
> >
> > Here's a copy of the contents of the GH issue for your references:
> >
> > ### Motivation
> >
> > Currently, when you want to consume from Pulsar topics in applications
> > written in languages that don't have a Pulsar driver supported, you need
> to
> > run some type of proxy like the WebSocket Proxy or Pulsar Beam. In
> > production this needs additional effort to deploy, scale, load balance,
> > monitor, and so on...
> > Pulsar IO is a framework that deals with all these operational subjects
> and
> > can be leveraged to provide a way to push messages to external systems
> > using HTTP, a protocol supported by every existing language and OS.
> >
> > ### Goal
> >
> > This proposal defines an HTTP Sink that sends the messages to a
> configured
> > URL.
> > It takes inspiration from [Pulsar Beam](
> > https://github.com/kafkaesque-io/pulsar-beam) and the [Confluent HTTP
> Sink
> > connector](
> > https://docs.confluent.io/kafka-connectors/http/current/overview.html).
> >
> >
> > ### Implementation
> >
> > A `pulsar-io-http` module will be added to `pulsar-io`.
> > On building the project `pulsar-io-http-{version}.nar` will be built and
> > added to the `pulsar-all` distribution.
> > The name of the Sink will be `http`.
> >
> > The HTTP Sink pushes records to any HTTP server with the record value in
> > the body of a POST method.
> > The body of the HTTP request is the JSON representation of the record
> value.
>
> What do you mean ?
> I think that this should depend on the Schema.
>
> BYTES SCHEMA -> I would push the raw message payload
> PRIMITIVE VALUES (long, integer, string) - > I would push the JSON
> represantation
> JSON SCHEMA ->  push the raw message payload
> AVRO -> ?  convert to JSON ?
> PROTOBUF -> ? convert to JSON ?
> KEY-VALUE ?
>
> Probably we need some flag to define the behaviour for the non trivial
> cases.
>
> The current impl chooses to serialize as JSON because it's a well
supported content-type on the server frameworks.
It's also to be consistent with existing HTTP Sinks such as Pulsar Bean and
Confluent HTTP Sink Connector.
The possibility to adapt the content-type to the schema is elegant and will
probably result in shorter payloads (but less readable) and I think it
could be done as a follow-up option.
It has indeed the problem of being difficult to do for KV schema.
For the content-type mappings I would do:
BYTES SCHEMA -> application/octet-stream (raw bytes)
PRIMITIVE VALUES (long, integer, string) - > text/plain
JSON ->  application/json
AVRO -> avro/binary
PROTOBUF -> probably application/octet-stream ?
KEY-VALUE ?

Would also need to indicate the Schema-Type in the HTTP headers.


>
> >
> > Some headers are added to the HTTP request:
> > * `PulsarTopic`: the topic of the record
> > * `PulsarKey`: the key of the record
> > * `PulsarEventTime`: the event time of the record
> > * `PulsarPublishTime`: the publish time of the record
> > * `PulsarMessageId`: the ID of the message contained in the record
> > * `PulsarProperties-*`: each record property is passed with the property
> > name prefixed by `PulsarProperties-`
> >
>
> Can we make the "Content-Type" configurable ?
>
Yes we can. But do we do it for the first iteration ?
If we do it, I would have an option to add some fix headers and the user
can override the content-type.
If we go for a variable content-type depending on the schema, then we could
have a map<SchemaType, content-type>

> Can we make the HTTP METHOD configurable ?
>
Yes we can. But do we do it for the first iteration ?

>
> > ### Alternatives
> >
> > Creating a separated project for this Sink is rejected since:
> > * this Sink is very useful for developers to test the Pulsar IO
> framework,
> > transform functions, and to make demos.
> > * the code has a very small footprint with no external dependencies.
> > * it should be visible at the same level as other sinks
>
> 100% agreed !
>
> >
> > I'm looking forward the discussion.
> >
> > Best regards,
> >
> > Christophe Bornet
>

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by Christophe Bornet <bo...@gmail.com>.
Thanks for your feedback Alexander.
My answers to your comments below.

BR

Christophe

Le mer. 21 sept. 2022 à 15:58, Alexander Preuss
<al...@streamnative.io.invalid> a écrit :

> Hi Christophe,
>
> I think this is a very good idea!
>
> I agree with Enrico that the body should depend on the record schema, but
> it could also be done as a follow-up task.
>
I agree that it can be done as a follow-up task.

>
> Another thing to think about could be an optional batching mechanism that
> would take a batch of records and send them as a list of JSON objects in a
> single HTTP request.
>
Yes. Batching could be implemented as a follow-up task.

>
> Best,
> Alex
>
> On Tue, Sep 20, 2022 at 2:16 PM Enrico Olivelli <eo...@gmail.com>
> wrote:
>
> > Christophe,
> > very good initiative!
> >
> > I support it
> > Some comments inline below
> >
> >
> > Enrico
> >
> > Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> > <bo...@gmail.com> ha scritto:
> > >
> > > Hi all,
> > >
> > > I have drafted PIP-208: HTTP Sink
> > >
> > > PIP link:
> > > https://github.com/apache/pulsar/issues/17719
> > >
> > > Here's a copy of the contents of the GH issue for your references:
> > >
> > > ### Motivation
> > >
> > > Currently, when you want to consume from Pulsar topics in applications
> > > written in languages that don't have a Pulsar driver supported, you
> need
> > to
> > > run some type of proxy like the WebSocket Proxy or Pulsar Beam. In
> > > production this needs additional effort to deploy, scale, load balance,
> > > monitor, and so on...
> > > Pulsar IO is a framework that deals with all these operational subjects
> > and
> > > can be leveraged to provide a way to push messages to external systems
> > > using HTTP, a protocol supported by every existing language and OS.
> > >
> > > ### Goal
> > >
> > > This proposal defines an HTTP Sink that sends the messages to a
> > configured
> > > URL.
> > > It takes inspiration from [Pulsar Beam](
> > > https://github.com/kafkaesque-io/pulsar-beam) and the [Confluent HTTP
> > Sink
> > > connector](
> > > https://docs.confluent.io/kafka-connectors/http/current/overview.html
> ).
> > >
> > >
> > > ### Implementation
> > >
> > > A `pulsar-io-http` module will be added to `pulsar-io`.
> > > On building the project `pulsar-io-http-{version}.nar` will be built
> and
> > > added to the `pulsar-all` distribution.
> > > The name of the Sink will be `http`.
> > >
> > > The HTTP Sink pushes records to any HTTP server with the record value
> in
> > > the body of a POST method.
> > > The body of the HTTP request is the JSON representation of the record
> > value.
> >
> > What do you mean ?
> > I think that this should depend on the Schema.
> >
> > BYTES SCHEMA -> I would push the raw message payload
> > PRIMITIVE VALUES (long, integer, string) - > I would push the JSON
> > represantation
> > JSON SCHEMA ->  push the raw message payload
> > AVRO -> ?  convert to JSON ?
> > PROTOBUF -> ? convert to JSON ?
> > KEY-VALUE ?
> >
> > Probably we need some flag to define the behaviour for the non trivial
> > cases.
> >
> >
> > >
> > > Some headers are added to the HTTP request:
> > > * `PulsarTopic`: the topic of the record
> > > * `PulsarKey`: the key of the record
> > > * `PulsarEventTime`: the event time of the record
> > > * `PulsarPublishTime`: the publish time of the record
> > > * `PulsarMessageId`: the ID of the message contained in the record
> > > * `PulsarProperties-*`: each record property is passed with the
> property
> > > name prefixed by `PulsarProperties-`
> > >
> >
> > Can we make the "Content-Type" configurable ?
> > Can we make the HTTP METHOD configurable ?
> >
> >
> > > ### Alternatives
> > >
> > > Creating a separated project for this Sink is rejected since:
> > > * this Sink is very useful for developers to test the Pulsar IO
> > framework,
> > > transform functions, and to make demos.
> > > * the code has a very small footprint with no external dependencies.
> > > * it should be visible at the same level as other sinks
> >
> > 100% agreed !
> >
> > >
> > > I'm looking forward the discussion.
> > >
> > > Best regards,
> > >
> > > Christophe Bornet
> >
>

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by Alexander Preuss <al...@streamnative.io.INVALID>.
Hi Christophe,

I think this is a very good idea!

I agree with Enrico that the body should depend on the record schema, but
it could also be done as a follow-up task.

Another thing to think about could be an optional batching mechanism that
would take a batch of records and send them as a list of JSON objects in a
single HTTP request.

Best,
Alex

On Tue, Sep 20, 2022 at 2:16 PM Enrico Olivelli <eo...@gmail.com> wrote:

> Christophe,
> very good initiative!
>
> I support it
> Some comments inline below
>
>
> Enrico
>
> Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
> <bo...@gmail.com> ha scritto:
> >
> > Hi all,
> >
> > I have drafted PIP-208: HTTP Sink
> >
> > PIP link:
> > https://github.com/apache/pulsar/issues/17719
> >
> > Here's a copy of the contents of the GH issue for your references:
> >
> > ### Motivation
> >
> > Currently, when you want to consume from Pulsar topics in applications
> > written in languages that don't have a Pulsar driver supported, you need
> to
> > run some type of proxy like the WebSocket Proxy or Pulsar Beam. In
> > production this needs additional effort to deploy, scale, load balance,
> > monitor, and so on...
> > Pulsar IO is a framework that deals with all these operational subjects
> and
> > can be leveraged to provide a way to push messages to external systems
> > using HTTP, a protocol supported by every existing language and OS.
> >
> > ### Goal
> >
> > This proposal defines an HTTP Sink that sends the messages to a
> configured
> > URL.
> > It takes inspiration from [Pulsar Beam](
> > https://github.com/kafkaesque-io/pulsar-beam) and the [Confluent HTTP
> Sink
> > connector](
> > https://docs.confluent.io/kafka-connectors/http/current/overview.html).
> >
> >
> > ### Implementation
> >
> > A `pulsar-io-http` module will be added to `pulsar-io`.
> > On building the project `pulsar-io-http-{version}.nar` will be built and
> > added to the `pulsar-all` distribution.
> > The name of the Sink will be `http`.
> >
> > The HTTP Sink pushes records to any HTTP server with the record value in
> > the body of a POST method.
> > The body of the HTTP request is the JSON representation of the record
> value.
>
> What do you mean ?
> I think that this should depend on the Schema.
>
> BYTES SCHEMA -> I would push the raw message payload
> PRIMITIVE VALUES (long, integer, string) - > I would push the JSON
> represantation
> JSON SCHEMA ->  push the raw message payload
> AVRO -> ?  convert to JSON ?
> PROTOBUF -> ? convert to JSON ?
> KEY-VALUE ?
>
> Probably we need some flag to define the behaviour for the non trivial
> cases.
>
>
> >
> > Some headers are added to the HTTP request:
> > * `PulsarTopic`: the topic of the record
> > * `PulsarKey`: the key of the record
> > * `PulsarEventTime`: the event time of the record
> > * `PulsarPublishTime`: the publish time of the record
> > * `PulsarMessageId`: the ID of the message contained in the record
> > * `PulsarProperties-*`: each record property is passed with the property
> > name prefixed by `PulsarProperties-`
> >
>
> Can we make the "Content-Type" configurable ?
> Can we make the HTTP METHOD configurable ?
>
>
> > ### Alternatives
> >
> > Creating a separated project for this Sink is rejected since:
> > * this Sink is very useful for developers to test the Pulsar IO
> framework,
> > transform functions, and to make demos.
> > * the code has a very small footprint with no external dependencies.
> > * it should be visible at the same level as other sinks
>
> 100% agreed !
>
> >
> > I'm looking forward the discussion.
> >
> > Best regards,
> >
> > Christophe Bornet
>

Re: [DISCUSS] PIP-208: HTTP Sink

Posted by Enrico Olivelli <eo...@gmail.com>.
Christophe,
very good initiative!

I support it
Some comments inline below


Enrico

Il giorno lun 19 set 2022 alle ore 19:10 Christophe Bornet
<bo...@gmail.com> ha scritto:
>
> Hi all,
>
> I have drafted PIP-208: HTTP Sink
>
> PIP link:
> https://github.com/apache/pulsar/issues/17719
>
> Here's a copy of the contents of the GH issue for your references:
>
> ### Motivation
>
> Currently, when you want to consume from Pulsar topics in applications
> written in languages that don't have a Pulsar driver supported, you need to
> run some type of proxy like the WebSocket Proxy or Pulsar Beam. In
> production this needs additional effort to deploy, scale, load balance,
> monitor, and so on...
> Pulsar IO is a framework that deals with all these operational subjects and
> can be leveraged to provide a way to push messages to external systems
> using HTTP, a protocol supported by every existing language and OS.
>
> ### Goal
>
> This proposal defines an HTTP Sink that sends the messages to a configured
> URL.
> It takes inspiration from [Pulsar Beam](
> https://github.com/kafkaesque-io/pulsar-beam) and the [Confluent HTTP Sink
> connector](
> https://docs.confluent.io/kafka-connectors/http/current/overview.html).
>
>
> ### Implementation
>
> A `pulsar-io-http` module will be added to `pulsar-io`.
> On building the project `pulsar-io-http-{version}.nar` will be built and
> added to the `pulsar-all` distribution.
> The name of the Sink will be `http`.
>
> The HTTP Sink pushes records to any HTTP server with the record value in
> the body of a POST method.
> The body of the HTTP request is the JSON representation of the record value.

What do you mean ?
I think that this should depend on the Schema.

BYTES SCHEMA -> I would push the raw message payload
PRIMITIVE VALUES (long, integer, string) - > I would push the JSON
represantation
JSON SCHEMA ->  push the raw message payload
AVRO -> ?  convert to JSON ?
PROTOBUF -> ? convert to JSON ?
KEY-VALUE ?

Probably we need some flag to define the behaviour for the non trivial cases.


>
> Some headers are added to the HTTP request:
> * `PulsarTopic`: the topic of the record
> * `PulsarKey`: the key of the record
> * `PulsarEventTime`: the event time of the record
> * `PulsarPublishTime`: the publish time of the record
> * `PulsarMessageId`: the ID of the message contained in the record
> * `PulsarProperties-*`: each record property is passed with the property
> name prefixed by `PulsarProperties-`
>

Can we make the "Content-Type" configurable ?
Can we make the HTTP METHOD configurable ?


> ### Alternatives
>
> Creating a separated project for this Sink is rejected since:
> * this Sink is very useful for developers to test the Pulsar IO framework,
> transform functions, and to make demos.
> * the code has a very small footprint with no external dependencies.
> * it should be visible at the same level as other sinks

100% agreed !

>
> I'm looking forward the discussion.
>
> Best regards,
>
> Christophe Bornet