You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/02/04 08:57:14 UTC

[GitHub] [pulsar] eolivelli commented on pull request #9448: Pulsar IO - KafkaSource - allow to manage Avro Encoded messages

eolivelli commented on pull request #9448:
URL: https://github.com/apache/pulsar/pull/9448#issuecomment-773143619

@sijie there no problem in disagreeing,
let's find together the right way to provide features to the users in the best way for the project.

I am going to split the patch into two parts, this way we can make one step at a time.

> First of all, it basically creates a "connector" class per schema type. This is a very bad practice. I would discourage a connector implementation going down this route. It is impossible to maintain.

We already have `KafkaBytesSource` and `KafkaStringSource`, so I am just adding a new flavour of the KafkaSource, in fact the implementation is just about adding a new subclass of `KafkaAbstractSource`.
I am following the current style.

In my plans I would like to work more on this KafkaSource and on the KafkaSink and try to make the structure better.

There is an open work that will allow to put more sinks on the same nar and provide a better user experience.
https://github.com/apache/pulsar/pull/3678

In the meanwhile users of the Kafka source can go with "--classname" (or they can select it from a Web UI for interactive Pulsar Management Consoles)

> in which the key is a string and the value is AVRO

we can work on this issue as well (and that's on my backlog), I didn't want to introduce too many features.

I have users that are used to advanced data mapping mechanisms both for the key and for the value, so mapping the key is very important to me.
That said, currently the `KafkaAbstractSource` is working on a string key, that is a preexisting code

The approach you proposed also has a bad performance because it will churn a lot

> of object allocations. Realizing the serialization details can save a lot of memory copy and serialization/deserialization.

I know about this fact, and I know how the StreamNative connector works.

Using the Java Model with GenericRecord adds that additional cost, but the benefit are:
- to have simpler code, using code provided by the same vendors that are mantaining that serialization protocol
- we can follow the evolutions just by upgrading the Confluent library
- we are using pure Kafka/Pulsar APIs, totally integrated with the framework, this will allow us to leverage all future improvements

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org