You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Michael Sklyar <mi...@gmail.com> on 2016/08/22 17:32:23 UTC

Kafka Connect - how to deal with multiple formats in Kafka?

I am looking into Kafka Connect and Confluent HDFSSinkConnector.

The goal is to save data from various topics to HDFS.
We have at least two different formats of the data in Kafka - raw data
(JSON) - that we want to save as SequenceFile and normalized data
(Protobuf) that we want to save as Parquet.

(I understand that Confluent expects to use Avro but I succeeded with
writing my custom converters and RecordWriters that work fine without Avro
and ShemaRegistry).

Question: Is there a specific reason that key.converter value,converter are
defined per Kafka Connect cluster and not per a specific connector?

It means that all the data in Kafka(in all the topics) should be stored in
the same format - or I will need two different clusters: one with
value.converter = MyCustomJsonConverter and another with
MyCustomProtobufConverter.

It becomes even worse in case of Protobuf - every topic has a different
Protobuf schema - therefore needs a different converter and having a dozen
of Kafka clusters sounds like a very bad option.

Wouldn't it make more sense to have the key.converter and value.converter
defined on the specific Connector level?

Any other suggestions?

Re: Kafka Connect - how to deal with multiple formats in Kafka?

Posted by Michael Sklyar <mi...@gmail.com>.

Thank you,
Glad to see it is addressed.

On Mon, Aug 22, 2016 at 8:36 PM, Dustin Cote <du...@confluent.io> wrote:

> Hi Michael,
>
> You'd probably be interested in the discussion for this KIP:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 75+-+Add+per-connector+Converters
>
> For now, you'd have to run different Connect instances, but KIP-75 plans to
> let you have control over converters at a connector level.
>
> Regards,
>
>
> On Mon, Aug 22, 2016 at 1:32 PM, Michael Sklyar <mi...@gmail.com>
> wrote:
>
> > I am looking into Kafka Connect and Confluent HDFSSinkConnector.
> >
> > The goal is to save data from various topics to HDFS.
> > We have at least two different formats of the data in Kafka - raw data
> > (JSON) - that we want to save as SequenceFile and normalized data
> > (Protobuf) that we want to save as Parquet.
> >
> > (I understand that Confluent expects to use Avro but I succeeded with
> > writing my custom converters and RecordWriters that work fine without
> Avro
> > and ShemaRegistry).
> >
> > Question: Is there a specific reason that key.converter value,converter
> are
> > defined per Kafka Connect cluster and not per a specific connector?
> >
> > It means that all the data in Kafka(in all the topics) should be stored
> in
> > the same format - or I will need two different clusters: one with
> > value.converter = MyCustomJsonConverter and another with
> > MyCustomProtobufConverter.
> >
> > It becomes even worse in case of Protobuf - every topic has a different
> > Protobuf schema - therefore needs a different converter and having a
> dozen
> > of Kafka clusters sounds like a very bad option.
> >
> > Wouldn't it make more sense to have the key.converter and value.converter
> > defined on the specific Connector level?
> >
> > Any other suggestions?
> >
>
>
>
> --
> *Dustin Cote*
> Customer Operations Engineer | Confluent
> Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> <http://www.confluent.io/blog>
>

Re: Kafka Connect - how to deal with multiple formats in Kafka?

Posted by Dustin Cote <du...@confluent.io>.

Hi Michael,

You'd probably be interested in the discussion for this KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-
75+-+Add+per-connector+Converters

For now, you'd have to run different Connect instances, but KIP-75 plans to
let you have control over converters at a connector level.

Regards,


On Mon, Aug 22, 2016 at 1:32 PM, Michael Sklyar <mi...@gmail.com> wrote:

> I am looking into Kafka Connect and Confluent HDFSSinkConnector.
>
> The goal is to save data from various topics to HDFS.
> We have at least two different formats of the data in Kafka - raw data
> (JSON) - that we want to save as SequenceFile and normalized data
> (Protobuf) that we want to save as Parquet.
>
> (I understand that Confluent expects to use Avro but I succeeded with
> writing my custom converters and RecordWriters that work fine without Avro
> and ShemaRegistry).
>
> Question: Is there a specific reason that key.converter value,converter are
> defined per Kafka Connect cluster and not per a specific connector?
>
> It means that all the data in Kafka(in all the topics) should be stored in
> the same format - or I will need two different clusters: one with
> value.converter = MyCustomJsonConverter and another with
> MyCustomProtobufConverter.
>
> It becomes even worse in case of Protobuf - every topic has a different
> Protobuf schema - therefore needs a different converter and having a dozen
> of Kafka clusters sounds like a very bad option.
>
> Wouldn't it make more sense to have the key.converter and value.converter
> defined on the specific Connector level?
>
> Any other suggestions?
>



-- 
*Dustin Cote*
Customer Operations Engineer | Confluent
Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
<http://www.confluent.io/blog>