You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by noah <ia...@gmail.com> on 2016/06/24 18:16:45 UTC

[kafka-connect] multiple or single clusters?

I'm having some trouble figuring out the right way to run Kafka Connect in
production. We will have multiple sink connectors that we need to remain
running indefinitely and have at least once semantics (with as little
duplication as possible) so it seems clear that we need to run in
distributed mode so that our offsets are durable and we can scale up by
adding new distributed mode instances of Connect.

What isn't clear is what the best way to run multiple, heterogenous
connectors in distributed mode is. It looks like every instance of Connect
will read the config/status topics and take on some number of tasks (and
that tasks can't be assigned to specific running instances of Connect.) It
also looks like it is only possible to configure 1 key and value converter
per Connect instance. So if I need two different conversion strategies, I'd
need to either write a custom converter that can figure it out, or run
multiple Connect clusters, each with their own set of config+offset+status
topics.

Is that right? Worst case, I need another set of N distributed Connect
instances per sink/source, which ends up being a lot of topics to manage.
What does a real-world Connect topology look like?

Re: [kafka-connect] multiple or single clusters?

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
On Fri, Jun 24, 2016 at 11:16 AM, noah <ia...@gmail.com> wrote:

> I'm having some trouble figuring out the right way to run Kafka Connect in
> production. We will have multiple sink connectors that we need to remain
> running indefinitely and have at least once semantics (with as little
> duplication as possible) so it seems clear that we need to run in
> distributed mode so that our offsets are durable and we can scale up by
> adding new distributed mode instances of Connect.
>
> What isn't clear is what the best way to run multiple, heterogenous
> connectors in distributed mode is. It looks like every instance of Connect
> will read the config/status topics and take on some number of tasks (and
> that tasks can't be assigned to specific running instances of Connect.) It
> also looks like it is only possible to configure 1 key and value converter
> per Connect instance. So if I need two different conversion strategies, I'd
> need to either write a custom converter that can figure it out, or run
> multiple Connect clusters, each with their own set of config+offset+status
> topics.
>
> Is that right? Worst case, I need another set of N distributed Connect
> instances per sink/source, which ends up being a lot of topics to manage.
> What does a real-world Connect topology look like?
>


Yeah, ideally you want to minimize the # of clusters just to minimize
operationalization costs -- it is easier to maintain the one cluster than
N. However, you're right that at the moment we only support one converter
type per cluster. We want to make that configurable per connector (with a
cluster-wide default to keep configuration cheap when you know you want to
standardize on one type for most connectors), but we haven't gotten to
implementing that yet. Look for it in an upcoming release! But that does
mean that, for now, if you want to use different converters you'll need to
run multiple clusters.

-Ewen