You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Jordan Wyatt <jw...@gmail.com> on 2022/03/30 12:27:51 UTC

Kafka Connect - offset.storage.topic reuse across clusters

Hi,

I've recently been experimenting with setting the values of the `offset,`
`storage` and `status` topics within Kafka Connect.

I'm aware from various sources (Robin Moffatt blogs, StackOverflow,
Confluent Kafka Connect docs) that these topics should not be shared across
different connect **clusters**.  e.g for each  unique set of workers with a
given `group.id`, a unique set of internal storage topics should be used.

These discussions and documentations usually talk about sharing all three
topics at once, however, I am interested in reusing only the offset storage
topic. I am struggling to find the risks of sharing this offset topic
between different connect clusters.

I'm aware of issues with sharing the config and status topics from blogs
and my own testing (clusters can end up running connectors from other
clusters, for example), but I cannot find a case for not sharing the offset
topic despite guidance to avoid this.

The use cases I am interested in are:

1. Sharing an offset topic between clusters, but never in parallel.


*e.g cluster 1 running connector A uses the offset topic, cluster 1 and
connector A are deleted, then cluster 2 running connector B is created uses
the offset topic. *

2. As above, but using the offset topic in parallel.

As the offset.stroage topic is keyed by connector name (from the source
connectors I've tried) I do not understand the risk of both of the above
cases **unless** > 1  connector exists with the same name in separate
clusters, as there would then be the risk of key collision as group.id is
not referenced in the offset topic keys.

Any insights into why sharing the offset topic between clusters for the
cases described would be greatly appreciated, thank you.

Re: Kafka Connect - offset.storage.topic reuse across clusters

Posted by Chris Egerton <fe...@gmail.com>.

Connectors overwriting each other's offsets is the primary concern. If you
have a guarantee that there will only ever be one connector with a given
name running at once on any of the Connect clusters that use the same
offsets topic, and you want offsets to be shared for all source connectors
on any of those clusters, then that concern is addressed. It does inflict
an operational burden on the administrators for your Connect clusters, and
for people creating/managing connectors on those clusters. But if you're
willing to accept that burden and the footguns it comes with, this is an
option for you.

Also worth noting that this would actually cause cross-Connect-cluster
offset tracking logic to behave the same for source connectors and sink
connectors, which already commit consumer offsets to Kafka based solely on
connector name and with no distinction made between which Connect cluster
the connector is running on. (This can technically be addressed by manually
overriding the sink connector's group ID; I'm just outlining the default
behavior.)

One other potential cause for concern is that Connect workers do a read to
the end of the offsets topic every time a source task reads offsets, so if
you're hammering the offsets topic with a ton of writes from across several
Connect clusters, there may be a performance impact for source connectors
that read offsets frequently. But this shouldn't be any different than
running a monolithic cluster with the same number of workers as the sum of
all workers across your multi-cluster setup, and it's generally not a good
idea for source connectors to read offsets apart from when they're starting
up.

I'm mostly curious about the motivation to use a different group ID,
though--if failover is the idea here, is there any specific scenario you
have in mind that makes this option less appealing?

On Wed, Mar 30, 2022, 11:42 Jordan Wyatt <jw...@gmail.com> wrote:

> Hi Robin,
>
> I'm interested in a use case in which I need to be able to have a connect
> cluster fail, and then bring up a new cluster with the same offset topics
> and connectors. By new cluster I mean a cluster with a new `group.id`. I
> am
> aware I could just use the same group id as before but I would like to
> explore this route.
>
> I'm keen to learn more about the reasons the described case above, and
> those in my original thread, aren't recommended.
>
> Thank you,
> Jordan
>
> On Wed, 30 Mar 2022 at 14:00, Robin Moffatt <ro...@confluent.io.invalid>
> wrote:
>
> > Hi Jordan,
> >
> > Is there a good reason for wanting to do this? I can think of multiple
> > reasons why you shouldn't do this even if technically it works in some
> > cases.
> > Or it's just curiosity as to whether you can/should?
> >
> > thanks, Robin.
> >
> >
> > --
> >
> > Robin Moffatt | Principal Developer Advocate | robin@confluent.io |
> @rmoff
> >
> >
> > On Wed, 30 Mar 2022 at 13:36, Jordan Wyatt <jw...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I've recently been experimenting with setting the values of the
> `offset,`
> > > `storage` and `status` topics within Kafka Connect.
> > >
> > > I'm aware from various sources (Robin Moffatt blogs, StackOverflow,
> > > Confluent Kafka Connect docs) that these topics should not be shared
> > across
> > > different connect **clusters**.  e.g for each  unique set of workers
> > with a
> > > given `group.id`, a unique set of internal storage topics should be
> > used.
> > >
> > > These discussions and documentations usually talk about sharing all
> three
> > > topics at once, however, I am interested in reusing only the offset
> > storage
> > > topic. I am struggling to find the risks of sharing this offset topic
> > > between different connect clusters.
> > >
> > > I'm aware of issues with sharing the config and status topics from
> blogs
> > > and my own testing (clusters can end up running connectors from other
> > > clusters, for example), but I cannot find a case for not sharing the
> > offset
> > > topic despite guidance to avoid this.
> > >
> > > The use cases I am interested in are:
> > >
> > > 1. Sharing an offset topic between clusters, but never in parallel.
> > >
> > >
> > > *e.g cluster 1 running connector A uses the offset topic, cluster 1 and
> > > connector A are deleted, then cluster 2 running connector B is created
> > uses
> > > the offset topic. *
> > >
> > > 2. As above, but using the offset topic in parallel.
> > >
> > > As the offset.stroage topic is keyed by connector name (from the source
> > > connectors I've tried) I do not understand the risk of both of the
> above
> > > cases **unless** > 1  connector exists with the same name in separate
> > > clusters, as there would then be the risk of key collision as group.id
> > is
> > > not referenced in the offset topic keys.
> > >
> > > Any insights into why sharing the offset topic between clusters for the
> > > cases described would be greatly appreciated, thank you.
> > >
> >
>

Re: Kafka Connect - offset.storage.topic reuse across clusters

Posted by Jordan Wyatt <jw...@gmail.com>.

Hi Robin,

I'm interested in a use case in which I need to be able to have a connect
cluster fail, and then bring up a new cluster with the same offset topics
and connectors. By new cluster I mean a cluster with a new `group.id`. I am
aware I could just use the same group id as before but I would like to
explore this route.

I'm keen to learn more about the reasons the described case above, and
those in my original thread, aren't recommended.

Thank you,
Jordan

On Wed, 30 Mar 2022 at 14:00, Robin Moffatt <ro...@confluent.io.invalid>
wrote:

> Hi Jordan,
>
> Is there a good reason for wanting to do this? I can think of multiple
> reasons why you shouldn't do this even if technically it works in some
> cases.
> Or it's just curiosity as to whether you can/should?
>
> thanks, Robin.
>
>
> --
>
> Robin Moffatt | Principal Developer Advocate | robin@confluent.io | @rmoff
>
>
> On Wed, 30 Mar 2022 at 13:36, Jordan Wyatt <jw...@gmail.com> wrote:
>
> > Hi,
> >
> > I've recently been experimenting with setting the values of the `offset,`
> > `storage` and `status` topics within Kafka Connect.
> >
> > I'm aware from various sources (Robin Moffatt blogs, StackOverflow,
> > Confluent Kafka Connect docs) that these topics should not be shared
> across
> > different connect **clusters**.  e.g for each  unique set of workers
> with a
> > given `group.id`, a unique set of internal storage topics should be
> used.
> >
> > These discussions and documentations usually talk about sharing all three
> > topics at once, however, I am interested in reusing only the offset
> storage
> > topic. I am struggling to find the risks of sharing this offset topic
> > between different connect clusters.
> >
> > I'm aware of issues with sharing the config and status topics from blogs
> > and my own testing (clusters can end up running connectors from other
> > clusters, for example), but I cannot find a case for not sharing the
> offset
> > topic despite guidance to avoid this.
> >
> > The use cases I am interested in are:
> >
> > 1. Sharing an offset topic between clusters, but never in parallel.
> >
> >
> > *e.g cluster 1 running connector A uses the offset topic, cluster 1 and
> > connector A are deleted, then cluster 2 running connector B is created
> uses
> > the offset topic. *
> >
> > 2. As above, but using the offset topic in parallel.
> >
> > As the offset.stroage topic is keyed by connector name (from the source
> > connectors I've tried) I do not understand the risk of both of the above
> > cases **unless** > 1  connector exists with the same name in separate
> > clusters, as there would then be the risk of key collision as group.id
> is
> > not referenced in the offset topic keys.
> >
> > Any insights into why sharing the offset topic between clusters for the
> > cases described would be greatly appreciated, thank you.
> >
>

Re: Kafka Connect - offset.storage.topic reuse across clusters

Posted by Robin Moffatt <ro...@confluent.io.INVALID>.

Hi Jordan,

Is there a good reason for wanting to do this? I can think of multiple
reasons why you shouldn't do this even if technically it works in some
cases.
Or it's just curiosity as to whether you can/should?

thanks, Robin.


-- 

Robin Moffatt | Principal Developer Advocate | robin@confluent.io | @rmoff


On Wed, 30 Mar 2022 at 13:36, Jordan Wyatt <jw...@gmail.com> wrote:

> Hi,
>
> I've recently been experimenting with setting the values of the `offset,`
> `storage` and `status` topics within Kafka Connect.
>
> I'm aware from various sources (Robin Moffatt blogs, StackOverflow,
> Confluent Kafka Connect docs) that these topics should not be shared across
> different connect **clusters**.  e.g for each  unique set of workers with a
> given `group.id`, a unique set of internal storage topics should be used.
>
> These discussions and documentations usually talk about sharing all three
> topics at once, however, I am interested in reusing only the offset storage
> topic. I am struggling to find the risks of sharing this offset topic
> between different connect clusters.
>
> I'm aware of issues with sharing the config and status topics from blogs
> and my own testing (clusters can end up running connectors from other
> clusters, for example), but I cannot find a case for not sharing the offset
> topic despite guidance to avoid this.
>
> The use cases I am interested in are:
>
> 1. Sharing an offset topic between clusters, but never in parallel.
>
>
> *e.g cluster 1 running connector A uses the offset topic, cluster 1 and
> connector A are deleted, then cluster 2 running connector B is created uses
> the offset topic. *
>
> 2. As above, but using the offset topic in parallel.
>
> As the offset.stroage topic is keyed by connector name (from the source
> connectors I've tried) I do not understand the risk of both of the above
> cases **unless** > 1  connector exists with the same name in separate
> clusters, as there would then be the risk of key collision as group.id is
> not referenced in the offset topic keys.
>
> Any insights into why sharing the offset topic between clusters for the
> cases described would be greatly appreciated, thank you.
>