You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Ryan van Huuksloot <ry...@shopify.com.INVALID> on 2022/09/02 16:30:25 UTC

RE: [DISCUSS] Contribution of Multi Cluster Kafka Source

Hi Mason,

First off, thanks for putting this FLIP together! Sorry for the delay. Full
disclosure Mason and I chatted a little bit at Flink Forward 2022 but I
have tried to capture the questions I had for him then.

I'll start the conversation with a few questions:

1. The concept of streamIds is not clear to me in the proposal and could
use some more information. If I understand correctly, they will be used in
the MetadataService to link KafkaClusters to ones you want to use? If you
assign stream ids using `setStreamIds`, how can you dynamically increase
the number of clusters you consume if the list of StreamIds is static? I am
basing this off of your example .setStreamIds(List.of("my-stream-1",
"my-stream-2")) so I could be off base with my assumption. If you don't
mind clearing up the intention, that would be great!

2. How would offsets work if you wanted to use this MultiClusterKafkaSource
with a file based backfill? In the case I am thinking of, you have a bucket
backed archive of Kafka data per cluster. and you want to pick up from the
last offset in the archived system, how would you set OffsetInitializers
"per cluster" potentially as a function or are you limited to setting an
OffsetInitializer for the entire Source?

3. Just to make sure - because this system will layer on top of Flink-27
and use KafkaSource for some aspects under the hood, the watermark
alignment that was introduced in FLIP-182 / Flink 1.15 would be possible
across multiple clusters if you assign them to the same alignment group?

Thanks!
Ryan

On 2022/06/24 01:43:38 Mason Chen wrote:
> Hi community,
>
> We have been working on a Multi Cluster Kafka Source and are looking to
> contribute it upstream. I've given a talk about the features and design at
> a Flink meetup: https://youtu.be/H1SYOuLcUTI.
>
> The main features that it provides is:
> 1. Reading multiple Kafka clusters within a single source.
> 2. Adjusting the clusters and topics the source consumes from dynamically,
> without Flink job restart.
>
> Some of the challenging use cases that these features solve are:
> 1. Transparent Kafka cluster migration without Flink job restart.
> 2. Transparent Kafka topic migration without Flink job restart.
> 3. Direct integration with Hybrid Source.
>
> In addition, this is designed with wrapping and managing the existing
> KafkaSource components to enable these features, so it can continue to
> benefit from KafkaSource improvements and bug fixes. It can be considered
> as a form of a composite source.
>
> I think the contribution of this source could benefit a lot of users who
> have asked in the mailing list about Flink handling Kafka migrations and
> removing topics in the past. I would love to hear and address your
thoughts
> and feedback, and if possible drive a FLIP!
>
> Best,
> Mason
>