You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Sachin Mittal <sj...@gmail.com> on 2020/04/15 14:22:45 UTC

How to add partitions to an existing kafka topic

Hi,
We have a kafka streams application which runs multiple instances and
consumes from a source topic.
Producers produces keyed messages to this source topic.
Keyed messages are events from different sources and each source has a
unique key.

So what essentially happens is that messages from particular source always
gets added to a particular partition.
Hence we can run multiple instances of streams application with a
particular instance processing messages for certain partitions.
We will never get into a case where messages for a source are processed by
different instances of streams application simultaneously.

So far so good.

Now over time new sources are added. It may so happen that we reach a
saturation point and have no option but to increase number of partitions.

So what is the best practice to increase number of partitions.
Is there a way to ensure that existing key's messages continue to get
published on same partition as before.
And only new source's keys gets their messages published on the new
partition we add.

If this is not possible then does kafka's re-partition mechanism ensure
that during re-balance all the previous messages of a particular key gets
moved to same partition.
I guess under this approach we would have to stop our streaming application
till re-balance is over otherwise messages for same key may get processed
by different instances of the application.

Anyway just wanted to know how such a problem is tackled on live systems
real time, or how some of you have approached the same.

Thanks
Sachin

Re: How to add partitions to an existing kafka topic

Posted by John Roesler <vv...@apache.org>.
Hi Sachin,

I’m a bit hazy on the details of the broker partition expansion feature. It’s been a while since I looked at it. But you actually control the key-to-partition mapping at the producer side. The producer’s default partitioner just hashes the keys over the partition, but you could plug in your own partitioner to remember which keys it sent to which partitions and do what you want. 

I hope that helps,
John

On Wed, Apr 15, 2020, at 12:22, Sachin Mittal wrote:
> Hi,
> I will look into the suggestions you folks mentioned.
> 
> I was just wondering something from just kafka point of view.
> Lets say we add new partitions to kafka topics. Is there any way to
> configure that only new keys get their messages added to those partitions.
> Existing keys continue to add their messages to previous partitions.
> 
> Or the moment we add new partition the kafka completely re-distributes all
> the older messages too among all the partitions.
> And if it does that then does it ensure that in this re-distributions it
> keeps messages of same key in same partition.
> 
> Thanks
> Sachin
> 
> 
> 
> 
> On Wed, Apr 15, 2020 at 10:19 PM John Roesler <vv...@apache.org> wrote:
> 
> > Hi Sachin,
> >
> > Just to build on Boyang’s answer a little, when designing Kafka’s
> > partition expansion operation, we did consider making it work also for
> > dynamically repartitioning in a way that would work for Streams as well,
> > but it added too much complexity, and the contributor had some other use
> > cases in mind.
> >
> > In Streams, we have some ideas for improving the dynamic scalability, but
> > for now, your best bet is to stop the app and clone the topic in question
> > into a new topic with more partitions, then point the app to the new input
> > topic. Depending on the application, you might also have changelog topics
> > and repartition topics to worry about. The easiest thing is just to reset
> > the app, if you can tolerate it.
> >
> > Iirc, Jan Filipiak has mentioned some techniques or tooling he developed
> > to automate this process. You might search the archives to see what you can
> > dig up. I think it was pretty much what I said above.
> >
> > Hope this helps,
> > John
> >
> > On Wed, Apr 15, 2020, at 10:23, Boyang Chen wrote:
> > > Hey Sachin,
> > >
> > > your observation is correct, unfortunately Kafka Streams doesn't support
> > > adding partitions online. The rebalance could not guarantee the same key
> > > routing to the same partition when the input topic partition changes, as
> > > this is the upstream producer's responsibility to consistently route the
> > > same key data, which is not resolved today.
> > >
> > > Boyang
> > >
> > > On Wed, Apr 15, 2020 at 7:23 AM Sachin Mittal <sj...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > > We have a kafka streams application which runs multiple instances and
> > > > consumes from a source topic.
> > > > Producers produces keyed messages to this source topic.
> > > > Keyed messages are events from different sources and each source has a
> > > > unique key.
> > > >
> > > > So what essentially happens is that messages from particular source
> > always
> > > > gets added to a particular partition.
> > > > Hence we can run multiple instances of streams application with a
> > > > particular instance processing messages for certain partitions.
> > > > We will never get into a case where messages for a source are
> > processed by
> > > > different instances of streams application simultaneously.
> > > >
> > > > So far so good.
> > > >
> > > > Now over time new sources are added. It may so happen that we reach a
> > > > saturation point and have no option but to increase number of
> > partitions.
> > > >
> > > > So what is the best practice to increase number of partitions.
> > > > Is there a way to ensure that existing key's messages continue to get
> > > > published on same partition as before.
> > > > And only new source's keys gets their messages published on the new
> > > > partition we add.
> > > >
> > > > If this is not possible then does kafka's re-partition mechanism ensure
> > > > that during re-balance all the previous messages of a particular key
> > gets
> > > > moved to same partition.
> > > > I guess under this approach we would have to stop our streaming
> > application
> > > > till re-balance is over otherwise messages for same key may get
> > processed
> > > > by different instances of the application.
> > > >
> > > > Anyway just wanted to know how such a problem is tackled on live
> > systems
> > > > real time, or how some of you have approached the same.
> > > >
> > > > Thanks
> > > > Sachin
> > > >
> > >
> >
>

Re: How to add partitions to an existing kafka topic

Posted by Sachin Mittal <sj...@gmail.com>.
Hi,
I will look into the suggestions you folks mentioned.

I was just wondering something from just kafka point of view.
Lets say we add new partitions to kafka topics. Is there any way to
configure that only new keys get their messages added to those partitions.
Existing keys continue to add their messages to previous partitions.

Or the moment we add new partition the kafka completely re-distributes all
the older messages too among all the partitions.
And if it does that then does it ensure that in this re-distributions it
keeps messages of same key in same partition.

Thanks
Sachin




On Wed, Apr 15, 2020 at 10:19 PM John Roesler <vv...@apache.org> wrote:

> Hi Sachin,
>
> Just to build on Boyang’s answer a little, when designing Kafka’s
> partition expansion operation, we did consider making it work also for
> dynamically repartitioning in a way that would work for Streams as well,
> but it added too much complexity, and the contributor had some other use
> cases in mind.
>
> In Streams, we have some ideas for improving the dynamic scalability, but
> for now, your best bet is to stop the app and clone the topic in question
> into a new topic with more partitions, then point the app to the new input
> topic. Depending on the application, you might also have changelog topics
> and repartition topics to worry about. The easiest thing is just to reset
> the app, if you can tolerate it.
>
> Iirc, Jan Filipiak has mentioned some techniques or tooling he developed
> to automate this process. You might search the archives to see what you can
> dig up. I think it was pretty much what I said above.
>
> Hope this helps,
> John
>
> On Wed, Apr 15, 2020, at 10:23, Boyang Chen wrote:
> > Hey Sachin,
> >
> > your observation is correct, unfortunately Kafka Streams doesn't support
> > adding partitions online. The rebalance could not guarantee the same key
> > routing to the same partition when the input topic partition changes, as
> > this is the upstream producer's responsibility to consistently route the
> > same key data, which is not resolved today.
> >
> > Boyang
> >
> > On Wed, Apr 15, 2020 at 7:23 AM Sachin Mittal <sj...@gmail.com>
> wrote:
> >
> > > Hi,
> > > We have a kafka streams application which runs multiple instances and
> > > consumes from a source topic.
> > > Producers produces keyed messages to this source topic.
> > > Keyed messages are events from different sources and each source has a
> > > unique key.
> > >
> > > So what essentially happens is that messages from particular source
> always
> > > gets added to a particular partition.
> > > Hence we can run multiple instances of streams application with a
> > > particular instance processing messages for certain partitions.
> > > We will never get into a case where messages for a source are
> processed by
> > > different instances of streams application simultaneously.
> > >
> > > So far so good.
> > >
> > > Now over time new sources are added. It may so happen that we reach a
> > > saturation point and have no option but to increase number of
> partitions.
> > >
> > > So what is the best practice to increase number of partitions.
> > > Is there a way to ensure that existing key's messages continue to get
> > > published on same partition as before.
> > > And only new source's keys gets their messages published on the new
> > > partition we add.
> > >
> > > If this is not possible then does kafka's re-partition mechanism ensure
> > > that during re-balance all the previous messages of a particular key
> gets
> > > moved to same partition.
> > > I guess under this approach we would have to stop our streaming
> application
> > > till re-balance is over otherwise messages for same key may get
> processed
> > > by different instances of the application.
> > >
> > > Anyway just wanted to know how such a problem is tackled on live
> systems
> > > real time, or how some of you have approached the same.
> > >
> > > Thanks
> > > Sachin
> > >
> >
>

Re: How to add partitions to an existing kafka topic

Posted by John Roesler <vv...@apache.org>.
Hi Sachin,

Just to build on Boyang’s answer a little, when designing Kafka’s partition expansion operation, we did consider making it work also for dynamically repartitioning in a way that would work for Streams as well, but it added too much complexity, and the contributor had some other use cases in mind. 

In Streams, we have some ideas for improving the dynamic scalability, but for now, your best bet is to stop the app and clone the topic in question into a new topic with more partitions, then point the app to the new input topic. Depending on the application, you might also have changelog topics and repartition topics to worry about. The easiest thing is just to reset the app, if you can tolerate it. 

Iirc, Jan Filipiak has mentioned some techniques or tooling he developed to automate this process. You might search the archives to see what you can dig up. I think it was pretty much what I said above. 

Hope this helps,
John

On Wed, Apr 15, 2020, at 10:23, Boyang Chen wrote:
> Hey Sachin,
> 
> your observation is correct, unfortunately Kafka Streams doesn't support
> adding partitions online. The rebalance could not guarantee the same key
> routing to the same partition when the input topic partition changes, as
> this is the upstream producer's responsibility to consistently route the
> same key data, which is not resolved today.
> 
> Boyang
> 
> On Wed, Apr 15, 2020 at 7:23 AM Sachin Mittal <sj...@gmail.com> wrote:
> 
> > Hi,
> > We have a kafka streams application which runs multiple instances and
> > consumes from a source topic.
> > Producers produces keyed messages to this source topic.
> > Keyed messages are events from different sources and each source has a
> > unique key.
> >
> > So what essentially happens is that messages from particular source always
> > gets added to a particular partition.
> > Hence we can run multiple instances of streams application with a
> > particular instance processing messages for certain partitions.
> > We will never get into a case where messages for a source are processed by
> > different instances of streams application simultaneously.
> >
> > So far so good.
> >
> > Now over time new sources are added. It may so happen that we reach a
> > saturation point and have no option but to increase number of partitions.
> >
> > So what is the best practice to increase number of partitions.
> > Is there a way to ensure that existing key's messages continue to get
> > published on same partition as before.
> > And only new source's keys gets their messages published on the new
> > partition we add.
> >
> > If this is not possible then does kafka's re-partition mechanism ensure
> > that during re-balance all the previous messages of a particular key gets
> > moved to same partition.
> > I guess under this approach we would have to stop our streaming application
> > till re-balance is over otherwise messages for same key may get processed
> > by different instances of the application.
> >
> > Anyway just wanted to know how such a problem is tackled on live systems
> > real time, or how some of you have approached the same.
> >
> > Thanks
> > Sachin
> >
>

Re: How to add partitions to an existing kafka topic

Posted by Boyang Chen <re...@gmail.com>.
Hey Sachin,

your observation is correct, unfortunately Kafka Streams doesn't support
adding partitions online. The rebalance could not guarantee the same key
routing to the same partition when the input topic partition changes, as
this is the upstream producer's responsibility to consistently route the
same key data, which is not resolved today.

Boyang

On Wed, Apr 15, 2020 at 7:23 AM Sachin Mittal <sj...@gmail.com> wrote:

> Hi,
> We have a kafka streams application which runs multiple instances and
> consumes from a source topic.
> Producers produces keyed messages to this source topic.
> Keyed messages are events from different sources and each source has a
> unique key.
>
> So what essentially happens is that messages from particular source always
> gets added to a particular partition.
> Hence we can run multiple instances of streams application with a
> particular instance processing messages for certain partitions.
> We will never get into a case where messages for a source are processed by
> different instances of streams application simultaneously.
>
> So far so good.
>
> Now over time new sources are added. It may so happen that we reach a
> saturation point and have no option but to increase number of partitions.
>
> So what is the best practice to increase number of partitions.
> Is there a way to ensure that existing key's messages continue to get
> published on same partition as before.
> And only new source's keys gets their messages published on the new
> partition we add.
>
> If this is not possible then does kafka's re-partition mechanism ensure
> that during re-balance all the previous messages of a particular key gets
> moved to same partition.
> I guess under this approach we would have to stop our streaming application
> till re-balance is over otherwise messages for same key may get processed
> by different instances of the application.
>
> Anyway just wanted to know how such a problem is tackled on live systems
> real time, or how some of you have approached the same.
>
> Thanks
> Sachin
>