You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Javier Arias Losada <ja...@gmail.com> on 2019/01/16 08:59:19 UTC

rebalancing latency spikes on high throughput kafka-streams services

Dear all,

we are starting to work with Kafka streams, our service is a very simple
stateless consumer.

We have tight requirements on latency, and we are facing too high latency
problems when the consumer group is rebalancing. In our scenario,
rebalancing will happen relatively often: rolling updates of code, scaling
up/down the service, containers being shuffled by the cluster scheduler,
containers dying, hardware failing.

One of the first tests we have done is having a small consumer group with 4
consumers handling a small amount of messages (1K/sec) and killing one of
them; the cluster manager (currently AWS-ECS, probably soon moving to K8S)
starts a new one. So, more than one rebalancing is done.

Our most critical metric is latency, which we measure as the milliseconds
between message creation and message consumption. We saw the maximum
latency spiking from a few milliseconds, to almost 15 seconds.

[image: image.png]

[image: image.png]

[image: image.png]

We also have done tests with some rolling updates of code and the results
are worse, since our deployment is not prepared for Kafka services and we
trigger a lot of rebalancings. We'll need to work on that, but wondering
what are the strategies followed by other people for doing code deployment
/ autoscaling with the minimum possible delays.

Not sure it might help, but our requirements are pretty relaxed related to
message processing: we don't care about some messages being processed twice
from time to time, or are very strict with the ordering of messages.

We are using all default configurations, no tuning.

We need to improve this latency spikes during rebalancing.
Can someone, please, give us some hints on how to work on it? Is touching
configurations enough? Do we need to use some concrete parition Asignor?
Implement our own?

What is the recommended approach to code deployment / autoscaling with the
minimum possible delays?

Our Kafka version is 1.1.0, after looking at libs found for example
kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0.
In the consumer side, we are using Kafka-streams 2.1.0.

Thank you for reading my question and your responses.
Best,
Javier Arias Losada

Re: rebalancing latency spikes on high throughput kafka-streams services

Posted by Guozhang Wang <wa...@gmail.com>.
Hello Javier,

I read you have an SO thread before I noticed the question here, so I've
answered it in SO already, just for the reference for other readers
interested in this thread:

https://stackoverflow.com/questions/54218822/kafka-streams-rebalancing-latency-spikes-on-high-throughput-kafka-streams-servic

Guozhang

On Wed, Jan 16, 2019 at 12:59 AM Javier Arias Losada <
javier.arilos@gmail.com> wrote:

> Dear all,
>
> we are starting to work with Kafka streams, our service is a very simple
> stateless consumer.
>
> We have tight requirements on latency, and we are facing too high latency
> problems when the consumer group is rebalancing. In our scenario,
> rebalancing will happen relatively often: rolling updates of code, scaling
> up/down the service, containers being shuffled by the cluster scheduler,
> containers dying, hardware failing.
>
> One of the first tests we have done is having a small consumer group with
> 4 consumers handling a small amount of messages (1K/sec) and killing one of
> them; the cluster manager (currently AWS-ECS, probably soon moving to K8S)
> starts a new one. So, more than one rebalancing is done.
>
> Our most critical metric is latency, which we measure as the milliseconds
> between message creation and message consumption. We saw the maximum
> latency spiking from a few milliseconds, to almost 15 seconds.
>
> [image: image.png]
>
> [image: image.png]
>
> [image: image.png]
>
> We also have done tests with some rolling updates of code and the results
> are worse, since our deployment is not prepared for Kafka services and we
> trigger a lot of rebalancings. We'll need to work on that, but wondering
> what are the strategies followed by other people for doing code deployment
> / autoscaling with the minimum possible delays.
>
> Not sure it might help, but our requirements are pretty relaxed related to
> message processing: we don't care about some messages being processed twice
> from time to time, or are very strict with the ordering of messages.
>
> We are using all default configurations, no tuning.
>
> We need to improve this latency spikes during rebalancing.
> Can someone, please, give us some hints on how to work on it? Is touching
> configurations enough? Do we need to use some concrete parition Asignor?
> Implement our own?
>
> What is the recommended approach to code deployment / autoscaling with the
> minimum possible delays?
>
> Our Kafka version is 1.1.0, after looking at libs found for example
> kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0.
> In the consumer side, we are using Kafka-streams 2.1.0.
>
> Thank you for reading my question and your responses.
> Best,
> Javier Arias Losada
>


-- 
-- Guozhang

Re: rebalancing latency spikes on high throughput kafka-streams services

Posted by Raman Gupta <ro...@gmail.com>.
The first thing I'd take a look at is your `max.poll.records` setting. The
default for streams is 1000 (see
https://docs.confluent.io/current/streams/developer-guide/config-streams.html#default-values).
Depending on your workloads, this could definitely cause long rebalances --
it did for me, but my workload requires some quite long processing times.

Regards,
Raman

On Wed, Jan 16, 2019 at 3:59 AM Javier Arias Losada <ja...@gmail.com>
wrote:

> Dear all,
>
> we are starting to work with Kafka streams, our service is a very simple
> stateless consumer.
>
> We have tight requirements on latency, and we are facing too high latency
> problems when the consumer group is rebalancing. In our scenario,
> rebalancing will happen relatively often: rolling updates of code, scaling
> up/down the service, containers being shuffled by the cluster scheduler,
> containers dying, hardware failing.
>
> One of the first tests we have done is having a small consumer group with
> 4 consumers handling a small amount of messages (1K/sec) and killing one of
> them; the cluster manager (currently AWS-ECS, probably soon moving to K8S)
> starts a new one. So, more than one rebalancing is done.
>
> Our most critical metric is latency, which we measure as the milliseconds
> between message creation and message consumption. We saw the maximum
> latency spiking from a few milliseconds, to almost 15 seconds.
>
> [image: image.png]
>
> [image: image.png]
>
> [image: image.png]
>
> We also have done tests with some rolling updates of code and the results
> are worse, since our deployment is not prepared for Kafka services and we
> trigger a lot of rebalancings. We'll need to work on that, but wondering
> what are the strategies followed by other people for doing code deployment
> / autoscaling with the minimum possible delays.
>
> Not sure it might help, but our requirements are pretty relaxed related to
> message processing: we don't care about some messages being processed twice
> from time to time, or are very strict with the ordering of messages.
>
> We are using all default configurations, no tuning.
>
> We need to improve this latency spikes during rebalancing.
> Can someone, please, give us some hints on how to work on it? Is touching
> configurations enough? Do we need to use some concrete parition Asignor?
> Implement our own?
>
> What is the recommended approach to code deployment / autoscaling with the
> minimum possible delays?
>
> Our Kafka version is 1.1.0, after looking at libs found for example
> kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0.
> In the consumer side, we are using Kafka-streams 2.1.0.
>
> Thank you for reading my question and your responses.
> Best,
> Javier Arias Losada
>