You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Margaret Figura <ma...@infovista.com.INVALID> on 2023/03/24 18:56:50 UTC

Sudden imbalance between partitions

Hi,

We have a 22-node Kafka 3.3.1 cluster on K8s. All data is sent with null partitionId and null key from 20 Java producers, so it should be distributed evenly across partitions. All was good for days, but a couple hours ago, broker 21 started receiving about 2x the data of the other brokers for a few topics (but not all). These topics are all 1x replicated and the 96 partitions are distributed evenly across brokers (each broker has 4 or 5 partitions). This was detected in Grafana, but I can also see the offsets increasing much faster for the partitions owned by broker 21 in KafkaOffsetsShell. What could cause this? I didn't see anything unusual in the broker 21 logs or the controller logs.

Looking back, I noticed that broker 11 also becomes a bit unbalanced each day at the time when we are processing the most data, but it is only 10-15% higher than the others. All other brokers are quite even, including broker 21 until today.

Any ideas on what I can check? Unfortunately we'll probably have to restart Kafka and/or the producers pretty soon.

Thanks a lot!
Meg

RE: Sudden imbalance between partitions

Posted by Margaret Figura <ma...@infovista.com.INVALID>.
Hi Greg,

Thank you very much for the quick and detailed response. Our clients are 2.5.0 so they do have the problematic version of the partitioner. From the metrics we have available, it's not 100% clear it is this issue, but there was a restart of some large components at the same time the problem started, so it is certainly plausible that it could have temporarily affected the node that went bad. We'll plan to upgrade the client and hope that solves it.

Thanks!
Meg

-----Original Message-----
From: Greg Harris <gr...@aiven.io.INVALID> 
Sent: Friday, March 24, 2023 5:13 PM
To: users@kafka.apache.org
Subject: Re: Sudden imbalance between partitions

CAUTION: External Email : Be wary of clicking links or if this claims to be internal.

Meg,

What version are your clients, and what partitioner are you using for these records?

If you're using the DefaultPartitioner from 2.4.0+, it has a known imbalance flaw that is described and addressed by this KIP:
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-794%253A%2BStrictly%2BUniform%2BSticky%2BPartitioner&data=05%7C01%7Cmargaret.figura%40infovista.com%7Cf11c080bc6d04b39b96608db2cac9412%7Cc8d853de982e440492ffb4189dc94e37%7C0%7C0%7C638152891957621357%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=1BQWQ4Q3TnXfie3xGAmPYrDNh4TDDqRtNWDbxIqp5Pg%3D&reserved=0
which was released in 3.3.0.
In order to make sure you're using the patched partitioner, the clients jar should be on 3.3.0+ and your application should not set the `partitioner.class` configuration, to let the producer choose the behavior.

In the short term, pausing, throttling, or restarting producers may help resolve the imbalance, since the poor balance is caused by the state of the producer buffers.
Adding nodes to the cluster and spreading partitions thinner may also help increase the tolerance of each broker before it becomes unbalanced.
However, this will not solve the problem on its own, and may make it temporarily worse while partitions are being replicated to the added nodes.
If you're already running the patched version of the partitioner, then a more detailed investigation will be necessary.

I hope some of this helps!
Greg Harris

On Fri, Mar 24, 2023 at 11:57 AM Margaret Figura <ma...@infovista.com.invalid> wrote:

> Hi,
>
> We have a 22-node Kafka 3.3.1 cluster on K8s. All data is sent with 
> null partitionId and null key from 20 Java producers, so it should be 
> distributed evenly across partitions. All was good for days, but a 
> couple hours ago, broker 21 started receiving about 2x the data of the 
> other brokers for a few topics (but not all). These topics are all 1x 
> replicated and the 96 partitions are distributed evenly across brokers 
> (each broker has 4 or 5 partitions). This was detected in Grafana, but 
> I can also see the offsets increasing much faster for the partitions 
> owned by broker 21 in KafkaOffsetsShell. What could cause this? I 
> didn't see anything unusual in the broker 21 logs or the controller logs.
>
> Looking back, I noticed that broker 11 also becomes a bit unbalanced 
> each day at the time when we are processing the most data, but it is 
> only 10-15% higher than the others. All other brokers are quite even, 
> including broker
> 21 until today.
>
> Any ideas on what I can check? Unfortunately we'll probably have to 
> restart Kafka and/or the producers pretty soon.
>
> Thanks a lot!
> Meg
>

Re: Sudden imbalance between partitions

Posted by Greg Harris <gr...@aiven.io.INVALID>.
Meg,

What version are your clients, and what partitioner are you using for these
records?

If you're using the DefaultPartitioner from 2.4.0+, it has a known
imbalance flaw that is described and addressed by this KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
which was released in 3.3.0.
In order to make sure you're using the patched partitioner, the clients jar
should be on 3.3.0+ and your application should not set the
`partitioner.class` configuration, to let the producer choose the behavior.

In the short term, pausing, throttling, or restarting producers may help
resolve the imbalance, since the poor balance is caused by the state of the
producer buffers.
Adding nodes to the cluster and spreading partitions thinner may also help
increase the tolerance of each broker before it becomes unbalanced.
However, this will not solve the problem on its own, and may make it
temporarily worse while partitions are being replicated to the added nodes.
If you're already running the patched version of the partitioner, then a
more detailed investigation will be necessary.

I hope some of this helps!
Greg Harris

On Fri, Mar 24, 2023 at 11:57 AM Margaret Figura
<ma...@infovista.com.invalid> wrote:

> Hi,
>
> We have a 22-node Kafka 3.3.1 cluster on K8s. All data is sent with null
> partitionId and null key from 20 Java producers, so it should be
> distributed evenly across partitions. All was good for days, but a couple
> hours ago, broker 21 started receiving about 2x the data of the other
> brokers for a few topics (but not all). These topics are all 1x replicated
> and the 96 partitions are distributed evenly across brokers (each broker
> has 4 or 5 partitions). This was detected in Grafana, but I can also see
> the offsets increasing much faster for the partitions owned by broker 21 in
> KafkaOffsetsShell. What could cause this? I didn't see anything unusual in
> the broker 21 logs or the controller logs.
>
> Looking back, I noticed that broker 11 also becomes a bit unbalanced each
> day at the time when we are processing the most data, but it is only 10-15%
> higher than the others. All other brokers are quite even, including broker
> 21 until today.
>
> Any ideas on what I can check? Unfortunately we'll probably have to
> restart Kafka and/or the producers pretty soon.
>
> Thanks a lot!
> Meg
>