You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "Karnam, Sudheer" <su...@hpe.com> on 2020/06/17 12:16:13 UTC

Kafka partitions replication issue

Team,
We are using kafka version 2.3.0 and we are facing issue with brokers replication
<https://support.d2iq.com/s/feed/0D53Z00007KdrfHSAR>
1.Kafka has 6 brokers.
2.Mainly 7 topics exist in kafka cluster and each topic has 128 partitions.
3.Each partition has 3 in-sync-replicas and these are distributed among 6 kafka brokers.
4.All partitions has preferred leader and "Auto Leader Rebalance Enable" configuration enabled.
Issue:
We had a kafka broker-3 failure because of hardware issues and partitions having broker-3 as leader are disrupted.
As per kafka official page, partitions should elect new leader once preferred leader fails.

[2020-06-01 14:02:25,029] ERROR [ReplicaManager broker=3] Error processing append operation on partition object-xxx-xxx-xx-na4-93 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.NotEnoughReplicasException: Number of insync replicas for partition object-xxx-xxx-xx-na4-93 is [1], below required minimum [2]

Above error message found in kafka logs,
" object-xxx-xxx-xx-na4-93 " topic has 128 partition and 93rd partition has 3 replicas. It is distributed among (broker-3,broker-2,broker-4).
Broker -3 is the preferred leader.
When broker-3 failed, Leader position should move to any one of (broker-2,broker-4) but it didn't happened.
As per error message, whenever leader is failing it is throwing error by stating only one insync replica available.

Please help us in finding root cause for not selecting new leader.


Thanks,
Sudheer


Re: Kafka partitions replication issue

Posted by Ricardo Ferreira <ri...@riferrei.com>.
Karnam,

I think the combination of the setting preferred leader and auto leader 
rebalance enable along with the hardware issue in broker-3 might be 
giving you the opposite effect that you are expecting. If the broker-3 
happens to be the preferred leader for a given partition (because it 
happens to be the broker that hosted the original leader when the 
partition was originally created) then the Kafka protocol will try to 
pin that broker for that partition -- but as you say the broker is 
having hardware failures and thus it will fail in this attempt.

Here are things that you can try:

- Move the preferred leader to another broker using the 
`bin/kafka-preferred-replica-election` tool.

- Decrease the `min.insync.replicas` from 2 to 1 to allow producers and 
replication to keep on going.

- Enable unclean election, which allows non-ISRs to become leaders (but 
opens margin for data loss)

- Solve the hardware issue in broker-3 =)

Nevertheless, it is never a good idea to keep preferred leader election 
enabled if the cluster health is not constantly monitored and you are 
not willing to keep moving those across the cluster from time to time. 
Keeping the cluster well balanced requires an increase of Ops tasks. 
This is the reason why Confluent created the feature called Auto Data 
Balancing 
<https://docs.confluent.io/current/kafka/rebalancer/index.html> that 
keeps partition leaders automatically and constantly spread over the 
cluster for you.

Thanks,

-- Ricardo

On 6/17/20 8:16 AM, Karnam, Sudheer wrote:
> Team,
> We are using kafka version 2.3.0 and we are facing issue with brokers replication
> <https://support.d2iq.com/s/feed/0D53Z00007KdrfHSAR>
> 1.Kafka has 6 brokers.
> 2.Mainly 7 topics exist in kafka cluster and each topic has 128 partitions.
> 3.Each partition has 3 in-sync-replicas and these are distributed among 6 kafka brokers.
> 4.All partitions has preferred leader and "Auto Leader Rebalance Enable" configuration enabled.
> Issue:
> We had a kafka broker-3 failure because of hardware issues and partitions having broker-3 as leader are disrupted.
> As per kafka official page, partitions should elect new leader once preferred leader fails.
>
> [2020-06-01 14:02:25,029] ERROR [ReplicaManager broker=3] Error processing append operation on partition object-xxx-xxx-xx-na4-93 (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.NotEnoughReplicasException: Number of insync replicas for partition object-xxx-xxx-xx-na4-93 is [1], below required minimum [2]
>
> Above error message found in kafka logs,
> " object-xxx-xxx-xx-na4-93 " topic has 128 partition and 93rd partition has 3 replicas. It is distributed among (broker-3,broker-2,broker-4).
> Broker -3 is the preferred leader.
> When broker-3 failed, Leader position should move to any one of (broker-2,broker-4) but it didn't happened.
> As per error message, whenever leader is failing it is throwing error by stating only one insync replica available.
>
> Please help us in finding root cause for not selecting new leader.
>
>
> Thanks,
> Sudheer
>

Re: Kafka partitions replication issue

Posted by Peter Bukowinski <pm...@gmail.com>.
> On Jun 17, 2020, at 5:16 AM, Karnam, Sudheer <su...@hpe.com> wrote:
> 
> Team,
> We are using kafka version 2.3.0 and we are facing issue with brokers replication
> <https://support.d2iq.com/s/feed/0D53Z00007KdrfHSAR>
> 1.Kafka has 6 brokers.
> 2.Mainly 7 topics exist in kafka cluster and each topic has 128 partitions.
> 3.Each partition has 3 in-sync-replicas and these are distributed among 6 kafka brokers.
> 4.All partitions has preferred leader and "Auto Leader Rebalance Enable" configuration enabled.
> Issue:
> We had a kafka broker-3 failure because of hardware issues and partitions having broker-3 as leader are disrupted.
> As per kafka official page, partitions should elect new leader once preferred leader fails.
> 
> [2020-06-01 14:02:25,029] ERROR [ReplicaManager broker=3] Error processing append operation on partition object-xxx-xxx-xx-na4-93 (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.NotEnoughReplicasException: Number of insync replicas for partition object-xxx-xxx-xx-na4-93 is [1], below required minimum [2]
> 
> Above error message found in kafka logs,
> " object-xxx-xxx-xx-na4-93 " topic has 128 partition and 93rd partition has 3 replicas. It is distributed among (broker-3,broker-2,broker-4).
> Broker -3 is the preferred leader.
> When broker-3 failed, Leader position should move to any one of (broker-2,broker-4) but it didn't happened.
> As per error message, whenever leader is failing it is throwing error by stating only one insync replica available.
> 
> Please help us in finding root cause for not selecting new leader.
> 
> 
> Thanks,
> Sudheer

Hi Sudheer,

What do you have `replica.lag.time.max.ms` set to for your cluster? Also, are your producers using `acks=-1` or `acks=all`? If the replica lag time is too short or you are using `acks=1`, then it’s likely that when broker 3 failed, the both followers for partition you mention had not yet caught up with the leader, so the cluster is unable to meet the min.insync.replicas count of 2.

You have a few choices you can make. If you value topic availability over complete data integrity, then you can set `min.insync.replicas=1`, or set `unclean.leader.election.enable=true`. The former will keep a partition online with only one in-sync replica. The latter will allow a replica that hadn’t fully caught up to the leader to become a leader.

I have both of these set in my environment since I have the luxury of not dealing with transactional data and “best effort” delivery is sufficient for my needs. In practice, the amount of loss we see is an extremely small fraction of the total data pushed through kafka and only occurs around broker failures.

—
Peter