You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Koushik Chitta <kc...@microsoft.com.INVALID> on 2019/09/01 18:42:00 UTC

RE: Curious case of under min ISR before offline partition

I did not see any abnormalities in other metrics, throughput/message size etc.
This is one off case I've encountered, trying to reproduce this to better understand any race conditions in this flow. 

Thanks,
Koushik
-----Original Message-----
From: Lisheng Wang <wa...@gmail.com> 
Sent: Wednesday, August 28, 2019 7:20 PM
To: users@kafka.apache.org
Subject: Re: Curious case of under min ISR before offline partition

Hi Koushik

Seems there is something lead to didn't replicated in time, so the follower was kicked out from ISR.

May i know if you have any chance can lead to that issue, e.g. the throughput is too high to can not complete replication in time or there is record can not replicated to follower as some configs fo max size not proper?

Best,
Lisheng


Koushik Chitta <kc...@microsoft.com.invalid> 于2019年8月29日周四 上午2:03写道:

> Hi All,
>
> We had a topic partition(with 5 replication) going offline when leader 
> of the partition was down. Below is some analysis
>
> Kafka server - 1.1  , relevant config  (replica.fetch.wait.max.ms = 
> 500, replica.fetch.min.bytes = 50000, replica.lag.time.max.ms=10000) 
> Topic partition (Test.Request-3) - replication 5 ,  Replica List [17, 
> 425222741, 425222681, 423809494,425222740]  , unclean leader election 
> = false
>
> Sequence of events.
>
>
>   1.  Leader(425222740) of the partition is down.
>
>
>
>   1.  Controller detects the offline broker.
>
> [2019-08-26 13:00:22,037] INFO [Controller id=423809469] Newly added
> brokers: , deleted brokers: 425222740, all live brokers:
> 15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,423809443,423809444,42380
> 9450,423809458,423809463,423809464,423809469,423809474,423809494,42521
> 8574,425222675,425222681,425222741,425222745
> (kafka.controller.KafkaController)
>
>
>
>   1.  Controller sends update metadata request , but observes only 
> leader in the isr. Please note that none of the isr metrics 
> "underminisr", "isrshirnk" are captured before or after the partition going offline.
> [2019-08-26 13:00:05,333] TRACE [Controller id=423809469 epoch=206] 
> Sending UpdateMetadata request PartitionState(controllerEpoch=204,
> leader=425222740, leaderEpoch=435, isr=[425222740], zkVersion=804, 
> replicas=[425222740, 423809494, 17, 425222741, 425222681],
> offlineReplicas=[]) to brokers Set(425222740, 423809458, 425222741, 
> 24, 425218574, 25, 423809474, 26, 27, 19, 20, 21, 22, 425222681, 23, 
> 423809450, 15, 16, 423809443, 423809494, 425222675, 17, 423809444, 18, 
> 423809469, 425222745, 423809463, 28, 423809464, 29) for Test.Request-3
> (state.change.logger)
>
>
>   1.  New leader election fails under strategy 
> OfflinePartitionLeaderElectionStrategy since no replicas are in ISR list.
>
>
>
>   1.  All replicas see replica fetch request errors as they cannot 
> connect to leader.
>
> Any pointers on why the ISR list was shrunk just before the leader 
> went down forcing the partition to go offline.
>
> Thanks,
> Koushik
>