You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Flavien Raynaud <fl...@yelp.com.INVALID> on 2020/08/18 09:54:13 UTC
Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

Hi there,

Just a small nudge on this as this happened once more at Yelp 😄
Was there any progress on this? If not, is there anything we can do to help?

Thank you,
Flavien

On 2020/02/13 11:34:14, Satish Duggana <s....@gmail.com> wrote: 
> Hi Lucas,> 
> Thanks for looking into the KIP and providing your comments.> 
> 
> Adding to what Harsha mentioned, I do not think there is a fool proof> 
> solution here to solve the cases like pending requests in the request> 
> queue. We also thought about the option of relinquishing the> 
> leadership but the followers might have been already out of ISR which> 
> will result in offline partitions. This was added as a rejected> 
> alternative in the KIP.> 
> Broker should try its best to keep the followers(sending fetch requests) insync.> 
> 
> ~Satish.> 
> 
> On Tue, Feb 11, 2020 at 11:45 PM Harsha Chintalapani <ka...@harsha.io> wrote:> 
> >> 
> > Hi Lucas,> 
> >            Yes the case you mentioned is true. I do understand KIP-501> 
> > might not fully solve this particular use case where there might blocked> 
> > fetch requests. But the issue we noticed multiple times  and continue to> 
> > notice is> 
> >           1. Fetch request comes from Follower> 
> >           2. Leader tries to fetch data from disk which takes longer than> 
> > replica.lag.time.max.ms> 
> >          3. Async thread on leader side which checks the ISR marks the> 
> > follower who sent a fetch request as not in ISR> 
> >          4. Leader dies during this request due to disk errors and now we> 
> > have offline partitions because Leader kicked out healthy followers out of> 
> > ISR> 
> >> 
> > Instead of considering this from a disk issue. Lets look at how we maintain> 
> > the ISR> 
> >> 
> >    1. Currently we do not consider a follower as healthy even when its able> 
> >    to send fetch requests> 
> >    2. ISR is controlled on how healthy a broker is, ie if it takes longer> 
> >    than replica.lag.time.max.ms we mark followers out of sync instead of> 
> >    relinquishing the leadership.> 
> >> 
> >> 
> > What we are proposing in this KIP, we should look at the time when a> 
> > follower sends a fetch request and keep that as basis for marking a> 
> > follower out of ISR or to keep it in the ISR and leave the disk read time> 
> > on leader side out of this.> 
> >> 
> > Thanks,> 
> > Harsha> 
> >> 
> >> 
> >> 
> > On Mon, Feb 10, 2020 at 9:26 PM, Lucas Bradstreet <lu...@confluent.io>> 
> > wrote:> 
> >> 
> > > Hi Harsha,> 
> > >> 
> > > Is the problem you'd like addressed the following?> 
> > >> 
> > > Assume 3 replicas, L and F1 and F2.> 
> > >> 
> > > 1. F1 and F2 are alive and sending fetch requests to L.> 
> > > 2. L starts encountering disk issues, any requests being processed by the> 
> > > request handler threads become blocked.> 
> > > 3. L's zookeeper connection is still alive so it remains the leader for> 
> > > the partition.> 
> > > 4. Given that F1 and F2 have not successfully fetched, L shrinks the ISR> 
> > > to itself.> 
> > >> 
> > > While KIP-501 may help prevent a shrink in partitions where a replica> 
> > > fetch request has started processing, any fetch requests in the request> 
> > > queue will have no effect. Generally when these slow/failing disk issues> 
> > > occur, all of the request handler threads end up blocked and requests queue> 
> > > up in the request queue. For example, all of the request handler threads> 
> > > may end up stuck in> 
> > > KafkaApis.handleProduceRequest handling produce requests, at which point> 
> > > all of the replica fetcher fetch requests remain queued in the request> 
> > > queue. If this happens, there will be no tracked fetch requests to prevent> 
> > > a shrink.> 
> > >> 
> > > Solving this shrinking issue is tricky. It would be better if L resigns> 
> > > leadership when it enters a degraded state rather than avoiding a shrink.> 
> > > If L is no longer the leader in this situation, it will eventually become> 
> > > blocked fetching from the new leader and the new leader will shrink the> 
> > > ISR, kicking out L.> 
> > >> 
> > > Cheers,> 
> > >> 
> > > Lucas> 
> > >> 
>