You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Flavien Raynaud <fl...@yelp.com.INVALID> on 2020/08/18 09:54:13 UTC
Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when
follower fetch requests not processed in time
Hi there,
Just a small nudge on this as this happened once more at Yelp 😄
Was there any progress on this? If not, is there anything we can do to help?
Thank you,
Flavien
On 2020/02/13 11:34:14, Satish Duggana <s....@gmail.com> wrote:
> Hi Lucas,>
> Thanks for looking into the KIP and providing your comments.>
>
> Adding to what Harsha mentioned, I do not think there is a fool proof>
> solution here to solve the cases like pending requests in the request>
> queue. We also thought about the option of relinquishing the>
> leadership but the followers might have been already out of ISR which>
> will result in offline partitions. This was added as a rejected>
> alternative in the KIP.>
> Broker should try its best to keep the followers(sending fetch requests) insync.>
>
> ~Satish.>
>
> On Tue, Feb 11, 2020 at 11:45 PM Harsha Chintalapani <ka...@harsha.io> wrote:>
> >>
> > Hi Lucas,>
> > Yes the case you mentioned is true. I do understand KIP-501>
> > might not fully solve this particular use case where there might blocked>
> > fetch requests. But the issue we noticed multiple times and continue to>
> > notice is>
> > 1. Fetch request comes from Follower>
> > 2. Leader tries to fetch data from disk which takes longer than>
> > replica.lag.time.max.ms>
> > 3. Async thread on leader side which checks the ISR marks the>
> > follower who sent a fetch request as not in ISR>
> > 4. Leader dies during this request due to disk errors and now we>
> > have offline partitions because Leader kicked out healthy followers out of>
> > ISR>
> >>
> > Instead of considering this from a disk issue. Lets look at how we maintain>
> > the ISR>
> >>
> > 1. Currently we do not consider a follower as healthy even when its able>
> > to send fetch requests>
> > 2. ISR is controlled on how healthy a broker is, ie if it takes longer>
> > than replica.lag.time.max.ms we mark followers out of sync instead of>
> > relinquishing the leadership.>
> >>
> >>
> > What we are proposing in this KIP, we should look at the time when a>
> > follower sends a fetch request and keep that as basis for marking a>
> > follower out of ISR or to keep it in the ISR and leave the disk read time>
> > on leader side out of this.>
> >>
> > Thanks,>
> > Harsha>
> >>
> >>
> >>
> > On Mon, Feb 10, 2020 at 9:26 PM, Lucas Bradstreet <lu...@confluent.io>>
> > wrote:>
> >>
> > > Hi Harsha,>
> > >>
> > > Is the problem you'd like addressed the following?>
> > >>
> > > Assume 3 replicas, L and F1 and F2.>
> > >>
> > > 1. F1 and F2 are alive and sending fetch requests to L.>
> > > 2. L starts encountering disk issues, any requests being processed by the>
> > > request handler threads become blocked.>
> > > 3. L's zookeeper connection is still alive so it remains the leader for>
> > > the partition.>
> > > 4. Given that F1 and F2 have not successfully fetched, L shrinks the ISR>
> > > to itself.>
> > >>
> > > While KIP-501 may help prevent a shrink in partitions where a replica>
> > > fetch request has started processing, any fetch requests in the request>
> > > queue will have no effect. Generally when these slow/failing disk issues>
> > > occur, all of the request handler threads end up blocked and requests queue>
> > > up in the request queue. For example, all of the request handler threads>
> > > may end up stuck in>
> > > KafkaApis.handleProduceRequest handling produce requests, at which point>
> > > all of the replica fetcher fetch requests remain queued in the request>
> > > queue. If this happens, there will be no tracked fetch requests to prevent>
> > > a shrink.>
> > >>
> > > Solving this shrinking issue is tricky. It would be better if L resigns>
> > > leadership when it enters a degraded state rather than avoiding a shrink.>
> > > If L is no longer the leader in this situation, it will eventually become>
> > > blocked fetching from the new leader and the new leader will shrink the>
> > > ISR, kicking out L.>
> > >>
> > > Cheers,>
> > >>
> > > Lucas>
> > >>
>