You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Harsha Chintalapani <ka...@harsha.io> on 2019/12/05 05:08:36 UTC

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

Hi Jason,
         As Satish said just increase replica max lag will not work in this
case. Just before a disk dies the reads becomes really slow and its hard to
estimate how much this is, as we noticed range is pretty wide. Overall it
doesn't make sense to knock good replicas out of just because a leader is
slower in processing reads or serving the fetch requests which may be due
to disk issues in this case but could be other issues as well. I think this
kip addresses in general all of these issues.
         Do you still have questions on the current approach if not we can
take it vote.
Thanks,
Harsha


On Mon, Nov 18, 2019 at 7:05 PM, Satish Duggana <sa...@gmail.com>
wrote:

> Hi Jason,
> Thanks for looking into the KIP. Apologies for my late reply. Increasing
> replica max lag to 30-45 secs did not help as we observed that a few fetch
> requests took more than 1-2 minutes. We do not want to increase further as
> it increases upper bound on commit latency. We have strict SLAs on some of
> the clusters on end to end(producer to consumer) latency. This proposal
> improves the availability of partitions when followers are trying their
> best to be insync even when leaders are slow in processing those requests.
> I have updated the KIP to have a single config for giving backward
> compatibility and I guess this config is more comprehensible than earlier.
> But I believe there is no need to have config because the suggested
> proposal in the KIP is an enhancement to the existing behavior. Please let
> me know your comments.
>
> Thanks,
> Satish.
>
> On Thu, Nov 14, 2019 at 10:57 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> Hi Satish,
>
> Thanks for the KIP. I'm wondering how much of this problem can be
> addressed just by increasing the replication max lag? That was one of the
> purposes of KIP-537 (the default increased from 10s to 30s). Also, the new
> configurations seem quite low level. I think they will be hard for users to
> understand (even reading through a couple times I'm not sure I understand
> them fully). I think if there's a way to improve this behavior without
> requiring any new configurations, it would be much more attractive.
>
> Best,
> Jason
>
> On Wed, Nov 6, 2019 at 8:14 AM Satish Duggana <sa...@gmail.com>
> wrote:
>
> Hi Dhruvil,
> Thanks for looking into the KIP.
>
> 10. I have an initial sketch of the KIP-500 in commit[a] which discusses
> tracking the pending fetch requests. Tracking is not done in
> Partition#readRecords because if it takes longer in reading any of the
> partitions then we do not want any of the replicas of this fetch request to
> go out of sync.
>
> 11. I think `Replica` class should be thread-safe to handle the remote
> scenario of concurrent requests running for a follower replica. Or I may be
> missing something here. This is a separate issue from KIP-500. I will file
> a separate JIRA to discuss that issue.
>
> a -
> https://github.com/satishd/kafka/commit/
> c69b525abe8f6aad5059236076a003cdec4c4eb7
>
> Thanks,
> Satish.
>
> On Tue, Oct 29, 2019 at 10:57 AM Dhruvil Shah <dh...@confluent.io>
> wrote:
>
> Hi Satish,
>
> Thanks for the KIP, those seems very useful. Could you elaborate on how
> pending fetch requests are tracked?
>
> Thanks,
> Dhruvil
>
> On Mon, Oct 28, 2019 at 9:43 PM Satish Duggana <satish.duggana@gmail.com
>
> wrote:
>
> Hi All,
> I wrote a short KIP about avoiding out-of-sync or offline partitions when
> follower fetch requests are not processed in time by the leader replica.
> KIP-501 is located at https://s.apache.org/jhbpn
>
> Please take a look, I would like to hear your feedback and suggestions.
>
> JIRA: https://issues.apache.org/jira/browse/KAFKA-8733
>
> Thanks,
> Satish.
>
>

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

Posted by Satish Duggana <sa...@gmail.com>.

Hi Jun,
I updated the KIP-501 with more details. Please take a look and
provide your comments.

This issue occurred several times in multiple production
environments(Uber,  Yelp, Twitter, etc).

Thanks,
Satish.


On Thu, 13 Feb 2020 at 17:04, Satish Duggana <sa...@gmail.com> wrote:
>
> Hi Lucas,
> Thanks for looking into the KIP and providing your comments.
>
> Adding to what Harsha mentioned, I do not think there is a fool proof
> solution here to solve the cases like pending requests in the request
> queue. We also thought about the option of relinquishing the
> leadership but the followers might have been already out of ISR which
> will result in offline partitions. This was added as a rejected
> alternative in the KIP.
> Broker should try its best to keep the followers(sending fetch requests) insync.
>
> ~Satish.
>
> On Tue, Feb 11, 2020 at 11:45 PM Harsha Chintalapani <ka...@harsha.io> wrote:
> >
> > Hi Lucas,
> >            Yes the case you mentioned is true. I do understand KIP-501
> > might not fully solve this particular use case where there might blocked
> > fetch requests. But the issue we noticed multiple times  and continue to
> > notice is
> >           1. Fetch request comes from Follower
> >           2. Leader tries to fetch data from disk which takes longer than
> > replica.lag.time.max.ms
> >          3. Async thread on leader side which checks the ISR marks the
> > follower who sent a fetch request as not in ISR
> >          4. Leader dies during this request due to disk errors and now we
> > have offline partitions because Leader kicked out healthy followers out of
> > ISR
> >
> > Instead of considering this from a disk issue. Lets look at how we maintain
> > the ISR
> >
> >    1. Currently we do not consider a follower as healthy even when its able
> >    to send fetch requests
> >    2. ISR is controlled on how healthy a broker is, ie if it takes longer
> >    than replica.lag.time.max.ms we mark followers out of sync instead of
> >    relinquishing the leadership.
> >
> >
> > What we are proposing in this KIP, we should look at the time when a
> > follower sends a fetch request and keep that as basis for marking a
> > follower out of ISR or to keep it in the ISR and leave the disk read time
> > on leader side out of this.
> >
> > Thanks,
> > Harsha
> >
> >
> >
> > On Mon, Feb 10, 2020 at 9:26 PM, Lucas Bradstreet <lu...@confluent.io>
> > wrote:
> >
> > > Hi Harsha,
> > >
> > > Is the problem you'd like addressed the following?
> > >
> > > Assume 3 replicas, L and F1 and F2.
> > >
> > > 1. F1 and F2 are alive and sending fetch requests to L.
> > > 2. L starts encountering disk issues, any requests being processed by the
> > > request handler threads become blocked.
> > > 3. L's zookeeper connection is still alive so it remains the leader for
> > > the partition.
> > > 4. Given that F1 and F2 have not successfully fetched, L shrinks the ISR
> > > to itself.
> > >
> > > While KIP-501 may help prevent a shrink in partitions where a replica
> > > fetch request has started processing, any fetch requests in the request
> > > queue will have no effect. Generally when these slow/failing disk issues
> > > occur, all of the request handler threads end up blocked and requests queue
> > > up in the request queue. For example, all of the request handler threads
> > > may end up stuck in
> > > KafkaApis.handleProduceRequest handling produce requests, at which point
> > > all of the replica fetcher fetch requests remain queued in the request
> > > queue. If this happens, there will be no tracked fetch requests to prevent
> > > a shrink.
> > >
> > > Solving this shrinking issue is tricky. It would be better if L resigns
> > > leadership when it enters a degraded state rather than avoiding a shrink.
> > > If L is no longer the leader in this situation, it will eventually become
> > > blocked fetching from the new leader and the new leader will shrink the
> > > ISR, kicking out L.
> > >
> > > Cheers,
> > >
> > > Lucas
> > >

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

Posted by Satish Duggana <sa...@gmail.com>.

Hi Lucas,
Thanks for looking into the KIP and providing your comments.

Adding to what Harsha mentioned, I do not think there is a fool proof
solution here to solve the cases like pending requests in the request
queue. We also thought about the option of relinquishing the
leadership but the followers might have been already out of ISR which
will result in offline partitions. This was added as a rejected
alternative in the KIP.
Broker should try its best to keep the followers(sending fetch requests) insync.

~Satish.

On Tue, Feb 11, 2020 at 11:45 PM Harsha Chintalapani <ka...@harsha.io> wrote:
>
> Hi Lucas,
>            Yes the case you mentioned is true. I do understand KIP-501
> might not fully solve this particular use case where there might blocked
> fetch requests. But the issue we noticed multiple times  and continue to
> notice is
>           1. Fetch request comes from Follower
>           2. Leader tries to fetch data from disk which takes longer than
> replica.lag.time.max.ms
>          3. Async thread on leader side which checks the ISR marks the
> follower who sent a fetch request as not in ISR
>          4. Leader dies during this request due to disk errors and now we
> have offline partitions because Leader kicked out healthy followers out of
> ISR
>
> Instead of considering this from a disk issue. Lets look at how we maintain
> the ISR
>
>    1. Currently we do not consider a follower as healthy even when its able
>    to send fetch requests
>    2. ISR is controlled on how healthy a broker is, ie if it takes longer
>    than replica.lag.time.max.ms we mark followers out of sync instead of
>    relinquishing the leadership.
>
>
> What we are proposing in this KIP, we should look at the time when a
> follower sends a fetch request and keep that as basis for marking a
> follower out of ISR or to keep it in the ISR and leave the disk read time
> on leader side out of this.
>
> Thanks,
> Harsha
>
>
>
> On Mon, Feb 10, 2020 at 9:26 PM, Lucas Bradstreet <lu...@confluent.io>
> wrote:
>
> > Hi Harsha,
> >
> > Is the problem you'd like addressed the following?
> >
> > Assume 3 replicas, L and F1 and F2.
> >
> > 1. F1 and F2 are alive and sending fetch requests to L.
> > 2. L starts encountering disk issues, any requests being processed by the
> > request handler threads become blocked.
> > 3. L's zookeeper connection is still alive so it remains the leader for
> > the partition.
> > 4. Given that F1 and F2 have not successfully fetched, L shrinks the ISR
> > to itself.
> >
> > While KIP-501 may help prevent a shrink in partitions where a replica
> > fetch request has started processing, any fetch requests in the request
> > queue will have no effect. Generally when these slow/failing disk issues
> > occur, all of the request handler threads end up blocked and requests queue
> > up in the request queue. For example, all of the request handler threads
> > may end up stuck in
> > KafkaApis.handleProduceRequest handling produce requests, at which point
> > all of the replica fetcher fetch requests remain queued in the request
> > queue. If this happens, there will be no tracked fetch requests to prevent
> > a shrink.
> >
> > Solving this shrinking issue is tricky. It would be better if L resigns
> > leadership when it enters a degraded state rather than avoiding a shrink.
> > If L is no longer the leader in this situation, it will eventually become
> > blocked fetching from the new leader and the new leader will shrink the
> > ISR, kicking out L.
> >
> > Cheers,
> >
> > Lucas
> >

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

Posted by Harsha Chintalapani <ka...@harsha.io>.

Hi Lucas,
           Yes the case you mentioned is true. I do understand KIP-501
might not fully solve this particular use case where there might blocked
fetch requests. But the issue we noticed multiple times  and continue to
notice is
          1. Fetch request comes from Follower
          2. Leader tries to fetch data from disk which takes longer than
replica.lag.time.max.ms
         3. Async thread on leader side which checks the ISR marks the
follower who sent a fetch request as not in ISR
         4. Leader dies during this request due to disk errors and now we
have offline partitions because Leader kicked out healthy followers out of
ISR

Instead of considering this from a disk issue. Lets look at how we maintain
the ISR

   1. Currently we do not consider a follower as healthy even when its able
   to send fetch requests
   2. ISR is controlled on how healthy a broker is, ie if it takes longer
   than replica.lag.time.max.ms we mark followers out of sync instead of
   relinquishing the leadership.


What we are proposing in this KIP, we should look at the time when a
follower sends a fetch request and keep that as basis for marking a
follower out of ISR or to keep it in the ISR and leave the disk read time
on leader side out of this.

Thanks,
Harsha



On Mon, Feb 10, 2020 at 9:26 PM, Lucas Bradstreet <lu...@confluent.io>
wrote:

> Hi Harsha,
>
> Is the problem you'd like addressed the following?
>
> Assume 3 replicas, L and F1 and F2.
>
> 1. F1 and F2 are alive and sending fetch requests to L.
> 2. L starts encountering disk issues, any requests being processed by the
> request handler threads become blocked.
> 3. L's zookeeper connection is still alive so it remains the leader for
> the partition.
> 4. Given that F1 and F2 have not successfully fetched, L shrinks the ISR
> to itself.
>
> While KIP-501 may help prevent a shrink in partitions where a replica
> fetch request has started processing, any fetch requests in the request
> queue will have no effect. Generally when these slow/failing disk issues
> occur, all of the request handler threads end up blocked and requests queue
> up in the request queue. For example, all of the request handler threads
> may end up stuck in
> KafkaApis.handleProduceRequest handling produce requests, at which point
> all of the replica fetcher fetch requests remain queued in the request
> queue. If this happens, there will be no tracked fetch requests to prevent
> a shrink.
>
> Solving this shrinking issue is tricky. It would be better if L resigns
> leadership when it enters a degraded state rather than avoiding a shrink.
> If L is no longer the leader in this situation, it will eventually become
> blocked fetching from the new leader and the new leader will shrink the
> ISR, kicking out L.
>
> Cheers,
>
> Lucas
>

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

Posted by Lucas Bradstreet <lu...@confluent.io>.

Hi Harsha,

Is the problem you'd like addressed the following?

Assume 3 replicas, L and F1 and F2.

1. F1 and F2 are alive and sending fetch requests to L.
2. L starts encountering disk issues, any requests being processed by
the request handler threads become blocked.
3. L's zookeeper connection is still alive so it remains the leader
for the partition.
4. Given that F1 and F2 have not successfully fetched, L shrinks the
ISR to itself.

While KIP-501 may help prevent a shrink in partitions where a replica
fetch request has started processing, any fetch requests in the
request queue will have no effect. Generally when these slow/failing
disk issues occur, all of the request handler threads end up blocked
and requests queue up in the request queue. For example, all of the
request handler threads may end up stuck in
KafkaApis.handleProduceRequest handling produce requests, at which
point all of the replica fetcher fetch requests remain queued in the
request queue. If this happens, there will be no tracked fetch
requests to prevent a shrink.

Solving this shrinking issue is tricky. It would be better if L
resigns leadership when it enters a degraded state rather than
avoiding a shrink. If L is no longer the leader in this situation, it
will eventually become blocked fetching from the new leader and the
new leader will shrink the ISR, kicking out L.

Cheers,

Lucas

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

Posted by Harsha Ch <ha...@gmail.com>.

Hi Jason & Jun,

                 Do you have any feedback on the KIP or is it ok take it to voting?. Its good to have this config in Kafka to address disk failure scenarios as described in the KIP.

Thanks,

Harsha

On Mon, Feb 10, 2020 at 5:10 PM, Brian Sang < baisang@yelp.com.invalid > wrote:

> 
> 
> 
> Hi,
> 
> 
> 
> Just wanted to bump this discussion, since it happened to us again at Yelp
> 😂
> 
> 
> 
> It's particularly nasty since it can happen right before a disk failure,
> so right as the leader for the partition becomes the only ISR, the leader
> becomes unrecoverable right after, forcing us to do an unclean leader
> election to resolve the situation. Having offline partitions due to a
> single failure is really annoying. I'm curious if others have experienced
> this as well, but weren't able to trace it to this specific error.
> 
> 
> 
> Best,
> Brian
> 
> 
> 
> On 2020/01/22 03:28:34, Satish Duggana < satish. duggana@ gmail. com (
> satish.duggana@gmail.com ) > wrote:
> 
> 
>> 
>> 
>> Hi Jun,
>> Can you please review the KIP and let us know your comments?
>> 
>> 
>> 
>> If there are no comments/questions, we can start a vote thread.
>> 
>> 
>> 
>> It looks like Yelp folks also encountered the same issue as mentioned in
>> JIRA comment[1].
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> Flavien Raynaud added a comment - Yesterday
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> We've seen offline partitions happening for the same reason in one of our
>> clusters too, where only the broker leader for the offline partitions was
>> having disk issues. It looks like there has not been much progress/look on
>> the PR submitted since December 9th. Is there anything blocking this
>> change from moving forward?
>> 
>> 
>> 
>> 1. https:/ / issues. apache. org/ jira/ browse/ KAFKA-8733?focusedCommentId=17020083&page=com.
>> atlassian. jira. plugin. system. issuetabpanels%3Acomment-tabpanel#comment-17020083
>> (
>> https://issues.apache.org/jira/browse/KAFKA-8733?focusedCommentId=17020083&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17020083
>> )
>> 
>> 
>> 
>> Thanks,
>> Satish.
>> 
>> 
>> 
>> On Thu, Dec 5, 2019 at 10:38 AM Harsha Chintalapani < kafka@ harsha. io (
>> kafka@harsha.io ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Hi Jason,
>>> As Satish said just increase replica max lag will not work in this case.
>>> Just before a disk dies the reads becomes really slow and its hard to
>>> estimate how much this is, as we noticed range is pretty wide. Overall it
>>> doesn't make sense to knock good replicas out of just because a leader is
>>> slower in processing reads or serving the fetch requests which may be due
>>> to disk issues in this case but could be other issues as well. I think
>>> this kip addresses in general all of these issues.
>>> Do you still have questions on the current approach if not we can take it
>>> vote.
>>> Thanks,
>>> Harsha
>>> 
>>> 
>>> 
>>> On Mon, Nov 18, 2019 at 7:05 PM, Satish Duggana < satish. duggana@ gmail. com
>>> ( satish.duggana@gmail.com ) > wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> Hi Jason,
>>>> Thanks for looking into the KIP. Apologies for my late reply. Increasing
>>>> replica max lag to 30-45 secs did not help as we observed that a few fetch
>>>> requests took more than 1-2 minutes. We do not want to increase further as
>>>> it increases upper bound on commit latency. We have strict SLAs on some of
>>>> the clusters on end to end(producer to consumer) latency. This proposal
>>>> improves the availability of partitions when followers are trying their
>>>> best to be insync even when leaders are slow in processing those requests.
>>>> I have updated the KIP to have a single config for giving backward
>>>> compatibility and I guess this config is more comprehensible than earlier.
>>>> But I believe there is no need to have config because the suggested
>>>> proposal in the KIP is an enhancement to the existing behavior. Please let
>>>> me know your comments.
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Satish.
>>>> 
>>>> 
>>>> 
>>>> On Thu, Nov 14, 2019 at 10:57 AM Jason Gustafson < jason@ confluent. io (
>>>> jason@confluent.io ) > wrote:
>>>> 
>>>> 
>>>> 
>>>> Hi Satish,
>>>> 
>>>> 
>>>> 
>>>> Thanks for the KIP. I'm wondering how much of this problem can be
>>>> addressed just by increasing the replication max lag? That was one of the
>>>> purposes of KIP-537 (the default increased from 10s to 30s). Also, the new
>>>> configurations seem quite low level. I think they will be hard for users
>>>> to understand (even reading through a couple times I'm not sure I
>>>> understand them fully). I think if there's a way to improve this behavior
>>>> without requiring any new configurations, it would be much more
>>>> attractive.
>>>> 
>>>> 
>>>> 
>>>> Best,
>>>> Jason
>>>> 
>>>> 
>>>> 
>>>> On Wed, Nov 6, 2019 at 8:14 AM Satish Duggana < satish. duggana@ gmail. com
>>>> ( satish.duggana@gmail.com ) > wrote:
>>>> 
>>>> 
>>>> 
>>>> Hi Dhruvil,
>>>> Thanks for looking into the KIP.
>>>> 
>>>> 
>>>> 
>>>> 10. I have an initial sketch of the KIP-500 in commit[a] which discusses
>>>> tracking the pending fetch requests. Tracking is not done in
>>>> Partition#readRecords because if it takes longer in reading any of the
>>>> partitions then we do not want any of the replicas of this fetch request
>>>> to go out of sync.
>>>> 
>>>> 
>>>> 
>>>> 11. I think `Replica` class should be thread-safe to handle the remote
>>>> scenario of concurrent requests running for a follower replica. Or I may
>>>> be missing something here. This is a separate issue from KIP-500. I will
>>>> file a separate JIRA to discuss that issue.
>>>> 
>>>> 
>>>> 
>>>> a -
>>>> https:/ / github. com/ satishd/ kafka/ commit/ (
>>>> https://github.com/satishd/kafka/commit/ )
>>>> c69b525abe8f6aad5059236076a003cdec4c4eb7
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Satish.
>>>> 
>>>> 
>>>> 
>>>> On Tue, Oct 29, 2019 at 10:57 AM Dhruvil Shah < dhruvil@ confluent. io (
>>>> dhruvil@confluent.io ) > wrote:
>>>> 
>>>> 
>>>> 
>>>> Hi Satish,
>>>> 
>>>> 
>>>> 
>>>> Thanks for the KIP, those seems very useful. Could you elaborate on how
>>>> pending fetch requests are tracked?
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Dhruvil
>>>> 
>>>> 
>>>> 
>>>> On Mon, Oct 28, 2019 at 9:43 PM Satish Duggana < satish. duggana@ gmail. com
>>>> ( satish.duggana@gmail.com )
>>>> 
>>>> 
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> 
>>>> Hi All,
>>>> I wrote a short KIP about avoiding out-of-sync or offline partitions when
>>>> follower fetch requests are not processed in time by the leader replica.
>>>> KIP-501 is located at https:/ / s. apache. org/ jhbpn (
>>>> https://s.apache.org/jhbpn )
>>>> 
>>>> 
>>>> 
>>>> Please take a look, I would like to hear your feedback and suggestions.
>>>> 
>>>> 
>>>> 
>>>> JIRA: https:/ / issues. apache. org/ jira/ browse/ KAFKA-8733 (
>>>> https://issues.apache.org/jira/browse/KAFKA-8733 )
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Satish.
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
>

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

Posted by Brian Sang <ba...@yelp.com.INVALID>.

Hi,

Just wanted to bump this discussion, since it happened to us again at Yelp 😂 

It's particularly nasty since it can happen right before a disk failure, so right as the leader for the partition becomes the only ISR, the leader becomes unrecoverable right after, forcing us to do an unclean leader election to resolve the situation. Having offline partitions due to a single failure is really annoying. I'm curious if others have experienced this as well, but weren't able to trace it to this specific error.

Best,
Brian

On 2020/01/22 03:28:34, Satish Duggana <sa...@gmail.com> wrote: 
> Hi Jun,
> Can you please review the KIP and let us know your comments?
> 
> If there are no comments/questions, we can start a vote thread.
> 
> It looks like Yelp folks also encountered the same issue as mentioned
> in JIRA comment[1].
> 
> >> Flavien Raynaud added a comment - Yesterday
> We've seen offline partitions happening for the same reason in one of
> our clusters too, where only the broker leader for the offline
> partitions was having disk issues. It looks like there has not been
> much progress/look on the PR submitted since December 9th. Is there
> anything blocking this change from moving forward?
> 
> 1. https://issues.apache.org/jira/browse/KAFKA-8733?focusedCommentId=17020083&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17020083
> 
> Thanks,
> Satish.
> 
> 
> On Thu, Dec 5, 2019 at 10:38 AM Harsha Chintalapani <ka...@harsha.io> wrote:
> >
> > Hi Jason,
> >          As Satish said just increase replica max lag will not work in this
> > case. Just before a disk dies the reads becomes really slow and its hard to
> > estimate how much this is, as we noticed range is pretty wide. Overall it
> > doesn't make sense to knock good replicas out of just because a leader is
> > slower in processing reads or serving the fetch requests which may be due
> > to disk issues in this case but could be other issues as well. I think this
> > kip addresses in general all of these issues.
> >          Do you still have questions on the current approach if not we can
> > take it vote.
> > Thanks,
> > Harsha
> >
> >
> > On Mon, Nov 18, 2019 at 7:05 PM, Satish Duggana <sa...@gmail.com>
> > wrote:
> >
> > > Hi Jason,
> > > Thanks for looking into the KIP. Apologies for my late reply. Increasing
> > > replica max lag to 30-45 secs did not help as we observed that a few fetch
> > > requests took more than 1-2 minutes. We do not want to increase further as
> > > it increases upper bound on commit latency. We have strict SLAs on some of
> > > the clusters on end to end(producer to consumer) latency. This proposal
> > > improves the availability of partitions when followers are trying their
> > > best to be insync even when leaders are slow in processing those requests.
> > > I have updated the KIP to have a single config for giving backward
> > > compatibility and I guess this config is more comprehensible than earlier.
> > > But I believe there is no need to have config because the suggested
> > > proposal in the KIP is an enhancement to the existing behavior. Please let
> > > me know your comments.
> > >
> > > Thanks,
> > > Satish.
> > >
> > > On Thu, Nov 14, 2019 at 10:57 AM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > Hi Satish,
> > >
> > > Thanks for the KIP. I'm wondering how much of this problem can be
> > > addressed just by increasing the replication max lag? That was one of the
> > > purposes of KIP-537 (the default increased from 10s to 30s). Also, the new
> > > configurations seem quite low level. I think they will be hard for users to
> > > understand (even reading through a couple times I'm not sure I understand
> > > them fully). I think if there's a way to improve this behavior without
> > > requiring any new configurations, it would be much more attractive.
> > >
> > > Best,
> > > Jason
> > >
> > > On Wed, Nov 6, 2019 at 8:14 AM Satish Duggana <sa...@gmail.com>
> > > wrote:
> > >
> > > Hi Dhruvil,
> > > Thanks for looking into the KIP.
> > >
> > > 10. I have an initial sketch of the KIP-500 in commit[a] which discusses
> > > tracking the pending fetch requests. Tracking is not done in
> > > Partition#readRecords because if it takes longer in reading any of the
> > > partitions then we do not want any of the replicas of this fetch request to
> > > go out of sync.
> > >
> > > 11. I think `Replica` class should be thread-safe to handle the remote
> > > scenario of concurrent requests running for a follower replica. Or I may be
> > > missing something here. This is a separate issue from KIP-500. I will file
> > > a separate JIRA to discuss that issue.
> > >
> > > a -
> > > https://github.com/satishd/kafka/commit/
> > > c69b525abe8f6aad5059236076a003cdec4c4eb7
> > >
> > > Thanks,
> > > Satish.
> > >
> > > On Tue, Oct 29, 2019 at 10:57 AM Dhruvil Shah <dh...@confluent.io>
> > > wrote:
> > >
> > > Hi Satish,
> > >
> > > Thanks for the KIP, those seems very useful. Could you elaborate on how
> > > pending fetch requests are tracked?
> > >
> > > Thanks,
> > > Dhruvil
> > >
> > > On Mon, Oct 28, 2019 at 9:43 PM Satish Duggana <satish.duggana@gmail.com
> > >
> > > wrote:
> > >
> > > Hi All,
> > > I wrote a short KIP about avoiding out-of-sync or offline partitions when
> > > follower fetch requests are not processed in time by the leader replica.
> > > KIP-501 is located at https://s.apache.org/jhbpn
> > >
> > > Please take a look, I would like to hear your feedback and suggestions.
> > >
> > > JIRA: https://issues.apache.org/jira/browse/KAFKA-8733
> > >
> > > Thanks,
> > > Satish.
> > >
> > >
>

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

Posted by Satish Duggana <sa...@gmail.com>.

Hi Jun,
Can you please review the KIP and let us know your comments?

If there are no comments/questions, we can start a vote thread.

It looks like Yelp folks also encountered the same issue as mentioned
in JIRA comment[1].

>> Flavien Raynaud added a comment - Yesterday
We've seen offline partitions happening for the same reason in one of
our clusters too, where only the broker leader for the offline
partitions was having disk issues. It looks like there has not been
much progress/look on the PR submitted since December 9th. Is there
anything blocking this change from moving forward?

1. https://issues.apache.org/jira/browse/KAFKA-8733?focusedCommentId=17020083&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17020083

Thanks,
Satish.


On Thu, Dec 5, 2019 at 10:38 AM Harsha Chintalapani <ka...@harsha.io> wrote:
>
> Hi Jason,
>          As Satish said just increase replica max lag will not work in this
> case. Just before a disk dies the reads becomes really slow and its hard to
> estimate how much this is, as we noticed range is pretty wide. Overall it
> doesn't make sense to knock good replicas out of just because a leader is
> slower in processing reads or serving the fetch requests which may be due
> to disk issues in this case but could be other issues as well. I think this
> kip addresses in general all of these issues.
>          Do you still have questions on the current approach if not we can
> take it vote.
> Thanks,
> Harsha
>
>
> On Mon, Nov 18, 2019 at 7:05 PM, Satish Duggana <sa...@gmail.com>
> wrote:
>
> > Hi Jason,
> > Thanks for looking into the KIP. Apologies for my late reply. Increasing
> > replica max lag to 30-45 secs did not help as we observed that a few fetch
> > requests took more than 1-2 minutes. We do not want to increase further as
> > it increases upper bound on commit latency. We have strict SLAs on some of
> > the clusters on end to end(producer to consumer) latency. This proposal
> > improves the availability of partitions when followers are trying their
> > best to be insync even when leaders are slow in processing those requests.
> > I have updated the KIP to have a single config for giving backward
> > compatibility and I guess this config is more comprehensible than earlier.
> > But I believe there is no need to have config because the suggested
> > proposal in the KIP is an enhancement to the existing behavior. Please let
> > me know your comments.
> >
> > Thanks,
> > Satish.
> >
> > On Thu, Nov 14, 2019 at 10:57 AM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > Hi Satish,
> >
> > Thanks for the KIP. I'm wondering how much of this problem can be
> > addressed just by increasing the replication max lag? That was one of the
> > purposes of KIP-537 (the default increased from 10s to 30s). Also, the new
> > configurations seem quite low level. I think they will be hard for users to
> > understand (even reading through a couple times I'm not sure I understand
> > them fully). I think if there's a way to improve this behavior without
> > requiring any new configurations, it would be much more attractive.
> >
> > Best,
> > Jason
> >
> > On Wed, Nov 6, 2019 at 8:14 AM Satish Duggana <sa...@gmail.com>
> > wrote:
> >
> > Hi Dhruvil,
> > Thanks for looking into the KIP.
> >
> > 10. I have an initial sketch of the KIP-500 in commit[a] which discusses
> > tracking the pending fetch requests. Tracking is not done in
> > Partition#readRecords because if it takes longer in reading any of the
> > partitions then we do not want any of the replicas of this fetch request to
> > go out of sync.
> >
> > 11. I think `Replica` class should be thread-safe to handle the remote
> > scenario of concurrent requests running for a follower replica. Or I may be
> > missing something here. This is a separate issue from KIP-500. I will file
> > a separate JIRA to discuss that issue.
> >
> > a -
> > https://github.com/satishd/kafka/commit/
> > c69b525abe8f6aad5059236076a003cdec4c4eb7
> >
> > Thanks,
> > Satish.
> >
> > On Tue, Oct 29, 2019 at 10:57 AM Dhruvil Shah <dh...@confluent.io>
> > wrote:
> >
> > Hi Satish,
> >
> > Thanks for the KIP, those seems very useful. Could you elaborate on how
> > pending fetch requests are tracked?
> >
> > Thanks,
> > Dhruvil
> >
> > On Mon, Oct 28, 2019 at 9:43 PM Satish Duggana <satish.duggana@gmail.com
> >
> > wrote:
> >
> > Hi All,
> > I wrote a short KIP about avoiding out-of-sync or offline partitions when
> > follower fetch requests are not processed in time by the leader replica.
> > KIP-501 is located at https://s.apache.org/jhbpn
> >
> > Please take a look, I would like to hear your feedback and suggestions.
> >
> > JIRA: https://issues.apache.org/jira/browse/KAFKA-8733
> >
> > Thanks,
> > Satish.
> >
> >