You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Colin McCabe <cm...@apache.org> on 2017/05/01 17:18:14 UTC

Re: [DISCUSS] KIP-144: Exponential backoff for broker reconnect attempts

Thanks for the KIP, Ismael & Dana!  This could be pretty important for
avoiding congestion collapse when there are a lot of clients.

It seems like a good idea to keep the "ms" suffix, like we have with
"reconnect.backoff.ms".  So maybe we should use
"reconnect.backoff.max.ms"?  In general unitless timeouts can be the
source of a lot of confusion (is it seconds, milliseconds, etc.?)

It's good that the KIP inject random delays (jitter) into the timeout. 
As per Gwen's point, does it make sense to put an upper bound on the
jitter, though?  If someone sets reconnect.backoff.max to 5 minutes,
they probably would be a little surprised to find it doing three retries
after 100 ms in a row (as it could under the current scheme.)  Maybe a
maximum jitter configuration would help address that, and make the
behavior a little more intuitive.

best,
Colin

On Thu, Apr 27, 2017, at 09:39, Gwen Shapira wrote:
> This is a great suggestion. I like how we just do it by default instead
> of
> making it a choice users need to figure out.
> Avoiding connection storms is great.
> 
> One concern. If I understand the formula for effective maximum backoff
> correctly, then with default maximum of 1000ms and default backoff of
> 100ms, the effective maximum backoff will be 450ms rather than 1000ms.
> This
> isn't exactly intuitive.
> I'm wondering if it makes more sense to allow "one last doubling" which
> may
> bring us slightly over the maximum, but much closer to it. I.e. have the
> effective maximum be in [max.backoff - backoff, max.backoff + backoff]
> range rather than half that. Does that make sense?
> 
> Gwen
> 
> On Thu, Apr 27, 2017 at 9:06 AM, Ismael Juma <is...@juma.me.uk> wrote:
> 
> > Hi all,
> >
> > Dana Powers posted a PR a while back for exponential backoff for broker
> > reconnect attempts. Because it adds a config, a KIP is required and Dana
> > seems to be busy so I posted it:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 144%3A+Exponential+backoff+for+broker+reconnect+attempts
> >
> > Please take a look. Your feedback is appreciated.
> >
> > Thanks,
> > Ismael
> >
> 
> 
> 
> -- 
> *Gwen Shapira*
> Product Manager | Confluent
> 650.450.2760 | @gwenshap
> Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> <http://www.confluent.io/blog>

Re: [DISCUSS] KIP-144: Exponential backoff for broker reconnect attempts

Posted by Dana Powers <da...@gmail.com>.

s/back-office/backoff/

Final note: although the goal here is not to resolve contention (as in the
aws article), I think we do still want a relatively smooth rate of
reconnects across all clients to avoid storm spikes. Full Jitter does that.
I expect that narrower jitter bands will lead to more clumping of
reconnects, but that's maybe ok.

Another idea would be to make jitter configurable. Full jitter would be
100%. No jitter 0%. Equal Jitter 50%. Etc.

On May 8, 2017 5:28 PM, "Dana Powers" <da...@gmail.com> wrote:

> For some discussion of jitter and exponential back-office, I found this
> article useful:
>
> https://www.awsarchitectureblog.com/2015/03/backoff.html
>
> My initial POC used the "Full Jitter" approach described therein. Equal
> Jitter is good too, and may perform a little better. It is random
> distribution between 50% and 100% of calculated backoff.
>
> Dana
>
> On May 4, 2017 8:50 PM, "Ismael Juma" <is...@juma.me.uk> wrote:
>
>> Thanks for the feedback Gwen and Colin. I agree that the original formula
>> was not intuitive. I updated it to include a max jitter as was suggested.
>> I
>> also updated the config name to include `ms`:
>>
>> https://cwiki.apache.org/confluence/pages/diffpagesbyversion
>> .action?pageId=69408222&selectedPageVersions=3&selectedPageVersions=1
>>
>> If there are no other concerns, I will start the vote tomorrow.
>>
>> Ismael
>>
>> On Mon, May 1, 2017 at 6:18 PM, Colin McCabe <cm...@apache.org> wrote:
>>
>> > Thanks for the KIP, Ismael & Dana!  This could be pretty important for
>> > avoiding congestion collapse when there are a lot of clients.
>> >
>> > It seems like a good idea to keep the "ms" suffix, like we have with
>> > "reconnect.backoff.ms".  So maybe we should use
>> > "reconnect.backoff.max.ms"?  In general unitless timeouts can be the
>> > source of a lot of confusion (is it seconds, milliseconds, etc.?)
>> >
>> > It's good that the KIP inject random delays (jitter) into the timeout.
>> > As per Gwen's point, does it make sense to put an upper bound on the
>> > jitter, though?  If someone sets reconnect.backoff.max to 5 minutes,
>> > they probably would be a little surprised to find it doing three retries
>> > after 100 ms in a row (as it could under the current scheme.)  Maybe a
>> > maximum jitter configuration would help address that, and make the
>> > behavior a little more intuitive.
>> >
>> > best,
>> > Colin
>> >
>> >
>> > On Thu, Apr 27, 2017, at 09:39, Gwen Shapira wrote:
>> > > This is a great suggestion. I like how we just do it by default
>> instead
>> > > of
>> > > making it a choice users need to figure out.
>> > > Avoiding connection storms is great.
>> > >
>> > > One concern. If I understand the formula for effective maximum backoff
>> > > correctly, then with default maximum of 1000ms and default backoff of
>> > > 100ms, the effective maximum backoff will be 450ms rather than 1000ms.
>> > > This
>> > > isn't exactly intuitive.
>> > > I'm wondering if it makes more sense to allow "one last doubling"
>> which
>> > > may
>> > > bring us slightly over the maximum, but much closer to it. I.e. have
>> the
>> > > effective maximum be in [max.backoff - backoff, max.backoff + backoff]
>> > > range rather than half that. Does that make sense?
>> > >
>> > > Gwen
>> > >
>> > > On Thu, Apr 27, 2017 at 9:06 AM, Ismael Juma <is...@juma.me.uk>
>> wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > Dana Powers posted a PR a while back for exponential backoff for
>> broker
>> > > > reconnect attempts. Because it adds a config, a KIP is required and
>> > Dana
>> > > > seems to be busy so I posted it:
>> > > >
>> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > > 144%3A+Exponential+backoff+for+broker+reconnect+attempts
>> > > >
>> > > > Please take a look. Your feedback is appreciated.
>> > > >
>> > > > Thanks,
>> > > > Ismael
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > *Gwen Shapira*
>> > > Product Manager | Confluent
>> > > 650.450.2760 | @gwenshap
>> > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
>> > > <http://www.confluent.io/blog>
>> >
>>
>

Re: [DISCUSS] KIP-144: Exponential backoff for broker reconnect attempts

Posted by Dana Powers <da...@gmail.com>.

For some discussion of jitter and exponential back-office, I found this
article useful:

https://www.awsarchitectureblog.com/2015/03/backoff.html

My initial POC used the "Full Jitter" approach described therein. Equal
Jitter is good too, and may perform a little better. It is random
distribution between 50% and 100% of calculated backoff.

Dana

On May 4, 2017 8:50 PM, "Ismael Juma" <is...@juma.me.uk> wrote:

> Thanks for the feedback Gwen and Colin. I agree that the original formula
> was not intuitive. I updated it to include a max jitter as was suggested. I
> also updated the config name to include `ms`:
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?
> pageId=69408222&selectedPageVersions=3&selectedPageVersions=1
>
> If there are no other concerns, I will start the vote tomorrow.
>
> Ismael
>
> On Mon, May 1, 2017 at 6:18 PM, Colin McCabe <cm...@apache.org> wrote:
>
> > Thanks for the KIP, Ismael & Dana!  This could be pretty important for
> > avoiding congestion collapse when there are a lot of clients.
> >
> > It seems like a good idea to keep the "ms" suffix, like we have with
> > "reconnect.backoff.ms".  So maybe we should use
> > "reconnect.backoff.max.ms"?  In general unitless timeouts can be the
> > source of a lot of confusion (is it seconds, milliseconds, etc.?)
> >
> > It's good that the KIP inject random delays (jitter) into the timeout.
> > As per Gwen's point, does it make sense to put an upper bound on the
> > jitter, though?  If someone sets reconnect.backoff.max to 5 minutes,
> > they probably would be a little surprised to find it doing three retries
> > after 100 ms in a row (as it could under the current scheme.)  Maybe a
> > maximum jitter configuration would help address that, and make the
> > behavior a little more intuitive.
> >
> > best,
> > Colin
> >
> >
> > On Thu, Apr 27, 2017, at 09:39, Gwen Shapira wrote:
> > > This is a great suggestion. I like how we just do it by default instead
> > > of
> > > making it a choice users need to figure out.
> > > Avoiding connection storms is great.
> > >
> > > One concern. If I understand the formula for effective maximum backoff
> > > correctly, then with default maximum of 1000ms and default backoff of
> > > 100ms, the effective maximum backoff will be 450ms rather than 1000ms.
> > > This
> > > isn't exactly intuitive.
> > > I'm wondering if it makes more sense to allow "one last doubling" which
> > > may
> > > bring us slightly over the maximum, but much closer to it. I.e. have
> the
> > > effective maximum be in [max.backoff - backoff, max.backoff + backoff]
> > > range rather than half that. Does that make sense?
> > >
> > > Gwen
> > >
> > > On Thu, Apr 27, 2017 at 9:06 AM, Ismael Juma <is...@juma.me.uk>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Dana Powers posted a PR a while back for exponential backoff for
> broker
> > > > reconnect attempts. Because it adds a config, a KIP is required and
> > Dana
> > > > seems to be busy so I posted it:
> > > >
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > > >
> > > > Please take a look. Your feedback is appreciated.
> > > >
> > > > Thanks,
> > > > Ismael
> > > >
> > >
> > >
> > >
> > > --
> > > *Gwen Shapira*
> > > Product Manager | Confluent
> > > 650.450.2760 | @gwenshap
> > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > > <http://www.confluent.io/blog>
> >
>

Re: [DISCUSS] KIP-144: Exponential backoff for broker reconnect attempts

Posted by Ismael Juma <is...@juma.me.uk>.

Thanks for the feedback Gwen and Colin. I agree that the original formula
was not intuitive. I updated it to include a max jitter as was suggested. I
also updated the config name to include `ms`:

https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=69408222&selectedPageVersions=3&selectedPageVersions=1

If there are no other concerns, I will start the vote tomorrow.

Ismael

On Mon, May 1, 2017 at 6:18 PM, Colin McCabe <cm...@apache.org> wrote:

> Thanks for the KIP, Ismael & Dana!  This could be pretty important for
> avoiding congestion collapse when there are a lot of clients.
>
> It seems like a good idea to keep the "ms" suffix, like we have with
> "reconnect.backoff.ms".  So maybe we should use
> "reconnect.backoff.max.ms"?  In general unitless timeouts can be the
> source of a lot of confusion (is it seconds, milliseconds, etc.?)
>
> It's good that the KIP inject random delays (jitter) into the timeout.
> As per Gwen's point, does it make sense to put an upper bound on the
> jitter, though?  If someone sets reconnect.backoff.max to 5 minutes,
> they probably would be a little surprised to find it doing three retries
> after 100 ms in a row (as it could under the current scheme.)  Maybe a
> maximum jitter configuration would help address that, and make the
> behavior a little more intuitive.
>
> best,
> Colin
>
>
> On Thu, Apr 27, 2017, at 09:39, Gwen Shapira wrote:
> > This is a great suggestion. I like how we just do it by default instead
> > of
> > making it a choice users need to figure out.
> > Avoiding connection storms is great.
> >
> > One concern. If I understand the formula for effective maximum backoff
> > correctly, then with default maximum of 1000ms and default backoff of
> > 100ms, the effective maximum backoff will be 450ms rather than 1000ms.
> > This
> > isn't exactly intuitive.
> > I'm wondering if it makes more sense to allow "one last doubling" which
> > may
> > bring us slightly over the maximum, but much closer to it. I.e. have the
> > effective maximum be in [max.backoff - backoff, max.backoff + backoff]
> > range rather than half that. Does that make sense?
> >
> > Gwen
> >
> > On Thu, Apr 27, 2017 at 9:06 AM, Ismael Juma <is...@juma.me.uk> wrote:
> >
> > > Hi all,
> > >
> > > Dana Powers posted a PR a while back for exponential backoff for broker
> > > reconnect attempts. Because it adds a config, a KIP is required and
> Dana
> > > seems to be busy so I posted it:
> > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > >
> > > Please take a look. Your feedback is appreciated.
> > >
> > > Thanks,
> > > Ismael
> > >
> >
> >
> >
> > --
> > *Gwen Shapira*
> > Product Manager | Confluent
> > 650.450.2760 | @gwenshap
> > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > <http://www.confluent.io/blog>
>