You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Sanjana Kaundinya <sk...@gmail.com> on 2020/03/13 23:37:59 UTC

[DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Hi Everyone,

I’ve written a KIP about introducing exponential backoff for Kafka clients. Would appreciate any feedback on this.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients

Thanks,
Sanjana

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Konstantine Karantasis <ko...@confluent.io>.

Thank you Sanjana.

I checked the updates on the KIP and it looks good to me too now. I'll vote
on the voting thread.

Best,
Konstantine

On Wed, Mar 25, 2020 at 1:57 PM Sanjana Kaundinya <sk...@gmail.com>
wrote:

> Hi Konstantine,
>
> Thanks for the insightful feedback. I’ll address it here as well as update
> the KIP accordingly.
>
> I think it is important to call out the fact that we are leaving out
> Connect and Streams in the proposed changes, so that it can be addressed in
> future KIP/changes. As you pointed out, Kafka Connect does utilize
> ConsumerNetworkClient and Metadata for its rebalancing protocol, and as a
> result the changes made to exponential backoff would affect the
> WorkerGroupMember that utilizes these classes. Any Kafka client that
> utilizes these classes would be making use of exponential backoff instead
> of the current static backoff.
>
> That being said, although Kafka Connect will be affected with respect to
> those two things, not all of the backoff configs are being replaced here.
> As you correctly stated, classes such as AbstractCoordinator,
> ConsumerCoordinator, and the Heartbeat thread would be utilizing the static
> backoff behavior - no changes will be made with respect to rebalancing.
>
> With respect to Compatibility, I will add into that section the things
> I’ve mentioned above - affects to Kafka Connect as well as no changes to
> anything related to the rebalance protocol. In addition, the reason why
> retry.backoff.max.ms shouldn’t default to the same value as
> retry.backoff.ms is that then if a user isn’t aware of this feature and
> doesn’t set this, they wouldn’t enjoy the exponential backoff. Instead it’s
> important to ensure that we provide this as a default feature for all Kafka
> clients. In addition defaulting the retry.backoff.max.ms to 1000 ms
> unconditionally wouldn’t give users the flexibility to tune their clients
> to their environments.
>
> Finally, yes you are correct, in order to have exponential backoff, we
> actually do need both configs, with retry.backkoff.ms <
> retry.backoff.max.ms. I will update the KIP to reflect that as well as
> incorporate the wording change you have suggested.
>
> Thanks,
> Sanjana
>
> On Mar 25, 2020, 10:50 AM -0700, Konstantine Karantasis <
> konstantine@confluent.io>, wrote:
> > Hi Sanjana and thanks for the KIP!
> >
> > Sorry for the late response, but I still have a few questions that you
> > might find useful.
> >
> > The KIP currently does not mention Kafka Connect at all. I have read
> > the discussion above where it'd been decided to leave Connect and Streams
> > out of the proposed changes, but I feel this should be called out
> > explicitly. At the same time, Kafka Connect is also a Kafka client that
> > uses ConsumerNetworkClient and Metadata for its rebalancing protocol.
> It's
> > not clear to me whether changes in those classes will affect Connect
> > workers. Do you think it's worth clarifying that?
> >
> > Additionally, you might also want to add a section specifically to
> mention
> > how this new config affects the places where the current config
> > retry.backoff.ms is used today to back-off during rebalancing. Is
> > exponential backoff going to replace the old config in those places as
> > well? And if it does, should we add a mention that a very high value of
> the
> > new retry.backoff.max.ms might affect how quickly a consumer or worker
> > rejoins their group after it experiences a temporary network partitioning
> > from the broker coordinator?
> >
> > Places that explicitly use retry.backoff.ms at the moment include the
> > AbstractCoordinator, the ConsumerCoordinator and the Heartbeat thread. By
> > reading the previous discussion, I understand that these classes might
> keep
> > using the old static backoff. Even if that's the case, I think it's worth
> > mentioning that in the KIP for reference.
> >
> > In the rejected alternatives section, you mention that "existing behavior
> > is always maintained: for reasons explained in the compatibility
> section.".
> > However, the Compatibility section says that there are no compatibility
> > concerns. I'd suggest extending the compatibility section to help a bit
> > more in explaining why the alternatives were rejected. Also, in the
> > compatibility section you mention that the new config (
> retry.backoff.max.ms)
> > will replace the old one (retry.backoff.ms), but from reading at the
> > beginning, I understand that in order to have exponential increments, you
> > actually need both configs, with retry.backoff.ms < retry.backoff.max.ms
> .
> > Should the mention around replacement be removed?
> >
> > Finally, I have a minor suggestion that might help explain the following
> > sentence better:
> >
> > "If retry.backoff.ms is set to be greater than retry.backoff.max.ms,
> then
> > retry.backoff.max.ms will be used as a **constant backoff from the
> > beginning without exponential increase**." (highlighting the difference
> > only for reference here). Unless I misunderstood how the new backoff will
> > be used when it's smaller than the value of the old config, in which case
> > it might help clarifying a bit more as well.
> >
> >
> > Thanks for the KIP!
> > Really looking forward to more robust retries in Kafka clients
> >
> > Konstantine
> >
> >
> > On Tue, Mar 24, 2020 at 9:56 AM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > In Kafka clients, there are cases where we log a warning when
> overriding
> > > some conflicting configs and in some other cases we throw and let the
> > > brokers to die during startup --- you can check the
> > > `postProcessParsedConfig` function in Producer/ConsumerConfig for such
> > > logic.
> > >
> > > I think for this case, it is sufficient to log a warning if we find the
> > > `max` < `backoff`.
> > >
> > >
> > > Guozhang
> > >
> > > On Mon, Mar 23, 2020 at 9:18 PM Boyang Chen <
> reluctanthero104@gmail.com>
> > > wrote:
> > >
> > > > Got it, although I would still like to be aware of the actual
> backoff I
> > > > will be using in production, having the app crash seems like an
> > > > over-reaction. I don't think I have further questions :)
> > > >
> > > > On Mon, Mar 23, 2020 at 7:36 PM Sanjana Kaundinya <
> skaundinya@gmail.com>
> > > > wrote:
> > > >
> > > > > Hey Sanjana,
> > > > >
> > > > > Hey Boyang,
> > > > >
> > > > > If a user provides no config at all then as you mentioned they
> will be
> > > > > default be able to make use of the exponential back off feature
> > > > introduced
> > > > > by the KIP. If the backoff.ms is overriden to 2000 ms, the lesser
> of
> > > > > either
> > > > > the max or the computed back off will be chosen, so in this case
> the
> > > max
> > > > > will be chosen as it is 1000 ms. As Guozhang mentioned if the user
> > > > > configures something like this then they would notice the behavior
> to
> > > not
> > > > > be in line what they expect and would see the KIP + Release notes
> and
> > > > know
> > > > > to configure it to be backoff.ms < max backoff.ms. I’m not sure
> if its
> > > > as
> > > > > big of an error to reject the configuration if it’s configured like
> > > this,
> > > > > as the clients could still run in either case.
> > > > >
> > > > > To answer your second question, we are making the dynamic backoff
> the
> > > > > default and not allowing for static backoff (unless they set
> > > backoff.ms
> > > > >
> > > > > max.backof.ms, then that would in a sense be static) We will
> include
> > > > this
> > > > > information in the release notes to make sure users are aware of
> this
> > > > > behavior change.
> > > > >
> > > > > Thanks,
> > > > > Sanjana
> > > > >
> > > > > On Mon, Mar 23, 2020 at 6:37 PM Boyang Chen <
> > > reluctanthero104@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hey Sanjana,
> > > > > >
> > > > > > my understanding with the update is that if a user provides no
> config
> > > > at
> > > > > > all, a Producer/Consumer/Admin client user would by default
> enjoying
> > > a
> > > > > > starting backoff.ms as 100 ms and max.backoff.ms as 1000 ms? If
> I
> > > > > already
> > > > > > override the backoff.ms to 2000 ms for instance, will I be
> choosing
> > > > the
> > > > > > default max.backoff here?
> > > > > >
> > > > > > I guess my question would be whether we should just reject a
> config
> > > > with
> > > > > > backoff.ms > max.backoff.ms in the first place, as this looks
> like
> > > > > > mis-configuration to me.
> > > > > >
> > > > > > Second question is whether we allow fallback to static backoffs
> if
> > > the
> > > > > user
> > > > > > wants to do so, or we should just ship this as an opt-in feature?
> > > > > >
> > > > > > Let me know your thoughts.
> > > > > >
> > > > > > Boyang
> > > > > >
> > > > > > On Mon, Mar 23, 2020 at 11:38 AM Cheng Tan <ct...@confluent.io>
> > > wrote:
> > > > > >
> > > > > > > +1 (non-binding)
> > > > > > >
> > > > > > > > On Mar 19, 2020, at 7:27 PM, Sanjana Kaundinya <
> > > > skaundinya@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Ah yes that makes sense. I’ll update the KIP to reflect this.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Sanjana
> > > > > > > >
> > > > > > > > On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <
> > > wangguoz@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Following the formula you have in the KIP, if it is simply:
> > > > > > > > >
> > > > > > > > > MIN(retry.backoff.max.ms, (retry.backoff.ms *
> 2**(failures -
> > > 1))
> > > > *
> > > > > > > random(
> > > > > > > > > 0.8, 1.2))
> > > > > > > > >
> > > > > > > > > then the behavior would stay consistent at
> retry.backoff.max.ms
> > > .
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Guozhang
> > > > > > > > >
> > > > > > > > > On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <
> > > > > > skaundinya@gmail.com
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > If that’s the case then what should we base the starting
> point
> > > > as?
> > > > > > > > > > Currently in the KIP the starting point is
> retry.backoff.ms
> > > and
> > > > it
> > > > > > > > > > exponentially goes up to retry.backoff.max.ms. If
> > > > > > retry.backoff.max.ms
> > > > > > > > > is
> > > > > > > > > > smaller than retry.backoff.ms then that could pose a
> bit of a
> > > > > > problem
> > > > > > > > > > there right?
> > > > > > > > > >
> > > > > > > > > > On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <
> > > > wangguoz@gmail.com
> > > > > > ,
> > > > > > > > > wrote:
> > > > > > > > > > > Thanks Sanjana, I did not capture the part that Jason
> referred
> > > > to,
> > > > > > so
> > > > > > > > > > > that's my bad :P
> > > > > > > > > > >
> > > > > > > > > > > Regarding your last statement, I actually feel that
> instead of
> > > > > take
> > > > > > > the
> > > > > > > > > > > larger of the two, we should respect "
> retry.backoff.max.ms"
> > > > even
> > > > > if
> > > > > > > it
> > > > > > > > > > is
> > > > > > > > > > > smaller than "retry.backoff.ms". I do not have a very
> strong
> > > > > > > rationale
> > > > > > > > > > > except it is logically more aligned to the config
> names.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Guozhang
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <
> > > > > > > > > skaundinya@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hey Jason and Guozhang,
> > > > > > > > > > > >
> > > > > > > > > > > > Jason is right, I took this inspiration from KIP-144
> (
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > > > > > > > > > > > )
> > > > > > > > > > > > which had the same logic in order to preserve the
> existing
> > > > > > behavior.
> > > > > > > > > In
> > > > > > > > > > > > this case however, if we are thinking to completely
> eliminate
> > > > the
> > > > > > > > > > static
> > > > > > > > > > > > backoff behavior, we can do that and as Jason
> mentioned put
> > > it
> > > > in
> > > > > > the
> > > > > > > > > > > > release notes and not add any special logic. In
> addition I
> > > > agree
> > > > > > that
> > > > > > > > > > we
> > > > > > > > > > > > should take the larger of the two of `
> retry.backoff.ms` and
> > > `
> > > > > > > > > > > > retry.backoff.max.ms`. I'll update the KIP to
> reflect this
> > > and
> > > > > > make
> > > > > > > > > it
> > > > > > > > > > > > clear that the old static retry backoff is getting
> replaced
> > > by
> > > > > the
> > > > > > > > > new
> > > > > > > > > > > > dynamic retry backoff.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Sanjana
> > > > > > > > > > > > On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <
> > > > > > jason@confluent.io>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hey Guozhang,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I was referring to this:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > For users who have not set retry.backoff.ms
> explicitly,
> > > the
> > > > > > > > > > default
> > > > > > > > > > > > > behavior will change so that the backoff will grow
> up to
> > > 1000
> > > > > ms.
> > > > > > > > > For
> > > > > > > > > > > > users
> > > > > > > > > > > > > who have set retry.backoff.ms explicitly, the
> behavior will
> > > > > > remain
> > > > > > > > > > the
> > > > > > > > > > > > > same
> > > > > > > > > > > > > as they could have specific requirements.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I took this to mean that for users who have
> overridden `
> > > > > > > > > > retry.backoff.ms
> > > > > > > > > > > > `
> > > > > > > > > > > > > to 50ms (say), we will change the default `
> > > > retry.backoff.max.ms
> > > > > `
> > > > > > > > > to
> > > > > > > > > > 50ms
> > > > > > > > > > > > > as
> > > > > > > > > > > > > well in order to preserve existing backoff
> behavior. Is that
> > > > not
> > > > > > > > > > right?
> > > > > > > > > > > > In
> > > > > > > > > > > > > any case, I agree that we can use the maximum of
> the two
> > > > values
> > > > > as
> > > > > > > > > > the
> > > > > > > > > > > > > effective `retry.backoff.max.ms` to handle the
> case when
> > > the
> > > > > > > > > > configured
> > > > > > > > > > > > > value of `retry.backoff.ms` is larger than the
> default of
> > > 1s.
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Jason
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <
> > > > > wangguoz@gmail.com
> > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hey Jason,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > My understanding is a bit different here: even
> if user has
> > > an
> > > > > > > > > > explicit
> > > > > > > > > > > > > > overridden "retry.backoff.ms", the exponential
> mechanism
> > > > still
> > > > > > > > > > > > triggers
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > the backoff would be increased till "
> retry.backoff.max.ms
> > > ";
> > > > > and
> > > > > > > > > > if the
> > > > > > > > > > > > > > specified "retry.backoff.ms" is already larger
> than the "
> > > > > > > > > > > > > > retry.backoff.max.ms", we would still take "
> > > > > retry.backoff.max.ms
> > > > > > > > > ".
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So if the user does override the "
> retry.backoff.ms" to a
> > > > value
> > > > > > > > > > larger
> > > > > > > > > > > > > than
> > > > > > > > > > > > > > 1s and is not aware of the new config, she would
> be
> > > surprised
> > > > > to
> > > > > > > > > > see
> > > > > > > > > > > > the
> > > > > > > > > > > > > > specified value seemingly not being respected,
> but she
> > > could
> > > > > > > > > still
> > > > > > > > > > > > learn
> > > > > > > > > > > > > > that afterwards by reading the release notes
> introducing
> > > this
> > > > > KIP
> > > > > > > > > > > > > anyways.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Guozhang
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
> > > > > > > > > > jason@confluent.io>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Sanjana,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The KIP looks good to me. I had just one
> question about
> > > the
> > > > > > > > > > default
> > > > > > > > > > > > > > > behavior. As I understand, if the user has
> specified `
> > > > > > > > > > > > retry.backoff.ms
> > > > > > > > > > > > > `
> > > > > > > > > > > > > > > explicitly, then we will not apply the default
> max
> > > backoff.
> > > > As
> > > > > > > > > > such,
> > > > > > > > > > > > > > > there's no way to get the benefit of this
> feature if you
> > > are
> > > > > > > > > > > > providing
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > `
> > > > > > > > > > > > > > > retry.backoff.ms` unless you also provide `
> > > > > > > > > retry.backoff.max.ms
> > > > > > > > > > `.
> > > > > > > > > > > > That
> > > > > > > > > > > > > > > makes sense if you assume the user is unaware
> of the new
> > > > > > > > > > > > configuration,
> > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > it is surprising otherwise. Since it's not a
> semantic
> > > change
> > > > > > > > > and
> > > > > > > > > > > > since
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > default you're proposing of 1s is fairly low
> already, I
> > > > wonder
> > > > > > > > > if
> > > > > > > > > > > > it's
> > > > > > > > > > > > > > good
> > > > > > > > > > > > > > > enough to mention the new configuration in the
> release
> > > notes
> > > > > > > > > and
> > > > > > > > > > not
> > > > > > > > > > > > > add
> > > > > > > > > > > > > > > any special logic. What do you think?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -Jason
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Mar 19, 2020 at 1:56 PM Sanjana
> Kaundinya <
> > > > > > > > > > > > > skaundinya@gmail.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thank you for the comments Guozhang.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I’ll leave this KIP out for discussion till
> the end of
> > > the
> > > > > > > > > > week and
> > > > > > > > > > > > > > then
> > > > > > > > > > > > > > > > start a vote for this early next week.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Mar 18, 2020, 3:38 PM -0700, Guozhang
> Wang <
> > > > > > > > > > wangguoz@gmail.com
> > > > > > > > > > > > > ,
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > Hello Sanjana,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the proposed KIP, I think that
> makes a lot of
> > > > > > > > > > sense --
> > > > > > > > > > > > > as
> > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > mentioned in the motivation, we've indeed
> seen many
> > > issues
> > > > > > > > > > with
> > > > > > > > > > > > > > regard
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > the frequent retries, with bounded
> exponential backoff
> > > in
> > > > > > > > > the
> > > > > > > > > > > > > > scenario
> > > > > > > > > > > > > > > > > where there's a long connectivity issue we
> would
> > > > > > > > > effectively
> > > > > > > > > > > > reduce
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > request load by 10 given the default
> configs.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > For higher-level Streams client and
> Connect frameworks,
> > > > > > > > > > today we
> > > > > > > > > > > > > also
> > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > a retry logic but that's used in a
> slightly different
> > > way.
> > > > > > > > > > For
> > > > > > > > > > > > > > example
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > Streams, we tend to handle the retry logic
> at the
> > > > > > > > > > thread-level
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > hence
> > > > > > > > > > > > > > > > > very likely we'd like to change that
> mechanism in
> > > KIP-572
> > > > > > > > > > > > anyways.
> > > > > > > > > > > > > > For
> > > > > > > > > > > > > > > > > producer / consumer / admin clients, I
> think just
> > > applying
> > > > > > > > > > this
> > > > > > > > > > > > > > > > behavioral
> > > > > > > > > > > > > > > > > change across these clients makes lot of
> sense. So I
> > > think
> > > > > > > > > > can
> > > > > > > > > > > > just
> > > > > > > > > > > > > > > leave
> > > > > > > > > > > > > > > > > the Streams / Connect out of the scope of
> this KIP to be
> > > > > > > > > > > > addressed
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > separate discussions.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I do not have further comments about this
> KIP :) LGTM.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Guozhang
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana
> Kaundinya <
> > > > > > > > > > > > > > > skaundinya@gmail.com
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the feedback Boyang.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > If there’s anyone else who has feedback
> regarding this
> > > > > > > > > KIP,
> > > > > > > > > > > > would
> > > > > > > > > > > > > > > > really
> > > > > > > > > > > > > > > > > > appreciate it hearing it!
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang
> Chen <
> > > > > > > > > > > > > bchen11@outlook.com>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Sounds great!
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Get Outlook for iOS<
> https://aka.ms/o0ukef>
> > > > > > > > > > > > > > > > > > > ________________________________
> > > > > > > > > > > > > > > > > > > From: Sanjana Kaundinya <
> skaundinya@gmail.com>
> > > > > > > > > > > > > > > > > > > Sent: Tuesday, March 17, 2020 5:54:35
> PM
> > > > > > > > > > > > > > > > > > > To: dev@kafka.apache.org <
> dev@kafka.apache.org>
> > > > > > > > > > > > > > > > > > > Subject: Re: [DISCUSS] KIP-580:
> Exponential Backoff
> > > for
> > > > > > > > > > Kafka
> > > > > > > > > > > > > > > Clients
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for the explanation Boyang. One
> of the most
> > > > > > > > > common
> > > > > > > > > > > > > > problems
> > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > have in Kafka is with respect to
> metadata fetches. For
> > > > > > > > > > > > example,
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > a broker failure, all clients start to
> fetch metadata
> > > > > > > > > at
> > > > > > > > > > the
> > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > it often takes a while for the
> metadata to converge.
> > > > > > > > > In a
> > > > > > > > > > > > high
> > > > > > > > > > > > > > load
> > > > > > > > > > > > > > > > > > > cluster, there are also issues where
> the volume of
> > > > > > > > > > metadata
> > > > > > > > > > > > has
> > > > > > > > > > > > > > > made
> > > > > > > > > > > > > > > > > > > convergence of metadata slower.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > For this case, exponential backoff
> helps as it reduces
> > > > > > > > > > the
> > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > rate and
> > > > > > > > > > > > > > > > > > > spaces out how often clients will
> retry, thereby
> > > > > > > > > bringing
> > > > > > > > > > > > down
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > convergence. Something that Jason
> mentioned that would
> > > > > > > > > > be a
> > > > > > > > > > > > > great
> > > > > > > > > > > > > > > > > > addition
> > > > > > > > > > > > > > > > > > > here would be if the backoff should be
> “jittered” as
> > > it
> > > > > > > > > > was
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > > KIP-144
> > > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > respect to exponential reconnect
> backoff. This would
> > > > > > > > > help
> > > > > > > > > > > > > prevent
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > clients from being synchronized on
> when they retry,
> > > > > > > > > > thereby
> > > > > > > > > > > > > > spacing
> > > > > > > > > > > > > > > > out
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > number of requests being sent to the
> broker at the
> > > same
> > > > > > > > > > time.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I’ll add this example to the KIP and
> flush out more of
> > > > > > > > > > the
> > > > > > > > > > > > > > details
> > > > > > > > > > > > > > > -
> > > > > > > > > > > > > > > > so
> > > > > > > > > > > > > > > > > > > it’s more clear.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang
> Chen <
> > > > > > > > > > > > > > > > reluctanthero104@gmail.com
> > > > > > > > > > > > > > > > > > > ,
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > Thanks for the reply Sanjana. I
> guess I would like to
> > > > > > > > > > > > > rephrase
> > > > > > > > > > > > > > my
> > > > > > > > > > > > > > > > > > > question
> > > > > > > > > > > > > > > > > > > > 2 and 3 as my previous response is a
> little bit
> > > > > > > > > > > > unactionable.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > My specific point is that
> exponential backoff is not
> > > > > > > > > a
> > > > > > > > > > > > silver
> > > > > > > > > > > > > > > > bullet
> > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > > should consider using it to solve
> known problems,
> > > > > > > > > > instead
> > > > > > > > > > > > of
> > > > > > > > > > > > > > > > making the
> > > > > > > > > > > > > > > > > > > > holistic changes to all clients in
> Kafka ecosystem. I
> > > > > > > > > > do
> > > > > > > > > > > > like
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > exponential backoff idea and believe
> this would be of
> > > > > > > > > > great
> > > > > > > > > > > > > > > value,
> > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > > > maybe we should focus on proposing
> some existing
> > > > > > > > > > modules
> > > > > > > > > > > > that
> > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > suffering
> > > > > > > > > > > > > > > > > > > > from static retry, and only change
> them in this first
> > > > > > > > > > KIP.
> > > > > > > > > > > > If
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > future, some other component users
> believe they are
> > > > > > > > > > also
> > > > > > > > > > > > > > > > suffering, we
> > > > > > > > > > > > > > > > > > > > could get more minor KIPs to change
> the behavior as
> > > > > > > > > > well.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Boyang
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Sun, Mar 15, 2020 at 12:07 AM
> Sanjana Kaundinya <
> > > > > > > > > > > > > > > > > > skaundinya@gmail.com
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thanks for the feedback Boyang, I
> will revise the
> > > > > > > > > KIP
> > > > > > > > > > > > with
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > mathematical relations as per your
> suggestion. To
> > > > > > > > > > address
> > > > > > > > > > > > > > your
> > > > > > > > > > > > > > > > > > > feedback:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > 1. Currently, with the default of
> 100 ms per retry
> > > > > > > > > > > > backoff,
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > > > > second
> > > > > > > > > > > > > > > > > > > > > we would have 10 retries. In the
> case of using an
> > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > > > > backoff,
> > > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > > > would have a total of 4 retries in
> 1 second. Thus
> > > > > > > > > we
> > > > > > > > > > have
> > > > > > > > > > > > > > less
> > > > > > > > > > > > > > > > than
> > > > > > > > > > > > > > > > > > > half of
> > > > > > > > > > > > > > > > > > > > > the amount of retries in the same
> timeframe and can
> > > > > > > > > > > > lessen
> > > > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > > > > pressure.
> > > > > > > > > > > > > > > > > > > > > This calculation is done as
> following (using the
> > > > > > > > > > formula
> > > > > > > > > > > > > laid
> > > > > > > > > > > > > > > > out in
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > KIP:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Try 1 at time 0 ms, failures = 0,
> next retry in 100
> > > > > > > > > > ms
> > > > > > > > > > > > > > (default
> > > > > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > > > ms
> > > > > > > > > > > > > > > > > > > > > is initially 100 ms)
> > > > > > > > > > > > > > > > > > > > > Try 2 at time 100 ms, failures =
> 1, next retry in
> > > > > > > > > > 200 ms
> > > > > > > > > > > > > > > > > > > > > Try 3 at time 300 ms, failures =
> 2, next retry in
> > > > > > > > > > 400 ms
> > > > > > > > > > > > > > > > > > > > > Try 4 at time 700 ms, failures =
> 3, next retry in
> > > > > > > > > > 800 ms
> > > > > > > > > > > > > > > > > > > > > Try 5 at time 1500 ms, failures =
> 4, next retry in
> > > > > > > > > > 1000
> > > > > > > > > > > > ms
> > > > > > > > > > > > > > > > (default
> > > > > > > > > > > > > > > > > > max
> > > > > > > > > > > > > > > > > > > > > retry ms is 1000 ms)
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > For 2 and 3, could you elaborate
> more about what
> > > > > > > > > you
> > > > > > > > > > mean
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > respect
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > client timeouts? I’m not very
> familiar with the
> > > > > > > > > > Streams
> > > > > > > > > > > > > > > > framework, so
> > > > > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > > > > love to get more insight to how
> that currently
> > > > > > > > > works,
> > > > > > > > > > > > with
> > > > > > > > > > > > > > > > respect to
> > > > > > > > > > > > > > > > > > > > > producer transactions, so I can
> appropriately
> > > > > > > > > update
> > > > > > > > > > the
> > > > > > > > > > > > > KIP
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > address
> > > > > > > > > > > > > > > > > > > > > these scenarios.
> > > > > > > > > > > > > > > > > > > > > On Mar 13, 2020, 7:15 PM -0700,
> Boyang Chen <
> > > > > > > > > > > > > > > > > > > reluctanthero104@gmail.com>,
> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > Thanks for the KIP Sanjana. I
> think the
> > > > > > > > > motivation
> > > > > > > > > > is
> > > > > > > > > > > > > good,
> > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > lack
> > > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > > more quantitative analysis. For
> instance:
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > 1. How much retries we are
> saving by applying the
> > > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > > > vs
> > > > > > > > > > > > > > > > > > > > > > static retry? There should be
> some mathematical
> > > > > > > > > > > > relations
> > > > > > > > > > > > > > > > between
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > static retry ms, the initial
> exponential retry
> > > > > > > > > ms,
> > > > > > > > > > the
> > > > > > > > > > > > > max
> > > > > > > > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > > > > > > > > retry ms in a given time
> interval.
> > > > > > > > > > > > > > > > > > > > > > 2. How does this affect the
> client timeout? With
> > > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > > retry,
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > client shall be getting easier
> to timeout on a
> > > > > > > > > > parent
> > > > > > > > > > > > > level
> > > > > > > > > > > > > > > > caller,
> > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > instance stream attempts to
> retry initializing
> > > > > > > > > > producer
> > > > > > > > > > > > > > > > > > transactions
> > > > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > > > > given 5 minute interval. With
> exponential retry
> > > > > > > > > > this
> > > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > > experience more frequent timeout
> which we should
> > > > > > > > > be
> > > > > > > > > > > > > careful
> > > > > > > > > > > > > > > > with.
> > > > > > > > > > > > > > > > > > > > > > 3. With regards to #2, we should
> have more
> > > > > > > > > detailed
> > > > > > > > > > > > > > checklist
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > existing static retry scenarios,
> and adjust the
> > > > > > > > > > initial
> > > > > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > > > > > > ms to make sure we won't get
> easily timeout in
> > > > > > > > > high
> > > > > > > > > > > > level
> > > > > > > > > > > > > > due
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > too
> > > > > > > > > > > > > > > > > > > few
> > > > > > > > > > > > > > > > > > > > > > attempts.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Boyang
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM
> Sanjana
> > > > > > > > > Kaundinya <
> > > > > > > > > > > > > > > > > > > skaundinya@gmail.com>
> > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Hi Everyone,
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > I’ve written a KIP about
> introducing
> > > > > > > > > exponential
> > > > > > > > > > > > > backoff
> > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > Kafka
> > > > > > > > > > > > > > > > > > > > > > > clients. Would appreciate any
> feedback on this.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > -- Guozhang
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > -- Guozhang
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > -- Guozhang
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > -- Guozhang
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
>

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Sanjana Kaundinya <sk...@gmail.com>.

Hi Konstantine,

Thanks for the insightful feedback. I’ll address it here as well as update the KIP accordingly.

I think it is important to call out the fact that we are leaving out Connect and Streams in the proposed changes, so that it can be addressed in future KIP/changes. As you pointed out, Kafka Connect does utilize ConsumerNetworkClient and Metadata for its rebalancing protocol, and as a result the changes made to exponential backoff would affect the WorkerGroupMember that utilizes these classes. Any Kafka client that utilizes these classes would be making use of exponential backoff instead of the current static backoff.

That being said, although Kafka Connect will be affected with respect to those two things, not all of the backoff configs are being replaced here. As you correctly stated, classes such as AbstractCoordinator, ConsumerCoordinator, and the Heartbeat thread would be utilizing the static backoff behavior - no changes will be made with respect to rebalancing.

With respect to Compatibility, I will add into that section the things I’ve mentioned above - affects to Kafka Connect as well as no changes to anything related to the rebalance protocol. In addition, the reason why retry.backoff.max.ms shouldn’t default to the same value as retry.backoff.ms is that then if a user isn’t aware of this feature and doesn’t set this, they wouldn’t enjoy the exponential backoff. Instead it’s important to ensure that we provide this as a default feature for all Kafka clients. In addition defaulting the retry.backoff.max.ms to 1000 ms unconditionally wouldn’t give users the flexibility to tune their clients to their environments.

Finally, yes you are correct, in order to have exponential backoff, we actually do need both configs, with retry.backkoff.ms < retry.backoff.max.ms. I will update the KIP to reflect that as well as incorporate the wording change you have suggested.

Thanks,
Sanjana

On Mar 25, 2020, 10:50 AM -0700, Konstantine Karantasis <ko...@confluent.io>, wrote:
> Hi Sanjana and thanks for the KIP!
>
> Sorry for the late response, but I still have a few questions that you
> might find useful.
>
> The KIP currently does not mention Kafka Connect at all. I have read
> the discussion above where it'd been decided to leave Connect and Streams
> out of the proposed changes, but I feel this should be called out
> explicitly. At the same time, Kafka Connect is also a Kafka client that
> uses ConsumerNetworkClient and Metadata for its rebalancing protocol. It's
> not clear to me whether changes in those classes will affect Connect
> workers. Do you think it's worth clarifying that?
>
> Additionally, you might also want to add a section specifically to mention
> how this new config affects the places where the current config
> retry.backoff.ms is used today to back-off during rebalancing. Is
> exponential backoff going to replace the old config in those places as
> well? And if it does, should we add a mention that a very high value of the
> new retry.backoff.max.ms might affect how quickly a consumer or worker
> rejoins their group after it experiences a temporary network partitioning
> from the broker coordinator?
>
> Places that explicitly use retry.backoff.ms at the moment include the
> AbstractCoordinator, the ConsumerCoordinator and the Heartbeat thread. By
> reading the previous discussion, I understand that these classes might keep
> using the old static backoff. Even if that's the case, I think it's worth
> mentioning that in the KIP for reference.
>
> In the rejected alternatives section, you mention that "existing behavior
> is always maintained: for reasons explained in the compatibility section.".
> However, the Compatibility section says that there are no compatibility
> concerns. I'd suggest extending the compatibility section to help a bit
> more in explaining why the alternatives were rejected. Also, in the
> compatibility section you mention that the new config (retry.backoff.max.ms)
> will replace the old one (retry.backoff.ms), but from reading at the
> beginning, I understand that in order to have exponential increments, you
> actually need both configs, with retry.backoff.ms < retry.backoff.max.ms.
> Should the mention around replacement be removed?
>
> Finally, I have a minor suggestion that might help explain the following
> sentence better:
>
> "If retry.backoff.ms is set to be greater than retry.backoff.max.ms, then
> retry.backoff.max.ms will be used as a **constant backoff from the
> beginning without exponential increase**." (highlighting the difference
> only for reference here). Unless I misunderstood how the new backoff will
> be used when it's smaller than the value of the old config, in which case
> it might help clarifying a bit more as well.
>
>
> Thanks for the KIP!
> Really looking forward to more robust retries in Kafka clients
>
> Konstantine
>
>
> On Tue, Mar 24, 2020 at 9:56 AM Guozhang Wang <wa...@gmail.com> wrote:
>
> > In Kafka clients, there are cases where we log a warning when overriding
> > some conflicting configs and in some other cases we throw and let the
> > brokers to die during startup --- you can check the
> > `postProcessParsedConfig` function in Producer/ConsumerConfig for such
> > logic.
> >
> > I think for this case, it is sufficient to log a warning if we find the
> > `max` < `backoff`.
> >
> >
> > Guozhang
> >
> > On Mon, Mar 23, 2020 at 9:18 PM Boyang Chen <re...@gmail.com>
> > wrote:
> >
> > > Got it, although I would still like to be aware of the actual backoff I
> > > will be using in production, having the app crash seems like an
> > > over-reaction. I don't think I have further questions :)
> > >
> > > On Mon, Mar 23, 2020 at 7:36 PM Sanjana Kaundinya <sk...@gmail.com>
> > > wrote:
> > >
> > > > Hey Sanjana,
> > > >
> > > > Hey Boyang,
> > > >
> > > > If a user provides no config at all then as you mentioned they will be
> > > > default be able to make use of the exponential back off feature
> > > introduced
> > > > by the KIP. If the backoff.ms is overriden to 2000 ms, the lesser of
> > > > either
> > > > the max or the computed back off will be chosen, so in this case the
> > max
> > > > will be chosen as it is 1000 ms. As Guozhang mentioned if the user
> > > > configures something like this then they would notice the behavior to
> > not
> > > > be in line what they expect and would see the KIP + Release notes and
> > > know
> > > > to configure it to be backoff.ms < max backoff.ms. I’m not sure if its
> > > as
> > > > big of an error to reject the configuration if it’s configured like
> > this,
> > > > as the clients could still run in either case.
> > > >
> > > > To answer your second question, we are making the dynamic backoff the
> > > > default and not allowing for static backoff (unless they set
> > backoff.ms
> > > >
> > > > max.backof.ms, then that would in a sense be static) We will include
> > > this
> > > > information in the release notes to make sure users are aware of this
> > > > behavior change.
> > > >
> > > > Thanks,
> > > > Sanjana
> > > >
> > > > On Mon, Mar 23, 2020 at 6:37 PM Boyang Chen <
> > reluctanthero104@gmail.com>
> > > > wrote:
> > > >
> > > > > Hey Sanjana,
> > > > >
> > > > > my understanding with the update is that if a user provides no config
> > > at
> > > > > all, a Producer/Consumer/Admin client user would by default enjoying
> > a
> > > > > starting backoff.ms as 100 ms and max.backoff.ms as 1000 ms? If I
> > > > already
> > > > > override the backoff.ms to 2000 ms for instance, will I be choosing
> > > the
> > > > > default max.backoff here?
> > > > >
> > > > > I guess my question would be whether we should just reject a config
> > > with
> > > > > backoff.ms > max.backoff.ms in the first place, as this looks like
> > > > > mis-configuration to me.
> > > > >
> > > > > Second question is whether we allow fallback to static backoffs if
> > the
> > > > user
> > > > > wants to do so, or we should just ship this as an opt-in feature?
> > > > >
> > > > > Let me know your thoughts.
> > > > >
> > > > > Boyang
> > > > >
> > > > > On Mon, Mar 23, 2020 at 11:38 AM Cheng Tan <ct...@confluent.io>
> > wrote:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > > On Mar 19, 2020, at 7:27 PM, Sanjana Kaundinya <
> > > skaundinya@gmail.com
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > Ah yes that makes sense. I’ll update the KIP to reflect this.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Sanjana
> > > > > > >
> > > > > > > On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <
> > wangguoz@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Following the formula you have in the KIP, if it is simply:
> > > > > > > >
> > > > > > > > MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures -
> > 1))
> > > *
> > > > > > random(
> > > > > > > > 0.8, 1.2))
> > > > > > > >
> > > > > > > > then the behavior would stay consistent at retry.backoff.max.ms
> > .
> > > > > > > >
> > > > > > > >
> > > > > > > > Guozhang
> > > > > > > >
> > > > > > > > On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <
> > > > > skaundinya@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > If that’s the case then what should we base the starting point
> > > as?
> > > > > > > > > Currently in the KIP the starting point is retry.backoff.ms
> > and
> > > it
> > > > > > > > > exponentially goes up to retry.backoff.max.ms. If
> > > > > retry.backoff.max.ms
> > > > > > > > is
> > > > > > > > > smaller than retry.backoff.ms then that could pose a bit of a
> > > > > problem
> > > > > > > > > there right?
> > > > > > > > >
> > > > > > > > > On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <
> > > wangguoz@gmail.com
> > > > > ,
> > > > > > > > wrote:
> > > > > > > > > > Thanks Sanjana, I did not capture the part that Jason referred
> > > to,
> > > > > so
> > > > > > > > > > that's my bad :P
> > > > > > > > > >
> > > > > > > > > > Regarding your last statement, I actually feel that instead of
> > > > take
> > > > > > the
> > > > > > > > > > larger of the two, we should respect "retry.backoff.max.ms"
> > > even
> > > > if
> > > > > > it
> > > > > > > > > is
> > > > > > > > > > smaller than "retry.backoff.ms". I do not have a very strong
> > > > > > rationale
> > > > > > > > > > except it is logically more aligned to the config names.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Guozhang
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <
> > > > > > > > skaundinya@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey Jason and Guozhang,
> > > > > > > > > > >
> > > > > > > > > > > Jason is right, I took this inspiration from KIP-144 (
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > > > > > > > > > > )
> > > > > > > > > > > which had the same logic in order to preserve the existing
> > > > > behavior.
> > > > > > > > In
> > > > > > > > > > > this case however, if we are thinking to completely eliminate
> > > the
> > > > > > > > > static
> > > > > > > > > > > backoff behavior, we can do that and as Jason mentioned put
> > it
> > > in
> > > > > the
> > > > > > > > > > > release notes and not add any special logic. In addition I
> > > agree
> > > > > that
> > > > > > > > > we
> > > > > > > > > > > should take the larger of the two of `retry.backoff.ms` and
> > `
> > > > > > > > > > > retry.backoff.max.ms`. I'll update the KIP to reflect this
> > and
> > > > > make
> > > > > > > > it
> > > > > > > > > > > clear that the old static retry backoff is getting replaced
> > by
> > > > the
> > > > > > > > new
> > > > > > > > > > > dynamic retry backoff.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Sanjana
> > > > > > > > > > > On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <
> > > > > jason@confluent.io>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hey Guozhang,
> > > > > > > > > > > >
> > > > > > > > > > > > I was referring to this:
> > > > > > > > > > > >
> > > > > > > > > > > > > For users who have not set retry.backoff.ms explicitly,
> > the
> > > > > > > > > default
> > > > > > > > > > > > behavior will change so that the backoff will grow up to
> > 1000
> > > > ms.
> > > > > > > > For
> > > > > > > > > > > users
> > > > > > > > > > > > who have set retry.backoff.ms explicitly, the behavior will
> > > > > remain
> > > > > > > > > the
> > > > > > > > > > > > same
> > > > > > > > > > > > as they could have specific requirements.
> > > > > > > > > > > >
> > > > > > > > > > > > I took this to mean that for users who have overridden `
> > > > > > > > > retry.backoff.ms
> > > > > > > > > > > `
> > > > > > > > > > > > to 50ms (say), we will change the default `
> > > retry.backoff.max.ms
> > > > `
> > > > > > > > to
> > > > > > > > > 50ms
> > > > > > > > > > > > as
> > > > > > > > > > > > well in order to preserve existing backoff behavior. Is that
> > > not
> > > > > > > > > right?
> > > > > > > > > > > In
> > > > > > > > > > > > any case, I agree that we can use the maximum of the two
> > > values
> > > > as
> > > > > > > > > the
> > > > > > > > > > > > effective `retry.backoff.max.ms` to handle the case when
> > the
> > > > > > > > > configured
> > > > > > > > > > > > value of `retry.backoff.ms` is larger than the default of
> > 1s.
> > > > > > > > > > > >
> > > > > > > > > > > > -Jason
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <
> > > > wangguoz@gmail.com
> > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hey Jason,
> > > > > > > > > > > > >
> > > > > > > > > > > > > My understanding is a bit different here: even if user has
> > an
> > > > > > > > > explicit
> > > > > > > > > > > > > overridden "retry.backoff.ms", the exponential mechanism
> > > still
> > > > > > > > > > > triggers
> > > > > > > > > > > > > and
> > > > > > > > > > > > > the backoff would be increased till "retry.backoff.max.ms
> > ";
> > > > and
> > > > > > > > > if the
> > > > > > > > > > > > > specified "retry.backoff.ms" is already larger than the "
> > > > > > > > > > > > > retry.backoff.max.ms", we would still take "
> > > > retry.backoff.max.ms
> > > > > > > > ".
> > > > > > > > > > > > >
> > > > > > > > > > > > > So if the user does override the "retry.backoff.ms" to a
> > > value
> > > > > > > > > larger
> > > > > > > > > > > > than
> > > > > > > > > > > > > 1s and is not aware of the new config, she would be
> > surprised
> > > > to
> > > > > > > > > see
> > > > > > > > > > > the
> > > > > > > > > > > > > specified value seemingly not being respected, but she
> > could
> > > > > > > > still
> > > > > > > > > > > learn
> > > > > > > > > > > > > that afterwards by reading the release notes introducing
> > this
> > > > KIP
> > > > > > > > > > > > anyways.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Guozhang
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
> > > > > > > > > jason@confluent.io>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Sanjana,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The KIP looks good to me. I had just one question about
> > the
> > > > > > > > > default
> > > > > > > > > > > > > > behavior. As I understand, if the user has specified `
> > > > > > > > > > > retry.backoff.ms
> > > > > > > > > > > > `
> > > > > > > > > > > > > > explicitly, then we will not apply the default max
> > backoff.
> > > As
> > > > > > > > > such,
> > > > > > > > > > > > > > there's no way to get the benefit of this feature if you
> > are
> > > > > > > > > > > providing
> > > > > > > > > > > > a
> > > > > > > > > > > > > `
> > > > > > > > > > > > > > retry.backoff.ms` unless you also provide `
> > > > > > > > retry.backoff.max.ms
> > > > > > > > > `.
> > > > > > > > > > > That
> > > > > > > > > > > > > > makes sense if you assume the user is unaware of the new
> > > > > > > > > > > configuration,
> > > > > > > > > > > > > but
> > > > > > > > > > > > > > it is surprising otherwise. Since it's not a semantic
> > change
> > > > > > > > and
> > > > > > > > > > > since
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > default you're proposing of 1s is fairly low already, I
> > > wonder
> > > > > > > > if
> > > > > > > > > > > it's
> > > > > > > > > > > > > good
> > > > > > > > > > > > > > enough to mention the new configuration in the release
> > notes
> > > > > > > > and
> > > > > > > > > not
> > > > > > > > > > > > add
> > > > > > > > > > > > > > any special logic. What do you think?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -Jason
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> > > > > > > > > > > > skaundinya@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you for the comments Guozhang.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I’ll leave this KIP out for discussion till the end of
> > the
> > > > > > > > > week and
> > > > > > > > > > > > > then
> > > > > > > > > > > > > > > start a vote for this early next week.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
> > > > > > > > > wangguoz@gmail.com
> > > > > > > > > > > > ,
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > Hello Sanjana,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the proposed KIP, I think that makes a lot of
> > > > > > > > > sense --
> > > > > > > > > > > > as
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > mentioned in the motivation, we've indeed seen many
> > issues
> > > > > > > > > with
> > > > > > > > > > > > > regard
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > the frequent retries, with bounded exponential backoff
> > in
> > > > > > > > the
> > > > > > > > > > > > > scenario
> > > > > > > > > > > > > > > > where there's a long connectivity issue we would
> > > > > > > > effectively
> > > > > > > > > > > reduce
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > request load by 10 given the default configs.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > For higher-level Streams client and Connect frameworks,
> > > > > > > > > today we
> > > > > > > > > > > > also
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > a retry logic but that's used in a slightly different
> > way.
> > > > > > > > > For
> > > > > > > > > > > > > example
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > Streams, we tend to handle the retry logic at the
> > > > > > > > > thread-level
> > > > > > > > > > > and
> > > > > > > > > > > > > > hence
> > > > > > > > > > > > > > > > very likely we'd like to change that mechanism in
> > KIP-572
> > > > > > > > > > > anyways.
> > > > > > > > > > > > > For
> > > > > > > > > > > > > > > > producer / consumer / admin clients, I think just
> > applying
> > > > > > > > > this
> > > > > > > > > > > > > > > behavioral
> > > > > > > > > > > > > > > > change across these clients makes lot of sense. So I
> > think
> > > > > > > > > can
> > > > > > > > > > > just
> > > > > > > > > > > > > > leave
> > > > > > > > > > > > > > > > the Streams / Connect out of the scope of this KIP to be
> > > > > > > > > > > addressed
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > > separate discussions.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I do not have further comments about this KIP :) LGTM.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Guozhang
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > > > > > > > > > > > > > skaundinya@gmail.com
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the feedback Boyang.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If there’s anyone else who has feedback regarding this
> > > > > > > > KIP,
> > > > > > > > > > > would
> > > > > > > > > > > > > > > really
> > > > > > > > > > > > > > > > > appreciate it hearing it!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> > > > > > > > > > > > bchen11@outlook.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Sounds great!
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > > > > > > > > > > > > > > ________________________________
> > > > > > > > > > > > > > > > > > From: Sanjana Kaundinya <sk...@gmail.com>
> > > > > > > > > > > > > > > > > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > > > > > > > > > > > > > > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > > > > > > > > > > > > > > > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff
> > for
> > > > > > > > > Kafka
> > > > > > > > > > > > > > Clients
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the explanation Boyang. One of the most
> > > > > > > > common
> > > > > > > > > > > > > problems
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > have in Kafka is with respect to metadata fetches. For
> > > > > > > > > > > example,
> > > > > > > > > > > > > if
> > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > a broker failure, all clients start to fetch metadata
> > > > > > > > at
> > > > > > > > > the
> > > > > > > > > > > > same
> > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > it often takes a while for the metadata to converge.
> > > > > > > > In a
> > > > > > > > > > > high
> > > > > > > > > > > > > load
> > > > > > > > > > > > > > > > > > cluster, there are also issues where the volume of
> > > > > > > > > metadata
> > > > > > > > > > > has
> > > > > > > > > > > > > > made
> > > > > > > > > > > > > > > > > > convergence of metadata slower.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > For this case, exponential backoff helps as it reduces
> > > > > > > > > the
> > > > > > > > > > > > retry
> > > > > > > > > > > > > > > rate and
> > > > > > > > > > > > > > > > > > spaces out how often clients will retry, thereby
> > > > > > > > bringing
> > > > > > > > > > > down
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > convergence. Something that Jason mentioned that would
> > > > > > > > > be a
> > > > > > > > > > > > great
> > > > > > > > > > > > > > > > > addition
> > > > > > > > > > > > > > > > > > here would be if the backoff should be “jittered” as
> > it
> > > > > > > > > was
> > > > > > > > > > > in
> > > > > > > > > > > > > > > KIP-144
> > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > respect to exponential reconnect backoff. This would
> > > > > > > > help
> > > > > > > > > > > > prevent
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > clients from being synchronized on when they retry,
> > > > > > > > > thereby
> > > > > > > > > > > > > spacing
> > > > > > > > > > > > > > > out
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > number of requests being sent to the broker at the
> > same
> > > > > > > > > time.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I’ll add this example to the KIP and flush out more of
> > > > > > > > > the
> > > > > > > > > > > > > details
> > > > > > > > > > > > > > -
> > > > > > > > > > > > > > > so
> > > > > > > > > > > > > > > > > > it’s more clear.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > > > > > > > > > > > > > reluctanthero104@gmail.com
> > > > > > > > > > > > > > > > > > ,
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > Thanks for the reply Sanjana. I guess I would like to
> > > > > > > > > > > > rephrase
> > > > > > > > > > > > > my
> > > > > > > > > > > > > > > > > > question
> > > > > > > > > > > > > > > > > > > 2 and 3 as my previous response is a little bit
> > > > > > > > > > > unactionable.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > My specific point is that exponential backoff is not
> > > > > > > > a
> > > > > > > > > > > silver
> > > > > > > > > > > > > > > bullet
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > should consider using it to solve known problems,
> > > > > > > > > instead
> > > > > > > > > > > of
> > > > > > > > > > > > > > > making the
> > > > > > > > > > > > > > > > > > > holistic changes to all clients in Kafka ecosystem. I
> > > > > > > > > do
> > > > > > > > > > > like
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > exponential backoff idea and believe this would be of
> > > > > > > > > great
> > > > > > > > > > > > > > value,
> > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > > > maybe we should focus on proposing some existing
> > > > > > > > > modules
> > > > > > > > > > > that
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > suffering
> > > > > > > > > > > > > > > > > > > from static retry, and only change them in this first
> > > > > > > > > KIP.
> > > > > > > > > > > If
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > future, some other component users believe they are
> > > > > > > > > also
> > > > > > > > > > > > > > > suffering, we
> > > > > > > > > > > > > > > > > > > could get more minor KIPs to change the behavior as
> > > > > > > > > well.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Boyang
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > > > > > > > > > > > > > > skaundinya@gmail.com
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks for the feedback Boyang, I will revise the
> > > > > > > > KIP
> > > > > > > > > > > with
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > mathematical relations as per your suggestion. To
> > > > > > > > > address
> > > > > > > > > > > > > your
> > > > > > > > > > > > > > > > > > feedback:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > 1. Currently, with the default of 100 ms per retry
> > > > > > > > > > > backoff,
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > > > second
> > > > > > > > > > > > > > > > > > > > we would have 10 retries. In the case of using an
> > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > > > backoff,
> > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > > would have a total of 4 retries in 1 second. Thus
> > > > > > > > we
> > > > > > > > > have
> > > > > > > > > > > > > less
> > > > > > > > > > > > > > > than
> > > > > > > > > > > > > > > > > > half of
> > > > > > > > > > > > > > > > > > > > the amount of retries in the same timeframe and can
> > > > > > > > > > > lessen
> > > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > > > pressure.
> > > > > > > > > > > > > > > > > > > > This calculation is done as following (using the
> > > > > > > > > formula
> > > > > > > > > > > > laid
> > > > > > > > > > > > > > > out in
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > KIP:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Try 1 at time 0 ms, failures = 0, next retry in 100
> > > > > > > > > ms
> > > > > > > > > > > > > (default
> > > > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > > ms
> > > > > > > > > > > > > > > > > > > > is initially 100 ms)
> > > > > > > > > > > > > > > > > > > > Try 2 at time 100 ms, failures = 1, next retry in
> > > > > > > > > 200 ms
> > > > > > > > > > > > > > > > > > > > Try 3 at time 300 ms, failures = 2, next retry in
> > > > > > > > > 400 ms
> > > > > > > > > > > > > > > > > > > > Try 4 at time 700 ms, failures = 3, next retry in
> > > > > > > > > 800 ms
> > > > > > > > > > > > > > > > > > > > Try 5 at time 1500 ms, failures = 4, next retry in
> > > > > > > > > 1000
> > > > > > > > > > > ms
> > > > > > > > > > > > > > > (default
> > > > > > > > > > > > > > > > > max
> > > > > > > > > > > > > > > > > > > > retry ms is 1000 ms)
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > For 2 and 3, could you elaborate more about what
> > > > > > > > you
> > > > > > > > > mean
> > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > respect
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > client timeouts? I’m not very familiar with the
> > > > > > > > > Streams
> > > > > > > > > > > > > > > framework, so
> > > > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > > > love to get more insight to how that currently
> > > > > > > > works,
> > > > > > > > > > > with
> > > > > > > > > > > > > > > respect to
> > > > > > > > > > > > > > > > > > > > producer transactions, so I can appropriately
> > > > > > > > update
> > > > > > > > > the
> > > > > > > > > > > > KIP
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > address
> > > > > > > > > > > > > > > > > > > > these scenarios.
> > > > > > > > > > > > > > > > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > > > > > > > > > > > > > > > reluctanthero104@gmail.com>,
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > Thanks for the KIP Sanjana. I think the
> > > > > > > > motivation
> > > > > > > > > is
> > > > > > > > > > > > good,
> > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > lack
> > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > more quantitative analysis. For instance:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > 1. How much retries we are saving by applying the
> > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > > vs
> > > > > > > > > > > > > > > > > > > > > static retry? There should be some mathematical
> > > > > > > > > > > relations
> > > > > > > > > > > > > > > between
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > static retry ms, the initial exponential retry
> > > > > > > > ms,
> > > > > > > > > the
> > > > > > > > > > > > max
> > > > > > > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > > > > > > > retry ms in a given time interval.
> > > > > > > > > > > > > > > > > > > > > 2. How does this affect the client timeout? With
> > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > retry,
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > client shall be getting easier to timeout on a
> > > > > > > > > parent
> > > > > > > > > > > > level
> > > > > > > > > > > > > > > caller,
> > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > instance stream attempts to retry initializing
> > > > > > > > > producer
> > > > > > > > > > > > > > > > > transactions
> > > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > > > given 5 minute interval. With exponential retry
> > > > > > > > > this
> > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > experience more frequent timeout which we should
> > > > > > > > be
> > > > > > > > > > > > careful
> > > > > > > > > > > > > > > with.
> > > > > > > > > > > > > > > > > > > > > 3. With regards to #2, we should have more
> > > > > > > > detailed
> > > > > > > > > > > > > checklist
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > existing static retry scenarios, and adjust the
> > > > > > > > > initial
> > > > > > > > > > > > > > > exponential
> > > > > > > > > > > > > > > > > > retry
> > > > > > > > > > > > > > > > > > > > > ms to make sure we won't get easily timeout in
> > > > > > > > high
> > > > > > > > > > > level
> > > > > > > > > > > > > due
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > too
> > > > > > > > > > > > > > > > > > few
> > > > > > > > > > > > > > > > > > > > > attempts.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Boyang
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana
> > > > > > > > Kaundinya <
> > > > > > > > > > > > > > > > > > skaundinya@gmail.com>
> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Hi Everyone,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > I’ve written a KIP about introducing
> > > > > > > > exponential
> > > > > > > > > > > > backoff
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > Kafka
> > > > > > > > > > > > > > > > > > > > > > clients. Would appreciate any feedback on this.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > -- Guozhang
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > -- Guozhang
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > -- Guozhang
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Guozhang
> > > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Konstantine Karantasis <ko...@confluent.io>.

Hi Sanjana and thanks for the KIP!

Sorry for the late response, but I still have a few questions that you
might find useful.

The KIP currently does not mention Kafka Connect at all. I have read
the discussion above where it'd been decided to leave Connect and Streams
out of the proposed changes, but I feel this should be called out
explicitly. At the same time, Kafka Connect is also a Kafka client that
uses ConsumerNetworkClient and Metadata for its rebalancing protocol. It's
not clear to me whether changes in those classes will affect Connect
workers. Do you think it's worth clarifying that?

Additionally, you might also want to add a section specifically to mention
how this new config affects the places where the current config
retry.backoff.ms is used today to back-off during rebalancing. Is
exponential backoff going to replace the old config in those places as
well? And if it does, should we add a mention that a very high value of the
new retry.backoff.max.ms might affect how quickly a consumer or worker
rejoins their group after it experiences a temporary network partitioning
from the broker coordinator?

Places that explicitly use retry.backoff.ms at the moment include the
AbstractCoordinator, the ConsumerCoordinator and the Heartbeat thread. By
reading the previous discussion, I understand that these classes might keep
using the old static backoff. Even if that's the case, I think it's worth
mentioning that in the KIP for reference.

In the rejected alternatives section, you mention that "existing behavior
is always maintained: for reasons explained in the compatibility section.".
However, the Compatibility section says that there are no compatibility
concerns. I'd suggest extending the compatibility section to help a bit
more in explaining why the alternatives were rejected. Also, in the
compatibility section you mention that the new config (retry.backoff.max.ms)
will replace the old one (retry.backoff.ms), but from reading at the
beginning, I understand that in order to have exponential increments, you
actually need both configs, with retry.backoff.ms < retry.backoff.max.ms.
Should the mention around replacement be removed?

Finally, I have a minor suggestion that might help explain the following
sentence better:

"If retry.backoff.ms is set to be greater than retry.backoff.max.ms, then
retry.backoff.max.ms will be used as a **constant backoff from the
beginning without exponential increase**." (highlighting the difference
only for reference here). Unless I misunderstood how the new backoff will
be used when it's smaller than the value of the old config, in which case
it might help clarifying a bit more as well.


Thanks for the KIP!
Really looking forward to more robust retries in Kafka clients

Konstantine


On Tue, Mar 24, 2020 at 9:56 AM Guozhang Wang <wa...@gmail.com> wrote:

> In Kafka clients, there are cases where we log a warning when overriding
> some conflicting configs and in some other cases we throw and let the
> brokers to die during startup  --- you can check the
> `postProcessParsedConfig` function in Producer/ConsumerConfig for such
> logic.
>
> I think for this case, it is sufficient to log a warning if we find the
> `max` < `backoff`.
>
>
> Guozhang
>
> On Mon, Mar 23, 2020 at 9:18 PM Boyang Chen <re...@gmail.com>
> wrote:
>
> > Got it, although I would still like to be aware of the actual backoff I
> > will be using in production, having the app crash seems like an
> > over-reaction. I don't think I have further questions :)
> >
> > On Mon, Mar 23, 2020 at 7:36 PM Sanjana Kaundinya <sk...@gmail.com>
> > wrote:
> >
> > > Hey Sanjana,
> > >
> > > Hey Boyang,
> > >
> > > If a user provides no config at all then as you mentioned they will be
> > > default be able to make use of the exponential back off feature
> > introduced
> > > by the KIP. If the backoff.ms is overriden to 2000 ms, the lesser of
> > > either
> > > the max or the computed back off will be chosen, so in this case the
> max
> > > will be chosen as it is 1000 ms. As Guozhang mentioned if the user
> > > configures something like this then they would notice the behavior to
> not
> > > be in line what they expect and would see the KIP + Release notes and
> > know
> > > to configure it to be backoff.ms < max backoff.ms. I’m not sure if its
> > as
> > > big of an error to reject the configuration if it’s configured like
> this,
> > > as the clients could still run in either case.
> > >
> > > To answer your second question, we are making the dynamic backoff the
> > > default and not allowing for static backoff (unless they set
> backoff.ms
> > >
> > > max.backof.ms, then that would in a sense be static) We will include
> > this
> > > information in the release notes to make sure users are aware of this
> > > behavior change.
> > >
> > > Thanks,
> > > Sanjana
> > >
> > > On Mon, Mar 23, 2020 at 6:37 PM Boyang Chen <
> reluctanthero104@gmail.com>
> > > wrote:
> > >
> > > > Hey Sanjana,
> > > >
> > > > my understanding with the update is that if a user provides no config
> > at
> > > > all, a Producer/Consumer/Admin client user would by default enjoying
> a
> > > > starting backoff.ms as 100 ms and max.backoff.ms as 1000 ms? If I
> > > already
> > > > override the backoff.ms to 2000 ms for instance, will I be choosing
> > the
> > > > default max.backoff here?
> > > >
> > > > I guess my question would be whether we should just reject a config
> > with
> > > > backoff.ms > max.backoff.ms in the first place, as this looks like
> > > > mis-configuration to me.
> > > >
> > > > Second question is whether we allow fallback to static backoffs if
> the
> > > user
> > > > wants to do so, or we should just ship this as an opt-in feature?
> > > >
> > > > Let me know your thoughts.
> > > >
> > > > Boyang
> > > >
> > > > On Mon, Mar 23, 2020 at 11:38 AM Cheng Tan <ct...@confluent.io>
> wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > > On Mar 19, 2020, at 7:27 PM, Sanjana Kaundinya <
> > skaundinya@gmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > > Ah yes that makes sense. I’ll update the KIP to reflect this.
> > > > > >
> > > > > > Thanks,
> > > > > > Sanjana
> > > > > >
> > > > > > On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <
> wangguoz@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> Following the formula you have in the KIP, if it is simply:
> > > > > >>
> > > > > >> MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures -
> 1))
> > *
> > > > > random(
> > > > > >> 0.8, 1.2))
> > > > > >>
> > > > > >> then the behavior would stay consistent at retry.backoff.max.ms
> .
> > > > > >>
> > > > > >>
> > > > > >> Guozhang
> > > > > >>
> > > > > >> On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <
> > > > skaundinya@gmail.com
> > > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >>> If that’s the case then what should we base the starting point
> > as?
> > > > > >>> Currently in the KIP the starting point is retry.backoff.ms
> and
> > it
> > > > > >>> exponentially goes up to retry.backoff.max.ms. If
> > > > retry.backoff.max.ms
> > > > > >> is
> > > > > >>> smaller than retry.backoff.ms then that could pose a bit of a
> > > > problem
> > > > > >>> there right?
> > > > > >>>
> > > > > >>> On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <
> > wangguoz@gmail.com
> > > >,
> > > > > >> wrote:
> > > > > >>>> Thanks Sanjana, I did not capture the part that Jason referred
> > to,
> > > > so
> > > > > >>>> that's my bad :P
> > > > > >>>>
> > > > > >>>> Regarding your last statement, I actually feel that instead of
> > > take
> > > > > the
> > > > > >>>> larger of the two, we should respect "retry.backoff.max.ms"
> > even
> > > if
> > > > > it
> > > > > >>> is
> > > > > >>>> smaller than "retry.backoff.ms". I do not have a very strong
> > > > > rationale
> > > > > >>>> except it is logically more aligned to the config names.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Guozhang
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <
> > > > > >> skaundinya@gmail.com>
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> Hey Jason and Guozhang,
> > > > > >>>>>
> > > > > >>>>> Jason is right, I took this inspiration from KIP-144 (
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > > > > >>>>> )
> > > > > >>>>> which had the same logic in order to preserve the existing
> > > > behavior.
> > > > > >> In
> > > > > >>>>> this case however, if we are thinking to completely eliminate
> > the
> > > > > >>> static
> > > > > >>>>> backoff behavior, we can do that and as Jason mentioned put
> it
> > in
> > > > the
> > > > > >>>>> release notes and not add any special logic. In addition I
> > agree
> > > > that
> > > > > >>> we
> > > > > >>>>> should take the larger of the two of `retry.backoff.ms` and
> `
> > > > > >>>>> retry.backoff.max.ms`. I'll update the KIP to reflect this
> and
> > > > make
> > > > > >> it
> > > > > >>>>> clear that the old static retry backoff is getting replaced
> by
> > > the
> > > > > >> new
> > > > > >>>>> dynamic retry backoff.
> > > > > >>>>>
> > > > > >>>>> Thanks,
> > > > > >>>>> Sanjana
> > > > > >>>>> On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <
> > > > jason@confluent.io>
> > > > > >>>>> wrote:
> > > > > >>>>>
> > > > > >>>>>> Hey Guozhang,
> > > > > >>>>>>
> > > > > >>>>>> I was referring to this:
> > > > > >>>>>>
> > > > > >>>>>>> For users who have not set retry.backoff.ms explicitly,
> the
> > > > > >>> default
> > > > > >>>>>> behavior will change so that the backoff will grow up to
> 1000
> > > ms.
> > > > > >> For
> > > > > >>>>> users
> > > > > >>>>>> who have set retry.backoff.ms explicitly, the behavior will
> > > > remain
> > > > > >>> the
> > > > > >>>>>> same
> > > > > >>>>>> as they could have specific requirements.
> > > > > >>>>>>
> > > > > >>>>>> I took this to mean that for users who have overridden `
> > > > > >>> retry.backoff.ms
> > > > > >>>>> `
> > > > > >>>>>> to 50ms (say), we will change the default `
> > retry.backoff.max.ms
> > > `
> > > > > >> to
> > > > > >>> 50ms
> > > > > >>>>>> as
> > > > > >>>>>> well in order to preserve existing backoff behavior. Is that
> > not
> > > > > >>> right?
> > > > > >>>>> In
> > > > > >>>>>> any case, I agree that we can use the maximum of the two
> > values
> > > as
> > > > > >>> the
> > > > > >>>>>> effective `retry.backoff.max.ms` to handle the case when
> the
> > > > > >>> configured
> > > > > >>>>>> value of `retry.backoff.ms` is larger than the default of
> 1s.
> > > > > >>>>>>
> > > > > >>>>>> -Jason
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <
> > > wangguoz@gmail.com
> > > > >
> > > > > >>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Hey Jason,
> > > > > >>>>>>>
> > > > > >>>>>>> My understanding is a bit different here: even if user has
> an
> > > > > >>> explicit
> > > > > >>>>>>> overridden "retry.backoff.ms", the exponential mechanism
> > still
> > > > > >>>>> triggers
> > > > > >>>>>>> and
> > > > > >>>>>>> the backoff would be increased till "retry.backoff.max.ms
> ";
> > > and
> > > > > >>> if the
> > > > > >>>>>>> specified "retry.backoff.ms" is already larger than the "
> > > > > >>>>>>> retry.backoff.max.ms", we would still take "
> > > retry.backoff.max.ms
> > > > > >> ".
> > > > > >>>>>>>
> > > > > >>>>>>> So if the user does override the "retry.backoff.ms" to a
> > value
> > > > > >>> larger
> > > > > >>>>>> than
> > > > > >>>>>>> 1s and is not aware of the new config, she would be
> surprised
> > > to
> > > > > >>> see
> > > > > >>>>> the
> > > > > >>>>>>> specified value seemingly not being respected, but she
> could
> > > > > >> still
> > > > > >>>>> learn
> > > > > >>>>>>> that afterwards by reading the release notes introducing
> this
> > > KIP
> > > > > >>>>>> anyways.
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> Guozhang
> > > > > >>>>>>>
> > > > > >>>>>>> On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
> > > > > >>> jason@confluent.io>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> Hi Sanjana,
> > > > > >>>>>>>>
> > > > > >>>>>>>> The KIP looks good to me. I had just one question about
> the
> > > > > >>> default
> > > > > >>>>>>>> behavior. As I understand, if the user has specified `
> > > > > >>>>> retry.backoff.ms
> > > > > >>>>>> `
> > > > > >>>>>>>> explicitly, then we will not apply the default max
> backoff.
> > As
> > > > > >>> such,
> > > > > >>>>>>>> there's no way to get the benefit of this feature if you
> are
> > > > > >>>>> providing
> > > > > >>>>>> a
> > > > > >>>>>>> `
> > > > > >>>>>>>> retry.backoff.ms` unless you also provide `
> > > > > >> retry.backoff.max.ms
> > > > > >>> `.
> > > > > >>>>> That
> > > > > >>>>>>>> makes sense if you assume the user is unaware of the new
> > > > > >>>>> configuration,
> > > > > >>>>>>> but
> > > > > >>>>>>>> it is surprising otherwise. Since it's not a semantic
> change
> > > > > >> and
> > > > > >>>>> since
> > > > > >>>>>>> the
> > > > > >>>>>>>> default you're proposing of 1s is fairly low already, I
> > wonder
> > > > > >> if
> > > > > >>>>> it's
> > > > > >>>>>>> good
> > > > > >>>>>>>> enough to mention the new configuration in the release
> notes
> > > > > >> and
> > > > > >>> not
> > > > > >>>>>> add
> > > > > >>>>>>>> any special logic. What do you think?
> > > > > >>>>>>>>
> > > > > >>>>>>>> -Jason
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> > > > > >>>>>> skaundinya@gmail.com>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Thank you for the comments Guozhang.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I’ll leave this KIP out for discussion till the end of
> the
> > > > > >>> week and
> > > > > >>>>>>> then
> > > > > >>>>>>>>> start a vote for this early next week.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Sanjana
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
> > > > > >>> wangguoz@gmail.com
> > > > > >>>>>> ,
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>>> Hello Sanjana,
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Thanks for the proposed KIP, I think that makes a lot of
> > > > > >>> sense --
> > > > > >>>>>> as
> > > > > >>>>>>>> you
> > > > > >>>>>>>>>> mentioned in the motivation, we've indeed seen many
> issues
> > > > > >>> with
> > > > > >>>>>>> regard
> > > > > >>>>>>>> to
> > > > > >>>>>>>>>> the frequent retries, with bounded exponential backoff
> in
> > > > > >> the
> > > > > >>>>>>> scenario
> > > > > >>>>>>>>>> where there's a long connectivity issue we would
> > > > > >> effectively
> > > > > >>>>> reduce
> > > > > >>>>>>> the
> > > > > >>>>>>>>>> request load by 10 given the default configs.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> For higher-level Streams client and Connect frameworks,
> > > > > >>> today we
> > > > > >>>>>> also
> > > > > >>>>>>>>> have
> > > > > >>>>>>>>>> a retry logic but that's used in a slightly different
> way.
> > > > > >>> For
> > > > > >>>>>>> example
> > > > > >>>>>>>> in
> > > > > >>>>>>>>>> Streams, we tend to handle the retry logic at the
> > > > > >>> thread-level
> > > > > >>>>> and
> > > > > >>>>>>>> hence
> > > > > >>>>>>>>>> very likely we'd like to change that mechanism in
> KIP-572
> > > > > >>>>> anyways.
> > > > > >>>>>>> For
> > > > > >>>>>>>>>> producer / consumer / admin clients, I think just
> applying
> > > > > >>> this
> > > > > >>>>>>>>> behavioral
> > > > > >>>>>>>>>> change across these clients makes lot of sense. So I
> think
> > > > > >>> can
> > > > > >>>>> just
> > > > > >>>>>>>> leave
> > > > > >>>>>>>>>> the Streams / Connect out of the scope of this KIP to be
> > > > > >>>>> addressed
> > > > > >>>>>> in
> > > > > >>>>>>>>>> separate discussions.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I do not have further comments about this KIP :) LGTM.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Guozhang
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > > > > >>>>>>>> skaundinya@gmail.com
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> Thanks for the feedback Boyang.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> If there’s anyone else who has feedback regarding this
> > > > > >> KIP,
> > > > > >>>>> would
> > > > > >>>>>>>>> really
> > > > > >>>>>>>>>>> appreciate it hearing it!
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>> Sanjana
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> > > > > >>>>>> bchen11@outlook.com>
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> Sounds great!
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > >>>>>>>>>>>> ________________________________
> > > > > >>>>>>>>>>>> From: Sanjana Kaundinya <sk...@gmail.com>
> > > > > >>>>>>>>>>>> Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > >>>>>>>>>>>> To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > > > >>>>>>>>>>>> Subject: Re: [DISCUSS] KIP-580: Exponential Backoff
> for
> > > > > >>> Kafka
> > > > > >>>>>>>> Clients
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Thanks for the explanation Boyang. One of the most
> > > > > >> common
> > > > > >>>>>>> problems
> > > > > >>>>>>>>> that
> > > > > >>>>>>>>>>> we
> > > > > >>>>>>>>>>>> have in Kafka is with respect to metadata fetches. For
> > > > > >>>>> example,
> > > > > >>>>>>> if
> > > > > >>>>>>>>> there
> > > > > >>>>>>>>>>> is
> > > > > >>>>>>>>>>>> a broker failure, all clients start to fetch metadata
> > > > > >> at
> > > > > >>> the
> > > > > >>>>>> same
> > > > > >>>>>>>>> time
> > > > > >>>>>>>>>>> and
> > > > > >>>>>>>>>>>> it often takes a while for the metadata to converge.
> > > > > >> In a
> > > > > >>>>> high
> > > > > >>>>>>> load
> > > > > >>>>>>>>>>>> cluster, there are also issues where the volume of
> > > > > >>> metadata
> > > > > >>>>> has
> > > > > >>>>>>>> made
> > > > > >>>>>>>>>>>> convergence of metadata slower.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> For this case, exponential backoff helps as it reduces
> > > > > >>> the
> > > > > >>>>>> retry
> > > > > >>>>>>>>> rate and
> > > > > >>>>>>>>>>>> spaces out how often clients will retry, thereby
> > > > > >> bringing
> > > > > >>>>> down
> > > > > >>>>>>> the
> > > > > >>>>>>>>> time
> > > > > >>>>>>>>>>> for
> > > > > >>>>>>>>>>>> convergence. Something that Jason mentioned that would
> > > > > >>> be a
> > > > > >>>>>> great
> > > > > >>>>>>>>>>> addition
> > > > > >>>>>>>>>>>> here would be if the backoff should be “jittered” as
> it
> > > > > >>> was
> > > > > >>>>> in
> > > > > >>>>>>>>> KIP-144
> > > > > >>>>>>>>>>> with
> > > > > >>>>>>>>>>>> respect to exponential reconnect backoff. This would
> > > > > >> help
> > > > > >>>>>> prevent
> > > > > >>>>>>>> the
> > > > > >>>>>>>>>>>> clients from being synchronized on when they retry,
> > > > > >>> thereby
> > > > > >>>>>>> spacing
> > > > > >>>>>>>>> out
> > > > > >>>>>>>>>>> the
> > > > > >>>>>>>>>>>> number of requests being sent to the broker at the
> same
> > > > > >>> time.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> I’ll add this example to the KIP and flush out more of
> > > > > >>> the
> > > > > >>>>>>> details
> > > > > >>>>>>>> -
> > > > > >>>>>>>>> so
> > > > > >>>>>>>>>>>> it’s more clear.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > > > >>>>>>>>> reluctanthero104@gmail.com
> > > > > >>>>>>>>>>>> ,
> > > > > >>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>> Thanks for the reply Sanjana. I guess I would like to
> > > > > >>>>>> rephrase
> > > > > >>>>>>> my
> > > > > >>>>>>>>>>>> question
> > > > > >>>>>>>>>>>>> 2 and 3 as my previous response is a little bit
> > > > > >>>>> unactionable.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> My specific point is that exponential backoff is not
> > > > > >> a
> > > > > >>>>> silver
> > > > > >>>>>>>>> bullet
> > > > > >>>>>>>>>>> and
> > > > > >>>>>>>>>>>> we
> > > > > >>>>>>>>>>>>> should consider using it to solve known problems,
> > > > > >>> instead
> > > > > >>>>> of
> > > > > >>>>>>>>> making the
> > > > > >>>>>>>>>>>>> holistic changes to all clients in Kafka ecosystem. I
> > > > > >>> do
> > > > > >>>>> like
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>>>> exponential backoff idea and believe this would be of
> > > > > >>> great
> > > > > >>>>>>>> value,
> > > > > >>>>>>>>> but
> > > > > >>>>>>>>>>>>> maybe we should focus on proposing some existing
> > > > > >>> modules
> > > > > >>>>> that
> > > > > >>>>>>> are
> > > > > >>>>>>>>>>>> suffering
> > > > > >>>>>>>>>>>>> from static retry, and only change them in this first
> > > > > >>> KIP.
> > > > > >>>>> If
> > > > > >>>>>>> in
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>>> future, some other component users believe they are
> > > > > >>> also
> > > > > >>>>>>>>> suffering, we
> > > > > >>>>>>>>>>>>> could get more minor KIPs to change the behavior as
> > > > > >>> well.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Boyang
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > > >>>>>>>>>>> skaundinya@gmail.com
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Thanks for the feedback Boyang, I will revise the
> > > > > >> KIP
> > > > > >>>>> with
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>>>>> mathematical relations as per your suggestion. To
> > > > > >>> address
> > > > > >>>>>>> your
> > > > > >>>>>>>>>>>> feedback:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> 1. Currently, with the default of 100 ms per retry
> > > > > >>>>> backoff,
> > > > > >>>>>>> in
> > > > > >>>>>>>> 1
> > > > > >>>>>>>>>>> second
> > > > > >>>>>>>>>>>>>> we would have 10 retries. In the case of using an
> > > > > >>>>>> exponential
> > > > > >>>>>>>>>>> backoff,
> > > > > >>>>>>>>>>>> we
> > > > > >>>>>>>>>>>>>> would have a total of 4 retries in 1 second. Thus
> > > > > >> we
> > > > > >>> have
> > > > > >>>>>>> less
> > > > > >>>>>>>>> than
> > > > > >>>>>>>>>>>> half of
> > > > > >>>>>>>>>>>>>> the amount of retries in the same timeframe and can
> > > > > >>>>> lessen
> > > > > >>>>>>>> broker
> > > > > >>>>>>>>>>>> pressure.
> > > > > >>>>>>>>>>>>>> This calculation is done as following (using the
> > > > > >>> formula
> > > > > >>>>>> laid
> > > > > >>>>>>>>> out in
> > > > > >>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>> KIP:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Try 1 at time 0 ms, failures = 0, next retry in 100
> > > > > >>> ms
> > > > > >>>>>>> (default
> > > > > >>>>>>>>> retry
> > > > > >>>>>>>>>>>> ms
> > > > > >>>>>>>>>>>>>> is initially 100 ms)
> > > > > >>>>>>>>>>>>>> Try 2 at time 100 ms, failures = 1, next retry in
> > > > > >>> 200 ms
> > > > > >>>>>>>>>>>>>> Try 3 at time 300 ms, failures = 2, next retry in
> > > > > >>> 400 ms
> > > > > >>>>>>>>>>>>>> Try 4 at time 700 ms, failures = 3, next retry in
> > > > > >>> 800 ms
> > > > > >>>>>>>>>>>>>> Try 5 at time 1500 ms, failures = 4, next retry in
> > > > > >>> 1000
> > > > > >>>>> ms
> > > > > >>>>>>>>> (default
> > > > > >>>>>>>>>>> max
> > > > > >>>>>>>>>>>>>> retry ms is 1000 ms)
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> For 2 and 3, could you elaborate more about what
> > > > > >> you
> > > > > >>> mean
> > > > > >>>>>>> with
> > > > > >>>>>>>>>>> respect
> > > > > >>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>> client timeouts? I’m not very familiar with the
> > > > > >>> Streams
> > > > > >>>>>>>>> framework, so
> > > > > >>>>>>>>>>>> would
> > > > > >>>>>>>>>>>>>> love to get more insight to how that currently
> > > > > >> works,
> > > > > >>>>> with
> > > > > >>>>>>>>> respect to
> > > > > >>>>>>>>>>>>>> producer transactions, so I can appropriately
> > > > > >> update
> > > > > >>> the
> > > > > >>>>>> KIP
> > > > > >>>>>>> to
> > > > > >>>>>>>>>>> address
> > > > > >>>>>>>>>>>>>> these scenarios.
> > > > > >>>>>>>>>>>>>> On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > > >>>>>>>>>>>> reluctanthero104@gmail.com>,
> > > > > >>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>> Thanks for the KIP Sanjana. I think the
> > > > > >> motivation
> > > > > >>> is
> > > > > >>>>>> good,
> > > > > >>>>>>>> but
> > > > > >>>>>>>>>>> lack
> > > > > >>>>>>>>>>>> of
> > > > > >>>>>>>>>>>>>>> more quantitative analysis. For instance:
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> 1. How much retries we are saving by applying the
> > > > > >>>>>>> exponential
> > > > > >>>>>>>>> retry
> > > > > >>>>>>>>>>>> vs
> > > > > >>>>>>>>>>>>>>> static retry? There should be some mathematical
> > > > > >>>>> relations
> > > > > >>>>>>>>> between
> > > > > >>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>> static retry ms, the initial exponential retry
> > > > > >> ms,
> > > > > >>> the
> > > > > >>>>>> max
> > > > > >>>>>>>>>>>> exponential
> > > > > >>>>>>>>>>>>>>> retry ms in a given time interval.
> > > > > >>>>>>>>>>>>>>> 2. How does this affect the client timeout? With
> > > > > >>>>>>> exponential
> > > > > >>>>>>>>> retry,
> > > > > >>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>> client shall be getting easier to timeout on a
> > > > > >>> parent
> > > > > >>>>>> level
> > > > > >>>>>>>>> caller,
> > > > > >>>>>>>>>>>> for
> > > > > >>>>>>>>>>>>>>> instance stream attempts to retry initializing
> > > > > >>> producer
> > > > > >>>>>>>>>>> transactions
> > > > > >>>>>>>>>>>> with
> > > > > >>>>>>>>>>>>>>> given 5 minute interval. With exponential retry
> > > > > >>> this
> > > > > >>>>>>>> mechanism
> > > > > >>>>>>>>>>> could
> > > > > >>>>>>>>>>>>>>> experience more frequent timeout which we should
> > > > > >> be
> > > > > >>>>>> careful
> > > > > >>>>>>>>> with.
> > > > > >>>>>>>>>>>>>>> 3. With regards to #2, we should have more
> > > > > >> detailed
> > > > > >>>>>>> checklist
> > > > > >>>>>>>>> of
> > > > > >>>>>>>>>>> all
> > > > > >>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>> existing static retry scenarios, and adjust the
> > > > > >>> initial
> > > > > >>>>>>>>> exponential
> > > > > >>>>>>>>>>>> retry
> > > > > >>>>>>>>>>>>>>> ms to make sure we won't get easily timeout in
> > > > > >> high
> > > > > >>>>> level
> > > > > >>>>>>> due
> > > > > >>>>>>>>> to
> > > > > >>>>>>>>>>> too
> > > > > >>>>>>>>>>>> few
> > > > > >>>>>>>>>>>>>>> attempts.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Boyang
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> On Fri, Mar 13, 2020 at 4:38 PM Sanjana
> > > > > >> Kaundinya <
> > > > > >>>>>>>>>>>> skaundinya@gmail.com>
> > > > > >>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Hi Everyone,
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> I’ve written a KIP about introducing
> > > > > >> exponential
> > > > > >>>>>> backoff
> > > > > >>>>>>>> for
> > > > > >>>>>>>>>>> Kafka
> > > > > >>>>>>>>>>>>>>>> clients. Would appreciate any feedback on this.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>> Sanjana
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> --
> > > > > >>>>>>>>>> -- Guozhang
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> --
> > > > > >>>>>>> -- Guozhang
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> --
> > > > > >>>> -- Guozhang
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> -- Guozhang
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Guozhang Wang <wa...@gmail.com>.

In Kafka clients, there are cases where we log a warning when overriding
some conflicting configs and in some other cases we throw and let the
brokers to die during startup  --- you can check the
`postProcessParsedConfig` function in Producer/ConsumerConfig for such
logic.

I think for this case, it is sufficient to log a warning if we find the
`max` < `backoff`.


Guozhang

On Mon, Mar 23, 2020 at 9:18 PM Boyang Chen <re...@gmail.com>
wrote:

> Got it, although I would still like to be aware of the actual backoff I
> will be using in production, having the app crash seems like an
> over-reaction. I don't think I have further questions :)
>
> On Mon, Mar 23, 2020 at 7:36 PM Sanjana Kaundinya <sk...@gmail.com>
> wrote:
>
> > Hey Sanjana,
> >
> > Hey Boyang,
> >
> > If a user provides no config at all then as you mentioned they will be
> > default be able to make use of the exponential back off feature
> introduced
> > by the KIP. If the backoff.ms is overriden to 2000 ms, the lesser of
> > either
> > the max or the computed back off will be chosen, so in this case the max
> > will be chosen as it is 1000 ms. As Guozhang mentioned if the user
> > configures something like this then they would notice the behavior to not
> > be in line what they expect and would see the KIP + Release notes and
> know
> > to configure it to be backoff.ms < max backoff.ms. I’m not sure if its
> as
> > big of an error to reject the configuration if it’s configured like this,
> > as the clients could still run in either case.
> >
> > To answer your second question, we are making the dynamic backoff the
> > default and not allowing for static backoff (unless they set backoff.ms
> >
> > max.backof.ms, then that would in a sense be static) We will include
> this
> > information in the release notes to make sure users are aware of this
> > behavior change.
> >
> > Thanks,
> > Sanjana
> >
> > On Mon, Mar 23, 2020 at 6:37 PM Boyang Chen <re...@gmail.com>
> > wrote:
> >
> > > Hey Sanjana,
> > >
> > > my understanding with the update is that if a user provides no config
> at
> > > all, a Producer/Consumer/Admin client user would by default enjoying a
> > > starting backoff.ms as 100 ms and max.backoff.ms as 1000 ms? If I
> > already
> > > override the backoff.ms to 2000 ms for instance, will I be choosing
> the
> > > default max.backoff here?
> > >
> > > I guess my question would be whether we should just reject a config
> with
> > > backoff.ms > max.backoff.ms in the first place, as this looks like
> > > mis-configuration to me.
> > >
> > > Second question is whether we allow fallback to static backoffs if the
> > user
> > > wants to do so, or we should just ship this as an opt-in feature?
> > >
> > > Let me know your thoughts.
> > >
> > > Boyang
> > >
> > > On Mon, Mar 23, 2020 at 11:38 AM Cheng Tan <ct...@confluent.io> wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > > On Mar 19, 2020, at 7:27 PM, Sanjana Kaundinya <
> skaundinya@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > Ah yes that makes sense. I’ll update the KIP to reflect this.
> > > > >
> > > > > Thanks,
> > > > > Sanjana
> > > > >
> > > > > On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <wa...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Following the formula you have in the KIP, if it is simply:
> > > > >>
> > > > >> MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures - 1))
> *
> > > > random(
> > > > >> 0.8, 1.2))
> > > > >>
> > > > >> then the behavior would stay consistent at retry.backoff.max.ms.
> > > > >>
> > > > >>
> > > > >> Guozhang
> > > > >>
> > > > >> On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <
> > > skaundinya@gmail.com
> > > > >
> > > > >> wrote:
> > > > >>
> > > > >>> If that’s the case then what should we base the starting point
> as?
> > > > >>> Currently in the KIP the starting point is retry.backoff.ms and
> it
> > > > >>> exponentially goes up to retry.backoff.max.ms. If
> > > retry.backoff.max.ms
> > > > >> is
> > > > >>> smaller than retry.backoff.ms then that could pose a bit of a
> > > problem
> > > > >>> there right?
> > > > >>>
> > > > >>> On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <
> wangguoz@gmail.com
> > >,
> > > > >> wrote:
> > > > >>>> Thanks Sanjana, I did not capture the part that Jason referred
> to,
> > > so
> > > > >>>> that's my bad :P
> > > > >>>>
> > > > >>>> Regarding your last statement, I actually feel that instead of
> > take
> > > > the
> > > > >>>> larger of the two, we should respect "retry.backoff.max.ms"
> even
> > if
> > > > it
> > > > >>> is
> > > > >>>> smaller than "retry.backoff.ms". I do not have a very strong
> > > > rationale
> > > > >>>> except it is logically more aligned to the config names.
> > > > >>>>
> > > > >>>>
> > > > >>>> Guozhang
> > > > >>>>
> > > > >>>>
> > > > >>>> On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <
> > > > >> skaundinya@gmail.com>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Hey Jason and Guozhang,
> > > > >>>>>
> > > > >>>>> Jason is right, I took this inspiration from KIP-144 (
> > > > >>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > > > >>>>> )
> > > > >>>>> which had the same logic in order to preserve the existing
> > > behavior.
> > > > >> In
> > > > >>>>> this case however, if we are thinking to completely eliminate
> the
> > > > >>> static
> > > > >>>>> backoff behavior, we can do that and as Jason mentioned put it
> in
> > > the
> > > > >>>>> release notes and not add any special logic. In addition I
> agree
> > > that
> > > > >>> we
> > > > >>>>> should take the larger of the two of `retry.backoff.ms` and `
> > > > >>>>> retry.backoff.max.ms`. I'll update the KIP to reflect this and
> > > make
> > > > >> it
> > > > >>>>> clear that the old static retry backoff is getting replaced by
> > the
> > > > >> new
> > > > >>>>> dynamic retry backoff.
> > > > >>>>>
> > > > >>>>> Thanks,
> > > > >>>>> Sanjana
> > > > >>>>> On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <
> > > jason@confluent.io>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Hey Guozhang,
> > > > >>>>>>
> > > > >>>>>> I was referring to this:
> > > > >>>>>>
> > > > >>>>>>> For users who have not set retry.backoff.ms explicitly, the
> > > > >>> default
> > > > >>>>>> behavior will change so that the backoff will grow up to 1000
> > ms.
> > > > >> For
> > > > >>>>> users
> > > > >>>>>> who have set retry.backoff.ms explicitly, the behavior will
> > > remain
> > > > >>> the
> > > > >>>>>> same
> > > > >>>>>> as they could have specific requirements.
> > > > >>>>>>
> > > > >>>>>> I took this to mean that for users who have overridden `
> > > > >>> retry.backoff.ms
> > > > >>>>> `
> > > > >>>>>> to 50ms (say), we will change the default `
> retry.backoff.max.ms
> > `
> > > > >> to
> > > > >>> 50ms
> > > > >>>>>> as
> > > > >>>>>> well in order to preserve existing backoff behavior. Is that
> not
> > > > >>> right?
> > > > >>>>> In
> > > > >>>>>> any case, I agree that we can use the maximum of the two
> values
> > as
> > > > >>> the
> > > > >>>>>> effective `retry.backoff.max.ms` to handle the case when the
> > > > >>> configured
> > > > >>>>>> value of `retry.backoff.ms` is larger than the default of 1s.
> > > > >>>>>>
> > > > >>>>>> -Jason
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <
> > wangguoz@gmail.com
> > > >
> > > > >>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Hey Jason,
> > > > >>>>>>>
> > > > >>>>>>> My understanding is a bit different here: even if user has an
> > > > >>> explicit
> > > > >>>>>>> overridden "retry.backoff.ms", the exponential mechanism
> still
> > > > >>>>> triggers
> > > > >>>>>>> and
> > > > >>>>>>> the backoff would be increased till "retry.backoff.max.ms";
> > and
> > > > >>> if the
> > > > >>>>>>> specified "retry.backoff.ms" is already larger than the "
> > > > >>>>>>> retry.backoff.max.ms", we would still take "
> > retry.backoff.max.ms
> > > > >> ".
> > > > >>>>>>>
> > > > >>>>>>> So if the user does override the "retry.backoff.ms" to a
> value
> > > > >>> larger
> > > > >>>>>> than
> > > > >>>>>>> 1s and is not aware of the new config, she would be surprised
> > to
> > > > >>> see
> > > > >>>>> the
> > > > >>>>>>> specified value seemingly not being respected, but she could
> > > > >> still
> > > > >>>>> learn
> > > > >>>>>>> that afterwards by reading the release notes introducing this
> > KIP
> > > > >>>>>> anyways.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Guozhang
> > > > >>>>>>>
> > > > >>>>>>> On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
> > > > >>> jason@confluent.io>
> > > > >>>>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hi Sanjana,
> > > > >>>>>>>>
> > > > >>>>>>>> The KIP looks good to me. I had just one question about the
> > > > >>> default
> > > > >>>>>>>> behavior. As I understand, if the user has specified `
> > > > >>>>> retry.backoff.ms
> > > > >>>>>> `
> > > > >>>>>>>> explicitly, then we will not apply the default max backoff.
> As
> > > > >>> such,
> > > > >>>>>>>> there's no way to get the benefit of this feature if you are
> > > > >>>>> providing
> > > > >>>>>> a
> > > > >>>>>>> `
> > > > >>>>>>>> retry.backoff.ms` unless you also provide `
> > > > >> retry.backoff.max.ms
> > > > >>> `.
> > > > >>>>> That
> > > > >>>>>>>> makes sense if you assume the user is unaware of the new
> > > > >>>>> configuration,
> > > > >>>>>>> but
> > > > >>>>>>>> it is surprising otherwise. Since it's not a semantic change
> > > > >> and
> > > > >>>>> since
> > > > >>>>>>> the
> > > > >>>>>>>> default you're proposing of 1s is fairly low already, I
> wonder
> > > > >> if
> > > > >>>>> it's
> > > > >>>>>>> good
> > > > >>>>>>>> enough to mention the new configuration in the release notes
> > > > >> and
> > > > >>> not
> > > > >>>>>> add
> > > > >>>>>>>> any special logic. What do you think?
> > > > >>>>>>>>
> > > > >>>>>>>> -Jason
> > > > >>>>>>>>
> > > > >>>>>>>> On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> > > > >>>>>> skaundinya@gmail.com>
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> Thank you for the comments Guozhang.
> > > > >>>>>>>>>
> > > > >>>>>>>>> I’ll leave this KIP out for discussion till the end of the
> > > > >>> week and
> > > > >>>>>>> then
> > > > >>>>>>>>> start a vote for this early next week.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Sanjana
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
> > > > >>> wangguoz@gmail.com
> > > > >>>>>> ,
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>>> Hello Sanjana,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Thanks for the proposed KIP, I think that makes a lot of
> > > > >>> sense --
> > > > >>>>>> as
> > > > >>>>>>>> you
> > > > >>>>>>>>>> mentioned in the motivation, we've indeed seen many issues
> > > > >>> with
> > > > >>>>>>> regard
> > > > >>>>>>>> to
> > > > >>>>>>>>>> the frequent retries, with bounded exponential backoff in
> > > > >> the
> > > > >>>>>>> scenario
> > > > >>>>>>>>>> where there's a long connectivity issue we would
> > > > >> effectively
> > > > >>>>> reduce
> > > > >>>>>>> the
> > > > >>>>>>>>>> request load by 10 given the default configs.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> For higher-level Streams client and Connect frameworks,
> > > > >>> today we
> > > > >>>>>> also
> > > > >>>>>>>>> have
> > > > >>>>>>>>>> a retry logic but that's used in a slightly different way.
> > > > >>> For
> > > > >>>>>>> example
> > > > >>>>>>>> in
> > > > >>>>>>>>>> Streams, we tend to handle the retry logic at the
> > > > >>> thread-level
> > > > >>>>> and
> > > > >>>>>>>> hence
> > > > >>>>>>>>>> very likely we'd like to change that mechanism in KIP-572
> > > > >>>>> anyways.
> > > > >>>>>>> For
> > > > >>>>>>>>>> producer / consumer / admin clients, I think just applying
> > > > >>> this
> > > > >>>>>>>>> behavioral
> > > > >>>>>>>>>> change across these clients makes lot of sense. So I think
> > > > >>> can
> > > > >>>>> just
> > > > >>>>>>>> leave
> > > > >>>>>>>>>> the Streams / Connect out of the scope of this KIP to be
> > > > >>>>> addressed
> > > > >>>>>> in
> > > > >>>>>>>>>> separate discussions.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I do not have further comments about this KIP :) LGTM.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Guozhang
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > > > >>>>>>>> skaundinya@gmail.com
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> Thanks for the feedback Boyang.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> If there’s anyone else who has feedback regarding this
> > > > >> KIP,
> > > > >>>>> would
> > > > >>>>>>>>> really
> > > > >>>>>>>>>>> appreciate it hearing it!
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>> Sanjana
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> > > > >>>>>> bchen11@outlook.com>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> Sounds great!
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > >>>>>>>>>>>> ________________________________
> > > > >>>>>>>>>>>> From: Sanjana Kaundinya <sk...@gmail.com>
> > > > >>>>>>>>>>>> Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > >>>>>>>>>>>> To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > > >>>>>>>>>>>> Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for
> > > > >>> Kafka
> > > > >>>>>>>> Clients
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Thanks for the explanation Boyang. One of the most
> > > > >> common
> > > > >>>>>>> problems
> > > > >>>>>>>>> that
> > > > >>>>>>>>>>> we
> > > > >>>>>>>>>>>> have in Kafka is with respect to metadata fetches. For
> > > > >>>>> example,
> > > > >>>>>>> if
> > > > >>>>>>>>> there
> > > > >>>>>>>>>>> is
> > > > >>>>>>>>>>>> a broker failure, all clients start to fetch metadata
> > > > >> at
> > > > >>> the
> > > > >>>>>> same
> > > > >>>>>>>>> time
> > > > >>>>>>>>>>> and
> > > > >>>>>>>>>>>> it often takes a while for the metadata to converge.
> > > > >> In a
> > > > >>>>> high
> > > > >>>>>>> load
> > > > >>>>>>>>>>>> cluster, there are also issues where the volume of
> > > > >>> metadata
> > > > >>>>> has
> > > > >>>>>>>> made
> > > > >>>>>>>>>>>> convergence of metadata slower.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> For this case, exponential backoff helps as it reduces
> > > > >>> the
> > > > >>>>>> retry
> > > > >>>>>>>>> rate and
> > > > >>>>>>>>>>>> spaces out how often clients will retry, thereby
> > > > >> bringing
> > > > >>>>> down
> > > > >>>>>>> the
> > > > >>>>>>>>> time
> > > > >>>>>>>>>>> for
> > > > >>>>>>>>>>>> convergence. Something that Jason mentioned that would
> > > > >>> be a
> > > > >>>>>> great
> > > > >>>>>>>>>>> addition
> > > > >>>>>>>>>>>> here would be if the backoff should be “jittered” as it
> > > > >>> was
> > > > >>>>> in
> > > > >>>>>>>>> KIP-144
> > > > >>>>>>>>>>> with
> > > > >>>>>>>>>>>> respect to exponential reconnect backoff. This would
> > > > >> help
> > > > >>>>>> prevent
> > > > >>>>>>>> the
> > > > >>>>>>>>>>>> clients from being synchronized on when they retry,
> > > > >>> thereby
> > > > >>>>>>> spacing
> > > > >>>>>>>>> out
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>> number of requests being sent to the broker at the same
> > > > >>> time.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> I’ll add this example to the KIP and flush out more of
> > > > >>> the
> > > > >>>>>>> details
> > > > >>>>>>>> -
> > > > >>>>>>>>> so
> > > > >>>>>>>>>>>> it’s more clear.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > > >>>>>>>>> reluctanthero104@gmail.com
> > > > >>>>>>>>>>>> ,
> > > > >>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>> Thanks for the reply Sanjana. I guess I would like to
> > > > >>>>>> rephrase
> > > > >>>>>>> my
> > > > >>>>>>>>>>>> question
> > > > >>>>>>>>>>>>> 2 and 3 as my previous response is a little bit
> > > > >>>>> unactionable.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> My specific point is that exponential backoff is not
> > > > >> a
> > > > >>>>> silver
> > > > >>>>>>>>> bullet
> > > > >>>>>>>>>>> and
> > > > >>>>>>>>>>>> we
> > > > >>>>>>>>>>>>> should consider using it to solve known problems,
> > > > >>> instead
> > > > >>>>> of
> > > > >>>>>>>>> making the
> > > > >>>>>>>>>>>>> holistic changes to all clients in Kafka ecosystem. I
> > > > >>> do
> > > > >>>>> like
> > > > >>>>>>> the
> > > > >>>>>>>>>>>>> exponential backoff idea and believe this would be of
> > > > >>> great
> > > > >>>>>>>> value,
> > > > >>>>>>>>> but
> > > > >>>>>>>>>>>>> maybe we should focus on proposing some existing
> > > > >>> modules
> > > > >>>>> that
> > > > >>>>>>> are
> > > > >>>>>>>>>>>> suffering
> > > > >>>>>>>>>>>>> from static retry, and only change them in this first
> > > > >>> KIP.
> > > > >>>>> If
> > > > >>>>>>> in
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>>> future, some other component users believe they are
> > > > >>> also
> > > > >>>>>>>>> suffering, we
> > > > >>>>>>>>>>>>> could get more minor KIPs to change the behavior as
> > > > >>> well.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Boyang
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > >>>>>>>>>>> skaundinya@gmail.com
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Thanks for the feedback Boyang, I will revise the
> > > > >> KIP
> > > > >>>>> with
> > > > >>>>>>> the
> > > > >>>>>>>>>>>>>> mathematical relations as per your suggestion. To
> > > > >>> address
> > > > >>>>>>> your
> > > > >>>>>>>>>>>> feedback:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> 1. Currently, with the default of 100 ms per retry
> > > > >>>>> backoff,
> > > > >>>>>>> in
> > > > >>>>>>>> 1
> > > > >>>>>>>>>>> second
> > > > >>>>>>>>>>>>>> we would have 10 retries. In the case of using an
> > > > >>>>>> exponential
> > > > >>>>>>>>>>> backoff,
> > > > >>>>>>>>>>>> we
> > > > >>>>>>>>>>>>>> would have a total of 4 retries in 1 second. Thus
> > > > >> we
> > > > >>> have
> > > > >>>>>>> less
> > > > >>>>>>>>> than
> > > > >>>>>>>>>>>> half of
> > > > >>>>>>>>>>>>>> the amount of retries in the same timeframe and can
> > > > >>>>> lessen
> > > > >>>>>>>> broker
> > > > >>>>>>>>>>>> pressure.
> > > > >>>>>>>>>>>>>> This calculation is done as following (using the
> > > > >>> formula
> > > > >>>>>> laid
> > > > >>>>>>>>> out in
> > > > >>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>> KIP:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Try 1 at time 0 ms, failures = 0, next retry in 100
> > > > >>> ms
> > > > >>>>>>> (default
> > > > >>>>>>>>> retry
> > > > >>>>>>>>>>>> ms
> > > > >>>>>>>>>>>>>> is initially 100 ms)
> > > > >>>>>>>>>>>>>> Try 2 at time 100 ms, failures = 1, next retry in
> > > > >>> 200 ms
> > > > >>>>>>>>>>>>>> Try 3 at time 300 ms, failures = 2, next retry in
> > > > >>> 400 ms
> > > > >>>>>>>>>>>>>> Try 4 at time 700 ms, failures = 3, next retry in
> > > > >>> 800 ms
> > > > >>>>>>>>>>>>>> Try 5 at time 1500 ms, failures = 4, next retry in
> > > > >>> 1000
> > > > >>>>> ms
> > > > >>>>>>>>> (default
> > > > >>>>>>>>>>> max
> > > > >>>>>>>>>>>>>> retry ms is 1000 ms)
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> For 2 and 3, could you elaborate more about what
> > > > >> you
> > > > >>> mean
> > > > >>>>>>> with
> > > > >>>>>>>>>>> respect
> > > > >>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>> client timeouts? I’m not very familiar with the
> > > > >>> Streams
> > > > >>>>>>>>> framework, so
> > > > >>>>>>>>>>>> would
> > > > >>>>>>>>>>>>>> love to get more insight to how that currently
> > > > >> works,
> > > > >>>>> with
> > > > >>>>>>>>> respect to
> > > > >>>>>>>>>>>>>> producer transactions, so I can appropriately
> > > > >> update
> > > > >>> the
> > > > >>>>>> KIP
> > > > >>>>>>> to
> > > > >>>>>>>>>>> address
> > > > >>>>>>>>>>>>>> these scenarios.
> > > > >>>>>>>>>>>>>> On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > >>>>>>>>>>>> reluctanthero104@gmail.com>,
> > > > >>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>> Thanks for the KIP Sanjana. I think the
> > > > >> motivation
> > > > >>> is
> > > > >>>>>> good,
> > > > >>>>>>>> but
> > > > >>>>>>>>>>> lack
> > > > >>>>>>>>>>>> of
> > > > >>>>>>>>>>>>>>> more quantitative analysis. For instance:
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> 1. How much retries we are saving by applying the
> > > > >>>>>>> exponential
> > > > >>>>>>>>> retry
> > > > >>>>>>>>>>>> vs
> > > > >>>>>>>>>>>>>>> static retry? There should be some mathematical
> > > > >>>>> relations
> > > > >>>>>>>>> between
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>> static retry ms, the initial exponential retry
> > > > >> ms,
> > > > >>> the
> > > > >>>>>> max
> > > > >>>>>>>>>>>> exponential
> > > > >>>>>>>>>>>>>>> retry ms in a given time interval.
> > > > >>>>>>>>>>>>>>> 2. How does this affect the client timeout? With
> > > > >>>>>>> exponential
> > > > >>>>>>>>> retry,
> > > > >>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>> client shall be getting easier to timeout on a
> > > > >>> parent
> > > > >>>>>> level
> > > > >>>>>>>>> caller,
> > > > >>>>>>>>>>>> for
> > > > >>>>>>>>>>>>>>> instance stream attempts to retry initializing
> > > > >>> producer
> > > > >>>>>>>>>>> transactions
> > > > >>>>>>>>>>>> with
> > > > >>>>>>>>>>>>>>> given 5 minute interval. With exponential retry
> > > > >>> this
> > > > >>>>>>>> mechanism
> > > > >>>>>>>>>>> could
> > > > >>>>>>>>>>>>>>> experience more frequent timeout which we should
> > > > >> be
> > > > >>>>>> careful
> > > > >>>>>>>>> with.
> > > > >>>>>>>>>>>>>>> 3. With regards to #2, we should have more
> > > > >> detailed
> > > > >>>>>>> checklist
> > > > >>>>>>>>> of
> > > > >>>>>>>>>>> all
> > > > >>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>> existing static retry scenarios, and adjust the
> > > > >>> initial
> > > > >>>>>>>>> exponential
> > > > >>>>>>>>>>>> retry
> > > > >>>>>>>>>>>>>>> ms to make sure we won't get easily timeout in
> > > > >> high
> > > > >>>>> level
> > > > >>>>>>> due
> > > > >>>>>>>>> to
> > > > >>>>>>>>>>> too
> > > > >>>>>>>>>>>> few
> > > > >>>>>>>>>>>>>>> attempts.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Boyang
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> On Fri, Mar 13, 2020 at 4:38 PM Sanjana
> > > > >> Kaundinya <
> > > > >>>>>>>>>>>> skaundinya@gmail.com>
> > > > >>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Hi Everyone,
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> I’ve written a KIP about introducing
> > > > >> exponential
> > > > >>>>>> backoff
> > > > >>>>>>>> for
> > > > >>>>>>>>>>> Kafka
> > > > >>>>>>>>>>>>>>>> clients. Would appreciate any feedback on this.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>> Sanjana
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> --
> > > > >>>>>>>>>> -- Guozhang
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> --
> > > > >>>>>>> -- Guozhang
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> --
> > > > >>>> -- Guozhang
> > > > >>>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> -- Guozhang
> > > > >>
> > > >
> > > >
> > >
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Boyang Chen <re...@gmail.com>.

Got it, although I would still like to be aware of the actual backoff I
will be using in production, having the app crash seems like an
over-reaction. I don't think I have further questions :)

On Mon, Mar 23, 2020 at 7:36 PM Sanjana Kaundinya <sk...@gmail.com>
wrote:

> Hey Sanjana,
>
> Hey Boyang,
>
> If a user provides no config at all then as you mentioned they will be
> default be able to make use of the exponential back off feature introduced
> by the KIP. If the backoff.ms is overriden to 2000 ms, the lesser of
> either
> the max or the computed back off will be chosen, so in this case the max
> will be chosen as it is 1000 ms. As Guozhang mentioned if the user
> configures something like this then they would notice the behavior to not
> be in line what they expect and would see the KIP + Release notes and know
> to configure it to be backoff.ms < max backoff.ms. I’m not sure if its as
> big of an error to reject the configuration if it’s configured like this,
> as the clients could still run in either case.
>
> To answer your second question, we are making the dynamic backoff the
> default and not allowing for static backoff (unless they set backoff.ms >
> max.backof.ms, then that would in a sense be static) We will include this
> information in the release notes to make sure users are aware of this
> behavior change.
>
> Thanks,
> Sanjana
>
> On Mon, Mar 23, 2020 at 6:37 PM Boyang Chen <re...@gmail.com>
> wrote:
>
> > Hey Sanjana,
> >
> > my understanding with the update is that if a user provides no config at
> > all, a Producer/Consumer/Admin client user would by default enjoying a
> > starting backoff.ms as 100 ms and max.backoff.ms as 1000 ms? If I
> already
> > override the backoff.ms to 2000 ms for instance, will I be choosing the
> > default max.backoff here?
> >
> > I guess my question would be whether we should just reject a config with
> > backoff.ms > max.backoff.ms in the first place, as this looks like
> > mis-configuration to me.
> >
> > Second question is whether we allow fallback to static backoffs if the
> user
> > wants to do so, or we should just ship this as an opt-in feature?
> >
> > Let me know your thoughts.
> >
> > Boyang
> >
> > On Mon, Mar 23, 2020 at 11:38 AM Cheng Tan <ct...@confluent.io> wrote:
> >
> > > +1 (non-binding)
> > >
> > > > On Mar 19, 2020, at 7:27 PM, Sanjana Kaundinya <skaundinya@gmail.com
> >
> > > wrote:
> > > >
> > > > Ah yes that makes sense. I’ll update the KIP to reflect this.
> > > >
> > > > Thanks,
> > > > Sanjana
> > > >
> > > > On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <wa...@gmail.com>
> > > wrote:
> > > >
> > > >> Following the formula you have in the KIP, if it is simply:
> > > >>
> > > >> MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures - 1)) *
> > > random(
> > > >> 0.8, 1.2))
> > > >>
> > > >> then the behavior would stay consistent at retry.backoff.max.ms.
> > > >>
> > > >>
> > > >> Guozhang
> > > >>
> > > >> On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <
> > skaundinya@gmail.com
> > > >
> > > >> wrote:
> > > >>
> > > >>> If that’s the case then what should we base the starting point as?
> > > >>> Currently in the KIP the starting point is retry.backoff.ms and it
> > > >>> exponentially goes up to retry.backoff.max.ms. If
> > retry.backoff.max.ms
> > > >> is
> > > >>> smaller than retry.backoff.ms then that could pose a bit of a
> > problem
> > > >>> there right?
> > > >>>
> > > >>> On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <wangguoz@gmail.com
> >,
> > > >> wrote:
> > > >>>> Thanks Sanjana, I did not capture the part that Jason referred to,
> > so
> > > >>>> that's my bad :P
> > > >>>>
> > > >>>> Regarding your last statement, I actually feel that instead of
> take
> > > the
> > > >>>> larger of the two, we should respect "retry.backoff.max.ms" even
> if
> > > it
> > > >>> is
> > > >>>> smaller than "retry.backoff.ms". I do not have a very strong
> > > rationale
> > > >>>> except it is logically more aligned to the config names.
> > > >>>>
> > > >>>>
> > > >>>> Guozhang
> > > >>>>
> > > >>>>
> > > >>>> On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <
> > > >> skaundinya@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hey Jason and Guozhang,
> > > >>>>>
> > > >>>>> Jason is right, I took this inspiration from KIP-144 (
> > > >>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > > >>>>> )
> > > >>>>> which had the same logic in order to preserve the existing
> > behavior.
> > > >> In
> > > >>>>> this case however, if we are thinking to completely eliminate the
> > > >>> static
> > > >>>>> backoff behavior, we can do that and as Jason mentioned put it in
> > the
> > > >>>>> release notes and not add any special logic. In addition I agree
> > that
> > > >>> we
> > > >>>>> should take the larger of the two of `retry.backoff.ms` and `
> > > >>>>> retry.backoff.max.ms`. I'll update the KIP to reflect this and
> > make
> > > >> it
> > > >>>>> clear that the old static retry backoff is getting replaced by
> the
> > > >> new
> > > >>>>> dynamic retry backoff.
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>> Sanjana
> > > >>>>> On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <
> > jason@confluent.io>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hey Guozhang,
> > > >>>>>>
> > > >>>>>> I was referring to this:
> > > >>>>>>
> > > >>>>>>> For users who have not set retry.backoff.ms explicitly, the
> > > >>> default
> > > >>>>>> behavior will change so that the backoff will grow up to 1000
> ms.
> > > >> For
> > > >>>>> users
> > > >>>>>> who have set retry.backoff.ms explicitly, the behavior will
> > remain
> > > >>> the
> > > >>>>>> same
> > > >>>>>> as they could have specific requirements.
> > > >>>>>>
> > > >>>>>> I took this to mean that for users who have overridden `
> > > >>> retry.backoff.ms
> > > >>>>> `
> > > >>>>>> to 50ms (say), we will change the default `retry.backoff.max.ms
> `
> > > >> to
> > > >>> 50ms
> > > >>>>>> as
> > > >>>>>> well in order to preserve existing backoff behavior. Is that not
> > > >>> right?
> > > >>>>> In
> > > >>>>>> any case, I agree that we can use the maximum of the two values
> as
> > > >>> the
> > > >>>>>> effective `retry.backoff.max.ms` to handle the case when the
> > > >>> configured
> > > >>>>>> value of `retry.backoff.ms` is larger than the default of 1s.
> > > >>>>>>
> > > >>>>>> -Jason
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <
> wangguoz@gmail.com
> > >
> > > >>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hey Jason,
> > > >>>>>>>
> > > >>>>>>> My understanding is a bit different here: even if user has an
> > > >>> explicit
> > > >>>>>>> overridden "retry.backoff.ms", the exponential mechanism still
> > > >>>>> triggers
> > > >>>>>>> and
> > > >>>>>>> the backoff would be increased till "retry.backoff.max.ms";
> and
> > > >>> if the
> > > >>>>>>> specified "retry.backoff.ms" is already larger than the "
> > > >>>>>>> retry.backoff.max.ms", we would still take "
> retry.backoff.max.ms
> > > >> ".
> > > >>>>>>>
> > > >>>>>>> So if the user does override the "retry.backoff.ms" to a value
> > > >>> larger
> > > >>>>>> than
> > > >>>>>>> 1s and is not aware of the new config, she would be surprised
> to
> > > >>> see
> > > >>>>> the
> > > >>>>>>> specified value seemingly not being respected, but she could
> > > >> still
> > > >>>>> learn
> > > >>>>>>> that afterwards by reading the release notes introducing this
> KIP
> > > >>>>>> anyways.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Guozhang
> > > >>>>>>>
> > > >>>>>>> On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
> > > >>> jason@confluent.io>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Sanjana,
> > > >>>>>>>>
> > > >>>>>>>> The KIP looks good to me. I had just one question about the
> > > >>> default
> > > >>>>>>>> behavior. As I understand, if the user has specified `
> > > >>>>> retry.backoff.ms
> > > >>>>>> `
> > > >>>>>>>> explicitly, then we will not apply the default max backoff. As
> > > >>> such,
> > > >>>>>>>> there's no way to get the benefit of this feature if you are
> > > >>>>> providing
> > > >>>>>> a
> > > >>>>>>> `
> > > >>>>>>>> retry.backoff.ms` unless you also provide `
> > > >> retry.backoff.max.ms
> > > >>> `.
> > > >>>>> That
> > > >>>>>>>> makes sense if you assume the user is unaware of the new
> > > >>>>> configuration,
> > > >>>>>>> but
> > > >>>>>>>> it is surprising otherwise. Since it's not a semantic change
> > > >> and
> > > >>>>> since
> > > >>>>>>> the
> > > >>>>>>>> default you're proposing of 1s is fairly low already, I wonder
> > > >> if
> > > >>>>> it's
> > > >>>>>>> good
> > > >>>>>>>> enough to mention the new configuration in the release notes
> > > >> and
> > > >>> not
> > > >>>>>> add
> > > >>>>>>>> any special logic. What do you think?
> > > >>>>>>>>
> > > >>>>>>>> -Jason
> > > >>>>>>>>
> > > >>>>>>>> On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> > > >>>>>> skaundinya@gmail.com>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Thank you for the comments Guozhang.
> > > >>>>>>>>>
> > > >>>>>>>>> I’ll leave this KIP out for discussion till the end of the
> > > >>> week and
> > > >>>>>>> then
> > > >>>>>>>>> start a vote for this early next week.
> > > >>>>>>>>>
> > > >>>>>>>>> Sanjana
> > > >>>>>>>>>
> > > >>>>>>>>> On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
> > > >>> wangguoz@gmail.com
> > > >>>>>> ,
> > > >>>>>>>> wrote:
> > > >>>>>>>>>> Hello Sanjana,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks for the proposed KIP, I think that makes a lot of
> > > >>> sense --
> > > >>>>>> as
> > > >>>>>>>> you
> > > >>>>>>>>>> mentioned in the motivation, we've indeed seen many issues
> > > >>> with
> > > >>>>>>> regard
> > > >>>>>>>> to
> > > >>>>>>>>>> the frequent retries, with bounded exponential backoff in
> > > >> the
> > > >>>>>>> scenario
> > > >>>>>>>>>> where there's a long connectivity issue we would
> > > >> effectively
> > > >>>>> reduce
> > > >>>>>>> the
> > > >>>>>>>>>> request load by 10 given the default configs.
> > > >>>>>>>>>>
> > > >>>>>>>>>> For higher-level Streams client and Connect frameworks,
> > > >>> today we
> > > >>>>>> also
> > > >>>>>>>>> have
> > > >>>>>>>>>> a retry logic but that's used in a slightly different way.
> > > >>> For
> > > >>>>>>> example
> > > >>>>>>>> in
> > > >>>>>>>>>> Streams, we tend to handle the retry logic at the
> > > >>> thread-level
> > > >>>>> and
> > > >>>>>>>> hence
> > > >>>>>>>>>> very likely we'd like to change that mechanism in KIP-572
> > > >>>>> anyways.
> > > >>>>>>> For
> > > >>>>>>>>>> producer / consumer / admin clients, I think just applying
> > > >>> this
> > > >>>>>>>>> behavioral
> > > >>>>>>>>>> change across these clients makes lot of sense. So I think
> > > >>> can
> > > >>>>> just
> > > >>>>>>>> leave
> > > >>>>>>>>>> the Streams / Connect out of the scope of this KIP to be
> > > >>>>> addressed
> > > >>>>>> in
> > > >>>>>>>>>> separate discussions.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I do not have further comments about this KIP :) LGTM.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Guozhang
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > > >>>>>>>> skaundinya@gmail.com
> > > >>>>>>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Thanks for the feedback Boyang.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> If there’s anyone else who has feedback regarding this
> > > >> KIP,
> > > >>>>> would
> > > >>>>>>>>> really
> > > >>>>>>>>>>> appreciate it hearing it!
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks,
> > > >>>>>>>>>>> Sanjana
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> > > >>>>>> bchen11@outlook.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Sounds great!
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Get Outlook for iOS<https://aka.ms/o0ukef>
> > > >>>>>>>>>>>> ________________________________
> > > >>>>>>>>>>>> From: Sanjana Kaundinya <sk...@gmail.com>
> > > >>>>>>>>>>>> Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > >>>>>>>>>>>> To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > >>>>>>>>>>>> Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for
> > > >>> Kafka
> > > >>>>>>>> Clients
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks for the explanation Boyang. One of the most
> > > >> common
> > > >>>>>>> problems
> > > >>>>>>>>> that
> > > >>>>>>>>>>> we
> > > >>>>>>>>>>>> have in Kafka is with respect to metadata fetches. For
> > > >>>>> example,
> > > >>>>>>> if
> > > >>>>>>>>> there
> > > >>>>>>>>>>> is
> > > >>>>>>>>>>>> a broker failure, all clients start to fetch metadata
> > > >> at
> > > >>> the
> > > >>>>>> same
> > > >>>>>>>>> time
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>> it often takes a while for the metadata to converge.
> > > >> In a
> > > >>>>> high
> > > >>>>>>> load
> > > >>>>>>>>>>>> cluster, there are also issues where the volume of
> > > >>> metadata
> > > >>>>> has
> > > >>>>>>>> made
> > > >>>>>>>>>>>> convergence of metadata slower.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> For this case, exponential backoff helps as it reduces
> > > >>> the
> > > >>>>>> retry
> > > >>>>>>>>> rate and
> > > >>>>>>>>>>>> spaces out how often clients will retry, thereby
> > > >> bringing
> > > >>>>> down
> > > >>>>>>> the
> > > >>>>>>>>> time
> > > >>>>>>>>>>> for
> > > >>>>>>>>>>>> convergence. Something that Jason mentioned that would
> > > >>> be a
> > > >>>>>> great
> > > >>>>>>>>>>> addition
> > > >>>>>>>>>>>> here would be if the backoff should be “jittered” as it
> > > >>> was
> > > >>>>> in
> > > >>>>>>>>> KIP-144
> > > >>>>>>>>>>> with
> > > >>>>>>>>>>>> respect to exponential reconnect backoff. This would
> > > >> help
> > > >>>>>> prevent
> > > >>>>>>>> the
> > > >>>>>>>>>>>> clients from being synchronized on when they retry,
> > > >>> thereby
> > > >>>>>>> spacing
> > > >>>>>>>>> out
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>> number of requests being sent to the broker at the same
> > > >>> time.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I’ll add this example to the KIP and flush out more of
> > > >>> the
> > > >>>>>>> details
> > > >>>>>>>> -
> > > >>>>>>>>> so
> > > >>>>>>>>>>>> it’s more clear.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > >>>>>>>>> reluctanthero104@gmail.com
> > > >>>>>>>>>>>> ,
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>> Thanks for the reply Sanjana. I guess I would like to
> > > >>>>>> rephrase
> > > >>>>>>> my
> > > >>>>>>>>>>>> question
> > > >>>>>>>>>>>>> 2 and 3 as my previous response is a little bit
> > > >>>>> unactionable.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> My specific point is that exponential backoff is not
> > > >> a
> > > >>>>> silver
> > > >>>>>>>>> bullet
> > > >>>>>>>>>>> and
> > > >>>>>>>>>>>> we
> > > >>>>>>>>>>>>> should consider using it to solve known problems,
> > > >>> instead
> > > >>>>> of
> > > >>>>>>>>> making the
> > > >>>>>>>>>>>>> holistic changes to all clients in Kafka ecosystem. I
> > > >>> do
> > > >>>>> like
> > > >>>>>>> the
> > > >>>>>>>>>>>>> exponential backoff idea and believe this would be of
> > > >>> great
> > > >>>>>>>> value,
> > > >>>>>>>>> but
> > > >>>>>>>>>>>>> maybe we should focus on proposing some existing
> > > >>> modules
> > > >>>>> that
> > > >>>>>>> are
> > > >>>>>>>>>>>> suffering
> > > >>>>>>>>>>>>> from static retry, and only change them in this first
> > > >>> KIP.
> > > >>>>> If
> > > >>>>>>> in
> > > >>>>>>>>> the
> > > >>>>>>>>>>>>> future, some other component users believe they are
> > > >>> also
> > > >>>>>>>>> suffering, we
> > > >>>>>>>>>>>>> could get more minor KIPs to change the behavior as
> > > >>> well.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Boyang
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > >>>>>>>>>>> skaundinya@gmail.com
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Thanks for the feedback Boyang, I will revise the
> > > >> KIP
> > > >>>>> with
> > > >>>>>>> the
> > > >>>>>>>>>>>>>> mathematical relations as per your suggestion. To
> > > >>> address
> > > >>>>>>> your
> > > >>>>>>>>>>>> feedback:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> 1. Currently, with the default of 100 ms per retry
> > > >>>>> backoff,
> > > >>>>>>> in
> > > >>>>>>>> 1
> > > >>>>>>>>>>> second
> > > >>>>>>>>>>>>>> we would have 10 retries. In the case of using an
> > > >>>>>> exponential
> > > >>>>>>>>>>> backoff,
> > > >>>>>>>>>>>> we
> > > >>>>>>>>>>>>>> would have a total of 4 retries in 1 second. Thus
> > > >> we
> > > >>> have
> > > >>>>>>> less
> > > >>>>>>>>> than
> > > >>>>>>>>>>>> half of
> > > >>>>>>>>>>>>>> the amount of retries in the same timeframe and can
> > > >>>>> lessen
> > > >>>>>>>> broker
> > > >>>>>>>>>>>> pressure.
> > > >>>>>>>>>>>>>> This calculation is done as following (using the
> > > >>> formula
> > > >>>>>> laid
> > > >>>>>>>>> out in
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>> KIP:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Try 1 at time 0 ms, failures = 0, next retry in 100
> > > >>> ms
> > > >>>>>>> (default
> > > >>>>>>>>> retry
> > > >>>>>>>>>>>> ms
> > > >>>>>>>>>>>>>> is initially 100 ms)
> > > >>>>>>>>>>>>>> Try 2 at time 100 ms, failures = 1, next retry in
> > > >>> 200 ms
> > > >>>>>>>>>>>>>> Try 3 at time 300 ms, failures = 2, next retry in
> > > >>> 400 ms
> > > >>>>>>>>>>>>>> Try 4 at time 700 ms, failures = 3, next retry in
> > > >>> 800 ms
> > > >>>>>>>>>>>>>> Try 5 at time 1500 ms, failures = 4, next retry in
> > > >>> 1000
> > > >>>>> ms
> > > >>>>>>>>> (default
> > > >>>>>>>>>>> max
> > > >>>>>>>>>>>>>> retry ms is 1000 ms)
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> For 2 and 3, could you elaborate more about what
> > > >> you
> > > >>> mean
> > > >>>>>>> with
> > > >>>>>>>>>>> respect
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> client timeouts? I’m not very familiar with the
> > > >>> Streams
> > > >>>>>>>>> framework, so
> > > >>>>>>>>>>>> would
> > > >>>>>>>>>>>>>> love to get more insight to how that currently
> > > >> works,
> > > >>>>> with
> > > >>>>>>>>> respect to
> > > >>>>>>>>>>>>>> producer transactions, so I can appropriately
> > > >> update
> > > >>> the
> > > >>>>>> KIP
> > > >>>>>>> to
> > > >>>>>>>>>>> address
> > > >>>>>>>>>>>>>> these scenarios.
> > > >>>>>>>>>>>>>> On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > >>>>>>>>>>>> reluctanthero104@gmail.com>,
> > > >>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>> Thanks for the KIP Sanjana. I think the
> > > >> motivation
> > > >>> is
> > > >>>>>> good,
> > > >>>>>>>> but
> > > >>>>>>>>>>> lack
> > > >>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>> more quantitative analysis. For instance:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> 1. How much retries we are saving by applying the
> > > >>>>>>> exponential
> > > >>>>>>>>> retry
> > > >>>>>>>>>>>> vs
> > > >>>>>>>>>>>>>>> static retry? There should be some mathematical
> > > >>>>> relations
> > > >>>>>>>>> between
> > > >>>>>>>>>>> the
> > > >>>>>>>>>>>>>>> static retry ms, the initial exponential retry
> > > >> ms,
> > > >>> the
> > > >>>>>> max
> > > >>>>>>>>>>>> exponential
> > > >>>>>>>>>>>>>>> retry ms in a given time interval.
> > > >>>>>>>>>>>>>>> 2. How does this affect the client timeout? With
> > > >>>>>>> exponential
> > > >>>>>>>>> retry,
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>> client shall be getting easier to timeout on a
> > > >>> parent
> > > >>>>>> level
> > > >>>>>>>>> caller,
> > > >>>>>>>>>>>> for
> > > >>>>>>>>>>>>>>> instance stream attempts to retry initializing
> > > >>> producer
> > > >>>>>>>>>>> transactions
> > > >>>>>>>>>>>> with
> > > >>>>>>>>>>>>>>> given 5 minute interval. With exponential retry
> > > >>> this
> > > >>>>>>>> mechanism
> > > >>>>>>>>>>> could
> > > >>>>>>>>>>>>>>> experience more frequent timeout which we should
> > > >> be
> > > >>>>>> careful
> > > >>>>>>>>> with.
> > > >>>>>>>>>>>>>>> 3. With regards to #2, we should have more
> > > >> detailed
> > > >>>>>>> checklist
> > > >>>>>>>>> of
> > > >>>>>>>>>>> all
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>> existing static retry scenarios, and adjust the
> > > >>> initial
> > > >>>>>>>>> exponential
> > > >>>>>>>>>>>> retry
> > > >>>>>>>>>>>>>>> ms to make sure we won't get easily timeout in
> > > >> high
> > > >>>>> level
> > > >>>>>>> due
> > > >>>>>>>>> to
> > > >>>>>>>>>>> too
> > > >>>>>>>>>>>> few
> > > >>>>>>>>>>>>>>> attempts.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Boyang
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Fri, Mar 13, 2020 at 4:38 PM Sanjana
> > > >> Kaundinya <
> > > >>>>>>>>>>>> skaundinya@gmail.com>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi Everyone,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> I’ve written a KIP about introducing
> > > >> exponential
> > > >>>>>> backoff
> > > >>>>>>>> for
> > > >>>>>>>>>>> Kafka
> > > >>>>>>>>>>>>>>>> clients. Would appreciate any feedback on this.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>> Sanjana
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>> -- Guozhang
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> --
> > > >>>>>>> -- Guozhang
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> -- Guozhang
> > > >>>
> > > >>
> > > >>
> > > >> --
> > > >> -- Guozhang
> > > >>
> > >
> > >
> >
>

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Sanjana Kaundinya <sk...@gmail.com>.

Hey Sanjana,

Hey Boyang,

If a user provides no config at all then as you mentioned they will be
default be able to make use of the exponential back off feature introduced
by the KIP. If the backoff.ms is overriden to 2000 ms, the lesser of either
the max or the computed back off will be chosen, so in this case the max
will be chosen as it is 1000 ms. As Guozhang mentioned if the user
configures something like this then they would notice the behavior to not
be in line what they expect and would see the KIP + Release notes and know
to configure it to be backoff.ms < max backoff.ms. I’m not sure if its as
big of an error to reject the configuration if it’s configured like this,
as the clients could still run in either case.

To answer your second question, we are making the dynamic backoff the
default and not allowing for static backoff (unless they set backoff.ms >
max.backof.ms, then that would in a sense be static) We will include this
information in the release notes to make sure users are aware of this
behavior change.

Thanks,
Sanjana

On Mon, Mar 23, 2020 at 6:37 PM Boyang Chen <re...@gmail.com>
wrote:

> Hey Sanjana,
>
> my understanding with the update is that if a user provides no config at
> all, a Producer/Consumer/Admin client user would by default enjoying a
> starting backoff.ms as 100 ms and max.backoff.ms as 1000 ms? If I already
> override the backoff.ms to 2000 ms for instance, will I be choosing the
> default max.backoff here?
>
> I guess my question would be whether we should just reject a config with
> backoff.ms > max.backoff.ms in the first place, as this looks like
> mis-configuration to me.
>
> Second question is whether we allow fallback to static backoffs if the user
> wants to do so, or we should just ship this as an opt-in feature?
>
> Let me know your thoughts.
>
> Boyang
>
> On Mon, Mar 23, 2020 at 11:38 AM Cheng Tan <ct...@confluent.io> wrote:
>
> > +1 (non-binding)
> >
> > > On Mar 19, 2020, at 7:27 PM, Sanjana Kaundinya <sk...@gmail.com>
> > wrote:
> > >
> > > Ah yes that makes sense. I’ll update the KIP to reflect this.
> > >
> > > Thanks,
> > > Sanjana
> > >
> > > On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > >> Following the formula you have in the KIP, if it is simply:
> > >>
> > >> MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures - 1)) *
> > random(
> > >> 0.8, 1.2))
> > >>
> > >> then the behavior would stay consistent at retry.backoff.max.ms.
> > >>
> > >>
> > >> Guozhang
> > >>
> > >> On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <
> skaundinya@gmail.com
> > >
> > >> wrote:
> > >>
> > >>> If that’s the case then what should we base the starting point as?
> > >>> Currently in the KIP the starting point is retry.backoff.ms and it
> > >>> exponentially goes up to retry.backoff.max.ms. If
> retry.backoff.max.ms
> > >> is
> > >>> smaller than retry.backoff.ms then that could pose a bit of a
> problem
> > >>> there right?
> > >>>
> > >>> On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <wa...@gmail.com>,
> > >> wrote:
> > >>>> Thanks Sanjana, I did not capture the part that Jason referred to,
> so
> > >>>> that's my bad :P
> > >>>>
> > >>>> Regarding your last statement, I actually feel that instead of take
> > the
> > >>>> larger of the two, we should respect "retry.backoff.max.ms" even if
> > it
> > >>> is
> > >>>> smaller than "retry.backoff.ms". I do not have a very strong
> > rationale
> > >>>> except it is logically more aligned to the config names.
> > >>>>
> > >>>>
> > >>>> Guozhang
> > >>>>
> > >>>>
> > >>>> On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <
> > >> skaundinya@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hey Jason and Guozhang,
> > >>>>>
> > >>>>> Jason is right, I took this inspiration from KIP-144 (
> > >>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > >>>>> )
> > >>>>> which had the same logic in order to preserve the existing
> behavior.
> > >> In
> > >>>>> this case however, if we are thinking to completely eliminate the
> > >>> static
> > >>>>> backoff behavior, we can do that and as Jason mentioned put it in
> the
> > >>>>> release notes and not add any special logic. In addition I agree
> that
> > >>> we
> > >>>>> should take the larger of the two of `retry.backoff.ms` and `
> > >>>>> retry.backoff.max.ms`. I'll update the KIP to reflect this and
> make
> > >> it
> > >>>>> clear that the old static retry backoff is getting replaced by the
> > >> new
> > >>>>> dynamic retry backoff.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Sanjana
> > >>>>> On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <
> jason@confluent.io>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hey Guozhang,
> > >>>>>>
> > >>>>>> I was referring to this:
> > >>>>>>
> > >>>>>>> For users who have not set retry.backoff.ms explicitly, the
> > >>> default
> > >>>>>> behavior will change so that the backoff will grow up to 1000 ms.
> > >> For
> > >>>>> users
> > >>>>>> who have set retry.backoff.ms explicitly, the behavior will
> remain
> > >>> the
> > >>>>>> same
> > >>>>>> as they could have specific requirements.
> > >>>>>>
> > >>>>>> I took this to mean that for users who have overridden `
> > >>> retry.backoff.ms
> > >>>>> `
> > >>>>>> to 50ms (say), we will change the default `retry.backoff.max.ms`
> > >> to
> > >>> 50ms
> > >>>>>> as
> > >>>>>> well in order to preserve existing backoff behavior. Is that not
> > >>> right?
> > >>>>> In
> > >>>>>> any case, I agree that we can use the maximum of the two values as
> > >>> the
> > >>>>>> effective `retry.backoff.max.ms` to handle the case when the
> > >>> configured
> > >>>>>> value of `retry.backoff.ms` is larger than the default of 1s.
> > >>>>>>
> > >>>>>> -Jason
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wangguoz@gmail.com
> >
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Hey Jason,
> > >>>>>>>
> > >>>>>>> My understanding is a bit different here: even if user has an
> > >>> explicit
> > >>>>>>> overridden "retry.backoff.ms", the exponential mechanism still
> > >>>>> triggers
> > >>>>>>> and
> > >>>>>>> the backoff would be increased till "retry.backoff.max.ms"; and
> > >>> if the
> > >>>>>>> specified "retry.backoff.ms" is already larger than the "
> > >>>>>>> retry.backoff.max.ms", we would still take "retry.backoff.max.ms
> > >> ".
> > >>>>>>>
> > >>>>>>> So if the user does override the "retry.backoff.ms" to a value
> > >>> larger
> > >>>>>> than
> > >>>>>>> 1s and is not aware of the new config, she would be surprised to
> > >>> see
> > >>>>> the
> > >>>>>>> specified value seemingly not being respected, but she could
> > >> still
> > >>>>> learn
> > >>>>>>> that afterwards by reading the release notes introducing this KIP
> > >>>>>> anyways.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Guozhang
> > >>>>>>>
> > >>>>>>> On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
> > >>> jason@confluent.io>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Sanjana,
> > >>>>>>>>
> > >>>>>>>> The KIP looks good to me. I had just one question about the
> > >>> default
> > >>>>>>>> behavior. As I understand, if the user has specified `
> > >>>>> retry.backoff.ms
> > >>>>>> `
> > >>>>>>>> explicitly, then we will not apply the default max backoff. As
> > >>> such,
> > >>>>>>>> there's no way to get the benefit of this feature if you are
> > >>>>> providing
> > >>>>>> a
> > >>>>>>> `
> > >>>>>>>> retry.backoff.ms` unless you also provide `
> > >> retry.backoff.max.ms
> > >>> `.
> > >>>>> That
> > >>>>>>>> makes sense if you assume the user is unaware of the new
> > >>>>> configuration,
> > >>>>>>> but
> > >>>>>>>> it is surprising otherwise. Since it's not a semantic change
> > >> and
> > >>>>> since
> > >>>>>>> the
> > >>>>>>>> default you're proposing of 1s is fairly low already, I wonder
> > >> if
> > >>>>> it's
> > >>>>>>> good
> > >>>>>>>> enough to mention the new configuration in the release notes
> > >> and
> > >>> not
> > >>>>>> add
> > >>>>>>>> any special logic. What do you think?
> > >>>>>>>>
> > >>>>>>>> -Jason
> > >>>>>>>>
> > >>>>>>>> On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> > >>>>>> skaundinya@gmail.com>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Thank you for the comments Guozhang.
> > >>>>>>>>>
> > >>>>>>>>> I’ll leave this KIP out for discussion till the end of the
> > >>> week and
> > >>>>>>> then
> > >>>>>>>>> start a vote for this early next week.
> > >>>>>>>>>
> > >>>>>>>>> Sanjana
> > >>>>>>>>>
> > >>>>>>>>> On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
> > >>> wangguoz@gmail.com
> > >>>>>> ,
> > >>>>>>>> wrote:
> > >>>>>>>>>> Hello Sanjana,
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for the proposed KIP, I think that makes a lot of
> > >>> sense --
> > >>>>>> as
> > >>>>>>>> you
> > >>>>>>>>>> mentioned in the motivation, we've indeed seen many issues
> > >>> with
> > >>>>>>> regard
> > >>>>>>>> to
> > >>>>>>>>>> the frequent retries, with bounded exponential backoff in
> > >> the
> > >>>>>>> scenario
> > >>>>>>>>>> where there's a long connectivity issue we would
> > >> effectively
> > >>>>> reduce
> > >>>>>>> the
> > >>>>>>>>>> request load by 10 given the default configs.
> > >>>>>>>>>>
> > >>>>>>>>>> For higher-level Streams client and Connect frameworks,
> > >>> today we
> > >>>>>> also
> > >>>>>>>>> have
> > >>>>>>>>>> a retry logic but that's used in a slightly different way.
> > >>> For
> > >>>>>>> example
> > >>>>>>>> in
> > >>>>>>>>>> Streams, we tend to handle the retry logic at the
> > >>> thread-level
> > >>>>> and
> > >>>>>>>> hence
> > >>>>>>>>>> very likely we'd like to change that mechanism in KIP-572
> > >>>>> anyways.
> > >>>>>>> For
> > >>>>>>>>>> producer / consumer / admin clients, I think just applying
> > >>> this
> > >>>>>>>>> behavioral
> > >>>>>>>>>> change across these clients makes lot of sense. So I think
> > >>> can
> > >>>>> just
> > >>>>>>>> leave
> > >>>>>>>>>> the Streams / Connect out of the scope of this KIP to be
> > >>>>> addressed
> > >>>>>> in
> > >>>>>>>>>> separate discussions.
> > >>>>>>>>>>
> > >>>>>>>>>> I do not have further comments about this KIP :) LGTM.
> > >>>>>>>>>>
> > >>>>>>>>>> Guozhang
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > >>>>>>>> skaundinya@gmail.com
> > >>>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Thanks for the feedback Boyang.
> > >>>>>>>>>>>
> > >>>>>>>>>>> If there’s anyone else who has feedback regarding this
> > >> KIP,
> > >>>>> would
> > >>>>>>>>> really
> > >>>>>>>>>>> appreciate it hearing it!
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> Sanjana
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> > >>>>>> bchen11@outlook.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Sounds great!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Get Outlook for iOS<https://aka.ms/o0ukef>
> > >>>>>>>>>>>> ________________________________
> > >>>>>>>>>>>> From: Sanjana Kaundinya <sk...@gmail.com>
> > >>>>>>>>>>>> Sent: Tuesday, March 17, 2020 5:54:35 PM
> > >>>>>>>>>>>> To: dev@kafka.apache.org <de...@kafka.apache.org>
> > >>>>>>>>>>>> Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for
> > >>> Kafka
> > >>>>>>>> Clients
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks for the explanation Boyang. One of the most
> > >> common
> > >>>>>>> problems
> > >>>>>>>>> that
> > >>>>>>>>>>> we
> > >>>>>>>>>>>> have in Kafka is with respect to metadata fetches. For
> > >>>>> example,
> > >>>>>>> if
> > >>>>>>>>> there
> > >>>>>>>>>>> is
> > >>>>>>>>>>>> a broker failure, all clients start to fetch metadata
> > >> at
> > >>> the
> > >>>>>> same
> > >>>>>>>>> time
> > >>>>>>>>>>> and
> > >>>>>>>>>>>> it often takes a while for the metadata to converge.
> > >> In a
> > >>>>> high
> > >>>>>>> load
> > >>>>>>>>>>>> cluster, there are also issues where the volume of
> > >>> metadata
> > >>>>> has
> > >>>>>>>> made
> > >>>>>>>>>>>> convergence of metadata slower.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For this case, exponential backoff helps as it reduces
> > >>> the
> > >>>>>> retry
> > >>>>>>>>> rate and
> > >>>>>>>>>>>> spaces out how often clients will retry, thereby
> > >> bringing
> > >>>>> down
> > >>>>>>> the
> > >>>>>>>>> time
> > >>>>>>>>>>> for
> > >>>>>>>>>>>> convergence. Something that Jason mentioned that would
> > >>> be a
> > >>>>>> great
> > >>>>>>>>>>> addition
> > >>>>>>>>>>>> here would be if the backoff should be “jittered” as it
> > >>> was
> > >>>>> in
> > >>>>>>>>> KIP-144
> > >>>>>>>>>>> with
> > >>>>>>>>>>>> respect to exponential reconnect backoff. This would
> > >> help
> > >>>>>> prevent
> > >>>>>>>> the
> > >>>>>>>>>>>> clients from being synchronized on when they retry,
> > >>> thereby
> > >>>>>>> spacing
> > >>>>>>>>> out
> > >>>>>>>>>>> the
> > >>>>>>>>>>>> number of requests being sent to the broker at the same
> > >>> time.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I’ll add this example to the KIP and flush out more of
> > >>> the
> > >>>>>>> details
> > >>>>>>>> -
> > >>>>>>>>> so
> > >>>>>>>>>>>> it’s more clear.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > >>>>>>>>> reluctanthero104@gmail.com
> > >>>>>>>>>>>> ,
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>> Thanks for the reply Sanjana. I guess I would like to
> > >>>>>> rephrase
> > >>>>>>> my
> > >>>>>>>>>>>> question
> > >>>>>>>>>>>>> 2 and 3 as my previous response is a little bit
> > >>>>> unactionable.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> My specific point is that exponential backoff is not
> > >> a
> > >>>>> silver
> > >>>>>>>>> bullet
> > >>>>>>>>>>> and
> > >>>>>>>>>>>> we
> > >>>>>>>>>>>>> should consider using it to solve known problems,
> > >>> instead
> > >>>>> of
> > >>>>>>>>> making the
> > >>>>>>>>>>>>> holistic changes to all clients in Kafka ecosystem. I
> > >>> do
> > >>>>> like
> > >>>>>>> the
> > >>>>>>>>>>>>> exponential backoff idea and believe this would be of
> > >>> great
> > >>>>>>>> value,
> > >>>>>>>>> but
> > >>>>>>>>>>>>> maybe we should focus on proposing some existing
> > >>> modules
> > >>>>> that
> > >>>>>>> are
> > >>>>>>>>>>>> suffering
> > >>>>>>>>>>>>> from static retry, and only change them in this first
> > >>> KIP.
> > >>>>> If
> > >>>>>>> in
> > >>>>>>>>> the
> > >>>>>>>>>>>>> future, some other component users believe they are
> > >>> also
> > >>>>>>>>> suffering, we
> > >>>>>>>>>>>>> could get more minor KIPs to change the behavior as
> > >>> well.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Boyang
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > >>>>>>>>>>> skaundinya@gmail.com
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks for the feedback Boyang, I will revise the
> > >> KIP
> > >>>>> with
> > >>>>>>> the
> > >>>>>>>>>>>>>> mathematical relations as per your suggestion. To
> > >>> address
> > >>>>>>> your
> > >>>>>>>>>>>> feedback:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 1. Currently, with the default of 100 ms per retry
> > >>>>> backoff,
> > >>>>>>> in
> > >>>>>>>> 1
> > >>>>>>>>>>> second
> > >>>>>>>>>>>>>> we would have 10 retries. In the case of using an
> > >>>>>> exponential
> > >>>>>>>>>>> backoff,
> > >>>>>>>>>>>> we
> > >>>>>>>>>>>>>> would have a total of 4 retries in 1 second. Thus
> > >> we
> > >>> have
> > >>>>>>> less
> > >>>>>>>>> than
> > >>>>>>>>>>>> half of
> > >>>>>>>>>>>>>> the amount of retries in the same timeframe and can
> > >>>>> lessen
> > >>>>>>>> broker
> > >>>>>>>>>>>> pressure.
> > >>>>>>>>>>>>>> This calculation is done as following (using the
> > >>> formula
> > >>>>>> laid
> > >>>>>>>>> out in
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>> KIP:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Try 1 at time 0 ms, failures = 0, next retry in 100
> > >>> ms
> > >>>>>>> (default
> > >>>>>>>>> retry
> > >>>>>>>>>>>> ms
> > >>>>>>>>>>>>>> is initially 100 ms)
> > >>>>>>>>>>>>>> Try 2 at time 100 ms, failures = 1, next retry in
> > >>> 200 ms
> > >>>>>>>>>>>>>> Try 3 at time 300 ms, failures = 2, next retry in
> > >>> 400 ms
> > >>>>>>>>>>>>>> Try 4 at time 700 ms, failures = 3, next retry in
> > >>> 800 ms
> > >>>>>>>>>>>>>> Try 5 at time 1500 ms, failures = 4, next retry in
> > >>> 1000
> > >>>>> ms
> > >>>>>>>>> (default
> > >>>>>>>>>>> max
> > >>>>>>>>>>>>>> retry ms is 1000 ms)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> For 2 and 3, could you elaborate more about what
> > >> you
> > >>> mean
> > >>>>>>> with
> > >>>>>>>>>>> respect
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>> client timeouts? I’m not very familiar with the
> > >>> Streams
> > >>>>>>>>> framework, so
> > >>>>>>>>>>>> would
> > >>>>>>>>>>>>>> love to get more insight to how that currently
> > >> works,
> > >>>>> with
> > >>>>>>>>> respect to
> > >>>>>>>>>>>>>> producer transactions, so I can appropriately
> > >> update
> > >>> the
> > >>>>>> KIP
> > >>>>>>> to
> > >>>>>>>>>>> address
> > >>>>>>>>>>>>>> these scenarios.
> > >>>>>>>>>>>>>> On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > >>>>>>>>>>>> reluctanthero104@gmail.com>,
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>> Thanks for the KIP Sanjana. I think the
> > >> motivation
> > >>> is
> > >>>>>> good,
> > >>>>>>>> but
> > >>>>>>>>>>> lack
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>>>> more quantitative analysis. For instance:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 1. How much retries we are saving by applying the
> > >>>>>>> exponential
> > >>>>>>>>> retry
> > >>>>>>>>>>>> vs
> > >>>>>>>>>>>>>>> static retry? There should be some mathematical
> > >>>>> relations
> > >>>>>>>>> between
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>> static retry ms, the initial exponential retry
> > >> ms,
> > >>> the
> > >>>>>> max
> > >>>>>>>>>>>> exponential
> > >>>>>>>>>>>>>>> retry ms in a given time interval.
> > >>>>>>>>>>>>>>> 2. How does this affect the client timeout? With
> > >>>>>>> exponential
> > >>>>>>>>> retry,
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>> client shall be getting easier to timeout on a
> > >>> parent
> > >>>>>> level
> > >>>>>>>>> caller,
> > >>>>>>>>>>>> for
> > >>>>>>>>>>>>>>> instance stream attempts to retry initializing
> > >>> producer
> > >>>>>>>>>>> transactions
> > >>>>>>>>>>>> with
> > >>>>>>>>>>>>>>> given 5 minute interval. With exponential retry
> > >>> this
> > >>>>>>>> mechanism
> > >>>>>>>>>>> could
> > >>>>>>>>>>>>>>> experience more frequent timeout which we should
> > >> be
> > >>>>>> careful
> > >>>>>>>>> with.
> > >>>>>>>>>>>>>>> 3. With regards to #2, we should have more
> > >> detailed
> > >>>>>>> checklist
> > >>>>>>>>> of
> > >>>>>>>>>>> all
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>> existing static retry scenarios, and adjust the
> > >>> initial
> > >>>>>>>>> exponential
> > >>>>>>>>>>>> retry
> > >>>>>>>>>>>>>>> ms to make sure we won't get easily timeout in
> > >> high
> > >>>>> level
> > >>>>>>> due
> > >>>>>>>>> to
> > >>>>>>>>>>> too
> > >>>>>>>>>>>> few
> > >>>>>>>>>>>>>>> attempts.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Boyang
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Fri, Mar 13, 2020 at 4:38 PM Sanjana
> > >> Kaundinya <
> > >>>>>>>>>>>> skaundinya@gmail.com>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Everyone,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I’ve written a KIP about introducing
> > >> exponential
> > >>>>>> backoff
> > >>>>>>>> for
> > >>>>>>>>>>> Kafka
> > >>>>>>>>>>>>>>>> clients. Would appreciate any feedback on this.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>> Sanjana
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> -- Guozhang
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> -- Guozhang
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> -- Guozhang
> > >>>
> > >>
> > >>
> > >> --
> > >> -- Guozhang
> > >>
> >
> >
>

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Boyang Chen <re...@gmail.com>.

Hey Sanjana,

my understanding with the update is that if a user provides no config at
all, a Producer/Consumer/Admin client user would by default enjoying a
starting backoff.ms as 100 ms and max.backoff.ms as 1000 ms? If I already
override the backoff.ms to 2000 ms for instance, will I be choosing the
default max.backoff here?

I guess my question would be whether we should just reject a config with
backoff.ms > max.backoff.ms in the first place, as this looks like
mis-configuration to me.

Second question is whether we allow fallback to static backoffs if the user
wants to do so, or we should just ship this as an opt-in feature?

Let me know your thoughts.

Boyang

On Mon, Mar 23, 2020 at 11:38 AM Cheng Tan <ct...@confluent.io> wrote:

> +1 (non-binding)
>
> > On Mar 19, 2020, at 7:27 PM, Sanjana Kaundinya <sk...@gmail.com>
> wrote:
> >
> > Ah yes that makes sense. I’ll update the KIP to reflect this.
> >
> > Thanks,
> > Sanjana
> >
> > On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> >> Following the formula you have in the KIP, if it is simply:
> >>
> >> MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures - 1)) *
> random(
> >> 0.8, 1.2))
> >>
> >> then the behavior would stay consistent at retry.backoff.max.ms.
> >>
> >>
> >> Guozhang
> >>
> >> On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <skaundinya@gmail.com
> >
> >> wrote:
> >>
> >>> If that’s the case then what should we base the starting point as?
> >>> Currently in the KIP the starting point is retry.backoff.ms and it
> >>> exponentially goes up to retry.backoff.max.ms. If retry.backoff.max.ms
> >> is
> >>> smaller than retry.backoff.ms then that could pose a bit of a problem
> >>> there right?
> >>>
> >>> On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <wa...@gmail.com>,
> >> wrote:
> >>>> Thanks Sanjana, I did not capture the part that Jason referred to, so
> >>>> that's my bad :P
> >>>>
> >>>> Regarding your last statement, I actually feel that instead of take
> the
> >>>> larger of the two, we should respect "retry.backoff.max.ms" even if
> it
> >>> is
> >>>> smaller than "retry.backoff.ms". I do not have a very strong
> rationale
> >>>> except it is logically more aligned to the config names.
> >>>>
> >>>>
> >>>> Guozhang
> >>>>
> >>>>
> >>>> On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <
> >> skaundinya@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hey Jason and Guozhang,
> >>>>>
> >>>>> Jason is right, I took this inspiration from KIP-144 (
> >>>>>
> >>>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> >>>>> )
> >>>>> which had the same logic in order to preserve the existing behavior.
> >> In
> >>>>> this case however, if we are thinking to completely eliminate the
> >>> static
> >>>>> backoff behavior, we can do that and as Jason mentioned put it in the
> >>>>> release notes and not add any special logic. In addition I agree that
> >>> we
> >>>>> should take the larger of the two of `retry.backoff.ms` and `
> >>>>> retry.backoff.max.ms`. I'll update the KIP to reflect this and make
> >> it
> >>>>> clear that the old static retry backoff is getting replaced by the
> >> new
> >>>>> dynamic retry backoff.
> >>>>>
> >>>>> Thanks,
> >>>>> Sanjana
> >>>>> On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <ja...@confluent.io>
> >>>>> wrote:
> >>>>>
> >>>>>> Hey Guozhang,
> >>>>>>
> >>>>>> I was referring to this:
> >>>>>>
> >>>>>>> For users who have not set retry.backoff.ms explicitly, the
> >>> default
> >>>>>> behavior will change so that the backoff will grow up to 1000 ms.
> >> For
> >>>>> users
> >>>>>> who have set retry.backoff.ms explicitly, the behavior will remain
> >>> the
> >>>>>> same
> >>>>>> as they could have specific requirements.
> >>>>>>
> >>>>>> I took this to mean that for users who have overridden `
> >>> retry.backoff.ms
> >>>>> `
> >>>>>> to 50ms (say), we will change the default `retry.backoff.max.ms`
> >> to
> >>> 50ms
> >>>>>> as
> >>>>>> well in order to preserve existing backoff behavior. Is that not
> >>> right?
> >>>>> In
> >>>>>> any case, I agree that we can use the maximum of the two values as
> >>> the
> >>>>>> effective `retry.backoff.max.ms` to handle the case when the
> >>> configured
> >>>>>> value of `retry.backoff.ms` is larger than the default of 1s.
> >>>>>>
> >>>>>> -Jason
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wa...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Hey Jason,
> >>>>>>>
> >>>>>>> My understanding is a bit different here: even if user has an
> >>> explicit
> >>>>>>> overridden "retry.backoff.ms", the exponential mechanism still
> >>>>> triggers
> >>>>>>> and
> >>>>>>> the backoff would be increased till "retry.backoff.max.ms"; and
> >>> if the
> >>>>>>> specified "retry.backoff.ms" is already larger than the "
> >>>>>>> retry.backoff.max.ms", we would still take "retry.backoff.max.ms
> >> ".
> >>>>>>>
> >>>>>>> So if the user does override the "retry.backoff.ms" to a value
> >>> larger
> >>>>>> than
> >>>>>>> 1s and is not aware of the new config, she would be surprised to
> >>> see
> >>>>> the
> >>>>>>> specified value seemingly not being respected, but she could
> >> still
> >>>>> learn
> >>>>>>> that afterwards by reading the release notes introducing this KIP
> >>>>>> anyways.
> >>>>>>>
> >>>>>>>
> >>>>>>> Guozhang
> >>>>>>>
> >>>>>>> On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
> >>> jason@confluent.io>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Sanjana,
> >>>>>>>>
> >>>>>>>> The KIP looks good to me. I had just one question about the
> >>> default
> >>>>>>>> behavior. As I understand, if the user has specified `
> >>>>> retry.backoff.ms
> >>>>>> `
> >>>>>>>> explicitly, then we will not apply the default max backoff. As
> >>> such,
> >>>>>>>> there's no way to get the benefit of this feature if you are
> >>>>> providing
> >>>>>> a
> >>>>>>> `
> >>>>>>>> retry.backoff.ms` unless you also provide `
> >> retry.backoff.max.ms
> >>> `.
> >>>>> That
> >>>>>>>> makes sense if you assume the user is unaware of the new
> >>>>> configuration,
> >>>>>>> but
> >>>>>>>> it is surprising otherwise. Since it's not a semantic change
> >> and
> >>>>> since
> >>>>>>> the
> >>>>>>>> default you're proposing of 1s is fairly low already, I wonder
> >> if
> >>>>> it's
> >>>>>>> good
> >>>>>>>> enough to mention the new configuration in the release notes
> >> and
> >>> not
> >>>>>> add
> >>>>>>>> any special logic. What do you think?
> >>>>>>>>
> >>>>>>>> -Jason
> >>>>>>>>
> >>>>>>>> On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> >>>>>> skaundinya@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Thank you for the comments Guozhang.
> >>>>>>>>>
> >>>>>>>>> I’ll leave this KIP out for discussion till the end of the
> >>> week and
> >>>>>>> then
> >>>>>>>>> start a vote for this early next week.
> >>>>>>>>>
> >>>>>>>>> Sanjana
> >>>>>>>>>
> >>>>>>>>> On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
> >>> wangguoz@gmail.com
> >>>>>> ,
> >>>>>>>> wrote:
> >>>>>>>>>> Hello Sanjana,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the proposed KIP, I think that makes a lot of
> >>> sense --
> >>>>>> as
> >>>>>>>> you
> >>>>>>>>>> mentioned in the motivation, we've indeed seen many issues
> >>> with
> >>>>>>> regard
> >>>>>>>> to
> >>>>>>>>>> the frequent retries, with bounded exponential backoff in
> >> the
> >>>>>>> scenario
> >>>>>>>>>> where there's a long connectivity issue we would
> >> effectively
> >>>>> reduce
> >>>>>>> the
> >>>>>>>>>> request load by 10 given the default configs.
> >>>>>>>>>>
> >>>>>>>>>> For higher-level Streams client and Connect frameworks,
> >>> today we
> >>>>>> also
> >>>>>>>>> have
> >>>>>>>>>> a retry logic but that's used in a slightly different way.
> >>> For
> >>>>>>> example
> >>>>>>>> in
> >>>>>>>>>> Streams, we tend to handle the retry logic at the
> >>> thread-level
> >>>>> and
> >>>>>>>> hence
> >>>>>>>>>> very likely we'd like to change that mechanism in KIP-572
> >>>>> anyways.
> >>>>>>> For
> >>>>>>>>>> producer / consumer / admin clients, I think just applying
> >>> this
> >>>>>>>>> behavioral
> >>>>>>>>>> change across these clients makes lot of sense. So I think
> >>> can
> >>>>> just
> >>>>>>>> leave
> >>>>>>>>>> the Streams / Connect out of the scope of this KIP to be
> >>>>> addressed
> >>>>>> in
> >>>>>>>>>> separate discussions.
> >>>>>>>>>>
> >>>>>>>>>> I do not have further comments about this KIP :) LGTM.
> >>>>>>>>>>
> >>>>>>>>>> Guozhang
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> >>>>>>>> skaundinya@gmail.com
> >>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Thanks for the feedback Boyang.
> >>>>>>>>>>>
> >>>>>>>>>>> If there’s anyone else who has feedback regarding this
> >> KIP,
> >>>>> would
> >>>>>>>>> really
> >>>>>>>>>>> appreciate it hearing it!
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Sanjana
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> >>>>>> bchen11@outlook.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Sounds great!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Get Outlook for iOS<https://aka.ms/o0ukef>
> >>>>>>>>>>>> ________________________________
> >>>>>>>>>>>> From: Sanjana Kaundinya <sk...@gmail.com>
> >>>>>>>>>>>> Sent: Tuesday, March 17, 2020 5:54:35 PM
> >>>>>>>>>>>> To: dev@kafka.apache.org <de...@kafka.apache.org>
> >>>>>>>>>>>> Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for
> >>> Kafka
> >>>>>>>> Clients
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the explanation Boyang. One of the most
> >> common
> >>>>>>> problems
> >>>>>>>>> that
> >>>>>>>>>>> we
> >>>>>>>>>>>> have in Kafka is with respect to metadata fetches. For
> >>>>> example,
> >>>>>>> if
> >>>>>>>>> there
> >>>>>>>>>>> is
> >>>>>>>>>>>> a broker failure, all clients start to fetch metadata
> >> at
> >>> the
> >>>>>> same
> >>>>>>>>> time
> >>>>>>>>>>> and
> >>>>>>>>>>>> it often takes a while for the metadata to converge.
> >> In a
> >>>>> high
> >>>>>>> load
> >>>>>>>>>>>> cluster, there are also issues where the volume of
> >>> metadata
> >>>>> has
> >>>>>>>> made
> >>>>>>>>>>>> convergence of metadata slower.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For this case, exponential backoff helps as it reduces
> >>> the
> >>>>>> retry
> >>>>>>>>> rate and
> >>>>>>>>>>>> spaces out how often clients will retry, thereby
> >> bringing
> >>>>> down
> >>>>>>> the
> >>>>>>>>> time
> >>>>>>>>>>> for
> >>>>>>>>>>>> convergence. Something that Jason mentioned that would
> >>> be a
> >>>>>> great
> >>>>>>>>>>> addition
> >>>>>>>>>>>> here would be if the backoff should be “jittered” as it
> >>> was
> >>>>> in
> >>>>>>>>> KIP-144
> >>>>>>>>>>> with
> >>>>>>>>>>>> respect to exponential reconnect backoff. This would
> >> help
> >>>>>> prevent
> >>>>>>>> the
> >>>>>>>>>>>> clients from being synchronized on when they retry,
> >>> thereby
> >>>>>>> spacing
> >>>>>>>>> out
> >>>>>>>>>>> the
> >>>>>>>>>>>> number of requests being sent to the broker at the same
> >>> time.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I’ll add this example to the KIP and flush out more of
> >>> the
> >>>>>>> details
> >>>>>>>> -
> >>>>>>>>> so
> >>>>>>>>>>>> it’s more clear.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> >>>>>>>>> reluctanthero104@gmail.com
> >>>>>>>>>>>> ,
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>> Thanks for the reply Sanjana. I guess I would like to
> >>>>>> rephrase
> >>>>>>> my
> >>>>>>>>>>>> question
> >>>>>>>>>>>>> 2 and 3 as my previous response is a little bit
> >>>>> unactionable.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> My specific point is that exponential backoff is not
> >> a
> >>>>> silver
> >>>>>>>>> bullet
> >>>>>>>>>>> and
> >>>>>>>>>>>> we
> >>>>>>>>>>>>> should consider using it to solve known problems,
> >>> instead
> >>>>> of
> >>>>>>>>> making the
> >>>>>>>>>>>>> holistic changes to all clients in Kafka ecosystem. I
> >>> do
> >>>>> like
> >>>>>>> the
> >>>>>>>>>>>>> exponential backoff idea and believe this would be of
> >>> great
> >>>>>>>> value,
> >>>>>>>>> but
> >>>>>>>>>>>>> maybe we should focus on proposing some existing
> >>> modules
> >>>>> that
> >>>>>>> are
> >>>>>>>>>>>> suffering
> >>>>>>>>>>>>> from static retry, and only change them in this first
> >>> KIP.
> >>>>> If
> >>>>>>> in
> >>>>>>>>> the
> >>>>>>>>>>>>> future, some other component users believe they are
> >>> also
> >>>>>>>>> suffering, we
> >>>>>>>>>>>>> could get more minor KIPs to change the behavior as
> >>> well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Boyang
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> >>>>>>>>>>> skaundinya@gmail.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the feedback Boyang, I will revise the
> >> KIP
> >>>>> with
> >>>>>>> the
> >>>>>>>>>>>>>> mathematical relations as per your suggestion. To
> >>> address
> >>>>>>> your
> >>>>>>>>>>>> feedback:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. Currently, with the default of 100 ms per retry
> >>>>> backoff,
> >>>>>>> in
> >>>>>>>> 1
> >>>>>>>>>>> second
> >>>>>>>>>>>>>> we would have 10 retries. In the case of using an
> >>>>>> exponential
> >>>>>>>>>>> backoff,
> >>>>>>>>>>>> we
> >>>>>>>>>>>>>> would have a total of 4 retries in 1 second. Thus
> >> we
> >>> have
> >>>>>>> less
> >>>>>>>>> than
> >>>>>>>>>>>> half of
> >>>>>>>>>>>>>> the amount of retries in the same timeframe and can
> >>>>> lessen
> >>>>>>>> broker
> >>>>>>>>>>>> pressure.
> >>>>>>>>>>>>>> This calculation is done as following (using the
> >>> formula
> >>>>>> laid
> >>>>>>>>> out in
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> KIP:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Try 1 at time 0 ms, failures = 0, next retry in 100
> >>> ms
> >>>>>>> (default
> >>>>>>>>> retry
> >>>>>>>>>>>> ms
> >>>>>>>>>>>>>> is initially 100 ms)
> >>>>>>>>>>>>>> Try 2 at time 100 ms, failures = 1, next retry in
> >>> 200 ms
> >>>>>>>>>>>>>> Try 3 at time 300 ms, failures = 2, next retry in
> >>> 400 ms
> >>>>>>>>>>>>>> Try 4 at time 700 ms, failures = 3, next retry in
> >>> 800 ms
> >>>>>>>>>>>>>> Try 5 at time 1500 ms, failures = 4, next retry in
> >>> 1000
> >>>>> ms
> >>>>>>>>> (default
> >>>>>>>>>>> max
> >>>>>>>>>>>>>> retry ms is 1000 ms)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> For 2 and 3, could you elaborate more about what
> >> you
> >>> mean
> >>>>>>> with
> >>>>>>>>>>> respect
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>> client timeouts? I’m not very familiar with the
> >>> Streams
> >>>>>>>>> framework, so
> >>>>>>>>>>>> would
> >>>>>>>>>>>>>> love to get more insight to how that currently
> >> works,
> >>>>> with
> >>>>>>>>> respect to
> >>>>>>>>>>>>>> producer transactions, so I can appropriately
> >> update
> >>> the
> >>>>>> KIP
> >>>>>>> to
> >>>>>>>>>>> address
> >>>>>>>>>>>>>> these scenarios.
> >>>>>>>>>>>>>> On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> >>>>>>>>>>>> reluctanthero104@gmail.com>,
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> Thanks for the KIP Sanjana. I think the
> >> motivation
> >>> is
> >>>>>> good,
> >>>>>>>> but
> >>>>>>>>>>> lack
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>>> more quantitative analysis. For instance:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1. How much retries we are saving by applying the
> >>>>>>> exponential
> >>>>>>>>> retry
> >>>>>>>>>>>> vs
> >>>>>>>>>>>>>>> static retry? There should be some mathematical
> >>>>> relations
> >>>>>>>>> between
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> static retry ms, the initial exponential retry
> >> ms,
> >>> the
> >>>>>> max
> >>>>>>>>>>>> exponential
> >>>>>>>>>>>>>>> retry ms in a given time interval.
> >>>>>>>>>>>>>>> 2. How does this affect the client timeout? With
> >>>>>>> exponential
> >>>>>>>>> retry,
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>> client shall be getting easier to timeout on a
> >>> parent
> >>>>>> level
> >>>>>>>>> caller,
> >>>>>>>>>>>> for
> >>>>>>>>>>>>>>> instance stream attempts to retry initializing
> >>> producer
> >>>>>>>>>>> transactions
> >>>>>>>>>>>> with
> >>>>>>>>>>>>>>> given 5 minute interval. With exponential retry
> >>> this
> >>>>>>>> mechanism
> >>>>>>>>>>> could
> >>>>>>>>>>>>>>> experience more frequent timeout which we should
> >> be
> >>>>>> careful
> >>>>>>>>> with.
> >>>>>>>>>>>>>>> 3. With regards to #2, we should have more
> >> detailed
> >>>>>>> checklist
> >>>>>>>>> of
> >>>>>>>>>>> all
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>> existing static retry scenarios, and adjust the
> >>> initial
> >>>>>>>>> exponential
> >>>>>>>>>>>> retry
> >>>>>>>>>>>>>>> ms to make sure we won't get easily timeout in
> >> high
> >>>>> level
> >>>>>>> due
> >>>>>>>>> to
> >>>>>>>>>>> too
> >>>>>>>>>>>> few
> >>>>>>>>>>>>>>> attempts.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Boyang
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Mar 13, 2020 at 4:38 PM Sanjana
> >> Kaundinya <
> >>>>>>>>>>>> skaundinya@gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Everyone,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I’ve written a KIP about introducing
> >> exponential
> >>>>>> backoff
> >>>>>>>> for
> >>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>> clients. Would appreciate any feedback on this.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>> Sanjana
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> -- Guozhang
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> -- Guozhang
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> -- Guozhang
> >>>
> >>
> >>
> >> --
> >> -- Guozhang
> >>
>
>

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Cheng Tan <ct...@confluent.io>.

+1 (non-binding)

> On Mar 19, 2020, at 7:27 PM, Sanjana Kaundinya <sk...@gmail.com> wrote:
> 
> Ah yes that makes sense. I’ll update the KIP to reflect this.
> 
> Thanks,
> Sanjana
> 
> On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <wa...@gmail.com> wrote:
> 
>> Following the formula you have in the KIP, if it is simply:
>> 
>> MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures - 1)) * random(
>> 0.8, 1.2))
>> 
>> then the behavior would stay consistent at retry.backoff.max.ms.
>> 
>> 
>> Guozhang
>> 
>> On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <sk...@gmail.com>
>> wrote:
>> 
>>> If that’s the case then what should we base the starting point as?
>>> Currently in the KIP the starting point is retry.backoff.ms and it
>>> exponentially goes up to retry.backoff.max.ms. If retry.backoff.max.ms
>> is
>>> smaller than retry.backoff.ms then that could pose a bit of a problem
>>> there right?
>>> 
>>> On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <wa...@gmail.com>,
>> wrote:
>>>> Thanks Sanjana, I did not capture the part that Jason referred to, so
>>>> that's my bad :P
>>>> 
>>>> Regarding your last statement, I actually feel that instead of take the
>>>> larger of the two, we should respect "retry.backoff.max.ms" even if it
>>> is
>>>> smaller than "retry.backoff.ms". I do not have a very strong rationale
>>>> except it is logically more aligned to the config names.
>>>> 
>>>> 
>>>> Guozhang
>>>> 
>>>> 
>>>> On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <
>> skaundinya@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hey Jason and Guozhang,
>>>>> 
>>>>> Jason is right, I took this inspiration from KIP-144 (
>>>>> 
>>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
>>>>> )
>>>>> which had the same logic in order to preserve the existing behavior.
>> In
>>>>> this case however, if we are thinking to completely eliminate the
>>> static
>>>>> backoff behavior, we can do that and as Jason mentioned put it in the
>>>>> release notes and not add any special logic. In addition I agree that
>>> we
>>>>> should take the larger of the two of `retry.backoff.ms` and `
>>>>> retry.backoff.max.ms`. I'll update the KIP to reflect this and make
>> it
>>>>> clear that the old static retry backoff is getting replaced by the
>> new
>>>>> dynamic retry backoff.
>>>>> 
>>>>> Thanks,
>>>>> Sanjana
>>>>> On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <ja...@confluent.io>
>>>>> wrote:
>>>>> 
>>>>>> Hey Guozhang,
>>>>>> 
>>>>>> I was referring to this:
>>>>>> 
>>>>>>> For users who have not set retry.backoff.ms explicitly, the
>>> default
>>>>>> behavior will change so that the backoff will grow up to 1000 ms.
>> For
>>>>> users
>>>>>> who have set retry.backoff.ms explicitly, the behavior will remain
>>> the
>>>>>> same
>>>>>> as they could have specific requirements.
>>>>>> 
>>>>>> I took this to mean that for users who have overridden `
>>> retry.backoff.ms
>>>>> `
>>>>>> to 50ms (say), we will change the default `retry.backoff.max.ms`
>> to
>>> 50ms
>>>>>> as
>>>>>> well in order to preserve existing backoff behavior. Is that not
>>> right?
>>>>> In
>>>>>> any case, I agree that we can use the maximum of the two values as
>>> the
>>>>>> effective `retry.backoff.max.ms` to handle the case when the
>>> configured
>>>>>> value of `retry.backoff.ms` is larger than the default of 1s.
>>>>>> 
>>>>>> -Jason
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wa...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> Hey Jason,
>>>>>>> 
>>>>>>> My understanding is a bit different here: even if user has an
>>> explicit
>>>>>>> overridden "retry.backoff.ms", the exponential mechanism still
>>>>> triggers
>>>>>>> and
>>>>>>> the backoff would be increased till "retry.backoff.max.ms"; and
>>> if the
>>>>>>> specified "retry.backoff.ms" is already larger than the "
>>>>>>> retry.backoff.max.ms", we would still take "retry.backoff.max.ms
>> ".
>>>>>>> 
>>>>>>> So if the user does override the "retry.backoff.ms" to a value
>>> larger
>>>>>> than
>>>>>>> 1s and is not aware of the new config, she would be surprised to
>>> see
>>>>> the
>>>>>>> specified value seemingly not being respected, but she could
>> still
>>>>> learn
>>>>>>> that afterwards by reading the release notes introducing this KIP
>>>>>> anyways.
>>>>>>> 
>>>>>>> 
>>>>>>> Guozhang
>>>>>>> 
>>>>>>> On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
>>> jason@confluent.io>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Sanjana,
>>>>>>>> 
>>>>>>>> The KIP looks good to me. I had just one question about the
>>> default
>>>>>>>> behavior. As I understand, if the user has specified `
>>>>> retry.backoff.ms
>>>>>> `
>>>>>>>> explicitly, then we will not apply the default max backoff. As
>>> such,
>>>>>>>> there's no way to get the benefit of this feature if you are
>>>>> providing
>>>>>> a
>>>>>>> `
>>>>>>>> retry.backoff.ms` unless you also provide `
>> retry.backoff.max.ms
>>> `.
>>>>> That
>>>>>>>> makes sense if you assume the user is unaware of the new
>>>>> configuration,
>>>>>>> but
>>>>>>>> it is surprising otherwise. Since it's not a semantic change
>> and
>>>>> since
>>>>>>> the
>>>>>>>> default you're proposing of 1s is fairly low already, I wonder
>> if
>>>>> it's
>>>>>>> good
>>>>>>>> enough to mention the new configuration in the release notes
>> and
>>> not
>>>>>> add
>>>>>>>> any special logic. What do you think?
>>>>>>>> 
>>>>>>>> -Jason
>>>>>>>> 
>>>>>>>> On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
>>>>>> skaundinya@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Thank you for the comments Guozhang.
>>>>>>>>> 
>>>>>>>>> I’ll leave this KIP out for discussion till the end of the
>>> week and
>>>>>>> then
>>>>>>>>> start a vote for this early next week.
>>>>>>>>> 
>>>>>>>>> Sanjana
>>>>>>>>> 
>>>>>>>>> On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
>>> wangguoz@gmail.com
>>>>>> ,
>>>>>>>> wrote:
>>>>>>>>>> Hello Sanjana,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the proposed KIP, I think that makes a lot of
>>> sense --
>>>>>> as
>>>>>>>> you
>>>>>>>>>> mentioned in the motivation, we've indeed seen many issues
>>> with
>>>>>>> regard
>>>>>>>> to
>>>>>>>>>> the frequent retries, with bounded exponential backoff in
>> the
>>>>>>> scenario
>>>>>>>>>> where there's a long connectivity issue we would
>> effectively
>>>>> reduce
>>>>>>> the
>>>>>>>>>> request load by 10 given the default configs.
>>>>>>>>>> 
>>>>>>>>>> For higher-level Streams client and Connect frameworks,
>>> today we
>>>>>> also
>>>>>>>>> have
>>>>>>>>>> a retry logic but that's used in a slightly different way.
>>> For
>>>>>>> example
>>>>>>>> in
>>>>>>>>>> Streams, we tend to handle the retry logic at the
>>> thread-level
>>>>> and
>>>>>>>> hence
>>>>>>>>>> very likely we'd like to change that mechanism in KIP-572
>>>>> anyways.
>>>>>>> For
>>>>>>>>>> producer / consumer / admin clients, I think just applying
>>> this
>>>>>>>>> behavioral
>>>>>>>>>> change across these clients makes lot of sense. So I think
>>> can
>>>>> just
>>>>>>>> leave
>>>>>>>>>> the Streams / Connect out of the scope of this KIP to be
>>>>> addressed
>>>>>> in
>>>>>>>>>> separate discussions.
>>>>>>>>>> 
>>>>>>>>>> I do not have further comments about this KIP :) LGTM.
>>>>>>>>>> 
>>>>>>>>>> Guozhang
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
>>>>>>>> skaundinya@gmail.com
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks for the feedback Boyang.
>>>>>>>>>>> 
>>>>>>>>>>> If there’s anyone else who has feedback regarding this
>> KIP,
>>>>> would
>>>>>>>>> really
>>>>>>>>>>> appreciate it hearing it!
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Sanjana
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
>>>>>> bchen11@outlook.com>
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Sounds great!
>>>>>>>>>>>> 
>>>>>>>>>>>> Get Outlook for iOS<https://aka.ms/o0ukef>
>>>>>>>>>>>> ________________________________
>>>>>>>>>>>> From: Sanjana Kaundinya <sk...@gmail.com>
>>>>>>>>>>>> Sent: Tuesday, March 17, 2020 5:54:35 PM
>>>>>>>>>>>> To: dev@kafka.apache.org <de...@kafka.apache.org>
>>>>>>>>>>>> Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for
>>> Kafka
>>>>>>>> Clients
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for the explanation Boyang. One of the most
>> common
>>>>>>> problems
>>>>>>>>> that
>>>>>>>>>>> we
>>>>>>>>>>>> have in Kafka is with respect to metadata fetches. For
>>>>> example,
>>>>>>> if
>>>>>>>>> there
>>>>>>>>>>> is
>>>>>>>>>>>> a broker failure, all clients start to fetch metadata
>> at
>>> the
>>>>>> same
>>>>>>>>> time
>>>>>>>>>>> and
>>>>>>>>>>>> it often takes a while for the metadata to converge.
>> In a
>>>>> high
>>>>>>> load
>>>>>>>>>>>> cluster, there are also issues where the volume of
>>> metadata
>>>>> has
>>>>>>>> made
>>>>>>>>>>>> convergence of metadata slower.
>>>>>>>>>>>> 
>>>>>>>>>>>> For this case, exponential backoff helps as it reduces
>>> the
>>>>>> retry
>>>>>>>>> rate and
>>>>>>>>>>>> spaces out how often clients will retry, thereby
>> bringing
>>>>> down
>>>>>>> the
>>>>>>>>> time
>>>>>>>>>>> for
>>>>>>>>>>>> convergence. Something that Jason mentioned that would
>>> be a
>>>>>> great
>>>>>>>>>>> addition
>>>>>>>>>>>> here would be if the backoff should be “jittered” as it
>>> was
>>>>> in
>>>>>>>>> KIP-144
>>>>>>>>>>> with
>>>>>>>>>>>> respect to exponential reconnect backoff. This would
>> help
>>>>>> prevent
>>>>>>>> the
>>>>>>>>>>>> clients from being synchronized on when they retry,
>>> thereby
>>>>>>> spacing
>>>>>>>>> out
>>>>>>>>>>> the
>>>>>>>>>>>> number of requests being sent to the broker at the same
>>> time.
>>>>>>>>>>>> 
>>>>>>>>>>>> I’ll add this example to the KIP and flush out more of
>>> the
>>>>>>> details
>>>>>>>> -
>>>>>>>>> so
>>>>>>>>>>>> it’s more clear.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
>>>>>>>>> reluctanthero104@gmail.com
>>>>>>>>>>>> ,
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Thanks for the reply Sanjana. I guess I would like to
>>>>>> rephrase
>>>>>>> my
>>>>>>>>>>>> question
>>>>>>>>>>>>> 2 and 3 as my previous response is a little bit
>>>>> unactionable.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My specific point is that exponential backoff is not
>> a
>>>>> silver
>>>>>>>>> bullet
>>>>>>>>>>> and
>>>>>>>>>>>> we
>>>>>>>>>>>>> should consider using it to solve known problems,
>>> instead
>>>>> of
>>>>>>>>> making the
>>>>>>>>>>>>> holistic changes to all clients in Kafka ecosystem. I
>>> do
>>>>> like
>>>>>>> the
>>>>>>>>>>>>> exponential backoff idea and believe this would be of
>>> great
>>>>>>>> value,
>>>>>>>>> but
>>>>>>>>>>>>> maybe we should focus on proposing some existing
>>> modules
>>>>> that
>>>>>>> are
>>>>>>>>>>>> suffering
>>>>>>>>>>>>> from static retry, and only change them in this first
>>> KIP.
>>>>> If
>>>>>>> in
>>>>>>>>> the
>>>>>>>>>>>>> future, some other component users believe they are
>>> also
>>>>>>>>> suffering, we
>>>>>>>>>>>>> could get more minor KIPs to change the behavior as
>>> well.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Boyang
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
>>>>>>>>>>> skaundinya@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for the feedback Boyang, I will revise the
>> KIP
>>>>> with
>>>>>>> the
>>>>>>>>>>>>>> mathematical relations as per your suggestion. To
>>> address
>>>>>>> your
>>>>>>>>>>>> feedback:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. Currently, with the default of 100 ms per retry
>>>>> backoff,
>>>>>>> in
>>>>>>>> 1
>>>>>>>>>>> second
>>>>>>>>>>>>>> we would have 10 retries. In the case of using an
>>>>>> exponential
>>>>>>>>>>> backoff,
>>>>>>>>>>>> we
>>>>>>>>>>>>>> would have a total of 4 retries in 1 second. Thus
>> we
>>> have
>>>>>>> less
>>>>>>>>> than
>>>>>>>>>>>> half of
>>>>>>>>>>>>>> the amount of retries in the same timeframe and can
>>>>> lessen
>>>>>>>> broker
>>>>>>>>>>>> pressure.
>>>>>>>>>>>>>> This calculation is done as following (using the
>>> formula
>>>>>> laid
>>>>>>>>> out in
>>>>>>>>>>>> the
>>>>>>>>>>>>>> KIP:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Try 1 at time 0 ms, failures = 0, next retry in 100
>>> ms
>>>>>>> (default
>>>>>>>>> retry
>>>>>>>>>>>> ms
>>>>>>>>>>>>>> is initially 100 ms)
>>>>>>>>>>>>>> Try 2 at time 100 ms, failures = 1, next retry in
>>> 200 ms
>>>>>>>>>>>>>> Try 3 at time 300 ms, failures = 2, next retry in
>>> 400 ms
>>>>>>>>>>>>>> Try 4 at time 700 ms, failures = 3, next retry in
>>> 800 ms
>>>>>>>>>>>>>> Try 5 at time 1500 ms, failures = 4, next retry in
>>> 1000
>>>>> ms
>>>>>>>>> (default
>>>>>>>>>>> max
>>>>>>>>>>>>>> retry ms is 1000 ms)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For 2 and 3, could you elaborate more about what
>> you
>>> mean
>>>>>>> with
>>>>>>>>>>> respect
>>>>>>>>>>>> to
>>>>>>>>>>>>>> client timeouts? I’m not very familiar with the
>>> Streams
>>>>>>>>> framework, so
>>>>>>>>>>>> would
>>>>>>>>>>>>>> love to get more insight to how that currently
>> works,
>>>>> with
>>>>>>>>> respect to
>>>>>>>>>>>>>> producer transactions, so I can appropriately
>> update
>>> the
>>>>>> KIP
>>>>>>> to
>>>>>>>>>>> address
>>>>>>>>>>>>>> these scenarios.
>>>>>>>>>>>>>> On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
>>>>>>>>>>>> reluctanthero104@gmail.com>,
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> Thanks for the KIP Sanjana. I think the
>> motivation
>>> is
>>>>>> good,
>>>>>>>> but
>>>>>>>>>>> lack
>>>>>>>>>>>> of
>>>>>>>>>>>>>>> more quantitative analysis. For instance:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. How much retries we are saving by applying the
>>>>>>> exponential
>>>>>>>>> retry
>>>>>>>>>>>> vs
>>>>>>>>>>>>>>> static retry? There should be some mathematical
>>>>> relations
>>>>>>>>> between
>>>>>>>>>>> the
>>>>>>>>>>>>>>> static retry ms, the initial exponential retry
>> ms,
>>> the
>>>>>> max
>>>>>>>>>>>> exponential
>>>>>>>>>>>>>>> retry ms in a given time interval.
>>>>>>>>>>>>>>> 2. How does this affect the client timeout? With
>>>>>>> exponential
>>>>>>>>> retry,
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> client shall be getting easier to timeout on a
>>> parent
>>>>>> level
>>>>>>>>> caller,
>>>>>>>>>>>> for
>>>>>>>>>>>>>>> instance stream attempts to retry initializing
>>> producer
>>>>>>>>>>> transactions
>>>>>>>>>>>> with
>>>>>>>>>>>>>>> given 5 minute interval. With exponential retry
>>> this
>>>>>>>> mechanism
>>>>>>>>>>> could
>>>>>>>>>>>>>>> experience more frequent timeout which we should
>> be
>>>>>> careful
>>>>>>>>> with.
>>>>>>>>>>>>>>> 3. With regards to #2, we should have more
>> detailed
>>>>>>> checklist
>>>>>>>>> of
>>>>>>>>>>> all
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> existing static retry scenarios, and adjust the
>>> initial
>>>>>>>>> exponential
>>>>>>>>>>>> retry
>>>>>>>>>>>>>>> ms to make sure we won't get easily timeout in
>> high
>>>>> level
>>>>>>> due
>>>>>>>>> to
>>>>>>>>>>> too
>>>>>>>>>>>> few
>>>>>>>>>>>>>>> attempts.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Boyang
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Mar 13, 2020 at 4:38 PM Sanjana
>> Kaundinya <
>>>>>>>>>>>> skaundinya@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I’ve written a KIP about introducing
>> exponential
>>>>>> backoff
>>>>>>>> for
>>>>>>>>>>> Kafka
>>>>>>>>>>>>>>>> clients. Would appreciate any feedback on this.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Sanjana
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> -- Guozhang
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> -- Guozhang
>>> 
>> 
>> 
>> --
>> -- Guozhang
>>

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Sanjana Kaundinya <sk...@gmail.com>.

Ah yes that makes sense. I’ll update the KIP to reflect this.

Thanks,
Sanjana

On Thu, Mar 19, 2020 at 5:48 PM Guozhang Wang <wa...@gmail.com> wrote:

> Following the formula you have in the KIP, if it is simply:
>
> MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures - 1)) * random(
> 0.8, 1.2))
>
> then the behavior would stay consistent at retry.backoff.max.ms.
>
>
> Guozhang
>
> On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <sk...@gmail.com>
> wrote:
>
> > If that’s the case then what should we base the starting point as?
> > Currently in the KIP the starting point is retry.backoff.ms and it
> > exponentially goes up to retry.backoff.max.ms. If retry.backoff.max.ms
> is
> > smaller than retry.backoff.ms then that could pose a bit of a problem
> > there right?
> >
> > On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <wa...@gmail.com>,
> wrote:
> > > Thanks Sanjana, I did not capture the part that Jason referred to, so
> > > that's my bad :P
> > >
> > > Regarding your last statement, I actually feel that instead of take the
> > > larger of the two, we should respect "retry.backoff.max.ms" even if it
> > is
> > > smaller than "retry.backoff.ms". I do not have a very strong rationale
> > > except it is logically more aligned to the config names.
> > >
> > >
> > > Guozhang
> > >
> > >
> > > On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <
> skaundinya@gmail.com>
> > > wrote:
> > >
> > > > Hey Jason and Guozhang,
> > > >
> > > > Jason is right, I took this inspiration from KIP-144 (
> > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > > > )
> > > > which had the same logic in order to preserve the existing behavior.
> In
> > > > this case however, if we are thinking to completely eliminate the
> > static
> > > > backoff behavior, we can do that and as Jason mentioned put it in the
> > > > release notes and not add any special logic. In addition I agree that
> > we
> > > > should take the larger of the two of `retry.backoff.ms` and `
> > > > retry.backoff.max.ms`. I'll update the KIP to reflect this and make
> it
> > > > clear that the old static retry backoff is getting replaced by the
> new
> > > > dynamic retry backoff.
> > > >
> > > > Thanks,
> > > > Sanjana
> > > > On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hey Guozhang,
> > > > >
> > > > > I was referring to this:
> > > > >
> > > > > > For users who have not set retry.backoff.ms explicitly, the
> > default
> > > > > behavior will change so that the backoff will grow up to 1000 ms.
> For
> > > > users
> > > > > who have set retry.backoff.ms explicitly, the behavior will remain
> > the
> > > > > same
> > > > > as they could have specific requirements.
> > > > >
> > > > > I took this to mean that for users who have overridden `
> > retry.backoff.ms
> > > > `
> > > > > to 50ms (say), we will change the default `retry.backoff.max.ms`
> to
> > 50ms
> > > > > as
> > > > > well in order to preserve existing backoff behavior. Is that not
> > right?
> > > > In
> > > > > any case, I agree that we can use the maximum of the two values as
> > the
> > > > > effective `retry.backoff.max.ms` to handle the case when the
> > configured
> > > > > value of `retry.backoff.ms` is larger than the default of 1s.
> > > > >
> > > > > -Jason
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wa...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hey Jason,
> > > > > >
> > > > > > My understanding is a bit different here: even if user has an
> > explicit
> > > > > > overridden "retry.backoff.ms", the exponential mechanism still
> > > > triggers
> > > > > > and
> > > > > > the backoff would be increased till "retry.backoff.max.ms"; and
> > if the
> > > > > > specified "retry.backoff.ms" is already larger than the "
> > > > > > retry.backoff.max.ms", we would still take "retry.backoff.max.ms
> ".
> > > > > >
> > > > > > So if the user does override the "retry.backoff.ms" to a value
> > larger
> > > > > than
> > > > > > 1s and is not aware of the new config, she would be surprised to
> > see
> > > > the
> > > > > > specified value seemingly not being respected, but she could
> still
> > > > learn
> > > > > > that afterwards by reading the release notes introducing this KIP
> > > > > anyways.
> > > > > >
> > > > > >
> > > > > > Guozhang
> > > > > >
> > > > > > On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
> > jason@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Sanjana,
> > > > > > >
> > > > > > > The KIP looks good to me. I had just one question about the
> > default
> > > > > > > behavior. As I understand, if the user has specified `
> > > > retry.backoff.ms
> > > > > `
> > > > > > > explicitly, then we will not apply the default max backoff. As
> > such,
> > > > > > > there's no way to get the benefit of this feature if you are
> > > > providing
> > > > > a
> > > > > > `
> > > > > > > retry.backoff.ms` unless you also provide `
> retry.backoff.max.ms
> > `.
> > > > That
> > > > > > > makes sense if you assume the user is unaware of the new
> > > > configuration,
> > > > > > but
> > > > > > > it is surprising otherwise. Since it's not a semantic change
> and
> > > > since
> > > > > > the
> > > > > > > default you're proposing of 1s is fairly low already, I wonder
> if
> > > > it's
> > > > > > good
> > > > > > > enough to mention the new configuration in the release notes
> and
> > not
> > > > > add
> > > > > > > any special logic. What do you think?
> > > > > > >
> > > > > > > -Jason
> > > > > > >
> > > > > > > On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> > > > > skaundinya@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thank you for the comments Guozhang.
> > > > > > > >
> > > > > > > > I’ll leave this KIP out for discussion till the end of the
> > week and
> > > > > > then
> > > > > > > > start a vote for this early next week.
> > > > > > > >
> > > > > > > > Sanjana
> > > > > > > >
> > > > > > > > On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
> > wangguoz@gmail.com
> > > > > ,
> > > > > > > wrote:
> > > > > > > > > Hello Sanjana,
> > > > > > > > >
> > > > > > > > > Thanks for the proposed KIP, I think that makes a lot of
> > sense --
> > > > > as
> > > > > > > you
> > > > > > > > > mentioned in the motivation, we've indeed seen many issues
> > with
> > > > > > regard
> > > > > > > to
> > > > > > > > > the frequent retries, with bounded exponential backoff in
> the
> > > > > > scenario
> > > > > > > > > where there's a long connectivity issue we would
> effectively
> > > > reduce
> > > > > > the
> > > > > > > > > request load by 10 given the default configs.
> > > > > > > > >
> > > > > > > > > For higher-level Streams client and Connect frameworks,
> > today we
> > > > > also
> > > > > > > > have
> > > > > > > > > a retry logic but that's used in a slightly different way.
> > For
> > > > > > example
> > > > > > > in
> > > > > > > > > Streams, we tend to handle the retry logic at the
> > thread-level
> > > > and
> > > > > > > hence
> > > > > > > > > very likely we'd like to change that mechanism in KIP-572
> > > > anyways.
> > > > > > For
> > > > > > > > > producer / consumer / admin clients, I think just applying
> > this
> > > > > > > > behavioral
> > > > > > > > > change across these clients makes lot of sense. So I think
> > can
> > > > just
> > > > > > > leave
> > > > > > > > > the Streams / Connect out of the scope of this KIP to be
> > > > addressed
> > > > > in
> > > > > > > > > separate discussions.
> > > > > > > > >
> > > > > > > > > I do not have further comments about this KIP :) LGTM.
> > > > > > > > >
> > > > > > > > > Guozhang
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > > > > > > skaundinya@gmail.com
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback Boyang.
> > > > > > > > > >
> > > > > > > > > > If there’s anyone else who has feedback regarding this
> KIP,
> > > > would
> > > > > > > > really
> > > > > > > > > > appreciate it hearing it!
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Sanjana
> > > > > > > > > >
> > > > > > > > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> > > > > bchen11@outlook.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Sounds great!
> > > > > > > > > > >
> > > > > > > > > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > > > > > > > ________________________________
> > > > > > > > > > > From: Sanjana Kaundinya <sk...@gmail.com>
> > > > > > > > > > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > > > > > > > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > > > > > > > > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for
> > Kafka
> > > > > > > Clients
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the explanation Boyang. One of the most
> common
> > > > > > problems
> > > > > > > > that
> > > > > > > > > > we
> > > > > > > > > > > have in Kafka is with respect to metadata fetches. For
> > > > example,
> > > > > > if
> > > > > > > > there
> > > > > > > > > > is
> > > > > > > > > > > a broker failure, all clients start to fetch metadata
> at
> > the
> > > > > same
> > > > > > > > time
> > > > > > > > > > and
> > > > > > > > > > > it often takes a while for the metadata to converge.
> In a
> > > > high
> > > > > > load
> > > > > > > > > > > cluster, there are also issues where the volume of
> > metadata
> > > > has
> > > > > > > made
> > > > > > > > > > > convergence of metadata slower.
> > > > > > > > > > >
> > > > > > > > > > > For this case, exponential backoff helps as it reduces
> > the
> > > > > retry
> > > > > > > > rate and
> > > > > > > > > > > spaces out how often clients will retry, thereby
> bringing
> > > > down
> > > > > > the
> > > > > > > > time
> > > > > > > > > > for
> > > > > > > > > > > convergence. Something that Jason mentioned that would
> > be a
> > > > > great
> > > > > > > > > > addition
> > > > > > > > > > > here would be if the backoff should be “jittered” as it
> > was
> > > > in
> > > > > > > > KIP-144
> > > > > > > > > > with
> > > > > > > > > > > respect to exponential reconnect backoff. This would
> help
> > > > > prevent
> > > > > > > the
> > > > > > > > > > > clients from being synchronized on when they retry,
> > thereby
> > > > > > spacing
> > > > > > > > out
> > > > > > > > > > the
> > > > > > > > > > > number of requests being sent to the broker at the same
> > time.
> > > > > > > > > > >
> > > > > > > > > > > I’ll add this example to the KIP and flush out more of
> > the
> > > > > > details
> > > > > > > -
> > > > > > > > so
> > > > > > > > > > > it’s more clear.
> > > > > > > > > > >
> > > > > > > > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > > > > > > reluctanthero104@gmail.com
> > > > > > > > > > > ,
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Thanks for the reply Sanjana. I guess I would like to
> > > > > rephrase
> > > > > > my
> > > > > > > > > > > question
> > > > > > > > > > > > 2 and 3 as my previous response is a little bit
> > > > unactionable.
> > > > > > > > > > > >
> > > > > > > > > > > > My specific point is that exponential backoff is not
> a
> > > > silver
> > > > > > > > bullet
> > > > > > > > > > and
> > > > > > > > > > > we
> > > > > > > > > > > > should consider using it to solve known problems,
> > instead
> > > > of
> > > > > > > > making the
> > > > > > > > > > > > holistic changes to all clients in Kafka ecosystem. I
> > do
> > > > like
> > > > > > the
> > > > > > > > > > > > exponential backoff idea and believe this would be of
> > great
> > > > > > > value,
> > > > > > > > but
> > > > > > > > > > > > maybe we should focus on proposing some existing
> > modules
> > > > that
> > > > > > are
> > > > > > > > > > > suffering
> > > > > > > > > > > > from static retry, and only change them in this first
> > KIP.
> > > > If
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > > future, some other component users believe they are
> > also
> > > > > > > > suffering, we
> > > > > > > > > > > > could get more minor KIPs to change the behavior as
> > well.
> > > > > > > > > > > >
> > > > > > > > > > > > Boyang
> > > > > > > > > > > >
> > > > > > > > > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > > > > > > > skaundinya@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the feedback Boyang, I will revise the
> KIP
> > > > with
> > > > > > the
> > > > > > > > > > > > > mathematical relations as per your suggestion. To
> > address
> > > > > > your
> > > > > > > > > > > feedback:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Currently, with the default of 100 ms per retry
> > > > backoff,
> > > > > > in
> > > > > > > 1
> > > > > > > > > > second
> > > > > > > > > > > > > we would have 10 retries. In the case of using an
> > > > > exponential
> > > > > > > > > > backoff,
> > > > > > > > > > > we
> > > > > > > > > > > > > would have a total of 4 retries in 1 second. Thus
> we
> > have
> > > > > > less
> > > > > > > > than
> > > > > > > > > > > half of
> > > > > > > > > > > > > the amount of retries in the same timeframe and can
> > > > lessen
> > > > > > > broker
> > > > > > > > > > > pressure.
> > > > > > > > > > > > > This calculation is done as following (using the
> > formula
> > > > > laid
> > > > > > > > out in
> > > > > > > > > > > the
> > > > > > > > > > > > > KIP:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Try 1 at time 0 ms, failures = 0, next retry in 100
> > ms
> > > > > > (default
> > > > > > > > retry
> > > > > > > > > > > ms
> > > > > > > > > > > > > is initially 100 ms)
> > > > > > > > > > > > > Try 2 at time 100 ms, failures = 1, next retry in
> > 200 ms
> > > > > > > > > > > > > Try 3 at time 300 ms, failures = 2, next retry in
> > 400 ms
> > > > > > > > > > > > > Try 4 at time 700 ms, failures = 3, next retry in
> > 800 ms
> > > > > > > > > > > > > Try 5 at time 1500 ms, failures = 4, next retry in
> > 1000
> > > > ms
> > > > > > > > (default
> > > > > > > > > > max
> > > > > > > > > > > > > retry ms is 1000 ms)
> > > > > > > > > > > > >
> > > > > > > > > > > > > For 2 and 3, could you elaborate more about what
> you
> > mean
> > > > > > with
> > > > > > > > > > respect
> > > > > > > > > > > to
> > > > > > > > > > > > > client timeouts? I’m not very familiar with the
> > Streams
> > > > > > > > framework, so
> > > > > > > > > > > would
> > > > > > > > > > > > > love to get more insight to how that currently
> works,
> > > > with
> > > > > > > > respect to
> > > > > > > > > > > > > producer transactions, so I can appropriately
> update
> > the
> > > > > KIP
> > > > > > to
> > > > > > > > > > address
> > > > > > > > > > > > > these scenarios.
> > > > > > > > > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > > > > > > > > reluctanthero104@gmail.com>,
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > Thanks for the KIP Sanjana. I think the
> motivation
> > is
> > > > > good,
> > > > > > > but
> > > > > > > > > > lack
> > > > > > > > > > > of
> > > > > > > > > > > > > > more quantitative analysis. For instance:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. How much retries we are saving by applying the
> > > > > > exponential
> > > > > > > > retry
> > > > > > > > > > > vs
> > > > > > > > > > > > > > static retry? There should be some mathematical
> > > > relations
> > > > > > > > between
> > > > > > > > > > the
> > > > > > > > > > > > > > static retry ms, the initial exponential retry
> ms,
> > the
> > > > > max
> > > > > > > > > > > exponential
> > > > > > > > > > > > > > retry ms in a given time interval.
> > > > > > > > > > > > > > 2. How does this affect the client timeout? With
> > > > > > exponential
> > > > > > > > retry,
> > > > > > > > > > > the
> > > > > > > > > > > > > > client shall be getting easier to timeout on a
> > parent
> > > > > level
> > > > > > > > caller,
> > > > > > > > > > > for
> > > > > > > > > > > > > > instance stream attempts to retry initializing
> > producer
> > > > > > > > > > transactions
> > > > > > > > > > > with
> > > > > > > > > > > > > > given 5 minute interval. With exponential retry
> > this
> > > > > > > mechanism
> > > > > > > > > > could
> > > > > > > > > > > > > > experience more frequent timeout which we should
> be
> > > > > careful
> > > > > > > > with.
> > > > > > > > > > > > > > 3. With regards to #2, we should have more
> detailed
> > > > > > checklist
> > > > > > > > of
> > > > > > > > > > all
> > > > > > > > > > > the
> > > > > > > > > > > > > > existing static retry scenarios, and adjust the
> > initial
> > > > > > > > exponential
> > > > > > > > > > > retry
> > > > > > > > > > > > > > ms to make sure we won't get easily timeout in
> high
> > > > level
> > > > > > due
> > > > > > > > to
> > > > > > > > > > too
> > > > > > > > > > > few
> > > > > > > > > > > > > > attempts.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Boyang
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana
> Kaundinya <
> > > > > > > > > > > skaundinya@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Everyone,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I’ve written a KIP about introducing
> exponential
> > > > > backoff
> > > > > > > for
> > > > > > > > > > Kafka
> > > > > > > > > > > > > > > clients. Would appreciate any feedback on this.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > -- Guozhang
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Guozhang Wang <wa...@gmail.com>.

Following the formula you have in the KIP, if it is simply:

MIN(retry.backoff.max.ms, (retry.backoff.ms * 2**(failures - 1)) * random(
0.8, 1.2))

then the behavior would stay consistent at retry.backoff.max.ms.


Guozhang

On Thu, Mar 19, 2020 at 5:46 PM Sanjana Kaundinya <sk...@gmail.com>
wrote:

> If that’s the case then what should we base the starting point as?
> Currently in the KIP the starting point is retry.backoff.ms and it
> exponentially goes up to retry.backoff.max.ms. If retry.backoff.max.ms is
> smaller than retry.backoff.ms then that could pose a bit of a problem
> there right?
>
> On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <wa...@gmail.com>, wrote:
> > Thanks Sanjana, I did not capture the part that Jason referred to, so
> > that's my bad :P
> >
> > Regarding your last statement, I actually feel that instead of take the
> > larger of the two, we should respect "retry.backoff.max.ms" even if it
> is
> > smaller than "retry.backoff.ms". I do not have a very strong rationale
> > except it is logically more aligned to the config names.
> >
> >
> > Guozhang
> >
> >
> > On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <sk...@gmail.com>
> > wrote:
> >
> > > Hey Jason and Guozhang,
> > >
> > > Jason is right, I took this inspiration from KIP-144 (
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > > )
> > > which had the same logic in order to preserve the existing behavior. In
> > > this case however, if we are thinking to completely eliminate the
> static
> > > backoff behavior, we can do that and as Jason mentioned put it in the
> > > release notes and not add any special logic. In addition I agree that
> we
> > > should take the larger of the two of `retry.backoff.ms` and `
> > > retry.backoff.max.ms`. I'll update the KIP to reflect this and make it
> > > clear that the old static retry backoff is getting replaced by the new
> > > dynamic retry backoff.
> > >
> > > Thanks,
> > > Sanjana
> > > On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hey Guozhang,
> > > >
> > > > I was referring to this:
> > > >
> > > > > For users who have not set retry.backoff.ms explicitly, the
> default
> > > > behavior will change so that the backoff will grow up to 1000 ms. For
> > > users
> > > > who have set retry.backoff.ms explicitly, the behavior will remain
> the
> > > > same
> > > > as they could have specific requirements.
> > > >
> > > > I took this to mean that for users who have overridden `
> retry.backoff.ms
> > > `
> > > > to 50ms (say), we will change the default `retry.backoff.max.ms` to
> 50ms
> > > > as
> > > > well in order to preserve existing backoff behavior. Is that not
> right?
> > > In
> > > > any case, I agree that we can use the maximum of the two values as
> the
> > > > effective `retry.backoff.max.ms` to handle the case when the
> configured
> > > > value of `retry.backoff.ms` is larger than the default of 1s.
> > > >
> > > > -Jason
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wa...@gmail.com>
> > > wrote:
> > > >
> > > > > Hey Jason,
> > > > >
> > > > > My understanding is a bit different here: even if user has an
> explicit
> > > > > overridden "retry.backoff.ms", the exponential mechanism still
> > > triggers
> > > > > and
> > > > > the backoff would be increased till "retry.backoff.max.ms"; and
> if the
> > > > > specified "retry.backoff.ms" is already larger than the "
> > > > > retry.backoff.max.ms", we would still take "retry.backoff.max.ms".
> > > > >
> > > > > So if the user does override the "retry.backoff.ms" to a value
> larger
> > > > than
> > > > > 1s and is not aware of the new config, she would be surprised to
> see
> > > the
> > > > > specified value seemingly not being respected, but she could still
> > > learn
> > > > > that afterwards by reading the release notes introducing this KIP
> > > > anyways.
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > > On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <
> jason@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Hi Sanjana,
> > > > > >
> > > > > > The KIP looks good to me. I had just one question about the
> default
> > > > > > behavior. As I understand, if the user has specified `
> > > retry.backoff.ms
> > > > `
> > > > > > explicitly, then we will not apply the default max backoff. As
> such,
> > > > > > there's no way to get the benefit of this feature if you are
> > > providing
> > > > a
> > > > > `
> > > > > > retry.backoff.ms` unless you also provide `retry.backoff.max.ms
> `.
> > > That
> > > > > > makes sense if you assume the user is unaware of the new
> > > configuration,
> > > > > but
> > > > > > it is surprising otherwise. Since it's not a semantic change and
> > > since
> > > > > the
> > > > > > default you're proposing of 1s is fairly low already, I wonder if
> > > it's
> > > > > good
> > > > > > enough to mention the new configuration in the release notes and
> not
> > > > add
> > > > > > any special logic. What do you think?
> > > > > >
> > > > > > -Jason
> > > > > >
> > > > > > On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> > > > skaundinya@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thank you for the comments Guozhang.
> > > > > > >
> > > > > > > I’ll leave this KIP out for discussion till the end of the
> week and
> > > > > then
> > > > > > > start a vote for this early next week.
> > > > > > >
> > > > > > > Sanjana
> > > > > > >
> > > > > > > On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <
> wangguoz@gmail.com
> > > > ,
> > > > > > wrote:
> > > > > > > > Hello Sanjana,
> > > > > > > >
> > > > > > > > Thanks for the proposed KIP, I think that makes a lot of
> sense --
> > > > as
> > > > > > you
> > > > > > > > mentioned in the motivation, we've indeed seen many issues
> with
> > > > > regard
> > > > > > to
> > > > > > > > the frequent retries, with bounded exponential backoff in the
> > > > > scenario
> > > > > > > > where there's a long connectivity issue we would effectively
> > > reduce
> > > > > the
> > > > > > > > request load by 10 given the default configs.
> > > > > > > >
> > > > > > > > For higher-level Streams client and Connect frameworks,
> today we
> > > > also
> > > > > > > have
> > > > > > > > a retry logic but that's used in a slightly different way.
> For
> > > > > example
> > > > > > in
> > > > > > > > Streams, we tend to handle the retry logic at the
> thread-level
> > > and
> > > > > > hence
> > > > > > > > very likely we'd like to change that mechanism in KIP-572
> > > anyways.
> > > > > For
> > > > > > > > producer / consumer / admin clients, I think just applying
> this
> > > > > > > behavioral
> > > > > > > > change across these clients makes lot of sense. So I think
> can
> > > just
> > > > > > leave
> > > > > > > > the Streams / Connect out of the scope of this KIP to be
> > > addressed
> > > > in
> > > > > > > > separate discussions.
> > > > > > > >
> > > > > > > > I do not have further comments about this KIP :) LGTM.
> > > > > > > >
> > > > > > > > Guozhang
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > > > > > skaundinya@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the feedback Boyang.
> > > > > > > > >
> > > > > > > > > If there’s anyone else who has feedback regarding this KIP,
> > > would
> > > > > > > really
> > > > > > > > > appreciate it hearing it!
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Sanjana
> > > > > > > > >
> > > > > > > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> > > > bchen11@outlook.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Sounds great!
> > > > > > > > > >
> > > > > > > > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > > > > > > ________________________________
> > > > > > > > > > From: Sanjana Kaundinya <sk...@gmail.com>
> > > > > > > > > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > > > > > > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > > > > > > > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for
> Kafka
> > > > > > Clients
> > > > > > > > > >
> > > > > > > > > > Thanks for the explanation Boyang. One of the most common
> > > > > problems
> > > > > > > that
> > > > > > > > > we
> > > > > > > > > > have in Kafka is with respect to metadata fetches. For
> > > example,
> > > > > if
> > > > > > > there
> > > > > > > > > is
> > > > > > > > > > a broker failure, all clients start to fetch metadata at
> the
> > > > same
> > > > > > > time
> > > > > > > > > and
> > > > > > > > > > it often takes a while for the metadata to converge. In a
> > > high
> > > > > load
> > > > > > > > > > cluster, there are also issues where the volume of
> metadata
> > > has
> > > > > > made
> > > > > > > > > > convergence of metadata slower.
> > > > > > > > > >
> > > > > > > > > > For this case, exponential backoff helps as it reduces
> the
> > > > retry
> > > > > > > rate and
> > > > > > > > > > spaces out how often clients will retry, thereby bringing
> > > down
> > > > > the
> > > > > > > time
> > > > > > > > > for
> > > > > > > > > > convergence. Something that Jason mentioned that would
> be a
> > > > great
> > > > > > > > > addition
> > > > > > > > > > here would be if the backoff should be “jittered” as it
> was
> > > in
> > > > > > > KIP-144
> > > > > > > > > with
> > > > > > > > > > respect to exponential reconnect backoff. This would help
> > > > prevent
> > > > > > the
> > > > > > > > > > clients from being synchronized on when they retry,
> thereby
> > > > > spacing
> > > > > > > out
> > > > > > > > > the
> > > > > > > > > > number of requests being sent to the broker at the same
> time.
> > > > > > > > > >
> > > > > > > > > > I’ll add this example to the KIP and flush out more of
> the
> > > > > details
> > > > > > -
> > > > > > > so
> > > > > > > > > > it’s more clear.
> > > > > > > > > >
> > > > > > > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > > > > > reluctanthero104@gmail.com
> > > > > > > > > > ,
> > > > > > > > > > wrote:
> > > > > > > > > > > Thanks for the reply Sanjana. I guess I would like to
> > > > rephrase
> > > > > my
> > > > > > > > > > question
> > > > > > > > > > > 2 and 3 as my previous response is a little bit
> > > unactionable.
> > > > > > > > > > >
> > > > > > > > > > > My specific point is that exponential backoff is not a
> > > silver
> > > > > > > bullet
> > > > > > > > > and
> > > > > > > > > > we
> > > > > > > > > > > should consider using it to solve known problems,
> instead
> > > of
> > > > > > > making the
> > > > > > > > > > > holistic changes to all clients in Kafka ecosystem. I
> do
> > > like
> > > > > the
> > > > > > > > > > > exponential backoff idea and believe this would be of
> great
> > > > > > value,
> > > > > > > but
> > > > > > > > > > > maybe we should focus on proposing some existing
> modules
> > > that
> > > > > are
> > > > > > > > > > suffering
> > > > > > > > > > > from static retry, and only change them in this first
> KIP.
> > > If
> > > > > in
> > > > > > > the
> > > > > > > > > > > future, some other component users believe they are
> also
> > > > > > > suffering, we
> > > > > > > > > > > could get more minor KIPs to change the behavior as
> well.
> > > > > > > > > > >
> > > > > > > > > > > Boyang
> > > > > > > > > > >
> > > > > > > > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > > > > > > skaundinya@gmail.com
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for the feedback Boyang, I will revise the KIP
> > > with
> > > > > the
> > > > > > > > > > > > mathematical relations as per your suggestion. To
> address
> > > > > your
> > > > > > > > > > feedback:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Currently, with the default of 100 ms per retry
> > > backoff,
> > > > > in
> > > > > > 1
> > > > > > > > > second
> > > > > > > > > > > > we would have 10 retries. In the case of using an
> > > > exponential
> > > > > > > > > backoff,
> > > > > > > > > > we
> > > > > > > > > > > > would have a total of 4 retries in 1 second. Thus we
> have
> > > > > less
> > > > > > > than
> > > > > > > > > > half of
> > > > > > > > > > > > the amount of retries in the same timeframe and can
> > > lessen
> > > > > > broker
> > > > > > > > > > pressure.
> > > > > > > > > > > > This calculation is done as following (using the
> formula
> > > > laid
> > > > > > > out in
> > > > > > > > > > the
> > > > > > > > > > > > KIP:
> > > > > > > > > > > >
> > > > > > > > > > > > Try 1 at time 0 ms, failures = 0, next retry in 100
> ms
> > > > > (default
> > > > > > > retry
> > > > > > > > > > ms
> > > > > > > > > > > > is initially 100 ms)
> > > > > > > > > > > > Try 2 at time 100 ms, failures = 1, next retry in
> 200 ms
> > > > > > > > > > > > Try 3 at time 300 ms, failures = 2, next retry in
> 400 ms
> > > > > > > > > > > > Try 4 at time 700 ms, failures = 3, next retry in
> 800 ms
> > > > > > > > > > > > Try 5 at time 1500 ms, failures = 4, next retry in
> 1000
> > > ms
> > > > > > > (default
> > > > > > > > > max
> > > > > > > > > > > > retry ms is 1000 ms)
> > > > > > > > > > > >
> > > > > > > > > > > > For 2 and 3, could you elaborate more about what you
> mean
> > > > > with
> > > > > > > > > respect
> > > > > > > > > > to
> > > > > > > > > > > > client timeouts? I’m not very familiar with the
> Streams
> > > > > > > framework, so
> > > > > > > > > > would
> > > > > > > > > > > > love to get more insight to how that currently works,
> > > with
> > > > > > > respect to
> > > > > > > > > > > > producer transactions, so I can appropriately update
> the
> > > > KIP
> > > > > to
> > > > > > > > > address
> > > > > > > > > > > > these scenarios.
> > > > > > > > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > > > > > > > reluctanthero104@gmail.com>,
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > Thanks for the KIP Sanjana. I think the motivation
> is
> > > > good,
> > > > > > but
> > > > > > > > > lack
> > > > > > > > > > of
> > > > > > > > > > > > > more quantitative analysis. For instance:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. How much retries we are saving by applying the
> > > > > exponential
> > > > > > > retry
> > > > > > > > > > vs
> > > > > > > > > > > > > static retry? There should be some mathematical
> > > relations
> > > > > > > between
> > > > > > > > > the
> > > > > > > > > > > > > static retry ms, the initial exponential retry ms,
> the
> > > > max
> > > > > > > > > > exponential
> > > > > > > > > > > > > retry ms in a given time interval.
> > > > > > > > > > > > > 2. How does this affect the client timeout? With
> > > > > exponential
> > > > > > > retry,
> > > > > > > > > > the
> > > > > > > > > > > > > client shall be getting easier to timeout on a
> parent
> > > > level
> > > > > > > caller,
> > > > > > > > > > for
> > > > > > > > > > > > > instance stream attempts to retry initializing
> producer
> > > > > > > > > transactions
> > > > > > > > > > with
> > > > > > > > > > > > > given 5 minute interval. With exponential retry
> this
> > > > > > mechanism
> > > > > > > > > could
> > > > > > > > > > > > > experience more frequent timeout which we should be
> > > > careful
> > > > > > > with.
> > > > > > > > > > > > > 3. With regards to #2, we should have more detailed
> > > > > checklist
> > > > > > > of
> > > > > > > > > all
> > > > > > > > > > the
> > > > > > > > > > > > > existing static retry scenarios, and adjust the
> initial
> > > > > > > exponential
> > > > > > > > > > retry
> > > > > > > > > > > > > ms to make sure we won't get easily timeout in high
> > > level
> > > > > due
> > > > > > > to
> > > > > > > > > too
> > > > > > > > > > few
> > > > > > > > > > > > > attempts.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Boyang
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <
> > > > > > > > > > skaundinya@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Everyone,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I’ve written a KIP about introducing exponential
> > > > backoff
> > > > > > for
> > > > > > > > > Kafka
> > > > > > > > > > > > > > clients. Would appreciate any feedback on this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Guozhang
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Sanjana Kaundinya <sk...@gmail.com>.

If that’s the case then what should we base the starting point as? Currently in the KIP the starting point is retry.backoff.ms and it exponentially goes up to retry.backoff.max.ms. If retry.backoff.max.ms is smaller than retry.backoff.ms then that could pose a bit of a problem there right?

On Mar 19, 2020, 5:44 PM -0700, Guozhang Wang <wa...@gmail.com>, wrote:
> Thanks Sanjana, I did not capture the part that Jason referred to, so
> that's my bad :P
>
> Regarding your last statement, I actually feel that instead of take the
> larger of the two, we should respect "retry.backoff.max.ms" even if it is
> smaller than "retry.backoff.ms". I do not have a very strong rationale
> except it is logically more aligned to the config names.
>
>
> Guozhang
>
>
> On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <sk...@gmail.com>
> wrote:
>
> > Hey Jason and Guozhang,
> >
> > Jason is right, I took this inspiration from KIP-144 (
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> > )
> > which had the same logic in order to preserve the existing behavior. In
> > this case however, if we are thinking to completely eliminate the static
> > backoff behavior, we can do that and as Jason mentioned put it in the
> > release notes and not add any special logic. In addition I agree that we
> > should take the larger of the two of `retry.backoff.ms` and `
> > retry.backoff.max.ms`. I'll update the KIP to reflect this and make it
> > clear that the old static retry backoff is getting replaced by the new
> > dynamic retry backoff.
> >
> > Thanks,
> > Sanjana
> > On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hey Guozhang,
> > >
> > > I was referring to this:
> > >
> > > > For users who have not set retry.backoff.ms explicitly, the default
> > > behavior will change so that the backoff will grow up to 1000 ms. For
> > users
> > > who have set retry.backoff.ms explicitly, the behavior will remain the
> > > same
> > > as they could have specific requirements.
> > >
> > > I took this to mean that for users who have overridden `retry.backoff.ms
> > `
> > > to 50ms (say), we will change the default `retry.backoff.max.ms` to 50ms
> > > as
> > > well in order to preserve existing backoff behavior. Is that not right?
> > In
> > > any case, I agree that we can use the maximum of the two values as the
> > > effective `retry.backoff.max.ms` to handle the case when the configured
> > > value of `retry.backoff.ms` is larger than the default of 1s.
> > >
> > > -Jason
> > >
> > >
> > >
> > >
> > > On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > Hey Jason,
> > > >
> > > > My understanding is a bit different here: even if user has an explicit
> > > > overridden "retry.backoff.ms", the exponential mechanism still
> > triggers
> > > > and
> > > > the backoff would be increased till "retry.backoff.max.ms"; and if the
> > > > specified "retry.backoff.ms" is already larger than the "
> > > > retry.backoff.max.ms", we would still take "retry.backoff.max.ms".
> > > >
> > > > So if the user does override the "retry.backoff.ms" to a value larger
> > > than
> > > > 1s and is not aware of the new config, she would be surprised to see
> > the
> > > > specified value seemingly not being respected, but she could still
> > learn
> > > > that afterwards by reading the release notes introducing this KIP
> > > anyways.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > > On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hi Sanjana,
> > > > >
> > > > > The KIP looks good to me. I had just one question about the default
> > > > > behavior. As I understand, if the user has specified `
> > retry.backoff.ms
> > > `
> > > > > explicitly, then we will not apply the default max backoff. As such,
> > > > > there's no way to get the benefit of this feature if you are
> > providing
> > > a
> > > > `
> > > > > retry.backoff.ms` unless you also provide `retry.backoff.max.ms`.
> > That
> > > > > makes sense if you assume the user is unaware of the new
> > configuration,
> > > > but
> > > > > it is surprising otherwise. Since it's not a semantic change and
> > since
> > > > the
> > > > > default you're proposing of 1s is fairly low already, I wonder if
> > it's
> > > > good
> > > > > enough to mention the new configuration in the release notes and not
> > > add
> > > > > any special logic. What do you think?
> > > > >
> > > > > -Jason
> > > > >
> > > > > On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> > > skaundinya@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thank you for the comments Guozhang.
> > > > > >
> > > > > > I’ll leave this KIP out for discussion till the end of the week and
> > > > then
> > > > > > start a vote for this early next week.
> > > > > >
> > > > > > Sanjana
> > > > > >
> > > > > > On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <wangguoz@gmail.com
> > > ,
> > > > > wrote:
> > > > > > > Hello Sanjana,
> > > > > > >
> > > > > > > Thanks for the proposed KIP, I think that makes a lot of sense --
> > > as
> > > > > you
> > > > > > > mentioned in the motivation, we've indeed seen many issues with
> > > > regard
> > > > > to
> > > > > > > the frequent retries, with bounded exponential backoff in the
> > > > scenario
> > > > > > > where there's a long connectivity issue we would effectively
> > reduce
> > > > the
> > > > > > > request load by 10 given the default configs.
> > > > > > >
> > > > > > > For higher-level Streams client and Connect frameworks, today we
> > > also
> > > > > > have
> > > > > > > a retry logic but that's used in a slightly different way. For
> > > > example
> > > > > in
> > > > > > > Streams, we tend to handle the retry logic at the thread-level
> > and
> > > > > hence
> > > > > > > very likely we'd like to change that mechanism in KIP-572
> > anyways.
> > > > For
> > > > > > > producer / consumer / admin clients, I think just applying this
> > > > > > behavioral
> > > > > > > change across these clients makes lot of sense. So I think can
> > just
> > > > > leave
> > > > > > > the Streams / Connect out of the scope of this KIP to be
> > addressed
> > > in
> > > > > > > separate discussions.
> > > > > > >
> > > > > > > I do not have further comments about this KIP :) LGTM.
> > > > > > >
> > > > > > > Guozhang
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > > > > skaundinya@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the feedback Boyang.
> > > > > > > >
> > > > > > > > If there’s anyone else who has feedback regarding this KIP,
> > would
> > > > > > really
> > > > > > > > appreciate it hearing it!
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Sanjana
> > > > > > > >
> > > > > > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> > > bchen11@outlook.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Sounds great!
> > > > > > > > >
> > > > > > > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > > > > > ________________________________
> > > > > > > > > From: Sanjana Kaundinya <sk...@gmail.com>
> > > > > > > > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > > > > > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > > > > > > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka
> > > > > Clients
> > > > > > > > >
> > > > > > > > > Thanks for the explanation Boyang. One of the most common
> > > > problems
> > > > > > that
> > > > > > > > we
> > > > > > > > > have in Kafka is with respect to metadata fetches. For
> > example,
> > > > if
> > > > > > there
> > > > > > > > is
> > > > > > > > > a broker failure, all clients start to fetch metadata at the
> > > same
> > > > > > time
> > > > > > > > and
> > > > > > > > > it often takes a while for the metadata to converge. In a
> > high
> > > > load
> > > > > > > > > cluster, there are also issues where the volume of metadata
> > has
> > > > > made
> > > > > > > > > convergence of metadata slower.
> > > > > > > > >
> > > > > > > > > For this case, exponential backoff helps as it reduces the
> > > retry
> > > > > > rate and
> > > > > > > > > spaces out how often clients will retry, thereby bringing
> > down
> > > > the
> > > > > > time
> > > > > > > > for
> > > > > > > > > convergence. Something that Jason mentioned that would be a
> > > great
> > > > > > > > addition
> > > > > > > > > here would be if the backoff should be “jittered” as it was
> > in
> > > > > > KIP-144
> > > > > > > > with
> > > > > > > > > respect to exponential reconnect backoff. This would help
> > > prevent
> > > > > the
> > > > > > > > > clients from being synchronized on when they retry, thereby
> > > > spacing
> > > > > > out
> > > > > > > > the
> > > > > > > > > number of requests being sent to the broker at the same time.
> > > > > > > > >
> > > > > > > > > I’ll add this example to the KIP and flush out more of the
> > > > details
> > > > > -
> > > > > > so
> > > > > > > > > it’s more clear.
> > > > > > > > >
> > > > > > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > > > > reluctanthero104@gmail.com
> > > > > > > > > ,
> > > > > > > > > wrote:
> > > > > > > > > > Thanks for the reply Sanjana. I guess I would like to
> > > rephrase
> > > > my
> > > > > > > > > question
> > > > > > > > > > 2 and 3 as my previous response is a little bit
> > unactionable.
> > > > > > > > > >
> > > > > > > > > > My specific point is that exponential backoff is not a
> > silver
> > > > > > bullet
> > > > > > > > and
> > > > > > > > > we
> > > > > > > > > > should consider using it to solve known problems, instead
> > of
> > > > > > making the
> > > > > > > > > > holistic changes to all clients in Kafka ecosystem. I do
> > like
> > > > the
> > > > > > > > > > exponential backoff idea and believe this would be of great
> > > > > value,
> > > > > > but
> > > > > > > > > > maybe we should focus on proposing some existing modules
> > that
> > > > are
> > > > > > > > > suffering
> > > > > > > > > > from static retry, and only change them in this first KIP.
> > If
> > > > in
> > > > > > the
> > > > > > > > > > future, some other component users believe they are also
> > > > > > suffering, we
> > > > > > > > > > could get more minor KIPs to change the behavior as well.
> > > > > > > > > >
> > > > > > > > > > Boyang
> > > > > > > > > >
> > > > > > > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > > > > > skaundinya@gmail.com
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback Boyang, I will revise the KIP
> > with
> > > > the
> > > > > > > > > > > mathematical relations as per your suggestion. To address
> > > > your
> > > > > > > > > feedback:
> > > > > > > > > > >
> > > > > > > > > > > 1. Currently, with the default of 100 ms per retry
> > backoff,
> > > > in
> > > > > 1
> > > > > > > > second
> > > > > > > > > > > we would have 10 retries. In the case of using an
> > > exponential
> > > > > > > > backoff,
> > > > > > > > > we
> > > > > > > > > > > would have a total of 4 retries in 1 second. Thus we have
> > > > less
> > > > > > than
> > > > > > > > > half of
> > > > > > > > > > > the amount of retries in the same timeframe and can
> > lessen
> > > > > broker
> > > > > > > > > pressure.
> > > > > > > > > > > This calculation is done as following (using the formula
> > > laid
> > > > > > out in
> > > > > > > > > the
> > > > > > > > > > > KIP:
> > > > > > > > > > >
> > > > > > > > > > > Try 1 at time 0 ms, failures = 0, next retry in 100 ms
> > > > (default
> > > > > > retry
> > > > > > > > > ms
> > > > > > > > > > > is initially 100 ms)
> > > > > > > > > > > Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> > > > > > > > > > > Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> > > > > > > > > > > Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> > > > > > > > > > > Try 5 at time 1500 ms, failures = 4, next retry in 1000
> > ms
> > > > > > (default
> > > > > > > > max
> > > > > > > > > > > retry ms is 1000 ms)
> > > > > > > > > > >
> > > > > > > > > > > For 2 and 3, could you elaborate more about what you mean
> > > > with
> > > > > > > > respect
> > > > > > > > > to
> > > > > > > > > > > client timeouts? I’m not very familiar with the Streams
> > > > > > framework, so
> > > > > > > > > would
> > > > > > > > > > > love to get more insight to how that currently works,
> > with
> > > > > > respect to
> > > > > > > > > > > producer transactions, so I can appropriately update the
> > > KIP
> > > > to
> > > > > > > > address
> > > > > > > > > > > these scenarios.
> > > > > > > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > > > > > > reluctanthero104@gmail.com>,
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Thanks for the KIP Sanjana. I think the motivation is
> > > good,
> > > > > but
> > > > > > > > lack
> > > > > > > > > of
> > > > > > > > > > > > more quantitative analysis. For instance:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. How much retries we are saving by applying the
> > > > exponential
> > > > > > retry
> > > > > > > > > vs
> > > > > > > > > > > > static retry? There should be some mathematical
> > relations
> > > > > > between
> > > > > > > > the
> > > > > > > > > > > > static retry ms, the initial exponential retry ms, the
> > > max
> > > > > > > > > exponential
> > > > > > > > > > > > retry ms in a given time interval.
> > > > > > > > > > > > 2. How does this affect the client timeout? With
> > > > exponential
> > > > > > retry,
> > > > > > > > > the
> > > > > > > > > > > > client shall be getting easier to timeout on a parent
> > > level
> > > > > > caller,
> > > > > > > > > for
> > > > > > > > > > > > instance stream attempts to retry initializing producer
> > > > > > > > transactions
> > > > > > > > > with
> > > > > > > > > > > > given 5 minute interval. With exponential retry this
> > > > > mechanism
> > > > > > > > could
> > > > > > > > > > > > experience more frequent timeout which we should be
> > > careful
> > > > > > with.
> > > > > > > > > > > > 3. With regards to #2, we should have more detailed
> > > > checklist
> > > > > > of
> > > > > > > > all
> > > > > > > > > the
> > > > > > > > > > > > existing static retry scenarios, and adjust the initial
> > > > > > exponential
> > > > > > > > > retry
> > > > > > > > > > > > ms to make sure we won't get easily timeout in high
> > level
> > > > due
> > > > > > to
> > > > > > > > too
> > > > > > > > > few
> > > > > > > > > > > > attempts.
> > > > > > > > > > > >
> > > > > > > > > > > > Boyang
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <
> > > > > > > > > skaundinya@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Everyone,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I’ve written a KIP about introducing exponential
> > > backoff
> > > > > for
> > > > > > > > Kafka
> > > > > > > > > > > > > clients. Would appreciate any feedback on this.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Sanjana
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>
>
> --
> -- Guozhang

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Guozhang Wang <wa...@gmail.com>.

Thanks Sanjana, I did not capture the part that Jason referred to, so
that's my bad :P

Regarding your last statement, I actually feel that instead of take the
larger of the two, we should respect "retry.backoff.max.ms" even if it is
smaller than "retry.backoff.ms". I do not have a very strong rationale
except it is logically more aligned to the config names.


Guozhang


On Thu, Mar 19, 2020 at 5:39 PM Sanjana Kaundinya <sk...@gmail.com>
wrote:

> Hey Jason and Guozhang,
>
> Jason is right, I took this inspiration from KIP-144 (
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts
> )
> which had the same logic in order to preserve the existing behavior. In
> this case however, if we are thinking to completely eliminate the static
> backoff behavior, we can do that and as Jason mentioned put it in the
> release notes and not add any special logic. In addition I agree that we
> should take the larger of the two of `retry.backoff.ms` and `
> retry.backoff.max.ms`. I'll update the KIP to reflect this and make it
> clear that the old static retry backoff is getting replaced by the new
> dynamic retry backoff.
>
> Thanks,
> Sanjana
> On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Guozhang,
> >
> > I was referring to this:
> >
> > > For users who have not set retry.backoff.ms explicitly, the default
> > behavior will change so that the backoff will grow up to 1000 ms. For
> users
> > who have set retry.backoff.ms explicitly, the behavior will remain the
> > same
> > as they could have specific requirements.
> >
> > I took this to mean that for users who have overridden `retry.backoff.ms
> `
> > to 50ms (say), we will change the default `retry.backoff.max.ms` to 50ms
> > as
> > well in order to preserve existing backoff behavior. Is that not right?
> In
> > any case, I agree that we can use the maximum of the two values as the
> > effective `retry.backoff.max.ms` to handle the case when the configured
> > value of `retry.backoff.ms` is larger than the default of 1s.
> >
> > -Jason
> >
> >
> >
> >
> > On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Hey Jason,
> > >
> > > My understanding is a bit different here: even if user has an explicit
> > > overridden "retry.backoff.ms", the exponential mechanism still
> triggers
> > > and
> > > the backoff would be increased till "retry.backoff.max.ms"; and if the
> > > specified "retry.backoff.ms" is already larger than the "
> > > retry.backoff.max.ms", we would still take "retry.backoff.max.ms".
> > >
> > > So if the user does override the "retry.backoff.ms" to a value larger
> > than
> > > 1s and is not aware of the new config, she would be surprised to see
> the
> > > specified value seemingly not being respected, but she could still
> learn
> > > that afterwards by reading the release notes introducing this KIP
> > anyways.
> > >
> > >
> > > Guozhang
> > >
> > > On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi Sanjana,
> > > >
> > > > The KIP looks good to me. I had just one question about the default
> > > > behavior. As I understand, if the user has specified `
> retry.backoff.ms
> > `
> > > > explicitly, then we will not apply the default max backoff. As such,
> > > > there's no way to get the benefit of this feature if you are
> providing
> > a
> > > `
> > > > retry.backoff.ms` unless you also provide `retry.backoff.max.ms`.
> That
> > > > makes sense if you assume the user is unaware of the new
> configuration,
> > > but
> > > > it is surprising otherwise. Since it's not a semantic change and
> since
> > > the
> > > > default you're proposing of 1s is fairly low already, I wonder if
> it's
> > > good
> > > > enough to mention the new configuration in the release notes and not
> > add
> > > > any special logic. What do you think?
> > > >
> > > > -Jason
> > > >
> > > > On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> > skaundinya@gmail.com>
> > > > wrote:
> > > >
> > > > > Thank you for the comments Guozhang.
> > > > >
> > > > > I’ll leave this KIP out for discussion till the end of the week and
> > > then
> > > > > start a vote for this early next week.
> > > > >
> > > > > Sanjana
> > > > >
> > > > > On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <wangguoz@gmail.com
> >,
> > > > wrote:
> > > > > > Hello Sanjana,
> > > > > >
> > > > > > Thanks for the proposed KIP, I think that makes a lot of sense --
> > as
> > > > you
> > > > > > mentioned in the motivation, we've indeed seen many issues with
> > > regard
> > > > to
> > > > > > the frequent retries, with bounded exponential backoff in the
> > > scenario
> > > > > > where there's a long connectivity issue we would effectively
> reduce
> > > the
> > > > > > request load by 10 given the default configs.
> > > > > >
> > > > > > For higher-level Streams client and Connect frameworks, today we
> > also
> > > > > have
> > > > > > a retry logic but that's used in a slightly different way. For
> > > example
> > > > in
> > > > > > Streams, we tend to handle the retry logic at the thread-level
> and
> > > > hence
> > > > > > very likely we'd like to change that mechanism in KIP-572
> anyways.
> > > For
> > > > > > producer / consumer / admin clients, I think just applying this
> > > > > behavioral
> > > > > > change across these clients makes lot of sense. So I think can
> just
> > > > leave
> > > > > > the Streams / Connect out of the scope of this KIP to be
> addressed
> > in
> > > > > > separate discussions.
> > > > > >
> > > > > > I do not have further comments about this KIP :) LGTM.
> > > > > >
> > > > > > Guozhang
> > > > > >
> > > > > >
> > > > > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > > > skaundinya@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for the feedback Boyang.
> > > > > > >
> > > > > > > If there’s anyone else who has feedback regarding this KIP,
> would
> > > > > really
> > > > > > > appreciate it hearing it!
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Sanjana
> > > > > > >
> > > > > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> > bchen11@outlook.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Sounds great!
> > > > > > > >
> > > > > > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > > > > ________________________________
> > > > > > > > From: Sanjana Kaundinya <sk...@gmail.com>
> > > > > > > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > > > > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > > > > > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka
> > > > Clients
> > > > > > > >
> > > > > > > > Thanks for the explanation Boyang. One of the most common
> > > problems
> > > > > that
> > > > > > > we
> > > > > > > > have in Kafka is with respect to metadata fetches. For
> example,
> > > if
> > > > > there
> > > > > > > is
> > > > > > > > a broker failure, all clients start to fetch metadata at the
> > same
> > > > > time
> > > > > > > and
> > > > > > > > it often takes a while for the metadata to converge. In a
> high
> > > load
> > > > > > > > cluster, there are also issues where the volume of metadata
> has
> > > > made
> > > > > > > > convergence of metadata slower.
> > > > > > > >
> > > > > > > > For this case, exponential backoff helps as it reduces the
> > retry
> > > > > rate and
> > > > > > > > spaces out how often clients will retry, thereby bringing
> down
> > > the
> > > > > time
> > > > > > > for
> > > > > > > > convergence. Something that Jason mentioned that would be a
> > great
> > > > > > > addition
> > > > > > > > here would be if the backoff should be “jittered” as it was
> in
> > > > > KIP-144
> > > > > > > with
> > > > > > > > respect to exponential reconnect backoff. This would help
> > prevent
> > > > the
> > > > > > > > clients from being synchronized on when they retry, thereby
> > > spacing
> > > > > out
> > > > > > > the
> > > > > > > > number of requests being sent to the broker at the same time.
> > > > > > > >
> > > > > > > > I’ll add this example to the KIP and flush out more of the
> > > details
> > > > -
> > > > > so
> > > > > > > > it’s more clear.
> > > > > > > >
> > > > > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > > > reluctanthero104@gmail.com
> > > > > > > > ,
> > > > > > > > wrote:
> > > > > > > > > Thanks for the reply Sanjana. I guess I would like to
> > rephrase
> > > my
> > > > > > > > question
> > > > > > > > > 2 and 3 as my previous response is a little bit
> unactionable.
> > > > > > > > >
> > > > > > > > > My specific point is that exponential backoff is not a
> silver
> > > > > bullet
> > > > > > > and
> > > > > > > > we
> > > > > > > > > should consider using it to solve known problems, instead
> of
> > > > > making the
> > > > > > > > > holistic changes to all clients in Kafka ecosystem. I do
> like
> > > the
> > > > > > > > > exponential backoff idea and believe this would be of great
> > > > value,
> > > > > but
> > > > > > > > > maybe we should focus on proposing some existing modules
> that
> > > are
> > > > > > > > suffering
> > > > > > > > > from static retry, and only change them in this first KIP.
> If
> > > in
> > > > > the
> > > > > > > > > future, some other component users believe they are also
> > > > > suffering, we
> > > > > > > > > could get more minor KIPs to change the behavior as well.
> > > > > > > > >
> > > > > > > > > Boyang
> > > > > > > > >
> > > > > > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > > > > skaundinya@gmail.com
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback Boyang, I will revise the KIP
> with
> > > the
> > > > > > > > > > mathematical relations as per your suggestion. To address
> > > your
> > > > > > > > feedback:
> > > > > > > > > >
> > > > > > > > > > 1. Currently, with the default of 100 ms per retry
> backoff,
> > > in
> > > > 1
> > > > > > > second
> > > > > > > > > > we would have 10 retries. In the case of using an
> > exponential
> > > > > > > backoff,
> > > > > > > > we
> > > > > > > > > > would have a total of 4 retries in 1 second. Thus we have
> > > less
> > > > > than
> > > > > > > > half of
> > > > > > > > > > the amount of retries in the same timeframe and can
> lessen
> > > > broker
> > > > > > > > pressure.
> > > > > > > > > > This calculation is done as following (using the formula
> > laid
> > > > > out in
> > > > > > > > the
> > > > > > > > > > KIP:
> > > > > > > > > >
> > > > > > > > > > Try 1 at time 0 ms, failures = 0, next retry in 100 ms
> > > (default
> > > > > retry
> > > > > > > > ms
> > > > > > > > > > is initially 100 ms)
> > > > > > > > > > Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> > > > > > > > > > Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> > > > > > > > > > Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> > > > > > > > > > Try 5 at time 1500 ms, failures = 4, next retry in 1000
> ms
> > > > > (default
> > > > > > > max
> > > > > > > > > > retry ms is 1000 ms)
> > > > > > > > > >
> > > > > > > > > > For 2 and 3, could you elaborate more about what you mean
> > > with
> > > > > > > respect
> > > > > > > > to
> > > > > > > > > > client timeouts? I’m not very familiar with the Streams
> > > > > framework, so
> > > > > > > > would
> > > > > > > > > > love to get more insight to how that currently works,
> with
> > > > > respect to
> > > > > > > > > > producer transactions, so I can appropriately update the
> > KIP
> > > to
> > > > > > > address
> > > > > > > > > > these scenarios.
> > > > > > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > > > > > reluctanthero104@gmail.com>,
> > > > > > > > > > wrote:
> > > > > > > > > > > Thanks for the KIP Sanjana. I think the motivation is
> > good,
> > > > but
> > > > > > > lack
> > > > > > > > of
> > > > > > > > > > > more quantitative analysis. For instance:
> > > > > > > > > > >
> > > > > > > > > > > 1. How much retries we are saving by applying the
> > > exponential
> > > > > retry
> > > > > > > > vs
> > > > > > > > > > > static retry? There should be some mathematical
> relations
> > > > > between
> > > > > > > the
> > > > > > > > > > > static retry ms, the initial exponential retry ms, the
> > max
> > > > > > > > exponential
> > > > > > > > > > > retry ms in a given time interval.
> > > > > > > > > > > 2. How does this affect the client timeout? With
> > > exponential
> > > > > retry,
> > > > > > > > the
> > > > > > > > > > > client shall be getting easier to timeout on a parent
> > level
> > > > > caller,
> > > > > > > > for
> > > > > > > > > > > instance stream attempts to retry initializing producer
> > > > > > > transactions
> > > > > > > > with
> > > > > > > > > > > given 5 minute interval. With exponential retry this
> > > > mechanism
> > > > > > > could
> > > > > > > > > > > experience more frequent timeout which we should be
> > careful
> > > > > with.
> > > > > > > > > > > 3. With regards to #2, we should have more detailed
> > > checklist
> > > > > of
> > > > > > > all
> > > > > > > > the
> > > > > > > > > > > existing static retry scenarios, and adjust the initial
> > > > > exponential
> > > > > > > > retry
> > > > > > > > > > > ms to make sure we won't get easily timeout in high
> level
> > > due
> > > > > to
> > > > > > > too
> > > > > > > > few
> > > > > > > > > > > attempts.
> > > > > > > > > > >
> > > > > > > > > > > Boyang
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <
> > > > > > > > skaundinya@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Everyone,
> > > > > > > > > > > >
> > > > > > > > > > > > I’ve written a KIP about introducing exponential
> > backoff
> > > > for
> > > > > > > Kafka
> > > > > > > > > > > > clients. Would appreciate any feedback on this.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Sanjana
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Sanjana Kaundinya <sk...@gmail.com>.

Hey Jason and Guozhang,

Jason is right, I took this inspiration from KIP-144 (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts)
which had the same logic in order to preserve the existing behavior. In
this case however, if we are thinking to completely eliminate the static
backoff behavior, we can do that and as Jason mentioned put it in the
release notes and not add any special logic. In addition I agree that we
should take the larger of the two of `retry.backoff.ms` and `
retry.backoff.max.ms`. I'll update the KIP to reflect this and make it
clear that the old static retry backoff is getting replaced by the new
dynamic retry backoff.

Thanks,
Sanjana
On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Guozhang,
>
> I was referring to this:
>
> > For users who have not set retry.backoff.ms explicitly, the default
> behavior will change so that the backoff will grow up to 1000 ms. For users
> who have set retry.backoff.ms explicitly, the behavior will remain the
> same
> as they could have specific requirements.
>
> I took this to mean that for users who have overridden `retry.backoff.ms`
> to 50ms (say), we will change the default `retry.backoff.max.ms` to 50ms
> as
> well in order to preserve existing backoff behavior. Is that not right? In
> any case, I agree that we can use the maximum of the two values as the
> effective `retry.backoff.max.ms` to handle the case when the configured
> value of `retry.backoff.ms` is larger than the default of 1s.
>
> -Jason
>
>
>
>
> On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hey Jason,
> >
> > My understanding is a bit different here: even if user has an explicit
> > overridden "retry.backoff.ms", the exponential mechanism still triggers
> > and
> > the backoff would be increased till "retry.backoff.max.ms"; and if the
> > specified "retry.backoff.ms" is already larger than the "
> > retry.backoff.max.ms", we would still take "retry.backoff.max.ms".
> >
> > So if the user does override the "retry.backoff.ms" to a value larger
> than
> > 1s and is not aware of the new config, she would be surprised to see the
> > specified value seemingly not being respected, but she could still learn
> > that afterwards by reading the release notes introducing this KIP
> anyways.
> >
> >
> > Guozhang
> >
> > On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hi Sanjana,
> > >
> > > The KIP looks good to me. I had just one question about the default
> > > behavior. As I understand, if the user has specified `retry.backoff.ms
> `
> > > explicitly, then we will not apply the default max backoff. As such,
> > > there's no way to get the benefit of this feature if you are providing
> a
> > `
> > > retry.backoff.ms` unless you also provide `retry.backoff.max.ms`. That
> > > makes sense if you assume the user is unaware of the new configuration,
> > but
> > > it is surprising otherwise. Since it's not a semantic change and since
> > the
> > > default you're proposing of 1s is fairly low already, I wonder if it's
> > good
> > > enough to mention the new configuration in the release notes and not
> add
> > > any special logic. What do you think?
> > >
> > > -Jason
> > >
> > > On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> skaundinya@gmail.com>
> > > wrote:
> > >
> > > > Thank you for the comments Guozhang.
> > > >
> > > > I’ll leave this KIP out for discussion till the end of the week and
> > then
> > > > start a vote for this early next week.
> > > >
> > > > Sanjana
> > > >
> > > > On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <wa...@gmail.com>,
> > > wrote:
> > > > > Hello Sanjana,
> > > > >
> > > > > Thanks for the proposed KIP, I think that makes a lot of sense --
> as
> > > you
> > > > > mentioned in the motivation, we've indeed seen many issues with
> > regard
> > > to
> > > > > the frequent retries, with bounded exponential backoff in the
> > scenario
> > > > > where there's a long connectivity issue we would effectively reduce
> > the
> > > > > request load by 10 given the default configs.
> > > > >
> > > > > For higher-level Streams client and Connect frameworks, today we
> also
> > > > have
> > > > > a retry logic but that's used in a slightly different way. For
> > example
> > > in
> > > > > Streams, we tend to handle the retry logic at the thread-level and
> > > hence
> > > > > very likely we'd like to change that mechanism in KIP-572 anyways.
> > For
> > > > > producer / consumer / admin clients, I think just applying this
> > > > behavioral
> > > > > change across these clients makes lot of sense. So I think can just
> > > leave
> > > > > the Streams / Connect out of the scope of this KIP to be addressed
> in
> > > > > separate discussions.
> > > > >
> > > > > I do not have further comments about this KIP :) LGTM.
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > > skaundinya@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks for the feedback Boyang.
> > > > > >
> > > > > > If there’s anyone else who has feedback regarding this KIP, would
> > > > really
> > > > > > appreciate it hearing it!
> > > > > >
> > > > > > Thanks,
> > > > > > Sanjana
> > > > > >
> > > > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> bchen11@outlook.com>
> > > > wrote:
> > > > > >
> > > > > > > Sounds great!
> > > > > > >
> > > > > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > > > ________________________________
> > > > > > > From: Sanjana Kaundinya <sk...@gmail.com>
> > > > > > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > > > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > > > > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka
> > > Clients
> > > > > > >
> > > > > > > Thanks for the explanation Boyang. One of the most common
> > problems
> > > > that
> > > > > > we
> > > > > > > have in Kafka is with respect to metadata fetches. For example,
> > if
> > > > there
> > > > > > is
> > > > > > > a broker failure, all clients start to fetch metadata at the
> same
> > > > time
> > > > > > and
> > > > > > > it often takes a while for the metadata to converge. In a high
> > load
> > > > > > > cluster, there are also issues where the volume of metadata has
> > > made
> > > > > > > convergence of metadata slower.
> > > > > > >
> > > > > > > For this case, exponential backoff helps as it reduces the
> retry
> > > > rate and
> > > > > > > spaces out how often clients will retry, thereby bringing down
> > the
> > > > time
> > > > > > for
> > > > > > > convergence. Something that Jason mentioned that would be a
> great
> > > > > > addition
> > > > > > > here would be if the backoff should be “jittered” as it was in
> > > > KIP-144
> > > > > > with
> > > > > > > respect to exponential reconnect backoff. This would help
> prevent
> > > the
> > > > > > > clients from being synchronized on when they retry, thereby
> > spacing
> > > > out
> > > > > > the
> > > > > > > number of requests being sent to the broker at the same time.
> > > > > > >
> > > > > > > I’ll add this example to the KIP and flush out more of the
> > details
> > > -
> > > > so
> > > > > > > it’s more clear.
> > > > > > >
> > > > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > > reluctanthero104@gmail.com
> > > > > > > ,
> > > > > > > wrote:
> > > > > > > > Thanks for the reply Sanjana. I guess I would like to
> rephrase
> > my
> > > > > > > question
> > > > > > > > 2 and 3 as my previous response is a little bit unactionable.
> > > > > > > >
> > > > > > > > My specific point is that exponential backoff is not a silver
> > > > bullet
> > > > > > and
> > > > > > > we
> > > > > > > > should consider using it to solve known problems, instead of
> > > > making the
> > > > > > > > holistic changes to all clients in Kafka ecosystem. I do like
> > the
> > > > > > > > exponential backoff idea and believe this would be of great
> > > value,
> > > > but
> > > > > > > > maybe we should focus on proposing some existing modules that
> > are
> > > > > > > suffering
> > > > > > > > from static retry, and only change them in this first KIP. If
> > in
> > > > the
> > > > > > > > future, some other component users believe they are also
> > > > suffering, we
> > > > > > > > could get more minor KIPs to change the behavior as well.
> > > > > > > >
> > > > > > > > Boyang
> > > > > > > >
> > > > > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > > > skaundinya@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the feedback Boyang, I will revise the KIP with
> > the
> > > > > > > > > mathematical relations as per your suggestion. To address
> > your
> > > > > > > feedback:
> > > > > > > > >
> > > > > > > > > 1. Currently, with the default of 100 ms per retry backoff,
> > in
> > > 1
> > > > > > second
> > > > > > > > > we would have 10 retries. In the case of using an
> exponential
> > > > > > backoff,
> > > > > > > we
> > > > > > > > > would have a total of 4 retries in 1 second. Thus we have
> > less
> > > > than
> > > > > > > half of
> > > > > > > > > the amount of retries in the same timeframe and can lessen
> > > broker
> > > > > > > pressure.
> > > > > > > > > This calculation is done as following (using the formula
> laid
> > > > out in
> > > > > > > the
> > > > > > > > > KIP:
> > > > > > > > >
> > > > > > > > > Try 1 at time 0 ms, failures = 0, next retry in 100 ms
> > (default
> > > > retry
> > > > > > > ms
> > > > > > > > > is initially 100 ms)
> > > > > > > > > Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> > > > > > > > > Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> > > > > > > > > Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> > > > > > > > > Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms
> > > > (default
> > > > > > max
> > > > > > > > > retry ms is 1000 ms)
> > > > > > > > >
> > > > > > > > > For 2 and 3, could you elaborate more about what you mean
> > with
> > > > > > respect
> > > > > > > to
> > > > > > > > > client timeouts? I’m not very familiar with the Streams
> > > > framework, so
> > > > > > > would
> > > > > > > > > love to get more insight to how that currently works, with
> > > > respect to
> > > > > > > > > producer transactions, so I can appropriately update the
> KIP
> > to
> > > > > > address
> > > > > > > > > these scenarios.
> > > > > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > > > > reluctanthero104@gmail.com>,
> > > > > > > > > wrote:
> > > > > > > > > > Thanks for the KIP Sanjana. I think the motivation is
> good,
> > > but
> > > > > > lack
> > > > > > > of
> > > > > > > > > > more quantitative analysis. For instance:
> > > > > > > > > >
> > > > > > > > > > 1. How much retries we are saving by applying the
> > exponential
> > > > retry
> > > > > > > vs
> > > > > > > > > > static retry? There should be some mathematical relations
> > > > between
> > > > > > the
> > > > > > > > > > static retry ms, the initial exponential retry ms, the
> max
> > > > > > > exponential
> > > > > > > > > > retry ms in a given time interval.
> > > > > > > > > > 2. How does this affect the client timeout? With
> > exponential
> > > > retry,
> > > > > > > the
> > > > > > > > > > client shall be getting easier to timeout on a parent
> level
> > > > caller,
> > > > > > > for
> > > > > > > > > > instance stream attempts to retry initializing producer
> > > > > > transactions
> > > > > > > with
> > > > > > > > > > given 5 minute interval. With exponential retry this
> > > mechanism
> > > > > > could
> > > > > > > > > > experience more frequent timeout which we should be
> careful
> > > > with.
> > > > > > > > > > 3. With regards to #2, we should have more detailed
> > checklist
> > > > of
> > > > > > all
> > > > > > > the
> > > > > > > > > > existing static retry scenarios, and adjust the initial
> > > > exponential
> > > > > > > retry
> > > > > > > > > > ms to make sure we won't get easily timeout in high level
> > due
> > > > to
> > > > > > too
> > > > > > > few
> > > > > > > > > > attempts.
> > > > > > > > > >
> > > > > > > > > > Boyang
> > > > > > > > > >
> > > > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <
> > > > > > > skaundinya@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Everyone,
> > > > > > > > > > >
> > > > > > > > > > > I’ve written a KIP about introducing exponential
> backoff
> > > for
> > > > > > Kafka
> > > > > > > > > > > clients. Would appreciate any feedback on this.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Sanjana
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Jason Gustafson <ja...@confluent.io>.

Hey Guozhang,

I was referring to this:

> For users who have not set retry.backoff.ms explicitly, the default
behavior will change so that the backoff will grow up to 1000 ms. For users
who have set retry.backoff.ms explicitly, the behavior will remain the same
as they could have specific requirements.

I took this to mean that for users who have overridden `retry.backoff.ms`
to 50ms (say), we will change the default `retry.backoff.max.ms` to 50ms as
well in order to preserve existing backoff behavior. Is that not right? In
any case, I agree that we can use the maximum of the two values as the
effective `retry.backoff.max.ms` to handle the case when the configured
value of `retry.backoff.ms` is larger than the default of 1s.

-Jason




On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wa...@gmail.com> wrote:

> Hey Jason,
>
> My understanding is a bit different here: even if user has an explicit
> overridden "retry.backoff.ms", the exponential mechanism still triggers
> and
> the backoff would be increased till "retry.backoff.max.ms"; and if the
> specified "retry.backoff.ms" is already larger than the "
> retry.backoff.max.ms", we would still take "retry.backoff.max.ms".
>
> So if the user does override the "retry.backoff.ms" to a value larger than
> 1s and is not aware of the new config, she would be surprised to see the
> specified value seemingly not being respected, but she could still learn
> that afterwards by reading the release notes introducing this KIP anyways.
>
>
> Guozhang
>
> On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hi Sanjana,
> >
> > The KIP looks good to me. I had just one question about the default
> > behavior. As I understand, if the user has specified `retry.backoff.ms`
> > explicitly, then we will not apply the default max backoff. As such,
> > there's no way to get the benefit of this feature if you are providing a
> `
> > retry.backoff.ms` unless you also provide `retry.backoff.max.ms`. That
> > makes sense if you assume the user is unaware of the new configuration,
> but
> > it is surprising otherwise. Since it's not a semantic change and since
> the
> > default you're proposing of 1s is fairly low already, I wonder if it's
> good
> > enough to mention the new configuration in the release notes and not add
> > any special logic. What do you think?
> >
> > -Jason
> >
> > On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <sk...@gmail.com>
> > wrote:
> >
> > > Thank you for the comments Guozhang.
> > >
> > > I’ll leave this KIP out for discussion till the end of the week and
> then
> > > start a vote for this early next week.
> > >
> > > Sanjana
> > >
> > > On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <wa...@gmail.com>,
> > wrote:
> > > > Hello Sanjana,
> > > >
> > > > Thanks for the proposed KIP, I think that makes a lot of sense -- as
> > you
> > > > mentioned in the motivation, we've indeed seen many issues with
> regard
> > to
> > > > the frequent retries, with bounded exponential backoff in the
> scenario
> > > > where there's a long connectivity issue we would effectively reduce
> the
> > > > request load by 10 given the default configs.
> > > >
> > > > For higher-level Streams client and Connect frameworks, today we also
> > > have
> > > > a retry logic but that's used in a slightly different way. For
> example
> > in
> > > > Streams, we tend to handle the retry logic at the thread-level and
> > hence
> > > > very likely we'd like to change that mechanism in KIP-572 anyways.
> For
> > > > producer / consumer / admin clients, I think just applying this
> > > behavioral
> > > > change across these clients makes lot of sense. So I think can just
> > leave
> > > > the Streams / Connect out of the scope of this KIP to be addressed in
> > > > separate discussions.
> > > >
> > > > I do not have further comments about this KIP :) LGTM.
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > skaundinya@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Thanks for the feedback Boyang.
> > > > >
> > > > > If there’s anyone else who has feedback regarding this KIP, would
> > > really
> > > > > appreciate it hearing it!
> > > > >
> > > > > Thanks,
> > > > > Sanjana
> > > > >
> > > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <bc...@outlook.com>
> > > wrote:
> > > > >
> > > > > > Sounds great!
> > > > > >
> > > > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > > ________________________________
> > > > > > From: Sanjana Kaundinya <sk...@gmail.com>
> > > > > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > > > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka
> > Clients
> > > > > >
> > > > > > Thanks for the explanation Boyang. One of the most common
> problems
> > > that
> > > > > we
> > > > > > have in Kafka is with respect to metadata fetches. For example,
> if
> > > there
> > > > > is
> > > > > > a broker failure, all clients start to fetch metadata at the same
> > > time
> > > > > and
> > > > > > it often takes a while for the metadata to converge. In a high
> load
> > > > > > cluster, there are also issues where the volume of metadata has
> > made
> > > > > > convergence of metadata slower.
> > > > > >
> > > > > > For this case, exponential backoff helps as it reduces the retry
> > > rate and
> > > > > > spaces out how often clients will retry, thereby bringing down
> the
> > > time
> > > > > for
> > > > > > convergence. Something that Jason mentioned that would be a great
> > > > > addition
> > > > > > here would be if the backoff should be “jittered” as it was in
> > > KIP-144
> > > > > with
> > > > > > respect to exponential reconnect backoff. This would help prevent
> > the
> > > > > > clients from being synchronized on when they retry, thereby
> spacing
> > > out
> > > > > the
> > > > > > number of requests being sent to the broker at the same time.
> > > > > >
> > > > > > I’ll add this example to the KIP and flush out more of the
> details
> > -
> > > so
> > > > > > it’s more clear.
> > > > > >
> > > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > reluctanthero104@gmail.com
> > > > > > ,
> > > > > > wrote:
> > > > > > > Thanks for the reply Sanjana. I guess I would like to rephrase
> my
> > > > > > question
> > > > > > > 2 and 3 as my previous response is a little bit unactionable.
> > > > > > >
> > > > > > > My specific point is that exponential backoff is not a silver
> > > bullet
> > > > > and
> > > > > > we
> > > > > > > should consider using it to solve known problems, instead of
> > > making the
> > > > > > > holistic changes to all clients in Kafka ecosystem. I do like
> the
> > > > > > > exponential backoff idea and believe this would be of great
> > value,
> > > but
> > > > > > > maybe we should focus on proposing some existing modules that
> are
> > > > > > suffering
> > > > > > > from static retry, and only change them in this first KIP. If
> in
> > > the
> > > > > > > future, some other component users believe they are also
> > > suffering, we
> > > > > > > could get more minor KIPs to change the behavior as well.
> > > > > > >
> > > > > > > Boyang
> > > > > > >
> > > > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > > skaundinya@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the feedback Boyang, I will revise the KIP with
> the
> > > > > > > > mathematical relations as per your suggestion. To address
> your
> > > > > > feedback:
> > > > > > > >
> > > > > > > > 1. Currently, with the default of 100 ms per retry backoff,
> in
> > 1
> > > > > second
> > > > > > > > we would have 10 retries. In the case of using an exponential
> > > > > backoff,
> > > > > > we
> > > > > > > > would have a total of 4 retries in 1 second. Thus we have
> less
> > > than
> > > > > > half of
> > > > > > > > the amount of retries in the same timeframe and can lessen
> > broker
> > > > > > pressure.
> > > > > > > > This calculation is done as following (using the formula laid
> > > out in
> > > > > > the
> > > > > > > > KIP:
> > > > > > > >
> > > > > > > > Try 1 at time 0 ms, failures = 0, next retry in 100 ms
> (default
> > > retry
> > > > > > ms
> > > > > > > > is initially 100 ms)
> > > > > > > > Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> > > > > > > > Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> > > > > > > > Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> > > > > > > > Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms
> > > (default
> > > > > max
> > > > > > > > retry ms is 1000 ms)
> > > > > > > >
> > > > > > > > For 2 and 3, could you elaborate more about what you mean
> with
> > > > > respect
> > > > > > to
> > > > > > > > client timeouts? I’m not very familiar with the Streams
> > > framework, so
> > > > > > would
> > > > > > > > love to get more insight to how that currently works, with
> > > respect to
> > > > > > > > producer transactions, so I can appropriately update the KIP
> to
> > > > > address
> > > > > > > > these scenarios.
> > > > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > > > reluctanthero104@gmail.com>,
> > > > > > > > wrote:
> > > > > > > > > Thanks for the KIP Sanjana. I think the motivation is good,
> > but
> > > > > lack
> > > > > > of
> > > > > > > > > more quantitative analysis. For instance:
> > > > > > > > >
> > > > > > > > > 1. How much retries we are saving by applying the
> exponential
> > > retry
> > > > > > vs
> > > > > > > > > static retry? There should be some mathematical relations
> > > between
> > > > > the
> > > > > > > > > static retry ms, the initial exponential retry ms, the max
> > > > > > exponential
> > > > > > > > > retry ms in a given time interval.
> > > > > > > > > 2. How does this affect the client timeout? With
> exponential
> > > retry,
> > > > > > the
> > > > > > > > > client shall be getting easier to timeout on a parent level
> > > caller,
> > > > > > for
> > > > > > > > > instance stream attempts to retry initializing producer
> > > > > transactions
> > > > > > with
> > > > > > > > > given 5 minute interval. With exponential retry this
> > mechanism
> > > > > could
> > > > > > > > > experience more frequent timeout which we should be careful
> > > with.
> > > > > > > > > 3. With regards to #2, we should have more detailed
> checklist
> > > of
> > > > > all
> > > > > > the
> > > > > > > > > existing static retry scenarios, and adjust the initial
> > > exponential
> > > > > > retry
> > > > > > > > > ms to make sure we won't get easily timeout in high level
> due
> > > to
> > > > > too
> > > > > > few
> > > > > > > > > attempts.
> > > > > > > > >
> > > > > > > > > Boyang
> > > > > > > > >
> > > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <
> > > > > > skaundinya@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Everyone,
> > > > > > > > > >
> > > > > > > > > > I’ve written a KIP about introducing exponential backoff
> > for
> > > > > Kafka
> > > > > > > > > > clients. Would appreciate any feedback on this.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Sanjana
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Guozhang Wang <wa...@gmail.com>.

Hey Jason,

My understanding is a bit different here: even if user has an explicit
overridden "retry.backoff.ms", the exponential mechanism still triggers and
the backoff would be increased till "retry.backoff.max.ms"; and if the
specified "retry.backoff.ms" is already larger than the "
retry.backoff.max.ms", we would still take "retry.backoff.max.ms".

So if the user does override the "retry.backoff.ms" to a value larger than
1s and is not aware of the new config, she would be surprised to see the
specified value seemingly not being respected, but she could still learn
that afterwards by reading the release notes introducing this KIP anyways.


Guozhang

On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Sanjana,
>
> The KIP looks good to me. I had just one question about the default
> behavior. As I understand, if the user has specified `retry.backoff.ms`
> explicitly, then we will not apply the default max backoff. As such,
> there's no way to get the benefit of this feature if you are providing a `
> retry.backoff.ms` unless you also provide `retry.backoff.max.ms`. That
> makes sense if you assume the user is unaware of the new configuration, but
> it is surprising otherwise. Since it's not a semantic change and since the
> default you're proposing of 1s is fairly low already, I wonder if it's good
> enough to mention the new configuration in the release notes and not add
> any special logic. What do you think?
>
> -Jason
>
> On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <sk...@gmail.com>
> wrote:
>
> > Thank you for the comments Guozhang.
> >
> > I’ll leave this KIP out for discussion till the end of the week and then
> > start a vote for this early next week.
> >
> > Sanjana
> >
> > On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <wa...@gmail.com>,
> wrote:
> > > Hello Sanjana,
> > >
> > > Thanks for the proposed KIP, I think that makes a lot of sense -- as
> you
> > > mentioned in the motivation, we've indeed seen many issues with regard
> to
> > > the frequent retries, with bounded exponential backoff in the scenario
> > > where there's a long connectivity issue we would effectively reduce the
> > > request load by 10 given the default configs.
> > >
> > > For higher-level Streams client and Connect frameworks, today we also
> > have
> > > a retry logic but that's used in a slightly different way. For example
> in
> > > Streams, we tend to handle the retry logic at the thread-level and
> hence
> > > very likely we'd like to change that mechanism in KIP-572 anyways. For
> > > producer / consumer / admin clients, I think just applying this
> > behavioral
> > > change across these clients makes lot of sense. So I think can just
> leave
> > > the Streams / Connect out of the scope of this KIP to be addressed in
> > > separate discussions.
> > >
> > > I do not have further comments about this KIP :) LGTM.
> > >
> > > Guozhang
> > >
> > >
> > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> skaundinya@gmail.com
> > >
> > > wrote:
> > >
> > > > Thanks for the feedback Boyang.
> > > >
> > > > If there’s anyone else who has feedback regarding this KIP, would
> > really
> > > > appreciate it hearing it!
> > > >
> > > > Thanks,
> > > > Sanjana
> > > >
> > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <bc...@outlook.com>
> > wrote:
> > > >
> > > > > Sounds great!
> > > > >
> > > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > ________________________________
> > > > > From: Sanjana Kaundinya <sk...@gmail.com>
> > > > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka
> Clients
> > > > >
> > > > > Thanks for the explanation Boyang. One of the most common problems
> > that
> > > > we
> > > > > have in Kafka is with respect to metadata fetches. For example, if
> > there
> > > > is
> > > > > a broker failure, all clients start to fetch metadata at the same
> > time
> > > > and
> > > > > it often takes a while for the metadata to converge. In a high load
> > > > > cluster, there are also issues where the volume of metadata has
> made
> > > > > convergence of metadata slower.
> > > > >
> > > > > For this case, exponential backoff helps as it reduces the retry
> > rate and
> > > > > spaces out how often clients will retry, thereby bringing down the
> > time
> > > > for
> > > > > convergence. Something that Jason mentioned that would be a great
> > > > addition
> > > > > here would be if the backoff should be “jittered” as it was in
> > KIP-144
> > > > with
> > > > > respect to exponential reconnect backoff. This would help prevent
> the
> > > > > clients from being synchronized on when they retry, thereby spacing
> > out
> > > > the
> > > > > number of requests being sent to the broker at the same time.
> > > > >
> > > > > I’ll add this example to the KIP and flush out more of the details
> -
> > so
> > > > > it’s more clear.
> > > > >
> > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > reluctanthero104@gmail.com
> > > > > ,
> > > > > wrote:
> > > > > > Thanks for the reply Sanjana. I guess I would like to rephrase my
> > > > > question
> > > > > > 2 and 3 as my previous response is a little bit unactionable.
> > > > > >
> > > > > > My specific point is that exponential backoff is not a silver
> > bullet
> > > > and
> > > > > we
> > > > > > should consider using it to solve known problems, instead of
> > making the
> > > > > > holistic changes to all clients in Kafka ecosystem. I do like the
> > > > > > exponential backoff idea and believe this would be of great
> value,
> > but
> > > > > > maybe we should focus on proposing some existing modules that are
> > > > > suffering
> > > > > > from static retry, and only change them in this first KIP. If in
> > the
> > > > > > future, some other component users believe they are also
> > suffering, we
> > > > > > could get more minor KIPs to change the behavior as well.
> > > > > >
> > > > > > Boyang
> > > > > >
> > > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > skaundinya@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for the feedback Boyang, I will revise the KIP with the
> > > > > > > mathematical relations as per your suggestion. To address your
> > > > > feedback:
> > > > > > >
> > > > > > > 1. Currently, with the default of 100 ms per retry backoff, in
> 1
> > > > second
> > > > > > > we would have 10 retries. In the case of using an exponential
> > > > backoff,
> > > > > we
> > > > > > > would have a total of 4 retries in 1 second. Thus we have less
> > than
> > > > > half of
> > > > > > > the amount of retries in the same timeframe and can lessen
> broker
> > > > > pressure.
> > > > > > > This calculation is done as following (using the formula laid
> > out in
> > > > > the
> > > > > > > KIP:
> > > > > > >
> > > > > > > Try 1 at time 0 ms, failures = 0, next retry in 100 ms (default
> > retry
> > > > > ms
> > > > > > > is initially 100 ms)
> > > > > > > Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> > > > > > > Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> > > > > > > Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> > > > > > > Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms
> > (default
> > > > max
> > > > > > > retry ms is 1000 ms)
> > > > > > >
> > > > > > > For 2 and 3, could you elaborate more about what you mean with
> > > > respect
> > > > > to
> > > > > > > client timeouts? I’m not very familiar with the Streams
> > framework, so
> > > > > would
> > > > > > > love to get more insight to how that currently works, with
> > respect to
> > > > > > > producer transactions, so I can appropriately update the KIP to
> > > > address
> > > > > > > these scenarios.
> > > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > > reluctanthero104@gmail.com>,
> > > > > > > wrote:
> > > > > > > > Thanks for the KIP Sanjana. I think the motivation is good,
> but
> > > > lack
> > > > > of
> > > > > > > > more quantitative analysis. For instance:
> > > > > > > >
> > > > > > > > 1. How much retries we are saving by applying the exponential
> > retry
> > > > > vs
> > > > > > > > static retry? There should be some mathematical relations
> > between
> > > > the
> > > > > > > > static retry ms, the initial exponential retry ms, the max
> > > > > exponential
> > > > > > > > retry ms in a given time interval.
> > > > > > > > 2. How does this affect the client timeout? With exponential
> > retry,
> > > > > the
> > > > > > > > client shall be getting easier to timeout on a parent level
> > caller,
> > > > > for
> > > > > > > > instance stream attempts to retry initializing producer
> > > > transactions
> > > > > with
> > > > > > > > given 5 minute interval. With exponential retry this
> mechanism
> > > > could
> > > > > > > > experience more frequent timeout which we should be careful
> > with.
> > > > > > > > 3. With regards to #2, we should have more detailed checklist
> > of
> > > > all
> > > > > the
> > > > > > > > existing static retry scenarios, and adjust the initial
> > exponential
> > > > > retry
> > > > > > > > ms to make sure we won't get easily timeout in high level due
> > to
> > > > too
> > > > > few
> > > > > > > > attempts.
> > > > > > > >
> > > > > > > > Boyang
> > > > > > > >
> > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <
> > > > > skaundinya@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Everyone,
> > > > > > > > >
> > > > > > > > > I’ve written a KIP about introducing exponential backoff
> for
> > > > Kafka
> > > > > > > > > clients. Would appreciate any feedback on this.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Sanjana
> > > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Jason Gustafson <ja...@confluent.io>.

Hi Sanjana,

The KIP looks good to me. I had just one question about the default
behavior. As I understand, if the user has specified `retry.backoff.ms`
explicitly, then we will not apply the default max backoff. As such,
there's no way to get the benefit of this feature if you are providing a `
retry.backoff.ms` unless you also provide `retry.backoff.max.ms`. That
makes sense if you assume the user is unaware of the new configuration, but
it is surprising otherwise. Since it's not a semantic change and since the
default you're proposing of 1s is fairly low already, I wonder if it's good
enough to mention the new configuration in the release notes and not add
any special logic. What do you think?

-Jason

On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <sk...@gmail.com>
wrote:

> Thank you for the comments Guozhang.
>
> I’ll leave this KIP out for discussion till the end of the week and then
> start a vote for this early next week.
>
> Sanjana
>
> On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <wa...@gmail.com>, wrote:
> > Hello Sanjana,
> >
> > Thanks for the proposed KIP, I think that makes a lot of sense -- as you
> > mentioned in the motivation, we've indeed seen many issues with regard to
> > the frequent retries, with bounded exponential backoff in the scenario
> > where there's a long connectivity issue we would effectively reduce the
> > request load by 10 given the default configs.
> >
> > For higher-level Streams client and Connect frameworks, today we also
> have
> > a retry logic but that's used in a slightly different way. For example in
> > Streams, we tend to handle the retry logic at the thread-level and hence
> > very likely we'd like to change that mechanism in KIP-572 anyways. For
> > producer / consumer / admin clients, I think just applying this
> behavioral
> > change across these clients makes lot of sense. So I think can just leave
> > the Streams / Connect out of the scope of this KIP to be addressed in
> > separate discussions.
> >
> > I do not have further comments about this KIP :) LGTM.
> >
> > Guozhang
> >
> >
> > On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <skaundinya@gmail.com
> >
> > wrote:
> >
> > > Thanks for the feedback Boyang.
> > >
> > > If there’s anyone else who has feedback regarding this KIP, would
> really
> > > appreciate it hearing it!
> > >
> > > Thanks,
> > > Sanjana
> > >
> > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <bc...@outlook.com>
> wrote:
> > >
> > > > Sounds great!
> > > >
> > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > ________________________________
> > > > From: Sanjana Kaundinya <sk...@gmail.com>
> > > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients
> > > >
> > > > Thanks for the explanation Boyang. One of the most common problems
> that
> > > we
> > > > have in Kafka is with respect to metadata fetches. For example, if
> there
> > > is
> > > > a broker failure, all clients start to fetch metadata at the same
> time
> > > and
> > > > it often takes a while for the metadata to converge. In a high load
> > > > cluster, there are also issues where the volume of metadata has made
> > > > convergence of metadata slower.
> > > >
> > > > For this case, exponential backoff helps as it reduces the retry
> rate and
> > > > spaces out how often clients will retry, thereby bringing down the
> time
> > > for
> > > > convergence. Something that Jason mentioned that would be a great
> > > addition
> > > > here would be if the backoff should be “jittered” as it was in
> KIP-144
> > > with
> > > > respect to exponential reconnect backoff. This would help prevent the
> > > > clients from being synchronized on when they retry, thereby spacing
> out
> > > the
> > > > number of requests being sent to the broker at the same time.
> > > >
> > > > I’ll add this example to the KIP and flush out more of the details -
> so
> > > > it’s more clear.
> > > >
> > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> reluctanthero104@gmail.com
> > > > ,
> > > > wrote:
> > > > > Thanks for the reply Sanjana. I guess I would like to rephrase my
> > > > question
> > > > > 2 and 3 as my previous response is a little bit unactionable.
> > > > >
> > > > > My specific point is that exponential backoff is not a silver
> bullet
> > > and
> > > > we
> > > > > should consider using it to solve known problems, instead of
> making the
> > > > > holistic changes to all clients in Kafka ecosystem. I do like the
> > > > > exponential backoff idea and believe this would be of great value,
> but
> > > > > maybe we should focus on proposing some existing modules that are
> > > > suffering
> > > > > from static retry, and only change them in this first KIP. If in
> the
> > > > > future, some other component users believe they are also
> suffering, we
> > > > > could get more minor KIPs to change the behavior as well.
> > > > >
> > > > > Boyang
> > > > >
> > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > skaundinya@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks for the feedback Boyang, I will revise the KIP with the
> > > > > > mathematical relations as per your suggestion. To address your
> > > > feedback:
> > > > > >
> > > > > > 1. Currently, with the default of 100 ms per retry backoff, in 1
> > > second
> > > > > > we would have 10 retries. In the case of using an exponential
> > > backoff,
> > > > we
> > > > > > would have a total of 4 retries in 1 second. Thus we have less
> than
> > > > half of
> > > > > > the amount of retries in the same timeframe and can lessen broker
> > > > pressure.
> > > > > > This calculation is done as following (using the formula laid
> out in
> > > > the
> > > > > > KIP:
> > > > > >
> > > > > > Try 1 at time 0 ms, failures = 0, next retry in 100 ms (default
> retry
> > > > ms
> > > > > > is initially 100 ms)
> > > > > > Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> > > > > > Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> > > > > > Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> > > > > > Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms
> (default
> > > max
> > > > > > retry ms is 1000 ms)
> > > > > >
> > > > > > For 2 and 3, could you elaborate more about what you mean with
> > > respect
> > > > to
> > > > > > client timeouts? I’m not very familiar with the Streams
> framework, so
> > > > would
> > > > > > love to get more insight to how that currently works, with
> respect to
> > > > > > producer transactions, so I can appropriately update the KIP to
> > > address
> > > > > > these scenarios.
> > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > reluctanthero104@gmail.com>,
> > > > > > wrote:
> > > > > > > Thanks for the KIP Sanjana. I think the motivation is good, but
> > > lack
> > > > of
> > > > > > > more quantitative analysis. For instance:
> > > > > > >
> > > > > > > 1. How much retries we are saving by applying the exponential
> retry
> > > > vs
> > > > > > > static retry? There should be some mathematical relations
> between
> > > the
> > > > > > > static retry ms, the initial exponential retry ms, the max
> > > > exponential
> > > > > > > retry ms in a given time interval.
> > > > > > > 2. How does this affect the client timeout? With exponential
> retry,
> > > > the
> > > > > > > client shall be getting easier to timeout on a parent level
> caller,
> > > > for
> > > > > > > instance stream attempts to retry initializing producer
> > > transactions
> > > > with
> > > > > > > given 5 minute interval. With exponential retry this mechanism
> > > could
> > > > > > > experience more frequent timeout which we should be careful
> with.
> > > > > > > 3. With regards to #2, we should have more detailed checklist
> of
> > > all
> > > > the
> > > > > > > existing static retry scenarios, and adjust the initial
> exponential
> > > > retry
> > > > > > > ms to make sure we won't get easily timeout in high level due
> to
> > > too
> > > > few
> > > > > > > attempts.
> > > > > > >
> > > > > > > Boyang
> > > > > > >
> > > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <
> > > > skaundinya@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Everyone,
> > > > > > > >
> > > > > > > > I’ve written a KIP about introducing exponential backoff for
> > > Kafka
> > > > > > > > clients. Would appreciate any feedback on this.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Sanjana
> > > > > > > >
> > > > > >
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
>

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Sanjana Kaundinya <sk...@gmail.com>.

Thank you for the comments Guozhang.

I’ll leave this KIP out for discussion till the end of the week and then start a vote for this early next week.

Sanjana

On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <wa...@gmail.com>, wrote:
> Hello Sanjana,
>
> Thanks for the proposed KIP, I think that makes a lot of sense -- as you
> mentioned in the motivation, we've indeed seen many issues with regard to
> the frequent retries, with bounded exponential backoff in the scenario
> where there's a long connectivity issue we would effectively reduce the
> request load by 10 given the default configs.
>
> For higher-level Streams client and Connect frameworks, today we also have
> a retry logic but that's used in a slightly different way. For example in
> Streams, we tend to handle the retry logic at the thread-level and hence
> very likely we'd like to change that mechanism in KIP-572 anyways. For
> producer / consumer / admin clients, I think just applying this behavioral
> change across these clients makes lot of sense. So I think can just leave
> the Streams / Connect out of the scope of this KIP to be addressed in
> separate discussions.
>
> I do not have further comments about this KIP :) LGTM.
>
> Guozhang
>
>
> On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <sk...@gmail.com>
> wrote:
>
> > Thanks for the feedback Boyang.
> >
> > If there’s anyone else who has feedback regarding this KIP, would really
> > appreciate it hearing it!
> >
> > Thanks,
> > Sanjana
> >
> > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <bc...@outlook.com> wrote:
> >
> > > Sounds great!
> > >
> > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > ________________________________
> > > From: Sanjana Kaundinya <sk...@gmail.com>
> > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients
> > >
> > > Thanks for the explanation Boyang. One of the most common problems that
> > we
> > > have in Kafka is with respect to metadata fetches. For example, if there
> > is
> > > a broker failure, all clients start to fetch metadata at the same time
> > and
> > > it often takes a while for the metadata to converge. In a high load
> > > cluster, there are also issues where the volume of metadata has made
> > > convergence of metadata slower.
> > >
> > > For this case, exponential backoff helps as it reduces the retry rate and
> > > spaces out how often clients will retry, thereby bringing down the time
> > for
> > > convergence. Something that Jason mentioned that would be a great
> > addition
> > > here would be if the backoff should be “jittered” as it was in KIP-144
> > with
> > > respect to exponential reconnect backoff. This would help prevent the
> > > clients from being synchronized on when they retry, thereby spacing out
> > the
> > > number of requests being sent to the broker at the same time.
> > >
> > > I’ll add this example to the KIP and flush out more of the details - so
> > > it’s more clear.
> > >
> > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <reluctanthero104@gmail.com
> > > ,
> > > wrote:
> > > > Thanks for the reply Sanjana. I guess I would like to rephrase my
> > > question
> > > > 2 and 3 as my previous response is a little bit unactionable.
> > > >
> > > > My specific point is that exponential backoff is not a silver bullet
> > and
> > > we
> > > > should consider using it to solve known problems, instead of making the
> > > > holistic changes to all clients in Kafka ecosystem. I do like the
> > > > exponential backoff idea and believe this would be of great value, but
> > > > maybe we should focus on proposing some existing modules that are
> > > suffering
> > > > from static retry, and only change them in this first KIP. If in the
> > > > future, some other component users believe they are also suffering, we
> > > > could get more minor KIPs to change the behavior as well.
> > > >
> > > > Boyang
> > > >
> > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > skaundinya@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Thanks for the feedback Boyang, I will revise the KIP with the
> > > > > mathematical relations as per your suggestion. To address your
> > > feedback:
> > > > >
> > > > > 1. Currently, with the default of 100 ms per retry backoff, in 1
> > second
> > > > > we would have 10 retries. In the case of using an exponential
> > backoff,
> > > we
> > > > > would have a total of 4 retries in 1 second. Thus we have less than
> > > half of
> > > > > the amount of retries in the same timeframe and can lessen broker
> > > pressure.
> > > > > This calculation is done as following (using the formula laid out in
> > > the
> > > > > KIP:
> > > > >
> > > > > Try 1 at time 0 ms, failures = 0, next retry in 100 ms (default retry
> > > ms
> > > > > is initially 100 ms)
> > > > > Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> > > > > Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> > > > > Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> > > > > Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms (default
> > max
> > > > > retry ms is 1000 ms)
> > > > >
> > > > > For 2 and 3, could you elaborate more about what you mean with
> > respect
> > > to
> > > > > client timeouts? I’m not very familiar with the Streams framework, so
> > > would
> > > > > love to get more insight to how that currently works, with respect to
> > > > > producer transactions, so I can appropriately update the KIP to
> > address
> > > > > these scenarios.
> > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > reluctanthero104@gmail.com>,
> > > > > wrote:
> > > > > > Thanks for the KIP Sanjana. I think the motivation is good, but
> > lack
> > > of
> > > > > > more quantitative analysis. For instance:
> > > > > >
> > > > > > 1. How much retries we are saving by applying the exponential retry
> > > vs
> > > > > > static retry? There should be some mathematical relations between
> > the
> > > > > > static retry ms, the initial exponential retry ms, the max
> > > exponential
> > > > > > retry ms in a given time interval.
> > > > > > 2. How does this affect the client timeout? With exponential retry,
> > > the
> > > > > > client shall be getting easier to timeout on a parent level caller,
> > > for
> > > > > > instance stream attempts to retry initializing producer
> > transactions
> > > with
> > > > > > given 5 minute interval. With exponential retry this mechanism
> > could
> > > > > > experience more frequent timeout which we should be careful with.
> > > > > > 3. With regards to #2, we should have more detailed checklist of
> > all
> > > the
> > > > > > existing static retry scenarios, and adjust the initial exponential
> > > retry
> > > > > > ms to make sure we won't get easily timeout in high level due to
> > too
> > > few
> > > > > > attempts.
> > > > > >
> > > > > > Boyang
> > > > > >
> > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <
> > > skaundinya@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Everyone,
> > > > > > >
> > > > > > > I’ve written a KIP about introducing exponential backoff for
> > Kafka
> > > > > > > clients. Would appreciate any feedback on this.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Sanjana
> > > > > > >
> > > > >
> > >
> >
>
>
> --
> -- Guozhang

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Sanjana,

Thanks for the proposed KIP, I think that makes a lot of sense -- as you
mentioned in the motivation, we've indeed seen many issues with regard to
the frequent retries, with bounded exponential backoff in the scenario
where there's a long connectivity issue we would effectively reduce the
request load by 10 given the default configs.

For higher-level Streams client and Connect frameworks, today we also have
a retry logic but that's used in a slightly different way. For example in
Streams, we tend to handle the retry logic at the thread-level and hence
very likely we'd like to change that mechanism in KIP-572 anyways. For
producer / consumer / admin clients, I think just applying this behavioral
change across these clients makes lot of sense. So I think can just leave
the Streams / Connect out of the scope of this KIP to be addressed in
separate discussions.

I do not have further comments about this KIP :) LGTM.

Guozhang


On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <sk...@gmail.com>
wrote:

> Thanks for the feedback Boyang.
>
> If there’s anyone else who has feedback regarding this KIP, would really
> appreciate it hearing it!
>
> Thanks,
> Sanjana
>
> On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <bc...@outlook.com> wrote:
>
> > Sounds great!
> >
> > Get Outlook for iOS<https://aka.ms/o0ukef>
> > ________________________________
> > From: Sanjana Kaundinya <sk...@gmail.com>
> > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > To: dev@kafka.apache.org <de...@kafka.apache.org>
> > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients
> >
> > Thanks for the explanation Boyang. One of the most common problems that
> we
> > have in Kafka is with respect to metadata fetches. For example, if there
> is
> > a broker failure, all clients start to fetch metadata at the same time
> and
> > it often takes a while for the metadata to converge. In a high load
> > cluster, there are also issues where the volume of metadata has made
> > convergence of metadata slower.
> >
> > For this case, exponential backoff helps as it reduces the retry rate and
> > spaces out how often clients will retry, thereby bringing down the time
> for
> > convergence. Something that Jason mentioned that would be a great
> addition
> > here would be if the backoff should be “jittered” as it was in KIP-144
> with
> > respect to exponential reconnect backoff. This would help prevent the
> > clients from being synchronized on when they retry, thereby spacing out
> the
> > number of requests being sent to the broker at the same time.
> >
> > I’ll add this example to the KIP and flush out more of the details - so
> > it’s more clear.
> >
> > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <reluctanthero104@gmail.com
> >,
> > wrote:
> > > Thanks for the reply Sanjana. I guess I would like to rephrase my
> > question
> > > 2 and 3 as my previous response is a little bit unactionable.
> > >
> > > My specific point is that exponential backoff is not a silver bullet
> and
> > we
> > > should consider using it to solve known problems, instead of making the
> > > holistic changes to all clients in Kafka ecosystem. I do like the
> > > exponential backoff idea and believe this would be of great value, but
> > > maybe we should focus on proposing some existing modules that are
> > suffering
> > > from static retry, and only change them in this first KIP. If in the
> > > future, some other component users believe they are also suffering, we
> > > could get more minor KIPs to change the behavior as well.
> > >
> > > Boyang
> > >
> > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> skaundinya@gmail.com
> > >
> > > wrote:
> > >
> > > > Thanks for the feedback Boyang, I will revise the KIP with the
> > > > mathematical relations as per your suggestion. To address your
> > feedback:
> > > >
> > > > 1. Currently, with the default of 100 ms per retry backoff, in 1
> second
> > > > we would have 10 retries. In the case of using an exponential
> backoff,
> > we
> > > > would have a total of 4 retries in 1 second. Thus we have less than
> > half of
> > > > the amount of retries in the same timeframe and can lessen broker
> > pressure.
> > > > This calculation is done as following (using the formula laid out in
> > the
> > > > KIP:
> > > >
> > > > Try 1 at time 0 ms, failures = 0, next retry in 100 ms (default retry
> > ms
> > > > is initially 100 ms)
> > > > Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> > > > Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> > > > Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> > > > Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms (default
> max
> > > > retry ms is 1000 ms)
> > > >
> > > > For 2 and 3, could you elaborate more about what you mean with
> respect
> > to
> > > > client timeouts? I’m not very familiar with the Streams framework, so
> > would
> > > > love to get more insight to how that currently works, with respect to
> > > > producer transactions, so I can appropriately update the KIP to
> address
> > > > these scenarios.
> > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > reluctanthero104@gmail.com>,
> > > > wrote:
> > > > > Thanks for the KIP Sanjana. I think the motivation is good, but
> lack
> > of
> > > > > more quantitative analysis. For instance:
> > > > >
> > > > > 1. How much retries we are saving by applying the exponential retry
> > vs
> > > > > static retry? There should be some mathematical relations between
> the
> > > > > static retry ms, the initial exponential retry ms, the max
> > exponential
> > > > > retry ms in a given time interval.
> > > > > 2. How does this affect the client timeout? With exponential retry,
> > the
> > > > > client shall be getting easier to timeout on a parent level caller,
> > for
> > > > > instance stream attempts to retry initializing producer
> transactions
> > with
> > > > > given 5 minute interval. With exponential retry this mechanism
> could
> > > > > experience more frequent timeout which we should be careful with.
> > > > > 3. With regards to #2, we should have more detailed checklist of
> all
> > the
> > > > > existing static retry scenarios, and adjust the initial exponential
> > retry
> > > > > ms to make sure we won't get easily timeout in high level due to
> too
> > few
> > > > > attempts.
> > > > >
> > > > > Boyang
> > > > >
> > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <
> > skaundinya@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Everyone,
> > > > > >
> > > > > > I’ve written a KIP about introducing exponential backoff for
> Kafka
> > > > > > clients. Would appreciate any feedback on this.
> > > > > >
> > > > > >
> > > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > >
> > > > > > Thanks,
> > > > > > Sanjana
> > > > > >
> > > >
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Sanjana Kaundinya <sk...@gmail.com>.

Thanks for the feedback Boyang.

If there’s anyone else who has feedback regarding this KIP, would really
appreciate it hearing it!

Thanks,
Sanjana

On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <bc...@outlook.com> wrote:

> Sounds great!
>
> Get Outlook for iOS<https://aka.ms/o0ukef>
> ________________________________
> From: Sanjana Kaundinya <sk...@gmail.com>
> Sent: Tuesday, March 17, 2020 5:54:35 PM
> To: dev@kafka.apache.org <de...@kafka.apache.org>
> Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients
>
> Thanks for the explanation Boyang. One of the most common problems that we
> have in Kafka is with respect to metadata fetches. For example, if there is
> a broker failure, all clients start to fetch metadata at the same time and
> it often takes a while for the metadata to converge. In a high load
> cluster, there are also issues where the volume of metadata has made
> convergence of metadata slower.
>
> For this case, exponential backoff helps as it reduces the retry rate and
> spaces out how often clients will retry, thereby bringing down the time for
> convergence. Something that Jason mentioned that would be a great addition
> here would be if the backoff should be “jittered” as it was in KIP-144 with
> respect to exponential reconnect backoff. This would help prevent the
> clients from being synchronized on when they retry, thereby spacing out the
> number of requests being sent to the broker at the same time.
>
> I’ll add this example to the KIP and flush out more of the details - so
> it’s more clear.
>
> On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <re...@gmail.com>,
> wrote:
> > Thanks for the reply Sanjana. I guess I would like to rephrase my
> question
> > 2 and 3 as my previous response is a little bit unactionable.
> >
> > My specific point is that exponential backoff is not a silver bullet and
> we
> > should consider using it to solve known problems, instead of making the
> > holistic changes to all clients in Kafka ecosystem. I do like the
> > exponential backoff idea and believe this would be of great value, but
> > maybe we should focus on proposing some existing modules that are
> suffering
> > from static retry, and only change them in this first KIP. If in the
> > future, some other component users believe they are also suffering, we
> > could get more minor KIPs to change the behavior as well.
> >
> > Boyang
> >
> > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <skaundinya@gmail.com
> >
> > wrote:
> >
> > > Thanks for the feedback Boyang, I will revise the KIP with the
> > > mathematical relations as per your suggestion. To address your
> feedback:
> > >
> > > 1. Currently, with the default of 100 ms per retry backoff, in 1 second
> > > we would have 10 retries. In the case of using an exponential backoff,
> we
> > > would have a total of 4 retries in 1 second. Thus we have less than
> half of
> > > the amount of retries in the same timeframe and can lessen broker
> pressure.
> > > This calculation is done as following (using the formula laid out in
> the
> > > KIP:
> > >
> > > Try 1 at time 0 ms, failures = 0, next retry in 100 ms (default retry
> ms
> > > is initially 100 ms)
> > > Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> > > Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> > > Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> > > Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms (default max
> > > retry ms is 1000 ms)
> > >
> > > For 2 and 3, could you elaborate more about what you mean with respect
> to
> > > client timeouts? I’m not very familiar with the Streams framework, so
> would
> > > love to get more insight to how that currently works, with respect to
> > > producer transactions, so I can appropriately update the KIP to address
> > > these scenarios.
> > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> reluctanthero104@gmail.com>,
> > > wrote:
> > > > Thanks for the KIP Sanjana. I think the motivation is good, but lack
> of
> > > > more quantitative analysis. For instance:
> > > >
> > > > 1. How much retries we are saving by applying the exponential retry
> vs
> > > > static retry? There should be some mathematical relations between the
> > > > static retry ms, the initial exponential retry ms, the max
> exponential
> > > > retry ms in a given time interval.
> > > > 2. How does this affect the client timeout? With exponential retry,
> the
> > > > client shall be getting easier to timeout on a parent level caller,
> for
> > > > instance stream attempts to retry initializing producer transactions
> with
> > > > given 5 minute interval. With exponential retry this mechanism could
> > > > experience more frequent timeout which we should be careful with.
> > > > 3. With regards to #2, we should have more detailed checklist of all
> the
> > > > existing static retry scenarios, and adjust the initial exponential
> retry
> > > > ms to make sure we won't get easily timeout in high level due to too
> few
> > > > attempts.
> > > >
> > > > Boyang
> > > >
> > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <
> skaundinya@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Everyone,
> > > > >
> > > > > I’ve written a KIP about introducing exponential backoff for Kafka
> > > > > clients. Would appreciate any feedback on this.
> > > > >
> > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > >
> > > > > Thanks,
> > > > > Sanjana
> > > > >
> > >
>

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Boyang Chen <bc...@outlook.com>.

Sounds great!

Get Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: Sanjana Kaundinya <sk...@gmail.com>
Sent: Tuesday, March 17, 2020 5:54:35 PM
To: dev@kafka.apache.org <de...@kafka.apache.org>
Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Thanks for the explanation Boyang. One of the most common problems that we have in Kafka is with respect to metadata fetches. For example, if there is a broker failure, all clients start to fetch metadata at the same time and it often takes a while for the metadata to converge. In a high load cluster, there are also issues where the volume of metadata has made convergence of metadata slower.

For this case, exponential backoff helps as it reduces the retry rate and spaces out how often clients will retry, thereby bringing down the time for convergence. Something that Jason mentioned that would be a great addition here would be if the backoff should be “jittered” as it was in KIP-144 with respect to exponential reconnect backoff. This would help prevent the clients from being synchronized on when they retry, thereby spacing out the number of requests being sent to the broker at the same time.

I’ll add this example to the KIP and flush out more of the details - so it’s more clear.

On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <re...@gmail.com>, wrote:
> Thanks for the reply Sanjana. I guess I would like to rephrase my question
> 2 and 3 as my previous response is a little bit unactionable.
>
> My specific point is that exponential backoff is not a silver bullet and we
> should consider using it to solve known problems, instead of making the
> holistic changes to all clients in Kafka ecosystem. I do like the
> exponential backoff idea and believe this would be of great value, but
> maybe we should focus on proposing some existing modules that are suffering
> from static retry, and only change them in this first KIP. If in the
> future, some other component users believe they are also suffering, we
> could get more minor KIPs to change the behavior as well.
>
> Boyang
>
> On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <sk...@gmail.com>
> wrote:
>
> > Thanks for the feedback Boyang, I will revise the KIP with the
> > mathematical relations as per your suggestion. To address your feedback:
> >
> > 1. Currently, with the default of 100 ms per retry backoff, in 1 second
> > we would have 10 retries. In the case of using an exponential backoff, we
> > would have a total of 4 retries in 1 second. Thus we have less than half of
> > the amount of retries in the same timeframe and can lessen broker pressure.
> > This calculation is done as following (using the formula laid out in the
> > KIP:
> >
> > Try 1 at time 0 ms, failures = 0, next retry in 100 ms (default retry ms
> > is initially 100 ms)
> > Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> > Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> > Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> > Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms (default max
> > retry ms is 1000 ms)
> >
> > For 2 and 3, could you elaborate more about what you mean with respect to
> > client timeouts? I’m not very familiar with the Streams framework, so would
> > love to get more insight to how that currently works, with respect to
> > producer transactions, so I can appropriately update the KIP to address
> > these scenarios.
> > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <re...@gmail.com>,
> > wrote:
> > > Thanks for the KIP Sanjana. I think the motivation is good, but lack of
> > > more quantitative analysis. For instance:
> > >
> > > 1. How much retries we are saving by applying the exponential retry vs
> > > static retry? There should be some mathematical relations between the
> > > static retry ms, the initial exponential retry ms, the max exponential
> > > retry ms in a given time interval.
> > > 2. How does this affect the client timeout? With exponential retry, the
> > > client shall be getting easier to timeout on a parent level caller, for
> > > instance stream attempts to retry initializing producer transactions with
> > > given 5 minute interval. With exponential retry this mechanism could
> > > experience more frequent timeout which we should be careful with.
> > > 3. With regards to #2, we should have more detailed checklist of all the
> > > existing static retry scenarios, and adjust the initial exponential retry
> > > ms to make sure we won't get easily timeout in high level due to too few
> > > attempts.
> > >
> > > Boyang
> > >
> > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <sk...@gmail.com>
> > > wrote:
> > >
> > > > Hi Everyone,
> > > >
> > > > I’ve written a KIP about introducing exponential backoff for Kafka
> > > > clients. Would appreciate any feedback on this.
> > > >
> > > >
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > >
> > > > Thanks,
> > > > Sanjana
> > > >
> >

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Sanjana Kaundinya <sk...@gmail.com>.

Thanks for the explanation Boyang. One of the most common problems that we have in Kafka is with respect to metadata fetches. For example, if there is a broker failure, all clients start to fetch metadata at the same time and it often takes a while for the metadata to converge. In a high load cluster, there are also issues where the volume of metadata has made convergence of metadata slower.

For this case, exponential backoff helps as it reduces the retry rate and spaces out how often clients will retry, thereby bringing down the time for convergence. Something that Jason mentioned that would be a great addition here would be if the backoff should be “jittered” as it was in KIP-144 with respect to exponential reconnect backoff. This would help prevent the clients from being synchronized on when they retry, thereby spacing out the number of requests being sent to the broker at the same time.

I’ll add this example to the KIP and flush out more of the details - so it’s more clear.

On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <re...@gmail.com>, wrote:
> Thanks for the reply Sanjana. I guess I would like to rephrase my question
> 2 and 3 as my previous response is a little bit unactionable.
>
> My specific point is that exponential backoff is not a silver bullet and we
> should consider using it to solve known problems, instead of making the
> holistic changes to all clients in Kafka ecosystem. I do like the
> exponential backoff idea and believe this would be of great value, but
> maybe we should focus on proposing some existing modules that are suffering
> from static retry, and only change them in this first KIP. If in the
> future, some other component users believe they are also suffering, we
> could get more minor KIPs to change the behavior as well.
>
> Boyang
>
> On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <sk...@gmail.com>
> wrote:
>
> > Thanks for the feedback Boyang, I will revise the KIP with the
> > mathematical relations as per your suggestion. To address your feedback:
> >
> > 1. Currently, with the default of 100 ms per retry backoff, in 1 second
> > we would have 10 retries. In the case of using an exponential backoff, we
> > would have a total of 4 retries in 1 second. Thus we have less than half of
> > the amount of retries in the same timeframe and can lessen broker pressure.
> > This calculation is done as following (using the formula laid out in the
> > KIP:
> >
> > Try 1 at time 0 ms, failures = 0, next retry in 100 ms (default retry ms
> > is initially 100 ms)
> > Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> > Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> > Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> > Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms (default max
> > retry ms is 1000 ms)
> >
> > For 2 and 3, could you elaborate more about what you mean with respect to
> > client timeouts? I’m not very familiar with the Streams framework, so would
> > love to get more insight to how that currently works, with respect to
> > producer transactions, so I can appropriately update the KIP to address
> > these scenarios.
> > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <re...@gmail.com>,
> > wrote:
> > > Thanks for the KIP Sanjana. I think the motivation is good, but lack of
> > > more quantitative analysis. For instance:
> > >
> > > 1. How much retries we are saving by applying the exponential retry vs
> > > static retry? There should be some mathematical relations between the
> > > static retry ms, the initial exponential retry ms, the max exponential
> > > retry ms in a given time interval.
> > > 2. How does this affect the client timeout? With exponential retry, the
> > > client shall be getting easier to timeout on a parent level caller, for
> > > instance stream attempts to retry initializing producer transactions with
> > > given 5 minute interval. With exponential retry this mechanism could
> > > experience more frequent timeout which we should be careful with.
> > > 3. With regards to #2, we should have more detailed checklist of all the
> > > existing static retry scenarios, and adjust the initial exponential retry
> > > ms to make sure we won't get easily timeout in high level due to too few
> > > attempts.
> > >
> > > Boyang
> > >
> > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <sk...@gmail.com>
> > > wrote:
> > >
> > > > Hi Everyone,
> > > >
> > > > I’ve written a KIP about introducing exponential backoff for Kafka
> > > > clients. Would appreciate any feedback on this.
> > > >
> > > >
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > >
> > > > Thanks,
> > > > Sanjana
> > > >
> >

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Boyang Chen <re...@gmail.com>.

Thanks for the reply Sanjana. I guess I would like to rephrase my question
2 and 3 as my previous response is a little bit unactionable.

My specific point is that exponential backoff is not a silver bullet and we
should consider using it to solve known problems, instead of making the
holistic changes to all clients in Kafka ecosystem. I do like the
exponential backoff idea and believe this would be of great value, but
maybe we should focus on proposing some existing modules that are suffering
from static retry, and only change them in this first KIP. If in the
future, some other component users believe they are also suffering, we
could get more minor KIPs to change the behavior as well.

Boyang

On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <sk...@gmail.com>
wrote:

> Thanks for the feedback Boyang, I will revise the KIP with the
> mathematical relations as per your suggestion. To address your feedback:
>
> 1.  Currently, with the default of 100 ms per retry backoff, in 1 second
> we would have 10 retries. In the case of using an exponential backoff, we
> would have a total of 4 retries in 1 second. Thus we have less than half of
> the amount of retries in the same timeframe and can lessen broker pressure.
> This calculation is done as following (using the formula laid out in the
> KIP:
>
> Try 1 at time 0 ms, failures = 0, next retry in 100 ms (default retry ms
> is initially 100 ms)
> Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms (default max
> retry ms is 1000 ms)
>
> For 2 and 3, could you elaborate more about what you mean with respect to
> client timeouts? I’m not very familiar with the Streams framework, so would
> love to get more insight to how that currently works, with respect to
> producer transactions, so I can appropriately update the KIP to address
> these scenarios.
> On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <re...@gmail.com>,
> wrote:
> > Thanks for the KIP Sanjana. I think the motivation is good, but lack of
> > more quantitative analysis. For instance:
> >
> > 1. How much retries we are saving by applying the exponential retry vs
> > static retry? There should be some mathematical relations between the
> > static retry ms, the initial exponential retry ms, the max exponential
> > retry ms in a given time interval.
> > 2. How does this affect the client timeout? With exponential retry, the
> > client shall be getting easier to timeout on a parent level caller, for
> > instance stream attempts to retry initializing producer transactions with
> > given 5 minute interval. With exponential retry this mechanism could
> > experience more frequent timeout which we should be careful with.
> > 3. With regards to #2, we should have more detailed checklist of all the
> > existing static retry scenarios, and adjust the initial exponential retry
> > ms to make sure we won't get easily timeout in high level due to too few
> > attempts.
> >
> > Boyang
> >
> > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <sk...@gmail.com>
> > wrote:
> >
> > > Hi Everyone,
> > >
> > > I’ve written a KIP about introducing exponential backoff for Kafka
> > > clients. Would appreciate any feedback on this.
> > >
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > >
> > > Thanks,
> > > Sanjana
> > >
>

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Sanjana Kaundinya <sk...@gmail.com>.

Thanks for the feedback Boyang, I will revise the KIP with the mathematical relations as per your suggestion. To address your feedback:

1.  Currently, with the default of 100 ms per retry backoff, in 1 second we would have 10 retries. In the case of using an exponential backoff, we would have a total of 4 retries in 1 second. Thus we have less than half of the amount of retries in the same timeframe and can lessen broker pressure. This calculation is done as following (using the formula laid out in the KIP:

Try 1 at time 0 ms, failures = 0, next retry in 100 ms (default retry ms is initially 100 ms)
Try 2 at time 100 ms, failures = 1, next retry in 200 ms
Try 3 at time 300 ms, failures = 2, next retry in 400 ms
Try 4 at time 700 ms, failures = 3, next retry in 800 ms
Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms (default max retry ms is 1000 ms)

For 2 and 3, could you elaborate more about what you mean with respect to client timeouts? I’m not very familiar with the Streams framework, so would love to get more insight to how that currently works, with respect to producer transactions, so I can appropriately update the KIP to address these scenarios.
On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <re...@gmail.com>, wrote:
> Thanks for the KIP Sanjana. I think the motivation is good, but lack of
> more quantitative analysis. For instance:
>
> 1. How much retries we are saving by applying the exponential retry vs
> static retry? There should be some mathematical relations between the
> static retry ms, the initial exponential retry ms, the max exponential
> retry ms in a given time interval.
> 2. How does this affect the client timeout? With exponential retry, the
> client shall be getting easier to timeout on a parent level caller, for
> instance stream attempts to retry initializing producer transactions with
> given 5 minute interval. With exponential retry this mechanism could
> experience more frequent timeout which we should be careful with.
> 3. With regards to #2, we should have more detailed checklist of all the
> existing static retry scenarios, and adjust the initial exponential retry
> ms to make sure we won't get easily timeout in high level due to too few
> attempts.
>
> Boyang
>
> On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <sk...@gmail.com>
> wrote:
>
> > Hi Everyone,
> >
> > I’ve written a KIP about introducing exponential backoff for Kafka
> > clients. Would appreciate any feedback on this.
> >
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> >
> > Thanks,
> > Sanjana
> >

Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka Clients

Posted by Boyang Chen <re...@gmail.com>.

Thanks for the KIP Sanjana. I think the motivation is good, but lack of
more quantitative analysis. For instance:

1. How much retries we are saving by applying the exponential retry vs
static retry? There should be some mathematical relations between the
static retry ms, the initial exponential retry ms, the max exponential
retry ms in a given time interval.
2. How does this affect the client timeout? With exponential retry, the
client shall be getting easier to timeout on a parent level caller, for
instance stream attempts to retry initializing producer transactions with
given 5 minute interval. With exponential retry this mechanism could
experience more frequent timeout which we should be careful with.
3. With regards to #2, we should have more detailed checklist of all the
existing static retry scenarios, and adjust the initial exponential retry
ms to make sure we won't get easily timeout in high level due to too few
attempts.

Boyang

On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <sk...@gmail.com>
wrote:

> Hi Everyone,
>
> I’ve written a KIP about introducing exponential backoff for Kafka
> clients. Would appreciate any feedback on this.
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
>
> Thanks,
> Sanjana
>