You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Philip Nee <ph...@gmail.com> on 2023/02/23 17:29:23 UTC

[DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Hi all!

I want to start a discussion thread about how we can handle client
bootstrap failure due DNS lookup.  This requires a bit of behavioral
change, so a KIP is proposed and attached to this email. Let me know what
you think!


*A small remark here*: *As the title of this KIP might sound
familiar/similar to KIP-899, it is not the same.*

*In Summary:* I want to propose a KIP to change the existing bootstrap
(upon instantiation) strategy because it is reasonable to allow clients to
retry

*KIP: *
https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution

Thanks!
Philip

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <ph...@gmail.com>.
Hey Ivan,

(Also, thanks John!)

Looping you in just for transparency. Let me know what do you think.

Thanks!
P


On Thu, Feb 23, 2023 at 12:03 PM John Roesler <vv...@apache.org> wrote:

> Thanks for the KIP, Philip!
>
> I think your proposal makes sense. I suspect the reason that we previously
> did the DNS resolution in the constructor is to fail fast if the name is
> wrong. On the other hand, it's generally a hassle to do failure-prone or
> slow operations in a constructor, so I'm in favor of moving it to poll.
>
> I'm also in favor of throwing NetworkException (or some other
> RetriableException), since failing to resolve the DNS entry for the brokers
> shouldn't poison the state of the client, and it should be fine for users
> to retry if they want to.
>
> I actually do think there might be some overlap with KIP-899. If we go
> ahead and move DNS resolution to poll, then KIP-899 becomes just a question
> of whether we should call poll at other points after the first resolution.
> It seems like these could potentially be merged into one proposal, or you
> and Ivan could coordinate on symbiotic KIPs.
>
> Thanks again,
> -John
>
> On 2023/02/23 17:29:23 Philip Nee wrote:
> > Hi all!
> >
> > I want to start a discussion thread about how we can handle client
> > bootstrap failure due DNS lookup.  This requires a bit of behavioral
> > change, so a KIP is proposed and attached to this email. Let me know what
> > you think!
> >
> >
> > *A small remark here*: *As the title of this KIP might sound
> > familiar/similar to KIP-899, it is not the same.*
> >
> > *In Summary:* I want to propose a KIP to change the existing bootstrap
> > (upon instantiation) strategy because it is reasonable to allow clients
> to
> > retry
> >
> > *KIP: *
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> >
> > Thanks!
> > Philip
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by John Roesler <vv...@apache.org>.
Thanks for the KIP, Philip!

I think your proposal makes sense. I suspect the reason that we previously did the DNS resolution in the constructor is to fail fast if the name is wrong. On the other hand, it's generally a hassle to do failure-prone or slow operations in a constructor, so I'm in favor of moving it to poll.

I'm also in favor of throwing NetworkException (or some other RetriableException), since failing to resolve the DNS entry for the brokers shouldn't poison the state of the client, and it should be fine for users to retry if they want to.

I actually do think there might be some overlap with KIP-899. If we go ahead and move DNS resolution to poll, then KIP-899 becomes just a question of whether we should call poll at other points after the first resolution. It seems like these could potentially be merged into one proposal, or you and Ivan could coordinate on symbiotic KIPs.

Thanks again,
-John

On 2023/02/23 17:29:23 Philip Nee wrote:
> Hi all!
> 
> I want to start a discussion thread about how we can handle client
> bootstrap failure due DNS lookup.  This requires a bit of behavioral
> change, so a KIP is proposed and attached to this email. Let me know what
> you think!
> 
> 
> *A small remark here*: *As the title of this KIP might sound
> familiar/similar to KIP-899, it is not the same.*
> 
> *In Summary:* I want to propose a KIP to change the existing bootstrap
> (upon instantiation) strategy because it is reasonable to allow clients to
> retry
> 
> *KIP: *
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> 
> Thanks!
> Philip
> 

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <pn...@confluent.io.INVALID>.
Hi,

This KIP isn't implemented yet but that sounds good.

Thanks!
P

On Mon, Apr 29, 2024 at 3:17 AM Federico Valeri <fe...@gmail.com>
wrote:

> Hi Philip, thanks for this KIP. As other mentioned, I think it is
> useful in dynamic environment like Kubernetes.
>
> I was wondering if we should also update the "examples" modules with
> this new non-retriable exception. Wdyt?
>
> On Fri, Apr 21, 2023 at 11:13 PM Philip Nee <ph...@gmail.com> wrote:
> >
> > Hey Jason,
> >
> > Thanks again. I've updated the KIP.
> >
> > P
> >
> > On Fri, Apr 21, 2023 at 9:56 AM Jason Gustafson
> <ja...@confluent.io.invalid>
> > wrote:
> >
> > > Hey Philip, that sounds good to me.
> > >
> > > -Jason
> > >
> > >
> > >
> > > On Thu, Apr 20, 2023 at 1:20 PM Philip Nee <ph...@gmail.com>
> wrote:
> > >
> > > > Hey Jason,
> > > >
> > > > Thanks for the response. I agree with your suggestions. Just to
> clarify,
> > > we
> > > > want the timeout, bootstrap.resolve.timeout.ms to only bound the DNS
> > > > resolution time, and we can throw a fatal,
> BootstrapResolutionException
> > > (so
> > > > not connection exception anymore) afterward.
> > > >
> > > > I think that aligns with the goal of this KIP.
> > > >
> > > > P
> > > >
> > > > On Thu, Apr 20, 2023 at 9:23 AM Jason Gustafson
> > > <jason@confluent.io.invalid
> > > > >
> > > > wrote:
> > > >
> > > > > Hey Philip,
> > > > >
> > > > > Yeah, I see your point. Here is the challenge I'm considering.
> Today,
> > > the
> > > > > client does not forward connection-related errors to the
> application.
> > > It
> > > > is
> > > > > debatable whether it should, but that is the behavior that
> applications
> > > > > expect today. The only case of a fatal error is DNS resolution and
> the
> > > > > application must handle it because they cannot even construct the
> > > client.
> > > > > Once we have this KIP, we will introduce a separate fatal failure
> > > > mechanism
> > > > > through the `BootstrapConnectionException`. And we will use it not
> only
> > > > for
> > > > > DNS failures, but general bootstrap connection failures . The
> problem
> > > is
> > > > > that existing applications will not know that this should be
> treated
> > > as a
> > > > > fatal error. So they may continue to retry. Does it help in this
> > > > situation
> > > > > to poison the client so that it cannot be used? I'm not sure.
> Perhaps
> > > it
> > > > > would reduce the risk a bit if we change `
> > > > bootstrap.connection.timeout.ms`
> > > > > to `bootstrap.resolve.timeout.ms` or something like that? Then we
> > > retain
> > > > > the current behavior if DNS succeeds, but connections fail. I know
> I'm
> > > > the
> > > > > one that suggested generalizing the configuration, but I feel some
> > > > > hesitation after thinking about it more.
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > On Wed, Apr 19, 2023 at 8:10 PM Philip Nee <ph...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hey Jason,
> > > > > >
> > > > > > Thanks for your review.  I think if we make it a retriable error,
> > > does
> > > > it
> > > > > > make sense to have a configurable timeout still? as we expect the
> > > user
> > > > to
> > > > > > continue to retry anyway.
> > > > > >
> > > > > > I'm considering the case of bad configuration. If the user
> retries
> > > the
> > > > > > error, then we rely on the error/warning to alert the user.  In
> this
> > > > > case,
> > > > > > maybe we continue using the proposed behavior, i.e. warn on each
> poll
> > > > > after
> > > > > > the timeout period.
> > > > > >
> > > > > > If you agree that a configuration is needed, maybe we can call
> this
> > > > > > *bootstrap.auto.retry.ms
> > > > > > <http://bootstrap.auto.retry.ms> *instead, to indicate a
> > > configurable
> > > > > > period of automatic retry. What do you think?
> > > > > >
> > > > > > Cheers,
> > > > > > P
> > > > > >
> > > > > > On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson
> > > > > <jason@confluent.io.invalid
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hey Phillip,
> > > > > > >
> > > > > > > The KIP looks good. 5 minutes seems like a reasonable
> tradeoff. I
> > > do
> > > > > > wonder
> > > > > > > if it is necessary to treat bootstrap timeout as a fatal error
> > > > though.
> > > > > It
> > > > > > > seems possible that the exception might be caught by handlers
> in
> > > > > existing
> > > > > > > applications which may not expect that the client needs to be
> > > > > restarted.
> > > > > > > Perhaps it would be safer to make it retriable? As long as the
> > > > > > application
> > > > > > > continues trying to use the client, we could continue trying to
> > > reach
> > > > > the
> > > > > > > bootstrap servers perhaps? That would be closer to behavior
> today
> > > > which
> > > > > > > only treats DNS resolution failures as fatal. What do you
> think?
> > > > > > >
> > > > > > > Best,
> > > > > > > Jason
> > > > > > >
> > > > > > > On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <
> philipnee@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Thanks, everyone: I'm starting a vote today.  Here's the
> recap
> > > for
> > > > > some
> > > > > > > of
> > > > > > > > the questions:
> > > > > > > >
> > > > > > > > John: I changed the proposal to throw a non-retriable
> exception
> > > > after
> > > > > > the
> > > > > > > > timeout elapses. I feel it might be necessary to poison the
> > > client
> > > > > > after
> > > > > > > > retry expires, as it might indicate a real issue.
> > > > > > > > Ismael: The proposal is to add a configuration for the retry
> and
> > > it
> > > > > > will
> > > > > > > > throw a non-retriable exception after the time expires.
> > > > > > > > Chris: Addressed some unclarity that you mentioned, and a
> new API
> > > > > won't
> > > > > > > be
> > > > > > > > introduced in this KIP.  Maybe up for future discussion.
> > > > > > > > Jason: I'm proposing adding a timeout config and a bootstrap
> > > > > exception
> > > > > > > per
> > > > > > > > your suggestion.
> > > > > > > > Kirk: I'm proposing throwing a non-retriable exception in the
> > > > network
> > > > > > > > client. See previous comment.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> > > > > <chrise@aiven.io.invalid
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Philip,
> > > > > > > > >
> > > > > > > > > Yeah,  "DNS resolution should occur..." seems like a better
> > > fit.
> > > > 👍
> > > > > > > > >
> > > > > > > > > One other question I have is whether we should expose some
> kind
> > > > of
> > > > > > > public
> > > > > > > > > API for performing preflight validation of the bootstrap
> URLs.
> > > If
> > > > > we
> > > > > > > > change
> > > > > > > > > the behavior of a client configured with a silly typo
> (e.g.,
> > > > > > > > > "loclahost instead of localhost") from failing in the
> > > constructor
> > > > > to
> > > > > > > > > failing with a retriable exception, this might lead some
> client
> > > > > > > > > applications to handle that failure by, well, retrying. For
> > > > > > reference,
> > > > > > > > this
> > > > > > > > > is exactly what we do in Kafka Connect right now; see [1]
> and
> > > > [2].
> > > > > > IMO
> > > > > > > > it'd
> > > > > > > > > be nice to be able to opt into keeping the current
> behavior so
> > > > that
> > > > > > > > > projects like Connect could still do preflight checks of
> the
> > > > > > > > > bootstrap.servers property for connectors before starting
> them,
> > > > and
> > > > > > > > report
> > > > > > > > > any issues by failing fast instead of continuously writing
> > > > > > > warning/error
> > > > > > > > > messages to their logs.
> > > > > > > > >
> > > > > > > > > I'm not sure about where this new API could go, but a few
> > > options
> > > > > > might
> > > > > > > > be:
> > > > > > > > >
> > > > > > > > > - Expose a public variant of the existing ClientUtils class
> > > > > > > > > - Add static methods to the ConsumerConfig,
> ProducerConfig, and
> > > > > > > > > AdminClientConfig classes
> > > > > > > > > - Add those same static methods to the KafkaConsumer,
> > > > > KafkaProducer,
> > > > > > > and
> > > > > > > > > KafkaAdminClient classes
> > > > > > > > >
> > > > > > > > > If this seems reasonable, we should probably also specify
> in
> > > the
> > > > > KIP
> > > > > > > that
> > > > > > > > > Kafka Connect will leverage this preflight validation logic
> > > > before
> > > > > > > > > instantiating any Kafka clients for use by connectors or
> tasks,
> > > > and
> > > > > > > > > continue to fail fast if there are typos in the
> > > bootstrap.servers
> > > > > > > > property,
> > > > > > > > > or if temporary DNS resolution issues come up.
> > > > > > > > >
> > > > > > > > > [1] -
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > > > > > > > [2] -
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > >
> > > > > > > > > Chris
> > > > > > > > >
> > > > > > > > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <
> > > philipnee@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hey Chris,
> > > > > > > > > >
> > > > > > > > > > Thanks for the quick response, and I apologize for the
> > > unclear
> > > > > > > wording
> > > > > > > > > > there, I guess "DNS lookup" would be a more appropriate
> > > wording
> > > > > > here.
> > > > > > > > So
> > > > > > > > > > what I meant there was, to delegate the DNS lookup in the
> > > > > > constructor
> > > > > > > > to
> > > > > > > > > > the network client poll, and it will happen on the very
> first
> > > > > poll.
> > > > > > > I
> > > > > > > > > > guess the logic could look like this:
> > > > > > > > > >
> > > > > > > > > > - if the client has been bootstrapped, do nothing.
> > > > > > > > > > - Otherwise, perform DNS lookup, and acquire the
> bootstrap
> > > > server
> > > > > > > > > address.
> > > > > > > > > >
> > > > > > > > > > Thanks for the comment there, I'll change up the wording.
> > > > Maybe
> > > > > > > revise
> > > > > > > > > it
> > > > > > > > > > as "DNS resolution should occur in the poll...." ?
> > > > > > > > > >
> > > > > > > > > > P
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > > > > > > <chrise@aiven.io.invalid
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Philip,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the KIP!
> > > > > > > > > > >
> > > > > > > > > > > QQ: In the "Proposed Changes" section, the KIP states
> that
> > > > > > > > > "Bootstrapping
> > > > > > > > > > > should now occur in the poll method before attempting
> to
> > > > update
> > > > > > the
> > > > > > > > > > > metadata. This includes resolving the addresses and
> > > > > bootstrapping
> > > > > > > the
> > > > > > > > > > > metadata.". By "bootstrapping the metadata" do we mean
> > > > actually
> > > > > > > > > > contacting
> > > > > > > > > > > the bootstrap servers, or just setting some internal
> state
> > > > > > related
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > > current set of servers that can be contacted for
> metadata?
> > > I
> > > > > ask
> > > > > > > > > because
> > > > > > > > > > it
> > > > > > > > > > > seems like the language here implies the former, but if
> > > > that's
> > > > > > the
> > > > > > > > > case,
> > > > > > > > > > > this is already happening in poll (or at least, the
> first
> > > > > > > invocation
> > > > > > > > of
> > > > > > > > > > > it), and if it's the latter, it's probably not
> necessary to
> > > > > > mention
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > KIP since it doesn't really impact user-facing
> behavior. It
> > > > > also
> > > > > > > > seems
> > > > > > > > > > like
> > > > > > > > > > > that detail might impact how intertwined this and
> KIP-899
> > > > are,
> > > > > > > though
> > > > > > > > > the
> > > > > > > > > > > similarity could still be superficial either way.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > >
> > > > > > > > > > > Chris
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> > > > > philipnee@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hey Ismael,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the feedback! The proposal is not to retry
> > > > > > > automatically
> > > > > > > > > but
> > > > > > > > > > > > relies on the user polling the NetworkClient
> (basically,
> > > > > > > > > consumer.poll)
> > > > > > > > > > > to
> > > > > > > > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > > > > > > NetworkException
> > > > > > > > > > > > (retriable) will be thrown.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks!
> > > > > > > > > > > > P
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> > > > > ismael@juma.me.uk
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the KIP. Not sure if I missed it, but
> how
> > > long
> > > > > > will
> > > > > > > we
> > > > > > > > > > retry
> > > > > > > > > > > > for
> > > > > > > > > > > > > and when do we give up and propagate the failure
> to the
> > > > > user?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ismael
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > > > > > > philipnee@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi all!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I want to start a discussion thread about how we
> can
> > > > > handle
> > > > > > > > > client
> > > > > > > > > > > > > > bootstrap failure due DNS lookup.  This requires
> a
> > > bit
> > > > of
> > > > > > > > > > behavioral
> > > > > > > > > > > > > > change, so a KIP is proposed and attached to this
> > > > email.
> > > > > > Let
> > > > > > > me
> > > > > > > > > > know
> > > > > > > > > > > > what
> > > > > > > > > > > > > > you think!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *A small remark here*: *As the title of this KIP
> > > might
> > > > > > sound
> > > > > > > > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *In Summary:* I want to propose a KIP to change
> the
> > > > > > existing
> > > > > > > > > > > bootstrap
> > > > > > > > > > > > > > (upon instantiation) strategy because it is
> > > reasonable
> > > > to
> > > > > > > allow
> > > > > > > > > > > clients
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > retry
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > *KIP: *
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > > Philip
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Federico Valeri <fe...@gmail.com>.
Hi Philip, thanks for this KIP. As other mentioned, I think it is
useful in dynamic environment like Kubernetes.

I was wondering if we should also update the "examples" modules with
this new non-retriable exception. Wdyt?

On Fri, Apr 21, 2023 at 11:13 PM Philip Nee <ph...@gmail.com> wrote:
>
> Hey Jason,
>
> Thanks again. I've updated the KIP.
>
> P
>
> On Fri, Apr 21, 2023 at 9:56 AM Jason Gustafson <ja...@confluent.io.invalid>
> wrote:
>
> > Hey Philip, that sounds good to me.
> >
> > -Jason
> >
> >
> >
> > On Thu, Apr 20, 2023 at 1:20 PM Philip Nee <ph...@gmail.com> wrote:
> >
> > > Hey Jason,
> > >
> > > Thanks for the response. I agree with your suggestions. Just to clarify,
> > we
> > > want the timeout, bootstrap.resolve.timeout.ms to only bound the DNS
> > > resolution time, and we can throw a fatal, BootstrapResolutionException
> > (so
> > > not connection exception anymore) afterward.
> > >
> > > I think that aligns with the goal of this KIP.
> > >
> > > P
> > >
> > > On Thu, Apr 20, 2023 at 9:23 AM Jason Gustafson
> > <jason@confluent.io.invalid
> > > >
> > > wrote:
> > >
> > > > Hey Philip,
> > > >
> > > > Yeah, I see your point. Here is the challenge I'm considering. Today,
> > the
> > > > client does not forward connection-related errors to the application.
> > It
> > > is
> > > > debatable whether it should, but that is the behavior that applications
> > > > expect today. The only case of a fatal error is DNS resolution and the
> > > > application must handle it because they cannot even construct the
> > client.
> > > > Once we have this KIP, we will introduce a separate fatal failure
> > > mechanism
> > > > through the `BootstrapConnectionException`. And we will use it not only
> > > for
> > > > DNS failures, but general bootstrap connection failures . The problem
> > is
> > > > that existing applications will not know that this should be treated
> > as a
> > > > fatal error. So they may continue to retry. Does it help in this
> > > situation
> > > > to poison the client so that it cannot be used? I'm not sure. Perhaps
> > it
> > > > would reduce the risk a bit if we change `
> > > bootstrap.connection.timeout.ms`
> > > > to `bootstrap.resolve.timeout.ms` or something like that? Then we
> > retain
> > > > the current behavior if DNS succeeds, but connections fail. I know I'm
> > > the
> > > > one that suggested generalizing the configuration, but I feel some
> > > > hesitation after thinking about it more.
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > On Wed, Apr 19, 2023 at 8:10 PM Philip Nee <ph...@gmail.com>
> > wrote:
> > > >
> > > > > Hey Jason,
> > > > >
> > > > > Thanks for your review.  I think if we make it a retriable error,
> > does
> > > it
> > > > > make sense to have a configurable timeout still? as we expect the
> > user
> > > to
> > > > > continue to retry anyway.
> > > > >
> > > > > I'm considering the case of bad configuration. If the user retries
> > the
> > > > > error, then we rely on the error/warning to alert the user.  In this
> > > > case,
> > > > > maybe we continue using the proposed behavior, i.e. warn on each poll
> > > > after
> > > > > the timeout period.
> > > > >
> > > > > If you agree that a configuration is needed, maybe we can call this
> > > > > *bootstrap.auto.retry.ms
> > > > > <http://bootstrap.auto.retry.ms> *instead, to indicate a
> > configurable
> > > > > period of automatic retry. What do you think?
> > > > >
> > > > > Cheers,
> > > > > P
> > > > >
> > > > > On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson
> > > > <jason@confluent.io.invalid
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hey Phillip,
> > > > > >
> > > > > > The KIP looks good. 5 minutes seems like a reasonable tradeoff. I
> > do
> > > > > wonder
> > > > > > if it is necessary to treat bootstrap timeout as a fatal error
> > > though.
> > > > It
> > > > > > seems possible that the exception might be caught by handlers in
> > > > existing
> > > > > > applications which may not expect that the client needs to be
> > > > restarted.
> > > > > > Perhaps it would be safer to make it retriable? As long as the
> > > > > application
> > > > > > continues trying to use the client, we could continue trying to
> > reach
> > > > the
> > > > > > bootstrap servers perhaps? That would be closer to behavior today
> > > which
> > > > > > only treats DNS resolution failures as fatal. What do you think?
> > > > > >
> > > > > > Best,
> > > > > > Jason
> > > > > >
> > > > > > On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <ph...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Thanks, everyone: I'm starting a vote today.  Here's the recap
> > for
> > > > some
> > > > > > of
> > > > > > > the questions:
> > > > > > >
> > > > > > > John: I changed the proposal to throw a non-retriable exception
> > > after
> > > > > the
> > > > > > > timeout elapses. I feel it might be necessary to poison the
> > client
> > > > > after
> > > > > > > retry expires, as it might indicate a real issue.
> > > > > > > Ismael: The proposal is to add a configuration for the retry and
> > it
> > > > > will
> > > > > > > throw a non-retriable exception after the time expires.
> > > > > > > Chris: Addressed some unclarity that you mentioned, and a new API
> > > > won't
> > > > > > be
> > > > > > > introduced in this KIP.  Maybe up for future discussion.
> > > > > > > Jason: I'm proposing adding a timeout config and a bootstrap
> > > > exception
> > > > > > per
> > > > > > > your suggestion.
> > > > > > > Kirk: I'm proposing throwing a non-retriable exception in the
> > > network
> > > > > > > client. See previous comment.
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> > > > <chrise@aiven.io.invalid
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Philip,
> > > > > > > >
> > > > > > > > Yeah,  "DNS resolution should occur..." seems like a better
> > fit.
> > > 👍
> > > > > > > >
> > > > > > > > One other question I have is whether we should expose some kind
> > > of
> > > > > > public
> > > > > > > > API for performing preflight validation of the bootstrap URLs.
> > If
> > > > we
> > > > > > > change
> > > > > > > > the behavior of a client configured with a silly typo (e.g.,
> > > > > > > > "loclahost instead of localhost") from failing in the
> > constructor
> > > > to
> > > > > > > > failing with a retriable exception, this might lead some client
> > > > > > > > applications to handle that failure by, well, retrying. For
> > > > > reference,
> > > > > > > this
> > > > > > > > is exactly what we do in Kafka Connect right now; see [1] and
> > > [2].
> > > > > IMO
> > > > > > > it'd
> > > > > > > > be nice to be able to opt into keeping the current behavior so
> > > that
> > > > > > > > projects like Connect could still do preflight checks of the
> > > > > > > > bootstrap.servers property for connectors before starting them,
> > > and
> > > > > > > report
> > > > > > > > any issues by failing fast instead of continuously writing
> > > > > > warning/error
> > > > > > > > messages to their logs.
> > > > > > > >
> > > > > > > > I'm not sure about where this new API could go, but a few
> > options
> > > > > might
> > > > > > > be:
> > > > > > > >
> > > > > > > > - Expose a public variant of the existing ClientUtils class
> > > > > > > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > > > > > > > AdminClientConfig classes
> > > > > > > > - Add those same static methods to the KafkaConsumer,
> > > > KafkaProducer,
> > > > > > and
> > > > > > > > KafkaAdminClient classes
> > > > > > > >
> > > > > > > > If this seems reasonable, we should probably also specify in
> > the
> > > > KIP
> > > > > > that
> > > > > > > > Kafka Connect will leverage this preflight validation logic
> > > before
> > > > > > > > instantiating any Kafka clients for use by connectors or tasks,
> > > and
> > > > > > > > continue to fail fast if there are typos in the
> > bootstrap.servers
> > > > > > > property,
> > > > > > > > or if temporary DNS resolution issues come up.
> > > > > > > >
> > > > > > > > [1] -
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > > > > > > [2] -
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > > Chris
> > > > > > > >
> > > > > > > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <
> > philipnee@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Chris,
> > > > > > > > >
> > > > > > > > > Thanks for the quick response, and I apologize for the
> > unclear
> > > > > > wording
> > > > > > > > > there, I guess "DNS lookup" would be a more appropriate
> > wording
> > > > > here.
> > > > > > > So
> > > > > > > > > what I meant there was, to delegate the DNS lookup in the
> > > > > constructor
> > > > > > > to
> > > > > > > > > the network client poll, and it will happen on the very first
> > > > poll.
> > > > > > I
> > > > > > > > > guess the logic could look like this:
> > > > > > > > >
> > > > > > > > > - if the client has been bootstrapped, do nothing.
> > > > > > > > > - Otherwise, perform DNS lookup, and acquire the bootstrap
> > > server
> > > > > > > > address.
> > > > > > > > >
> > > > > > > > > Thanks for the comment there, I'll change up the wording.
> > > Maybe
> > > > > > revise
> > > > > > > > it
> > > > > > > > > as "DNS resolution should occur in the poll...." ?
> > > > > > > > >
> > > > > > > > > P
> > > > > > > > >
> > > > > > > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > > > > > <chrise@aiven.io.invalid
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Philip,
> > > > > > > > > >
> > > > > > > > > > Thanks for the KIP!
> > > > > > > > > >
> > > > > > > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > > > > > > > "Bootstrapping
> > > > > > > > > > should now occur in the poll method before attempting to
> > > update
> > > > > the
> > > > > > > > > > metadata. This includes resolving the addresses and
> > > > bootstrapping
> > > > > > the
> > > > > > > > > > metadata.". By "bootstrapping the metadata" do we mean
> > > actually
> > > > > > > > > contacting
> > > > > > > > > > the bootstrap servers, or just setting some internal state
> > > > > related
> > > > > > to
> > > > > > > > the
> > > > > > > > > > current set of servers that can be contacted for metadata?
> > I
> > > > ask
> > > > > > > > because
> > > > > > > > > it
> > > > > > > > > > seems like the language here implies the former, but if
> > > that's
> > > > > the
> > > > > > > > case,
> > > > > > > > > > this is already happening in poll (or at least, the first
> > > > > > invocation
> > > > > > > of
> > > > > > > > > > it), and if it's the latter, it's probably not necessary to
> > > > > mention
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > KIP since it doesn't really impact user-facing behavior. It
> > > > also
> > > > > > > seems
> > > > > > > > > like
> > > > > > > > > > that detail might impact how intertwined this and KIP-899
> > > are,
> > > > > > though
> > > > > > > > the
> > > > > > > > > > similarity could still be superficial either way.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > >
> > > > > > > > > > Chris
> > > > > > > > > >
> > > > > > > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> > > > philipnee@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey Ismael,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the feedback! The proposal is not to retry
> > > > > > automatically
> > > > > > > > but
> > > > > > > > > > > relies on the user polling the NetworkClient (basically,
> > > > > > > > consumer.poll)
> > > > > > > > > > to
> > > > > > > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > > > > > NetworkException
> > > > > > > > > > > (retriable) will be thrown.
> > > > > > > > > > >
> > > > > > > > > > > Thanks!
> > > > > > > > > > > P
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> > > > ismael@juma.me.uk
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for the KIP. Not sure if I missed it, but how
> > long
> > > > > will
> > > > > > we
> > > > > > > > > retry
> > > > > > > > > > > for
> > > > > > > > > > > > and when do we give up and propagate the failure to the
> > > > user?
> > > > > > > > > > > >
> > > > > > > > > > > > Ismael
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > > > > > philipnee@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi all!
> > > > > > > > > > > > >
> > > > > > > > > > > > > I want to start a discussion thread about how we can
> > > > handle
> > > > > > > > client
> > > > > > > > > > > > > bootstrap failure due DNS lookup.  This requires a
> > bit
> > > of
> > > > > > > > > behavioral
> > > > > > > > > > > > > change, so a KIP is proposed and attached to this
> > > email.
> > > > > Let
> > > > > > me
> > > > > > > > > know
> > > > > > > > > > > what
> > > > > > > > > > > > > you think!
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > *A small remark here*: *As the title of this KIP
> > might
> > > > > sound
> > > > > > > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > > > > > > > >
> > > > > > > > > > > > > *In Summary:* I want to propose a KIP to change the
> > > > > existing
> > > > > > > > > > bootstrap
> > > > > > > > > > > > > (upon instantiation) strategy because it is
> > reasonable
> > > to
> > > > > > allow
> > > > > > > > > > clients
> > > > > > > > > > > > to
> > > > > > > > > > > > > retry
> > > > > > > > > > > > >
> > > > > > > > > > > > > *KIP: *
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > Philip
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <ph...@gmail.com>.
Hey Jason,

Thanks again. I've updated the KIP.

P

On Fri, Apr 21, 2023 at 9:56 AM Jason Gustafson <ja...@confluent.io.invalid>
wrote:

> Hey Philip, that sounds good to me.
>
> -Jason
>
>
>
> On Thu, Apr 20, 2023 at 1:20 PM Philip Nee <ph...@gmail.com> wrote:
>
> > Hey Jason,
> >
> > Thanks for the response. I agree with your suggestions. Just to clarify,
> we
> > want the timeout, bootstrap.resolve.timeout.ms to only bound the DNS
> > resolution time, and we can throw a fatal, BootstrapResolutionException
> (so
> > not connection exception anymore) afterward.
> >
> > I think that aligns with the goal of this KIP.
> >
> > P
> >
> > On Thu, Apr 20, 2023 at 9:23 AM Jason Gustafson
> <jason@confluent.io.invalid
> > >
> > wrote:
> >
> > > Hey Philip,
> > >
> > > Yeah, I see your point. Here is the challenge I'm considering. Today,
> the
> > > client does not forward connection-related errors to the application.
> It
> > is
> > > debatable whether it should, but that is the behavior that applications
> > > expect today. The only case of a fatal error is DNS resolution and the
> > > application must handle it because they cannot even construct the
> client.
> > > Once we have this KIP, we will introduce a separate fatal failure
> > mechanism
> > > through the `BootstrapConnectionException`. And we will use it not only
> > for
> > > DNS failures, but general bootstrap connection failures . The problem
> is
> > > that existing applications will not know that this should be treated
> as a
> > > fatal error. So they may continue to retry. Does it help in this
> > situation
> > > to poison the client so that it cannot be used? I'm not sure. Perhaps
> it
> > > would reduce the risk a bit if we change `
> > bootstrap.connection.timeout.ms`
> > > to `bootstrap.resolve.timeout.ms` or something like that? Then we
> retain
> > > the current behavior if DNS succeeds, but connections fail. I know I'm
> > the
> > > one that suggested generalizing the configuration, but I feel some
> > > hesitation after thinking about it more.
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Wed, Apr 19, 2023 at 8:10 PM Philip Nee <ph...@gmail.com>
> wrote:
> > >
> > > > Hey Jason,
> > > >
> > > > Thanks for your review.  I think if we make it a retriable error,
> does
> > it
> > > > make sense to have a configurable timeout still? as we expect the
> user
> > to
> > > > continue to retry anyway.
> > > >
> > > > I'm considering the case of bad configuration. If the user retries
> the
> > > > error, then we rely on the error/warning to alert the user.  In this
> > > case,
> > > > maybe we continue using the proposed behavior, i.e. warn on each poll
> > > after
> > > > the timeout period.
> > > >
> > > > If you agree that a configuration is needed, maybe we can call this
> > > > *bootstrap.auto.retry.ms
> > > > <http://bootstrap.auto.retry.ms> *instead, to indicate a
> configurable
> > > > period of automatic retry. What do you think?
> > > >
> > > > Cheers,
> > > > P
> > > >
> > > > On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson
> > > <jason@confluent.io.invalid
> > > > >
> > > > wrote:
> > > >
> > > > > Hey Phillip,
> > > > >
> > > > > The KIP looks good. 5 minutes seems like a reasonable tradeoff. I
> do
> > > > wonder
> > > > > if it is necessary to treat bootstrap timeout as a fatal error
> > though.
> > > It
> > > > > seems possible that the exception might be caught by handlers in
> > > existing
> > > > > applications which may not expect that the client needs to be
> > > restarted.
> > > > > Perhaps it would be safer to make it retriable? As long as the
> > > > application
> > > > > continues trying to use the client, we could continue trying to
> reach
> > > the
> > > > > bootstrap servers perhaps? That would be closer to behavior today
> > which
> > > > > only treats DNS resolution failures as fatal. What do you think?
> > > > >
> > > > > Best,
> > > > > Jason
> > > > >
> > > > > On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <ph...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Thanks, everyone: I'm starting a vote today.  Here's the recap
> for
> > > some
> > > > > of
> > > > > > the questions:
> > > > > >
> > > > > > John: I changed the proposal to throw a non-retriable exception
> > after
> > > > the
> > > > > > timeout elapses. I feel it might be necessary to poison the
> client
> > > > after
> > > > > > retry expires, as it might indicate a real issue.
> > > > > > Ismael: The proposal is to add a configuration for the retry and
> it
> > > > will
> > > > > > throw a non-retriable exception after the time expires.
> > > > > > Chris: Addressed some unclarity that you mentioned, and a new API
> > > won't
> > > > > be
> > > > > > introduced in this KIP.  Maybe up for future discussion.
> > > > > > Jason: I'm proposing adding a timeout config and a bootstrap
> > > exception
> > > > > per
> > > > > > your suggestion.
> > > > > > Kirk: I'm proposing throwing a non-retriable exception in the
> > network
> > > > > > client. See previous comment.
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> > > <chrise@aiven.io.invalid
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Philip,
> > > > > > >
> > > > > > > Yeah,  "DNS resolution should occur..." seems like a better
> fit.
> > 👍
> > > > > > >
> > > > > > > One other question I have is whether we should expose some kind
> > of
> > > > > public
> > > > > > > API for performing preflight validation of the bootstrap URLs.
> If
> > > we
> > > > > > change
> > > > > > > the behavior of a client configured with a silly typo (e.g.,
> > > > > > > "loclahost instead of localhost") from failing in the
> constructor
> > > to
> > > > > > > failing with a retriable exception, this might lead some client
> > > > > > > applications to handle that failure by, well, retrying. For
> > > > reference,
> > > > > > this
> > > > > > > is exactly what we do in Kafka Connect right now; see [1] and
> > [2].
> > > > IMO
> > > > > > it'd
> > > > > > > be nice to be able to opt into keeping the current behavior so
> > that
> > > > > > > projects like Connect could still do preflight checks of the
> > > > > > > bootstrap.servers property for connectors before starting them,
> > and
> > > > > > report
> > > > > > > any issues by failing fast instead of continuously writing
> > > > > warning/error
> > > > > > > messages to their logs.
> > > > > > >
> > > > > > > I'm not sure about where this new API could go, but a few
> options
> > > > might
> > > > > > be:
> > > > > > >
> > > > > > > - Expose a public variant of the existing ClientUtils class
> > > > > > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > > > > > > AdminClientConfig classes
> > > > > > > - Add those same static methods to the KafkaConsumer,
> > > KafkaProducer,
> > > > > and
> > > > > > > KafkaAdminClient classes
> > > > > > >
> > > > > > > If this seems reasonable, we should probably also specify in
> the
> > > KIP
> > > > > that
> > > > > > > Kafka Connect will leverage this preflight validation logic
> > before
> > > > > > > instantiating any Kafka clients for use by connectors or tasks,
> > and
> > > > > > > continue to fail fast if there are typos in the
> bootstrap.servers
> > > > > > property,
> > > > > > > or if temporary DNS resolution issues come up.
> > > > > > >
> > > > > > > [1] -
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > > > > > [2] -
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <
> philipnee@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hey Chris,
> > > > > > > >
> > > > > > > > Thanks for the quick response, and I apologize for the
> unclear
> > > > > wording
> > > > > > > > there, I guess "DNS lookup" would be a more appropriate
> wording
> > > > here.
> > > > > > So
> > > > > > > > what I meant there was, to delegate the DNS lookup in the
> > > > constructor
> > > > > > to
> > > > > > > > the network client poll, and it will happen on the very first
> > > poll.
> > > > > I
> > > > > > > > guess the logic could look like this:
> > > > > > > >
> > > > > > > > - if the client has been bootstrapped, do nothing.
> > > > > > > > - Otherwise, perform DNS lookup, and acquire the bootstrap
> > server
> > > > > > > address.
> > > > > > > >
> > > > > > > > Thanks for the comment there, I'll change up the wording.
> > Maybe
> > > > > revise
> > > > > > > it
> > > > > > > > as "DNS resolution should occur in the poll...." ?
> > > > > > > >
> > > > > > > > P
> > > > > > > >
> > > > > > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > > > > <chrise@aiven.io.invalid
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Philip,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP!
> > > > > > > > >
> > > > > > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > > > > > > "Bootstrapping
> > > > > > > > > should now occur in the poll method before attempting to
> > update
> > > > the
> > > > > > > > > metadata. This includes resolving the addresses and
> > > bootstrapping
> > > > > the
> > > > > > > > > metadata.". By "bootstrapping the metadata" do we mean
> > actually
> > > > > > > > contacting
> > > > > > > > > the bootstrap servers, or just setting some internal state
> > > > related
> > > > > to
> > > > > > > the
> > > > > > > > > current set of servers that can be contacted for metadata?
> I
> > > ask
> > > > > > > because
> > > > > > > > it
> > > > > > > > > seems like the language here implies the former, but if
> > that's
> > > > the
> > > > > > > case,
> > > > > > > > > this is already happening in poll (or at least, the first
> > > > > invocation
> > > > > > of
> > > > > > > > > it), and if it's the latter, it's probably not necessary to
> > > > mention
> > > > > > in
> > > > > > > > the
> > > > > > > > > KIP since it doesn't really impact user-facing behavior. It
> > > also
> > > > > > seems
> > > > > > > > like
> > > > > > > > > that detail might impact how intertwined this and KIP-899
> > are,
> > > > > though
> > > > > > > the
> > > > > > > > > similarity could still be superficial either way.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > >
> > > > > > > > > Chris
> > > > > > > > >
> > > > > > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> > > philipnee@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hey Ismael,
> > > > > > > > > >
> > > > > > > > > > Thanks for the feedback! The proposal is not to retry
> > > > > automatically
> > > > > > > but
> > > > > > > > > > relies on the user polling the NetworkClient (basically,
> > > > > > > consumer.poll)
> > > > > > > > > to
> > > > > > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > > > > NetworkException
> > > > > > > > > > (retriable) will be thrown.
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > > > P
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> > > ismael@juma.me.uk
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for the KIP. Not sure if I missed it, but how
> long
> > > > will
> > > > > we
> > > > > > > > retry
> > > > > > > > > > for
> > > > > > > > > > > and when do we give up and propagate the failure to the
> > > user?
> > > > > > > > > > >
> > > > > > > > > > > Ismael
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > > > > philipnee@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all!
> > > > > > > > > > > >
> > > > > > > > > > > > I want to start a discussion thread about how we can
> > > handle
> > > > > > > client
> > > > > > > > > > > > bootstrap failure due DNS lookup.  This requires a
> bit
> > of
> > > > > > > > behavioral
> > > > > > > > > > > > change, so a KIP is proposed and attached to this
> > email.
> > > > Let
> > > > > me
> > > > > > > > know
> > > > > > > > > > what
> > > > > > > > > > > > you think!
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > *A small remark here*: *As the title of this KIP
> might
> > > > sound
> > > > > > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > > > > > > >
> > > > > > > > > > > > *In Summary:* I want to propose a KIP to change the
> > > > existing
> > > > > > > > > bootstrap
> > > > > > > > > > > > (upon instantiation) strategy because it is
> reasonable
> > to
> > > > > allow
> > > > > > > > > clients
> > > > > > > > > > > to
> > > > > > > > > > > > retry
> > > > > > > > > > > >
> > > > > > > > > > > > *KIP: *
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks!
> > > > > > > > > > > > Philip
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Jason Gustafson <ja...@confluent.io.INVALID>.
Hey Philip, that sounds good to me.

-Jason



On Thu, Apr 20, 2023 at 1:20 PM Philip Nee <ph...@gmail.com> wrote:

> Hey Jason,
>
> Thanks for the response. I agree with your suggestions. Just to clarify, we
> want the timeout, bootstrap.resolve.timeout.ms to only bound the DNS
> resolution time, and we can throw a fatal, BootstrapResolutionException (so
> not connection exception anymore) afterward.
>
> I think that aligns with the goal of this KIP.
>
> P
>
> On Thu, Apr 20, 2023 at 9:23 AM Jason Gustafson <jason@confluent.io.invalid
> >
> wrote:
>
> > Hey Philip,
> >
> > Yeah, I see your point. Here is the challenge I'm considering. Today, the
> > client does not forward connection-related errors to the application. It
> is
> > debatable whether it should, but that is the behavior that applications
> > expect today. The only case of a fatal error is DNS resolution and the
> > application must handle it because they cannot even construct the client.
> > Once we have this KIP, we will introduce a separate fatal failure
> mechanism
> > through the `BootstrapConnectionException`. And we will use it not only
> for
> > DNS failures, but general bootstrap connection failures . The problem is
> > that existing applications will not know that this should be treated as a
> > fatal error. So they may continue to retry. Does it help in this
> situation
> > to poison the client so that it cannot be used? I'm not sure. Perhaps it
> > would reduce the risk a bit if we change `
> bootstrap.connection.timeout.ms`
> > to `bootstrap.resolve.timeout.ms` or something like that? Then we retain
> > the current behavior if DNS succeeds, but connections fail. I know I'm
> the
> > one that suggested generalizing the configuration, but I feel some
> > hesitation after thinking about it more.
> >
> > Thanks,
> > Jason
> >
> > On Wed, Apr 19, 2023 at 8:10 PM Philip Nee <ph...@gmail.com> wrote:
> >
> > > Hey Jason,
> > >
> > > Thanks for your review.  I think if we make it a retriable error, does
> it
> > > make sense to have a configurable timeout still? as we expect the user
> to
> > > continue to retry anyway.
> > >
> > > I'm considering the case of bad configuration. If the user retries the
> > > error, then we rely on the error/warning to alert the user.  In this
> > case,
> > > maybe we continue using the proposed behavior, i.e. warn on each poll
> > after
> > > the timeout period.
> > >
> > > If you agree that a configuration is needed, maybe we can call this
> > > *bootstrap.auto.retry.ms
> > > <http://bootstrap.auto.retry.ms> *instead, to indicate a configurable
> > > period of automatic retry. What do you think?
> > >
> > > Cheers,
> > > P
> > >
> > > On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson
> > <jason@confluent.io.invalid
> > > >
> > > wrote:
> > >
> > > > Hey Phillip,
> > > >
> > > > The KIP looks good. 5 minutes seems like a reasonable tradeoff. I do
> > > wonder
> > > > if it is necessary to treat bootstrap timeout as a fatal error
> though.
> > It
> > > > seems possible that the exception might be caught by handlers in
> > existing
> > > > applications which may not expect that the client needs to be
> > restarted.
> > > > Perhaps it would be safer to make it retriable? As long as the
> > > application
> > > > continues trying to use the client, we could continue trying to reach
> > the
> > > > bootstrap servers perhaps? That would be closer to behavior today
> which
> > > > only treats DNS resolution failures as fatal. What do you think?
> > > >
> > > > Best,
> > > > Jason
> > > >
> > > > On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <ph...@gmail.com>
> > wrote:
> > > >
> > > > > Thanks, everyone: I'm starting a vote today.  Here's the recap for
> > some
> > > > of
> > > > > the questions:
> > > > >
> > > > > John: I changed the proposal to throw a non-retriable exception
> after
> > > the
> > > > > timeout elapses. I feel it might be necessary to poison the client
> > > after
> > > > > retry expires, as it might indicate a real issue.
> > > > > Ismael: The proposal is to add a configuration for the retry and it
> > > will
> > > > > throw a non-retriable exception after the time expires.
> > > > > Chris: Addressed some unclarity that you mentioned, and a new API
> > won't
> > > > be
> > > > > introduced in this KIP.  Maybe up for future discussion.
> > > > > Jason: I'm proposing adding a timeout config and a bootstrap
> > exception
> > > > per
> > > > > your suggestion.
> > > > > Kirk: I'm proposing throwing a non-retriable exception in the
> network
> > > > > client. See previous comment.
> > > > >
> > > > >
> > > > > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> > <chrise@aiven.io.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Philip,
> > > > > >
> > > > > > Yeah,  "DNS resolution should occur..." seems like a better fit.
> 👍
> > > > > >
> > > > > > One other question I have is whether we should expose some kind
> of
> > > > public
> > > > > > API for performing preflight validation of the bootstrap URLs. If
> > we
> > > > > change
> > > > > > the behavior of a client configured with a silly typo (e.g.,
> > > > > > "loclahost instead of localhost") from failing in the constructor
> > to
> > > > > > failing with a retriable exception, this might lead some client
> > > > > > applications to handle that failure by, well, retrying. For
> > > reference,
> > > > > this
> > > > > > is exactly what we do in Kafka Connect right now; see [1] and
> [2].
> > > IMO
> > > > > it'd
> > > > > > be nice to be able to opt into keeping the current behavior so
> that
> > > > > > projects like Connect could still do preflight checks of the
> > > > > > bootstrap.servers property for connectors before starting them,
> and
> > > > > report
> > > > > > any issues by failing fast instead of continuously writing
> > > > warning/error
> > > > > > messages to their logs.
> > > > > >
> > > > > > I'm not sure about where this new API could go, but a few options
> > > might
> > > > > be:
> > > > > >
> > > > > > - Expose a public variant of the existing ClientUtils class
> > > > > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > > > > > AdminClientConfig classes
> > > > > > - Add those same static methods to the KafkaConsumer,
> > KafkaProducer,
> > > > and
> > > > > > KafkaAdminClient classes
> > > > > >
> > > > > > If this seems reasonable, we should probably also specify in the
> > KIP
> > > > that
> > > > > > Kafka Connect will leverage this preflight validation logic
> before
> > > > > > instantiating any Kafka clients for use by connectors or tasks,
> and
> > > > > > continue to fail fast if there are typos in the bootstrap.servers
> > > > > property,
> > > > > > or if temporary DNS resolution issues come up.
> > > > > >
> > > > > > [1] -
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > > > > [2] -
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hey Chris,
> > > > > > >
> > > > > > > Thanks for the quick response, and I apologize for the unclear
> > > > wording
> > > > > > > there, I guess "DNS lookup" would be a more appropriate wording
> > > here.
> > > > > So
> > > > > > > what I meant there was, to delegate the DNS lookup in the
> > > constructor
> > > > > to
> > > > > > > the network client poll, and it will happen on the very first
> > poll.
> > > > I
> > > > > > > guess the logic could look like this:
> > > > > > >
> > > > > > > - if the client has been bootstrapped, do nothing.
> > > > > > > - Otherwise, perform DNS lookup, and acquire the bootstrap
> server
> > > > > > address.
> > > > > > >
> > > > > > > Thanks for the comment there, I'll change up the wording.
> Maybe
> > > > revise
> > > > > > it
> > > > > > > as "DNS resolution should occur in the poll...." ?
> > > > > > >
> > > > > > > P
> > > > > > >
> > > > > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > > > <chrise@aiven.io.invalid
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Philip,
> > > > > > > >
> > > > > > > > Thanks for the KIP!
> > > > > > > >
> > > > > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > > > > > "Bootstrapping
> > > > > > > > should now occur in the poll method before attempting to
> update
> > > the
> > > > > > > > metadata. This includes resolving the addresses and
> > bootstrapping
> > > > the
> > > > > > > > metadata.". By "bootstrapping the metadata" do we mean
> actually
> > > > > > > contacting
> > > > > > > > the bootstrap servers, or just setting some internal state
> > > related
> > > > to
> > > > > > the
> > > > > > > > current set of servers that can be contacted for metadata? I
> > ask
> > > > > > because
> > > > > > > it
> > > > > > > > seems like the language here implies the former, but if
> that's
> > > the
> > > > > > case,
> > > > > > > > this is already happening in poll (or at least, the first
> > > > invocation
> > > > > of
> > > > > > > > it), and if it's the latter, it's probably not necessary to
> > > mention
> > > > > in
> > > > > > > the
> > > > > > > > KIP since it doesn't really impact user-facing behavior. It
> > also
> > > > > seems
> > > > > > > like
> > > > > > > > that detail might impact how intertwined this and KIP-899
> are,
> > > > though
> > > > > > the
> > > > > > > > similarity could still be superficial either way.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > > Chris
> > > > > > > >
> > > > > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> > philipnee@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Ismael,
> > > > > > > > >
> > > > > > > > > Thanks for the feedback! The proposal is not to retry
> > > > automatically
> > > > > > but
> > > > > > > > > relies on the user polling the NetworkClient (basically,
> > > > > > consumer.poll)
> > > > > > > > to
> > > > > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > > > NetworkException
> > > > > > > > > (retriable) will be thrown.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > P
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> > ismael@juma.me.uk
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the KIP. Not sure if I missed it, but how long
> > > will
> > > > we
> > > > > > > retry
> > > > > > > > > for
> > > > > > > > > > and when do we give up and propagate the failure to the
> > user?
> > > > > > > > > >
> > > > > > > > > > Ismael
> > > > > > > > > >
> > > > > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > > > philipnee@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all!
> > > > > > > > > > >
> > > > > > > > > > > I want to start a discussion thread about how we can
> > handle
> > > > > > client
> > > > > > > > > > > bootstrap failure due DNS lookup.  This requires a bit
> of
> > > > > > > behavioral
> > > > > > > > > > > change, so a KIP is proposed and attached to this
> email.
> > > Let
> > > > me
> > > > > > > know
> > > > > > > > > what
> > > > > > > > > > > you think!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > *A small remark here*: *As the title of this KIP might
> > > sound
> > > > > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > > > > > >
> > > > > > > > > > > *In Summary:* I want to propose a KIP to change the
> > > existing
> > > > > > > > bootstrap
> > > > > > > > > > > (upon instantiation) strategy because it is reasonable
> to
> > > > allow
> > > > > > > > clients
> > > > > > > > > > to
> > > > > > > > > > > retry
> > > > > > > > > > >
> > > > > > > > > > > *KIP: *
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > > > > > >
> > > > > > > > > > > Thanks!
> > > > > > > > > > > Philip
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <ph...@gmail.com>.
Hey Jason,

Thanks for the response. I agree with your suggestions. Just to clarify, we
want the timeout, bootstrap.resolve.timeout.ms to only bound the DNS
resolution time, and we can throw a fatal, BootstrapResolutionException (so
not connection exception anymore) afterward.

I think that aligns with the goal of this KIP.

P

On Thu, Apr 20, 2023 at 9:23 AM Jason Gustafson <ja...@confluent.io.invalid>
wrote:

> Hey Philip,
>
> Yeah, I see your point. Here is the challenge I'm considering. Today, the
> client does not forward connection-related errors to the application. It is
> debatable whether it should, but that is the behavior that applications
> expect today. The only case of a fatal error is DNS resolution and the
> application must handle it because they cannot even construct the client.
> Once we have this KIP, we will introduce a separate fatal failure mechanism
> through the `BootstrapConnectionException`. And we will use it not only for
> DNS failures, but general bootstrap connection failures . The problem is
> that existing applications will not know that this should be treated as a
> fatal error. So they may continue to retry. Does it help in this situation
> to poison the client so that it cannot be used? I'm not sure. Perhaps it
> would reduce the risk a bit if we change `bootstrap.connection.timeout.ms`
> to `bootstrap.resolve.timeout.ms` or something like that? Then we retain
> the current behavior if DNS succeeds, but connections fail. I know I'm the
> one that suggested generalizing the configuration, but I feel some
> hesitation after thinking about it more.
>
> Thanks,
> Jason
>
> On Wed, Apr 19, 2023 at 8:10 PM Philip Nee <ph...@gmail.com> wrote:
>
> > Hey Jason,
> >
> > Thanks for your review.  I think if we make it a retriable error, does it
> > make sense to have a configurable timeout still? as we expect the user to
> > continue to retry anyway.
> >
> > I'm considering the case of bad configuration. If the user retries the
> > error, then we rely on the error/warning to alert the user.  In this
> case,
> > maybe we continue using the proposed behavior, i.e. warn on each poll
> after
> > the timeout period.
> >
> > If you agree that a configuration is needed, maybe we can call this
> > *bootstrap.auto.retry.ms
> > <http://bootstrap.auto.retry.ms> *instead, to indicate a configurable
> > period of automatic retry. What do you think?
> >
> > Cheers,
> > P
> >
> > On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson
> <jason@confluent.io.invalid
> > >
> > wrote:
> >
> > > Hey Phillip,
> > >
> > > The KIP looks good. 5 minutes seems like a reasonable tradeoff. I do
> > wonder
> > > if it is necessary to treat bootstrap timeout as a fatal error though.
> It
> > > seems possible that the exception might be caught by handlers in
> existing
> > > applications which may not expect that the client needs to be
> restarted.
> > > Perhaps it would be safer to make it retriable? As long as the
> > application
> > > continues trying to use the client, we could continue trying to reach
> the
> > > bootstrap servers perhaps? That would be closer to behavior today which
> > > only treats DNS resolution failures as fatal. What do you think?
> > >
> > > Best,
> > > Jason
> > >
> > > On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <ph...@gmail.com>
> wrote:
> > >
> > > > Thanks, everyone: I'm starting a vote today.  Here's the recap for
> some
> > > of
> > > > the questions:
> > > >
> > > > John: I changed the proposal to throw a non-retriable exception after
> > the
> > > > timeout elapses. I feel it might be necessary to poison the client
> > after
> > > > retry expires, as it might indicate a real issue.
> > > > Ismael: The proposal is to add a configuration for the retry and it
> > will
> > > > throw a non-retriable exception after the time expires.
> > > > Chris: Addressed some unclarity that you mentioned, and a new API
> won't
> > > be
> > > > introduced in this KIP.  Maybe up for future discussion.
> > > > Jason: I'm proposing adding a timeout config and a bootstrap
> exception
> > > per
> > > > your suggestion.
> > > > Kirk: I'm proposing throwing a non-retriable exception in the network
> > > > client. See previous comment.
> > > >
> > > >
> > > > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> <chrise@aiven.io.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hi Philip,
> > > > >
> > > > > Yeah,  "DNS resolution should occur..." seems like a better fit. 👍
> > > > >
> > > > > One other question I have is whether we should expose some kind of
> > > public
> > > > > API for performing preflight validation of the bootstrap URLs. If
> we
> > > > change
> > > > > the behavior of a client configured with a silly typo (e.g.,
> > > > > "loclahost instead of localhost") from failing in the constructor
> to
> > > > > failing with a retriable exception, this might lead some client
> > > > > applications to handle that failure by, well, retrying. For
> > reference,
> > > > this
> > > > > is exactly what we do in Kafka Connect right now; see [1] and [2].
> > IMO
> > > > it'd
> > > > > be nice to be able to opt into keeping the current behavior so that
> > > > > projects like Connect could still do preflight checks of the
> > > > > bootstrap.servers property for connectors before starting them, and
> > > > report
> > > > > any issues by failing fast instead of continuously writing
> > > warning/error
> > > > > messages to their logs.
> > > > >
> > > > > I'm not sure about where this new API could go, but a few options
> > might
> > > > be:
> > > > >
> > > > > - Expose a public variant of the existing ClientUtils class
> > > > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > > > > AdminClientConfig classes
> > > > > - Add those same static methods to the KafkaConsumer,
> KafkaProducer,
> > > and
> > > > > KafkaAdminClient classes
> > > > >
> > > > > If this seems reasonable, we should probably also specify in the
> KIP
> > > that
> > > > > Kafka Connect will leverage this preflight validation logic before
> > > > > instantiating any Kafka clients for use by connectors or tasks, and
> > > > > continue to fail fast if there are typos in the bootstrap.servers
> > > > property,
> > > > > or if temporary DNS resolution issues come up.
> > > > >
> > > > > [1] -
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > > > [2] -
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hey Chris,
> > > > > >
> > > > > > Thanks for the quick response, and I apologize for the unclear
> > > wording
> > > > > > there, I guess "DNS lookup" would be a more appropriate wording
> > here.
> > > > So
> > > > > > what I meant there was, to delegate the DNS lookup in the
> > constructor
> > > > to
> > > > > > the network client poll, and it will happen on the very first
> poll.
> > > I
> > > > > > guess the logic could look like this:
> > > > > >
> > > > > > - if the client has been bootstrapped, do nothing.
> > > > > > - Otherwise, perform DNS lookup, and acquire the bootstrap server
> > > > > address.
> > > > > >
> > > > > > Thanks for the comment there, I'll change up the wording.  Maybe
> > > revise
> > > > > it
> > > > > > as "DNS resolution should occur in the poll...." ?
> > > > > >
> > > > > > P
> > > > > >
> > > > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > > <chrise@aiven.io.invalid
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Philip,
> > > > > > >
> > > > > > > Thanks for the KIP!
> > > > > > >
> > > > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > > > > "Bootstrapping
> > > > > > > should now occur in the poll method before attempting to update
> > the
> > > > > > > metadata. This includes resolving the addresses and
> bootstrapping
> > > the
> > > > > > > metadata.". By "bootstrapping the metadata" do we mean actually
> > > > > > contacting
> > > > > > > the bootstrap servers, or just setting some internal state
> > related
> > > to
> > > > > the
> > > > > > > current set of servers that can be contacted for metadata? I
> ask
> > > > > because
> > > > > > it
> > > > > > > seems like the language here implies the former, but if that's
> > the
> > > > > case,
> > > > > > > this is already happening in poll (or at least, the first
> > > invocation
> > > > of
> > > > > > > it), and if it's the latter, it's probably not necessary to
> > mention
> > > > in
> > > > > > the
> > > > > > > KIP since it doesn't really impact user-facing behavior. It
> also
> > > > seems
> > > > > > like
> > > > > > > that detail might impact how intertwined this and KIP-899 are,
> > > though
> > > > > the
> > > > > > > similarity could still be superficial either way.
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> philipnee@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hey Ismael,
> > > > > > > >
> > > > > > > > Thanks for the feedback! The proposal is not to retry
> > > automatically
> > > > > but
> > > > > > > > relies on the user polling the NetworkClient (basically,
> > > > > consumer.poll)
> > > > > > > to
> > > > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > > NetworkException
> > > > > > > > (retriable) will be thrown.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > P
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> ismael@juma.me.uk
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the KIP. Not sure if I missed it, but how long
> > will
> > > we
> > > > > > retry
> > > > > > > > for
> > > > > > > > > and when do we give up and propagate the failure to the
> user?
> > > > > > > > >
> > > > > > > > > Ismael
> > > > > > > > >
> > > > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > > philipnee@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all!
> > > > > > > > > >
> > > > > > > > > > I want to start a discussion thread about how we can
> handle
> > > > > client
> > > > > > > > > > bootstrap failure due DNS lookup.  This requires a bit of
> > > > > > behavioral
> > > > > > > > > > change, so a KIP is proposed and attached to this email.
> > Let
> > > me
> > > > > > know
> > > > > > > > what
> > > > > > > > > > you think!
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > *A small remark here*: *As the title of this KIP might
> > sound
> > > > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > > > > >
> > > > > > > > > > *In Summary:* I want to propose a KIP to change the
> > existing
> > > > > > > bootstrap
> > > > > > > > > > (upon instantiation) strategy because it is reasonable to
> > > allow
> > > > > > > clients
> > > > > > > > > to
> > > > > > > > > > retry
> > > > > > > > > >
> > > > > > > > > > *KIP: *
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > > > Philip
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Jason Gustafson <ja...@confluent.io.INVALID>.
Hey Philip,

Yeah, I see your point. Here is the challenge I'm considering. Today, the
client does not forward connection-related errors to the application. It is
debatable whether it should, but that is the behavior that applications
expect today. The only case of a fatal error is DNS resolution and the
application must handle it because they cannot even construct the client.
Once we have this KIP, we will introduce a separate fatal failure mechanism
through the `BootstrapConnectionException`. And we will use it not only for
DNS failures, but general bootstrap connection failures . The problem is
that existing applications will not know that this should be treated as a
fatal error. So they may continue to retry. Does it help in this situation
to poison the client so that it cannot be used? I'm not sure. Perhaps it
would reduce the risk a bit if we change `bootstrap.connection.timeout.ms`
to `bootstrap.resolve.timeout.ms` or something like that? Then we retain
the current behavior if DNS succeeds, but connections fail. I know I'm the
one that suggested generalizing the configuration, but I feel some
hesitation after thinking about it more.

Thanks,
Jason

On Wed, Apr 19, 2023 at 8:10 PM Philip Nee <ph...@gmail.com> wrote:

> Hey Jason,
>
> Thanks for your review.  I think if we make it a retriable error, does it
> make sense to have a configurable timeout still? as we expect the user to
> continue to retry anyway.
>
> I'm considering the case of bad configuration. If the user retries the
> error, then we rely on the error/warning to alert the user.  In this case,
> maybe we continue using the proposed behavior, i.e. warn on each poll after
> the timeout period.
>
> If you agree that a configuration is needed, maybe we can call this
> *bootstrap.auto.retry.ms
> <http://bootstrap.auto.retry.ms> *instead, to indicate a configurable
> period of automatic retry. What do you think?
>
> Cheers,
> P
>
> On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson <jason@confluent.io.invalid
> >
> wrote:
>
> > Hey Phillip,
> >
> > The KIP looks good. 5 minutes seems like a reasonable tradeoff. I do
> wonder
> > if it is necessary to treat bootstrap timeout as a fatal error though. It
> > seems possible that the exception might be caught by handlers in existing
> > applications which may not expect that the client needs to be restarted.
> > Perhaps it would be safer to make it retriable? As long as the
> application
> > continues trying to use the client, we could continue trying to reach the
> > bootstrap servers perhaps? That would be closer to behavior today which
> > only treats DNS resolution failures as fatal. What do you think?
> >
> > Best,
> > Jason
> >
> > On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <ph...@gmail.com> wrote:
> >
> > > Thanks, everyone: I'm starting a vote today.  Here's the recap for some
> > of
> > > the questions:
> > >
> > > John: I changed the proposal to throw a non-retriable exception after
> the
> > > timeout elapses. I feel it might be necessary to poison the client
> after
> > > retry expires, as it might indicate a real issue.
> > > Ismael: The proposal is to add a configuration for the retry and it
> will
> > > throw a non-retriable exception after the time expires.
> > > Chris: Addressed some unclarity that you mentioned, and a new API won't
> > be
> > > introduced in this KIP.  Maybe up for future discussion.
> > > Jason: I'm proposing adding a timeout config and a bootstrap exception
> > per
> > > your suggestion.
> > > Kirk: I'm proposing throwing a non-retriable exception in the network
> > > client. See previous comment.
> > >
> > >
> > > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton <chrise@aiven.io.invalid
> >
> > > wrote:
> > >
> > > > Hi Philip,
> > > >
> > > > Yeah,  "DNS resolution should occur..." seems like a better fit. 👍
> > > >
> > > > One other question I have is whether we should expose some kind of
> > public
> > > > API for performing preflight validation of the bootstrap URLs. If we
> > > change
> > > > the behavior of a client configured with a silly typo (e.g.,
> > > > "loclahost instead of localhost") from failing in the constructor to
> > > > failing with a retriable exception, this might lead some client
> > > > applications to handle that failure by, well, retrying. For
> reference,
> > > this
> > > > is exactly what we do in Kafka Connect right now; see [1] and [2].
> IMO
> > > it'd
> > > > be nice to be able to opt into keeping the current behavior so that
> > > > projects like Connect could still do preflight checks of the
> > > > bootstrap.servers property for connectors before starting them, and
> > > report
> > > > any issues by failing fast instead of continuously writing
> > warning/error
> > > > messages to their logs.
> > > >
> > > > I'm not sure about where this new API could go, but a few options
> might
> > > be:
> > > >
> > > > - Expose a public variant of the existing ClientUtils class
> > > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > > > AdminClientConfig classes
> > > > - Add those same static methods to the KafkaConsumer, KafkaProducer,
> > and
> > > > KafkaAdminClient classes
> > > >
> > > > If this seems reasonable, we should probably also specify in the KIP
> > that
> > > > Kafka Connect will leverage this preflight validation logic before
> > > > instantiating any Kafka clients for use by connectors or tasks, and
> > > > continue to fail fast if there are typos in the bootstrap.servers
> > > property,
> > > > or if temporary DNS resolution issues come up.
> > > >
> > > > [1] -
> > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > > [2] -
> > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com>
> > wrote:
> > > >
> > > > > Hey Chris,
> > > > >
> > > > > Thanks for the quick response, and I apologize for the unclear
> > wording
> > > > > there, I guess "DNS lookup" would be a more appropriate wording
> here.
> > > So
> > > > > what I meant there was, to delegate the DNS lookup in the
> constructor
> > > to
> > > > > the network client poll, and it will happen on the very first poll.
> > I
> > > > > guess the logic could look like this:
> > > > >
> > > > > - if the client has been bootstrapped, do nothing.
> > > > > - Otherwise, perform DNS lookup, and acquire the bootstrap server
> > > > address.
> > > > >
> > > > > Thanks for the comment there, I'll change up the wording.  Maybe
> > revise
> > > > it
> > > > > as "DNS resolution should occur in the poll...." ?
> > > > >
> > > > > P
> > > > >
> > > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > <chrise@aiven.io.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Philip,
> > > > > >
> > > > > > Thanks for the KIP!
> > > > > >
> > > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > > > "Bootstrapping
> > > > > > should now occur in the poll method before attempting to update
> the
> > > > > > metadata. This includes resolving the addresses and bootstrapping
> > the
> > > > > > metadata.". By "bootstrapping the metadata" do we mean actually
> > > > > contacting
> > > > > > the bootstrap servers, or just setting some internal state
> related
> > to
> > > > the
> > > > > > current set of servers that can be contacted for metadata? I ask
> > > > because
> > > > > it
> > > > > > seems like the language here implies the former, but if that's
> the
> > > > case,
> > > > > > this is already happening in poll (or at least, the first
> > invocation
> > > of
> > > > > > it), and if it's the latter, it's probably not necessary to
> mention
> > > in
> > > > > the
> > > > > > KIP since it doesn't really impact user-facing behavior. It also
> > > seems
> > > > > like
> > > > > > that detail might impact how intertwined this and KIP-899 are,
> > though
> > > > the
> > > > > > similarity could still be superficial either way.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <ph...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hey Ismael,
> > > > > > >
> > > > > > > Thanks for the feedback! The proposal is not to retry
> > automatically
> > > > but
> > > > > > > relies on the user polling the NetworkClient (basically,
> > > > consumer.poll)
> > > > > > to
> > > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > NetworkException
> > > > > > > (retriable) will be thrown.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > P
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <ismael@juma.me.uk
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the KIP. Not sure if I missed it, but how long
> will
> > we
> > > > > retry
> > > > > > > for
> > > > > > > > and when do we give up and propagate the failure to the user?
> > > > > > > >
> > > > > > > > Ismael
> > > > > > > >
> > > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > philipnee@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all!
> > > > > > > > >
> > > > > > > > > I want to start a discussion thread about how we can handle
> > > > client
> > > > > > > > > bootstrap failure due DNS lookup.  This requires a bit of
> > > > > behavioral
> > > > > > > > > change, so a KIP is proposed and attached to this email.
> Let
> > me
> > > > > know
> > > > > > > what
> > > > > > > > > you think!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *A small remark here*: *As the title of this KIP might
> sound
> > > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > > > >
> > > > > > > > > *In Summary:* I want to propose a KIP to change the
> existing
> > > > > > bootstrap
> > > > > > > > > (upon instantiation) strategy because it is reasonable to
> > allow
> > > > > > clients
> > > > > > > > to
> > > > > > > > > retry
> > > > > > > > >
> > > > > > > > > *KIP: *
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > Philip
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <ph...@gmail.com>.
Hey Jason,

Thanks for your review.  I think if we make it a retriable error, does it
make sense to have a configurable timeout still? as we expect the user to
continue to retry anyway.

I'm considering the case of bad configuration. If the user retries the
error, then we rely on the error/warning to alert the user.  In this case,
maybe we continue using the proposed behavior, i.e. warn on each poll after
the timeout period.

If you agree that a configuration is needed, maybe we can call this
*bootstrap.auto.retry.ms
<http://bootstrap.auto.retry.ms> *instead, to indicate a configurable
period of automatic retry. What do you think?

Cheers,
P

On Wed, Apr 19, 2023 at 7:17 PM Jason Gustafson <ja...@confluent.io.invalid>
wrote:

> Hey Phillip,
>
> The KIP looks good. 5 minutes seems like a reasonable tradeoff. I do wonder
> if it is necessary to treat bootstrap timeout as a fatal error though. It
> seems possible that the exception might be caught by handlers in existing
> applications which may not expect that the client needs to be restarted.
> Perhaps it would be safer to make it retriable? As long as the application
> continues trying to use the client, we could continue trying to reach the
> bootstrap servers perhaps? That would be closer to behavior today which
> only treats DNS resolution failures as fatal. What do you think?
>
> Best,
> Jason
>
> On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <ph...@gmail.com> wrote:
>
> > Thanks, everyone: I'm starting a vote today.  Here's the recap for some
> of
> > the questions:
> >
> > John: I changed the proposal to throw a non-retriable exception after the
> > timeout elapses. I feel it might be necessary to poison the client after
> > retry expires, as it might indicate a real issue.
> > Ismael: The proposal is to add a configuration for the retry and it will
> > throw a non-retriable exception after the time expires.
> > Chris: Addressed some unclarity that you mentioned, and a new API won't
> be
> > introduced in this KIP.  Maybe up for future discussion.
> > Jason: I'm proposing adding a timeout config and a bootstrap exception
> per
> > your suggestion.
> > Kirk: I'm proposing throwing a non-retriable exception in the network
> > client. See previous comment.
> >
> >
> > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Philip,
> > >
> > > Yeah,  "DNS resolution should occur..." seems like a better fit. 👍
> > >
> > > One other question I have is whether we should expose some kind of
> public
> > > API for performing preflight validation of the bootstrap URLs. If we
> > change
> > > the behavior of a client configured with a silly typo (e.g.,
> > > "loclahost instead of localhost") from failing in the constructor to
> > > failing with a retriable exception, this might lead some client
> > > applications to handle that failure by, well, retrying. For reference,
> > this
> > > is exactly what we do in Kafka Connect right now; see [1] and [2]. IMO
> > it'd
> > > be nice to be able to opt into keeping the current behavior so that
> > > projects like Connect could still do preflight checks of the
> > > bootstrap.servers property for connectors before starting them, and
> > report
> > > any issues by failing fast instead of continuously writing
> warning/error
> > > messages to their logs.
> > >
> > > I'm not sure about where this new API could go, but a few options might
> > be:
> > >
> > > - Expose a public variant of the existing ClientUtils class
> > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > > AdminClientConfig classes
> > > - Add those same static methods to the KafkaConsumer, KafkaProducer,
> and
> > > KafkaAdminClient classes
> > >
> > > If this seems reasonable, we should probably also specify in the KIP
> that
> > > Kafka Connect will leverage this preflight validation logic before
> > > instantiating any Kafka clients for use by connectors or tasks, and
> > > continue to fail fast if there are typos in the bootstrap.servers
> > property,
> > > or if temporary DNS resolution issues come up.
> > >
> > > [1] -
> > >
> > >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > [2] -
> > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com>
> wrote:
> > >
> > > > Hey Chris,
> > > >
> > > > Thanks for the quick response, and I apologize for the unclear
> wording
> > > > there, I guess "DNS lookup" would be a more appropriate wording here.
> > So
> > > > what I meant there was, to delegate the DNS lookup in the constructor
> > to
> > > > the network client poll, and it will happen on the very first poll.
> I
> > > > guess the logic could look like this:
> > > >
> > > > - if the client has been bootstrapped, do nothing.
> > > > - Otherwise, perform DNS lookup, and acquire the bootstrap server
> > > address.
> > > >
> > > > Thanks for the comment there, I'll change up the wording.  Maybe
> revise
> > > it
> > > > as "DNS resolution should occur in the poll...." ?
> > > >
> > > > P
> > > >
> > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> <chrise@aiven.io.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hi Philip,
> > > > >
> > > > > Thanks for the KIP!
> > > > >
> > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > > "Bootstrapping
> > > > > should now occur in the poll method before attempting to update the
> > > > > metadata. This includes resolving the addresses and bootstrapping
> the
> > > > > metadata.". By "bootstrapping the metadata" do we mean actually
> > > > contacting
> > > > > the bootstrap servers, or just setting some internal state related
> to
> > > the
> > > > > current set of servers that can be contacted for metadata? I ask
> > > because
> > > > it
> > > > > seems like the language here implies the former, but if that's the
> > > case,
> > > > > this is already happening in poll (or at least, the first
> invocation
> > of
> > > > > it), and if it's the latter, it's probably not necessary to mention
> > in
> > > > the
> > > > > KIP since it doesn't really impact user-facing behavior. It also
> > seems
> > > > like
> > > > > that detail might impact how intertwined this and KIP-899 are,
> though
> > > the
> > > > > similarity could still be superficial either way.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <ph...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hey Ismael,
> > > > > >
> > > > > > Thanks for the feedback! The proposal is not to retry
> automatically
> > > but
> > > > > > relies on the user polling the NetworkClient (basically,
> > > consumer.poll)
> > > > > to
> > > > > > reattempt the bootstrap. If bootstrapping fails, a
> NetworkException
> > > > > > (retriable) will be thrown.
> > > > > >
> > > > > > Thanks!
> > > > > > P
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <is...@juma.me.uk>
> > > wrote:
> > > > > >
> > > > > > > Thanks for the KIP. Not sure if I missed it, but how long will
> we
> > > > retry
> > > > > > for
> > > > > > > and when do we give up and propagate the failure to the user?
> > > > > > >
> > > > > > > Ismael
> > > > > > >
> > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> philipnee@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi all!
> > > > > > > >
> > > > > > > > I want to start a discussion thread about how we can handle
> > > client
> > > > > > > > bootstrap failure due DNS lookup.  This requires a bit of
> > > > behavioral
> > > > > > > > change, so a KIP is proposed and attached to this email. Let
> me
> > > > know
> > > > > > what
> > > > > > > > you think!
> > > > > > > >
> > > > > > > >
> > > > > > > > *A small remark here*: *As the title of this KIP might sound
> > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > > >
> > > > > > > > *In Summary:* I want to propose a KIP to change the existing
> > > > > bootstrap
> > > > > > > > (upon instantiation) strategy because it is reasonable to
> allow
> > > > > clients
> > > > > > > to
> > > > > > > > retry
> > > > > > > >
> > > > > > > > *KIP: *
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > Philip
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Jason Gustafson <ja...@confluent.io.INVALID>.
Hey Phillip,

The KIP looks good. 5 minutes seems like a reasonable tradeoff. I do wonder
if it is necessary to treat bootstrap timeout as a fatal error though. It
seems possible that the exception might be caught by handlers in existing
applications which may not expect that the client needs to be restarted.
Perhaps it would be safer to make it retriable? As long as the application
continues trying to use the client, we could continue trying to reach the
bootstrap servers perhaps? That would be closer to behavior today which
only treats DNS resolution failures as fatal. What do you think?

Best,
Jason

On Mon, Apr 10, 2023 at 1:53 PM Philip Nee <ph...@gmail.com> wrote:

> Thanks, everyone: I'm starting a vote today.  Here's the recap for some of
> the questions:
>
> John: I changed the proposal to throw a non-retriable exception after the
> timeout elapses. I feel it might be necessary to poison the client after
> retry expires, as it might indicate a real issue.
> Ismael: The proposal is to add a configuration for the retry and it will
> throw a non-retriable exception after the time expires.
> Chris: Addressed some unclarity that you mentioned, and a new API won't be
> introduced in this KIP.  Maybe up for future discussion.
> Jason: I'm proposing adding a timeout config and a bootstrap exception per
> your suggestion.
> Kirk: I'm proposing throwing a non-retriable exception in the network
> client. See previous comment.
>
>
> On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton <ch...@aiven.io.invalid>
> wrote:
>
> > Hi Philip,
> >
> > Yeah,  "DNS resolution should occur..." seems like a better fit. 👍
> >
> > One other question I have is whether we should expose some kind of public
> > API for performing preflight validation of the bootstrap URLs. If we
> change
> > the behavior of a client configured with a silly typo (e.g.,
> > "loclahost instead of localhost") from failing in the constructor to
> > failing with a retriable exception, this might lead some client
> > applications to handle that failure by, well, retrying. For reference,
> this
> > is exactly what we do in Kafka Connect right now; see [1] and [2]. IMO
> it'd
> > be nice to be able to opt into keeping the current behavior so that
> > projects like Connect could still do preflight checks of the
> > bootstrap.servers property for connectors before starting them, and
> report
> > any issues by failing fast instead of continuously writing warning/error
> > messages to their logs.
> >
> > I'm not sure about where this new API could go, but a few options might
> be:
> >
> > - Expose a public variant of the existing ClientUtils class
> > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > AdminClientConfig classes
> > - Add those same static methods to the KafkaConsumer, KafkaProducer, and
> > KafkaAdminClient classes
> >
> > If this seems reasonable, we should probably also specify in the KIP that
> > Kafka Connect will leverage this preflight validation logic before
> > instantiating any Kafka clients for use by connectors or tasks, and
> > continue to fail fast if there are typos in the bootstrap.servers
> property,
> > or if temporary DNS resolution issues come up.
> >
> > [1] -
> >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > [2] -
> >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> >
> > Cheers,
> >
> > Chris
> >
> > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com> wrote:
> >
> > > Hey Chris,
> > >
> > > Thanks for the quick response, and I apologize for the unclear wording
> > > there, I guess "DNS lookup" would be a more appropriate wording here.
> So
> > > what I meant there was, to delegate the DNS lookup in the constructor
> to
> > > the network client poll, and it will happen on the very first poll.  I
> > > guess the logic could look like this:
> > >
> > > - if the client has been bootstrapped, do nothing.
> > > - Otherwise, perform DNS lookup, and acquire the bootstrap server
> > address.
> > >
> > > Thanks for the comment there, I'll change up the wording.  Maybe revise
> > it
> > > as "DNS resolution should occur in the poll...." ?
> > >
> > > P
> > >
> > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton <chrise@aiven.io.invalid
> >
> > > wrote:
> > >
> > > > Hi Philip,
> > > >
> > > > Thanks for the KIP!
> > > >
> > > > QQ: In the "Proposed Changes" section, the KIP states that
> > "Bootstrapping
> > > > should now occur in the poll method before attempting to update the
> > > > metadata. This includes resolving the addresses and bootstrapping the
> > > > metadata.". By "bootstrapping the metadata" do we mean actually
> > > contacting
> > > > the bootstrap servers, or just setting some internal state related to
> > the
> > > > current set of servers that can be contacted for metadata? I ask
> > because
> > > it
> > > > seems like the language here implies the former, but if that's the
> > case,
> > > > this is already happening in poll (or at least, the first invocation
> of
> > > > it), and if it's the latter, it's probably not necessary to mention
> in
> > > the
> > > > KIP since it doesn't really impact user-facing behavior. It also
> seems
> > > like
> > > > that detail might impact how intertwined this and KIP-899 are, though
> > the
> > > > similarity could still be superficial either way.
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <ph...@gmail.com>
> > wrote:
> > > >
> > > > > Hey Ismael,
> > > > >
> > > > > Thanks for the feedback! The proposal is not to retry automatically
> > but
> > > > > relies on the user polling the NetworkClient (basically,
> > consumer.poll)
> > > > to
> > > > > reattempt the bootstrap. If bootstrapping fails, a NetworkException
> > > > > (retriable) will be thrown.
> > > > >
> > > > > Thanks!
> > > > > P
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <is...@juma.me.uk>
> > wrote:
> > > > >
> > > > > > Thanks for the KIP. Not sure if I missed it, but how long will we
> > > retry
> > > > > for
> > > > > > and when do we give up and propagate the failure to the user?
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <ph...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi all!
> > > > > > >
> > > > > > > I want to start a discussion thread about how we can handle
> > client
> > > > > > > bootstrap failure due DNS lookup.  This requires a bit of
> > > behavioral
> > > > > > > change, so a KIP is proposed and attached to this email. Let me
> > > know
> > > > > what
> > > > > > > you think!
> > > > > > >
> > > > > > >
> > > > > > > *A small remark here*: *As the title of this KIP might sound
> > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > >
> > > > > > > *In Summary:* I want to propose a KIP to change the existing
> > > > bootstrap
> > > > > > > (upon instantiation) strategy because it is reasonable to allow
> > > > clients
> > > > > > to
> > > > > > > retry
> > > > > > >
> > > > > > > *KIP: *
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Philip
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <ph...@gmail.com>.
Thanks, everyone: I'm starting a vote today.  Here's the recap for some of
the questions:

John: I changed the proposal to throw a non-retriable exception after the
timeout elapses. I feel it might be necessary to poison the client after
retry expires, as it might indicate a real issue.
Ismael: The proposal is to add a configuration for the retry and it will
throw a non-retriable exception after the time expires.
Chris: Addressed some unclarity that you mentioned, and a new API won't be
introduced in this KIP.  Maybe up for future discussion.
Jason: I'm proposing adding a timeout config and a bootstrap exception per
your suggestion.
Kirk: I'm proposing throwing a non-retriable exception in the network
client. See previous comment.


On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton <ch...@aiven.io.invalid>
wrote:

> Hi Philip,
>
> Yeah,  "DNS resolution should occur..." seems like a better fit. 👍
>
> One other question I have is whether we should expose some kind of public
> API for performing preflight validation of the bootstrap URLs. If we change
> the behavior of a client configured with a silly typo (e.g.,
> "loclahost instead of localhost") from failing in the constructor to
> failing with a retriable exception, this might lead some client
> applications to handle that failure by, well, retrying. For reference, this
> is exactly what we do in Kafka Connect right now; see [1] and [2]. IMO it'd
> be nice to be able to opt into keeping the current behavior so that
> projects like Connect could still do preflight checks of the
> bootstrap.servers property for connectors before starting them, and report
> any issues by failing fast instead of continuously writing warning/error
> messages to their logs.
>
> I'm not sure about where this new API could go, but a few options might be:
>
> - Expose a public variant of the existing ClientUtils class
> - Add static methods to the ConsumerConfig, ProducerConfig, and
> AdminClientConfig classes
> - Add those same static methods to the KafkaConsumer, KafkaProducer, and
> KafkaAdminClient classes
>
> If this seems reasonable, we should probably also specify in the KIP that
> Kafka Connect will leverage this preflight validation logic before
> instantiating any Kafka clients for use by connectors or tasks, and
> continue to fail fast if there are typos in the bootstrap.servers property,
> or if temporary DNS resolution issues come up.
>
> [1] -
>
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> [2] -
>
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
>
> Cheers,
>
> Chris
>
> On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com> wrote:
>
> > Hey Chris,
> >
> > Thanks for the quick response, and I apologize for the unclear wording
> > there, I guess "DNS lookup" would be a more appropriate wording here. So
> > what I meant there was, to delegate the DNS lookup in the constructor to
> > the network client poll, and it will happen on the very first poll.  I
> > guess the logic could look like this:
> >
> > - if the client has been bootstrapped, do nothing.
> > - Otherwise, perform DNS lookup, and acquire the bootstrap server
> address.
> >
> > Thanks for the comment there, I'll change up the wording.  Maybe revise
> it
> > as "DNS resolution should occur in the poll...." ?
> >
> > P
> >
> > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Philip,
> > >
> > > Thanks for the KIP!
> > >
> > > QQ: In the "Proposed Changes" section, the KIP states that
> "Bootstrapping
> > > should now occur in the poll method before attempting to update the
> > > metadata. This includes resolving the addresses and bootstrapping the
> > > metadata.". By "bootstrapping the metadata" do we mean actually
> > contacting
> > > the bootstrap servers, or just setting some internal state related to
> the
> > > current set of servers that can be contacted for metadata? I ask
> because
> > it
> > > seems like the language here implies the former, but if that's the
> case,
> > > this is already happening in poll (or at least, the first invocation of
> > > it), and if it's the latter, it's probably not necessary to mention in
> > the
> > > KIP since it doesn't really impact user-facing behavior. It also seems
> > like
> > > that detail might impact how intertwined this and KIP-899 are, though
> the
> > > similarity could still be superficial either way.
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <ph...@gmail.com>
> wrote:
> > >
> > > > Hey Ismael,
> > > >
> > > > Thanks for the feedback! The proposal is not to retry automatically
> but
> > > > relies on the user polling the NetworkClient (basically,
> consumer.poll)
> > > to
> > > > reattempt the bootstrap. If bootstrapping fails, a NetworkException
> > > > (retriable) will be thrown.
> > > >
> > > > Thanks!
> > > > P
> > > >
> > > >
> > > >
> > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <is...@juma.me.uk>
> wrote:
> > > >
> > > > > Thanks for the KIP. Not sure if I missed it, but how long will we
> > retry
> > > > for
> > > > > and when do we give up and propagate the failure to the user?
> > > > >
> > > > > Ismael
> > > > >
> > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <ph...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi all!
> > > > > >
> > > > > > I want to start a discussion thread about how we can handle
> client
> > > > > > bootstrap failure due DNS lookup.  This requires a bit of
> > behavioral
> > > > > > change, so a KIP is proposed and attached to this email. Let me
> > know
> > > > what
> > > > > > you think!
> > > > > >
> > > > > >
> > > > > > *A small remark here*: *As the title of this KIP might sound
> > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > >
> > > > > > *In Summary:* I want to propose a KIP to change the existing
> > > bootstrap
> > > > > > (upon instantiation) strategy because it is reasonable to allow
> > > clients
> > > > > to
> > > > > > retry
> > > > > >
> > > > > > *KIP: *
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > >
> > > > > > Thanks!
> > > > > > Philip
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Kirk True <ki...@kirktrue.pro>.
Sounds good. I think it's ready to call a vote. Thanks Philip!

On Wed, Apr 5, 2023, at 11:24 AM, Philip Nee wrote:
> Hi all,
> 
> The KIP has been around for some time, and I've updated the document
> according to the previous comments. Here is the outline of the proposed
> changes:
> 1. Adding a timeout configuration.
> 2. Adding a new exception type
> 3. Adding a WARN level log message
> 
> The obvious changes on the client side are:
> 1. They won't attempt to resolve for DNS at the constructor level
> 2. They will try to bootstrap once network client poll is invoked.
> 
> In terms of client-side behavior:
> *Consumer*: Either the poll timer runs out and returns an empty record, or
> BootstrapConnectionException thrown
> *Producer*: API calls will be blocked on waitOnMetadata until the timer
> runs out, either the max block ms, or the bootstrap timer.
> *Admin*: The API calls timeout if API timeout expires first; otherwise, it
> throws a BootstrapConnectionException.
> 
> Let me know your thoughts: I would like to start voting in a week or so.
> 
> Link:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+DNS+Resolution+Failure+Should+Not+Fail+the+Clients
> 
> Thanks,
> P
> 
> On Thu, Mar 23, 2023 at 12:59 PM Philip Nee <ph...@gmail.com> wrote:
> 
> > Hey Kirk,
> >
> > Sorry about omitting your response; it slipped through the cracks...
> >
> > *I’m not sure if producers and consumers likewise do DNS resolution in
> > their constructors?*
> > Yes, both producer and consumer bootstrap in the constructor.
> >
> > *I agree that moving the DNS resolution to poll(), it would be hard to
> > distinguish hard failures (host name resolution) from transient network
> > issues. After all, the DNS resolution issue we saw was technically a
> > transient issue.*
> > I'm amending a timeout configuration in my KIP because that should resolve
> > any transient issues if it doesn't persist for too long.  If it is a hard
> > failure (like, misconfiguration of the host name), it should be logged and
> > errored out after the expiration.
> >
> > P
> >
> > On Wed, Mar 8, 2023 at 8:12 AM Kirk True <ki...@kirktrue.pro> wrote:
> >
> >> Hi Philip,
> >>
> >> I’m understanding the options proposed as consisting of these questions:
> >>
> >> Should we throw an Exception or not?
> >> Where should the DNS resolution/bootstrapping occur—in the constructor,
> >> poll, or somewhere else?
> >> Should there be a timeout, and if so, what configuration drives it?
> >>
> >> We were seeing instances of this issue when constructing KafkaAdminClient
> >> instances. The constructor performs the DNS lookup in that client. I’m not
> >> sure if producers and consumers likewise do DNS resolution in their
> >> constructors?
> >>
> >> I agree that moving the DNS resolution to poll(), it would be hard to
> >> distinguish hard failures (host name resolution) from transient network
> >> issues. After all, the DNS resolution issue we saw was technically a
> >> transient issue.
> >>
> >> Is this an accurate summary of the current thinking:
> >>
> >> Per Jason's suggestion, introduce a new BootstrapConnectionException
> >> Per Chris’ suggestion, introduce a new
> >> API—performInitialDnsResolution()—that an application developer can call to
> >> perform a fast-fail check, throwing BootstrapConnectionException
> >> Otherwise, move the DNS resolution to poll()
> >> Handle BootstrapConnectionException failures in poll() similar to how we
> >> handle NetworkException today, with retries, timeouts, etc., i.e. we don’t
> >> introduce any new configuration
> >> Improve logging to distinguish the DNS resolution case
> >>
> >> Thanks,
> >> Kirk
> >>
> >> > On Mar 6, 2023, at 9:15 AM, Philip Nee <ph...@gmail.com> wrote:
> >> >
> >> > Cheers Kafka Community,
> >> >
> >> > I just wanna give this thread bump, as it has been a bit quiet for the
> >> past
> >> > week. I have not updated the KIP based on Chris and Jason's feedback,
> >> as I
> >> > would also like to know more about what do people think.
> >> >
> >> > Jason - Thanks for the suggestion, I think your suggestion makes a lot
> >> of
> >> > sense.
> >> >
> >> > Thanks!
> >> > P
> >> >
> >> > On Tue, Feb 28, 2023 at 2:45 PM Jason Gustafson
> >> <ja...@confluent.io.invalid>
> >> > wrote:
> >> >
> >> >> Hi Philip,
> >> >>
> >> >>> Having an overall timeout also seems reasonable, but I wonder what
> >> should
> >> >> the client do after running out of the time? Should we throw a
> >> >> non-retriable exception (instead of TimeoutExceptoin to stop the client
> >> >> from retrying) and alert the user to examine the config and the DNS
> >> server?
> >> >>
> >> >> Yeah, not sure exactly. I'd probably suggest a
> >> >> `BootstrapConnectionException` or something like that with a clear
> >> message
> >> >> indicating the problem. What the user does with it is up to them, but
> >> at
> >> >> least it gives them the option to fail their application if that is
> >> what
> >> >> they prefer to do in this case. If they catch it and ignore it, I would
> >> >> expect the client to just continue retrying. Logging for bootstrap
> >> >> dns/connection failures will be helpful in any case.
> >> >>
> >> >> -Jason
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Tue, Feb 28, 2023 at 11:47 AM Philip Nee <ph...@gmail.com>
> >> wrote:
> >> >>
> >> >>> Jason:
> >> >>> Thanks for the feedback.  Now giving it a second thought, I think your
> >> >>> suggestion of logging the error might make sense, as I could imagine
> >> most
> >> >>> users would just continue to retry, so it might not be necessary to
> >> throw
> >> >>> an exception anyway.
> >> >>> Having an overall timeout also seems reasonable, but I wonder what
> >> should
> >> >>> the client do after running out of the time? Should we throw a
> >> >>> non-retriable exception (instead of TimeoutExceptoin to stop the
> >> client
> >> >>> from retrying) and alert the user to examine the config and the DNS
> >> >> server?
> >> >>>
> >> >>> Chris:
> >> >>> I feel I still haven't answered your question about the pre-flight
> >> check,
> >> >>> as it seems exposing an API might be harder to push through.
> >> >>>
> >> >>> Thanks!
> >> >>> P
> >> >>>
> >> >>> On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson
> >> >>> <ja...@confluent.io.invalid>
> >> >>> wrote:
> >> >>>
> >> >>>> One more random thought I had just as I pushed send. We're currently
> >> >>>> treating this problem somewhat narrowly by focusing only on the DNS
> >> >>>> resolution of the bootstrap servers. Even if the servers resolve,
> >> >> there's
> >> >>>> no guarantee that they are reachable by the client. Would it make
> >> sense
> >> >>> to
> >> >>>> have a timeout which bounds the total time that the client should
> >> wait
> >> >> to
> >> >>>> connect to the bootstrap servers? Something like `
> >> >>>> bootstrap.servers.connection.timeout.ms`.
> >> >>>>
> >> >>>> -Jason
> >> >>>>
> >> >>>> On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <jason@confluent.io
> >> >
> >> >>>> wrote:
> >> >>>>
> >> >>>>> Hi Philip,
> >> >>>>>
> >> >>>>> An alternative is not to fail at all. Every other network error is
> >> >>> caught
> >> >>>>> and handled internally in the client. We see this case as different
> >> >>>> because
> >> >>>>> a DNS resolution error may imply misconfiguration. Could it also
> >> >> imply
> >> >>>> that
> >> >>>>> the DNS server is unavailable? I'm not sure why that case should be
> >> >>>> handled
> >> >>>>> differently than if the bootstrap servers themselves are
> >> unavailable.
> >> >>>> Would
> >> >>>>> it be enough to log a clear warning in the logs if the bootstrap
> >> >>> servers
> >> >>>>> could not resolve?
> >> >>>>>
> >> >>>>> On the whole, the current fail-fast approach probably does more good
> >> >>> than
> >> >>>>> bad, but it does seem somewhat inconsistent overall and my guess is
> >> >>> that
> >> >>>>> dynamic environments will become increasingly common. It would be
> >> >> nice
> >> >>> to
> >> >>>>> have a reasonable default behavior which could handle these cases
> >> >>>>> gracefully without any additional logic. In any case, it would be
> >> >> nice
> >> >>> to
> >> >>>>> see this option in the rejected alternatives at least if we do not
> >> >> take
> >> >>>> it.
> >> >>>>>
> >> >>>>> If we want to take the route of throwing an exception, then I think
> >> >>> we're
> >> >>>>> probably going to need a new configuration since I can't see what a
> >> >>>>> reasonable timeout we would use as a default. The benefit of a
> >> >>>>> configuration is that it would let us retain the current default
> >> >>> behavior
> >> >>>>> with timeout effectively set to 0 and it would also let users
> >> >>> effectively
> >> >>>>> disable the timeout by using a very large value. Otherwise, it seems
> >> >>>> like a
> >> >>>>> potential compatibility break to have a new exception type thrown at
> >> >>> some
> >> >>>>> arbitrary time without giving the user any control over it.
> >> >>>>>
> >> >>>>> Thanks,
> >> >>>>> Jason
> >> >>>>>
> >> >>>>>
> >> >>>>> On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton
> >> >> <chrise@aiven.io.invalid
> >> >>>>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> Hi Philip,
> >> >>>>>>
> >> >>>>>> Yeah, it's basically DNS resolution we're talking about, though
> >> >>> there's
> >> >>>>>> some additional subtlety there with the logic introduced by KIP-235
> >> >>> [1].
> >> >>>>>> Essentially it should cover any scenario that causes a client
> >> >>>> constructor
> >> >>>>>> to fail with the current logic but would not after this KIP is
> >> >>> released.
> >> >>>>>>
> >> >>>>>> We can generalize the Connect use case like this: a client
> >> >> application
> >> >>>>>> that
> >> >>>>>> may connect to different Kafka clusters with a public-facing,
> >> >>>> easy-to-use
> >> >>>>>> API for restarting failed tasks and automatic handling of retriable
> >> >>>>>> exceptions. The ease with which failed tasks can be restarted is
> >> >>>>>> significant because it reduces the cost of failing on non-retriable
> >> >>>>>> exceptions and makes fail-fast behavior easier to work with. And,
> >> in
> >> >>>> cases
> >> >>>>>> like this where we can't really know whether the error we're
> >> dealing
> >> >>>> with
> >> >>>>>> is retriable or not, it's better IMO to continue to allow
> >> >> applications
> >> >>>>>> like
> >> >>>>>> these to fail fast. I do agree that it'd be nice to get input from
> >> >> the
> >> >>>>>> community, though.
> >> >>>>>>
> >> >>>>>> I was toying with the idea of a NetworkException subclass too. It's
> >> >> a
> >> >>>>>> simpler API, but it doesn't allow for preflight validation, which
> >> >> can
> >> >>> be
> >> >>>>>> useful in scenarios where submitting new configurations for client
> >> >>>>>> applications is expensive in terms of time or resources. Then
> >> >> again, I
> >> >>>>>> don't see why the two are mutually exclusive, and we might opt to
> >> >> use
> >> >>>> the
> >> >>>>>> NetworkException subclass in this KIP and pursue an opt-in
> >> >> validation
> >> >>>> API
> >> >>>>>> later on. Thoughts?
> >> >>>>>>
> >> >>>>>> [1] -
> >> >>>>>>
> >> >>>>>>
> >> >>>>
> >> >>>
> >> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
> >> >>>>>>
> >> >>>>>> Cheers,
> >> >>>>>>
> >> >>>>>> Chris
> >> >>>>>>
> >> >>>>>> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <ph...@gmail.com>
> >> >>> wrote:
> >> >>>>>>
> >> >>>>>>> Hey Chris,
> >> >>>>>>>
> >> >>>>>>> Thanks again for the feedback!
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> For the preflight DNS check (are we basically trying to resolve
> >> >> the
> >> >>>> DNS
> >> >>>>>>> there?): Maybe it makes more sense to add it to the Config
> >> >> modules?
> >> >>> I
> >> >>>>>> would
> >> >>>>>>> like to hear what the community says as I'm not familiar with the
> >> >>>>>> Connect
> >> >>>>>>> use case.
> >> >>>>>>>
> >> >>>>>>> A "slower failing" alternative - I wonder if it makes sense for us
> >> >>> to
> >> >>>>>>> extend the NetworkException so that clients can be smarter at
> >> >>> handling
> >> >>>>>>> these exceptions. Of course, it is still retriable and requires
> >> >>>> polling
> >> >>>>>> the
> >> >>>>>>> consumer, but then we can distinguish the DNS resolution error
> >> >> from
> >> >>>>>> other
> >> >>>>>>> network errors.
> >> >>>>>>>
> >> >>>>>>> Thanks!
> >> >>>>>>> P
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> >> >>> <chrise@aiven.io.invalid
> >> >>>>>
> >> >>>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>>> Hi Philip,
> >> >>>>>>>>
> >> >>>>>>>> Yeah,  "DNS resolution should occur..." seems like a better fit.
> >> >>> 👍
> >> >>>>>>>>
> >> >>>>>>>> One other question I have is whether we should expose some kind
> >> >> of
> >> >>>>>> public
> >> >>>>>>>> API for performing preflight validation of the bootstrap URLs.
> >> >> If
> >> >>> we
> >> >>>>>>> change
> >> >>>>>>>> the behavior of a client configured with a silly typo (e.g.,
> >> >>>>>>>> "loclahost instead of localhost") from failing in the
> >> >> constructor
> >> >>> to
> >> >>>>>>>> failing with a retriable exception, this might lead some client
> >> >>>>>>>> applications to handle that failure by, well, retrying. For
> >> >>>> reference,
> >> >>>>>>> this
> >> >>>>>>>> is exactly what we do in Kafka Connect right now; see [1] and
> >> >> [2].
> >> >>>> IMO
> >> >>>>>>> it'd
> >> >>>>>>>> be nice to be able to opt into keeping the current behavior so
> >> >>> that
> >> >>>>>>>> projects like Connect could still do preflight checks of the
> >> >>>>>>>> bootstrap.servers property for connectors before starting them,
> >> >>> and
> >> >>>>>>> report
> >> >>>>>>>> any issues by failing fast instead of continuously writing
> >> >>>>>> warning/error
> >> >>>>>>>> messages to their logs.
> >> >>>>>>>>
> >> >>>>>>>> I'm not sure about where this new API could go, but a few
> >> >> options
> >> >>>>>> might
> >> >>>>>>> be:
> >> >>>>>>>>
> >> >>>>>>>> - Expose a public variant of the existing ClientUtils class
> >> >>>>>>>> - Add static methods to the ConsumerConfig, ProducerConfig, and
> >> >>>>>>>> AdminClientConfig classes
> >> >>>>>>>> - Add those same static methods to the KafkaConsumer,
> >> >>> KafkaProducer,
> >> >>>>>> and
> >> >>>>>>>> KafkaAdminClient classes
> >> >>>>>>>>
> >> >>>>>>>> If this seems reasonable, we should probably also specify in the
> >> >>> KIP
> >> >>>>>> that
> >> >>>>>>>> Kafka Connect will leverage this preflight validation logic
> >> >> before
> >> >>>>>>>> instantiating any Kafka clients for use by connectors or tasks,
> >> >>> and
> >> >>>>>>>> continue to fail fast if there are typos in the
> >> >> bootstrap.servers
> >> >>>>>>> property,
> >> >>>>>>>> or if temporary DNS resolution issues come up.
> >> >>>>>>>>
> >> >>>>>>>> [1] -
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>
> >> >>>
> >> >>
> >> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> >> >>>>>>>> [2] -
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>
> >> >>>
> >> >>
> >> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> >> >>>>>>>>
> >> >>>>>>>> Cheers,
> >> >>>>>>>>
> >> >>>>>>>> Chris
> >> >>>>>>>>
> >> >>>>>>>> On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philipnee@gmail.com
> >> >>>
> >> >>>>>> wrote:
> >> >>>>>>>>
> >> >>>>>>>>> Hey Chris,
> >> >>>>>>>>>
> >> >>>>>>>>> Thanks for the quick response, and I apologize for the unclear
> >> >>>>>> wording
> >> >>>>>>>>> there, I guess "DNS lookup" would be a more appropriate
> >> >> wording
> >> >>>>>> here.
> >> >>>>>>> So
> >> >>>>>>>>> what I meant there was, to delegate the DNS lookup in the
> >> >>>>>> constructor
> >> >>>>>>> to
> >> >>>>>>>>> the network client poll, and it will happen on the very first
> >> >>>>>> poll.  I
> >> >>>>>>>>> guess the logic could look like this:
> >> >>>>>>>>>
> >> >>>>>>>>> - if the client has been bootstrapped, do nothing.
> >> >>>>>>>>> - Otherwise, perform DNS lookup, and acquire the bootstrap
> >> >>> server
> >> >>>>>>>> address.
> >> >>>>>>>>>
> >> >>>>>>>>> Thanks for the comment there, I'll change up the wording.
> >> >> Maybe
> >> >>>>>> revise
> >> >>>>>>>> it
> >> >>>>>>>>> as "DNS resolution should occur in the poll...." ?
> >> >>>>>>>>>
> >> >>>>>>>>> P
> >> >>>>>>>>>
> >> >>>>>>>>> On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> >> >>>>>> <chrise@aiven.io.invalid
> >> >>>>>>>>
> >> >>>>>>>>> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>> Hi Philip,
> >> >>>>>>>>>>
> >> >>>>>>>>>> Thanks for the KIP!
> >> >>>>>>>>>>
> >> >>>>>>>>>> QQ: In the "Proposed Changes" section, the KIP states that
> >> >>>>>>>> "Bootstrapping
> >> >>>>>>>>>> should now occur in the poll method before attempting to
> >> >>> update
> >> >>>>>> the
> >> >>>>>>>>>> metadata. This includes resolving the addresses and
> >> >>>> bootstrapping
> >> >>>>>> the
> >> >>>>>>>>>> metadata.". By "bootstrapping the metadata" do we mean
> >> >>> actually
> >> >>>>>>>>> contacting
> >> >>>>>>>>>> the bootstrap servers, or just setting some internal state
> >> >>>>>> related to
> >> >>>>>>>> the
> >> >>>>>>>>>> current set of servers that can be contacted for metadata? I
> >> >>> ask
> >> >>>>>>>> because
> >> >>>>>>>>> it
> >> >>>>>>>>>> seems like the language here implies the former, but if
> >> >> that's
> >> >>>> the
> >> >>>>>>>> case,
> >> >>>>>>>>>> this is already happening in poll (or at least, the first
> >> >>>>>> invocation
> >> >>>>>>> of
> >> >>>>>>>>>> it), and if it's the latter, it's probably not necessary to
> >> >>>>>> mention
> >> >>>>>>> in
> >> >>>>>>>>> the
> >> >>>>>>>>>> KIP since it doesn't really impact user-facing behavior. It
> >> >>> also
> >> >>>>>>> seems
> >> >>>>>>>>> like
> >> >>>>>>>>>> that detail might impact how intertwined this and KIP-899
> >> >> are,
> >> >>>>>> though
> >> >>>>>>>> the
> >> >>>>>>>>>> similarity could still be superficial either way.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Cheers,
> >> >>>>>>>>>>
> >> >>>>>>>>>> Chris
> >> >>>>>>>>>>
> >> >>>>>>>>>> On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> >> >>> philipnee@gmail.com
> >> >>>>>
> >> >>>>>>>> wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>>> Hey Ismael,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Thanks for the feedback! The proposal is not to retry
> >> >>>>>> automatically
> >> >>>>>>>> but
> >> >>>>>>>>>>> relies on the user polling the NetworkClient (basically,
> >> >>>>>>>> consumer.poll)
> >> >>>>>>>>>> to
> >> >>>>>>>>>>> reattempt the bootstrap. If bootstrapping fails, a
> >> >>>>>> NetworkException
> >> >>>>>>>>>>> (retriable) will be thrown.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Thanks!
> >> >>>>>>>>>>> P
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> >> >>>> ismael@juma.me.uk>
> >> >>>>>>>> wrote:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>> Thanks for the KIP. Not sure if I missed it, but how
> >> >> long
> >> >>>>>> will we
> >> >>>>>>>>> retry
> >> >>>>>>>>>>> for
> >> >>>>>>>>>>>> and when do we give up and propagate the failure to the
> >> >>>> user?
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Ismael
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> >> >>>>>> philipnee@gmail.com>
> >> >>>>>>>>>> wrote:
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>> Hi all!
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> I want to start a discussion thread about how we can
> >> >>>> handle
> >> >>>>>>>> client
> >> >>>>>>>>>>>>> bootstrap failure due DNS lookup.  This requires a bit
> >> >>> of
> >> >>>>>>>>> behavioral
> >> >>>>>>>>>>>>> change, so a KIP is proposed and attached to this
> >> >> email.
> >> >>>>>> Let me
> >> >>>>>>>>> know
> >> >>>>>>>>>>> what
> >> >>>>>>>>>>>>> you think!
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> *A small remark here*: *As the title of this KIP might
> >> >>>> sound
> >> >>>>>>>>>>>>> familiar/similar to KIP-899, it is not the same.*
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> *In Summary:* I want to propose a KIP to change the
> >> >>>> existing
> >> >>>>>>>>>> bootstrap
> >> >>>>>>>>>>>>> (upon instantiation) strategy because it is reasonable
> >> >>> to
> >> >>>>>> allow
> >> >>>>>>>>>> clients
> >> >>>>>>>>>>>> to
> >> >>>>>>>>>>>>> retry
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> *KIP: *
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>
> >> >>>
> >> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> Thanks!
> >> >>>>>>>>>>>>> Philip
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> >>
> 

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <ph...@gmail.com>.
Hi all,

The KIP has been around for some time, and I've updated the document
according to the previous comments. Here is the outline of the proposed
changes:
1. Adding a timeout configuration.
2. Adding a new exception type
3. Adding a WARN level log message

The obvious changes on the client side are:
1. They won't attempt to resolve for DNS at the constructor level
2. They will try to bootstrap once network client poll is invoked.

In terms of client-side behavior:
*Consumer*: Either the poll timer runs out and returns an empty record, or
BootstrapConnectionException thrown
*Producer*: API calls will be blocked on waitOnMetadata until the timer
runs out, either the max block ms, or the bootstrap timer.
*Admin*: The API calls timeout if API timeout expires first; otherwise, it
throws a BootstrapConnectionException.

Let me know your thoughts: I would like to start voting in a week or so.

Link:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+DNS+Resolution+Failure+Should+Not+Fail+the+Clients

Thanks,
P

On Thu, Mar 23, 2023 at 12:59 PM Philip Nee <ph...@gmail.com> wrote:

> Hey Kirk,
>
> Sorry about omitting your response; it slipped through the cracks...
>
> *I’m not sure if producers and consumers likewise do DNS resolution in
> their constructors?*
> Yes, both producer and consumer bootstrap in the constructor.
>
> *I agree that moving the DNS resolution to poll(), it would be hard to
> distinguish hard failures (host name resolution) from transient network
> issues. After all, the DNS resolution issue we saw was technically a
> transient issue.*
> I'm amending a timeout configuration in my KIP because that should resolve
> any transient issues if it doesn't persist for too long.  If it is a hard
> failure (like, misconfiguration of the host name), it should be logged and
> errored out after the expiration.
>
> P
>
> On Wed, Mar 8, 2023 at 8:12 AM Kirk True <ki...@kirktrue.pro> wrote:
>
>> Hi Philip,
>>
>> I’m understanding the options proposed as consisting of these questions:
>>
>> Should we throw an Exception or not?
>> Where should the DNS resolution/bootstrapping occur—in the constructor,
>> poll, or somewhere else?
>> Should there be a timeout, and if so, what configuration drives it?
>>
>> We were seeing instances of this issue when constructing KafkaAdminClient
>> instances. The constructor performs the DNS lookup in that client. I’m not
>> sure if producers and consumers likewise do DNS resolution in their
>> constructors?
>>
>> I agree that moving the DNS resolution to poll(), it would be hard to
>> distinguish hard failures (host name resolution) from transient network
>> issues. After all, the DNS resolution issue we saw was technically a
>> transient issue.
>>
>> Is this an accurate summary of the current thinking:
>>
>> Per Jason's suggestion, introduce a new BootstrapConnectionException
>> Per Chris’ suggestion, introduce a new
>> API—performInitialDnsResolution()—that an application developer can call to
>> perform a fast-fail check, throwing BootstrapConnectionException
>> Otherwise, move the DNS resolution to poll()
>> Handle BootstrapConnectionException failures in poll() similar to how we
>> handle NetworkException today, with retries, timeouts, etc., i.e. we don’t
>> introduce any new configuration
>> Improve logging to distinguish the DNS resolution case
>>
>> Thanks,
>> Kirk
>>
>> > On Mar 6, 2023, at 9:15 AM, Philip Nee <ph...@gmail.com> wrote:
>> >
>> > Cheers Kafka Community,
>> >
>> > I just wanna give this thread bump, as it has been a bit quiet for the
>> past
>> > week. I have not updated the KIP based on Chris and Jason's feedback,
>> as I
>> > would also like to know more about what do people think.
>> >
>> > Jason - Thanks for the suggestion, I think your suggestion makes a lot
>> of
>> > sense.
>> >
>> > Thanks!
>> > P
>> >
>> > On Tue, Feb 28, 2023 at 2:45 PM Jason Gustafson
>> <ja...@confluent.io.invalid>
>> > wrote:
>> >
>> >> Hi Philip,
>> >>
>> >>> Having an overall timeout also seems reasonable, but I wonder what
>> should
>> >> the client do after running out of the time? Should we throw a
>> >> non-retriable exception (instead of TimeoutExceptoin to stop the client
>> >> from retrying) and alert the user to examine the config and the DNS
>> server?
>> >>
>> >> Yeah, not sure exactly. I'd probably suggest a
>> >> `BootstrapConnectionException` or something like that with a clear
>> message
>> >> indicating the problem. What the user does with it is up to them, but
>> at
>> >> least it gives them the option to fail their application if that is
>> what
>> >> they prefer to do in this case. If they catch it and ignore it, I would
>> >> expect the client to just continue retrying. Logging for bootstrap
>> >> dns/connection failures will be helpful in any case.
>> >>
>> >> -Jason
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Feb 28, 2023 at 11:47 AM Philip Nee <ph...@gmail.com>
>> wrote:
>> >>
>> >>> Jason:
>> >>> Thanks for the feedback.  Now giving it a second thought, I think your
>> >>> suggestion of logging the error might make sense, as I could imagine
>> most
>> >>> users would just continue to retry, so it might not be necessary to
>> throw
>> >>> an exception anyway.
>> >>> Having an overall timeout also seems reasonable, but I wonder what
>> should
>> >>> the client do after running out of the time? Should we throw a
>> >>> non-retriable exception (instead of TimeoutExceptoin to stop the
>> client
>> >>> from retrying) and alert the user to examine the config and the DNS
>> >> server?
>> >>>
>> >>> Chris:
>> >>> I feel I still haven't answered your question about the pre-flight
>> check,
>> >>> as it seems exposing an API might be harder to push through.
>> >>>
>> >>> Thanks!
>> >>> P
>> >>>
>> >>> On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson
>> >>> <ja...@confluent.io.invalid>
>> >>> wrote:
>> >>>
>> >>>> One more random thought I had just as I pushed send. We're currently
>> >>>> treating this problem somewhat narrowly by focusing only on the DNS
>> >>>> resolution of the bootstrap servers. Even if the servers resolve,
>> >> there's
>> >>>> no guarantee that they are reachable by the client. Would it make
>> sense
>> >>> to
>> >>>> have a timeout which bounds the total time that the client should
>> wait
>> >> to
>> >>>> connect to the bootstrap servers? Something like `
>> >>>> bootstrap.servers.connection.timeout.ms`.
>> >>>>
>> >>>> -Jason
>> >>>>
>> >>>> On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <jason@confluent.io
>> >
>> >>>> wrote:
>> >>>>
>> >>>>> Hi Philip,
>> >>>>>
>> >>>>> An alternative is not to fail at all. Every other network error is
>> >>> caught
>> >>>>> and handled internally in the client. We see this case as different
>> >>>> because
>> >>>>> a DNS resolution error may imply misconfiguration. Could it also
>> >> imply
>> >>>> that
>> >>>>> the DNS server is unavailable? I'm not sure why that case should be
>> >>>> handled
>> >>>>> differently than if the bootstrap servers themselves are
>> unavailable.
>> >>>> Would
>> >>>>> it be enough to log a clear warning in the logs if the bootstrap
>> >>> servers
>> >>>>> could not resolve?
>> >>>>>
>> >>>>> On the whole, the current fail-fast approach probably does more good
>> >>> than
>> >>>>> bad, but it does seem somewhat inconsistent overall and my guess is
>> >>> that
>> >>>>> dynamic environments will become increasingly common. It would be
>> >> nice
>> >>> to
>> >>>>> have a reasonable default behavior which could handle these cases
>> >>>>> gracefully without any additional logic. In any case, it would be
>> >> nice
>> >>> to
>> >>>>> see this option in the rejected alternatives at least if we do not
>> >> take
>> >>>> it.
>> >>>>>
>> >>>>> If we want to take the route of throwing an exception, then I think
>> >>> we're
>> >>>>> probably going to need a new configuration since I can't see what a
>> >>>>> reasonable timeout we would use as a default. The benefit of a
>> >>>>> configuration is that it would let us retain the current default
>> >>> behavior
>> >>>>> with timeout effectively set to 0 and it would also let users
>> >>> effectively
>> >>>>> disable the timeout by using a very large value. Otherwise, it seems
>> >>>> like a
>> >>>>> potential compatibility break to have a new exception type thrown at
>> >>> some
>> >>>>> arbitrary time without giving the user any control over it.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Jason
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton
>> >> <chrise@aiven.io.invalid
>> >>>>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Hi Philip,
>> >>>>>>
>> >>>>>> Yeah, it's basically DNS resolution we're talking about, though
>> >>> there's
>> >>>>>> some additional subtlety there with the logic introduced by KIP-235
>> >>> [1].
>> >>>>>> Essentially it should cover any scenario that causes a client
>> >>>> constructor
>> >>>>>> to fail with the current logic but would not after this KIP is
>> >>> released.
>> >>>>>>
>> >>>>>> We can generalize the Connect use case like this: a client
>> >> application
>> >>>>>> that
>> >>>>>> may connect to different Kafka clusters with a public-facing,
>> >>>> easy-to-use
>> >>>>>> API for restarting failed tasks and automatic handling of retriable
>> >>>>>> exceptions. The ease with which failed tasks can be restarted is
>> >>>>>> significant because it reduces the cost of failing on non-retriable
>> >>>>>> exceptions and makes fail-fast behavior easier to work with. And,
>> in
>> >>>> cases
>> >>>>>> like this where we can't really know whether the error we're
>> dealing
>> >>>> with
>> >>>>>> is retriable or not, it's better IMO to continue to allow
>> >> applications
>> >>>>>> like
>> >>>>>> these to fail fast. I do agree that it'd be nice to get input from
>> >> the
>> >>>>>> community, though.
>> >>>>>>
>> >>>>>> I was toying with the idea of a NetworkException subclass too. It's
>> >> a
>> >>>>>> simpler API, but it doesn't allow for preflight validation, which
>> >> can
>> >>> be
>> >>>>>> useful in scenarios where submitting new configurations for client
>> >>>>>> applications is expensive in terms of time or resources. Then
>> >> again, I
>> >>>>>> don't see why the two are mutually exclusive, and we might opt to
>> >> use
>> >>>> the
>> >>>>>> NetworkException subclass in this KIP and pursue an opt-in
>> >> validation
>> >>>> API
>> >>>>>> later on. Thoughts?
>> >>>>>>
>> >>>>>> [1] -
>> >>>>>>
>> >>>>>>
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
>> >>>>>>
>> >>>>>> Cheers,
>> >>>>>>
>> >>>>>> Chris
>> >>>>>>
>> >>>>>> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <ph...@gmail.com>
>> >>> wrote:
>> >>>>>>
>> >>>>>>> Hey Chris,
>> >>>>>>>
>> >>>>>>> Thanks again for the feedback!
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> For the preflight DNS check (are we basically trying to resolve
>> >> the
>> >>>> DNS
>> >>>>>>> there?): Maybe it makes more sense to add it to the Config
>> >> modules?
>> >>> I
>> >>>>>> would
>> >>>>>>> like to hear what the community says as I'm not familiar with the
>> >>>>>> Connect
>> >>>>>>> use case.
>> >>>>>>>
>> >>>>>>> A "slower failing" alternative - I wonder if it makes sense for us
>> >>> to
>> >>>>>>> extend the NetworkException so that clients can be smarter at
>> >>> handling
>> >>>>>>> these exceptions. Of course, it is still retriable and requires
>> >>>> polling
>> >>>>>> the
>> >>>>>>> consumer, but then we can distinguish the DNS resolution error
>> >> from
>> >>>>>> other
>> >>>>>>> network errors.
>> >>>>>>>
>> >>>>>>> Thanks!
>> >>>>>>> P
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
>> >>> <chrise@aiven.io.invalid
>> >>>>>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> Hi Philip,
>> >>>>>>>>
>> >>>>>>>> Yeah,  "DNS resolution should occur..." seems like a better fit.
>> >>> 👍
>> >>>>>>>>
>> >>>>>>>> One other question I have is whether we should expose some kind
>> >> of
>> >>>>>> public
>> >>>>>>>> API for performing preflight validation of the bootstrap URLs.
>> >> If
>> >>> we
>> >>>>>>> change
>> >>>>>>>> the behavior of a client configured with a silly typo (e.g.,
>> >>>>>>>> "loclahost instead of localhost") from failing in the
>> >> constructor
>> >>> to
>> >>>>>>>> failing with a retriable exception, this might lead some client
>> >>>>>>>> applications to handle that failure by, well, retrying. For
>> >>>> reference,
>> >>>>>>> this
>> >>>>>>>> is exactly what we do in Kafka Connect right now; see [1] and
>> >> [2].
>> >>>> IMO
>> >>>>>>> it'd
>> >>>>>>>> be nice to be able to opt into keeping the current behavior so
>> >>> that
>> >>>>>>>> projects like Connect could still do preflight checks of the
>> >>>>>>>> bootstrap.servers property for connectors before starting them,
>> >>> and
>> >>>>>>> report
>> >>>>>>>> any issues by failing fast instead of continuously writing
>> >>>>>> warning/error
>> >>>>>>>> messages to their logs.
>> >>>>>>>>
>> >>>>>>>> I'm not sure about where this new API could go, but a few
>> >> options
>> >>>>>> might
>> >>>>>>> be:
>> >>>>>>>>
>> >>>>>>>> - Expose a public variant of the existing ClientUtils class
>> >>>>>>>> - Add static methods to the ConsumerConfig, ProducerConfig, and
>> >>>>>>>> AdminClientConfig classes
>> >>>>>>>> - Add those same static methods to the KafkaConsumer,
>> >>> KafkaProducer,
>> >>>>>> and
>> >>>>>>>> KafkaAdminClient classes
>> >>>>>>>>
>> >>>>>>>> If this seems reasonable, we should probably also specify in the
>> >>> KIP
>> >>>>>> that
>> >>>>>>>> Kafka Connect will leverage this preflight validation logic
>> >> before
>> >>>>>>>> instantiating any Kafka clients for use by connectors or tasks,
>> >>> and
>> >>>>>>>> continue to fail fast if there are typos in the
>> >> bootstrap.servers
>> >>>>>>> property,
>> >>>>>>>> or if temporary DNS resolution issues come up.
>> >>>>>>>>
>> >>>>>>>> [1] -
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>
>> >>
>> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
>> >>>>>>>> [2] -
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>
>> >>
>> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
>> >>>>>>>>
>> >>>>>>>> Cheers,
>> >>>>>>>>
>> >>>>>>>> Chris
>> >>>>>>>>
>> >>>>>>>> On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philipnee@gmail.com
>> >>>
>> >>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Hey Chris,
>> >>>>>>>>>
>> >>>>>>>>> Thanks for the quick response, and I apologize for the unclear
>> >>>>>> wording
>> >>>>>>>>> there, I guess "DNS lookup" would be a more appropriate
>> >> wording
>> >>>>>> here.
>> >>>>>>> So
>> >>>>>>>>> what I meant there was, to delegate the DNS lookup in the
>> >>>>>> constructor
>> >>>>>>> to
>> >>>>>>>>> the network client poll, and it will happen on the very first
>> >>>>>> poll.  I
>> >>>>>>>>> guess the logic could look like this:
>> >>>>>>>>>
>> >>>>>>>>> - if the client has been bootstrapped, do nothing.
>> >>>>>>>>> - Otherwise, perform DNS lookup, and acquire the bootstrap
>> >>> server
>> >>>>>>>> address.
>> >>>>>>>>>
>> >>>>>>>>> Thanks for the comment there, I'll change up the wording.
>> >> Maybe
>> >>>>>> revise
>> >>>>>>>> it
>> >>>>>>>>> as "DNS resolution should occur in the poll...." ?
>> >>>>>>>>>
>> >>>>>>>>> P
>> >>>>>>>>>
>> >>>>>>>>> On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
>> >>>>>> <chrise@aiven.io.invalid
>> >>>>>>>>
>> >>>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Hi Philip,
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks for the KIP!
>> >>>>>>>>>>
>> >>>>>>>>>> QQ: In the "Proposed Changes" section, the KIP states that
>> >>>>>>>> "Bootstrapping
>> >>>>>>>>>> should now occur in the poll method before attempting to
>> >>> update
>> >>>>>> the
>> >>>>>>>>>> metadata. This includes resolving the addresses and
>> >>>> bootstrapping
>> >>>>>> the
>> >>>>>>>>>> metadata.". By "bootstrapping the metadata" do we mean
>> >>> actually
>> >>>>>>>>> contacting
>> >>>>>>>>>> the bootstrap servers, or just setting some internal state
>> >>>>>> related to
>> >>>>>>>> the
>> >>>>>>>>>> current set of servers that can be contacted for metadata? I
>> >>> ask
>> >>>>>>>> because
>> >>>>>>>>> it
>> >>>>>>>>>> seems like the language here implies the former, but if
>> >> that's
>> >>>> the
>> >>>>>>>> case,
>> >>>>>>>>>> this is already happening in poll (or at least, the first
>> >>>>>> invocation
>> >>>>>>> of
>> >>>>>>>>>> it), and if it's the latter, it's probably not necessary to
>> >>>>>> mention
>> >>>>>>> in
>> >>>>>>>>> the
>> >>>>>>>>>> KIP since it doesn't really impact user-facing behavior. It
>> >>> also
>> >>>>>>> seems
>> >>>>>>>>> like
>> >>>>>>>>>> that detail might impact how intertwined this and KIP-899
>> >> are,
>> >>>>>> though
>> >>>>>>>> the
>> >>>>>>>>>> similarity could still be superficial either way.
>> >>>>>>>>>>
>> >>>>>>>>>> Cheers,
>> >>>>>>>>>>
>> >>>>>>>>>> Chris
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
>> >>> philipnee@gmail.com
>> >>>>>
>> >>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hey Ismael,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks for the feedback! The proposal is not to retry
>> >>>>>> automatically
>> >>>>>>>> but
>> >>>>>>>>>>> relies on the user polling the NetworkClient (basically,
>> >>>>>>>> consumer.poll)
>> >>>>>>>>>> to
>> >>>>>>>>>>> reattempt the bootstrap. If bootstrapping fails, a
>> >>>>>> NetworkException
>> >>>>>>>>>>> (retriable) will be thrown.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks!
>> >>>>>>>>>>> P
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
>> >>>> ismael@juma.me.uk>
>> >>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Thanks for the KIP. Not sure if I missed it, but how
>> >> long
>> >>>>>> will we
>> >>>>>>>>> retry
>> >>>>>>>>>>> for
>> >>>>>>>>>>>> and when do we give up and propagate the failure to the
>> >>>> user?
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Ismael
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
>> >>>>>> philipnee@gmail.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Hi all!
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> I want to start a discussion thread about how we can
>> >>>> handle
>> >>>>>>>> client
>> >>>>>>>>>>>>> bootstrap failure due DNS lookup.  This requires a bit
>> >>> of
>> >>>>>>>>> behavioral
>> >>>>>>>>>>>>> change, so a KIP is proposed and attached to this
>> >> email.
>> >>>>>> Let me
>> >>>>>>>>> know
>> >>>>>>>>>>> what
>> >>>>>>>>>>>>> you think!
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> *A small remark here*: *As the title of this KIP might
>> >>>> sound
>> >>>>>>>>>>>>> familiar/similar to KIP-899, it is not the same.*
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> *In Summary:* I want to propose a KIP to change the
>> >>>> existing
>> >>>>>>>>>> bootstrap
>> >>>>>>>>>>>>> (upon instantiation) strategy because it is reasonable
>> >>> to
>> >>>>>> allow
>> >>>>>>>>>> clients
>> >>>>>>>>>>>> to
>> >>>>>>>>>>>>> retry
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> *KIP: *
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Thanks!
>> >>>>>>>>>>>>> Philip
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>>
>>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <ph...@gmail.com>.
Hey Kirk,

Sorry about omitting your response; it slipped through the cracks...

*I’m not sure if producers and consumers likewise do DNS resolution in
their constructors?*
Yes, both producer and consumer bootstrap in the constructor.

*I agree that moving the DNS resolution to poll(), it would be hard to
distinguish hard failures (host name resolution) from transient network
issues. After all, the DNS resolution issue we saw was technically a
transient issue.*
I'm amending a timeout configuration in my KIP because that should resolve
any transient issues if it doesn't persist for too long.  If it is a hard
failure (like, misconfiguration of the host name), it should be logged and
errored out after the expiration.

P

On Wed, Mar 8, 2023 at 8:12 AM Kirk True <ki...@kirktrue.pro> wrote:

> Hi Philip,
>
> I’m understanding the options proposed as consisting of these questions:
>
> Should we throw an Exception or not?
> Where should the DNS resolution/bootstrapping occur—in the constructor,
> poll, or somewhere else?
> Should there be a timeout, and if so, what configuration drives it?
>
> We were seeing instances of this issue when constructing KafkaAdminClient
> instances. The constructor performs the DNS lookup in that client. I’m not
> sure if producers and consumers likewise do DNS resolution in their
> constructors?
>
> I agree that moving the DNS resolution to poll(), it would be hard to
> distinguish hard failures (host name resolution) from transient network
> issues. After all, the DNS resolution issue we saw was technically a
> transient issue.
>
> Is this an accurate summary of the current thinking:
>
> Per Jason's suggestion, introduce a new BootstrapConnectionException
> Per Chris’ suggestion, introduce a new
> API—performInitialDnsResolution()—that an application developer can call to
> perform a fast-fail check, throwing BootstrapConnectionException
> Otherwise, move the DNS resolution to poll()
> Handle BootstrapConnectionException failures in poll() similar to how we
> handle NetworkException today, with retries, timeouts, etc., i.e. we don’t
> introduce any new configuration
> Improve logging to distinguish the DNS resolution case
>
> Thanks,
> Kirk
>
> > On Mar 6, 2023, at 9:15 AM, Philip Nee <ph...@gmail.com> wrote:
> >
> > Cheers Kafka Community,
> >
> > I just wanna give this thread bump, as it has been a bit quiet for the
> past
> > week. I have not updated the KIP based on Chris and Jason's feedback, as
> I
> > would also like to know more about what do people think.
> >
> > Jason - Thanks for the suggestion, I think your suggestion makes a lot of
> > sense.
> >
> > Thanks!
> > P
> >
> > On Tue, Feb 28, 2023 at 2:45 PM Jason Gustafson
> <ja...@confluent.io.invalid>
> > wrote:
> >
> >> Hi Philip,
> >>
> >>> Having an overall timeout also seems reasonable, but I wonder what
> should
> >> the client do after running out of the time? Should we throw a
> >> non-retriable exception (instead of TimeoutExceptoin to stop the client
> >> from retrying) and alert the user to examine the config and the DNS
> server?
> >>
> >> Yeah, not sure exactly. I'd probably suggest a
> >> `BootstrapConnectionException` or something like that with a clear
> message
> >> indicating the problem. What the user does with it is up to them, but at
> >> least it gives them the option to fail their application if that is what
> >> they prefer to do in this case. If they catch it and ignore it, I would
> >> expect the client to just continue retrying. Logging for bootstrap
> >> dns/connection failures will be helpful in any case.
> >>
> >> -Jason
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Tue, Feb 28, 2023 at 11:47 AM Philip Nee <ph...@gmail.com>
> wrote:
> >>
> >>> Jason:
> >>> Thanks for the feedback.  Now giving it a second thought, I think your
> >>> suggestion of logging the error might make sense, as I could imagine
> most
> >>> users would just continue to retry, so it might not be necessary to
> throw
> >>> an exception anyway.
> >>> Having an overall timeout also seems reasonable, but I wonder what
> should
> >>> the client do after running out of the time? Should we throw a
> >>> non-retriable exception (instead of TimeoutExceptoin to stop the client
> >>> from retrying) and alert the user to examine the config and the DNS
> >> server?
> >>>
> >>> Chris:
> >>> I feel I still haven't answered your question about the pre-flight
> check,
> >>> as it seems exposing an API might be harder to push through.
> >>>
> >>> Thanks!
> >>> P
> >>>
> >>> On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson
> >>> <ja...@confluent.io.invalid>
> >>> wrote:
> >>>
> >>>> One more random thought I had just as I pushed send. We're currently
> >>>> treating this problem somewhat narrowly by focusing only on the DNS
> >>>> resolution of the bootstrap servers. Even if the servers resolve,
> >> there's
> >>>> no guarantee that they are reachable by the client. Would it make
> sense
> >>> to
> >>>> have a timeout which bounds the total time that the client should wait
> >> to
> >>>> connect to the bootstrap servers? Something like `
> >>>> bootstrap.servers.connection.timeout.ms`.
> >>>>
> >>>> -Jason
> >>>>
> >>>> On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <ja...@confluent.io>
> >>>> wrote:
> >>>>
> >>>>> Hi Philip,
> >>>>>
> >>>>> An alternative is not to fail at all. Every other network error is
> >>> caught
> >>>>> and handled internally in the client. We see this case as different
> >>>> because
> >>>>> a DNS resolution error may imply misconfiguration. Could it also
> >> imply
> >>>> that
> >>>>> the DNS server is unavailable? I'm not sure why that case should be
> >>>> handled
> >>>>> differently than if the bootstrap servers themselves are unavailable.
> >>>> Would
> >>>>> it be enough to log a clear warning in the logs if the bootstrap
> >>> servers
> >>>>> could not resolve?
> >>>>>
> >>>>> On the whole, the current fail-fast approach probably does more good
> >>> than
> >>>>> bad, but it does seem somewhat inconsistent overall and my guess is
> >>> that
> >>>>> dynamic environments will become increasingly common. It would be
> >> nice
> >>> to
> >>>>> have a reasonable default behavior which could handle these cases
> >>>>> gracefully without any additional logic. In any case, it would be
> >> nice
> >>> to
> >>>>> see this option in the rejected alternatives at least if we do not
> >> take
> >>>> it.
> >>>>>
> >>>>> If we want to take the route of throwing an exception, then I think
> >>> we're
> >>>>> probably going to need a new configuration since I can't see what a
> >>>>> reasonable timeout we would use as a default. The benefit of a
> >>>>> configuration is that it would let us retain the current default
> >>> behavior
> >>>>> with timeout effectively set to 0 and it would also let users
> >>> effectively
> >>>>> disable the timeout by using a very large value. Otherwise, it seems
> >>>> like a
> >>>>> potential compatibility break to have a new exception type thrown at
> >>> some
> >>>>> arbitrary time without giving the user any control over it.
> >>>>>
> >>>>> Thanks,
> >>>>> Jason
> >>>>>
> >>>>>
> >>>>> On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton
> >> <chrise@aiven.io.invalid
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Philip,
> >>>>>>
> >>>>>> Yeah, it's basically DNS resolution we're talking about, though
> >>> there's
> >>>>>> some additional subtlety there with the logic introduced by KIP-235
> >>> [1].
> >>>>>> Essentially it should cover any scenario that causes a client
> >>>> constructor
> >>>>>> to fail with the current logic but would not after this KIP is
> >>> released.
> >>>>>>
> >>>>>> We can generalize the Connect use case like this: a client
> >> application
> >>>>>> that
> >>>>>> may connect to different Kafka clusters with a public-facing,
> >>>> easy-to-use
> >>>>>> API for restarting failed tasks and automatic handling of retriable
> >>>>>> exceptions. The ease with which failed tasks can be restarted is
> >>>>>> significant because it reduces the cost of failing on non-retriable
> >>>>>> exceptions and makes fail-fast behavior easier to work with. And, in
> >>>> cases
> >>>>>> like this where we can't really know whether the error we're dealing
> >>>> with
> >>>>>> is retriable or not, it's better IMO to continue to allow
> >> applications
> >>>>>> like
> >>>>>> these to fail fast. I do agree that it'd be nice to get input from
> >> the
> >>>>>> community, though.
> >>>>>>
> >>>>>> I was toying with the idea of a NetworkException subclass too. It's
> >> a
> >>>>>> simpler API, but it doesn't allow for preflight validation, which
> >> can
> >>> be
> >>>>>> useful in scenarios where submitting new configurations for client
> >>>>>> applications is expensive in terms of time or resources. Then
> >> again, I
> >>>>>> don't see why the two are mutually exclusive, and we might opt to
> >> use
> >>>> the
> >>>>>> NetworkException subclass in this KIP and pursue an opt-in
> >> validation
> >>>> API
> >>>>>> later on. Thoughts?
> >>>>>>
> >>>>>> [1] -
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Chris
> >>>>>>
> >>>>>> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <ph...@gmail.com>
> >>> wrote:
> >>>>>>
> >>>>>>> Hey Chris,
> >>>>>>>
> >>>>>>> Thanks again for the feedback!
> >>>>>>>
> >>>>>>>
> >>>>>>> For the preflight DNS check (are we basically trying to resolve
> >> the
> >>>> DNS
> >>>>>>> there?): Maybe it makes more sense to add it to the Config
> >> modules?
> >>> I
> >>>>>> would
> >>>>>>> like to hear what the community says as I'm not familiar with the
> >>>>>> Connect
> >>>>>>> use case.
> >>>>>>>
> >>>>>>> A "slower failing" alternative - I wonder if it makes sense for us
> >>> to
> >>>>>>> extend the NetworkException so that clients can be smarter at
> >>> handling
> >>>>>>> these exceptions. Of course, it is still retriable and requires
> >>>> polling
> >>>>>> the
> >>>>>>> consumer, but then we can distinguish the DNS resolution error
> >> from
> >>>>>> other
> >>>>>>> network errors.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>> P
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> >>> <chrise@aiven.io.invalid
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Philip,
> >>>>>>>>
> >>>>>>>> Yeah,  "DNS resolution should occur..." seems like a better fit.
> >>> 👍
> >>>>>>>>
> >>>>>>>> One other question I have is whether we should expose some kind
> >> of
> >>>>>> public
> >>>>>>>> API for performing preflight validation of the bootstrap URLs.
> >> If
> >>> we
> >>>>>>> change
> >>>>>>>> the behavior of a client configured with a silly typo (e.g.,
> >>>>>>>> "loclahost instead of localhost") from failing in the
> >> constructor
> >>> to
> >>>>>>>> failing with a retriable exception, this might lead some client
> >>>>>>>> applications to handle that failure by, well, retrying. For
> >>>> reference,
> >>>>>>> this
> >>>>>>>> is exactly what we do in Kafka Connect right now; see [1] and
> >> [2].
> >>>> IMO
> >>>>>>> it'd
> >>>>>>>> be nice to be able to opt into keeping the current behavior so
> >>> that
> >>>>>>>> projects like Connect could still do preflight checks of the
> >>>>>>>> bootstrap.servers property for connectors before starting them,
> >>> and
> >>>>>>> report
> >>>>>>>> any issues by failing fast instead of continuously writing
> >>>>>> warning/error
> >>>>>>>> messages to their logs.
> >>>>>>>>
> >>>>>>>> I'm not sure about where this new API could go, but a few
> >> options
> >>>>>> might
> >>>>>>> be:
> >>>>>>>>
> >>>>>>>> - Expose a public variant of the existing ClientUtils class
> >>>>>>>> - Add static methods to the ConsumerConfig, ProducerConfig, and
> >>>>>>>> AdminClientConfig classes
> >>>>>>>> - Add those same static methods to the KafkaConsumer,
> >>> KafkaProducer,
> >>>>>> and
> >>>>>>>> KafkaAdminClient classes
> >>>>>>>>
> >>>>>>>> If this seems reasonable, we should probably also specify in the
> >>> KIP
> >>>>>> that
> >>>>>>>> Kafka Connect will leverage this preflight validation logic
> >> before
> >>>>>>>> instantiating any Kafka clients for use by connectors or tasks,
> >>> and
> >>>>>>>> continue to fail fast if there are typos in the
> >> bootstrap.servers
> >>>>>>> property,
> >>>>>>>> or if temporary DNS resolution issues come up.
> >>>>>>>>
> >>>>>>>> [1] -
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> >>>>>>>> [2] -
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>>
> >>>>>>>> Chris
> >>>>>>>>
> >>>>>>>> On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philipnee@gmail.com
> >>>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Chris,
> >>>>>>>>>
> >>>>>>>>> Thanks for the quick response, and I apologize for the unclear
> >>>>>> wording
> >>>>>>>>> there, I guess "DNS lookup" would be a more appropriate
> >> wording
> >>>>>> here.
> >>>>>>> So
> >>>>>>>>> what I meant there was, to delegate the DNS lookup in the
> >>>>>> constructor
> >>>>>>> to
> >>>>>>>>> the network client poll, and it will happen on the very first
> >>>>>> poll.  I
> >>>>>>>>> guess the logic could look like this:
> >>>>>>>>>
> >>>>>>>>> - if the client has been bootstrapped, do nothing.
> >>>>>>>>> - Otherwise, perform DNS lookup, and acquire the bootstrap
> >>> server
> >>>>>>>> address.
> >>>>>>>>>
> >>>>>>>>> Thanks for the comment there, I'll change up the wording.
> >> Maybe
> >>>>>> revise
> >>>>>>>> it
> >>>>>>>>> as "DNS resolution should occur in the poll...." ?
> >>>>>>>>>
> >>>>>>>>> P
> >>>>>>>>>
> >>>>>>>>> On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> >>>>>> <chrise@aiven.io.invalid
> >>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Philip,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the KIP!
> >>>>>>>>>>
> >>>>>>>>>> QQ: In the "Proposed Changes" section, the KIP states that
> >>>>>>>> "Bootstrapping
> >>>>>>>>>> should now occur in the poll method before attempting to
> >>> update
> >>>>>> the
> >>>>>>>>>> metadata. This includes resolving the addresses and
> >>>> bootstrapping
> >>>>>> the
> >>>>>>>>>> metadata.". By "bootstrapping the metadata" do we mean
> >>> actually
> >>>>>>>>> contacting
> >>>>>>>>>> the bootstrap servers, or just setting some internal state
> >>>>>> related to
> >>>>>>>> the
> >>>>>>>>>> current set of servers that can be contacted for metadata? I
> >>> ask
> >>>>>>>> because
> >>>>>>>>> it
> >>>>>>>>>> seems like the language here implies the former, but if
> >> that's
> >>>> the
> >>>>>>>> case,
> >>>>>>>>>> this is already happening in poll (or at least, the first
> >>>>>> invocation
> >>>>>>> of
> >>>>>>>>>> it), and if it's the latter, it's probably not necessary to
> >>>>>> mention
> >>>>>>> in
> >>>>>>>>> the
> >>>>>>>>>> KIP since it doesn't really impact user-facing behavior. It
> >>> also
> >>>>>>> seems
> >>>>>>>>> like
> >>>>>>>>>> that detail might impact how intertwined this and KIP-899
> >> are,
> >>>>>> though
> >>>>>>>> the
> >>>>>>>>>> similarity could still be superficial either way.
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>>
> >>>>>>>>>> Chris
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> >>> philipnee@gmail.com
> >>>>>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hey Ismael,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the feedback! The proposal is not to retry
> >>>>>> automatically
> >>>>>>>> but
> >>>>>>>>>>> relies on the user polling the NetworkClient (basically,
> >>>>>>>> consumer.poll)
> >>>>>>>>>> to
> >>>>>>>>>>> reattempt the bootstrap. If bootstrapping fails, a
> >>>>>> NetworkException
> >>>>>>>>>>> (retriable) will be thrown.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks!
> >>>>>>>>>>> P
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> >>>> ismael@juma.me.uk>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the KIP. Not sure if I missed it, but how
> >> long
> >>>>>> will we
> >>>>>>>>> retry
> >>>>>>>>>>> for
> >>>>>>>>>>>> and when do we give up and propagate the failure to the
> >>>> user?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ismael
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> >>>>>> philipnee@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi all!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I want to start a discussion thread about how we can
> >>>> handle
> >>>>>>>> client
> >>>>>>>>>>>>> bootstrap failure due DNS lookup.  This requires a bit
> >>> of
> >>>>>>>>> behavioral
> >>>>>>>>>>>>> change, so a KIP is proposed and attached to this
> >> email.
> >>>>>> Let me
> >>>>>>>>> know
> >>>>>>>>>>> what
> >>>>>>>>>>>>> you think!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *A small remark here*: *As the title of this KIP might
> >>>> sound
> >>>>>>>>>>>>> familiar/similar to KIP-899, it is not the same.*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *In Summary:* I want to propose a KIP to change the
> >>>> existing
> >>>>>>>>>> bootstrap
> >>>>>>>>>>>>> (upon instantiation) strategy because it is reasonable
> >>> to
> >>>>>> allow
> >>>>>>>>>> clients
> >>>>>>>>>>>> to
> >>>>>>>>>>>>> retry
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *KIP: *
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>> Philip
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Kirk True <ki...@kirktrue.pro>.
Hi Philip,

I’m understanding the options proposed as consisting of these questions:

Should we throw an Exception or not?
Where should the DNS resolution/bootstrapping occur—in the constructor, poll, or somewhere else?
Should there be a timeout, and if so, what configuration drives it?

We were seeing instances of this issue when constructing KafkaAdminClient instances. The constructor performs the DNS lookup in that client. I’m not sure if producers and consumers likewise do DNS resolution in their constructors?

I agree that moving the DNS resolution to poll(), it would be hard to distinguish hard failures (host name resolution) from transient network issues. After all, the DNS resolution issue we saw was technically a transient issue.

Is this an accurate summary of the current thinking:

Per Jason's suggestion, introduce a new BootstrapConnectionException
Per Chris’ suggestion, introduce a new API—performInitialDnsResolution()—that an application developer can call to perform a fast-fail check, throwing BootstrapConnectionException
Otherwise, move the DNS resolution to poll()
Handle BootstrapConnectionException failures in poll() similar to how we handle NetworkException today, with retries, timeouts, etc., i.e. we don’t introduce any new configuration
Improve logging to distinguish the DNS resolution case

Thanks,
Kirk

> On Mar 6, 2023, at 9:15 AM, Philip Nee <ph...@gmail.com> wrote:
> 
> Cheers Kafka Community,
> 
> I just wanna give this thread bump, as it has been a bit quiet for the past
> week. I have not updated the KIP based on Chris and Jason's feedback, as I
> would also like to know more about what do people think.
> 
> Jason - Thanks for the suggestion, I think your suggestion makes a lot of
> sense.
> 
> Thanks!
> P
> 
> On Tue, Feb 28, 2023 at 2:45 PM Jason Gustafson <ja...@confluent.io.invalid>
> wrote:
> 
>> Hi Philip,
>> 
>>> Having an overall timeout also seems reasonable, but I wonder what should
>> the client do after running out of the time? Should we throw a
>> non-retriable exception (instead of TimeoutExceptoin to stop the client
>> from retrying) and alert the user to examine the config and the DNS server?
>> 
>> Yeah, not sure exactly. I'd probably suggest a
>> `BootstrapConnectionException` or something like that with a clear message
>> indicating the problem. What the user does with it is up to them, but at
>> least it gives them the option to fail their application if that is what
>> they prefer to do in this case. If they catch it and ignore it, I would
>> expect the client to just continue retrying. Logging for bootstrap
>> dns/connection failures will be helpful in any case.
>> 
>> -Jason
>> 
>> 
>> 
>> 
>> 
>> 
>> On Tue, Feb 28, 2023 at 11:47 AM Philip Nee <ph...@gmail.com> wrote:
>> 
>>> Jason:
>>> Thanks for the feedback.  Now giving it a second thought, I think your
>>> suggestion of logging the error might make sense, as I could imagine most
>>> users would just continue to retry, so it might not be necessary to throw
>>> an exception anyway.
>>> Having an overall timeout also seems reasonable, but I wonder what should
>>> the client do after running out of the time? Should we throw a
>>> non-retriable exception (instead of TimeoutExceptoin to stop the client
>>> from retrying) and alert the user to examine the config and the DNS
>> server?
>>> 
>>> Chris:
>>> I feel I still haven't answered your question about the pre-flight check,
>>> as it seems exposing an API might be harder to push through.
>>> 
>>> Thanks!
>>> P
>>> 
>>> On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson
>>> <ja...@confluent.io.invalid>
>>> wrote:
>>> 
>>>> One more random thought I had just as I pushed send. We're currently
>>>> treating this problem somewhat narrowly by focusing only on the DNS
>>>> resolution of the bootstrap servers. Even if the servers resolve,
>> there's
>>>> no guarantee that they are reachable by the client. Would it make sense
>>> to
>>>> have a timeout which bounds the total time that the client should wait
>> to
>>>> connect to the bootstrap servers? Something like `
>>>> bootstrap.servers.connection.timeout.ms`.
>>>> 
>>>> -Jason
>>>> 
>>>> On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <ja...@confluent.io>
>>>> wrote:
>>>> 
>>>>> Hi Philip,
>>>>> 
>>>>> An alternative is not to fail at all. Every other network error is
>>> caught
>>>>> and handled internally in the client. We see this case as different
>>>> because
>>>>> a DNS resolution error may imply misconfiguration. Could it also
>> imply
>>>> that
>>>>> the DNS server is unavailable? I'm not sure why that case should be
>>>> handled
>>>>> differently than if the bootstrap servers themselves are unavailable.
>>>> Would
>>>>> it be enough to log a clear warning in the logs if the bootstrap
>>> servers
>>>>> could not resolve?
>>>>> 
>>>>> On the whole, the current fail-fast approach probably does more good
>>> than
>>>>> bad, but it does seem somewhat inconsistent overall and my guess is
>>> that
>>>>> dynamic environments will become increasingly common. It would be
>> nice
>>> to
>>>>> have a reasonable default behavior which could handle these cases
>>>>> gracefully without any additional logic. In any case, it would be
>> nice
>>> to
>>>>> see this option in the rejected alternatives at least if we do not
>> take
>>>> it.
>>>>> 
>>>>> If we want to take the route of throwing an exception, then I think
>>> we're
>>>>> probably going to need a new configuration since I can't see what a
>>>>> reasonable timeout we would use as a default. The benefit of a
>>>>> configuration is that it would let us retain the current default
>>> behavior
>>>>> with timeout effectively set to 0 and it would also let users
>>> effectively
>>>>> disable the timeout by using a very large value. Otherwise, it seems
>>>> like a
>>>>> potential compatibility break to have a new exception type thrown at
>>> some
>>>>> arbitrary time without giving the user any control over it.
>>>>> 
>>>>> Thanks,
>>>>> Jason
>>>>> 
>>>>> 
>>>>> On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton
>> <chrise@aiven.io.invalid
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi Philip,
>>>>>> 
>>>>>> Yeah, it's basically DNS resolution we're talking about, though
>>> there's
>>>>>> some additional subtlety there with the logic introduced by KIP-235
>>> [1].
>>>>>> Essentially it should cover any scenario that causes a client
>>>> constructor
>>>>>> to fail with the current logic but would not after this KIP is
>>> released.
>>>>>> 
>>>>>> We can generalize the Connect use case like this: a client
>> application
>>>>>> that
>>>>>> may connect to different Kafka clusters with a public-facing,
>>>> easy-to-use
>>>>>> API for restarting failed tasks and automatic handling of retriable
>>>>>> exceptions. The ease with which failed tasks can be restarted is
>>>>>> significant because it reduces the cost of failing on non-retriable
>>>>>> exceptions and makes fail-fast behavior easier to work with. And, in
>>>> cases
>>>>>> like this where we can't really know whether the error we're dealing
>>>> with
>>>>>> is retriable or not, it's better IMO to continue to allow
>> applications
>>>>>> like
>>>>>> these to fail fast. I do agree that it'd be nice to get input from
>> the
>>>>>> community, though.
>>>>>> 
>>>>>> I was toying with the idea of a NetworkException subclass too. It's
>> a
>>>>>> simpler API, but it doesn't allow for preflight validation, which
>> can
>>> be
>>>>>> useful in scenarios where submitting new configurations for client
>>>>>> applications is expensive in terms of time or resources. Then
>> again, I
>>>>>> don't see why the two are mutually exclusive, and we might opt to
>> use
>>>> the
>>>>>> NetworkException subclass in this KIP and pursue an opt-in
>> validation
>>>> API
>>>>>> later on. Thoughts?
>>>>>> 
>>>>>> [1] -
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Chris
>>>>>> 
>>>>>> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <ph...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>>> Hey Chris,
>>>>>>> 
>>>>>>> Thanks again for the feedback!
>>>>>>> 
>>>>>>> 
>>>>>>> For the preflight DNS check (are we basically trying to resolve
>> the
>>>> DNS
>>>>>>> there?): Maybe it makes more sense to add it to the Config
>> modules?
>>> I
>>>>>> would
>>>>>>> like to hear what the community says as I'm not familiar with the
>>>>>> Connect
>>>>>>> use case.
>>>>>>> 
>>>>>>> A "slower failing" alternative - I wonder if it makes sense for us
>>> to
>>>>>>> extend the NetworkException so that clients can be smarter at
>>> handling
>>>>>>> these exceptions. Of course, it is still retriable and requires
>>>> polling
>>>>>> the
>>>>>>> consumer, but then we can distinguish the DNS resolution error
>> from
>>>>>> other
>>>>>>> network errors.
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> P
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
>>> <chrise@aiven.io.invalid
>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Philip,
>>>>>>>> 
>>>>>>>> Yeah,  "DNS resolution should occur..." seems like a better fit.
>>> 👍
>>>>>>>> 
>>>>>>>> One other question I have is whether we should expose some kind
>> of
>>>>>> public
>>>>>>>> API for performing preflight validation of the bootstrap URLs.
>> If
>>> we
>>>>>>> change
>>>>>>>> the behavior of a client configured with a silly typo (e.g.,
>>>>>>>> "loclahost instead of localhost") from failing in the
>> constructor
>>> to
>>>>>>>> failing with a retriable exception, this might lead some client
>>>>>>>> applications to handle that failure by, well, retrying. For
>>>> reference,
>>>>>>> this
>>>>>>>> is exactly what we do in Kafka Connect right now; see [1] and
>> [2].
>>>> IMO
>>>>>>> it'd
>>>>>>>> be nice to be able to opt into keeping the current behavior so
>>> that
>>>>>>>> projects like Connect could still do preflight checks of the
>>>>>>>> bootstrap.servers property for connectors before starting them,
>>> and
>>>>>>> report
>>>>>>>> any issues by failing fast instead of continuously writing
>>>>>> warning/error
>>>>>>>> messages to their logs.
>>>>>>>> 
>>>>>>>> I'm not sure about where this new API could go, but a few
>> options
>>>>>> might
>>>>>>> be:
>>>>>>>> 
>>>>>>>> - Expose a public variant of the existing ClientUtils class
>>>>>>>> - Add static methods to the ConsumerConfig, ProducerConfig, and
>>>>>>>> AdminClientConfig classes
>>>>>>>> - Add those same static methods to the KafkaConsumer,
>>> KafkaProducer,
>>>>>> and
>>>>>>>> KafkaAdminClient classes
>>>>>>>> 
>>>>>>>> If this seems reasonable, we should probably also specify in the
>>> KIP
>>>>>> that
>>>>>>>> Kafka Connect will leverage this preflight validation logic
>> before
>>>>>>>> instantiating any Kafka clients for use by connectors or tasks,
>>> and
>>>>>>>> continue to fail fast if there are typos in the
>> bootstrap.servers
>>>>>>> property,
>>>>>>>> or if temporary DNS resolution issues come up.
>>>>>>>> 
>>>>>>>> [1] -
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
>>>>>>>> [2] -
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> 
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philipnee@gmail.com
>>> 
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hey Chris,
>>>>>>>>> 
>>>>>>>>> Thanks for the quick response, and I apologize for the unclear
>>>>>> wording
>>>>>>>>> there, I guess "DNS lookup" would be a more appropriate
>> wording
>>>>>> here.
>>>>>>> So
>>>>>>>>> what I meant there was, to delegate the DNS lookup in the
>>>>>> constructor
>>>>>>> to
>>>>>>>>> the network client poll, and it will happen on the very first
>>>>>> poll.  I
>>>>>>>>> guess the logic could look like this:
>>>>>>>>> 
>>>>>>>>> - if the client has been bootstrapped, do nothing.
>>>>>>>>> - Otherwise, perform DNS lookup, and acquire the bootstrap
>>> server
>>>>>>>> address.
>>>>>>>>> 
>>>>>>>>> Thanks for the comment there, I'll change up the wording.
>> Maybe
>>>>>> revise
>>>>>>>> it
>>>>>>>>> as "DNS resolution should occur in the poll...." ?
>>>>>>>>> 
>>>>>>>>> P
>>>>>>>>> 
>>>>>>>>> On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
>>>>>> <chrise@aiven.io.invalid
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Philip,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the KIP!
>>>>>>>>>> 
>>>>>>>>>> QQ: In the "Proposed Changes" section, the KIP states that
>>>>>>>> "Bootstrapping
>>>>>>>>>> should now occur in the poll method before attempting to
>>> update
>>>>>> the
>>>>>>>>>> metadata. This includes resolving the addresses and
>>>> bootstrapping
>>>>>> the
>>>>>>>>>> metadata.". By "bootstrapping the metadata" do we mean
>>> actually
>>>>>>>>> contacting
>>>>>>>>>> the bootstrap servers, or just setting some internal state
>>>>>> related to
>>>>>>>> the
>>>>>>>>>> current set of servers that can be contacted for metadata? I
>>> ask
>>>>>>>> because
>>>>>>>>> it
>>>>>>>>>> seems like the language here implies the former, but if
>> that's
>>>> the
>>>>>>>> case,
>>>>>>>>>> this is already happening in poll (or at least, the first
>>>>>> invocation
>>>>>>> of
>>>>>>>>>> it), and if it's the latter, it's probably not necessary to
>>>>>> mention
>>>>>>> in
>>>>>>>>> the
>>>>>>>>>> KIP since it doesn't really impact user-facing behavior. It
>>> also
>>>>>>> seems
>>>>>>>>> like
>>>>>>>>>> that detail might impact how intertwined this and KIP-899
>> are,
>>>>>> though
>>>>>>>> the
>>>>>>>>>> similarity could still be superficial either way.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> 
>>>>>>>>>> Chris
>>>>>>>>>> 
>>>>>>>>>> On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
>>> philipnee@gmail.com
>>>>> 
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hey Ismael,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for the feedback! The proposal is not to retry
>>>>>> automatically
>>>>>>>> but
>>>>>>>>>>> relies on the user polling the NetworkClient (basically,
>>>>>>>> consumer.poll)
>>>>>>>>>> to
>>>>>>>>>>> reattempt the bootstrap. If bootstrapping fails, a
>>>>>> NetworkException
>>>>>>>>>>> (retriable) will be thrown.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>> P
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
>>>> ismael@juma.me.uk>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for the KIP. Not sure if I missed it, but how
>> long
>>>>>> will we
>>>>>>>>> retry
>>>>>>>>>>> for
>>>>>>>>>>>> and when do we give up and propagate the failure to the
>>>> user?
>>>>>>>>>>>> 
>>>>>>>>>>>> Ismael
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
>>>>>> philipnee@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi all!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I want to start a discussion thread about how we can
>>>> handle
>>>>>>>> client
>>>>>>>>>>>>> bootstrap failure due DNS lookup.  This requires a bit
>>> of
>>>>>>>>> behavioral
>>>>>>>>>>>>> change, so a KIP is proposed and attached to this
>> email.
>>>>>> Let me
>>>>>>>>> know
>>>>>>>>>>> what
>>>>>>>>>>>>> you think!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *A small remark here*: *As the title of this KIP might
>>>> sound
>>>>>>>>>>>>> familiar/similar to KIP-899, it is not the same.*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *In Summary:* I want to propose a KIP to change the
>>>> existing
>>>>>>>>>> bootstrap
>>>>>>>>>>>>> (upon instantiation) strategy because it is reasonable
>>> to
>>>>>> allow
>>>>>>>>>> clients
>>>>>>>>>>>> to
>>>>>>>>>>>>> retry
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *KIP: *
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>> Philip
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <ph...@gmail.com>.
Cheers Kafka Community,

I just wanna give this thread bump, as it has been a bit quiet for the past
week. I have not updated the KIP based on Chris and Jason's feedback, as I
would also like to know more about what do people think.

Jason - Thanks for the suggestion, I think your suggestion makes a lot of
sense.

Thanks!
P

On Tue, Feb 28, 2023 at 2:45 PM Jason Gustafson <ja...@confluent.io.invalid>
wrote:

> Hi Philip,
>
> > Having an overall timeout also seems reasonable, but I wonder what should
> the client do after running out of the time? Should we throw a
> non-retriable exception (instead of TimeoutExceptoin to stop the client
> from retrying) and alert the user to examine the config and the DNS server?
>
> Yeah, not sure exactly. I'd probably suggest a
> `BootstrapConnectionException` or something like that with a clear message
> indicating the problem. What the user does with it is up to them, but at
> least it gives them the option to fail their application if that is what
> they prefer to do in this case. If they catch it and ignore it, I would
> expect the client to just continue retrying. Logging for bootstrap
> dns/connection failures will be helpful in any case.
>
> -Jason
>
>
>
>
>
>
> On Tue, Feb 28, 2023 at 11:47 AM Philip Nee <ph...@gmail.com> wrote:
>
> > Jason:
> > Thanks for the feedback.  Now giving it a second thought, I think your
> > suggestion of logging the error might make sense, as I could imagine most
> > users would just continue to retry, so it might not be necessary to throw
> > an exception anyway.
> > Having an overall timeout also seems reasonable, but I wonder what should
> > the client do after running out of the time? Should we throw a
> > non-retriable exception (instead of TimeoutExceptoin to stop the client
> > from retrying) and alert the user to examine the config and the DNS
> server?
> >
> > Chris:
> > I feel I still haven't answered your question about the pre-flight check,
> > as it seems exposing an API might be harder to push through.
> >
> > Thanks!
> > P
> >
> > On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson
> > <ja...@confluent.io.invalid>
> > wrote:
> >
> > > One more random thought I had just as I pushed send. We're currently
> > > treating this problem somewhat narrowly by focusing only on the DNS
> > > resolution of the bootstrap servers. Even if the servers resolve,
> there's
> > > no guarantee that they are reachable by the client. Would it make sense
> > to
> > > have a timeout which bounds the total time that the client should wait
> to
> > > connect to the bootstrap servers? Something like `
> > > bootstrap.servers.connection.timeout.ms`.
> > >
> > > -Jason
> > >
> > > On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi Philip,
> > > >
> > > > An alternative is not to fail at all. Every other network error is
> > caught
> > > > and handled internally in the client. We see this case as different
> > > because
> > > > a DNS resolution error may imply misconfiguration. Could it also
> imply
> > > that
> > > > the DNS server is unavailable? I'm not sure why that case should be
> > > handled
> > > > differently than if the bootstrap servers themselves are unavailable.
> > > Would
> > > > it be enough to log a clear warning in the logs if the bootstrap
> > servers
> > > > could not resolve?
> > > >
> > > > On the whole, the current fail-fast approach probably does more good
> > than
> > > > bad, but it does seem somewhat inconsistent overall and my guess is
> > that
> > > > dynamic environments will become increasingly common. It would be
> nice
> > to
> > > > have a reasonable default behavior which could handle these cases
> > > > gracefully without any additional logic. In any case, it would be
> nice
> > to
> > > > see this option in the rejected alternatives at least if we do not
> take
> > > it.
> > > >
> > > > If we want to take the route of throwing an exception, then I think
> > we're
> > > > probably going to need a new configuration since I can't see what a
> > > > reasonable timeout we would use as a default. The benefit of a
> > > > configuration is that it would let us retain the current default
> > behavior
> > > > with timeout effectively set to 0 and it would also let users
> > effectively
> > > > disable the timeout by using a very large value. Otherwise, it seems
> > > like a
> > > > potential compatibility break to have a new exception type thrown at
> > some
> > > > arbitrary time without giving the user any control over it.
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > >
> > > > On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton
> <chrise@aiven.io.invalid
> > >
> > > > wrote:
> > > >
> > > >> Hi Philip,
> > > >>
> > > >> Yeah, it's basically DNS resolution we're talking about, though
> > there's
> > > >> some additional subtlety there with the logic introduced by KIP-235
> > [1].
> > > >> Essentially it should cover any scenario that causes a client
> > > constructor
> > > >> to fail with the current logic but would not after this KIP is
> > released.
> > > >>
> > > >> We can generalize the Connect use case like this: a client
> application
> > > >> that
> > > >> may connect to different Kafka clusters with a public-facing,
> > > easy-to-use
> > > >> API for restarting failed tasks and automatic handling of retriable
> > > >> exceptions. The ease with which failed tasks can be restarted is
> > > >> significant because it reduces the cost of failing on non-retriable
> > > >> exceptions and makes fail-fast behavior easier to work with. And, in
> > > cases
> > > >> like this where we can't really know whether the error we're dealing
> > > with
> > > >> is retriable or not, it's better IMO to continue to allow
> applications
> > > >> like
> > > >> these to fail fast. I do agree that it'd be nice to get input from
> the
> > > >> community, though.
> > > >>
> > > >> I was toying with the idea of a NetworkException subclass too. It's
> a
> > > >> simpler API, but it doesn't allow for preflight validation, which
> can
> > be
> > > >> useful in scenarios where submitting new configurations for client
> > > >> applications is expensive in terms of time or resources. Then
> again, I
> > > >> don't see why the two are mutually exclusive, and we might opt to
> use
> > > the
> > > >> NetworkException subclass in this KIP and pursue an opt-in
> validation
> > > API
> > > >> later on. Thoughts?
> > > >>
> > > >> [1] -
> > > >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
> > > >>
> > > >> Cheers,
> > > >>
> > > >> Chris
> > > >>
> > > >> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <ph...@gmail.com>
> > wrote:
> > > >>
> > > >> > Hey Chris,
> > > >> >
> > > >> > Thanks again for the feedback!
> > > >> >
> > > >> >
> > > >> > For the preflight DNS check (are we basically trying to resolve
> the
> > > DNS
> > > >> > there?): Maybe it makes more sense to add it to the Config
> modules?
> > I
> > > >> would
> > > >> > like to hear what the community says as I'm not familiar with the
> > > >> Connect
> > > >> > use case.
> > > >> >
> > > >> > A "slower failing" alternative - I wonder if it makes sense for us
> > to
> > > >> > extend the NetworkException so that clients can be smarter at
> > handling
> > > >> > these exceptions. Of course, it is still retriable and requires
> > > polling
> > > >> the
> > > >> > consumer, but then we can distinguish the DNS resolution error
> from
> > > >> other
> > > >> > network errors.
> > > >> >
> > > >> > Thanks!
> > > >> > P
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> > <chrise@aiven.io.invalid
> > > >
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Philip,
> > > >> > >
> > > >> > > Yeah,  "DNS resolution should occur..." seems like a better fit.
> > 👍
> > > >> > >
> > > >> > > One other question I have is whether we should expose some kind
> of
> > > >> public
> > > >> > > API for performing preflight validation of the bootstrap URLs.
> If
> > we
> > > >> > change
> > > >> > > the behavior of a client configured with a silly typo (e.g.,
> > > >> > > "loclahost instead of localhost") from failing in the
> constructor
> > to
> > > >> > > failing with a retriable exception, this might lead some client
> > > >> > > applications to handle that failure by, well, retrying. For
> > > reference,
> > > >> > this
> > > >> > > is exactly what we do in Kafka Connect right now; see [1] and
> [2].
> > > IMO
> > > >> > it'd
> > > >> > > be nice to be able to opt into keeping the current behavior so
> > that
> > > >> > > projects like Connect could still do preflight checks of the
> > > >> > > bootstrap.servers property for connectors before starting them,
> > and
> > > >> > report
> > > >> > > any issues by failing fast instead of continuously writing
> > > >> warning/error
> > > >> > > messages to their logs.
> > > >> > >
> > > >> > > I'm not sure about where this new API could go, but a few
> options
> > > >> might
> > > >> > be:
> > > >> > >
> > > >> > > - Expose a public variant of the existing ClientUtils class
> > > >> > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > > >> > > AdminClientConfig classes
> > > >> > > - Add those same static methods to the KafkaConsumer,
> > KafkaProducer,
> > > >> and
> > > >> > > KafkaAdminClient classes
> > > >> > >
> > > >> > > If this seems reasonable, we should probably also specify in the
> > KIP
> > > >> that
> > > >> > > Kafka Connect will leverage this preflight validation logic
> before
> > > >> > > instantiating any Kafka clients for use by connectors or tasks,
> > and
> > > >> > > continue to fail fast if there are typos in the
> bootstrap.servers
> > > >> > property,
> > > >> > > or if temporary DNS resolution issues come up.
> > > >> > >
> > > >> > > [1] -
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > >> > > [2] -
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > > >> > >
> > > >> > > Cheers,
> > > >> > >
> > > >> > > Chris
> > > >> > >
> > > >> > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philipnee@gmail.com
> >
> > > >> wrote:
> > > >> > >
> > > >> > > > Hey Chris,
> > > >> > > >
> > > >> > > > Thanks for the quick response, and I apologize for the unclear
> > > >> wording
> > > >> > > > there, I guess "DNS lookup" would be a more appropriate
> wording
> > > >> here.
> > > >> > So
> > > >> > > > what I meant there was, to delegate the DNS lookup in the
> > > >> constructor
> > > >> > to
> > > >> > > > the network client poll, and it will happen on the very first
> > > >> poll.  I
> > > >> > > > guess the logic could look like this:
> > > >> > > >
> > > >> > > > - if the client has been bootstrapped, do nothing.
> > > >> > > > - Otherwise, perform DNS lookup, and acquire the bootstrap
> > server
> > > >> > > address.
> > > >> > > >
> > > >> > > > Thanks for the comment there, I'll change up the wording.
> Maybe
> > > >> revise
> > > >> > > it
> > > >> > > > as "DNS resolution should occur in the poll...." ?
> > > >> > > >
> > > >> > > > P
> > > >> > > >
> > > >> > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > > >> <chrise@aiven.io.invalid
> > > >> > >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Hi Philip,
> > > >> > > > >
> > > >> > > > > Thanks for the KIP!
> > > >> > > > >
> > > >> > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > > >> > > "Bootstrapping
> > > >> > > > > should now occur in the poll method before attempting to
> > update
> > > >> the
> > > >> > > > > metadata. This includes resolving the addresses and
> > > bootstrapping
> > > >> the
> > > >> > > > > metadata.". By "bootstrapping the metadata" do we mean
> > actually
> > > >> > > > contacting
> > > >> > > > > the bootstrap servers, or just setting some internal state
> > > >> related to
> > > >> > > the
> > > >> > > > > current set of servers that can be contacted for metadata? I
> > ask
> > > >> > > because
> > > >> > > > it
> > > >> > > > > seems like the language here implies the former, but if
> that's
> > > the
> > > >> > > case,
> > > >> > > > > this is already happening in poll (or at least, the first
> > > >> invocation
> > > >> > of
> > > >> > > > > it), and if it's the latter, it's probably not necessary to
> > > >> mention
> > > >> > in
> > > >> > > > the
> > > >> > > > > KIP since it doesn't really impact user-facing behavior. It
> > also
> > > >> > seems
> > > >> > > > like
> > > >> > > > > that detail might impact how intertwined this and KIP-899
> are,
> > > >> though
> > > >> > > the
> > > >> > > > > similarity could still be superficial either way.
> > > >> > > > >
> > > >> > > > > Cheers,
> > > >> > > > >
> > > >> > > > > Chris
> > > >> > > > >
> > > >> > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> > philipnee@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > > >
> > > >> > > > > > Hey Ismael,
> > > >> > > > > >
> > > >> > > > > > Thanks for the feedback! The proposal is not to retry
> > > >> automatically
> > > >> > > but
> > > >> > > > > > relies on the user polling the NetworkClient (basically,
> > > >> > > consumer.poll)
> > > >> > > > > to
> > > >> > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > > >> NetworkException
> > > >> > > > > > (retriable) will be thrown.
> > > >> > > > > >
> > > >> > > > > > Thanks!
> > > >> > > > > > P
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> > > ismael@juma.me.uk>
> > > >> > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Thanks for the KIP. Not sure if I missed it, but how
> long
> > > >> will we
> > > >> > > > retry
> > > >> > > > > > for
> > > >> > > > > > > and when do we give up and propagate the failure to the
> > > user?
> > > >> > > > > > >
> > > >> > > > > > > Ismael
> > > >> > > > > > >
> > > >> > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > > >> philipnee@gmail.com>
> > > >> > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Hi all!
> > > >> > > > > > > >
> > > >> > > > > > > > I want to start a discussion thread about how we can
> > > handle
> > > >> > > client
> > > >> > > > > > > > bootstrap failure due DNS lookup.  This requires a bit
> > of
> > > >> > > > behavioral
> > > >> > > > > > > > change, so a KIP is proposed and attached to this
> email.
> > > >> Let me
> > > >> > > > know
> > > >> > > > > > what
> > > >> > > > > > > > you think!
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > *A small remark here*: *As the title of this KIP might
> > > sound
> > > >> > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > >> > > > > > > >
> > > >> > > > > > > > *In Summary:* I want to propose a KIP to change the
> > > existing
> > > >> > > > > bootstrap
> > > >> > > > > > > > (upon instantiation) strategy because it is reasonable
> > to
> > > >> allow
> > > >> > > > > clients
> > > >> > > > > > > to
> > > >> > > > > > > > retry
> > > >> > > > > > > >
> > > >> > > > > > > > *KIP: *
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > >> > > > > > > >
> > > >> > > > > > > > Thanks!
> > > >> > > > > > > > Philip
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Jason Gustafson <ja...@confluent.io.INVALID>.
Hi Philip,

> Having an overall timeout also seems reasonable, but I wonder what should
the client do after running out of the time? Should we throw a
non-retriable exception (instead of TimeoutExceptoin to stop the client
from retrying) and alert the user to examine the config and the DNS server?

Yeah, not sure exactly. I'd probably suggest a
`BootstrapConnectionException` or something like that with a clear message
indicating the problem. What the user does with it is up to them, but at
least it gives them the option to fail their application if that is what
they prefer to do in this case. If they catch it and ignore it, I would
expect the client to just continue retrying. Logging for bootstrap
dns/connection failures will be helpful in any case.

-Jason






On Tue, Feb 28, 2023 at 11:47 AM Philip Nee <ph...@gmail.com> wrote:

> Jason:
> Thanks for the feedback.  Now giving it a second thought, I think your
> suggestion of logging the error might make sense, as I could imagine most
> users would just continue to retry, so it might not be necessary to throw
> an exception anyway.
> Having an overall timeout also seems reasonable, but I wonder what should
> the client do after running out of the time? Should we throw a
> non-retriable exception (instead of TimeoutExceptoin to stop the client
> from retrying) and alert the user to examine the config and the DNS server?
>
> Chris:
> I feel I still haven't answered your question about the pre-flight check,
> as it seems exposing an API might be harder to push through.
>
> Thanks!
> P
>
> On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson
> <ja...@confluent.io.invalid>
> wrote:
>
> > One more random thought I had just as I pushed send. We're currently
> > treating this problem somewhat narrowly by focusing only on the DNS
> > resolution of the bootstrap servers. Even if the servers resolve, there's
> > no guarantee that they are reachable by the client. Would it make sense
> to
> > have a timeout which bounds the total time that the client should wait to
> > connect to the bootstrap servers? Something like `
> > bootstrap.servers.connection.timeout.ms`.
> >
> > -Jason
> >
> > On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hi Philip,
> > >
> > > An alternative is not to fail at all. Every other network error is
> caught
> > > and handled internally in the client. We see this case as different
> > because
> > > a DNS resolution error may imply misconfiguration. Could it also imply
> > that
> > > the DNS server is unavailable? I'm not sure why that case should be
> > handled
> > > differently than if the bootstrap servers themselves are unavailable.
> > Would
> > > it be enough to log a clear warning in the logs if the bootstrap
> servers
> > > could not resolve?
> > >
> > > On the whole, the current fail-fast approach probably does more good
> than
> > > bad, but it does seem somewhat inconsistent overall and my guess is
> that
> > > dynamic environments will become increasingly common. It would be nice
> to
> > > have a reasonable default behavior which could handle these cases
> > > gracefully without any additional logic. In any case, it would be nice
> to
> > > see this option in the rejected alternatives at least if we do not take
> > it.
> > >
> > > If we want to take the route of throwing an exception, then I think
> we're
> > > probably going to need a new configuration since I can't see what a
> > > reasonable timeout we would use as a default. The benefit of a
> > > configuration is that it would let us retain the current default
> behavior
> > > with timeout effectively set to 0 and it would also let users
> effectively
> > > disable the timeout by using a very large value. Otherwise, it seems
> > like a
> > > potential compatibility break to have a new exception type thrown at
> some
> > > arbitrary time without giving the user any control over it.
> > >
> > > Thanks,
> > > Jason
> > >
> > >
> > > On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton <chrise@aiven.io.invalid
> >
> > > wrote:
> > >
> > >> Hi Philip,
> > >>
> > >> Yeah, it's basically DNS resolution we're talking about, though
> there's
> > >> some additional subtlety there with the logic introduced by KIP-235
> [1].
> > >> Essentially it should cover any scenario that causes a client
> > constructor
> > >> to fail with the current logic but would not after this KIP is
> released.
> > >>
> > >> We can generalize the Connect use case like this: a client application
> > >> that
> > >> may connect to different Kafka clusters with a public-facing,
> > easy-to-use
> > >> API for restarting failed tasks and automatic handling of retriable
> > >> exceptions. The ease with which failed tasks can be restarted is
> > >> significant because it reduces the cost of failing on non-retriable
> > >> exceptions and makes fail-fast behavior easier to work with. And, in
> > cases
> > >> like this where we can't really know whether the error we're dealing
> > with
> > >> is retriable or not, it's better IMO to continue to allow applications
> > >> like
> > >> these to fail fast. I do agree that it'd be nice to get input from the
> > >> community, though.
> > >>
> > >> I was toying with the idea of a NetworkException subclass too. It's a
> > >> simpler API, but it doesn't allow for preflight validation, which can
> be
> > >> useful in scenarios where submitting new configurations for client
> > >> applications is expensive in terms of time or resources. Then again, I
> > >> don't see why the two are mutually exclusive, and we might opt to use
> > the
> > >> NetworkException subclass in this KIP and pursue an opt-in validation
> > API
> > >> later on. Thoughts?
> > >>
> > >> [1] -
> > >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
> > >>
> > >> Cheers,
> > >>
> > >> Chris
> > >>
> > >> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <ph...@gmail.com>
> wrote:
> > >>
> > >> > Hey Chris,
> > >> >
> > >> > Thanks again for the feedback!
> > >> >
> > >> >
> > >> > For the preflight DNS check (are we basically trying to resolve the
> > DNS
> > >> > there?): Maybe it makes more sense to add it to the Config modules?
> I
> > >> would
> > >> > like to hear what the community says as I'm not familiar with the
> > >> Connect
> > >> > use case.
> > >> >
> > >> > A "slower failing" alternative - I wonder if it makes sense for us
> to
> > >> > extend the NetworkException so that clients can be smarter at
> handling
> > >> > these exceptions. Of course, it is still retriable and requires
> > polling
> > >> the
> > >> > consumer, but then we can distinguish the DNS resolution error from
> > >> other
> > >> > network errors.
> > >> >
> > >> > Thanks!
> > >> > P
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
> <chrise@aiven.io.invalid
> > >
> > >> > wrote:
> > >> >
> > >> > > Hi Philip,
> > >> > >
> > >> > > Yeah,  "DNS resolution should occur..." seems like a better fit.
> 👍
> > >> > >
> > >> > > One other question I have is whether we should expose some kind of
> > >> public
> > >> > > API for performing preflight validation of the bootstrap URLs. If
> we
> > >> > change
> > >> > > the behavior of a client configured with a silly typo (e.g.,
> > >> > > "loclahost instead of localhost") from failing in the constructor
> to
> > >> > > failing with a retriable exception, this might lead some client
> > >> > > applications to handle that failure by, well, retrying. For
> > reference,
> > >> > this
> > >> > > is exactly what we do in Kafka Connect right now; see [1] and [2].
> > IMO
> > >> > it'd
> > >> > > be nice to be able to opt into keeping the current behavior so
> that
> > >> > > projects like Connect could still do preflight checks of the
> > >> > > bootstrap.servers property for connectors before starting them,
> and
> > >> > report
> > >> > > any issues by failing fast instead of continuously writing
> > >> warning/error
> > >> > > messages to their logs.
> > >> > >
> > >> > > I'm not sure about where this new API could go, but a few options
> > >> might
> > >> > be:
> > >> > >
> > >> > > - Expose a public variant of the existing ClientUtils class
> > >> > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > >> > > AdminClientConfig classes
> > >> > > - Add those same static methods to the KafkaConsumer,
> KafkaProducer,
> > >> and
> > >> > > KafkaAdminClient classes
> > >> > >
> > >> > > If this seems reasonable, we should probably also specify in the
> KIP
> > >> that
> > >> > > Kafka Connect will leverage this preflight validation logic before
> > >> > > instantiating any Kafka clients for use by connectors or tasks,
> and
> > >> > > continue to fail fast if there are typos in the bootstrap.servers
> > >> > property,
> > >> > > or if temporary DNS resolution issues come up.
> > >> > >
> > >> > > [1] -
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > >> > > [2] -
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > >> > >
> > >> > > Cheers,
> > >> > >
> > >> > > Chris
> > >> > >
> > >> > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > Hey Chris,
> > >> > > >
> > >> > > > Thanks for the quick response, and I apologize for the unclear
> > >> wording
> > >> > > > there, I guess "DNS lookup" would be a more appropriate wording
> > >> here.
> > >> > So
> > >> > > > what I meant there was, to delegate the DNS lookup in the
> > >> constructor
> > >> > to
> > >> > > > the network client poll, and it will happen on the very first
> > >> poll.  I
> > >> > > > guess the logic could look like this:
> > >> > > >
> > >> > > > - if the client has been bootstrapped, do nothing.
> > >> > > > - Otherwise, perform DNS lookup, and acquire the bootstrap
> server
> > >> > > address.
> > >> > > >
> > >> > > > Thanks for the comment there, I'll change up the wording.  Maybe
> > >> revise
> > >> > > it
> > >> > > > as "DNS resolution should occur in the poll...." ?
> > >> > > >
> > >> > > > P
> > >> > > >
> > >> > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> > >> <chrise@aiven.io.invalid
> > >> > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi Philip,
> > >> > > > >
> > >> > > > > Thanks for the KIP!
> > >> > > > >
> > >> > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > >> > > "Bootstrapping
> > >> > > > > should now occur in the poll method before attempting to
> update
> > >> the
> > >> > > > > metadata. This includes resolving the addresses and
> > bootstrapping
> > >> the
> > >> > > > > metadata.". By "bootstrapping the metadata" do we mean
> actually
> > >> > > > contacting
> > >> > > > > the bootstrap servers, or just setting some internal state
> > >> related to
> > >> > > the
> > >> > > > > current set of servers that can be contacted for metadata? I
> ask
> > >> > > because
> > >> > > > it
> > >> > > > > seems like the language here implies the former, but if that's
> > the
> > >> > > case,
> > >> > > > > this is already happening in poll (or at least, the first
> > >> invocation
> > >> > of
> > >> > > > > it), and if it's the latter, it's probably not necessary to
> > >> mention
> > >> > in
> > >> > > > the
> > >> > > > > KIP since it doesn't really impact user-facing behavior. It
> also
> > >> > seems
> > >> > > > like
> > >> > > > > that detail might impact how intertwined this and KIP-899 are,
> > >> though
> > >> > > the
> > >> > > > > similarity could still be superficial either way.
> > >> > > > >
> > >> > > > > Cheers,
> > >> > > > >
> > >> > > > > Chris
> > >> > > > >
> > >> > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
> philipnee@gmail.com
> > >
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Hey Ismael,
> > >> > > > > >
> > >> > > > > > Thanks for the feedback! The proposal is not to retry
> > >> automatically
> > >> > > but
> > >> > > > > > relies on the user polling the NetworkClient (basically,
> > >> > > consumer.poll)
> > >> > > > > to
> > >> > > > > > reattempt the bootstrap. If bootstrapping fails, a
> > >> NetworkException
> > >> > > > > > (retriable) will be thrown.
> > >> > > > > >
> > >> > > > > > Thanks!
> > >> > > > > > P
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> > ismael@juma.me.uk>
> > >> > > wrote:
> > >> > > > > >
> > >> > > > > > > Thanks for the KIP. Not sure if I missed it, but how long
> > >> will we
> > >> > > > retry
> > >> > > > > > for
> > >> > > > > > > and when do we give up and propagate the failure to the
> > user?
> > >> > > > > > >
> > >> > > > > > > Ismael
> > >> > > > > > >
> > >> > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> > >> philipnee@gmail.com>
> > >> > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi all!
> > >> > > > > > > >
> > >> > > > > > > > I want to start a discussion thread about how we can
> > handle
> > >> > > client
> > >> > > > > > > > bootstrap failure due DNS lookup.  This requires a bit
> of
> > >> > > > behavioral
> > >> > > > > > > > change, so a KIP is proposed and attached to this email.
> > >> Let me
> > >> > > > know
> > >> > > > > > what
> > >> > > > > > > > you think!
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > *A small remark here*: *As the title of this KIP might
> > sound
> > >> > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > >> > > > > > > >
> > >> > > > > > > > *In Summary:* I want to propose a KIP to change the
> > existing
> > >> > > > > bootstrap
> > >> > > > > > > > (upon instantiation) strategy because it is reasonable
> to
> > >> allow
> > >> > > > > clients
> > >> > > > > > > to
> > >> > > > > > > > retry
> > >> > > > > > > >
> > >> > > > > > > > *KIP: *
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > >> > > > > > > >
> > >> > > > > > > > Thanks!
> > >> > > > > > > > Philip
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <ph...@gmail.com>.
Jason:
Thanks for the feedback.  Now giving it a second thought, I think your
suggestion of logging the error might make sense, as I could imagine most
users would just continue to retry, so it might not be necessary to throw
an exception anyway.
Having an overall timeout also seems reasonable, but I wonder what should
the client do after running out of the time? Should we throw a
non-retriable exception (instead of TimeoutExceptoin to stop the client
from retrying) and alert the user to examine the config and the DNS server?

Chris:
I feel I still haven't answered your question about the pre-flight check,
as it seems exposing an API might be harder to push through.

Thanks!
P

On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson <ja...@confluent.io.invalid>
wrote:

> One more random thought I had just as I pushed send. We're currently
> treating this problem somewhat narrowly by focusing only on the DNS
> resolution of the bootstrap servers. Even if the servers resolve, there's
> no guarantee that they are reachable by the client. Would it make sense to
> have a timeout which bounds the total time that the client should wait to
> connect to the bootstrap servers? Something like `
> bootstrap.servers.connection.timeout.ms`.
>
> -Jason
>
> On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hi Philip,
> >
> > An alternative is not to fail at all. Every other network error is caught
> > and handled internally in the client. We see this case as different
> because
> > a DNS resolution error may imply misconfiguration. Could it also imply
> that
> > the DNS server is unavailable? I'm not sure why that case should be
> handled
> > differently than if the bootstrap servers themselves are unavailable.
> Would
> > it be enough to log a clear warning in the logs if the bootstrap servers
> > could not resolve?
> >
> > On the whole, the current fail-fast approach probably does more good than
> > bad, but it does seem somewhat inconsistent overall and my guess is that
> > dynamic environments will become increasingly common. It would be nice to
> > have a reasonable default behavior which could handle these cases
> > gracefully without any additional logic. In any case, it would be nice to
> > see this option in the rejected alternatives at least if we do not take
> it.
> >
> > If we want to take the route of throwing an exception, then I think we're
> > probably going to need a new configuration since I can't see what a
> > reasonable timeout we would use as a default. The benefit of a
> > configuration is that it would let us retain the current default behavior
> > with timeout effectively set to 0 and it would also let users effectively
> > disable the timeout by using a very large value. Otherwise, it seems
> like a
> > potential compatibility break to have a new exception type thrown at some
> > arbitrary time without giving the user any control over it.
> >
> > Thanks,
> > Jason
> >
> >
> > On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> >> Hi Philip,
> >>
> >> Yeah, it's basically DNS resolution we're talking about, though there's
> >> some additional subtlety there with the logic introduced by KIP-235 [1].
> >> Essentially it should cover any scenario that causes a client
> constructor
> >> to fail with the current logic but would not after this KIP is released.
> >>
> >> We can generalize the Connect use case like this: a client application
> >> that
> >> may connect to different Kafka clusters with a public-facing,
> easy-to-use
> >> API for restarting failed tasks and automatic handling of retriable
> >> exceptions. The ease with which failed tasks can be restarted is
> >> significant because it reduces the cost of failing on non-retriable
> >> exceptions and makes fail-fast behavior easier to work with. And, in
> cases
> >> like this where we can't really know whether the error we're dealing
> with
> >> is retriable or not, it's better IMO to continue to allow applications
> >> like
> >> these to fail fast. I do agree that it'd be nice to get input from the
> >> community, though.
> >>
> >> I was toying with the idea of a NetworkException subclass too. It's a
> >> simpler API, but it doesn't allow for preflight validation, which can be
> >> useful in scenarios where submitting new configurations for client
> >> applications is expensive in terms of time or resources. Then again, I
> >> don't see why the two are mutually exclusive, and we might opt to use
> the
> >> NetworkException subclass in this KIP and pursue an opt-in validation
> API
> >> later on. Thoughts?
> >>
> >> [1] -
> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
> >>
> >> Cheers,
> >>
> >> Chris
> >>
> >> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <ph...@gmail.com> wrote:
> >>
> >> > Hey Chris,
> >> >
> >> > Thanks again for the feedback!
> >> >
> >> >
> >> > For the preflight DNS check (are we basically trying to resolve the
> DNS
> >> > there?): Maybe it makes more sense to add it to the Config modules? I
> >> would
> >> > like to hear what the community says as I'm not familiar with the
> >> Connect
> >> > use case.
> >> >
> >> > A "slower failing" alternative - I wonder if it makes sense for us to
> >> > extend the NetworkException so that clients can be smarter at handling
> >> > these exceptions. Of course, it is still retriable and requires
> polling
> >> the
> >> > consumer, but then we can distinguish the DNS resolution error from
> >> other
> >> > network errors.
> >> >
> >> > Thanks!
> >> > P
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton <chrise@aiven.io.invalid
> >
> >> > wrote:
> >> >
> >> > > Hi Philip,
> >> > >
> >> > > Yeah,  "DNS resolution should occur..." seems like a better fit. 👍
> >> > >
> >> > > One other question I have is whether we should expose some kind of
> >> public
> >> > > API for performing preflight validation of the bootstrap URLs. If we
> >> > change
> >> > > the behavior of a client configured with a silly typo (e.g.,
> >> > > "loclahost instead of localhost") from failing in the constructor to
> >> > > failing with a retriable exception, this might lead some client
> >> > > applications to handle that failure by, well, retrying. For
> reference,
> >> > this
> >> > > is exactly what we do in Kafka Connect right now; see [1] and [2].
> IMO
> >> > it'd
> >> > > be nice to be able to opt into keeping the current behavior so that
> >> > > projects like Connect could still do preflight checks of the
> >> > > bootstrap.servers property for connectors before starting them, and
> >> > report
> >> > > any issues by failing fast instead of continuously writing
> >> warning/error
> >> > > messages to their logs.
> >> > >
> >> > > I'm not sure about where this new API could go, but a few options
> >> might
> >> > be:
> >> > >
> >> > > - Expose a public variant of the existing ClientUtils class
> >> > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> >> > > AdminClientConfig classes
> >> > > - Add those same static methods to the KafkaConsumer, KafkaProducer,
> >> and
> >> > > KafkaAdminClient classes
> >> > >
> >> > > If this seems reasonable, we should probably also specify in the KIP
> >> that
> >> > > Kafka Connect will leverage this preflight validation logic before
> >> > > instantiating any Kafka clients for use by connectors or tasks, and
> >> > > continue to fail fast if there are typos in the bootstrap.servers
> >> > property,
> >> > > or if temporary DNS resolution issues come up.
> >> > >
> >> > > [1] -
> >> > >
> >> > >
> >> >
> >>
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> >> > > [2] -
> >> > >
> >> > >
> >> >
> >>
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> >> > >
> >> > > Cheers,
> >> > >
> >> > > Chris
> >> > >
> >> > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hey Chris,
> >> > > >
> >> > > > Thanks for the quick response, and I apologize for the unclear
> >> wording
> >> > > > there, I guess "DNS lookup" would be a more appropriate wording
> >> here.
> >> > So
> >> > > > what I meant there was, to delegate the DNS lookup in the
> >> constructor
> >> > to
> >> > > > the network client poll, and it will happen on the very first
> >> poll.  I
> >> > > > guess the logic could look like this:
> >> > > >
> >> > > > - if the client has been bootstrapped, do nothing.
> >> > > > - Otherwise, perform DNS lookup, and acquire the bootstrap server
> >> > > address.
> >> > > >
> >> > > > Thanks for the comment there, I'll change up the wording.  Maybe
> >> revise
> >> > > it
> >> > > > as "DNS resolution should occur in the poll...." ?
> >> > > >
> >> > > > P
> >> > > >
> >> > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> >> <chrise@aiven.io.invalid
> >> > >
> >> > > > wrote:
> >> > > >
> >> > > > > Hi Philip,
> >> > > > >
> >> > > > > Thanks for the KIP!
> >> > > > >
> >> > > > > QQ: In the "Proposed Changes" section, the KIP states that
> >> > > "Bootstrapping
> >> > > > > should now occur in the poll method before attempting to update
> >> the
> >> > > > > metadata. This includes resolving the addresses and
> bootstrapping
> >> the
> >> > > > > metadata.". By "bootstrapping the metadata" do we mean actually
> >> > > > contacting
> >> > > > > the bootstrap servers, or just setting some internal state
> >> related to
> >> > > the
> >> > > > > current set of servers that can be contacted for metadata? I ask
> >> > > because
> >> > > > it
> >> > > > > seems like the language here implies the former, but if that's
> the
> >> > > case,
> >> > > > > this is already happening in poll (or at least, the first
> >> invocation
> >> > of
> >> > > > > it), and if it's the latter, it's probably not necessary to
> >> mention
> >> > in
> >> > > > the
> >> > > > > KIP since it doesn't really impact user-facing behavior. It also
> >> > seems
> >> > > > like
> >> > > > > that detail might impact how intertwined this and KIP-899 are,
> >> though
> >> > > the
> >> > > > > similarity could still be superficial either way.
> >> > > > >
> >> > > > > Cheers,
> >> > > > >
> >> > > > > Chris
> >> > > > >
> >> > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <philipnee@gmail.com
> >
> >> > > wrote:
> >> > > > >
> >> > > > > > Hey Ismael,
> >> > > > > >
> >> > > > > > Thanks for the feedback! The proposal is not to retry
> >> automatically
> >> > > but
> >> > > > > > relies on the user polling the NetworkClient (basically,
> >> > > consumer.poll)
> >> > > > > to
> >> > > > > > reattempt the bootstrap. If bootstrapping fails, a
> >> NetworkException
> >> > > > > > (retriable) will be thrown.
> >> > > > > >
> >> > > > > > Thanks!
> >> > > > > > P
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
> ismael@juma.me.uk>
> >> > > wrote:
> >> > > > > >
> >> > > > > > > Thanks for the KIP. Not sure if I missed it, but how long
> >> will we
> >> > > > retry
> >> > > > > > for
> >> > > > > > > and when do we give up and propagate the failure to the
> user?
> >> > > > > > >
> >> > > > > > > Ismael
> >> > > > > > >
> >> > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> >> philipnee@gmail.com>
> >> > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi all!
> >> > > > > > > >
> >> > > > > > > > I want to start a discussion thread about how we can
> handle
> >> > > client
> >> > > > > > > > bootstrap failure due DNS lookup.  This requires a bit of
> >> > > > behavioral
> >> > > > > > > > change, so a KIP is proposed and attached to this email.
> >> Let me
> >> > > > know
> >> > > > > > what
> >> > > > > > > > you think!
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > *A small remark here*: *As the title of this KIP might
> sound
> >> > > > > > > > familiar/similar to KIP-899, it is not the same.*
> >> > > > > > > >
> >> > > > > > > > *In Summary:* I want to propose a KIP to change the
> existing
> >> > > > > bootstrap
> >> > > > > > > > (upon instantiation) strategy because it is reasonable to
> >> allow
> >> > > > > clients
> >> > > > > > > to
> >> > > > > > > > retry
> >> > > > > > > >
> >> > > > > > > > *KIP: *
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> >> > > > > > > >
> >> > > > > > > > Thanks!
> >> > > > > > > > Philip
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Jason Gustafson <ja...@confluent.io.INVALID>.
One more random thought I had just as I pushed send. We're currently
treating this problem somewhat narrowly by focusing only on the DNS
resolution of the bootstrap servers. Even if the servers resolve, there's
no guarantee that they are reachable by the client. Would it make sense to
have a timeout which bounds the total time that the client should wait to
connect to the bootstrap servers? Something like `
bootstrap.servers.connection.timeout.ms`.

-Jason

On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Philip,
>
> An alternative is not to fail at all. Every other network error is caught
> and handled internally in the client. We see this case as different because
> a DNS resolution error may imply misconfiguration. Could it also imply that
> the DNS server is unavailable? I'm not sure why that case should be handled
> differently than if the bootstrap servers themselves are unavailable. Would
> it be enough to log a clear warning in the logs if the bootstrap servers
> could not resolve?
>
> On the whole, the current fail-fast approach probably does more good than
> bad, but it does seem somewhat inconsistent overall and my guess is that
> dynamic environments will become increasingly common. It would be nice to
> have a reasonable default behavior which could handle these cases
> gracefully without any additional logic. In any case, it would be nice to
> see this option in the rejected alternatives at least if we do not take it.
>
> If we want to take the route of throwing an exception, then I think we're
> probably going to need a new configuration since I can't see what a
> reasonable timeout we would use as a default. The benefit of a
> configuration is that it would let us retain the current default behavior
> with timeout effectively set to 0 and it would also let users effectively
> disable the timeout by using a very large value. Otherwise, it seems like a
> potential compatibility break to have a new exception type thrown at some
> arbitrary time without giving the user any control over it.
>
> Thanks,
> Jason
>
>
> On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton <ch...@aiven.io.invalid>
> wrote:
>
>> Hi Philip,
>>
>> Yeah, it's basically DNS resolution we're talking about, though there's
>> some additional subtlety there with the logic introduced by KIP-235 [1].
>> Essentially it should cover any scenario that causes a client constructor
>> to fail with the current logic but would not after this KIP is released.
>>
>> We can generalize the Connect use case like this: a client application
>> that
>> may connect to different Kafka clusters with a public-facing, easy-to-use
>> API for restarting failed tasks and automatic handling of retriable
>> exceptions. The ease with which failed tasks can be restarted is
>> significant because it reduces the cost of failing on non-retriable
>> exceptions and makes fail-fast behavior easier to work with. And, in cases
>> like this where we can't really know whether the error we're dealing with
>> is retriable or not, it's better IMO to continue to allow applications
>> like
>> these to fail fast. I do agree that it'd be nice to get input from the
>> community, though.
>>
>> I was toying with the idea of a NetworkException subclass too. It's a
>> simpler API, but it doesn't allow for preflight validation, which can be
>> useful in scenarios where submitting new configurations for client
>> applications is expensive in terms of time or resources. Then again, I
>> don't see why the two are mutually exclusive, and we might opt to use the
>> NetworkException subclass in this KIP and pursue an opt-in validation API
>> later on. Thoughts?
>>
>> [1] -
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
>>
>> Cheers,
>>
>> Chris
>>
>> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <ph...@gmail.com> wrote:
>>
>> > Hey Chris,
>> >
>> > Thanks again for the feedback!
>> >
>> >
>> > For the preflight DNS check (are we basically trying to resolve the DNS
>> > there?): Maybe it makes more sense to add it to the Config modules? I
>> would
>> > like to hear what the community says as I'm not familiar with the
>> Connect
>> > use case.
>> >
>> > A "slower failing" alternative - I wonder if it makes sense for us to
>> > extend the NetworkException so that clients can be smarter at handling
>> > these exceptions. Of course, it is still retriable and requires polling
>> the
>> > consumer, but then we can distinguish the DNS resolution error from
>> other
>> > network errors.
>> >
>> > Thanks!
>> > P
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton <ch...@aiven.io.invalid>
>> > wrote:
>> >
>> > > Hi Philip,
>> > >
>> > > Yeah,  "DNS resolution should occur..." seems like a better fit. 👍
>> > >
>> > > One other question I have is whether we should expose some kind of
>> public
>> > > API for performing preflight validation of the bootstrap URLs. If we
>> > change
>> > > the behavior of a client configured with a silly typo (e.g.,
>> > > "loclahost instead of localhost") from failing in the constructor to
>> > > failing with a retriable exception, this might lead some client
>> > > applications to handle that failure by, well, retrying. For reference,
>> > this
>> > > is exactly what we do in Kafka Connect right now; see [1] and [2]. IMO
>> > it'd
>> > > be nice to be able to opt into keeping the current behavior so that
>> > > projects like Connect could still do preflight checks of the
>> > > bootstrap.servers property for connectors before starting them, and
>> > report
>> > > any issues by failing fast instead of continuously writing
>> warning/error
>> > > messages to their logs.
>> > >
>> > > I'm not sure about where this new API could go, but a few options
>> might
>> > be:
>> > >
>> > > - Expose a public variant of the existing ClientUtils class
>> > > - Add static methods to the ConsumerConfig, ProducerConfig, and
>> > > AdminClientConfig classes
>> > > - Add those same static methods to the KafkaConsumer, KafkaProducer,
>> and
>> > > KafkaAdminClient classes
>> > >
>> > > If this seems reasonable, we should probably also specify in the KIP
>> that
>> > > Kafka Connect will leverage this preflight validation logic before
>> > > instantiating any Kafka clients for use by connectors or tasks, and
>> > > continue to fail fast if there are typos in the bootstrap.servers
>> > property,
>> > > or if temporary DNS resolution issues come up.
>> > >
>> > > [1] -
>> > >
>> > >
>> >
>> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
>> > > [2] -
>> > >
>> > >
>> >
>> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
>> > >
>> > > Cheers,
>> > >
>> > > Chris
>> > >
>> > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com>
>> wrote:
>> > >
>> > > > Hey Chris,
>> > > >
>> > > > Thanks for the quick response, and I apologize for the unclear
>> wording
>> > > > there, I guess "DNS lookup" would be a more appropriate wording
>> here.
>> > So
>> > > > what I meant there was, to delegate the DNS lookup in the
>> constructor
>> > to
>> > > > the network client poll, and it will happen on the very first
>> poll.  I
>> > > > guess the logic could look like this:
>> > > >
>> > > > - if the client has been bootstrapped, do nothing.
>> > > > - Otherwise, perform DNS lookup, and acquire the bootstrap server
>> > > address.
>> > > >
>> > > > Thanks for the comment there, I'll change up the wording.  Maybe
>> revise
>> > > it
>> > > > as "DNS resolution should occur in the poll...." ?
>> > > >
>> > > > P
>> > > >
>> > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
>> <chrise@aiven.io.invalid
>> > >
>> > > > wrote:
>> > > >
>> > > > > Hi Philip,
>> > > > >
>> > > > > Thanks for the KIP!
>> > > > >
>> > > > > QQ: In the "Proposed Changes" section, the KIP states that
>> > > "Bootstrapping
>> > > > > should now occur in the poll method before attempting to update
>> the
>> > > > > metadata. This includes resolving the addresses and bootstrapping
>> the
>> > > > > metadata.". By "bootstrapping the metadata" do we mean actually
>> > > > contacting
>> > > > > the bootstrap servers, or just setting some internal state
>> related to
>> > > the
>> > > > > current set of servers that can be contacted for metadata? I ask
>> > > because
>> > > > it
>> > > > > seems like the language here implies the former, but if that's the
>> > > case,
>> > > > > this is already happening in poll (or at least, the first
>> invocation
>> > of
>> > > > > it), and if it's the latter, it's probably not necessary to
>> mention
>> > in
>> > > > the
>> > > > > KIP since it doesn't really impact user-facing behavior. It also
>> > seems
>> > > > like
>> > > > > that detail might impact how intertwined this and KIP-899 are,
>> though
>> > > the
>> > > > > similarity could still be superficial either way.
>> > > > >
>> > > > > Cheers,
>> > > > >
>> > > > > Chris
>> > > > >
>> > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <ph...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > Hey Ismael,
>> > > > > >
>> > > > > > Thanks for the feedback! The proposal is not to retry
>> automatically
>> > > but
>> > > > > > relies on the user polling the NetworkClient (basically,
>> > > consumer.poll)
>> > > > > to
>> > > > > > reattempt the bootstrap. If bootstrapping fails, a
>> NetworkException
>> > > > > > (retriable) will be thrown.
>> > > > > >
>> > > > > > Thanks!
>> > > > > > P
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <is...@juma.me.uk>
>> > > wrote:
>> > > > > >
>> > > > > > > Thanks for the KIP. Not sure if I missed it, but how long
>> will we
>> > > > retry
>> > > > > > for
>> > > > > > > and when do we give up and propagate the failure to the user?
>> > > > > > >
>> > > > > > > Ismael
>> > > > > > >
>> > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
>> philipnee@gmail.com>
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi all!
>> > > > > > > >
>> > > > > > > > I want to start a discussion thread about how we can handle
>> > > client
>> > > > > > > > bootstrap failure due DNS lookup.  This requires a bit of
>> > > > behavioral
>> > > > > > > > change, so a KIP is proposed and attached to this email.
>> Let me
>> > > > know
>> > > > > > what
>> > > > > > > > you think!
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > *A small remark here*: *As the title of this KIP might sound
>> > > > > > > > familiar/similar to KIP-899, it is not the same.*
>> > > > > > > >
>> > > > > > > > *In Summary:* I want to propose a KIP to change the existing
>> > > > > bootstrap
>> > > > > > > > (upon instantiation) strategy because it is reasonable to
>> allow
>> > > > > clients
>> > > > > > > to
>> > > > > > > > retry
>> > > > > > > >
>> > > > > > > > *KIP: *
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
>> > > > > > > >
>> > > > > > > > Thanks!
>> > > > > > > > Philip
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Jason Gustafson <ja...@confluent.io.INVALID>.
Hi Philip,

An alternative is not to fail at all. Every other network error is caught
and handled internally in the client. We see this case as different because
a DNS resolution error may imply misconfiguration. Could it also imply that
the DNS server is unavailable? I'm not sure why that case should be handled
differently than if the bootstrap servers themselves are unavailable. Would
it be enough to log a clear warning in the logs if the bootstrap servers
could not resolve?

On the whole, the current fail-fast approach probably does more good than
bad, but it does seem somewhat inconsistent overall and my guess is that
dynamic environments will become increasingly common. It would be nice to
have a reasonable default behavior which could handle these cases
gracefully without any additional logic. In any case, it would be nice to
see this option in the rejected alternatives at least if we do not take it.

If we want to take the route of throwing an exception, then I think we're
probably going to need a new configuration since I can't see what a
reasonable timeout we would use as a default. The benefit of a
configuration is that it would let us retain the current default behavior
with timeout effectively set to 0 and it would also let users effectively
disable the timeout by using a very large value. Otherwise, it seems like a
potential compatibility break to have a new exception type thrown at some
arbitrary time without giving the user any control over it.

Thanks,
Jason


On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton <ch...@aiven.io.invalid>
wrote:

> Hi Philip,
>
> Yeah, it's basically DNS resolution we're talking about, though there's
> some additional subtlety there with the logic introduced by KIP-235 [1].
> Essentially it should cover any scenario that causes a client constructor
> to fail with the current logic but would not after this KIP is released.
>
> We can generalize the Connect use case like this: a client application that
> may connect to different Kafka clusters with a public-facing, easy-to-use
> API for restarting failed tasks and automatic handling of retriable
> exceptions. The ease with which failed tasks can be restarted is
> significant because it reduces the cost of failing on non-retriable
> exceptions and makes fail-fast behavior easier to work with. And, in cases
> like this where we can't really know whether the error we're dealing with
> is retriable or not, it's better IMO to continue to allow applications like
> these to fail fast. I do agree that it'd be nice to get input from the
> community, though.
>
> I was toying with the idea of a NetworkException subclass too. It's a
> simpler API, but it doesn't allow for preflight validation, which can be
> useful in scenarios where submitting new configurations for client
> applications is expensive in terms of time or resources. Then again, I
> don't see why the two are mutually exclusive, and we might opt to use the
> NetworkException subclass in this KIP and pursue an opt-in validation API
> later on. Thoughts?
>
> [1] -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
>
> Cheers,
>
> Chris
>
> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <ph...@gmail.com> wrote:
>
> > Hey Chris,
> >
> > Thanks again for the feedback!
> >
> >
> > For the preflight DNS check (are we basically trying to resolve the DNS
> > there?): Maybe it makes more sense to add it to the Config modules? I
> would
> > like to hear what the community says as I'm not familiar with the Connect
> > use case.
> >
> > A "slower failing" alternative - I wonder if it makes sense for us to
> > extend the NetworkException so that clients can be smarter at handling
> > these exceptions. Of course, it is still retriable and requires polling
> the
> > consumer, but then we can distinguish the DNS resolution error from other
> > network errors.
> >
> > Thanks!
> > P
> >
> >
> >
> >
> >
> > On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Philip,
> > >
> > > Yeah,  "DNS resolution should occur..." seems like a better fit. 👍
> > >
> > > One other question I have is whether we should expose some kind of
> public
> > > API for performing preflight validation of the bootstrap URLs. If we
> > change
> > > the behavior of a client configured with a silly typo (e.g.,
> > > "loclahost instead of localhost") from failing in the constructor to
> > > failing with a retriable exception, this might lead some client
> > > applications to handle that failure by, well, retrying. For reference,
> > this
> > > is exactly what we do in Kafka Connect right now; see [1] and [2]. IMO
> > it'd
> > > be nice to be able to opt into keeping the current behavior so that
> > > projects like Connect could still do preflight checks of the
> > > bootstrap.servers property for connectors before starting them, and
> > report
> > > any issues by failing fast instead of continuously writing
> warning/error
> > > messages to their logs.
> > >
> > > I'm not sure about where this new API could go, but a few options might
> > be:
> > >
> > > - Expose a public variant of the existing ClientUtils class
> > > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > > AdminClientConfig classes
> > > - Add those same static methods to the KafkaConsumer, KafkaProducer,
> and
> > > KafkaAdminClient classes
> > >
> > > If this seems reasonable, we should probably also specify in the KIP
> that
> > > Kafka Connect will leverage this preflight validation logic before
> > > instantiating any Kafka clients for use by connectors or tasks, and
> > > continue to fail fast if there are typos in the bootstrap.servers
> > property,
> > > or if temporary DNS resolution issues come up.
> > >
> > > [1] -
> > >
> > >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > > [2] -
> > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com>
> wrote:
> > >
> > > > Hey Chris,
> > > >
> > > > Thanks for the quick response, and I apologize for the unclear
> wording
> > > > there, I guess "DNS lookup" would be a more appropriate wording here.
> > So
> > > > what I meant there was, to delegate the DNS lookup in the constructor
> > to
> > > > the network client poll, and it will happen on the very first poll.
> I
> > > > guess the logic could look like this:
> > > >
> > > > - if the client has been bootstrapped, do nothing.
> > > > - Otherwise, perform DNS lookup, and acquire the bootstrap server
> > > address.
> > > >
> > > > Thanks for the comment there, I'll change up the wording.  Maybe
> revise
> > > it
> > > > as "DNS resolution should occur in the poll...." ?
> > > >
> > > > P
> > > >
> > > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
> <chrise@aiven.io.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hi Philip,
> > > > >
> > > > > Thanks for the KIP!
> > > > >
> > > > > QQ: In the "Proposed Changes" section, the KIP states that
> > > "Bootstrapping
> > > > > should now occur in the poll method before attempting to update the
> > > > > metadata. This includes resolving the addresses and bootstrapping
> the
> > > > > metadata.". By "bootstrapping the metadata" do we mean actually
> > > > contacting
> > > > > the bootstrap servers, or just setting some internal state related
> to
> > > the
> > > > > current set of servers that can be contacted for metadata? I ask
> > > because
> > > > it
> > > > > seems like the language here implies the former, but if that's the
> > > case,
> > > > > this is already happening in poll (or at least, the first
> invocation
> > of
> > > > > it), and if it's the latter, it's probably not necessary to mention
> > in
> > > > the
> > > > > KIP since it doesn't really impact user-facing behavior. It also
> > seems
> > > > like
> > > > > that detail might impact how intertwined this and KIP-899 are,
> though
> > > the
> > > > > similarity could still be superficial either way.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <ph...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hey Ismael,
> > > > > >
> > > > > > Thanks for the feedback! The proposal is not to retry
> automatically
> > > but
> > > > > > relies on the user polling the NetworkClient (basically,
> > > consumer.poll)
> > > > > to
> > > > > > reattempt the bootstrap. If bootstrapping fails, a
> NetworkException
> > > > > > (retriable) will be thrown.
> > > > > >
> > > > > > Thanks!
> > > > > > P
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <is...@juma.me.uk>
> > > wrote:
> > > > > >
> > > > > > > Thanks for the KIP. Not sure if I missed it, but how long will
> we
> > > > retry
> > > > > > for
> > > > > > > and when do we give up and propagate the failure to the user?
> > > > > > >
> > > > > > > Ismael
> > > > > > >
> > > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
> philipnee@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi all!
> > > > > > > >
> > > > > > > > I want to start a discussion thread about how we can handle
> > > client
> > > > > > > > bootstrap failure due DNS lookup.  This requires a bit of
> > > > behavioral
> > > > > > > > change, so a KIP is proposed and attached to this email. Let
> me
> > > > know
> > > > > > what
> > > > > > > > you think!
> > > > > > > >
> > > > > > > >
> > > > > > > > *A small remark here*: *As the title of this KIP might sound
> > > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > > >
> > > > > > > > *In Summary:* I want to propose a KIP to change the existing
> > > > > bootstrap
> > > > > > > > (upon instantiation) strategy because it is reasonable to
> allow
> > > > > clients
> > > > > > > to
> > > > > > > > retry
> > > > > > > >
> > > > > > > > *KIP: *
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > Philip
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Chris Egerton <ch...@aiven.io.INVALID>.
Hi Philip,

Yeah, it's basically DNS resolution we're talking about, though there's
some additional subtlety there with the logic introduced by KIP-235 [1].
Essentially it should cover any scenario that causes a client constructor
to fail with the current logic but would not after this KIP is released.

We can generalize the Connect use case like this: a client application that
may connect to different Kafka clusters with a public-facing, easy-to-use
API for restarting failed tasks and automatic handling of retriable
exceptions. The ease with which failed tasks can be restarted is
significant because it reduces the cost of failing on non-retriable
exceptions and makes fail-fast behavior easier to work with. And, in cases
like this where we can't really know whether the error we're dealing with
is retriable or not, it's better IMO to continue to allow applications like
these to fail fast. I do agree that it'd be nice to get input from the
community, though.

I was toying with the idea of a NetworkException subclass too. It's a
simpler API, but it doesn't allow for preflight validation, which can be
useful in scenarios where submitting new configurations for client
applications is expensive in terms of time or resources. Then again, I
don't see why the two are mutually exclusive, and we might opt to use the
NetworkException subclass in this KIP and pursue an opt-in validation API
later on. Thoughts?

[1] -
https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection

Cheers,

Chris

On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <ph...@gmail.com> wrote:

> Hey Chris,
>
> Thanks again for the feedback!
>
>
> For the preflight DNS check (are we basically trying to resolve the DNS
> there?): Maybe it makes more sense to add it to the Config modules? I would
> like to hear what the community says as I'm not familiar with the Connect
> use case.
>
> A "slower failing" alternative - I wonder if it makes sense for us to
> extend the NetworkException so that clients can be smarter at handling
> these exceptions. Of course, it is still retriable and requires polling the
> consumer, but then we can distinguish the DNS resolution error from other
> network errors.
>
> Thanks!
> P
>
>
>
>
>
> On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton <ch...@aiven.io.invalid>
> wrote:
>
> > Hi Philip,
> >
> > Yeah,  "DNS resolution should occur..." seems like a better fit. 👍
> >
> > One other question I have is whether we should expose some kind of public
> > API for performing preflight validation of the bootstrap URLs. If we
> change
> > the behavior of a client configured with a silly typo (e.g.,
> > "loclahost instead of localhost") from failing in the constructor to
> > failing with a retriable exception, this might lead some client
> > applications to handle that failure by, well, retrying. For reference,
> this
> > is exactly what we do in Kafka Connect right now; see [1] and [2]. IMO
> it'd
> > be nice to be able to opt into keeping the current behavior so that
> > projects like Connect could still do preflight checks of the
> > bootstrap.servers property for connectors before starting them, and
> report
> > any issues by failing fast instead of continuously writing warning/error
> > messages to their logs.
> >
> > I'm not sure about where this new API could go, but a few options might
> be:
> >
> > - Expose a public variant of the existing ClientUtils class
> > - Add static methods to the ConsumerConfig, ProducerConfig, and
> > AdminClientConfig classes
> > - Add those same static methods to the KafkaConsumer, KafkaProducer, and
> > KafkaAdminClient classes
> >
> > If this seems reasonable, we should probably also specify in the KIP that
> > Kafka Connect will leverage this preflight validation logic before
> > instantiating any Kafka clients for use by connectors or tasks, and
> > continue to fail fast if there are typos in the bootstrap.servers
> property,
> > or if temporary DNS resolution issues come up.
> >
> > [1] -
> >
> >
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> > [2] -
> >
> >
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
> >
> > Cheers,
> >
> > Chris
> >
> > On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com> wrote:
> >
> > > Hey Chris,
> > >
> > > Thanks for the quick response, and I apologize for the unclear wording
> > > there, I guess "DNS lookup" would be a more appropriate wording here.
> So
> > > what I meant there was, to delegate the DNS lookup in the constructor
> to
> > > the network client poll, and it will happen on the very first poll.  I
> > > guess the logic could look like this:
> > >
> > > - if the client has been bootstrapped, do nothing.
> > > - Otherwise, perform DNS lookup, and acquire the bootstrap server
> > address.
> > >
> > > Thanks for the comment there, I'll change up the wording.  Maybe revise
> > it
> > > as "DNS resolution should occur in the poll...." ?
> > >
> > > P
> > >
> > > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton <chrise@aiven.io.invalid
> >
> > > wrote:
> > >
> > > > Hi Philip,
> > > >
> > > > Thanks for the KIP!
> > > >
> > > > QQ: In the "Proposed Changes" section, the KIP states that
> > "Bootstrapping
> > > > should now occur in the poll method before attempting to update the
> > > > metadata. This includes resolving the addresses and bootstrapping the
> > > > metadata.". By "bootstrapping the metadata" do we mean actually
> > > contacting
> > > > the bootstrap servers, or just setting some internal state related to
> > the
> > > > current set of servers that can be contacted for metadata? I ask
> > because
> > > it
> > > > seems like the language here implies the former, but if that's the
> > case,
> > > > this is already happening in poll (or at least, the first invocation
> of
> > > > it), and if it's the latter, it's probably not necessary to mention
> in
> > > the
> > > > KIP since it doesn't really impact user-facing behavior. It also
> seems
> > > like
> > > > that detail might impact how intertwined this and KIP-899 are, though
> > the
> > > > similarity could still be superficial either way.
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <ph...@gmail.com>
> > wrote:
> > > >
> > > > > Hey Ismael,
> > > > >
> > > > > Thanks for the feedback! The proposal is not to retry automatically
> > but
> > > > > relies on the user polling the NetworkClient (basically,
> > consumer.poll)
> > > > to
> > > > > reattempt the bootstrap. If bootstrapping fails, a NetworkException
> > > > > (retriable) will be thrown.
> > > > >
> > > > > Thanks!
> > > > > P
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <is...@juma.me.uk>
> > wrote:
> > > > >
> > > > > > Thanks for the KIP. Not sure if I missed it, but how long will we
> > > retry
> > > > > for
> > > > > > and when do we give up and propagate the failure to the user?
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <ph...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi all!
> > > > > > >
> > > > > > > I want to start a discussion thread about how we can handle
> > client
> > > > > > > bootstrap failure due DNS lookup.  This requires a bit of
> > > behavioral
> > > > > > > change, so a KIP is proposed and attached to this email. Let me
> > > know
> > > > > what
> > > > > > > you think!
> > > > > > >
> > > > > > >
> > > > > > > *A small remark here*: *As the title of this KIP might sound
> > > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > > >
> > > > > > > *In Summary:* I want to propose a KIP to change the existing
> > > > bootstrap
> > > > > > > (upon instantiation) strategy because it is reasonable to allow
> > > > clients
> > > > > > to
> > > > > > > retry
> > > > > > >
> > > > > > > *KIP: *
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Philip
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <ph...@gmail.com>.
Hey Chris,

Thanks again for the feedback!


For the preflight DNS check (are we basically trying to resolve the DNS
there?): Maybe it makes more sense to add it to the Config modules? I would
like to hear what the community says as I'm not familiar with the Connect
use case.

A "slower failing" alternative - I wonder if it makes sense for us to
extend the NetworkException so that clients can be smarter at handling
these exceptions. Of course, it is still retriable and requires polling the
consumer, but then we can distinguish the DNS resolution error from other
network errors.

Thanks!
P





On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton <ch...@aiven.io.invalid>
wrote:

> Hi Philip,
>
> Yeah,  "DNS resolution should occur..." seems like a better fit. 👍
>
> One other question I have is whether we should expose some kind of public
> API for performing preflight validation of the bootstrap URLs. If we change
> the behavior of a client configured with a silly typo (e.g.,
> "loclahost instead of localhost") from failing in the constructor to
> failing with a retriable exception, this might lead some client
> applications to handle that failure by, well, retrying. For reference, this
> is exactly what we do in Kafka Connect right now; see [1] and [2]. IMO it'd
> be nice to be able to opt into keeping the current behavior so that
> projects like Connect could still do preflight checks of the
> bootstrap.servers property for connectors before starting them, and report
> any issues by failing fast instead of continuously writing warning/error
> messages to their logs.
>
> I'm not sure about where this new API could go, but a few options might be:
>
> - Expose a public variant of the existing ClientUtils class
> - Add static methods to the ConsumerConfig, ProducerConfig, and
> AdminClientConfig classes
> - Add those same static methods to the KafkaConsumer, KafkaProducer, and
> KafkaAdminClient classes
>
> If this seems reasonable, we should probably also specify in the KIP that
> Kafka Connect will leverage this preflight validation logic before
> instantiating any Kafka clients for use by connectors or tasks, and
> continue to fail fast if there are typos in the bootstrap.servers property,
> or if temporary DNS resolution issues come up.
>
> [1] -
>
> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
> [2] -
>
> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
>
> Cheers,
>
> Chris
>
> On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com> wrote:
>
> > Hey Chris,
> >
> > Thanks for the quick response, and I apologize for the unclear wording
> > there, I guess "DNS lookup" would be a more appropriate wording here. So
> > what I meant there was, to delegate the DNS lookup in the constructor to
> > the network client poll, and it will happen on the very first poll.  I
> > guess the logic could look like this:
> >
> > - if the client has been bootstrapped, do nothing.
> > - Otherwise, perform DNS lookup, and acquire the bootstrap server
> address.
> >
> > Thanks for the comment there, I'll change up the wording.  Maybe revise
> it
> > as "DNS resolution should occur in the poll...." ?
> >
> > P
> >
> > On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Philip,
> > >
> > > Thanks for the KIP!
> > >
> > > QQ: In the "Proposed Changes" section, the KIP states that
> "Bootstrapping
> > > should now occur in the poll method before attempting to update the
> > > metadata. This includes resolving the addresses and bootstrapping the
> > > metadata.". By "bootstrapping the metadata" do we mean actually
> > contacting
> > > the bootstrap servers, or just setting some internal state related to
> the
> > > current set of servers that can be contacted for metadata? I ask
> because
> > it
> > > seems like the language here implies the former, but if that's the
> case,
> > > this is already happening in poll (or at least, the first invocation of
> > > it), and if it's the latter, it's probably not necessary to mention in
> > the
> > > KIP since it doesn't really impact user-facing behavior. It also seems
> > like
> > > that detail might impact how intertwined this and KIP-899 are, though
> the
> > > similarity could still be superficial either way.
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <ph...@gmail.com>
> wrote:
> > >
> > > > Hey Ismael,
> > > >
> > > > Thanks for the feedback! The proposal is not to retry automatically
> but
> > > > relies on the user polling the NetworkClient (basically,
> consumer.poll)
> > > to
> > > > reattempt the bootstrap. If bootstrapping fails, a NetworkException
> > > > (retriable) will be thrown.
> > > >
> > > > Thanks!
> > > > P
> > > >
> > > >
> > > >
> > > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <is...@juma.me.uk>
> wrote:
> > > >
> > > > > Thanks for the KIP. Not sure if I missed it, but how long will we
> > retry
> > > > for
> > > > > and when do we give up and propagate the failure to the user?
> > > > >
> > > > > Ismael
> > > > >
> > > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <ph...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi all!
> > > > > >
> > > > > > I want to start a discussion thread about how we can handle
> client
> > > > > > bootstrap failure due DNS lookup.  This requires a bit of
> > behavioral
> > > > > > change, so a KIP is proposed and attached to this email. Let me
> > know
> > > > what
> > > > > > you think!
> > > > > >
> > > > > >
> > > > > > *A small remark here*: *As the title of this KIP might sound
> > > > > > familiar/similar to KIP-899, it is not the same.*
> > > > > >
> > > > > > *In Summary:* I want to propose a KIP to change the existing
> > > bootstrap
> > > > > > (upon instantiation) strategy because it is reasonable to allow
> > > clients
> > > > > to
> > > > > > retry
> > > > > >
> > > > > > *KIP: *
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > > >
> > > > > > Thanks!
> > > > > > Philip
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Chris Egerton <ch...@aiven.io.INVALID>.
Hi Philip,

Yeah,  "DNS resolution should occur..." seems like a better fit. 👍

One other question I have is whether we should expose some kind of public
API for performing preflight validation of the bootstrap URLs. If we change
the behavior of a client configured with a silly typo (e.g.,
"loclahost instead of localhost") from failing in the constructor to
failing with a retriable exception, this might lead some client
applications to handle that failure by, well, retrying. For reference, this
is exactly what we do in Kafka Connect right now; see [1] and [2]. IMO it'd
be nice to be able to opt into keeping the current behavior so that
projects like Connect could still do preflight checks of the
bootstrap.servers property for connectors before starting them, and report
any issues by failing fast instead of continuously writing warning/error
messages to their logs.

I'm not sure about where this new API could go, but a few options might be:

- Expose a public variant of the existing ClientUtils class
- Add static methods to the ConsumerConfig, ProducerConfig, and
AdminClientConfig classes
- Add those same static methods to the KafkaConsumer, KafkaProducer, and
KafkaAdminClient classes

If this seems reasonable, we should probably also specify in the KIP that
Kafka Connect will leverage this preflight validation logic before
instantiating any Kafka clients for use by connectors or tasks, and
continue to fail fast if there are typos in the bootstrap.servers property,
or if temporary DNS resolution issues come up.

[1] -
https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
[2] -
https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439

Cheers,

Chris

On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <ph...@gmail.com> wrote:

> Hey Chris,
>
> Thanks for the quick response, and I apologize for the unclear wording
> there, I guess "DNS lookup" would be a more appropriate wording here. So
> what I meant there was, to delegate the DNS lookup in the constructor to
> the network client poll, and it will happen on the very first poll.  I
> guess the logic could look like this:
>
> - if the client has been bootstrapped, do nothing.
> - Otherwise, perform DNS lookup, and acquire the bootstrap server address.
>
> Thanks for the comment there, I'll change up the wording.  Maybe revise it
> as "DNS resolution should occur in the poll...." ?
>
> P
>
> On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton <ch...@aiven.io.invalid>
> wrote:
>
> > Hi Philip,
> >
> > Thanks for the KIP!
> >
> > QQ: In the "Proposed Changes" section, the KIP states that "Bootstrapping
> > should now occur in the poll method before attempting to update the
> > metadata. This includes resolving the addresses and bootstrapping the
> > metadata.". By "bootstrapping the metadata" do we mean actually
> contacting
> > the bootstrap servers, or just setting some internal state related to the
> > current set of servers that can be contacted for metadata? I ask because
> it
> > seems like the language here implies the former, but if that's the case,
> > this is already happening in poll (or at least, the first invocation of
> > it), and if it's the latter, it's probably not necessary to mention in
> the
> > KIP since it doesn't really impact user-facing behavior. It also seems
> like
> > that detail might impact how intertwined this and KIP-899 are, though the
> > similarity could still be superficial either way.
> >
> > Cheers,
> >
> > Chris
> >
> > On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <ph...@gmail.com> wrote:
> >
> > > Hey Ismael,
> > >
> > > Thanks for the feedback! The proposal is not to retry automatically but
> > > relies on the user polling the NetworkClient (basically, consumer.poll)
> > to
> > > reattempt the bootstrap. If bootstrapping fails, a NetworkException
> > > (retriable) will be thrown.
> > >
> > > Thanks!
> > > P
> > >
> > >
> > >
> > > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <is...@juma.me.uk> wrote:
> > >
> > > > Thanks for the KIP. Not sure if I missed it, but how long will we
> retry
> > > for
> > > > and when do we give up and propagate the failure to the user?
> > > >
> > > > Ismael
> > > >
> > > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <ph...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all!
> > > > >
> > > > > I want to start a discussion thread about how we can handle client
> > > > > bootstrap failure due DNS lookup.  This requires a bit of
> behavioral
> > > > > change, so a KIP is proposed and attached to this email. Let me
> know
> > > what
> > > > > you think!
> > > > >
> > > > >
> > > > > *A small remark here*: *As the title of this KIP might sound
> > > > > familiar/similar to KIP-899, it is not the same.*
> > > > >
> > > > > *In Summary:* I want to propose a KIP to change the existing
> > bootstrap
> > > > > (upon instantiation) strategy because it is reasonable to allow
> > clients
> > > > to
> > > > > retry
> > > > >
> > > > > *KIP: *
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > > >
> > > > > Thanks!
> > > > > Philip
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <ph...@gmail.com>.
Hey Chris,

Thanks for the quick response, and I apologize for the unclear wording
there, I guess "DNS lookup" would be a more appropriate wording here. So
what I meant there was, to delegate the DNS lookup in the constructor to
the network client poll, and it will happen on the very first poll.  I
guess the logic could look like this:

- if the client has been bootstrapped, do nothing.
- Otherwise, perform DNS lookup, and acquire the bootstrap server address.

Thanks for the comment there, I'll change up the wording.  Maybe revise it
as "DNS resolution should occur in the poll...." ?

P

On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton <ch...@aiven.io.invalid>
wrote:

> Hi Philip,
>
> Thanks for the KIP!
>
> QQ: In the "Proposed Changes" section, the KIP states that "Bootstrapping
> should now occur in the poll method before attempting to update the
> metadata. This includes resolving the addresses and bootstrapping the
> metadata.". By "bootstrapping the metadata" do we mean actually contacting
> the bootstrap servers, or just setting some internal state related to the
> current set of servers that can be contacted for metadata? I ask because it
> seems like the language here implies the former, but if that's the case,
> this is already happening in poll (or at least, the first invocation of
> it), and if it's the latter, it's probably not necessary to mention in the
> KIP since it doesn't really impact user-facing behavior. It also seems like
> that detail might impact how intertwined this and KIP-899 are, though the
> similarity could still be superficial either way.
>
> Cheers,
>
> Chris
>
> On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <ph...@gmail.com> wrote:
>
> > Hey Ismael,
> >
> > Thanks for the feedback! The proposal is not to retry automatically but
> > relies on the user polling the NetworkClient (basically, consumer.poll)
> to
> > reattempt the bootstrap. If bootstrapping fails, a NetworkException
> > (retriable) will be thrown.
> >
> > Thanks!
> > P
> >
> >
> >
> > On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <is...@juma.me.uk> wrote:
> >
> > > Thanks for the KIP. Not sure if I missed it, but how long will we retry
> > for
> > > and when do we give up and propagate the failure to the user?
> > >
> > > Ismael
> > >
> > > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <ph...@gmail.com>
> wrote:
> > >
> > > > Hi all!
> > > >
> > > > I want to start a discussion thread about how we can handle client
> > > > bootstrap failure due DNS lookup.  This requires a bit of behavioral
> > > > change, so a KIP is proposed and attached to this email. Let me know
> > what
> > > > you think!
> > > >
> > > >
> > > > *A small remark here*: *As the title of this KIP might sound
> > > > familiar/similar to KIP-899, it is not the same.*
> > > >
> > > > *In Summary:* I want to propose a KIP to change the existing
> bootstrap
> > > > (upon instantiation) strategy because it is reasonable to allow
> clients
> > > to
> > > > retry
> > > >
> > > > *KIP: *
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > > >
> > > > Thanks!
> > > > Philip
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Chris Egerton <ch...@aiven.io.INVALID>.
Hi Philip,

Thanks for the KIP!

QQ: In the "Proposed Changes" section, the KIP states that "Bootstrapping
should now occur in the poll method before attempting to update the
metadata. This includes resolving the addresses and bootstrapping the
metadata.". By "bootstrapping the metadata" do we mean actually contacting
the bootstrap servers, or just setting some internal state related to the
current set of servers that can be contacted for metadata? I ask because it
seems like the language here implies the former, but if that's the case,
this is already happening in poll (or at least, the first invocation of
it), and if it's the latter, it's probably not necessary to mention in the
KIP since it doesn't really impact user-facing behavior. It also seems like
that detail might impact how intertwined this and KIP-899 are, though the
similarity could still be superficial either way.

Cheers,

Chris

On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <ph...@gmail.com> wrote:

> Hey Ismael,
>
> Thanks for the feedback! The proposal is not to retry automatically but
> relies on the user polling the NetworkClient (basically, consumer.poll) to
> reattempt the bootstrap. If bootstrapping fails, a NetworkException
> (retriable) will be thrown.
>
> Thanks!
> P
>
>
>
> On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <is...@juma.me.uk> wrote:
>
> > Thanks for the KIP. Not sure if I missed it, but how long will we retry
> for
> > and when do we give up and propagate the failure to the user?
> >
> > Ismael
> >
> > On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <ph...@gmail.com> wrote:
> >
> > > Hi all!
> > >
> > > I want to start a discussion thread about how we can handle client
> > > bootstrap failure due DNS lookup.  This requires a bit of behavioral
> > > change, so a KIP is proposed and attached to this email. Let me know
> what
> > > you think!
> > >
> > >
> > > *A small remark here*: *As the title of this KIP might sound
> > > familiar/similar to KIP-899, it is not the same.*
> > >
> > > *In Summary:* I want to propose a KIP to change the existing bootstrap
> > > (upon instantiation) strategy because it is reasonable to allow clients
> > to
> > > retry
> > >
> > > *KIP: *
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> > >
> > > Thanks!
> > > Philip
> > >
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Philip Nee <ph...@gmail.com>.
Hey Ismael,

Thanks for the feedback! The proposal is not to retry automatically but
relies on the user polling the NetworkClient (basically, consumer.poll) to
reattempt the bootstrap. If bootstrapping fails, a NetworkException
(retriable) will be thrown.

Thanks!
P



On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <is...@juma.me.uk> wrote:

> Thanks for the KIP. Not sure if I missed it, but how long will we retry for
> and when do we give up and propagate the failure to the user?
>
> Ismael
>
> On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <ph...@gmail.com> wrote:
>
> > Hi all!
> >
> > I want to start a discussion thread about how we can handle client
> > bootstrap failure due DNS lookup.  This requires a bit of behavioral
> > change, so a KIP is proposed and attached to this email. Let me know what
> > you think!
> >
> >
> > *A small remark here*: *As the title of this KIP might sound
> > familiar/similar to KIP-899, it is not the same.*
> >
> > *In Summary:* I want to propose a KIP to change the existing bootstrap
> > (upon instantiation) strategy because it is reasonable to allow clients
> to
> > retry
> >
> > *KIP: *
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
> >
> > Thanks!
> > Philip
> >
>

Re: [DISCUSS] KIP-909: Allow clients to rebootstrap DNS lookup failure

Posted by Ismael Juma <is...@juma.me.uk>.
Thanks for the KIP. Not sure if I missed it, but how long will we retry for
and when do we give up and propagate the failure to the user?

Ismael

On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <ph...@gmail.com> wrote:

> Hi all!
>
> I want to start a discussion thread about how we can handle client
> bootstrap failure due DNS lookup.  This requires a bit of behavioral
> change, so a KIP is proposed and attached to this email. Let me know what
> you think!
>
>
> *A small remark here*: *As the title of this KIP might sound
> familiar/similar to KIP-899, it is not the same.*
>
> *In Summary:* I want to propose a KIP to change the existing bootstrap
> (upon instantiation) strategy because it is reasonable to allow clients to
> retry
>
> *KIP: *
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
>
> Thanks!
> Philip
>