You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Yash Mayya <ya...@gmail.com> on 2022/11/02 14:06:43 UTC

[DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Hi all,

I'd like to start a discussion thread on this small KIP -
https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Make+Kafka+Connect+REST+API+request+timeouts+configurable

It proposes the addition of a new Kafka Connect worker configuration to
allow configuring REST API request timeouts.

Thanks,
Yash

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Yash Mayya <ya...@gmail.com>.
Hi Greg,

> Applying this to a hypothetical extension of timeouts
> to other endpoints: my concern was primarily about
> whether this design applied everywhere felt natural or
> awkward. In my opinion, a request parameter would
> draw a lot of outsized attention in documentation, and
> be the first parameter for a lot of the endpoints, while a
> header could be specified as a more obscure global
> modifier to all requests.

That's a fair point, I do agree that the choice of a request header here
seems more appropriate now.

> My chief concern was about the clarity of the API in the
> future when both configurable timeouts and asynchronous
>  APIs exist. Will the configurable timeouts still make sense,
>  and will they be a feature that people will use?

I do believe that there's a place for both of these to co-exist - there
will always be use-cases that require config validation errors to be
reported synchronously so that connector configurations can be corrected or
iterated upon. I don't really see there being valid use cases for having a
timeout of 20 or 60 minutes as you suggested, but I do think that the
genuine edge cases that currently exist requiring a config validation
timeout of greater than the current 90 seconds (2-3 minutes perhaps)
deserve a way to be handled until we have an asynchronous config validation
API (if we ever do go down that path).

I've updated the KIP to reflect the change to use a request header rather
than a query parameter to configure the timeout value.

Thanks,
Yash

On Tue, Mar 14, 2023 at 2:46 AM Greg Harris <gr...@aiven.io.invalid>
wrote:

> Yash,
>
> 1. I think that Request-Timeout is a fine choice. I did not see any
> standard header for a use-case like this, so we are free to choose a name
> for ourselves.
>
> 2. While such proposals do not exist, I don't think that means that we
> should exclude them from consideration. If we do, then we run the risk of
> making future implementations more complicated, impossible, or awkward to
> use. This is a balance of course, as we cannot let future features paralyze
> us from making improvements today. We cannot hold ourselves responsible for
> handling unknown unknowns, but we can at least try to plan around known
> unknowns.
> Applying this to a hypothetical extension of timeouts to other endpoints:
> my concern was primarily about whether this design applied everywhere felt
> natural or awkward. In my opinion, a request parameter would draw a lot of
> outsized attention in documentation, and be the first parameter for a lot
> of the endpoints, while a header could be specified as a more obscure
> global modifier to all requests.
>
> 3. 4. Applying the same standard to a hypothetical asynchronous I don't
> believe it makes async validation substantially more complicated, except
> the "request timeout interrupts connector creation" interaction. I also do
> not think that it would make it impossible to implement an async validation
> scheme in the future, since the synchronization is already in place, the
> timeout is just different.
> My chief concern was about the clarity of the API in the future when both
> configurable timeouts and asynchronous APIs exist. Will the configurable
> timeouts still make sense, and will they be a feature that people will use?
> Are we expecting users to hold a single REST request open for 5, 20, or 60
> minutes to accommodate a connector with an excessively long validation
> step? Successfully validating a request requires the caller and the network
> to remain stable for the entire duration of the request to have a chance of
> success, and consumes (some small) amount of resources during that time.
> And what if they decide they no longer want the validation result, and wish
> to cancel the operation to free resources (both connect and external
> system)? There is no standard mechanism to discard or cancel an HTTP
> request (not even a connection loss).
>
> You have said that the default timeout for other endpoints is more than
> satisfactory for the typical user, and I agree. If other operations
> (deleting, restarting, pause/unpause, etc) had a substantial synchronous
> component that would benefit from a configurable timeout, I would be more
> inclined to support configurable timeouts (all at once, or validation first
> and other operations later).
> But as it stands, I don't think I see the value-add of configurable
> timeouts once there is another solution which targets the validate call
> duration specifically, and I'm concerned that configurable timeouts would
> be made irrelevant in that case.
>
> Thanks,
> Greg
>
> On Wed, Mar 8, 2023 at 7:21 AM Yash Mayya <ya...@gmail.com> wrote:
>
> > Hi Greg,
> >
> > Thanks for the response!
> >
> > 1. Hm that's a fair point, and while there aren't any strict guidelines
> or
> > rules governing whether something should be a query parameter versus a
> > request header, I agree that it might be more idiomatic for a request
> > timeout value to be accepted via a custom request header. What do you
> > think about using a header name like "X-Request-Timeout", or maybe simply
> > "Request-Timeout" if we want to take into account
> > https://www.rfc-editor.org/rfc/rfc6648?
> >
> > 2. 3. 4. Adding asynchronous validations via new endpoints / new request
> > parameters on existing endpoints is definitely an interesting idea but
> I'm
> > not sure whether this KIP should take into consideration a hypothetical
> > future proposal? If there were already such a concrete proposal, I would
> > definitely agree that this KIP should take that into consideration - but
> > that isn't the case today. Furthermore, I'm not sure I understand why the
> > addition of request timeout configurability to the existing endpoints
> would
> > preclude the introduction of an asynchronous validation API in the
> future?
> >
> > Thanks,
> > Yash
> >
> > On Sat, Mar 4, 2023 at 1:13 AM Greg Harris <greg.harris@aiven.io.invalid
> >
> > wrote:
> >
> > > Yash,
> > >
> > > 1.
> > > Currently the request parameters in the REST API are individual and
> > pertain
> > > to just one endpoint.
> > > They also change the content of the query result, or change the action
> > > taken on the cluster.
> > > I think that a request timeout is a property of the HTTP request more
> > than
> > > it is a property of the cluster query or cluster action.
> > > The two solutions have very similar tradeoffs, but I'm interested in
> > > whether one is more idiomatic and more obvious to users.
> > >
> > > 2.
> > > I understand that only these three endpoints are in need of increased
> > > timeouts at this time due to long connector validations.
> > > From another perspective, this change is making the API more irregular
> > and
> > > someone in the future might choose to make it more regular by
> > standardizing
> > > the configurable timeout functionality.
> > > I wouldn't (in this KIP) dismiss someone's desire to configure other
> > > timeouts in the future (in another KIP), and design them into a corner.
> > > It is acceptable to limit the scope of this change to just the three
> > > endpoints due to practical reasons, but I don't think that should
> prevent
> > > us from trying to ensure that this design fits in the "end goal" state
> of
> > > the Connect service.
> > >
> > > 3. 4.
> > > I am not suggesting an incompatible change, as the current synchronous
> > > behavior is still a useful API for certain situations. I think that it
> is
> > > possible to add asynchronous validations in a backwards compatible way,
> > > using new endpoints or other new request parameters.
> > > The interface could be designed such that users with connectors that
> > exceed
> > > the synchronous timeouts can utilize the asynchronous API. Tooling can
> > use
> > > the asynchronous API when it is available, and fall back to the
> > synchronous
> > > API when it is not.
> > > I think that it also may be more in-line with the design of the rest of
> > the
> > > REST API, where nearly every other request is asynchronous. That's why
> > > you're only targeting these three endpoints, they're the only ones
> with a
> > > synchronicity constraint.
> > > Again, I'm not necessarily saying that you must implement such an
> > > asynchronous validation scheme in this KIP, but we should consider if
> > that
> > > is a more extensible solution. If we decided to implement configurable
> > > synchronous timeouts now, how would that complement an asynchronous API
> > in
> > > the future?
> > >
> > > On Thu, Mar 2, 2023 at 10:00 PM Yash Mayya <ya...@gmail.com>
> wrote:
> > >
> > > > Hi Greg,
> > > >
> > > > Thanks for taking a look!
> > > >
> > > > 1. I believe Chris suggested the use of a query parameter above as we
> > > > already have precedents for using query parameters to configure per
> > > request
> > > > behavior in Kafka Connect (the restart connectors API and the get
> > > > connectors API for instance). Also, the limited choice of endpoints
> > > > targeted is intentional (see my reply to the next point).
> > > >
> > > > 2. I intentionally targeted just the three listed endpoints where
> > > > synchronous connector config validations come into the picture. This
> is
> > > > because of the legitimate cases where config validation for specific
> > > > connector plugins might exceed the default request timeout in edge
> case
> > > > scenarios (outlined in the KIP's motivation section). Other Connect
> > REST
> > > > endpoints shouldn't be taking longer than the default 90 second
> request
> > > > timeout; if they do so, it would either be indicative of a bug in the
> > > > Connect framework or a cluster health issue - neither of which should
> > be
> > > > covered up by manually setting a longer request timeout.
> > > >
> > > > 3. 4. I think changing the config validation behavior would be a
> > backward
> > > > incompatible change and I wanted to avoid that in this particular
> KIP.
> > > > There are multiple programmatic UIs out there which rely on the
> current
> > > > synchronous config validation behavior and breaking the existing
> > contract
> > > > would definitely require a larger discussion.
> > > >
> > > > Thanks,
> > > > Yash
> > > >
> > > > On Fri, Mar 3, 2023 at 12:04 AM Greg Harris
> > <greg.harris@aiven.io.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Hey Yash,
> > > > >
> > > > > Thanks for the KIP, and sorry for the late review.
> > > > >
> > > > > 1. Have you considered a HTTP header to provide the
> > client-configurable
> > > > > timeout? A header might more naturally extend to all of the other
> > > > endpoints
> > > > > in the future, rather than duplicating the query parameter across
> > > > > endpoints.
> > > > >
> > > > > 2. I understand that this change is targeted at just long duration
> > > > > Connector::validation calls, is that due to voluntary scope
> > > constraints?
> > > > > Implementing configurable timeouts for all endpoints in a uniform
> way
> > > > could
> > > > > be desirable, even if the default timeout will work for nearly all
> of
> > > the
> > > > > other calls.
> > > > >
> > > > > 3. Did you consider adding asynchronous validation as a user-facing
> > > > > feature? I think that relying on the synchronous validation results
> > in
> > > a
> > > > > single HTTP request is a bit of an anti-pattern, and one that we've
> > > > > inherited from the original REST design. It seems useful when using
> > the
> > > > > REST API by hand, but seems to be a liability when used in
> > environments
> > > > > with an external management layer.
> > > > >
> > > > > 4. Perhaps rather than allowing synchronous calls to
> > Connector:validate
> > > > to
> > > > > increase in duration, we should provide a way for connectors to
> > surface
> > > > > validation problems later in their lifecycle. Currently there is
> the
> > > > > ConnectorContext::raiseError that transitions the task to FAILED,
> > > > perhaps a
> > > > > similar API could asynchronously emit re-validation results. We've
> > also
> > > > had
> > > > > a problem with long start() duration for the same reasons as
> > > validate().
> > > > >
> > > > > I understand if you want to keep this KIP as tightly focused as
> > > possible,
> > > > > but i'm worried that it is addressing the symptom and not the
> > problem.
> > > I
> > > > > want to make sure that this change is impactful and isn't obsoleted
> > by
> > > a
> > > > > later improvement.
> > > > >
> > > > > Thanks,
> > > > > Greg
> > > > >
> > > > >
> > > > > On Wed, Mar 1, 2023 at 5:07 AM Yash Mayya <ya...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Thanks for all the feedback and discussion. I've renamed the KIP
> > > title
> > > > to
> > > > > > "Kafka Connect REST API timeout improvements" since we're
> > > introducing a
> > > > > > couple of improvements (cancelling create / update connector
> > requests
> > > > > when
> > > > > > config validations timeout and avoiding double config validations
> > in
> > > > > > distributed mode) along with making the request timeouts
> > > configurable.
> > > > > The
> > > > > > new KIP link is
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements
> > > > > > and I've called for a vote for the KIP here -
> > > > > > https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp
> .
> > > > > >
> > > > > > Thanks,
> > > > > > Yash
> > > > > >
> > > > > > On Sat, Nov 5, 2022 at 11:42 PM Sagar <sagarmeansocean@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hey Yash,
> > > > > > >
> > > > > > > Thanks for the explanation. I think it should be fine to
> delegate
> > > the
> > > > > > > validation directly to the leader.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Sagar.
> > > > > > >
> > > > > > > On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <
> yash.mayya@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Sagar,
> > > > > > > >
> > > > > > > > Thanks for chiming in!
> > > > > > > >
> > > > > > > > > Having said that, why does the worker forward to the
> > > > > > > > > leader? I am thinking if the worker can perform the
> > validation
> > > on
> > > > > > it's
> > > > > > > > own,
> > > > > > > > > we could let it do the validation instead of forwarding
> > > > everything
> > > > > to
> > > > > > > the
> > > > > > > > > leader
> > > > > > > >
> > > > > > > > Only the leader is allowed to perform writes to the config
> > topic,
> > > > so
> > > > > > any
> > > > > > > > request that requires a config topic write ends up being
> > > forwarded
> > > > to
> > > > > > the
> > > > > > > > leader. The `POST /connectors` and `PUT
> > > > > /connectors/{connector}/config`
> > > > > > > > endpoints call `Herder::putConnectorConfig` which internally
> > > does a
> > > > > > > config
> > > > > > > > validation first before initiating another herder request to
> > > write
> > > > to
> > > > > > the
> > > > > > > > config topic (in distributed mode). If we want to allow the
> > first
> > > > > > worker
> > > > > > > to
> > > > > > > > do the config validation, and then forward the request to the
> > > > leader
> > > > > > just
> > > > > > > > for the write to the config topic, we'd either need something
> > > like
> > > > a
> > > > > > skip
> > > > > > > > validations query parameter on the endpoint like Chris talks
> > > about
> > > > > > above
> > > > > > > or
> > > > > > > > else a new internal only endpoint that just does a write to
> the
> > > > > config
> > > > > > > > topic without any prior config validation. However, we agreed
> > > that
> > > > > this
> > > > > > > > optimization doesn't really seem necessary for now and can be
> > > done
> > > > > > later
> > > > > > > if
> > > > > > > > deemed useful. I'd be happy to hear differing thoughts if
> any,
> > > > > however.
> > > > > > > >
> > > > > > > > > I think a bound is certainly needed but IMO it shouldn't go
> > > > beyond
> > > > > > > > > 10 mins considering this is just validation
> > > > > > > >
> > > > > > > > Yeah, I agree that this seems like a fair value - I've
> elected
> > to
> > > > go
> > > > > > > with a
> > > > > > > > default value of 10 minutes for the proposed worker
> > configuration
> > > > > that
> > > > > > > sets
> > > > > > > > an upper bound for the timeout query parameter.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Yash
> > > > > > > >
> > > > > > > > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <
> > yash.mayya@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Chris,
> > > > > > > > >
> > > > > > > > > Thanks again for your feedback. I think a worker
> > configuration
> > > > for
> > > > > > the
> > > > > > > > > upper bound makes sense - I initially thought we could
> > hardcode
> > > > it
> > > > > > > (just
> > > > > > > > > like the current request timeout is), but there's no reason
> > to
> > > > set
> > > > > > > > another
> > > > > > > > > artificial bound that isn't user configurable which is
> > exactly
> > > > what
> > > > > > > we're
> > > > > > > > > trying to change here in the first place. I've updated the
> > KIP
> > > > > based
> > > > > > on
> > > > > > > > all
> > > > > > > > > our discussion above.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Yash
> > > > > > > > >
> > > > > > > > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <
> > > sagarmeansocean@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Hey Yash,
> > > > > > > > >>
> > > > > > > > >> Thanks for the KIP! This looks like a useful feature.
> > > > > > > > >>
> > > > > > > > >> I think the discussion thread already has some great
> points
> > by
> > > > > > Chris.
> > > > > > > > Just
> > > > > > > > >> a couple of points/clarifications=>
> > > > > > > > >>
> > > > > > > > >> Regarding, pt#2 , I guess it might be better to forward to
> > the
> > > > > > leader
> > > > > > > as
> > > > > > > > >> suggested by Yash. Having said that, why does the worker
> > > forward
> > > > > to
> > > > > > > the
> > > > > > > > >> leader? I am thinking if the worker can perform the
> > validation
> > > > on
> > > > > > it's
> > > > > > > > >> own,
> > > > > > > > >> we could let it do the validation instead of forwarding
> > > > everything
> > > > > > to
> > > > > > > > the
> > > > > > > > >> leader(even though it might be cheap to forward all
> requests
> > > to
> > > > > the
> > > > > > > > >> leader).
> > > > > > > > >>
> > > > > > > > >> Pt#3 => I think a bound is certainly needed but IMO it
> > > shouldn't
> > > > > go
> > > > > > > > beyond
> > > > > > > > >> 10 mins considering this is just validation. We shouldn't
> > end
> > > up
> > > > > in
> > > > > > a
> > > > > > > > >> situation where a few faulty connectors end up blocking a
> > lot
> > > of
> > > > > > > request
> > > > > > > > >> processing threads, so while increasing the config is
> > > certainly
> > > > > > > helpful,
> > > > > > > > >> we
> > > > > > > > >> shouldn't set too high a value IMO. Of course I am also
> open
> > > to
> > > > > > > > >> suggestions
> > > > > > > > >> here.
> > > > > > > > >>
> > > > > > > > >> Thanks!
> > > > > > > > >> Sagar.
> > > > > > > > >>
> > > > > > > > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton
> > > > > > <chrise@aiven.io.invalid
> > > > > > > >
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >> > Hi Yash,
> > > > > > > > >> >
> > > > > > > > >> > RE 2: That's a great point about validations already
> being
> > > > > > performed
> > > > > > > > by
> > > > > > > > >> the
> > > > > > > > >> > leader. For completeness's sake, I'd like to note that
> > this
> > > > only
> > > > > > > holds
> > > > > > > > >> for
> > > > > > > > >> > valid configurations; invalid ones are caught right now
> > > before
> > > > > > being
> > > > > > > > >> > forwarded to the leader. Still, I think it's fine to
> > forward
> > > > to
> > > > > > the
> > > > > > > > >> leader
> > > > > > > > >> > for now and optimize further in the future if necessary.
> > If
> > > > > > frequent
> > > > > > > > >> > validations are taking place they should be conducted
> via
> > > the
> > > > > `PUT
> > > > > > > > >> > /connector-plugins/{pluginName}/config/validate`
> endpoint,
> > > > which
> > > > > > > won't
> > > > > > > > >> do
> > > > > > > > >> > any forwarding at all.
> > > > > > > > >> >
> > > > > > > > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the
> > > > timeout
> > > > > > also
> > > > > > > > >> seem
> > > > > > > > >> > reasonable... maybe a low-importance worker property
> could
> > > > work
> > > > > > for
> > > > > > > > >> that?
> > > > > > > > >> > Not sure what would make sense for a default; probably
> > > > somewhere
> > > > > > in
> > > > > > > > the
> > > > > > > > >> > 10-60 minute range but would be interested in others'
> > > > thoughts.
> > > > > > > > >> >
> > > > > > > > >> > Thanks for the clarification on the zombie fencing
> logic.
> > I
> > > > > think
> > > > > > we
> > > > > > > > >> might
> > > > > > > > >> > want to have some more subtle logic around the
> interaction
> > > > > between
> > > > > > > > >> calls to
> > > > > > > > >> > Admin::fenceProducers and a worker-level timeout
> property
> > if
> > > > we
> > > > > go
> > > > > > > > that
> > > > > > > > >> > route, but we can cross that particular bridge if we get
> > > back
> > > > to
> > > > > > it.
> > > > > > > > >> >
> > > > > > > > >> > Cheers,
> > > > > > > > >> >
> > > > > > > > >> > Chris
> > > > > > > > >> >
> > > > > > > > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <
> > > > yash.mayya@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Hi Chris,
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks a lot for the super quick response and the
> great
> > > > > > feedback!
> > > > > > > > >> > >
> > > > > > > > >> > > 1. I think that makes a lot of sense, and I'd be happy
> > to
> > > > > update
> > > > > > > the
> > > > > > > > >> KIP
> > > > > > > > >> > to
> > > > > > > > >> > > include this change in the scope. The current behavior
> > > where
> > > > > the
> > > > > > > API
> > > > > > > > >> > > response indicates a time out but the connector is
> > > > > > created/updated
> > > > > > > > >> > > eventually anyway can be pretty confusing and is
> > generally
> > > > > not a
> > > > > > > > good
> > > > > > > > >> > user
> > > > > > > > >> > > experience IMO.
> > > > > > > > >> > >
> > > > > > > > >> > > 2. Wow, thanks for pointing this out - it's a really
> > good
> > > > > catch
> > > > > > > and
> > > > > > > > >> > > something I hadn't noticed was happening with the
> > current
> > > > > > > > >> implementation.
> > > > > > > > >> > > While I do like the idea of having a query parameter
> > that
> > > > > > > determines
> > > > > > > > >> > > whether validations can be skipped, I'm wondering if
> it
> > > > might
> > > > > > not
> > > > > > > be
> > > > > > > > >> > easier
> > > > > > > > >> > > and cleaner to just do the leader check earlier and
> > avoid
> > > > > doing
> > > > > > > the
> > > > > > > > >> > > unnecessary config validation on the first worker?
> Since
> > > > each
> > > > > > > config
> > > > > > > > >> > > validation happens on its own thread, I'm not so sure
> > > about
> > > > > the
> > > > > > > > >> concern
> > > > > > > > >> > of
> > > > > > > > >> > > overloading the leader even on larger clusters,
> > especially
> > > > > since
> > > > > > > > >> > > validations aren't typically long running operations.
> > > > > > Furthermore,
> > > > > > > > >> even
> > > > > > > > >> > > with the current implementation, the leader will
> always
> > be
> > > > > > doing a
> > > > > > > > >> config
> > > > > > > > >> > > validation for connector create / update REST API
> > requests
> > > > on
> > > > > > any
> > > > > > > > >> worker.
> > > > > > > > >> > >
> > > > > > > > >> > > 3. That's a good point, and this way we can also
> > restrict
> > > > the
> > > > > > APIs
> > > > > > > > >> whose
> > > > > > > > >> > > timeouts are configurable - I'm thinking `PUT
> > > > > > > > >> > > /connector-plugins/{pluginName}/config/validate`,
> `POST
> > > > > > > /connectors`
> > > > > > > > >> and
> > > > > > > > >> > > `PUT /connectors/{connector}/config` are the ones
> where
> > > > such a
> > > > > > > > timeout
> > > > > > > > >> > > parameter could be useful. Also, do you think we
> should
> > > > > enforce
> > > > > > > some
> > > > > > > > >> > > reasonable bounds for the timeout config?
> > > > > > > > >> > >
> > > > > > > > >> > > On the zombie fencing point, the implication was that
> > the
> > > > new
> > > > > > > worker
> > > > > > > > >> > > property would not control the timeout used for the
> call
> > > to
> > > > > > > > >> > > Admin::fenceProducers. However, if we go with a
> timeout
> > > > query
> > > > > > > > >> parameter
> > > > > > > > >> > > approach, even the timeout for the `PUT
> > > > > > > > /connectors/{connector}/fence'
> > > > > > > > >> > > endpoint will remain unaffected.
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks,
> > > > > > > > >> > > Yash
> > > > > > > > >> > >
> > > > > > > > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton
> > > > > > > > <chrise@aiven.io.invalid
> > > > > > > > >> >
> > > > > > > > >> > > wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > Hi Yash,
> > > > > > > > >> > > >
> > > > > > > > >> > > > Thanks for the KIP. It's a nice, focused change.
> > > > Initially I
> > > > > > was
> > > > > > > > >> > hesitant
> > > > > > > > >> > > > to support cases where connector validation takes
> this
> > > > long,
> > > > > > but
> > > > > > > > >> > > > considering the alternative is that we give users a
> > 500
> > > > > error
> > > > > > > > >> response
> > > > > > > > >> > > but
> > > > > > > > >> > > > leave the request to create/modify the connector
> > queued
> > > up
> > > > > in
> > > > > > > the
> > > > > > > > >> > > herder, I
> > > > > > > > >> > > > think I can get behind the motivation here. There's
> > also
> > > > an
> > > > > > > > >> argument to
> > > > > > > > >> > > be
> > > > > > > > >> > > > made about keeping Kafka Connect available even when
> > the
> > > > > > systems
> > > > > > > > >> that
> > > > > > > > >> > it
> > > > > > > > >> > > > connects to are in a degraded state.
> > > > > > > > >> > > >
> > > > > > > > >> > > > I have a few alternatives I'd be interested in your
> > > > thoughts
> > > > > > on:
> > > > > > > > >> > > >
> > > > > > > > >> > > > 1. Since the primary concern here seems to be that
> > > custom
> > > > > > > > connector
> > > > > > > > >> > > > validation logic can take too long, do we have any
> > > > thoughts
> > > > > on
> > > > > > > > >> adding
> > > > > > > > >> > > logic
> > > > > > > > >> > > > to check for request timeout after validation has
> > > > completed
> > > > > > and,
> > > > > > > > if
> > > > > > > > >> it
> > > > > > > > >> > > has,
> > > > > > > > >> > > > aborting the attempt to create/modify the connector?
> > > > > > > > >> > > >
> > > > > > > > >> > > > 2. Right now it's possible that we'll perform two
> > > > connector
> > > > > > > config
> > > > > > > > >> > > > validations per create/modify request; once on the
> > > worker
> > > > > that
> > > > > > > > >> > initially
> > > > > > > > >> > > > receives the request, and then again if that worker
> is
> > > not
> > > > > the
> > > > > > > > >> leader
> > > > > > > > >> > of
> > > > > > > > >> > > > the cluster and has to forward the request to the
> > > leader.
> > > > > Any
> > > > > > > > >> thoughts
> > > > > > > > >> > on
> > > > > > > > >> > > > optimizing this to only require a single validation
> > per
> > > > > > request?
> > > > > > > > We
> > > > > > > > >> > > > probably wouldn't want to force all validations to
> > take
> > > > > place
> > > > > > on
> > > > > > > > the
> > > > > > > > >> > > leader
> > > > > > > > >> > > > (could lead to overloading it pretty quickly in
> large
> > > > > > clusters),
> > > > > > > > >> but we
> > > > > > > > >> > > > could add an internal-only query parameter to skip
> > > > > validation
> > > > > > > and
> > > > > > > > >> then
> > > > > > > > >> > > use
> > > > > > > > >> > > > that parameter when forwarding requests from
> followers
> > > to
> > > > > the
> > > > > > > > >> leader.
> > > > > > > > >> > > >
> > > > > > > > >> > > > 3. A worker property is pretty coarse-grained, and
> > > > difficult
> > > > > > to
> > > > > > > > >> change.
> > > > > > > > >> > > We
> > > > > > > > >> > > > might allow per-request toggling of the timeout by
> > > adding
> > > > a
> > > > > > URL
> > > > > > > > >> query
> > > > > > > > >> > > > parameter like '?timeout=90s' to the REST API to
> allow
> > > > > > tweaking
> > > > > > > of
> > > > > > > > >> the
> > > > > > > > >> > > > timeout on a more granular basis, and without having
> > to
> > > > > > perform
> > > > > > > a
> > > > > > > > >> > worker
> > > > > > > > >> > > > restart.
> > > > > > > > >> > > >
> > > > > > > > >> > > > I'd also like to clarify a point about the rejected
> > > > > > alternative
> > > > > > > > >> "Allow
> > > > > > > > >> > > > configuring producer zombie fencing admin request
> > > > > timeout"--is
> > > > > > > the
> > > > > > > > >> > > > implication here that the "
> > rest.api.request.timeout.ms"
> > > > > > > property
> > > > > > > > >> will
> > > > > > > > >> > > not
> > > > > > > > >> > > > control the REST timeout for requests to the 'PUT
> > > > > > > > >> > > > /connectors/{connector}/fence' endpoint, or just
> that
> > it
> > > > > won't
> > > > > > > > >> control
> > > > > > > > >> > > the
> > > > > > > > >> > > > timeout that we use for the call to
> > > Admin::fenceProducers?
> > > > > > > > >> > > >
> > > > > > > > >> > > > Cheers,
> > > > > > > > >> > > >
> > > > > > > > >> > > > Chris
> > > > > > > > >> > > >
> > > > > > > > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <
> > > > > > > yash.mayya@gmail.com>
> > > > > > > > >> > wrote:
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Hi all,
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > I'd like to start a discussion thread on this
> small
> > > KIP
> > > > -
> > > > > > > > >> > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > It proposes the addition of a new Kafka Connect
> > worker
> > > > > > > > >> configuration
> > > > > > > > >> > to
> > > > > > > > >> > > > > allow configuring REST API request timeouts.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Thanks,
> > > > > > > > >> > > > > Yash
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Greg Harris <gr...@aiven.io.INVALID>.
Yash,

1. I think that Request-Timeout is a fine choice. I did not see any
standard header for a use-case like this, so we are free to choose a name
for ourselves.

2. While such proposals do not exist, I don't think that means that we
should exclude them from consideration. If we do, then we run the risk of
making future implementations more complicated, impossible, or awkward to
use. This is a balance of course, as we cannot let future features paralyze
us from making improvements today. We cannot hold ourselves responsible for
handling unknown unknowns, but we can at least try to plan around known
unknowns.
Applying this to a hypothetical extension of timeouts to other endpoints:
my concern was primarily about whether this design applied everywhere felt
natural or awkward. In my opinion, a request parameter would draw a lot of
outsized attention in documentation, and be the first parameter for a lot
of the endpoints, while a header could be specified as a more obscure
global modifier to all requests.

3. 4. Applying the same standard to a hypothetical asynchronous I don't
believe it makes async validation substantially more complicated, except
the "request timeout interrupts connector creation" interaction. I also do
not think that it would make it impossible to implement an async validation
scheme in the future, since the synchronization is already in place, the
timeout is just different.
My chief concern was about the clarity of the API in the future when both
configurable timeouts and asynchronous APIs exist. Will the configurable
timeouts still make sense, and will they be a feature that people will use?
Are we expecting users to hold a single REST request open for 5, 20, or 60
minutes to accommodate a connector with an excessively long validation
step? Successfully validating a request requires the caller and the network
to remain stable for the entire duration of the request to have a chance of
success, and consumes (some small) amount of resources during that time.
And what if they decide they no longer want the validation result, and wish
to cancel the operation to free resources (both connect and external
system)? There is no standard mechanism to discard or cancel an HTTP
request (not even a connection loss).

You have said that the default timeout for other endpoints is more than
satisfactory for the typical user, and I agree. If other operations
(deleting, restarting, pause/unpause, etc) had a substantial synchronous
component that would benefit from a configurable timeout, I would be more
inclined to support configurable timeouts (all at once, or validation first
and other operations later).
But as it stands, I don't think I see the value-add of configurable
timeouts once there is another solution which targets the validate call
duration specifically, and I'm concerned that configurable timeouts would
be made irrelevant in that case.

Thanks,
Greg

On Wed, Mar 8, 2023 at 7:21 AM Yash Mayya <ya...@gmail.com> wrote:

> Hi Greg,
>
> Thanks for the response!
>
> 1. Hm that's a fair point, and while there aren't any strict guidelines or
> rules governing whether something should be a query parameter versus a
> request header, I agree that it might be more idiomatic for a request
> timeout value to be accepted via a custom request header. What do you
> think about using a header name like "X-Request-Timeout", or maybe simply
> "Request-Timeout" if we want to take into account
> https://www.rfc-editor.org/rfc/rfc6648?
>
> 2. 3. 4. Adding asynchronous validations via new endpoints / new request
> parameters on existing endpoints is definitely an interesting idea but I'm
> not sure whether this KIP should take into consideration a hypothetical
> future proposal? If there were already such a concrete proposal, I would
> definitely agree that this KIP should take that into consideration - but
> that isn't the case today. Furthermore, I'm not sure I understand why the
> addition of request timeout configurability to the existing endpoints would
> preclude the introduction of an asynchronous validation API in the future?
>
> Thanks,
> Yash
>
> On Sat, Mar 4, 2023 at 1:13 AM Greg Harris <gr...@aiven.io.invalid>
> wrote:
>
> > Yash,
> >
> > 1.
> > Currently the request parameters in the REST API are individual and
> pertain
> > to just one endpoint.
> > They also change the content of the query result, or change the action
> > taken on the cluster.
> > I think that a request timeout is a property of the HTTP request more
> than
> > it is a property of the cluster query or cluster action.
> > The two solutions have very similar tradeoffs, but I'm interested in
> > whether one is more idiomatic and more obvious to users.
> >
> > 2.
> > I understand that only these three endpoints are in need of increased
> > timeouts at this time due to long connector validations.
> > From another perspective, this change is making the API more irregular
> and
> > someone in the future might choose to make it more regular by
> standardizing
> > the configurable timeout functionality.
> > I wouldn't (in this KIP) dismiss someone's desire to configure other
> > timeouts in the future (in another KIP), and design them into a corner.
> > It is acceptable to limit the scope of this change to just the three
> > endpoints due to practical reasons, but I don't think that should prevent
> > us from trying to ensure that this design fits in the "end goal" state of
> > the Connect service.
> >
> > 3. 4.
> > I am not suggesting an incompatible change, as the current synchronous
> > behavior is still a useful API for certain situations. I think that it is
> > possible to add asynchronous validations in a backwards compatible way,
> > using new endpoints or other new request parameters.
> > The interface could be designed such that users with connectors that
> exceed
> > the synchronous timeouts can utilize the asynchronous API. Tooling can
> use
> > the asynchronous API when it is available, and fall back to the
> synchronous
> > API when it is not.
> > I think that it also may be more in-line with the design of the rest of
> the
> > REST API, where nearly every other request is asynchronous. That's why
> > you're only targeting these three endpoints, they're the only ones with a
> > synchronicity constraint.
> > Again, I'm not necessarily saying that you must implement such an
> > asynchronous validation scheme in this KIP, but we should consider if
> that
> > is a more extensible solution. If we decided to implement configurable
> > synchronous timeouts now, how would that complement an asynchronous API
> in
> > the future?
> >
> > On Thu, Mar 2, 2023 at 10:00 PM Yash Mayya <ya...@gmail.com> wrote:
> >
> > > Hi Greg,
> > >
> > > Thanks for taking a look!
> > >
> > > 1. I believe Chris suggested the use of a query parameter above as we
> > > already have precedents for using query parameters to configure per
> > request
> > > behavior in Kafka Connect (the restart connectors API and the get
> > > connectors API for instance). Also, the limited choice of endpoints
> > > targeted is intentional (see my reply to the next point).
> > >
> > > 2. I intentionally targeted just the three listed endpoints where
> > > synchronous connector config validations come into the picture. This is
> > > because of the legitimate cases where config validation for specific
> > > connector plugins might exceed the default request timeout in edge case
> > > scenarios (outlined in the KIP's motivation section). Other Connect
> REST
> > > endpoints shouldn't be taking longer than the default 90 second request
> > > timeout; if they do so, it would either be indicative of a bug in the
> > > Connect framework or a cluster health issue - neither of which should
> be
> > > covered up by manually setting a longer request timeout.
> > >
> > > 3. 4. I think changing the config validation behavior would be a
> backward
> > > incompatible change and I wanted to avoid that in this particular KIP.
> > > There are multiple programmatic UIs out there which rely on the current
> > > synchronous config validation behavior and breaking the existing
> contract
> > > would definitely require a larger discussion.
> > >
> > > Thanks,
> > > Yash
> > >
> > > On Fri, Mar 3, 2023 at 12:04 AM Greg Harris
> <greg.harris@aiven.io.invalid
> > >
> > > wrote:
> > >
> > > > Hey Yash,
> > > >
> > > > Thanks for the KIP, and sorry for the late review.
> > > >
> > > > 1. Have you considered a HTTP header to provide the
> client-configurable
> > > > timeout? A header might more naturally extend to all of the other
> > > endpoints
> > > > in the future, rather than duplicating the query parameter across
> > > > endpoints.
> > > >
> > > > 2. I understand that this change is targeted at just long duration
> > > > Connector::validation calls, is that due to voluntary scope
> > constraints?
> > > > Implementing configurable timeouts for all endpoints in a uniform way
> > > could
> > > > be desirable, even if the default timeout will work for nearly all of
> > the
> > > > other calls.
> > > >
> > > > 3. Did you consider adding asynchronous validation as a user-facing
> > > > feature? I think that relying on the synchronous validation results
> in
> > a
> > > > single HTTP request is a bit of an anti-pattern, and one that we've
> > > > inherited from the original REST design. It seems useful when using
> the
> > > > REST API by hand, but seems to be a liability when used in
> environments
> > > > with an external management layer.
> > > >
> > > > 4. Perhaps rather than allowing synchronous calls to
> Connector:validate
> > > to
> > > > increase in duration, we should provide a way for connectors to
> surface
> > > > validation problems later in their lifecycle. Currently there is the
> > > > ConnectorContext::raiseError that transitions the task to FAILED,
> > > perhaps a
> > > > similar API could asynchronously emit re-validation results. We've
> also
> > > had
> > > > a problem with long start() duration for the same reasons as
> > validate().
> > > >
> > > > I understand if you want to keep this KIP as tightly focused as
> > possible,
> > > > but i'm worried that it is addressing the symptom and not the
> problem.
> > I
> > > > want to make sure that this change is impactful and isn't obsoleted
> by
> > a
> > > > later improvement.
> > > >
> > > > Thanks,
> > > > Greg
> > > >
> > > >
> > > > On Wed, Mar 1, 2023 at 5:07 AM Yash Mayya <ya...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Thanks for all the feedback and discussion. I've renamed the KIP
> > title
> > > to
> > > > > "Kafka Connect REST API timeout improvements" since we're
> > introducing a
> > > > > couple of improvements (cancelling create / update connector
> requests
> > > > when
> > > > > config validations timeout and avoiding double config validations
> in
> > > > > distributed mode) along with making the request timeouts
> > configurable.
> > > > The
> > > > > new KIP link is
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements
> > > > > and I've called for a vote for the KIP here -
> > > > > https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp.
> > > > >
> > > > > Thanks,
> > > > > Yash
> > > > >
> > > > > On Sat, Nov 5, 2022 at 11:42 PM Sagar <sa...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hey Yash,
> > > > > >
> > > > > > Thanks for the explanation. I think it should be fine to delegate
> > the
> > > > > > validation directly to the leader.
> > > > > >
> > > > > > Thanks!
> > > > > > Sagar.
> > > > > >
> > > > > > On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <yash.mayya@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Sagar,
> > > > > > >
> > > > > > > Thanks for chiming in!
> > > > > > >
> > > > > > > > Having said that, why does the worker forward to the
> > > > > > > > leader? I am thinking if the worker can perform the
> validation
> > on
> > > > > it's
> > > > > > > own,
> > > > > > > > we could let it do the validation instead of forwarding
> > > everything
> > > > to
> > > > > > the
> > > > > > > > leader
> > > > > > >
> > > > > > > Only the leader is allowed to perform writes to the config
> topic,
> > > so
> > > > > any
> > > > > > > request that requires a config topic write ends up being
> > forwarded
> > > to
> > > > > the
> > > > > > > leader. The `POST /connectors` and `PUT
> > > > /connectors/{connector}/config`
> > > > > > > endpoints call `Herder::putConnectorConfig` which internally
> > does a
> > > > > > config
> > > > > > > validation first before initiating another herder request to
> > write
> > > to
> > > > > the
> > > > > > > config topic (in distributed mode). If we want to allow the
> first
> > > > > worker
> > > > > > to
> > > > > > > do the config validation, and then forward the request to the
> > > leader
> > > > > just
> > > > > > > for the write to the config topic, we'd either need something
> > like
> > > a
> > > > > skip
> > > > > > > validations query parameter on the endpoint like Chris talks
> > about
> > > > > above
> > > > > > or
> > > > > > > else a new internal only endpoint that just does a write to the
> > > > config
> > > > > > > topic without any prior config validation. However, we agreed
> > that
> > > > this
> > > > > > > optimization doesn't really seem necessary for now and can be
> > done
> > > > > later
> > > > > > if
> > > > > > > deemed useful. I'd be happy to hear differing thoughts if any,
> > > > however.
> > > > > > >
> > > > > > > > I think a bound is certainly needed but IMO it shouldn't go
> > > beyond
> > > > > > > > 10 mins considering this is just validation
> > > > > > >
> > > > > > > Yeah, I agree that this seems like a fair value - I've elected
> to
> > > go
> > > > > > with a
> > > > > > > default value of 10 minutes for the proposed worker
> configuration
> > > > that
> > > > > > sets
> > > > > > > an upper bound for the timeout query parameter.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Yash
> > > > > > >
> > > > > > > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <
> yash.mayya@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Chris,
> > > > > > > >
> > > > > > > > Thanks again for your feedback. I think a worker
> configuration
> > > for
> > > > > the
> > > > > > > > upper bound makes sense - I initially thought we could
> hardcode
> > > it
> > > > > > (just
> > > > > > > > like the current request timeout is), but there's no reason
> to
> > > set
> > > > > > > another
> > > > > > > > artificial bound that isn't user configurable which is
> exactly
> > > what
> > > > > > we're
> > > > > > > > trying to change here in the first place. I've updated the
> KIP
> > > > based
> > > > > on
> > > > > > > all
> > > > > > > > our discussion above.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Yash
> > > > > > > >
> > > > > > > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <
> > sagarmeansocean@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hey Yash,
> > > > > > > >>
> > > > > > > >> Thanks for the KIP! This looks like a useful feature.
> > > > > > > >>
> > > > > > > >> I think the discussion thread already has some great points
> by
> > > > > Chris.
> > > > > > > Just
> > > > > > > >> a couple of points/clarifications=>
> > > > > > > >>
> > > > > > > >> Regarding, pt#2 , I guess it might be better to forward to
> the
> > > > > leader
> > > > > > as
> > > > > > > >> suggested by Yash. Having said that, why does the worker
> > forward
> > > > to
> > > > > > the
> > > > > > > >> leader? I am thinking if the worker can perform the
> validation
> > > on
> > > > > it's
> > > > > > > >> own,
> > > > > > > >> we could let it do the validation instead of forwarding
> > > everything
> > > > > to
> > > > > > > the
> > > > > > > >> leader(even though it might be cheap to forward all requests
> > to
> > > > the
> > > > > > > >> leader).
> > > > > > > >>
> > > > > > > >> Pt#3 => I think a bound is certainly needed but IMO it
> > shouldn't
> > > > go
> > > > > > > beyond
> > > > > > > >> 10 mins considering this is just validation. We shouldn't
> end
> > up
> > > > in
> > > > > a
> > > > > > > >> situation where a few faulty connectors end up blocking a
> lot
> > of
> > > > > > request
> > > > > > > >> processing threads, so while increasing the config is
> > certainly
> > > > > > helpful,
> > > > > > > >> we
> > > > > > > >> shouldn't set too high a value IMO. Of course I am also open
> > to
> > > > > > > >> suggestions
> > > > > > > >> here.
> > > > > > > >>
> > > > > > > >> Thanks!
> > > > > > > >> Sagar.
> > > > > > > >>
> > > > > > > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton
> > > > > <chrise@aiven.io.invalid
> > > > > > >
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Hi Yash,
> > > > > > > >> >
> > > > > > > >> > RE 2: That's a great point about validations already being
> > > > > performed
> > > > > > > by
> > > > > > > >> the
> > > > > > > >> > leader. For completeness's sake, I'd like to note that
> this
> > > only
> > > > > > holds
> > > > > > > >> for
> > > > > > > >> > valid configurations; invalid ones are caught right now
> > before
> > > > > being
> > > > > > > >> > forwarded to the leader. Still, I think it's fine to
> forward
> > > to
> > > > > the
> > > > > > > >> leader
> > > > > > > >> > for now and optimize further in the future if necessary.
> If
> > > > > frequent
> > > > > > > >> > validations are taking place they should be conducted via
> > the
> > > > `PUT
> > > > > > > >> > /connector-plugins/{pluginName}/config/validate` endpoint,
> > > which
> > > > > > won't
> > > > > > > >> do
> > > > > > > >> > any forwarding at all.
> > > > > > > >> >
> > > > > > > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the
> > > timeout
> > > > > also
> > > > > > > >> seem
> > > > > > > >> > reasonable... maybe a low-importance worker property could
> > > work
> > > > > for
> > > > > > > >> that?
> > > > > > > >> > Not sure what would make sense for a default; probably
> > > somewhere
> > > > > in
> > > > > > > the
> > > > > > > >> > 10-60 minute range but would be interested in others'
> > > thoughts.
> > > > > > > >> >
> > > > > > > >> > Thanks for the clarification on the zombie fencing logic.
> I
> > > > think
> > > > > we
> > > > > > > >> might
> > > > > > > >> > want to have some more subtle logic around the interaction
> > > > between
> > > > > > > >> calls to
> > > > > > > >> > Admin::fenceProducers and a worker-level timeout property
> if
> > > we
> > > > go
> > > > > > > that
> > > > > > > >> > route, but we can cross that particular bridge if we get
> > back
> > > to
> > > > > it.
> > > > > > > >> >
> > > > > > > >> > Cheers,
> > > > > > > >> >
> > > > > > > >> > Chris
> > > > > > > >> >
> > > > > > > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <
> > > yash.mayya@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >> >
> > > > > > > >> > > Hi Chris,
> > > > > > > >> > >
> > > > > > > >> > > Thanks a lot for the super quick response and the great
> > > > > feedback!
> > > > > > > >> > >
> > > > > > > >> > > 1. I think that makes a lot of sense, and I'd be happy
> to
> > > > update
> > > > > > the
> > > > > > > >> KIP
> > > > > > > >> > to
> > > > > > > >> > > include this change in the scope. The current behavior
> > where
> > > > the
> > > > > > API
> > > > > > > >> > > response indicates a time out but the connector is
> > > > > created/updated
> > > > > > > >> > > eventually anyway can be pretty confusing and is
> generally
> > > > not a
> > > > > > > good
> > > > > > > >> > user
> > > > > > > >> > > experience IMO.
> > > > > > > >> > >
> > > > > > > >> > > 2. Wow, thanks for pointing this out - it's a really
> good
> > > > catch
> > > > > > and
> > > > > > > >> > > something I hadn't noticed was happening with the
> current
> > > > > > > >> implementation.
> > > > > > > >> > > While I do like the idea of having a query parameter
> that
> > > > > > determines
> > > > > > > >> > > whether validations can be skipped, I'm wondering if it
> > > might
> > > > > not
> > > > > > be
> > > > > > > >> > easier
> > > > > > > >> > > and cleaner to just do the leader check earlier and
> avoid
> > > > doing
> > > > > > the
> > > > > > > >> > > unnecessary config validation on the first worker? Since
> > > each
> > > > > > config
> > > > > > > >> > > validation happens on its own thread, I'm not so sure
> > about
> > > > the
> > > > > > > >> concern
> > > > > > > >> > of
> > > > > > > >> > > overloading the leader even on larger clusters,
> especially
> > > > since
> > > > > > > >> > > validations aren't typically long running operations.
> > > > > Furthermore,
> > > > > > > >> even
> > > > > > > >> > > with the current implementation, the leader will always
> be
> > > > > doing a
> > > > > > > >> config
> > > > > > > >> > > validation for connector create / update REST API
> requests
> > > on
> > > > > any
> > > > > > > >> worker.
> > > > > > > >> > >
> > > > > > > >> > > 3. That's a good point, and this way we can also
> restrict
> > > the
> > > > > APIs
> > > > > > > >> whose
> > > > > > > >> > > timeouts are configurable - I'm thinking `PUT
> > > > > > > >> > > /connector-plugins/{pluginName}/config/validate`, `POST
> > > > > > /connectors`
> > > > > > > >> and
> > > > > > > >> > > `PUT /connectors/{connector}/config` are the ones where
> > > such a
> > > > > > > timeout
> > > > > > > >> > > parameter could be useful. Also, do you think we should
> > > > enforce
> > > > > > some
> > > > > > > >> > > reasonable bounds for the timeout config?
> > > > > > > >> > >
> > > > > > > >> > > On the zombie fencing point, the implication was that
> the
> > > new
> > > > > > worker
> > > > > > > >> > > property would not control the timeout used for the call
> > to
> > > > > > > >> > > Admin::fenceProducers. However, if we go with a timeout
> > > query
> > > > > > > >> parameter
> > > > > > > >> > > approach, even the timeout for the `PUT
> > > > > > > /connectors/{connector}/fence'
> > > > > > > >> > > endpoint will remain unaffected.
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Yash
> > > > > > > >> > >
> > > > > > > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton
> > > > > > > <chrise@aiven.io.invalid
> > > > > > > >> >
> > > > > > > >> > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Hi Yash,
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks for the KIP. It's a nice, focused change.
> > > Initially I
> > > > > was
> > > > > > > >> > hesitant
> > > > > > > >> > > > to support cases where connector validation takes this
> > > long,
> > > > > but
> > > > > > > >> > > > considering the alternative is that we give users a
> 500
> > > > error
> > > > > > > >> response
> > > > > > > >> > > but
> > > > > > > >> > > > leave the request to create/modify the connector
> queued
> > up
> > > > in
> > > > > > the
> > > > > > > >> > > herder, I
> > > > > > > >> > > > think I can get behind the motivation here. There's
> also
> > > an
> > > > > > > >> argument to
> > > > > > > >> > > be
> > > > > > > >> > > > made about keeping Kafka Connect available even when
> the
> > > > > systems
> > > > > > > >> that
> > > > > > > >> > it
> > > > > > > >> > > > connects to are in a degraded state.
> > > > > > > >> > > >
> > > > > > > >> > > > I have a few alternatives I'd be interested in your
> > > thoughts
> > > > > on:
> > > > > > > >> > > >
> > > > > > > >> > > > 1. Since the primary concern here seems to be that
> > custom
> > > > > > > connector
> > > > > > > >> > > > validation logic can take too long, do we have any
> > > thoughts
> > > > on
> > > > > > > >> adding
> > > > > > > >> > > logic
> > > > > > > >> > > > to check for request timeout after validation has
> > > completed
> > > > > and,
> > > > > > > if
> > > > > > > >> it
> > > > > > > >> > > has,
> > > > > > > >> > > > aborting the attempt to create/modify the connector?
> > > > > > > >> > > >
> > > > > > > >> > > > 2. Right now it's possible that we'll perform two
> > > connector
> > > > > > config
> > > > > > > >> > > > validations per create/modify request; once on the
> > worker
> > > > that
> > > > > > > >> > initially
> > > > > > > >> > > > receives the request, and then again if that worker is
> > not
> > > > the
> > > > > > > >> leader
> > > > > > > >> > of
> > > > > > > >> > > > the cluster and has to forward the request to the
> > leader.
> > > > Any
> > > > > > > >> thoughts
> > > > > > > >> > on
> > > > > > > >> > > > optimizing this to only require a single validation
> per
> > > > > request?
> > > > > > > We
> > > > > > > >> > > > probably wouldn't want to force all validations to
> take
> > > > place
> > > > > on
> > > > > > > the
> > > > > > > >> > > leader
> > > > > > > >> > > > (could lead to overloading it pretty quickly in large
> > > > > clusters),
> > > > > > > >> but we
> > > > > > > >> > > > could add an internal-only query parameter to skip
> > > > validation
> > > > > > and
> > > > > > > >> then
> > > > > > > >> > > use
> > > > > > > >> > > > that parameter when forwarding requests from followers
> > to
> > > > the
> > > > > > > >> leader.
> > > > > > > >> > > >
> > > > > > > >> > > > 3. A worker property is pretty coarse-grained, and
> > > difficult
> > > > > to
> > > > > > > >> change.
> > > > > > > >> > > We
> > > > > > > >> > > > might allow per-request toggling of the timeout by
> > adding
> > > a
> > > > > URL
> > > > > > > >> query
> > > > > > > >> > > > parameter like '?timeout=90s' to the REST API to allow
> > > > > tweaking
> > > > > > of
> > > > > > > >> the
> > > > > > > >> > > > timeout on a more granular basis, and without having
> to
> > > > > perform
> > > > > > a
> > > > > > > >> > worker
> > > > > > > >> > > > restart.
> > > > > > > >> > > >
> > > > > > > >> > > > I'd also like to clarify a point about the rejected
> > > > > alternative
> > > > > > > >> "Allow
> > > > > > > >> > > > configuring producer zombie fencing admin request
> > > > timeout"--is
> > > > > > the
> > > > > > > >> > > > implication here that the "
> rest.api.request.timeout.ms"
> > > > > > property
> > > > > > > >> will
> > > > > > > >> > > not
> > > > > > > >> > > > control the REST timeout for requests to the 'PUT
> > > > > > > >> > > > /connectors/{connector}/fence' endpoint, or just that
> it
> > > > won't
> > > > > > > >> control
> > > > > > > >> > > the
> > > > > > > >> > > > timeout that we use for the call to
> > Admin::fenceProducers?
> > > > > > > >> > > >
> > > > > > > >> > > > Cheers,
> > > > > > > >> > > >
> > > > > > > >> > > > Chris
> > > > > > > >> > > >
> > > > > > > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <
> > > > > > yash.mayya@gmail.com>
> > > > > > > >> > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Hi all,
> > > > > > > >> > > > >
> > > > > > > >> > > > > I'd like to start a discussion thread on this small
> > KIP
> > > -
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > > > > > > >> > > > >
> > > > > > > >> > > > > It proposes the addition of a new Kafka Connect
> worker
> > > > > > > >> configuration
> > > > > > > >> > to
> > > > > > > >> > > > > allow configuring REST API request timeouts.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks,
> > > > > > > >> > > > > Yash
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Yash Mayya <ya...@gmail.com>.
Hi Greg,

Thanks for the response!

1. Hm that's a fair point, and while there aren't any strict guidelines or
rules governing whether something should be a query parameter versus a
request header, I agree that it might be more idiomatic for a request
timeout value to be accepted via a custom request header. What do you
think about using a header name like "X-Request-Timeout", or maybe simply
"Request-Timeout" if we want to take into account
https://www.rfc-editor.org/rfc/rfc6648?

2. 3. 4. Adding asynchronous validations via new endpoints / new request
parameters on existing endpoints is definitely an interesting idea but I'm
not sure whether this KIP should take into consideration a hypothetical
future proposal? If there were already such a concrete proposal, I would
definitely agree that this KIP should take that into consideration - but
that isn't the case today. Furthermore, I'm not sure I understand why the
addition of request timeout configurability to the existing endpoints would
preclude the introduction of an asynchronous validation API in the future?

Thanks,
Yash

On Sat, Mar 4, 2023 at 1:13 AM Greg Harris <gr...@aiven.io.invalid>
wrote:

> Yash,
>
> 1.
> Currently the request parameters in the REST API are individual and pertain
> to just one endpoint.
> They also change the content of the query result, or change the action
> taken on the cluster.
> I think that a request timeout is a property of the HTTP request more than
> it is a property of the cluster query or cluster action.
> The two solutions have very similar tradeoffs, but I'm interested in
> whether one is more idiomatic and more obvious to users.
>
> 2.
> I understand that only these three endpoints are in need of increased
> timeouts at this time due to long connector validations.
> From another perspective, this change is making the API more irregular and
> someone in the future might choose to make it more regular by standardizing
> the configurable timeout functionality.
> I wouldn't (in this KIP) dismiss someone's desire to configure other
> timeouts in the future (in another KIP), and design them into a corner.
> It is acceptable to limit the scope of this change to just the three
> endpoints due to practical reasons, but I don't think that should prevent
> us from trying to ensure that this design fits in the "end goal" state of
> the Connect service.
>
> 3. 4.
> I am not suggesting an incompatible change, as the current synchronous
> behavior is still a useful API for certain situations. I think that it is
> possible to add asynchronous validations in a backwards compatible way,
> using new endpoints or other new request parameters.
> The interface could be designed such that users with connectors that exceed
> the synchronous timeouts can utilize the asynchronous API. Tooling can use
> the asynchronous API when it is available, and fall back to the synchronous
> API when it is not.
> I think that it also may be more in-line with the design of the rest of the
> REST API, where nearly every other request is asynchronous. That's why
> you're only targeting these three endpoints, they're the only ones with a
> synchronicity constraint.
> Again, I'm not necessarily saying that you must implement such an
> asynchronous validation scheme in this KIP, but we should consider if that
> is a more extensible solution. If we decided to implement configurable
> synchronous timeouts now, how would that complement an asynchronous API in
> the future?
>
> On Thu, Mar 2, 2023 at 10:00 PM Yash Mayya <ya...@gmail.com> wrote:
>
> > Hi Greg,
> >
> > Thanks for taking a look!
> >
> > 1. I believe Chris suggested the use of a query parameter above as we
> > already have precedents for using query parameters to configure per
> request
> > behavior in Kafka Connect (the restart connectors API and the get
> > connectors API for instance). Also, the limited choice of endpoints
> > targeted is intentional (see my reply to the next point).
> >
> > 2. I intentionally targeted just the three listed endpoints where
> > synchronous connector config validations come into the picture. This is
> > because of the legitimate cases where config validation for specific
> > connector plugins might exceed the default request timeout in edge case
> > scenarios (outlined in the KIP's motivation section). Other Connect REST
> > endpoints shouldn't be taking longer than the default 90 second request
> > timeout; if they do so, it would either be indicative of a bug in the
> > Connect framework or a cluster health issue - neither of which should be
> > covered up by manually setting a longer request timeout.
> >
> > 3. 4. I think changing the config validation behavior would be a backward
> > incompatible change and I wanted to avoid that in this particular KIP.
> > There are multiple programmatic UIs out there which rely on the current
> > synchronous config validation behavior and breaking the existing contract
> > would definitely require a larger discussion.
> >
> > Thanks,
> > Yash
> >
> > On Fri, Mar 3, 2023 at 12:04 AM Greg Harris <greg.harris@aiven.io.invalid
> >
> > wrote:
> >
> > > Hey Yash,
> > >
> > > Thanks for the KIP, and sorry for the late review.
> > >
> > > 1. Have you considered a HTTP header to provide the client-configurable
> > > timeout? A header might more naturally extend to all of the other
> > endpoints
> > > in the future, rather than duplicating the query parameter across
> > > endpoints.
> > >
> > > 2. I understand that this change is targeted at just long duration
> > > Connector::validation calls, is that due to voluntary scope
> constraints?
> > > Implementing configurable timeouts for all endpoints in a uniform way
> > could
> > > be desirable, even if the default timeout will work for nearly all of
> the
> > > other calls.
> > >
> > > 3. Did you consider adding asynchronous validation as a user-facing
> > > feature? I think that relying on the synchronous validation results in
> a
> > > single HTTP request is a bit of an anti-pattern, and one that we've
> > > inherited from the original REST design. It seems useful when using the
> > > REST API by hand, but seems to be a liability when used in environments
> > > with an external management layer.
> > >
> > > 4. Perhaps rather than allowing synchronous calls to Connector:validate
> > to
> > > increase in duration, we should provide a way for connectors to surface
> > > validation problems later in their lifecycle. Currently there is the
> > > ConnectorContext::raiseError that transitions the task to FAILED,
> > perhaps a
> > > similar API could asynchronously emit re-validation results. We've also
> > had
> > > a problem with long start() duration for the same reasons as
> validate().
> > >
> > > I understand if you want to keep this KIP as tightly focused as
> possible,
> > > but i'm worried that it is addressing the symptom and not the problem.
> I
> > > want to make sure that this change is impactful and isn't obsoleted by
> a
> > > later improvement.
> > >
> > > Thanks,
> > > Greg
> > >
> > >
> > > On Wed, Mar 1, 2023 at 5:07 AM Yash Mayya <ya...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Thanks for all the feedback and discussion. I've renamed the KIP
> title
> > to
> > > > "Kafka Connect REST API timeout improvements" since we're
> introducing a
> > > > couple of improvements (cancelling create / update connector requests
> > > when
> > > > config validations timeout and avoiding double config validations in
> > > > distributed mode) along with making the request timeouts
> configurable.
> > > The
> > > > new KIP link is
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements
> > > > and I've called for a vote for the KIP here -
> > > > https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp.
> > > >
> > > > Thanks,
> > > > Yash
> > > >
> > > > On Sat, Nov 5, 2022 at 11:42 PM Sagar <sa...@gmail.com>
> > wrote:
> > > >
> > > > > Hey Yash,
> > > > >
> > > > > Thanks for the explanation. I think it should be fine to delegate
> the
> > > > > validation directly to the leader.
> > > > >
> > > > > Thanks!
> > > > > Sagar.
> > > > >
> > > > > On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <ya...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Sagar,
> > > > > >
> > > > > > Thanks for chiming in!
> > > > > >
> > > > > > > Having said that, why does the worker forward to the
> > > > > > > leader? I am thinking if the worker can perform the validation
> on
> > > > it's
> > > > > > own,
> > > > > > > we could let it do the validation instead of forwarding
> > everything
> > > to
> > > > > the
> > > > > > > leader
> > > > > >
> > > > > > Only the leader is allowed to perform writes to the config topic,
> > so
> > > > any
> > > > > > request that requires a config topic write ends up being
> forwarded
> > to
> > > > the
> > > > > > leader. The `POST /connectors` and `PUT
> > > /connectors/{connector}/config`
> > > > > > endpoints call `Herder::putConnectorConfig` which internally
> does a
> > > > > config
> > > > > > validation first before initiating another herder request to
> write
> > to
> > > > the
> > > > > > config topic (in distributed mode). If we want to allow the first
> > > > worker
> > > > > to
> > > > > > do the config validation, and then forward the request to the
> > leader
> > > > just
> > > > > > for the write to the config topic, we'd either need something
> like
> > a
> > > > skip
> > > > > > validations query parameter on the endpoint like Chris talks
> about
> > > > above
> > > > > or
> > > > > > else a new internal only endpoint that just does a write to the
> > > config
> > > > > > topic without any prior config validation. However, we agreed
> that
> > > this
> > > > > > optimization doesn't really seem necessary for now and can be
> done
> > > > later
> > > > > if
> > > > > > deemed useful. I'd be happy to hear differing thoughts if any,
> > > however.
> > > > > >
> > > > > > > I think a bound is certainly needed but IMO it shouldn't go
> > beyond
> > > > > > > 10 mins considering this is just validation
> > > > > >
> > > > > > Yeah, I agree that this seems like a fair value - I've elected to
> > go
> > > > > with a
> > > > > > default value of 10 minutes for the proposed worker configuration
> > > that
> > > > > sets
> > > > > > an upper bound for the timeout query parameter.
> > > > > >
> > > > > > Thanks,
> > > > > > Yash
> > > > > >
> > > > > > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <yash.mayya@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Chris,
> > > > > > >
> > > > > > > Thanks again for your feedback. I think a worker configuration
> > for
> > > > the
> > > > > > > upper bound makes sense - I initially thought we could hardcode
> > it
> > > > > (just
> > > > > > > like the current request timeout is), but there's no reason to
> > set
> > > > > > another
> > > > > > > artificial bound that isn't user configurable which is exactly
> > what
> > > > > we're
> > > > > > > trying to change here in the first place. I've updated the KIP
> > > based
> > > > on
> > > > > > all
> > > > > > > our discussion above.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Yash
> > > > > > >
> > > > > > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <
> sagarmeansocean@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> Hey Yash,
> > > > > > >>
> > > > > > >> Thanks for the KIP! This looks like a useful feature.
> > > > > > >>
> > > > > > >> I think the discussion thread already has some great points by
> > > > Chris.
> > > > > > Just
> > > > > > >> a couple of points/clarifications=>
> > > > > > >>
> > > > > > >> Regarding, pt#2 , I guess it might be better to forward to the
> > > > leader
> > > > > as
> > > > > > >> suggested by Yash. Having said that, why does the worker
> forward
> > > to
> > > > > the
> > > > > > >> leader? I am thinking if the worker can perform the validation
> > on
> > > > it's
> > > > > > >> own,
> > > > > > >> we could let it do the validation instead of forwarding
> > everything
> > > > to
> > > > > > the
> > > > > > >> leader(even though it might be cheap to forward all requests
> to
> > > the
> > > > > > >> leader).
> > > > > > >>
> > > > > > >> Pt#3 => I think a bound is certainly needed but IMO it
> shouldn't
> > > go
> > > > > > beyond
> > > > > > >> 10 mins considering this is just validation. We shouldn't end
> up
> > > in
> > > > a
> > > > > > >> situation where a few faulty connectors end up blocking a lot
> of
> > > > > request
> > > > > > >> processing threads, so while increasing the config is
> certainly
> > > > > helpful,
> > > > > > >> we
> > > > > > >> shouldn't set too high a value IMO. Of course I am also open
> to
> > > > > > >> suggestions
> > > > > > >> here.
> > > > > > >>
> > > > > > >> Thanks!
> > > > > > >> Sagar.
> > > > > > >>
> > > > > > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton
> > > > <chrise@aiven.io.invalid
> > > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Hi Yash,
> > > > > > >> >
> > > > > > >> > RE 2: That's a great point about validations already being
> > > > performed
> > > > > > by
> > > > > > >> the
> > > > > > >> > leader. For completeness's sake, I'd like to note that this
> > only
> > > > > holds
> > > > > > >> for
> > > > > > >> > valid configurations; invalid ones are caught right now
> before
> > > > being
> > > > > > >> > forwarded to the leader. Still, I think it's fine to forward
> > to
> > > > the
> > > > > > >> leader
> > > > > > >> > for now and optimize further in the future if necessary. If
> > > > frequent
> > > > > > >> > validations are taking place they should be conducted via
> the
> > > `PUT
> > > > > > >> > /connector-plugins/{pluginName}/config/validate` endpoint,
> > which
> > > > > won't
> > > > > > >> do
> > > > > > >> > any forwarding at all.
> > > > > > >> >
> > > > > > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the
> > timeout
> > > > also
> > > > > > >> seem
> > > > > > >> > reasonable... maybe a low-importance worker property could
> > work
> > > > for
> > > > > > >> that?
> > > > > > >> > Not sure what would make sense for a default; probably
> > somewhere
> > > > in
> > > > > > the
> > > > > > >> > 10-60 minute range but would be interested in others'
> > thoughts.
> > > > > > >> >
> > > > > > >> > Thanks for the clarification on the zombie fencing logic. I
> > > think
> > > > we
> > > > > > >> might
> > > > > > >> > want to have some more subtle logic around the interaction
> > > between
> > > > > > >> calls to
> > > > > > >> > Admin::fenceProducers and a worker-level timeout property if
> > we
> > > go
> > > > > > that
> > > > > > >> > route, but we can cross that particular bridge if we get
> back
> > to
> > > > it.
> > > > > > >> >
> > > > > > >> > Cheers,
> > > > > > >> >
> > > > > > >> > Chris
> > > > > > >> >
> > > > > > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <
> > yash.mayya@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >> >
> > > > > > >> > > Hi Chris,
> > > > > > >> > >
> > > > > > >> > > Thanks a lot for the super quick response and the great
> > > > feedback!
> > > > > > >> > >
> > > > > > >> > > 1. I think that makes a lot of sense, and I'd be happy to
> > > update
> > > > > the
> > > > > > >> KIP
> > > > > > >> > to
> > > > > > >> > > include this change in the scope. The current behavior
> where
> > > the
> > > > > API
> > > > > > >> > > response indicates a time out but the connector is
> > > > created/updated
> > > > > > >> > > eventually anyway can be pretty confusing and is generally
> > > not a
> > > > > > good
> > > > > > >> > user
> > > > > > >> > > experience IMO.
> > > > > > >> > >
> > > > > > >> > > 2. Wow, thanks for pointing this out - it's a really good
> > > catch
> > > > > and
> > > > > > >> > > something I hadn't noticed was happening with the current
> > > > > > >> implementation.
> > > > > > >> > > While I do like the idea of having a query parameter that
> > > > > determines
> > > > > > >> > > whether validations can be skipped, I'm wondering if it
> > might
> > > > not
> > > > > be
> > > > > > >> > easier
> > > > > > >> > > and cleaner to just do the leader check earlier and avoid
> > > doing
> > > > > the
> > > > > > >> > > unnecessary config validation on the first worker? Since
> > each
> > > > > config
> > > > > > >> > > validation happens on its own thread, I'm not so sure
> about
> > > the
> > > > > > >> concern
> > > > > > >> > of
> > > > > > >> > > overloading the leader even on larger clusters, especially
> > > since
> > > > > > >> > > validations aren't typically long running operations.
> > > > Furthermore,
> > > > > > >> even
> > > > > > >> > > with the current implementation, the leader will always be
> > > > doing a
> > > > > > >> config
> > > > > > >> > > validation for connector create / update REST API requests
> > on
> > > > any
> > > > > > >> worker.
> > > > > > >> > >
> > > > > > >> > > 3. That's a good point, and this way we can also restrict
> > the
> > > > APIs
> > > > > > >> whose
> > > > > > >> > > timeouts are configurable - I'm thinking `PUT
> > > > > > >> > > /connector-plugins/{pluginName}/config/validate`, `POST
> > > > > /connectors`
> > > > > > >> and
> > > > > > >> > > `PUT /connectors/{connector}/config` are the ones where
> > such a
> > > > > > timeout
> > > > > > >> > > parameter could be useful. Also, do you think we should
> > > enforce
> > > > > some
> > > > > > >> > > reasonable bounds for the timeout config?
> > > > > > >> > >
> > > > > > >> > > On the zombie fencing point, the implication was that the
> > new
> > > > > worker
> > > > > > >> > > property would not control the timeout used for the call
> to
> > > > > > >> > > Admin::fenceProducers. However, if we go with a timeout
> > query
> > > > > > >> parameter
> > > > > > >> > > approach, even the timeout for the `PUT
> > > > > > /connectors/{connector}/fence'
> > > > > > >> > > endpoint will remain unaffected.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Yash
> > > > > > >> > >
> > > > > > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton
> > > > > > <chrise@aiven.io.invalid
> > > > > > >> >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Hi Yash,
> > > > > > >> > > >
> > > > > > >> > > > Thanks for the KIP. It's a nice, focused change.
> > Initially I
> > > > was
> > > > > > >> > hesitant
> > > > > > >> > > > to support cases where connector validation takes this
> > long,
> > > > but
> > > > > > >> > > > considering the alternative is that we give users a 500
> > > error
> > > > > > >> response
> > > > > > >> > > but
> > > > > > >> > > > leave the request to create/modify the connector queued
> up
> > > in
> > > > > the
> > > > > > >> > > herder, I
> > > > > > >> > > > think I can get behind the motivation here. There's also
> > an
> > > > > > >> argument to
> > > > > > >> > > be
> > > > > > >> > > > made about keeping Kafka Connect available even when the
> > > > systems
> > > > > > >> that
> > > > > > >> > it
> > > > > > >> > > > connects to are in a degraded state.
> > > > > > >> > > >
> > > > > > >> > > > I have a few alternatives I'd be interested in your
> > thoughts
> > > > on:
> > > > > > >> > > >
> > > > > > >> > > > 1. Since the primary concern here seems to be that
> custom
> > > > > > connector
> > > > > > >> > > > validation logic can take too long, do we have any
> > thoughts
> > > on
> > > > > > >> adding
> > > > > > >> > > logic
> > > > > > >> > > > to check for request timeout after validation has
> > completed
> > > > and,
> > > > > > if
> > > > > > >> it
> > > > > > >> > > has,
> > > > > > >> > > > aborting the attempt to create/modify the connector?
> > > > > > >> > > >
> > > > > > >> > > > 2. Right now it's possible that we'll perform two
> > connector
> > > > > config
> > > > > > >> > > > validations per create/modify request; once on the
> worker
> > > that
> > > > > > >> > initially
> > > > > > >> > > > receives the request, and then again if that worker is
> not
> > > the
> > > > > > >> leader
> > > > > > >> > of
> > > > > > >> > > > the cluster and has to forward the request to the
> leader.
> > > Any
> > > > > > >> thoughts
> > > > > > >> > on
> > > > > > >> > > > optimizing this to only require a single validation per
> > > > request?
> > > > > > We
> > > > > > >> > > > probably wouldn't want to force all validations to take
> > > place
> > > > on
> > > > > > the
> > > > > > >> > > leader
> > > > > > >> > > > (could lead to overloading it pretty quickly in large
> > > > clusters),
> > > > > > >> but we
> > > > > > >> > > > could add an internal-only query parameter to skip
> > > validation
> > > > > and
> > > > > > >> then
> > > > > > >> > > use
> > > > > > >> > > > that parameter when forwarding requests from followers
> to
> > > the
> > > > > > >> leader.
> > > > > > >> > > >
> > > > > > >> > > > 3. A worker property is pretty coarse-grained, and
> > difficult
> > > > to
> > > > > > >> change.
> > > > > > >> > > We
> > > > > > >> > > > might allow per-request toggling of the timeout by
> adding
> > a
> > > > URL
> > > > > > >> query
> > > > > > >> > > > parameter like '?timeout=90s' to the REST API to allow
> > > > tweaking
> > > > > of
> > > > > > >> the
> > > > > > >> > > > timeout on a more granular basis, and without having to
> > > > perform
> > > > > a
> > > > > > >> > worker
> > > > > > >> > > > restart.
> > > > > > >> > > >
> > > > > > >> > > > I'd also like to clarify a point about the rejected
> > > > alternative
> > > > > > >> "Allow
> > > > > > >> > > > configuring producer zombie fencing admin request
> > > timeout"--is
> > > > > the
> > > > > > >> > > > implication here that the "rest.api.request.timeout.ms"
> > > > > property
> > > > > > >> will
> > > > > > >> > > not
> > > > > > >> > > > control the REST timeout for requests to the 'PUT
> > > > > > >> > > > /connectors/{connector}/fence' endpoint, or just that it
> > > won't
> > > > > > >> control
> > > > > > >> > > the
> > > > > > >> > > > timeout that we use for the call to
> Admin::fenceProducers?
> > > > > > >> > > >
> > > > > > >> > > > Cheers,
> > > > > > >> > > >
> > > > > > >> > > > Chris
> > > > > > >> > > >
> > > > > > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <
> > > > > yash.mayya@gmail.com>
> > > > > > >> > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi all,
> > > > > > >> > > > >
> > > > > > >> > > > > I'd like to start a discussion thread on this small
> KIP
> > -
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > > > > > >> > > > >
> > > > > > >> > > > > It proposes the addition of a new Kafka Connect worker
> > > > > > >> configuration
> > > > > > >> > to
> > > > > > >> > > > > allow configuring REST API request timeouts.
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > > Yash
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Greg Harris <gr...@aiven.io.INVALID>.
Yash,

1.
Currently the request parameters in the REST API are individual and pertain
to just one endpoint.
They also change the content of the query result, or change the action
taken on the cluster.
I think that a request timeout is a property of the HTTP request more than
it is a property of the cluster query or cluster action.
The two solutions have very similar tradeoffs, but I'm interested in
whether one is more idiomatic and more obvious to users.

2.
I understand that only these three endpoints are in need of increased
timeouts at this time due to long connector validations.
From another perspective, this change is making the API more irregular and
someone in the future might choose to make it more regular by standardizing
the configurable timeout functionality.
I wouldn't (in this KIP) dismiss someone's desire to configure other
timeouts in the future (in another KIP), and design them into a corner.
It is acceptable to limit the scope of this change to just the three
endpoints due to practical reasons, but I don't think that should prevent
us from trying to ensure that this design fits in the "end goal" state of
the Connect service.

3. 4.
I am not suggesting an incompatible change, as the current synchronous
behavior is still a useful API for certain situations. I think that it is
possible to add asynchronous validations in a backwards compatible way,
using new endpoints or other new request parameters.
The interface could be designed such that users with connectors that exceed
the synchronous timeouts can utilize the asynchronous API. Tooling can use
the asynchronous API when it is available, and fall back to the synchronous
API when it is not.
I think that it also may be more in-line with the design of the rest of the
REST API, where nearly every other request is asynchronous. That's why
you're only targeting these three endpoints, they're the only ones with a
synchronicity constraint.
Again, I'm not necessarily saying that you must implement such an
asynchronous validation scheme in this KIP, but we should consider if that
is a more extensible solution. If we decided to implement configurable
synchronous timeouts now, how would that complement an asynchronous API in
the future?

On Thu, Mar 2, 2023 at 10:00 PM Yash Mayya <ya...@gmail.com> wrote:

> Hi Greg,
>
> Thanks for taking a look!
>
> 1. I believe Chris suggested the use of a query parameter above as we
> already have precedents for using query parameters to configure per request
> behavior in Kafka Connect (the restart connectors API and the get
> connectors API for instance). Also, the limited choice of endpoints
> targeted is intentional (see my reply to the next point).
>
> 2. I intentionally targeted just the three listed endpoints where
> synchronous connector config validations come into the picture. This is
> because of the legitimate cases where config validation for specific
> connector plugins might exceed the default request timeout in edge case
> scenarios (outlined in the KIP's motivation section). Other Connect REST
> endpoints shouldn't be taking longer than the default 90 second request
> timeout; if they do so, it would either be indicative of a bug in the
> Connect framework or a cluster health issue - neither of which should be
> covered up by manually setting a longer request timeout.
>
> 3. 4. I think changing the config validation behavior would be a backward
> incompatible change and I wanted to avoid that in this particular KIP.
> There are multiple programmatic UIs out there which rely on the current
> synchronous config validation behavior and breaking the existing contract
> would definitely require a larger discussion.
>
> Thanks,
> Yash
>
> On Fri, Mar 3, 2023 at 12:04 AM Greg Harris <gr...@aiven.io.invalid>
> wrote:
>
> > Hey Yash,
> >
> > Thanks for the KIP, and sorry for the late review.
> >
> > 1. Have you considered a HTTP header to provide the client-configurable
> > timeout? A header might more naturally extend to all of the other
> endpoints
> > in the future, rather than duplicating the query parameter across
> > endpoints.
> >
> > 2. I understand that this change is targeted at just long duration
> > Connector::validation calls, is that due to voluntary scope constraints?
> > Implementing configurable timeouts for all endpoints in a uniform way
> could
> > be desirable, even if the default timeout will work for nearly all of the
> > other calls.
> >
> > 3. Did you consider adding asynchronous validation as a user-facing
> > feature? I think that relying on the synchronous validation results in a
> > single HTTP request is a bit of an anti-pattern, and one that we've
> > inherited from the original REST design. It seems useful when using the
> > REST API by hand, but seems to be a liability when used in environments
> > with an external management layer.
> >
> > 4. Perhaps rather than allowing synchronous calls to Connector:validate
> to
> > increase in duration, we should provide a way for connectors to surface
> > validation problems later in their lifecycle. Currently there is the
> > ConnectorContext::raiseError that transitions the task to FAILED,
> perhaps a
> > similar API could asynchronously emit re-validation results. We've also
> had
> > a problem with long start() duration for the same reasons as validate().
> >
> > I understand if you want to keep this KIP as tightly focused as possible,
> > but i'm worried that it is addressing the symptom and not the problem. I
> > want to make sure that this change is impactful and isn't obsoleted by a
> > later improvement.
> >
> > Thanks,
> > Greg
> >
> >
> > On Wed, Mar 1, 2023 at 5:07 AM Yash Mayya <ya...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Thanks for all the feedback and discussion. I've renamed the KIP title
> to
> > > "Kafka Connect REST API timeout improvements" since we're introducing a
> > > couple of improvements (cancelling create / update connector requests
> > when
> > > config validations timeout and avoiding double config validations in
> > > distributed mode) along with making the request timeouts configurable.
> > The
> > > new KIP link is
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements
> > > and I've called for a vote for the KIP here -
> > > https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp.
> > >
> > > Thanks,
> > > Yash
> > >
> > > On Sat, Nov 5, 2022 at 11:42 PM Sagar <sa...@gmail.com>
> wrote:
> > >
> > > > Hey Yash,
> > > >
> > > > Thanks for the explanation. I think it should be fine to delegate the
> > > > validation directly to the leader.
> > > >
> > > > Thanks!
> > > > Sagar.
> > > >
> > > > On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <ya...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Sagar,
> > > > >
> > > > > Thanks for chiming in!
> > > > >
> > > > > > Having said that, why does the worker forward to the
> > > > > > leader? I am thinking if the worker can perform the validation on
> > > it's
> > > > > own,
> > > > > > we could let it do the validation instead of forwarding
> everything
> > to
> > > > the
> > > > > > leader
> > > > >
> > > > > Only the leader is allowed to perform writes to the config topic,
> so
> > > any
> > > > > request that requires a config topic write ends up being forwarded
> to
> > > the
> > > > > leader. The `POST /connectors` and `PUT
> > /connectors/{connector}/config`
> > > > > endpoints call `Herder::putConnectorConfig` which internally does a
> > > > config
> > > > > validation first before initiating another herder request to write
> to
> > > the
> > > > > config topic (in distributed mode). If we want to allow the first
> > > worker
> > > > to
> > > > > do the config validation, and then forward the request to the
> leader
> > > just
> > > > > for the write to the config topic, we'd either need something like
> a
> > > skip
> > > > > validations query parameter on the endpoint like Chris talks about
> > > above
> > > > or
> > > > > else a new internal only endpoint that just does a write to the
> > config
> > > > > topic without any prior config validation. However, we agreed that
> > this
> > > > > optimization doesn't really seem necessary for now and can be done
> > > later
> > > > if
> > > > > deemed useful. I'd be happy to hear differing thoughts if any,
> > however.
> > > > >
> > > > > > I think a bound is certainly needed but IMO it shouldn't go
> beyond
> > > > > > 10 mins considering this is just validation
> > > > >
> > > > > Yeah, I agree that this seems like a fair value - I've elected to
> go
> > > > with a
> > > > > default value of 10 minutes for the proposed worker configuration
> > that
> > > > sets
> > > > > an upper bound for the timeout query parameter.
> > > > >
> > > > > Thanks,
> > > > > Yash
> > > > >
> > > > > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <ya...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Chris,
> > > > > >
> > > > > > Thanks again for your feedback. I think a worker configuration
> for
> > > the
> > > > > > upper bound makes sense - I initially thought we could hardcode
> it
> > > > (just
> > > > > > like the current request timeout is), but there's no reason to
> set
> > > > > another
> > > > > > artificial bound that isn't user configurable which is exactly
> what
> > > > we're
> > > > > > trying to change here in the first place. I've updated the KIP
> > based
> > > on
> > > > > all
> > > > > > our discussion above.
> > > > > >
> > > > > > Thanks,
> > > > > > Yash
> > > > > >
> > > > > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <sagarmeansocean@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > >> Hey Yash,
> > > > > >>
> > > > > >> Thanks for the KIP! This looks like a useful feature.
> > > > > >>
> > > > > >> I think the discussion thread already has some great points by
> > > Chris.
> > > > > Just
> > > > > >> a couple of points/clarifications=>
> > > > > >>
> > > > > >> Regarding, pt#2 , I guess it might be better to forward to the
> > > leader
> > > > as
> > > > > >> suggested by Yash. Having said that, why does the worker forward
> > to
> > > > the
> > > > > >> leader? I am thinking if the worker can perform the validation
> on
> > > it's
> > > > > >> own,
> > > > > >> we could let it do the validation instead of forwarding
> everything
> > > to
> > > > > the
> > > > > >> leader(even though it might be cheap to forward all requests to
> > the
> > > > > >> leader).
> > > > > >>
> > > > > >> Pt#3 => I think a bound is certainly needed but IMO it shouldn't
> > go
> > > > > beyond
> > > > > >> 10 mins considering this is just validation. We shouldn't end up
> > in
> > > a
> > > > > >> situation where a few faulty connectors end up blocking a lot of
> > > > request
> > > > > >> processing threads, so while increasing the config is certainly
> > > > helpful,
> > > > > >> we
> > > > > >> shouldn't set too high a value IMO. Of course I am also open to
> > > > > >> suggestions
> > > > > >> here.
> > > > > >>
> > > > > >> Thanks!
> > > > > >> Sagar.
> > > > > >>
> > > > > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton
> > > <chrise@aiven.io.invalid
> > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Yash,
> > > > > >> >
> > > > > >> > RE 2: That's a great point about validations already being
> > > performed
> > > > > by
> > > > > >> the
> > > > > >> > leader. For completeness's sake, I'd like to note that this
> only
> > > > holds
> > > > > >> for
> > > > > >> > valid configurations; invalid ones are caught right now before
> > > being
> > > > > >> > forwarded to the leader. Still, I think it's fine to forward
> to
> > > the
> > > > > >> leader
> > > > > >> > for now and optimize further in the future if necessary. If
> > > frequent
> > > > > >> > validations are taking place they should be conducted via the
> > `PUT
> > > > > >> > /connector-plugins/{pluginName}/config/validate` endpoint,
> which
> > > > won't
> > > > > >> do
> > > > > >> > any forwarding at all.
> > > > > >> >
> > > > > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the
> timeout
> > > also
> > > > > >> seem
> > > > > >> > reasonable... maybe a low-importance worker property could
> work
> > > for
> > > > > >> that?
> > > > > >> > Not sure what would make sense for a default; probably
> somewhere
> > > in
> > > > > the
> > > > > >> > 10-60 minute range but would be interested in others'
> thoughts.
> > > > > >> >
> > > > > >> > Thanks for the clarification on the zombie fencing logic. I
> > think
> > > we
> > > > > >> might
> > > > > >> > want to have some more subtle logic around the interaction
> > between
> > > > > >> calls to
> > > > > >> > Admin::fenceProducers and a worker-level timeout property if
> we
> > go
> > > > > that
> > > > > >> > route, but we can cross that particular bridge if we get back
> to
> > > it.
> > > > > >> >
> > > > > >> > Cheers,
> > > > > >> >
> > > > > >> > Chris
> > > > > >> >
> > > > > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <
> yash.mayya@gmail.com
> > >
> > > > > wrote:
> > > > > >> >
> > > > > >> > > Hi Chris,
> > > > > >> > >
> > > > > >> > > Thanks a lot for the super quick response and the great
> > > feedback!
> > > > > >> > >
> > > > > >> > > 1. I think that makes a lot of sense, and I'd be happy to
> > update
> > > > the
> > > > > >> KIP
> > > > > >> > to
> > > > > >> > > include this change in the scope. The current behavior where
> > the
> > > > API
> > > > > >> > > response indicates a time out but the connector is
> > > created/updated
> > > > > >> > > eventually anyway can be pretty confusing and is generally
> > not a
> > > > > good
> > > > > >> > user
> > > > > >> > > experience IMO.
> > > > > >> > >
> > > > > >> > > 2. Wow, thanks for pointing this out - it's a really good
> > catch
> > > > and
> > > > > >> > > something I hadn't noticed was happening with the current
> > > > > >> implementation.
> > > > > >> > > While I do like the idea of having a query parameter that
> > > > determines
> > > > > >> > > whether validations can be skipped, I'm wondering if it
> might
> > > not
> > > > be
> > > > > >> > easier
> > > > > >> > > and cleaner to just do the leader check earlier and avoid
> > doing
> > > > the
> > > > > >> > > unnecessary config validation on the first worker? Since
> each
> > > > config
> > > > > >> > > validation happens on its own thread, I'm not so sure about
> > the
> > > > > >> concern
> > > > > >> > of
> > > > > >> > > overloading the leader even on larger clusters, especially
> > since
> > > > > >> > > validations aren't typically long running operations.
> > > Furthermore,
> > > > > >> even
> > > > > >> > > with the current implementation, the leader will always be
> > > doing a
> > > > > >> config
> > > > > >> > > validation for connector create / update REST API requests
> on
> > > any
> > > > > >> worker.
> > > > > >> > >
> > > > > >> > > 3. That's a good point, and this way we can also restrict
> the
> > > APIs
> > > > > >> whose
> > > > > >> > > timeouts are configurable - I'm thinking `PUT
> > > > > >> > > /connector-plugins/{pluginName}/config/validate`, `POST
> > > > /connectors`
> > > > > >> and
> > > > > >> > > `PUT /connectors/{connector}/config` are the ones where
> such a
> > > > > timeout
> > > > > >> > > parameter could be useful. Also, do you think we should
> > enforce
> > > > some
> > > > > >> > > reasonable bounds for the timeout config?
> > > > > >> > >
> > > > > >> > > On the zombie fencing point, the implication was that the
> new
> > > > worker
> > > > > >> > > property would not control the timeout used for the call to
> > > > > >> > > Admin::fenceProducers. However, if we go with a timeout
> query
> > > > > >> parameter
> > > > > >> > > approach, even the timeout for the `PUT
> > > > > /connectors/{connector}/fence'
> > > > > >> > > endpoint will remain unaffected.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Yash
> > > > > >> > >
> > > > > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton
> > > > > <chrise@aiven.io.invalid
> > > > > >> >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Hi Yash,
> > > > > >> > > >
> > > > > >> > > > Thanks for the KIP. It's a nice, focused change.
> Initially I
> > > was
> > > > > >> > hesitant
> > > > > >> > > > to support cases where connector validation takes this
> long,
> > > but
> > > > > >> > > > considering the alternative is that we give users a 500
> > error
> > > > > >> response
> > > > > >> > > but
> > > > > >> > > > leave the request to create/modify the connector queued up
> > in
> > > > the
> > > > > >> > > herder, I
> > > > > >> > > > think I can get behind the motivation here. There's also
> an
> > > > > >> argument to
> > > > > >> > > be
> > > > > >> > > > made about keeping Kafka Connect available even when the
> > > systems
> > > > > >> that
> > > > > >> > it
> > > > > >> > > > connects to are in a degraded state.
> > > > > >> > > >
> > > > > >> > > > I have a few alternatives I'd be interested in your
> thoughts
> > > on:
> > > > > >> > > >
> > > > > >> > > > 1. Since the primary concern here seems to be that custom
> > > > > connector
> > > > > >> > > > validation logic can take too long, do we have any
> thoughts
> > on
> > > > > >> adding
> > > > > >> > > logic
> > > > > >> > > > to check for request timeout after validation has
> completed
> > > and,
> > > > > if
> > > > > >> it
> > > > > >> > > has,
> > > > > >> > > > aborting the attempt to create/modify the connector?
> > > > > >> > > >
> > > > > >> > > > 2. Right now it's possible that we'll perform two
> connector
> > > > config
> > > > > >> > > > validations per create/modify request; once on the worker
> > that
> > > > > >> > initially
> > > > > >> > > > receives the request, and then again if that worker is not
> > the
> > > > > >> leader
> > > > > >> > of
> > > > > >> > > > the cluster and has to forward the request to the leader.
> > Any
> > > > > >> thoughts
> > > > > >> > on
> > > > > >> > > > optimizing this to only require a single validation per
> > > request?
> > > > > We
> > > > > >> > > > probably wouldn't want to force all validations to take
> > place
> > > on
> > > > > the
> > > > > >> > > leader
> > > > > >> > > > (could lead to overloading it pretty quickly in large
> > > clusters),
> > > > > >> but we
> > > > > >> > > > could add an internal-only query parameter to skip
> > validation
> > > > and
> > > > > >> then
> > > > > >> > > use
> > > > > >> > > > that parameter when forwarding requests from followers to
> > the
> > > > > >> leader.
> > > > > >> > > >
> > > > > >> > > > 3. A worker property is pretty coarse-grained, and
> difficult
> > > to
> > > > > >> change.
> > > > > >> > > We
> > > > > >> > > > might allow per-request toggling of the timeout by adding
> a
> > > URL
> > > > > >> query
> > > > > >> > > > parameter like '?timeout=90s' to the REST API to allow
> > > tweaking
> > > > of
> > > > > >> the
> > > > > >> > > > timeout on a more granular basis, and without having to
> > > perform
> > > > a
> > > > > >> > worker
> > > > > >> > > > restart.
> > > > > >> > > >
> > > > > >> > > > I'd also like to clarify a point about the rejected
> > > alternative
> > > > > >> "Allow
> > > > > >> > > > configuring producer zombie fencing admin request
> > timeout"--is
> > > > the
> > > > > >> > > > implication here that the "rest.api.request.timeout.ms"
> > > > property
> > > > > >> will
> > > > > >> > > not
> > > > > >> > > > control the REST timeout for requests to the 'PUT
> > > > > >> > > > /connectors/{connector}/fence' endpoint, or just that it
> > won't
> > > > > >> control
> > > > > >> > > the
> > > > > >> > > > timeout that we use for the call to Admin::fenceProducers?
> > > > > >> > > >
> > > > > >> > > > Cheers,
> > > > > >> > > >
> > > > > >> > > > Chris
> > > > > >> > > >
> > > > > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <
> > > > yash.mayya@gmail.com>
> > > > > >> > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi all,
> > > > > >> > > > >
> > > > > >> > > > > I'd like to start a discussion thread on this small KIP
> -
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > > > > >> > > > >
> > > > > >> > > > > It proposes the addition of a new Kafka Connect worker
> > > > > >> configuration
> > > > > >> > to
> > > > > >> > > > > allow configuring REST API request timeouts.
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > > Yash
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Yash Mayya <ya...@gmail.com>.
Hi Greg,

Thanks for taking a look!

1. I believe Chris suggested the use of a query parameter above as we
already have precedents for using query parameters to configure per request
behavior in Kafka Connect (the restart connectors API and the get
connectors API for instance). Also, the limited choice of endpoints
targeted is intentional (see my reply to the next point).

2. I intentionally targeted just the three listed endpoints where
synchronous connector config validations come into the picture. This is
because of the legitimate cases where config validation for specific
connector plugins might exceed the default request timeout in edge case
scenarios (outlined in the KIP's motivation section). Other Connect REST
endpoints shouldn't be taking longer than the default 90 second request
timeout; if they do so, it would either be indicative of a bug in the
Connect framework or a cluster health issue - neither of which should be
covered up by manually setting a longer request timeout.

3. 4. I think changing the config validation behavior would be a backward
incompatible change and I wanted to avoid that in this particular KIP.
There are multiple programmatic UIs out there which rely on the current
synchronous config validation behavior and breaking the existing contract
would definitely require a larger discussion.

Thanks,
Yash

On Fri, Mar 3, 2023 at 12:04 AM Greg Harris <gr...@aiven.io.invalid>
wrote:

> Hey Yash,
>
> Thanks for the KIP, and sorry for the late review.
>
> 1. Have you considered a HTTP header to provide the client-configurable
> timeout? A header might more naturally extend to all of the other endpoints
> in the future, rather than duplicating the query parameter across
> endpoints.
>
> 2. I understand that this change is targeted at just long duration
> Connector::validation calls, is that due to voluntary scope constraints?
> Implementing configurable timeouts for all endpoints in a uniform way could
> be desirable, even if the default timeout will work for nearly all of the
> other calls.
>
> 3. Did you consider adding asynchronous validation as a user-facing
> feature? I think that relying on the synchronous validation results in a
> single HTTP request is a bit of an anti-pattern, and one that we've
> inherited from the original REST design. It seems useful when using the
> REST API by hand, but seems to be a liability when used in environments
> with an external management layer.
>
> 4. Perhaps rather than allowing synchronous calls to Connector:validate to
> increase in duration, we should provide a way for connectors to surface
> validation problems later in their lifecycle. Currently there is the
> ConnectorContext::raiseError that transitions the task to FAILED, perhaps a
> similar API could asynchronously emit re-validation results. We've also had
> a problem with long start() duration for the same reasons as validate().
>
> I understand if you want to keep this KIP as tightly focused as possible,
> but i'm worried that it is addressing the symptom and not the problem. I
> want to make sure that this change is impactful and isn't obsoleted by a
> later improvement.
>
> Thanks,
> Greg
>
>
> On Wed, Mar 1, 2023 at 5:07 AM Yash Mayya <ya...@gmail.com> wrote:
>
> > Hi all,
> >
> > Thanks for all the feedback and discussion. I've renamed the KIP title to
> > "Kafka Connect REST API timeout improvements" since we're introducing a
> > couple of improvements (cancelling create / update connector requests
> when
> > config validations timeout and avoiding double config validations in
> > distributed mode) along with making the request timeouts configurable.
> The
> > new KIP link is
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements
> > and I've called for a vote for the KIP here -
> > https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp.
> >
> > Thanks,
> > Yash
> >
> > On Sat, Nov 5, 2022 at 11:42 PM Sagar <sa...@gmail.com> wrote:
> >
> > > Hey Yash,
> > >
> > > Thanks for the explanation. I think it should be fine to delegate the
> > > validation directly to the leader.
> > >
> > > Thanks!
> > > Sagar.
> > >
> > > On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <ya...@gmail.com>
> wrote:
> > >
> > > > Hi Sagar,
> > > >
> > > > Thanks for chiming in!
> > > >
> > > > > Having said that, why does the worker forward to the
> > > > > leader? I am thinking if the worker can perform the validation on
> > it's
> > > > own,
> > > > > we could let it do the validation instead of forwarding everything
> to
> > > the
> > > > > leader
> > > >
> > > > Only the leader is allowed to perform writes to the config topic, so
> > any
> > > > request that requires a config topic write ends up being forwarded to
> > the
> > > > leader. The `POST /connectors` and `PUT
> /connectors/{connector}/config`
> > > > endpoints call `Herder::putConnectorConfig` which internally does a
> > > config
> > > > validation first before initiating another herder request to write to
> > the
> > > > config topic (in distributed mode). If we want to allow the first
> > worker
> > > to
> > > > do the config validation, and then forward the request to the leader
> > just
> > > > for the write to the config topic, we'd either need something like a
> > skip
> > > > validations query parameter on the endpoint like Chris talks about
> > above
> > > or
> > > > else a new internal only endpoint that just does a write to the
> config
> > > > topic without any prior config validation. However, we agreed that
> this
> > > > optimization doesn't really seem necessary for now and can be done
> > later
> > > if
> > > > deemed useful. I'd be happy to hear differing thoughts if any,
> however.
> > > >
> > > > > I think a bound is certainly needed but IMO it shouldn't go beyond
> > > > > 10 mins considering this is just validation
> > > >
> > > > Yeah, I agree that this seems like a fair value - I've elected to go
> > > with a
> > > > default value of 10 minutes for the proposed worker configuration
> that
> > > sets
> > > > an upper bound for the timeout query parameter.
> > > >
> > > > Thanks,
> > > > Yash
> > > >
> > > > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <ya...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Chris,
> > > > >
> > > > > Thanks again for your feedback. I think a worker configuration for
> > the
> > > > > upper bound makes sense - I initially thought we could hardcode it
> > > (just
> > > > > like the current request timeout is), but there's no reason to set
> > > > another
> > > > > artificial bound that isn't user configurable which is exactly what
> > > we're
> > > > > trying to change here in the first place. I've updated the KIP
> based
> > on
> > > > all
> > > > > our discussion above.
> > > > >
> > > > > Thanks,
> > > > > Yash
> > > > >
> > > > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <sa...@gmail.com>
> > > wrote:
> > > > >
> > > > >> Hey Yash,
> > > > >>
> > > > >> Thanks for the KIP! This looks like a useful feature.
> > > > >>
> > > > >> I think the discussion thread already has some great points by
> > Chris.
> > > > Just
> > > > >> a couple of points/clarifications=>
> > > > >>
> > > > >> Regarding, pt#2 , I guess it might be better to forward to the
> > leader
> > > as
> > > > >> suggested by Yash. Having said that, why does the worker forward
> to
> > > the
> > > > >> leader? I am thinking if the worker can perform the validation on
> > it's
> > > > >> own,
> > > > >> we could let it do the validation instead of forwarding everything
> > to
> > > > the
> > > > >> leader(even though it might be cheap to forward all requests to
> the
> > > > >> leader).
> > > > >>
> > > > >> Pt#3 => I think a bound is certainly needed but IMO it shouldn't
> go
> > > > beyond
> > > > >> 10 mins considering this is just validation. We shouldn't end up
> in
> > a
> > > > >> situation where a few faulty connectors end up blocking a lot of
> > > request
> > > > >> processing threads, so while increasing the config is certainly
> > > helpful,
> > > > >> we
> > > > >> shouldn't set too high a value IMO. Of course I am also open to
> > > > >> suggestions
> > > > >> here.
> > > > >>
> > > > >> Thanks!
> > > > >> Sagar.
> > > > >>
> > > > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton
> > <chrise@aiven.io.invalid
> > > >
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Yash,
> > > > >> >
> > > > >> > RE 2: That's a great point about validations already being
> > performed
> > > > by
> > > > >> the
> > > > >> > leader. For completeness's sake, I'd like to note that this only
> > > holds
> > > > >> for
> > > > >> > valid configurations; invalid ones are caught right now before
> > being
> > > > >> > forwarded to the leader. Still, I think it's fine to forward to
> > the
> > > > >> leader
> > > > >> > for now and optimize further in the future if necessary. If
> > frequent
> > > > >> > validations are taking place they should be conducted via the
> `PUT
> > > > >> > /connector-plugins/{pluginName}/config/validate` endpoint, which
> > > won't
> > > > >> do
> > > > >> > any forwarding at all.
> > > > >> >
> > > > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout
> > also
> > > > >> seem
> > > > >> > reasonable... maybe a low-importance worker property could work
> > for
> > > > >> that?
> > > > >> > Not sure what would make sense for a default; probably somewhere
> > in
> > > > the
> > > > >> > 10-60 minute range but would be interested in others' thoughts.
> > > > >> >
> > > > >> > Thanks for the clarification on the zombie fencing logic. I
> think
> > we
> > > > >> might
> > > > >> > want to have some more subtle logic around the interaction
> between
> > > > >> calls to
> > > > >> > Admin::fenceProducers and a worker-level timeout property if we
> go
> > > > that
> > > > >> > route, but we can cross that particular bridge if we get back to
> > it.
> > > > >> >
> > > > >> > Cheers,
> > > > >> >
> > > > >> > Chris
> > > > >> >
> > > > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <yash.mayya@gmail.com
> >
> > > > wrote:
> > > > >> >
> > > > >> > > Hi Chris,
> > > > >> > >
> > > > >> > > Thanks a lot for the super quick response and the great
> > feedback!
> > > > >> > >
> > > > >> > > 1. I think that makes a lot of sense, and I'd be happy to
> update
> > > the
> > > > >> KIP
> > > > >> > to
> > > > >> > > include this change in the scope. The current behavior where
> the
> > > API
> > > > >> > > response indicates a time out but the connector is
> > created/updated
> > > > >> > > eventually anyway can be pretty confusing and is generally
> not a
> > > > good
> > > > >> > user
> > > > >> > > experience IMO.
> > > > >> > >
> > > > >> > > 2. Wow, thanks for pointing this out - it's a really good
> catch
> > > and
> > > > >> > > something I hadn't noticed was happening with the current
> > > > >> implementation.
> > > > >> > > While I do like the idea of having a query parameter that
> > > determines
> > > > >> > > whether validations can be skipped, I'm wondering if it might
> > not
> > > be
> > > > >> > easier
> > > > >> > > and cleaner to just do the leader check earlier and avoid
> doing
> > > the
> > > > >> > > unnecessary config validation on the first worker? Since each
> > > config
> > > > >> > > validation happens on its own thread, I'm not so sure about
> the
> > > > >> concern
> > > > >> > of
> > > > >> > > overloading the leader even on larger clusters, especially
> since
> > > > >> > > validations aren't typically long running operations.
> > Furthermore,
> > > > >> even
> > > > >> > > with the current implementation, the leader will always be
> > doing a
> > > > >> config
> > > > >> > > validation for connector create / update REST API requests on
> > any
> > > > >> worker.
> > > > >> > >
> > > > >> > > 3. That's a good point, and this way we can also restrict the
> > APIs
> > > > >> whose
> > > > >> > > timeouts are configurable - I'm thinking `PUT
> > > > >> > > /connector-plugins/{pluginName}/config/validate`, `POST
> > > /connectors`
> > > > >> and
> > > > >> > > `PUT /connectors/{connector}/config` are the ones where such a
> > > > timeout
> > > > >> > > parameter could be useful. Also, do you think we should
> enforce
> > > some
> > > > >> > > reasonable bounds for the timeout config?
> > > > >> > >
> > > > >> > > On the zombie fencing point, the implication was that the new
> > > worker
> > > > >> > > property would not control the timeout used for the call to
> > > > >> > > Admin::fenceProducers. However, if we go with a timeout query
> > > > >> parameter
> > > > >> > > approach, even the timeout for the `PUT
> > > > /connectors/{connector}/fence'
> > > > >> > > endpoint will remain unaffected.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Yash
> > > > >> > >
> > > > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton
> > > > <chrise@aiven.io.invalid
> > > > >> >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Hi Yash,
> > > > >> > > >
> > > > >> > > > Thanks for the KIP. It's a nice, focused change. Initially I
> > was
> > > > >> > hesitant
> > > > >> > > > to support cases where connector validation takes this long,
> > but
> > > > >> > > > considering the alternative is that we give users a 500
> error
> > > > >> response
> > > > >> > > but
> > > > >> > > > leave the request to create/modify the connector queued up
> in
> > > the
> > > > >> > > herder, I
> > > > >> > > > think I can get behind the motivation here. There's also an
> > > > >> argument to
> > > > >> > > be
> > > > >> > > > made about keeping Kafka Connect available even when the
> > systems
> > > > >> that
> > > > >> > it
> > > > >> > > > connects to are in a degraded state.
> > > > >> > > >
> > > > >> > > > I have a few alternatives I'd be interested in your thoughts
> > on:
> > > > >> > > >
> > > > >> > > > 1. Since the primary concern here seems to be that custom
> > > > connector
> > > > >> > > > validation logic can take too long, do we have any thoughts
> on
> > > > >> adding
> > > > >> > > logic
> > > > >> > > > to check for request timeout after validation has completed
> > and,
> > > > if
> > > > >> it
> > > > >> > > has,
> > > > >> > > > aborting the attempt to create/modify the connector?
> > > > >> > > >
> > > > >> > > > 2. Right now it's possible that we'll perform two connector
> > > config
> > > > >> > > > validations per create/modify request; once on the worker
> that
> > > > >> > initially
> > > > >> > > > receives the request, and then again if that worker is not
> the
> > > > >> leader
> > > > >> > of
> > > > >> > > > the cluster and has to forward the request to the leader.
> Any
> > > > >> thoughts
> > > > >> > on
> > > > >> > > > optimizing this to only require a single validation per
> > request?
> > > > We
> > > > >> > > > probably wouldn't want to force all validations to take
> place
> > on
> > > > the
> > > > >> > > leader
> > > > >> > > > (could lead to overloading it pretty quickly in large
> > clusters),
> > > > >> but we
> > > > >> > > > could add an internal-only query parameter to skip
> validation
> > > and
> > > > >> then
> > > > >> > > use
> > > > >> > > > that parameter when forwarding requests from followers to
> the
> > > > >> leader.
> > > > >> > > >
> > > > >> > > > 3. A worker property is pretty coarse-grained, and difficult
> > to
> > > > >> change.
> > > > >> > > We
> > > > >> > > > might allow per-request toggling of the timeout by adding a
> > URL
> > > > >> query
> > > > >> > > > parameter like '?timeout=90s' to the REST API to allow
> > tweaking
> > > of
> > > > >> the
> > > > >> > > > timeout on a more granular basis, and without having to
> > perform
> > > a
> > > > >> > worker
> > > > >> > > > restart.
> > > > >> > > >
> > > > >> > > > I'd also like to clarify a point about the rejected
> > alternative
> > > > >> "Allow
> > > > >> > > > configuring producer zombie fencing admin request
> timeout"--is
> > > the
> > > > >> > > > implication here that the "rest.api.request.timeout.ms"
> > > property
> > > > >> will
> > > > >> > > not
> > > > >> > > > control the REST timeout for requests to the 'PUT
> > > > >> > > > /connectors/{connector}/fence' endpoint, or just that it
> won't
> > > > >> control
> > > > >> > > the
> > > > >> > > > timeout that we use for the call to Admin::fenceProducers?
> > > > >> > > >
> > > > >> > > > Cheers,
> > > > >> > > >
> > > > >> > > > Chris
> > > > >> > > >
> > > > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <
> > > yash.mayya@gmail.com>
> > > > >> > wrote:
> > > > >> > > >
> > > > >> > > > > Hi all,
> > > > >> > > > >
> > > > >> > > > > I'd like to start a discussion thread on this small KIP -
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > > > >> > > > >
> > > > >> > > > > It proposes the addition of a new Kafka Connect worker
> > > > >> configuration
> > > > >> > to
> > > > >> > > > > allow configuring REST API request timeouts.
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > > Yash
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Greg Harris <gr...@aiven.io.INVALID>.
Hey Yash,

Thanks for the KIP, and sorry for the late review.

1. Have you considered a HTTP header to provide the client-configurable
timeout? A header might more naturally extend to all of the other endpoints
in the future, rather than duplicating the query parameter across endpoints.

2. I understand that this change is targeted at just long duration
Connector::validation calls, is that due to voluntary scope constraints?
Implementing configurable timeouts for all endpoints in a uniform way could
be desirable, even if the default timeout will work for nearly all of the
other calls.

3. Did you consider adding asynchronous validation as a user-facing
feature? I think that relying on the synchronous validation results in a
single HTTP request is a bit of an anti-pattern, and one that we've
inherited from the original REST design. It seems useful when using the
REST API by hand, but seems to be a liability when used in environments
with an external management layer.

4. Perhaps rather than allowing synchronous calls to Connector:validate to
increase in duration, we should provide a way for connectors to surface
validation problems later in their lifecycle. Currently there is the
ConnectorContext::raiseError that transitions the task to FAILED, perhaps a
similar API could asynchronously emit re-validation results. We've also had
a problem with long start() duration for the same reasons as validate().

I understand if you want to keep this KIP as tightly focused as possible,
but i'm worried that it is addressing the symptom and not the problem. I
want to make sure that this change is impactful and isn't obsoleted by a
later improvement.

Thanks,
Greg


On Wed, Mar 1, 2023 at 5:07 AM Yash Mayya <ya...@gmail.com> wrote:

> Hi all,
>
> Thanks for all the feedback and discussion. I've renamed the KIP title to
> "Kafka Connect REST API timeout improvements" since we're introducing a
> couple of improvements (cancelling create / update connector requests when
> config validations timeout and avoiding double config validations in
> distributed mode) along with making the request timeouts configurable. The
> new KIP link is
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements
> and I've called for a vote for the KIP here -
> https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp.
>
> Thanks,
> Yash
>
> On Sat, Nov 5, 2022 at 11:42 PM Sagar <sa...@gmail.com> wrote:
>
> > Hey Yash,
> >
> > Thanks for the explanation. I think it should be fine to delegate the
> > validation directly to the leader.
> >
> > Thanks!
> > Sagar.
> >
> > On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <ya...@gmail.com> wrote:
> >
> > > Hi Sagar,
> > >
> > > Thanks for chiming in!
> > >
> > > > Having said that, why does the worker forward to the
> > > > leader? I am thinking if the worker can perform the validation on
> it's
> > > own,
> > > > we could let it do the validation instead of forwarding everything to
> > the
> > > > leader
> > >
> > > Only the leader is allowed to perform writes to the config topic, so
> any
> > > request that requires a config topic write ends up being forwarded to
> the
> > > leader. The `POST /connectors` and `PUT /connectors/{connector}/config`
> > > endpoints call `Herder::putConnectorConfig` which internally does a
> > config
> > > validation first before initiating another herder request to write to
> the
> > > config topic (in distributed mode). If we want to allow the first
> worker
> > to
> > > do the config validation, and then forward the request to the leader
> just
> > > for the write to the config topic, we'd either need something like a
> skip
> > > validations query parameter on the endpoint like Chris talks about
> above
> > or
> > > else a new internal only endpoint that just does a write to the config
> > > topic without any prior config validation. However, we agreed that this
> > > optimization doesn't really seem necessary for now and can be done
> later
> > if
> > > deemed useful. I'd be happy to hear differing thoughts if any, however.
> > >
> > > > I think a bound is certainly needed but IMO it shouldn't go beyond
> > > > 10 mins considering this is just validation
> > >
> > > Yeah, I agree that this seems like a fair value - I've elected to go
> > with a
> > > default value of 10 minutes for the proposed worker configuration that
> > sets
> > > an upper bound for the timeout query parameter.
> > >
> > > Thanks,
> > > Yash
> > >
> > > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <ya...@gmail.com>
> wrote:
> > >
> > > > Hi Chris,
> > > >
> > > > Thanks again for your feedback. I think a worker configuration for
> the
> > > > upper bound makes sense - I initially thought we could hardcode it
> > (just
> > > > like the current request timeout is), but there's no reason to set
> > > another
> > > > artificial bound that isn't user configurable which is exactly what
> > we're
> > > > trying to change here in the first place. I've updated the KIP based
> on
> > > all
> > > > our discussion above.
> > > >
> > > > Thanks,
> > > > Yash
> > > >
> > > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <sa...@gmail.com>
> > wrote:
> > > >
> > > >> Hey Yash,
> > > >>
> > > >> Thanks for the KIP! This looks like a useful feature.
> > > >>
> > > >> I think the discussion thread already has some great points by
> Chris.
> > > Just
> > > >> a couple of points/clarifications=>
> > > >>
> > > >> Regarding, pt#2 , I guess it might be better to forward to the
> leader
> > as
> > > >> suggested by Yash. Having said that, why does the worker forward to
> > the
> > > >> leader? I am thinking if the worker can perform the validation on
> it's
> > > >> own,
> > > >> we could let it do the validation instead of forwarding everything
> to
> > > the
> > > >> leader(even though it might be cheap to forward all requests to the
> > > >> leader).
> > > >>
> > > >> Pt#3 => I think a bound is certainly needed but IMO it shouldn't go
> > > beyond
> > > >> 10 mins considering this is just validation. We shouldn't end up in
> a
> > > >> situation where a few faulty connectors end up blocking a lot of
> > request
> > > >> processing threads, so while increasing the config is certainly
> > helpful,
> > > >> we
> > > >> shouldn't set too high a value IMO. Of course I am also open to
> > > >> suggestions
> > > >> here.
> > > >>
> > > >> Thanks!
> > > >> Sagar.
> > > >>
> > > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton
> <chrise@aiven.io.invalid
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi Yash,
> > > >> >
> > > >> > RE 2: That's a great point about validations already being
> performed
> > > by
> > > >> the
> > > >> > leader. For completeness's sake, I'd like to note that this only
> > holds
> > > >> for
> > > >> > valid configurations; invalid ones are caught right now before
> being
> > > >> > forwarded to the leader. Still, I think it's fine to forward to
> the
> > > >> leader
> > > >> > for now and optimize further in the future if necessary. If
> frequent
> > > >> > validations are taking place they should be conducted via the `PUT
> > > >> > /connector-plugins/{pluginName}/config/validate` endpoint, which
> > won't
> > > >> do
> > > >> > any forwarding at all.
> > > >> >
> > > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout
> also
> > > >> seem
> > > >> > reasonable... maybe a low-importance worker property could work
> for
> > > >> that?
> > > >> > Not sure what would make sense for a default; probably somewhere
> in
> > > the
> > > >> > 10-60 minute range but would be interested in others' thoughts.
> > > >> >
> > > >> > Thanks for the clarification on the zombie fencing logic. I think
> we
> > > >> might
> > > >> > want to have some more subtle logic around the interaction between
> > > >> calls to
> > > >> > Admin::fenceProducers and a worker-level timeout property if we go
> > > that
> > > >> > route, but we can cross that particular bridge if we get back to
> it.
> > > >> >
> > > >> > Cheers,
> > > >> >
> > > >> > Chris
> > > >> >
> > > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <ya...@gmail.com>
> > > wrote:
> > > >> >
> > > >> > > Hi Chris,
> > > >> > >
> > > >> > > Thanks a lot for the super quick response and the great
> feedback!
> > > >> > >
> > > >> > > 1. I think that makes a lot of sense, and I'd be happy to update
> > the
> > > >> KIP
> > > >> > to
> > > >> > > include this change in the scope. The current behavior where the
> > API
> > > >> > > response indicates a time out but the connector is
> created/updated
> > > >> > > eventually anyway can be pretty confusing and is generally not a
> > > good
> > > >> > user
> > > >> > > experience IMO.
> > > >> > >
> > > >> > > 2. Wow, thanks for pointing this out - it's a really good catch
> > and
> > > >> > > something I hadn't noticed was happening with the current
> > > >> implementation.
> > > >> > > While I do like the idea of having a query parameter that
> > determines
> > > >> > > whether validations can be skipped, I'm wondering if it might
> not
> > be
> > > >> > easier
> > > >> > > and cleaner to just do the leader check earlier and avoid doing
> > the
> > > >> > > unnecessary config validation on the first worker? Since each
> > config
> > > >> > > validation happens on its own thread, I'm not so sure about the
> > > >> concern
> > > >> > of
> > > >> > > overloading the leader even on larger clusters, especially since
> > > >> > > validations aren't typically long running operations.
> Furthermore,
> > > >> even
> > > >> > > with the current implementation, the leader will always be
> doing a
> > > >> config
> > > >> > > validation for connector create / update REST API requests on
> any
> > > >> worker.
> > > >> > >
> > > >> > > 3. That's a good point, and this way we can also restrict the
> APIs
> > > >> whose
> > > >> > > timeouts are configurable - I'm thinking `PUT
> > > >> > > /connector-plugins/{pluginName}/config/validate`, `POST
> > /connectors`
> > > >> and
> > > >> > > `PUT /connectors/{connector}/config` are the ones where such a
> > > timeout
> > > >> > > parameter could be useful. Also, do you think we should enforce
> > some
> > > >> > > reasonable bounds for the timeout config?
> > > >> > >
> > > >> > > On the zombie fencing point, the implication was that the new
> > worker
> > > >> > > property would not control the timeout used for the call to
> > > >> > > Admin::fenceProducers. However, if we go with a timeout query
> > > >> parameter
> > > >> > > approach, even the timeout for the `PUT
> > > /connectors/{connector}/fence'
> > > >> > > endpoint will remain unaffected.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Yash
> > > >> > >
> > > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton
> > > <chrise@aiven.io.invalid
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Hi Yash,
> > > >> > > >
> > > >> > > > Thanks for the KIP. It's a nice, focused change. Initially I
> was
> > > >> > hesitant
> > > >> > > > to support cases where connector validation takes this long,
> but
> > > >> > > > considering the alternative is that we give users a 500 error
> > > >> response
> > > >> > > but
> > > >> > > > leave the request to create/modify the connector queued up in
> > the
> > > >> > > herder, I
> > > >> > > > think I can get behind the motivation here. There's also an
> > > >> argument to
> > > >> > > be
> > > >> > > > made about keeping Kafka Connect available even when the
> systems
> > > >> that
> > > >> > it
> > > >> > > > connects to are in a degraded state.
> > > >> > > >
> > > >> > > > I have a few alternatives I'd be interested in your thoughts
> on:
> > > >> > > >
> > > >> > > > 1. Since the primary concern here seems to be that custom
> > > connector
> > > >> > > > validation logic can take too long, do we have any thoughts on
> > > >> adding
> > > >> > > logic
> > > >> > > > to check for request timeout after validation has completed
> and,
> > > if
> > > >> it
> > > >> > > has,
> > > >> > > > aborting the attempt to create/modify the connector?
> > > >> > > >
> > > >> > > > 2. Right now it's possible that we'll perform two connector
> > config
> > > >> > > > validations per create/modify request; once on the worker that
> > > >> > initially
> > > >> > > > receives the request, and then again if that worker is not the
> > > >> leader
> > > >> > of
> > > >> > > > the cluster and has to forward the request to the leader. Any
> > > >> thoughts
> > > >> > on
> > > >> > > > optimizing this to only require a single validation per
> request?
> > > We
> > > >> > > > probably wouldn't want to force all validations to take place
> on
> > > the
> > > >> > > leader
> > > >> > > > (could lead to overloading it pretty quickly in large
> clusters),
> > > >> but we
> > > >> > > > could add an internal-only query parameter to skip validation
> > and
> > > >> then
> > > >> > > use
> > > >> > > > that parameter when forwarding requests from followers to the
> > > >> leader.
> > > >> > > >
> > > >> > > > 3. A worker property is pretty coarse-grained, and difficult
> to
> > > >> change.
> > > >> > > We
> > > >> > > > might allow per-request toggling of the timeout by adding a
> URL
> > > >> query
> > > >> > > > parameter like '?timeout=90s' to the REST API to allow
> tweaking
> > of
> > > >> the
> > > >> > > > timeout on a more granular basis, and without having to
> perform
> > a
> > > >> > worker
> > > >> > > > restart.
> > > >> > > >
> > > >> > > > I'd also like to clarify a point about the rejected
> alternative
> > > >> "Allow
> > > >> > > > configuring producer zombie fencing admin request timeout"--is
> > the
> > > >> > > > implication here that the "rest.api.request.timeout.ms"
> > property
> > > >> will
> > > >> > > not
> > > >> > > > control the REST timeout for requests to the 'PUT
> > > >> > > > /connectors/{connector}/fence' endpoint, or just that it won't
> > > >> control
> > > >> > > the
> > > >> > > > timeout that we use for the call to Admin::fenceProducers?
> > > >> > > >
> > > >> > > > Cheers,
> > > >> > > >
> > > >> > > > Chris
> > > >> > > >
> > > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <
> > yash.mayya@gmail.com>
> > > >> > wrote:
> > > >> > > >
> > > >> > > > > Hi all,
> > > >> > > > >
> > > >> > > > > I'd like to start a discussion thread on this small KIP -
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > > >> > > > >
> > > >> > > > > It proposes the addition of a new Kafka Connect worker
> > > >> configuration
> > > >> > to
> > > >> > > > > allow configuring REST API request timeouts.
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > > Yash
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Yash Mayya <ya...@gmail.com>.
Hi all,

Thanks for all the feedback and discussion. I've renamed the KIP title to
"Kafka Connect REST API timeout improvements" since we're introducing a
couple of improvements (cancelling create / update connector requests when
config validations timeout and avoiding double config validations in
distributed mode) along with making the request timeouts configurable. The
new KIP link is
https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements
and I've called for a vote for the KIP here -
https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp.

Thanks,
Yash

On Sat, Nov 5, 2022 at 11:42 PM Sagar <sa...@gmail.com> wrote:

> Hey Yash,
>
> Thanks for the explanation. I think it should be fine to delegate the
> validation directly to the leader.
>
> Thanks!
> Sagar.
>
> On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <ya...@gmail.com> wrote:
>
> > Hi Sagar,
> >
> > Thanks for chiming in!
> >
> > > Having said that, why does the worker forward to the
> > > leader? I am thinking if the worker can perform the validation on it's
> > own,
> > > we could let it do the validation instead of forwarding everything to
> the
> > > leader
> >
> > Only the leader is allowed to perform writes to the config topic, so any
> > request that requires a config topic write ends up being forwarded to the
> > leader. The `POST /connectors` and `PUT /connectors/{connector}/config`
> > endpoints call `Herder::putConnectorConfig` which internally does a
> config
> > validation first before initiating another herder request to write to the
> > config topic (in distributed mode). If we want to allow the first worker
> to
> > do the config validation, and then forward the request to the leader just
> > for the write to the config topic, we'd either need something like a skip
> > validations query parameter on the endpoint like Chris talks about above
> or
> > else a new internal only endpoint that just does a write to the config
> > topic without any prior config validation. However, we agreed that this
> > optimization doesn't really seem necessary for now and can be done later
> if
> > deemed useful. I'd be happy to hear differing thoughts if any, however.
> >
> > > I think a bound is certainly needed but IMO it shouldn't go beyond
> > > 10 mins considering this is just validation
> >
> > Yeah, I agree that this seems like a fair value - I've elected to go
> with a
> > default value of 10 minutes for the proposed worker configuration that
> sets
> > an upper bound for the timeout query parameter.
> >
> > Thanks,
> > Yash
> >
> > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <ya...@gmail.com> wrote:
> >
> > > Hi Chris,
> > >
> > > Thanks again for your feedback. I think a worker configuration for the
> > > upper bound makes sense - I initially thought we could hardcode it
> (just
> > > like the current request timeout is), but there's no reason to set
> > another
> > > artificial bound that isn't user configurable which is exactly what
> we're
> > > trying to change here in the first place. I've updated the KIP based on
> > all
> > > our discussion above.
> > >
> > > Thanks,
> > > Yash
> > >
> > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <sa...@gmail.com>
> wrote:
> > >
> > >> Hey Yash,
> > >>
> > >> Thanks for the KIP! This looks like a useful feature.
> > >>
> > >> I think the discussion thread already has some great points by Chris.
> > Just
> > >> a couple of points/clarifications=>
> > >>
> > >> Regarding, pt#2 , I guess it might be better to forward to the leader
> as
> > >> suggested by Yash. Having said that, why does the worker forward to
> the
> > >> leader? I am thinking if the worker can perform the validation on it's
> > >> own,
> > >> we could let it do the validation instead of forwarding everything to
> > the
> > >> leader(even though it might be cheap to forward all requests to the
> > >> leader).
> > >>
> > >> Pt#3 => I think a bound is certainly needed but IMO it shouldn't go
> > beyond
> > >> 10 mins considering this is just validation. We shouldn't end up in a
> > >> situation where a few faulty connectors end up blocking a lot of
> request
> > >> processing threads, so while increasing the config is certainly
> helpful,
> > >> we
> > >> shouldn't set too high a value IMO. Of course I am also open to
> > >> suggestions
> > >> here.
> > >>
> > >> Thanks!
> > >> Sagar.
> > >>
> > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton <chrise@aiven.io.invalid
> >
> > >> wrote:
> > >>
> > >> > Hi Yash,
> > >> >
> > >> > RE 2: That's a great point about validations already being performed
> > by
> > >> the
> > >> > leader. For completeness's sake, I'd like to note that this only
> holds
> > >> for
> > >> > valid configurations; invalid ones are caught right now before being
> > >> > forwarded to the leader. Still, I think it's fine to forward to the
> > >> leader
> > >> > for now and optimize further in the future if necessary. If frequent
> > >> > validations are taking place they should be conducted via the `PUT
> > >> > /connector-plugins/{pluginName}/config/validate` endpoint, which
> won't
> > >> do
> > >> > any forwarding at all.
> > >> >
> > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout also
> > >> seem
> > >> > reasonable... maybe a low-importance worker property could work for
> > >> that?
> > >> > Not sure what would make sense for a default; probably somewhere in
> > the
> > >> > 10-60 minute range but would be interested in others' thoughts.
> > >> >
> > >> > Thanks for the clarification on the zombie fencing logic. I think we
> > >> might
> > >> > want to have some more subtle logic around the interaction between
> > >> calls to
> > >> > Admin::fenceProducers and a worker-level timeout property if we go
> > that
> > >> > route, but we can cross that particular bridge if we get back to it.
> > >> >
> > >> > Cheers,
> > >> >
> > >> > Chris
> > >> >
> > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <ya...@gmail.com>
> > wrote:
> > >> >
> > >> > > Hi Chris,
> > >> > >
> > >> > > Thanks a lot for the super quick response and the great feedback!
> > >> > >
> > >> > > 1. I think that makes a lot of sense, and I'd be happy to update
> the
> > >> KIP
> > >> > to
> > >> > > include this change in the scope. The current behavior where the
> API
> > >> > > response indicates a time out but the connector is created/updated
> > >> > > eventually anyway can be pretty confusing and is generally not a
> > good
> > >> > user
> > >> > > experience IMO.
> > >> > >
> > >> > > 2. Wow, thanks for pointing this out - it's a really good catch
> and
> > >> > > something I hadn't noticed was happening with the current
> > >> implementation.
> > >> > > While I do like the idea of having a query parameter that
> determines
> > >> > > whether validations can be skipped, I'm wondering if it might not
> be
> > >> > easier
> > >> > > and cleaner to just do the leader check earlier and avoid doing
> the
> > >> > > unnecessary config validation on the first worker? Since each
> config
> > >> > > validation happens on its own thread, I'm not so sure about the
> > >> concern
> > >> > of
> > >> > > overloading the leader even on larger clusters, especially since
> > >> > > validations aren't typically long running operations. Furthermore,
> > >> even
> > >> > > with the current implementation, the leader will always be doing a
> > >> config
> > >> > > validation for connector create / update REST API requests on any
> > >> worker.
> > >> > >
> > >> > > 3. That's a good point, and this way we can also restrict the APIs
> > >> whose
> > >> > > timeouts are configurable - I'm thinking `PUT
> > >> > > /connector-plugins/{pluginName}/config/validate`, `POST
> /connectors`
> > >> and
> > >> > > `PUT /connectors/{connector}/config` are the ones where such a
> > timeout
> > >> > > parameter could be useful. Also, do you think we should enforce
> some
> > >> > > reasonable bounds for the timeout config?
> > >> > >
> > >> > > On the zombie fencing point, the implication was that the new
> worker
> > >> > > property would not control the timeout used for the call to
> > >> > > Admin::fenceProducers. However, if we go with a timeout query
> > >> parameter
> > >> > > approach, even the timeout for the `PUT
> > /connectors/{connector}/fence'
> > >> > > endpoint will remain unaffected.
> > >> > >
> > >> > > Thanks,
> > >> > > Yash
> > >> > >
> > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton
> > <chrise@aiven.io.invalid
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Yash,
> > >> > > >
> > >> > > > Thanks for the KIP. It's a nice, focused change. Initially I was
> > >> > hesitant
> > >> > > > to support cases where connector validation takes this long, but
> > >> > > > considering the alternative is that we give users a 500 error
> > >> response
> > >> > > but
> > >> > > > leave the request to create/modify the connector queued up in
> the
> > >> > > herder, I
> > >> > > > think I can get behind the motivation here. There's also an
> > >> argument to
> > >> > > be
> > >> > > > made about keeping Kafka Connect available even when the systems
> > >> that
> > >> > it
> > >> > > > connects to are in a degraded state.
> > >> > > >
> > >> > > > I have a few alternatives I'd be interested in your thoughts on:
> > >> > > >
> > >> > > > 1. Since the primary concern here seems to be that custom
> > connector
> > >> > > > validation logic can take too long, do we have any thoughts on
> > >> adding
> > >> > > logic
> > >> > > > to check for request timeout after validation has completed and,
> > if
> > >> it
> > >> > > has,
> > >> > > > aborting the attempt to create/modify the connector?
> > >> > > >
> > >> > > > 2. Right now it's possible that we'll perform two connector
> config
> > >> > > > validations per create/modify request; once on the worker that
> > >> > initially
> > >> > > > receives the request, and then again if that worker is not the
> > >> leader
> > >> > of
> > >> > > > the cluster and has to forward the request to the leader. Any
> > >> thoughts
> > >> > on
> > >> > > > optimizing this to only require a single validation per request?
> > We
> > >> > > > probably wouldn't want to force all validations to take place on
> > the
> > >> > > leader
> > >> > > > (could lead to overloading it pretty quickly in large clusters),
> > >> but we
> > >> > > > could add an internal-only query parameter to skip validation
> and
> > >> then
> > >> > > use
> > >> > > > that parameter when forwarding requests from followers to the
> > >> leader.
> > >> > > >
> > >> > > > 3. A worker property is pretty coarse-grained, and difficult to
> > >> change.
> > >> > > We
> > >> > > > might allow per-request toggling of the timeout by adding a URL
> > >> query
> > >> > > > parameter like '?timeout=90s' to the REST API to allow tweaking
> of
> > >> the
> > >> > > > timeout on a more granular basis, and without having to perform
> a
> > >> > worker
> > >> > > > restart.
> > >> > > >
> > >> > > > I'd also like to clarify a point about the rejected alternative
> > >> "Allow
> > >> > > > configuring producer zombie fencing admin request timeout"--is
> the
> > >> > > > implication here that the "rest.api.request.timeout.ms"
> property
> > >> will
> > >> > > not
> > >> > > > control the REST timeout for requests to the 'PUT
> > >> > > > /connectors/{connector}/fence' endpoint, or just that it won't
> > >> control
> > >> > > the
> > >> > > > timeout that we use for the call to Admin::fenceProducers?
> > >> > > >
> > >> > > > Cheers,
> > >> > > >
> > >> > > > Chris
> > >> > > >
> > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <
> yash.mayya@gmail.com>
> > >> > wrote:
> > >> > > >
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > > I'd like to start a discussion thread on this small KIP -
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > >> > > > >
> > >> > > > > It proposes the addition of a new Kafka Connect worker
> > >> configuration
> > >> > to
> > >> > > > > allow configuring REST API request timeouts.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > > Yash
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Sagar <sa...@gmail.com>.
Hey Yash,

Thanks for the explanation. I think it should be fine to delegate the
validation directly to the leader.

Thanks!
Sagar.

On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <ya...@gmail.com> wrote:

> Hi Sagar,
>
> Thanks for chiming in!
>
> > Having said that, why does the worker forward to the
> > leader? I am thinking if the worker can perform the validation on it's
> own,
> > we could let it do the validation instead of forwarding everything to the
> > leader
>
> Only the leader is allowed to perform writes to the config topic, so any
> request that requires a config topic write ends up being forwarded to the
> leader. The `POST /connectors` and `PUT /connectors/{connector}/config`
> endpoints call `Herder::putConnectorConfig` which internally does a config
> validation first before initiating another herder request to write to the
> config topic (in distributed mode). If we want to allow the first worker to
> do the config validation, and then forward the request to the leader just
> for the write to the config topic, we'd either need something like a skip
> validations query parameter on the endpoint like Chris talks about above or
> else a new internal only endpoint that just does a write to the config
> topic without any prior config validation. However, we agreed that this
> optimization doesn't really seem necessary for now and can be done later if
> deemed useful. I'd be happy to hear differing thoughts if any, however.
>
> > I think a bound is certainly needed but IMO it shouldn't go beyond
> > 10 mins considering this is just validation
>
> Yeah, I agree that this seems like a fair value - I've elected to go with a
> default value of 10 minutes for the proposed worker configuration that sets
> an upper bound for the timeout query parameter.
>
> Thanks,
> Yash
>
> On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <ya...@gmail.com> wrote:
>
> > Hi Chris,
> >
> > Thanks again for your feedback. I think a worker configuration for the
> > upper bound makes sense - I initially thought we could hardcode it (just
> > like the current request timeout is), but there's no reason to set
> another
> > artificial bound that isn't user configurable which is exactly what we're
> > trying to change here in the first place. I've updated the KIP based on
> all
> > our discussion above.
> >
> > Thanks,
> > Yash
> >
> > On Thu, Nov 3, 2022 at 11:01 PM Sagar <sa...@gmail.com> wrote:
> >
> >> Hey Yash,
> >>
> >> Thanks for the KIP! This looks like a useful feature.
> >>
> >> I think the discussion thread already has some great points by Chris.
> Just
> >> a couple of points/clarifications=>
> >>
> >> Regarding, pt#2 , I guess it might be better to forward to the leader as
> >> suggested by Yash. Having said that, why does the worker forward to the
> >> leader? I am thinking if the worker can perform the validation on it's
> >> own,
> >> we could let it do the validation instead of forwarding everything to
> the
> >> leader(even though it might be cheap to forward all requests to the
> >> leader).
> >>
> >> Pt#3 => I think a bound is certainly needed but IMO it shouldn't go
> beyond
> >> 10 mins considering this is just validation. We shouldn't end up in a
> >> situation where a few faulty connectors end up blocking a lot of request
> >> processing threads, so while increasing the config is certainly helpful,
> >> we
> >> shouldn't set too high a value IMO. Of course I am also open to
> >> suggestions
> >> here.
> >>
> >> Thanks!
> >> Sagar.
> >>
> >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton <ch...@aiven.io.invalid>
> >> wrote:
> >>
> >> > Hi Yash,
> >> >
> >> > RE 2: That's a great point about validations already being performed
> by
> >> the
> >> > leader. For completeness's sake, I'd like to note that this only holds
> >> for
> >> > valid configurations; invalid ones are caught right now before being
> >> > forwarded to the leader. Still, I think it's fine to forward to the
> >> leader
> >> > for now and optimize further in the future if necessary. If frequent
> >> > validations are taking place they should be conducted via the `PUT
> >> > /connector-plugins/{pluginName}/config/validate` endpoint, which won't
> >> do
> >> > any forwarding at all.
> >> >
> >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout also
> >> seem
> >> > reasonable... maybe a low-importance worker property could work for
> >> that?
> >> > Not sure what would make sense for a default; probably somewhere in
> the
> >> > 10-60 minute range but would be interested in others' thoughts.
> >> >
> >> > Thanks for the clarification on the zombie fencing logic. I think we
> >> might
> >> > want to have some more subtle logic around the interaction between
> >> calls to
> >> > Admin::fenceProducers and a worker-level timeout property if we go
> that
> >> > route, but we can cross that particular bridge if we get back to it.
> >> >
> >> > Cheers,
> >> >
> >> > Chris
> >> >
> >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <ya...@gmail.com>
> wrote:
> >> >
> >> > > Hi Chris,
> >> > >
> >> > > Thanks a lot for the super quick response and the great feedback!
> >> > >
> >> > > 1. I think that makes a lot of sense, and I'd be happy to update the
> >> KIP
> >> > to
> >> > > include this change in the scope. The current behavior where the API
> >> > > response indicates a time out but the connector is created/updated
> >> > > eventually anyway can be pretty confusing and is generally not a
> good
> >> > user
> >> > > experience IMO.
> >> > >
> >> > > 2. Wow, thanks for pointing this out - it's a really good catch and
> >> > > something I hadn't noticed was happening with the current
> >> implementation.
> >> > > While I do like the idea of having a query parameter that determines
> >> > > whether validations can be skipped, I'm wondering if it might not be
> >> > easier
> >> > > and cleaner to just do the leader check earlier and avoid doing the
> >> > > unnecessary config validation on the first worker? Since each config
> >> > > validation happens on its own thread, I'm not so sure about the
> >> concern
> >> > of
> >> > > overloading the leader even on larger clusters, especially since
> >> > > validations aren't typically long running operations. Furthermore,
> >> even
> >> > > with the current implementation, the leader will always be doing a
> >> config
> >> > > validation for connector create / update REST API requests on any
> >> worker.
> >> > >
> >> > > 3. That's a good point, and this way we can also restrict the APIs
> >> whose
> >> > > timeouts are configurable - I'm thinking `PUT
> >> > > /connector-plugins/{pluginName}/config/validate`, `POST /connectors`
> >> and
> >> > > `PUT /connectors/{connector}/config` are the ones where such a
> timeout
> >> > > parameter could be useful. Also, do you think we should enforce some
> >> > > reasonable bounds for the timeout config?
> >> > >
> >> > > On the zombie fencing point, the implication was that the new worker
> >> > > property would not control the timeout used for the call to
> >> > > Admin::fenceProducers. However, if we go with a timeout query
> >> parameter
> >> > > approach, even the timeout for the `PUT
> /connectors/{connector}/fence'
> >> > > endpoint will remain unaffected.
> >> > >
> >> > > Thanks,
> >> > > Yash
> >> > >
> >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton
> <chrise@aiven.io.invalid
> >> >
> >> > > wrote:
> >> > >
> >> > > > Hi Yash,
> >> > > >
> >> > > > Thanks for the KIP. It's a nice, focused change. Initially I was
> >> > hesitant
> >> > > > to support cases where connector validation takes this long, but
> >> > > > considering the alternative is that we give users a 500 error
> >> response
> >> > > but
> >> > > > leave the request to create/modify the connector queued up in the
> >> > > herder, I
> >> > > > think I can get behind the motivation here. There's also an
> >> argument to
> >> > > be
> >> > > > made about keeping Kafka Connect available even when the systems
> >> that
> >> > it
> >> > > > connects to are in a degraded state.
> >> > > >
> >> > > > I have a few alternatives I'd be interested in your thoughts on:
> >> > > >
> >> > > > 1. Since the primary concern here seems to be that custom
> connector
> >> > > > validation logic can take too long, do we have any thoughts on
> >> adding
> >> > > logic
> >> > > > to check for request timeout after validation has completed and,
> if
> >> it
> >> > > has,
> >> > > > aborting the attempt to create/modify the connector?
> >> > > >
> >> > > > 2. Right now it's possible that we'll perform two connector config
> >> > > > validations per create/modify request; once on the worker that
> >> > initially
> >> > > > receives the request, and then again if that worker is not the
> >> leader
> >> > of
> >> > > > the cluster and has to forward the request to the leader. Any
> >> thoughts
> >> > on
> >> > > > optimizing this to only require a single validation per request?
> We
> >> > > > probably wouldn't want to force all validations to take place on
> the
> >> > > leader
> >> > > > (could lead to overloading it pretty quickly in large clusters),
> >> but we
> >> > > > could add an internal-only query parameter to skip validation and
> >> then
> >> > > use
> >> > > > that parameter when forwarding requests from followers to the
> >> leader.
> >> > > >
> >> > > > 3. A worker property is pretty coarse-grained, and difficult to
> >> change.
> >> > > We
> >> > > > might allow per-request toggling of the timeout by adding a URL
> >> query
> >> > > > parameter like '?timeout=90s' to the REST API to allow tweaking of
> >> the
> >> > > > timeout on a more granular basis, and without having to perform a
> >> > worker
> >> > > > restart.
> >> > > >
> >> > > > I'd also like to clarify a point about the rejected alternative
> >> "Allow
> >> > > > configuring producer zombie fencing admin request timeout"--is the
> >> > > > implication here that the "rest.api.request.timeout.ms" property
> >> will
> >> > > not
> >> > > > control the REST timeout for requests to the 'PUT
> >> > > > /connectors/{connector}/fence' endpoint, or just that it won't
> >> control
> >> > > the
> >> > > > timeout that we use for the call to Admin::fenceProducers?
> >> > > >
> >> > > > Cheers,
> >> > > >
> >> > > > Chris
> >> > > >
> >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <ya...@gmail.com>
> >> > wrote:
> >> > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > I'd like to start a discussion thread on this small KIP -
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> >> > > > >
> >> > > > > It proposes the addition of a new Kafka Connect worker
> >> configuration
> >> > to
> >> > > > > allow configuring REST API request timeouts.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Yash
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Yash Mayya <ya...@gmail.com>.
Hi Sagar,

Thanks for chiming in!

> Having said that, why does the worker forward to the
> leader? I am thinking if the worker can perform the validation on it's
own,
> we could let it do the validation instead of forwarding everything to the
> leader

Only the leader is allowed to perform writes to the config topic, so any
request that requires a config topic write ends up being forwarded to the
leader. The `POST /connectors` and `PUT /connectors/{connector}/config`
endpoints call `Herder::putConnectorConfig` which internally does a config
validation first before initiating another herder request to write to the
config topic (in distributed mode). If we want to allow the first worker to
do the config validation, and then forward the request to the leader just
for the write to the config topic, we'd either need something like a skip
validations query parameter on the endpoint like Chris talks about above or
else a new internal only endpoint that just does a write to the config
topic without any prior config validation. However, we agreed that this
optimization doesn't really seem necessary for now and can be done later if
deemed useful. I'd be happy to hear differing thoughts if any, however.

> I think a bound is certainly needed but IMO it shouldn't go beyond
> 10 mins considering this is just validation

Yeah, I agree that this seems like a fair value - I've elected to go with a
default value of 10 minutes for the proposed worker configuration that sets
an upper bound for the timeout query parameter.

Thanks,
Yash

On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <ya...@gmail.com> wrote:

> Hi Chris,
>
> Thanks again for your feedback. I think a worker configuration for the
> upper bound makes sense - I initially thought we could hardcode it (just
> like the current request timeout is), but there's no reason to set another
> artificial bound that isn't user configurable which is exactly what we're
> trying to change here in the first place. I've updated the KIP based on all
> our discussion above.
>
> Thanks,
> Yash
>
> On Thu, Nov 3, 2022 at 11:01 PM Sagar <sa...@gmail.com> wrote:
>
>> Hey Yash,
>>
>> Thanks for the KIP! This looks like a useful feature.
>>
>> I think the discussion thread already has some great points by Chris. Just
>> a couple of points/clarifications=>
>>
>> Regarding, pt#2 , I guess it might be better to forward to the leader as
>> suggested by Yash. Having said that, why does the worker forward to the
>> leader? I am thinking if the worker can perform the validation on it's
>> own,
>> we could let it do the validation instead of forwarding everything to the
>> leader(even though it might be cheap to forward all requests to the
>> leader).
>>
>> Pt#3 => I think a bound is certainly needed but IMO it shouldn't go beyond
>> 10 mins considering this is just validation. We shouldn't end up in a
>> situation where a few faulty connectors end up blocking a lot of request
>> processing threads, so while increasing the config is certainly helpful,
>> we
>> shouldn't set too high a value IMO. Of course I am also open to
>> suggestions
>> here.
>>
>> Thanks!
>> Sagar.
>>
>> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton <ch...@aiven.io.invalid>
>> wrote:
>>
>> > Hi Yash,
>> >
>> > RE 2: That's a great point about validations already being performed by
>> the
>> > leader. For completeness's sake, I'd like to note that this only holds
>> for
>> > valid configurations; invalid ones are caught right now before being
>> > forwarded to the leader. Still, I think it's fine to forward to the
>> leader
>> > for now and optimize further in the future if necessary. If frequent
>> > validations are taking place they should be conducted via the `PUT
>> > /connector-plugins/{pluginName}/config/validate` endpoint, which won't
>> do
>> > any forwarding at all.
>> >
>> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout also
>> seem
>> > reasonable... maybe a low-importance worker property could work for
>> that?
>> > Not sure what would make sense for a default; probably somewhere in the
>> > 10-60 minute range but would be interested in others' thoughts.
>> >
>> > Thanks for the clarification on the zombie fencing logic. I think we
>> might
>> > want to have some more subtle logic around the interaction between
>> calls to
>> > Admin::fenceProducers and a worker-level timeout property if we go that
>> > route, but we can cross that particular bridge if we get back to it.
>> >
>> > Cheers,
>> >
>> > Chris
>> >
>> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <ya...@gmail.com> wrote:
>> >
>> > > Hi Chris,
>> > >
>> > > Thanks a lot for the super quick response and the great feedback!
>> > >
>> > > 1. I think that makes a lot of sense, and I'd be happy to update the
>> KIP
>> > to
>> > > include this change in the scope. The current behavior where the API
>> > > response indicates a time out but the connector is created/updated
>> > > eventually anyway can be pretty confusing and is generally not a good
>> > user
>> > > experience IMO.
>> > >
>> > > 2. Wow, thanks for pointing this out - it's a really good catch and
>> > > something I hadn't noticed was happening with the current
>> implementation.
>> > > While I do like the idea of having a query parameter that determines
>> > > whether validations can be skipped, I'm wondering if it might not be
>> > easier
>> > > and cleaner to just do the leader check earlier and avoid doing the
>> > > unnecessary config validation on the first worker? Since each config
>> > > validation happens on its own thread, I'm not so sure about the
>> concern
>> > of
>> > > overloading the leader even on larger clusters, especially since
>> > > validations aren't typically long running operations. Furthermore,
>> even
>> > > with the current implementation, the leader will always be doing a
>> config
>> > > validation for connector create / update REST API requests on any
>> worker.
>> > >
>> > > 3. That's a good point, and this way we can also restrict the APIs
>> whose
>> > > timeouts are configurable - I'm thinking `PUT
>> > > /connector-plugins/{pluginName}/config/validate`, `POST /connectors`
>> and
>> > > `PUT /connectors/{connector}/config` are the ones where such a timeout
>> > > parameter could be useful. Also, do you think we should enforce some
>> > > reasonable bounds for the timeout config?
>> > >
>> > > On the zombie fencing point, the implication was that the new worker
>> > > property would not control the timeout used for the call to
>> > > Admin::fenceProducers. However, if we go with a timeout query
>> parameter
>> > > approach, even the timeout for the `PUT /connectors/{connector}/fence'
>> > > endpoint will remain unaffected.
>> > >
>> > > Thanks,
>> > > Yash
>> > >
>> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton <chrise@aiven.io.invalid
>> >
>> > > wrote:
>> > >
>> > > > Hi Yash,
>> > > >
>> > > > Thanks for the KIP. It's a nice, focused change. Initially I was
>> > hesitant
>> > > > to support cases where connector validation takes this long, but
>> > > > considering the alternative is that we give users a 500 error
>> response
>> > > but
>> > > > leave the request to create/modify the connector queued up in the
>> > > herder, I
>> > > > think I can get behind the motivation here. There's also an
>> argument to
>> > > be
>> > > > made about keeping Kafka Connect available even when the systems
>> that
>> > it
>> > > > connects to are in a degraded state.
>> > > >
>> > > > I have a few alternatives I'd be interested in your thoughts on:
>> > > >
>> > > > 1. Since the primary concern here seems to be that custom connector
>> > > > validation logic can take too long, do we have any thoughts on
>> adding
>> > > logic
>> > > > to check for request timeout after validation has completed and, if
>> it
>> > > has,
>> > > > aborting the attempt to create/modify the connector?
>> > > >
>> > > > 2. Right now it's possible that we'll perform two connector config
>> > > > validations per create/modify request; once on the worker that
>> > initially
>> > > > receives the request, and then again if that worker is not the
>> leader
>> > of
>> > > > the cluster and has to forward the request to the leader. Any
>> thoughts
>> > on
>> > > > optimizing this to only require a single validation per request? We
>> > > > probably wouldn't want to force all validations to take place on the
>> > > leader
>> > > > (could lead to overloading it pretty quickly in large clusters),
>> but we
>> > > > could add an internal-only query parameter to skip validation and
>> then
>> > > use
>> > > > that parameter when forwarding requests from followers to the
>> leader.
>> > > >
>> > > > 3. A worker property is pretty coarse-grained, and difficult to
>> change.
>> > > We
>> > > > might allow per-request toggling of the timeout by adding a URL
>> query
>> > > > parameter like '?timeout=90s' to the REST API to allow tweaking of
>> the
>> > > > timeout on a more granular basis, and without having to perform a
>> > worker
>> > > > restart.
>> > > >
>> > > > I'd also like to clarify a point about the rejected alternative
>> "Allow
>> > > > configuring producer zombie fencing admin request timeout"--is the
>> > > > implication here that the "rest.api.request.timeout.ms" property
>> will
>> > > not
>> > > > control the REST timeout for requests to the 'PUT
>> > > > /connectors/{connector}/fence' endpoint, or just that it won't
>> control
>> > > the
>> > > > timeout that we use for the call to Admin::fenceProducers?
>> > > >
>> > > > Cheers,
>> > > >
>> > > > Chris
>> > > >
>> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <ya...@gmail.com>
>> > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I'd like to start a discussion thread on this small KIP -
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
>> > > > >
>> > > > > It proposes the addition of a new Kafka Connect worker
>> configuration
>> > to
>> > > > > allow configuring REST API request timeouts.
>> > > > >
>> > > > > Thanks,
>> > > > > Yash
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Yash Mayya <ya...@gmail.com>.
Hi Chris,

Thanks again for your feedback. I think a worker configuration for the
upper bound makes sense - I initially thought we could hardcode it (just
like the current request timeout is), but there's no reason to set another
artificial bound that isn't user configurable which is exactly what we're
trying to change here in the first place. I've updated the KIP based on all
our discussion above.

Thanks,
Yash

On Thu, Nov 3, 2022 at 11:01 PM Sagar <sa...@gmail.com> wrote:

> Hey Yash,
>
> Thanks for the KIP! This looks like a useful feature.
>
> I think the discussion thread already has some great points by Chris. Just
> a couple of points/clarifications=>
>
> Regarding, pt#2 , I guess it might be better to forward to the leader as
> suggested by Yash. Having said that, why does the worker forward to the
> leader? I am thinking if the worker can perform the validation on it's own,
> we could let it do the validation instead of forwarding everything to the
> leader(even though it might be cheap to forward all requests to the
> leader).
>
> Pt#3 => I think a bound is certainly needed but IMO it shouldn't go beyond
> 10 mins considering this is just validation. We shouldn't end up in a
> situation where a few faulty connectors end up blocking a lot of request
> processing threads, so while increasing the config is certainly helpful, we
> shouldn't set too high a value IMO. Of course I am also open to suggestions
> here.
>
> Thanks!
> Sagar.
>
> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton <ch...@aiven.io.invalid>
> wrote:
>
> > Hi Yash,
> >
> > RE 2: That's a great point about validations already being performed by
> the
> > leader. For completeness's sake, I'd like to note that this only holds
> for
> > valid configurations; invalid ones are caught right now before being
> > forwarded to the leader. Still, I think it's fine to forward to the
> leader
> > for now and optimize further in the future if necessary. If frequent
> > validations are taking place they should be conducted via the `PUT
> > /connector-plugins/{pluginName}/config/validate` endpoint, which won't do
> > any forwarding at all.
> >
> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout also seem
> > reasonable... maybe a low-importance worker property could work for that?
> > Not sure what would make sense for a default; probably somewhere in the
> > 10-60 minute range but would be interested in others' thoughts.
> >
> > Thanks for the clarification on the zombie fencing logic. I think we
> might
> > want to have some more subtle logic around the interaction between calls
> to
> > Admin::fenceProducers and a worker-level timeout property if we go that
> > route, but we can cross that particular bridge if we get back to it.
> >
> > Cheers,
> >
> > Chris
> >
> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <ya...@gmail.com> wrote:
> >
> > > Hi Chris,
> > >
> > > Thanks a lot for the super quick response and the great feedback!
> > >
> > > 1. I think that makes a lot of sense, and I'd be happy to update the
> KIP
> > to
> > > include this change in the scope. The current behavior where the API
> > > response indicates a time out but the connector is created/updated
> > > eventually anyway can be pretty confusing and is generally not a good
> > user
> > > experience IMO.
> > >
> > > 2. Wow, thanks for pointing this out - it's a really good catch and
> > > something I hadn't noticed was happening with the current
> implementation.
> > > While I do like the idea of having a query parameter that determines
> > > whether validations can be skipped, I'm wondering if it might not be
> > easier
> > > and cleaner to just do the leader check earlier and avoid doing the
> > > unnecessary config validation on the first worker? Since each config
> > > validation happens on its own thread, I'm not so sure about the concern
> > of
> > > overloading the leader even on larger clusters, especially since
> > > validations aren't typically long running operations. Furthermore, even
> > > with the current implementation, the leader will always be doing a
> config
> > > validation for connector create / update REST API requests on any
> worker.
> > >
> > > 3. That's a good point, and this way we can also restrict the APIs
> whose
> > > timeouts are configurable - I'm thinking `PUT
> > > /connector-plugins/{pluginName}/config/validate`, `POST /connectors`
> and
> > > `PUT /connectors/{connector}/config` are the ones where such a timeout
> > > parameter could be useful. Also, do you think we should enforce some
> > > reasonable bounds for the timeout config?
> > >
> > > On the zombie fencing point, the implication was that the new worker
> > > property would not control the timeout used for the call to
> > > Admin::fenceProducers. However, if we go with a timeout query parameter
> > > approach, even the timeout for the `PUT /connectors/{connector}/fence'
> > > endpoint will remain unaffected.
> > >
> > > Thanks,
> > > Yash
> > >
> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton <ch...@aiven.io.invalid>
> > > wrote:
> > >
> > > > Hi Yash,
> > > >
> > > > Thanks for the KIP. It's a nice, focused change. Initially I was
> > hesitant
> > > > to support cases where connector validation takes this long, but
> > > > considering the alternative is that we give users a 500 error
> response
> > > but
> > > > leave the request to create/modify the connector queued up in the
> > > herder, I
> > > > think I can get behind the motivation here. There's also an argument
> to
> > > be
> > > > made about keeping Kafka Connect available even when the systems that
> > it
> > > > connects to are in a degraded state.
> > > >
> > > > I have a few alternatives I'd be interested in your thoughts on:
> > > >
> > > > 1. Since the primary concern here seems to be that custom connector
> > > > validation logic can take too long, do we have any thoughts on adding
> > > logic
> > > > to check for request timeout after validation has completed and, if
> it
> > > has,
> > > > aborting the attempt to create/modify the connector?
> > > >
> > > > 2. Right now it's possible that we'll perform two connector config
> > > > validations per create/modify request; once on the worker that
> > initially
> > > > receives the request, and then again if that worker is not the leader
> > of
> > > > the cluster and has to forward the request to the leader. Any
> thoughts
> > on
> > > > optimizing this to only require a single validation per request? We
> > > > probably wouldn't want to force all validations to take place on the
> > > leader
> > > > (could lead to overloading it pretty quickly in large clusters), but
> we
> > > > could add an internal-only query parameter to skip validation and
> then
> > > use
> > > > that parameter when forwarding requests from followers to the leader.
> > > >
> > > > 3. A worker property is pretty coarse-grained, and difficult to
> change.
> > > We
> > > > might allow per-request toggling of the timeout by adding a URL query
> > > > parameter like '?timeout=90s' to the REST API to allow tweaking of
> the
> > > > timeout on a more granular basis, and without having to perform a
> > worker
> > > > restart.
> > > >
> > > > I'd also like to clarify a point about the rejected alternative
> "Allow
> > > > configuring producer zombie fencing admin request timeout"--is the
> > > > implication here that the "rest.api.request.timeout.ms" property
> will
> > > not
> > > > control the REST timeout for requests to the 'PUT
> > > > /connectors/{connector}/fence' endpoint, or just that it won't
> control
> > > the
> > > > timeout that we use for the call to Admin::fenceProducers?
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <ya...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I'd like to start a discussion thread on this small KIP -
> > > > >
> > > > >
> > > >
> > >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > > > >
> > > > > It proposes the addition of a new Kafka Connect worker
> configuration
> > to
> > > > > allow configuring REST API request timeouts.
> > > > >
> > > > > Thanks,
> > > > > Yash
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Sagar <sa...@gmail.com>.
Hey Yash,

Thanks for the KIP! This looks like a useful feature.

I think the discussion thread already has some great points by Chris. Just
a couple of points/clarifications=>

Regarding, pt#2 , I guess it might be better to forward to the leader as
suggested by Yash. Having said that, why does the worker forward to the
leader? I am thinking if the worker can perform the validation on it's own,
we could let it do the validation instead of forwarding everything to the
leader(even though it might be cheap to forward all requests to the leader).

Pt#3 => I think a bound is certainly needed but IMO it shouldn't go beyond
10 mins considering this is just validation. We shouldn't end up in a
situation where a few faulty connectors end up blocking a lot of request
processing threads, so while increasing the config is certainly helpful, we
shouldn't set too high a value IMO. Of course I am also open to suggestions
here.

Thanks!
Sagar.

On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton <ch...@aiven.io.invalid>
wrote:

> Hi Yash,
>
> RE 2: That's a great point about validations already being performed by the
> leader. For completeness's sake, I'd like to note that this only holds for
> valid configurations; invalid ones are caught right now before being
> forwarded to the leader. Still, I think it's fine to forward to the leader
> for now and optimize further in the future if necessary. If frequent
> validations are taking place they should be conducted via the `PUT
> /connector-plugins/{pluginName}/config/validate` endpoint, which won't do
> any forwarding at all.
>
> RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout also seem
> reasonable... maybe a low-importance worker property could work for that?
> Not sure what would make sense for a default; probably somewhere in the
> 10-60 minute range but would be interested in others' thoughts.
>
> Thanks for the clarification on the zombie fencing logic. I think we might
> want to have some more subtle logic around the interaction between calls to
> Admin::fenceProducers and a worker-level timeout property if we go that
> route, but we can cross that particular bridge if we get back to it.
>
> Cheers,
>
> Chris
>
> On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <ya...@gmail.com> wrote:
>
> > Hi Chris,
> >
> > Thanks a lot for the super quick response and the great feedback!
> >
> > 1. I think that makes a lot of sense, and I'd be happy to update the KIP
> to
> > include this change in the scope. The current behavior where the API
> > response indicates a time out but the connector is created/updated
> > eventually anyway can be pretty confusing and is generally not a good
> user
> > experience IMO.
> >
> > 2. Wow, thanks for pointing this out - it's a really good catch and
> > something I hadn't noticed was happening with the current implementation.
> > While I do like the idea of having a query parameter that determines
> > whether validations can be skipped, I'm wondering if it might not be
> easier
> > and cleaner to just do the leader check earlier and avoid doing the
> > unnecessary config validation on the first worker? Since each config
> > validation happens on its own thread, I'm not so sure about the concern
> of
> > overloading the leader even on larger clusters, especially since
> > validations aren't typically long running operations. Furthermore, even
> > with the current implementation, the leader will always be doing a config
> > validation for connector create / update REST API requests on any worker.
> >
> > 3. That's a good point, and this way we can also restrict the APIs whose
> > timeouts are configurable - I'm thinking `PUT
> > /connector-plugins/{pluginName}/config/validate`, `POST /connectors` and
> > `PUT /connectors/{connector}/config` are the ones where such a timeout
> > parameter could be useful. Also, do you think we should enforce some
> > reasonable bounds for the timeout config?
> >
> > On the zombie fencing point, the implication was that the new worker
> > property would not control the timeout used for the call to
> > Admin::fenceProducers. However, if we go with a timeout query parameter
> > approach, even the timeout for the `PUT /connectors/{connector}/fence'
> > endpoint will remain unaffected.
> >
> > Thanks,
> > Yash
> >
> > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Yash,
> > >
> > > Thanks for the KIP. It's a nice, focused change. Initially I was
> hesitant
> > > to support cases where connector validation takes this long, but
> > > considering the alternative is that we give users a 500 error response
> > but
> > > leave the request to create/modify the connector queued up in the
> > herder, I
> > > think I can get behind the motivation here. There's also an argument to
> > be
> > > made about keeping Kafka Connect available even when the systems that
> it
> > > connects to are in a degraded state.
> > >
> > > I have a few alternatives I'd be interested in your thoughts on:
> > >
> > > 1. Since the primary concern here seems to be that custom connector
> > > validation logic can take too long, do we have any thoughts on adding
> > logic
> > > to check for request timeout after validation has completed and, if it
> > has,
> > > aborting the attempt to create/modify the connector?
> > >
> > > 2. Right now it's possible that we'll perform two connector config
> > > validations per create/modify request; once on the worker that
> initially
> > > receives the request, and then again if that worker is not the leader
> of
> > > the cluster and has to forward the request to the leader. Any thoughts
> on
> > > optimizing this to only require a single validation per request? We
> > > probably wouldn't want to force all validations to take place on the
> > leader
> > > (could lead to overloading it pretty quickly in large clusters), but we
> > > could add an internal-only query parameter to skip validation and then
> > use
> > > that parameter when forwarding requests from followers to the leader.
> > >
> > > 3. A worker property is pretty coarse-grained, and difficult to change.
> > We
> > > might allow per-request toggling of the timeout by adding a URL query
> > > parameter like '?timeout=90s' to the REST API to allow tweaking of the
> > > timeout on a more granular basis, and without having to perform a
> worker
> > > restart.
> > >
> > > I'd also like to clarify a point about the rejected alternative "Allow
> > > configuring producer zombie fencing admin request timeout"--is the
> > > implication here that the "rest.api.request.timeout.ms" property will
> > not
> > > control the REST timeout for requests to the 'PUT
> > > /connectors/{connector}/fence' endpoint, or just that it won't control
> > the
> > > timeout that we use for the call to Admin::fenceProducers?
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <ya...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'd like to start a discussion thread on this small KIP -
> > > >
> > > >
> > >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > > >
> > > > It proposes the addition of a new Kafka Connect worker configuration
> to
> > > > allow configuring REST API request timeouts.
> > > >
> > > > Thanks,
> > > > Yash
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Chris Egerton <ch...@aiven.io.INVALID>.
Hi Yash,

RE 2: That's a great point about validations already being performed by the
leader. For completeness's sake, I'd like to note that this only holds for
valid configurations; invalid ones are caught right now before being
forwarded to the leader. Still, I think it's fine to forward to the leader
for now and optimize further in the future if necessary. If frequent
validations are taking place they should be conducted via the `PUT
/connector-plugins/{pluginName}/config/validate` endpoint, which won't do
any forwarding at all.

RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout also seem
reasonable... maybe a low-importance worker property could work for that?
Not sure what would make sense for a default; probably somewhere in the
10-60 minute range but would be interested in others' thoughts.

Thanks for the clarification on the zombie fencing logic. I think we might
want to have some more subtle logic around the interaction between calls to
Admin::fenceProducers and a worker-level timeout property if we go that
route, but we can cross that particular bridge if we get back to it.

Cheers,

Chris

On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <ya...@gmail.com> wrote:

> Hi Chris,
>
> Thanks a lot for the super quick response and the great feedback!
>
> 1. I think that makes a lot of sense, and I'd be happy to update the KIP to
> include this change in the scope. The current behavior where the API
> response indicates a time out but the connector is created/updated
> eventually anyway can be pretty confusing and is generally not a good user
> experience IMO.
>
> 2. Wow, thanks for pointing this out - it's a really good catch and
> something I hadn't noticed was happening with the current implementation.
> While I do like the idea of having a query parameter that determines
> whether validations can be skipped, I'm wondering if it might not be easier
> and cleaner to just do the leader check earlier and avoid doing the
> unnecessary config validation on the first worker? Since each config
> validation happens on its own thread, I'm not so sure about the concern of
> overloading the leader even on larger clusters, especially since
> validations aren't typically long running operations. Furthermore, even
> with the current implementation, the leader will always be doing a config
> validation for connector create / update REST API requests on any worker.
>
> 3. That's a good point, and this way we can also restrict the APIs whose
> timeouts are configurable - I'm thinking `PUT
> /connector-plugins/{pluginName}/config/validate`, `POST /connectors` and
> `PUT /connectors/{connector}/config` are the ones where such a timeout
> parameter could be useful. Also, do you think we should enforce some
> reasonable bounds for the timeout config?
>
> On the zombie fencing point, the implication was that the new worker
> property would not control the timeout used for the call to
> Admin::fenceProducers. However, if we go with a timeout query parameter
> approach, even the timeout for the `PUT /connectors/{connector}/fence'
> endpoint will remain unaffected.
>
> Thanks,
> Yash
>
> On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton <ch...@aiven.io.invalid>
> wrote:
>
> > Hi Yash,
> >
> > Thanks for the KIP. It's a nice, focused change. Initially I was hesitant
> > to support cases where connector validation takes this long, but
> > considering the alternative is that we give users a 500 error response
> but
> > leave the request to create/modify the connector queued up in the
> herder, I
> > think I can get behind the motivation here. There's also an argument to
> be
> > made about keeping Kafka Connect available even when the systems that it
> > connects to are in a degraded state.
> >
> > I have a few alternatives I'd be interested in your thoughts on:
> >
> > 1. Since the primary concern here seems to be that custom connector
> > validation logic can take too long, do we have any thoughts on adding
> logic
> > to check for request timeout after validation has completed and, if it
> has,
> > aborting the attempt to create/modify the connector?
> >
> > 2. Right now it's possible that we'll perform two connector config
> > validations per create/modify request; once on the worker that initially
> > receives the request, and then again if that worker is not the leader of
> > the cluster and has to forward the request to the leader. Any thoughts on
> > optimizing this to only require a single validation per request? We
> > probably wouldn't want to force all validations to take place on the
> leader
> > (could lead to overloading it pretty quickly in large clusters), but we
> > could add an internal-only query parameter to skip validation and then
> use
> > that parameter when forwarding requests from followers to the leader.
> >
> > 3. A worker property is pretty coarse-grained, and difficult to change.
> We
> > might allow per-request toggling of the timeout by adding a URL query
> > parameter like '?timeout=90s' to the REST API to allow tweaking of the
> > timeout on a more granular basis, and without having to perform a worker
> > restart.
> >
> > I'd also like to clarify a point about the rejected alternative "Allow
> > configuring producer zombie fencing admin request timeout"--is the
> > implication here that the "rest.api.request.timeout.ms" property will
> not
> > control the REST timeout for requests to the 'PUT
> > /connectors/{connector}/fence' endpoint, or just that it won't control
> the
> > timeout that we use for the call to Admin::fenceProducers?
> >
> > Cheers,
> >
> > Chris
> >
> > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <ya...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > I'd like to start a discussion thread on this small KIP -
> > >
> > >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > >
> > > It proposes the addition of a new Kafka Connect worker configuration to
> > > allow configuring REST API request timeouts.
> > >
> > > Thanks,
> > > Yash
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Yash Mayya <ya...@gmail.com>.
Hi Chris,

Thanks a lot for the super quick response and the great feedback!

1. I think that makes a lot of sense, and I'd be happy to update the KIP to
include this change in the scope. The current behavior where the API
response indicates a time out but the connector is created/updated
eventually anyway can be pretty confusing and is generally not a good user
experience IMO.

2. Wow, thanks for pointing this out - it's a really good catch and
something I hadn't noticed was happening with the current implementation.
While I do like the idea of having a query parameter that determines
whether validations can be skipped, I'm wondering if it might not be easier
and cleaner to just do the leader check earlier and avoid doing the
unnecessary config validation on the first worker? Since each config
validation happens on its own thread, I'm not so sure about the concern of
overloading the leader even on larger clusters, especially since
validations aren't typically long running operations. Furthermore, even
with the current implementation, the leader will always be doing a config
validation for connector create / update REST API requests on any worker.

3. That's a good point, and this way we can also restrict the APIs whose
timeouts are configurable - I'm thinking `PUT
/connector-plugins/{pluginName}/config/validate`, `POST /connectors` and
`PUT /connectors/{connector}/config` are the ones where such a timeout
parameter could be useful. Also, do you think we should enforce some
reasonable bounds for the timeout config?

On the zombie fencing point, the implication was that the new worker
property would not control the timeout used for the call to
Admin::fenceProducers. However, if we go with a timeout query parameter
approach, even the timeout for the `PUT /connectors/{connector}/fence'
endpoint will remain unaffected.

Thanks,
Yash

On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton <ch...@aiven.io.invalid>
wrote:

> Hi Yash,
>
> Thanks for the KIP. It's a nice, focused change. Initially I was hesitant
> to support cases where connector validation takes this long, but
> considering the alternative is that we give users a 500 error response but
> leave the request to create/modify the connector queued up in the herder, I
> think I can get behind the motivation here. There's also an argument to be
> made about keeping Kafka Connect available even when the systems that it
> connects to are in a degraded state.
>
> I have a few alternatives I'd be interested in your thoughts on:
>
> 1. Since the primary concern here seems to be that custom connector
> validation logic can take too long, do we have any thoughts on adding logic
> to check for request timeout after validation has completed and, if it has,
> aborting the attempt to create/modify the connector?
>
> 2. Right now it's possible that we'll perform two connector config
> validations per create/modify request; once on the worker that initially
> receives the request, and then again if that worker is not the leader of
> the cluster and has to forward the request to the leader. Any thoughts on
> optimizing this to only require a single validation per request? We
> probably wouldn't want to force all validations to take place on the leader
> (could lead to overloading it pretty quickly in large clusters), but we
> could add an internal-only query parameter to skip validation and then use
> that parameter when forwarding requests from followers to the leader.
>
> 3. A worker property is pretty coarse-grained, and difficult to change. We
> might allow per-request toggling of the timeout by adding a URL query
> parameter like '?timeout=90s' to the REST API to allow tweaking of the
> timeout on a more granular basis, and without having to perform a worker
> restart.
>
> I'd also like to clarify a point about the rejected alternative "Allow
> configuring producer zombie fencing admin request timeout"--is the
> implication here that the "rest.api.request.timeout.ms" property will not
> control the REST timeout for requests to the 'PUT
> /connectors/{connector}/fence' endpoint, or just that it won't control the
> timeout that we use for the call to Admin::fenceProducers?
>
> Cheers,
>
> Chris
>
> On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <ya...@gmail.com> wrote:
>
> > Hi all,
> >
> > I'd like to start a discussion thread on this small KIP -
> >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> >
> > It proposes the addition of a new Kafka Connect worker configuration to
> > allow configuring REST API request timeouts.
> >
> > Thanks,
> > Yash
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Posted by Chris Egerton <ch...@aiven.io.INVALID>.
Hi Yash,

Thanks for the KIP. It's a nice, focused change. Initially I was hesitant
to support cases where connector validation takes this long, but
considering the alternative is that we give users a 500 error response but
leave the request to create/modify the connector queued up in the herder, I
think I can get behind the motivation here. There's also an argument to be
made about keeping Kafka Connect available even when the systems that it
connects to are in a degraded state.

I have a few alternatives I'd be interested in your thoughts on:

1. Since the primary concern here seems to be that custom connector
validation logic can take too long, do we have any thoughts on adding logic
to check for request timeout after validation has completed and, if it has,
aborting the attempt to create/modify the connector?

2. Right now it's possible that we'll perform two connector config
validations per create/modify request; once on the worker that initially
receives the request, and then again if that worker is not the leader of
the cluster and has to forward the request to the leader. Any thoughts on
optimizing this to only require a single validation per request? We
probably wouldn't want to force all validations to take place on the leader
(could lead to overloading it pretty quickly in large clusters), but we
could add an internal-only query parameter to skip validation and then use
that parameter when forwarding requests from followers to the leader.

3. A worker property is pretty coarse-grained, and difficult to change. We
might allow per-request toggling of the timeout by adding a URL query
parameter like '?timeout=90s' to the REST API to allow tweaking of the
timeout on a more granular basis, and without having to perform a worker
restart.

I'd also like to clarify a point about the rejected alternative "Allow
configuring producer zombie fencing admin request timeout"--is the
implication here that the "rest.api.request.timeout.ms" property will not
control the REST timeout for requests to the 'PUT
/connectors/{connector}/fence' endpoint, or just that it won't control the
timeout that we use for the call to Admin::fenceProducers?

Cheers,

Chris

On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <ya...@gmail.com> wrote:

> Hi all,
>
> I'd like to start a discussion thread on this small KIP -
>
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
>
> It proposes the addition of a new Kafka Connect worker configuration to
> allow configuring REST API request timeouts.
>
> Thanks,
> Yash
>