You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by "Hector Geraldino (BLOOMBERG/ 919 3RD A)" <hg...@bloomberg.net> on 2022/11/03 18:54:41 UTC

[DISCUSS] KIP-883: Add delete callback method to Connector API

Hi everyone,

I've submitted KIP-883, which introduces a callback to the public Connector API called when deleting a connector:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-883%3A+Add+delete+callback+method+to+Connector+API

It adds a new `deleted()` method (open to better naming suggestions) to the org.apache.kafka.connect.connector.Connector abstract class, which will be invoked by connect Workers when a connector is being deleted. 

Feedback and comments are welcome.

Thank you!
Hector

Re: [DISCUSS] KIP-883: Add delete callback method to Connector API

Posted by Greg Harris <gr...@aiven.io.INVALID>.

Hi Hector,

Thanks for the KIP!

This is certainly missing functionality from the native Connect framework,
and we should try to make it possible to inform connectors about this part
of their lifecycle.
However, as with most functionality that was left out of the initial
implementation of the framework, the details are more challenging to work
out.

1. What happens when the destroy call throws an error, how does the
framework respond?

This is unspecified in the KIP, and it appears that your proposed changes
could cause the herder to fail.
From the perspective of operators & connector developers, what is a
reasonable expectation to have for failure of a destroy?
I could see operators wanting both a graceful-delete to make use of this
new feature, and a force-delete for when the graceful-delete fails.
A connector developer could choose to swallow all errors encountered, or
fail-fast to indicate to the operator that there is an issue with the
graceful-delete flow.
If the alternative is crashing the herder, connector developers may choose
to hide serious errors, which is undesirable.

2. What happens when the destroy() call takes a long time to complete, or
is interrupted?

It appears that your implementation serially destroy()s each appropriate
connector, and may prevent the herder thread from making progress while the
operation is ongoing.
We have previously had to patch Connect to perform all connector and task
operations on a background thread, because some connector method
implementations can stall indefinitely.
Connect also has the notion of "cancelling" a connector/task if a graceful
shutdown timeout operation takes too long. Perhaps some of that design or
machinery may be useful to protect this method call as well.

More specific to the destroy() call itself, what happens when a connector
completes part of a destroy operation and then cannot complete the
remainder, either due to timing out or a worker crashing?
What is the contract with the connector developer about this method? Is the
destroy() only started exactly once during the lifetime of the connector,
or may it be retried?

3. What should be considered a reasonable custom implementation of the
destroy() call? What resources should it clean up by default?

I think we can broadly categorize the state a connector mutates among the
following
* Framework-managed state (e.g. source offsets, consumer offsets)
* Implementation detail state (e.g. debezium db history topic, audit
tables, temporary accounts)
* Third party system data (e.g. the actual data being written by a sink
connector)
* Third party system metadata (e.g. tables in a database, delivery
receipts, permissions)

I think it's apparent that the framework-managed state cannot/should not be
interacted with by the destroy() call. However, the framework could be
changed to clean up these resources at the same time that destroy() is
called. Is that out-of-scope of this proposal, and better handled by manual
intervention?
From the text of the KIP, I think it explicitly includes the Implementation
detail state, which should not be depended on externally and should be safe
to clean up during a destroy(). I think this is completely reasonable.
Are the third-party data and metadata out-of-scope for this proposal? Can
we officially recommend against it, or should we accommodate users and
connector developers that wish to clean up data/metadata during destroy()?

4. How should connector implementations of destroy handle backwards
compatibility?

In terms of backward-compatibility for the framework vs connector versions,
I think the default-noop method is very reasonable.
However, what happens when someone upgrades from a version of a connector
without a destroy() implementation to one with an implementation, and
maintain backwards compatibility?
To replicate the same behavior, the connector might include something like
an `enable.cleanup` config which allows users to opt-in to the new
behavior. This could mean the proliferation of many different
configurations to handle this behavior.
Maybe we can provide some recommendations to developers, or some mechanism
to standardize this opt-in behavior.

I'm interested to hear if you have any experience with the above, if you've
experimented with this feature in your fork.

Thanks,
Greg

On Thu, Nov 3, 2022 at 11:55 AM Hector Geraldino (BLOOMBERG/ 919 3RD A) <
hgeraldino@bloomberg.net> wrote:

> Hi everyone,
>
> I've submitted KIP-883, which introduces a callback to the public
> Connector API called when deleting a connector:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-883%3A+Add+delete+callback+method+to+Connector+API
>
> It adds a new `deleted()` method (open to better naming suggestions) to
> the org.apache.kafka.connect.connector.Connector abstract class, which will
> be invoked by connect Workers when a connector is being deleted.
>
> Feedback and comments are welcome.
>
> Thank you!
> Hector
>
>