You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Chris Egerton <ch...@confluent.io.INVALID> on 2021/12/29 17:35:06 UTC

Re: Do we want to add more SMTs to Apache Kafka?

Hi all,

I think restricting the set of out-of-the-box SMTs that we provide with
Connect is reasonable. I do think Joshua raises a valuable point, though.
At the risk of reiterating his ideas, we can gain a few things from
improving the existing SMTs provided with Connect: first, we can establish
precedents for how SMTs are configured and implemented in more complex
scenarios (such as handling explicitly-specified nested fields or
traversing an entire key/value recursively), which can save time for both
developers and users if we do a good enough job for others to start
following the examples we set. Second, we decrease the likelihood that
someone forks, e.g., the InsertField SMT just to add their own small tweak
on top, which both adds unnecessary work for that developer and complicates
the experience for users of Connect ("which SMT do I use now?").

Additionally, I like Gunnar and Brandon's suggestion of a way to discover
SMTs. There's precedent for this with the "Kafka Connector Hub" link on the
https://cwiki.apache.org/confluence/display/KAFKA/ page, which currently
leads to a page on Confluent's website containing a fairly large list of
connectors from a variety of sources (
https://www.confluent.io/product/connectors/). In practice I'm not sure how
many new Kafka users end up visiting the wiki as their first stop, though.
Perhaps we could add a section to the docs page at
https://kafka.apache.org/documentation.html, for connectors,
transformations, and maybe even other pluggable components (converters,
config providers, etc.)?

Cheers,

Chris

On Sun, Nov 21, 2021 at 12:05 PM Joshua Grisham <gr...@gmail.com>
wrote:

> Hi all,
>
> From my perspective I think that the type of transformations which are
> already covered by the existing SMTs is quite good (but anyone else please
> say if you feel like you are missing something that feels "standard"), but
> the biggest issue is the limitations that many of them have which makes
> their usage extremely limited when trying to use them in a real production
> scenario.
>
> In my mind, the single biggest gap is the inability to handle nested fields
> or anything more than records that essentially look like simple key-value
> pairs. (However one exception being if you chain the flatten transform
> first then you can apply others on the flattened result, but this is
> assuming that the flatten transform can actually handle the message first!
> If you have nested arrays then you are toast ;) And wait, maybe you didn't
> actually want to flatten anyway?).
>
> I am not sure the best way to approach this (e.g. allow for some kind of
> path notation so users can address nested fields directly vs allow for
> recursion to match a field name at no matter what level, or both, or
> something else?) but I would say that some kind of standardized approach
> that was implemented in all of the SMTs (where it makes sense) would
> certainly be best! (at least, from a user perspective that the
> configuration to address nested fields is consistent across each transform
> that allows it).  I did this one way in a proposed change for KIP-683 but
> this is only one of the possible ways (
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-683%3A+Add+recursive+support+to+Connect+Cast+and+ReplaceField+transforms%2C+and+support+for+casting+complex+types+to+either+a+native+or+JSON+string
> )
>
> Past that, there are a few tweaks or enhancements which could be made to
> some of the existing SMTs which would help prevent them from blocking or
> failing for most general scenarios (for example some of the changes I had
> proposed in the past but haven't since had the time to follow up on them
> fit in this category I think), for example the ability to "cast" a more
> complicated structure (such as an array) as a string (Connect API or JSON)
> so the record can then be flattened and be inserted into a database table
> or something similar will open up a lot of what is IMO currently roadblocks
> that users might often hit in Sink scenarios.
>
> Then there are some small tweaks which maybe can be made for specific
> cases, some of which Randall already mentioned, such as:
>
> * The Filter implementation is very limited to use mostly due to lack of
> some "standard-feeling" predicates (field value filtering is very often
> what I think people are looking for) so often the Confluent or other one is
> used instead.
> * A bit more can be done with InsertField IMO (e.g. giving a wallclock
> timestamp instead of the record's produced timestamp is one example that
> often seems to pop up).
> * Some standardized way to "move" one field to another place e.g. to move
> it out of or into a nested record.
> * Limitations on only processing one field per transformation, e.g. with
> the TimestampConverter like I had proposed with KIP-682 (
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-682%3A+Connect+TimestampConverter+support+for+multiple+fields+and+multiple+input+formats
> )
> are just a little annoying feeling and can add to processing time in high
> volume scenarios.
>
> (By the way apologies to Randall that I have not had a chance to get back
> yet on KIP-682 but will try to do so in the discussion thread in the coming
> days if I can!)
>
> And finally I also feel like some of the SMTs are a bit disjointed from
> each other when it comes to how the classes are actually designed and how
> the configuration works when using them (both from a user implementing the
> transform, and a transform developer perspective). Some of the class design
> difference might be necessary due to the nature of the transformation
> itself, but I wonder if in the future some kind of standardization could be
> built into a type of base class or something instead, or some enhancements
> to the requirements specified by the interface, which would help to drive a
> more standardized approach?  Or maybe at least just a once-through on the
> code for all of them to align things like how Config string constants/enums
> etc are handled, method names and position within the code, that they are
> all refactored in a similar way, etc.
>
> In the end, I do feel it makes sense to try and sort of aim for the 80/20
> rule with the standard SMTs to be able to support "real world" scenarios,
> but some of these limitations cause them to fall a bit short today.
>
> Hope this is helpful at least to spark other ideas anyway!
>
> Have a nice (rest of the) weekend!
> Joshua Grisham
>
>
> Den lör 20 nov. 2021 kl 01:16 skrev Brandon Brown <brandon@bbrownsound.com
> >:
>
> > I agree, if the desire is to keep the internal SMTs collection small then
> > providing an ease of discovery like Gunnar suggestions would be extremely
> > helpful.
> >
> > Brandon Brown
> >
> > > On Nov 19, 2021, at 6:13 PM, Gunnar Morling
> > <gu...@googlemail.com.invalid> wrote:
> > >
> > > Hi all,
> > >
> > > Just came across this thread, I hope the late reply is ok.
> > >
> > > FWIW, we're in a similar situation in Debezium, where users often
> request
> > > new (Debezium-specific) SMTs, and we generally tend to recommend them
> to
> > be
> > > maintained by users themselves, unless they are truly generic. This
> > > excludes a share of users though who aren't Java developers.
> > >
> > > What might help is having means of simple discoverability of externally
> > > hosted SMTs, e.g. via some kind of catalog hosted on kafka.apache.org.
> > That
> > > way, people would have it easier to find and obtain SMTs from other
> > places,
> > > reducing the pressure to get them added to Apache Kafka proper.
> > >
> > > Best,
> > >
> > > --Gunnar
> > >
> > >
> > >
> > >
> > >> Am So., 7. Nov. 2021 um 21:49 Uhr schrieb Brandon Brown <
> > >> brandon@bbrownsound.com>:
> > >>
> > >> I like the idea of a select number of SMTs being offered and supported
> > out
> > >> of the box. The addition of SMTs via this process is nice because it
> > allows
> > >> for a rich set to be supported out of the box and without the need for
> > >> extra work to deploy.
> > >>
> > >> Perhaps this is a spot where the community could express the interest
> of
> > >> additional SMTs which maybe are available via an open source library
> > and if
> > >> enough usage occurs there could be a path to fold into the Kafka
> > project at
> > >> large?
> > >>
> > >> Brandon Brown
> > >>
> > >>
> > >>>> On Nov 7, 2021, at 1:19 PM, Randall Hauch <rh...@gmail.com> wrote:
> > >>>
> > >>> We have had several requests to add more Connect Single Message
> > >>> Transforms (SMTs) to the project. When SMTs were first introduced
> with
> > >>> KIP-66 (ref 1) in Jun 2017, the KIP mentioned the following:
> > >>>
> > >>>> Criteria: SMTs that are shipped with Kafka Connect should be general
> > >> enough to apply to many data sources & serialization formats. They
> > should
> > >> also be simple enough to not cause any additional library dependency
> to
> > be
> > >> introduced.
> > >>>> Beyond those being initially included with this KIP, transformations
> > >> can be adopted for inclusion in future with JIRA/ML discussion to
> weigh
> > the
> > >> tradeoffs.
> > >>>
> > >>> In the 4+ years that we've had SMTs in the project, we've only
> > >>> enhanced the framework with KIP-585 (ref 2), and fixed the initial
> > >>> SMTs (including KIP-437, ref 3). We recently have had quite a few
> > >>> requests to add new SMTs; a few samples of these include:
> > >>> * https://issues.apache.org/jira/browse/KAFKA-10299
> > >>> * https://issues.apache.org/jira/browse/KAFKA-9436
> > >>> * https://issues.apache.org/jira/browse/KAFKA-9318
> > >>> * https://issues.apache.org/jira/browse/KAFKA-12443
> > >>>
> > >>> Adding new or changing existing SMTs to the Apache Kafka project come
> > >>> with requirements. First, AK releases are infrequent and necessarily
> > >>> involve the entire project. Second, adding an SMT is an API change
> and
> > >>> therefore requires a KIP. Third, all changes in behavior to SMTs
> > >>> included in an prior AK release must be backward compatible, and
> > >>> adding or changing an SMT's configuration requires a KIP. This last
> > >>> one is also challenging if we're limiting ourselves to truly general
> > >>> SMTs, since these are notoriously difficult to get right the first
> > >>> time. All of these aspects mean that it's difficult to add, maintain,
> > >>> and evolve/improve SMTs in AK. And unless a bug fix is critical,
> we're
> > >>> likely not to create a patch release for AK just to fix a bug in an
> > >>> SMT, simply because of the effort involved.
> > >>>
> > >>> On the other hand, anyone can easily implement their own SMT and
> > >>> deploy them as a Connect plugin, whether that's part of a connector
> > >>> plugin or a separate plugin dedicated for one or SMTs. Interestingly,
> > >>> it's far simpler to implement and maintain custom SMTs outside of AK,
> > >>> especially since those plugins can be released and deployed in any
> > >>> Connect runtime version since at least 0.11.0. And if custom SMTs are
> > >>> maintained in a relatively small project, they can be released often.
> > >>>
> > >>> Finally, KIP-26 (ref 4) specifically rejected maintaining connector
> > >>> implementations in the AK project. So we have precedence for choosing
> > >>> not to accept implementations.
> > >>>
> > >>> Given the above, I wonder if the time has come for us to prefer only
> > >>> maintaining the SMT framework and existing SMTs, and to decline
> adding
> > >>> new SMTs.
> > >>>
> > >>> Thoughts?
> > >>>
> > >>> Best regards,
> > >>>
> > >>> Randall Hauch
> > >>>
> > >>> (1)
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect
> > >>> (2)
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-585%3A+Filter+and+Conditional+SMTs
> > >>> (3)
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-437%3A+Custom+replacement+for+MaskField+SMT
> > >>> (4)
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
> > >>
> >
>