You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Shikhar Bhushan <sh...@confluent.io> on 2016/12/07 19:46:44 UTC

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Hi all,

I have another iteration at a proposal for this feature here:
https://cwiki.apache.org/confluence/display/KAFKA/Connect+Transforms+-+Proposed+Design

I'd welcome your feedback and comments.

Thanks,

Shikhar

On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <ew...@confluent.io>
wrote:

On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <sh...@confluent.io>
wrote:

> >
> >
> > Hmm, operating on ConnectRecords probably doesn't work since you need to
> > emit the right type of record, which might mean instantiating a new one.
> I
> > think that means we either need 2 methods, one for SourceRecord, one for
> > SinkRecord, or we'd need to limit what parts of the message you can
> modify
> > (e.g. you can change the key/value via something like
> > transformKey(ConnectRecord) and transformValue(ConnectRecord), but other
> > fields would remain the same and the fmwk would handle allocating new
> > Source/SinkRecords if needed)
> >
>
> Good point, perhaps we could add an abstract method on ConnectRecord that
> takes all the shared fields as parameters and the implementations return a
> copy of the narrower SourceRecord/SinkRecord type as appropriate.
> Transformers would only operate on ConnectRecord rather than caring about
> SourceRecord or SinkRecord (in theory they could instanceof/cast, but the
> API should discourage it)
>
>
> > Is there a use case for hanging on to the original? I can't think of a
> > transformation where you'd need to do that (or couldn't just order
things
> > differently so it isn't a problem).
>
>
> Yeah maybe this isn't really necessary. No strong preference here.
>
> That said, I do worry a bit that farming too much stuff out to
transformers
> > can result in "programming via config", i.e. a lot of the simplicity you
> > get from Connect disappears in long config files. Standardization would
> be
> > nice and might just avoid this (and doesn't cost that much implementing
> it
> > in each connector), and I'd personally prefer something a bit less
> flexible
> > but consistent and easy to configure.
>
>
> Not sure what the you're suggesting :-) Standardized config properties for
> a small set of transformations, leaving it upto connectors to integrate?
>

I just mean that you get to the point where you're practically writing a
Kafka Streams application, you're just doing it through either an
incredibly convoluted set of transformers and configs, or a single
transformer with incredibly convoluted set of configs. You basically get to
the point where you're config is a mini DSL and you're not really saving
that much.

The real question is how much we want to venture into the "T" part of ETL.
I tend to favor minimizing how much we take on since the rest of Connect
isn't designed for it, it's designed around the E & L parts.

-Ewen


> Personally I'm skeptical of that level of flexibility in transformers --
> > its getting awfully complex and certainly takes us pretty far from
> "config
> > only" realtime data integration. It's not clear to me what the use cases
> > are that aren't covered by a small set of common transformations that
can
> > be chained together (e.g. rename/remove fields, mask values, and maybe a
> > couple more).
> >
>
> I agree that we should have some standard transformations that we ship
with
> connect that users would ideally lean towards for routine tasks. The ones
> you mention are some good candidates where I'd imagine can expose simple
> config, e.g.
>    transform.filter.whitelist=x,y,z # filter to a whitelist of fields
>    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
>    topic.rename.replace=-/_
>    topic.rename.prefix=kafka_
> etc..
>
> However the ecosystem will invariably have more complex transformers if we
> make this pluggable. And because ETL is messy, that's probably a good
thing
> if folks are able to do their data munging orthogonally to connectors, so
> that connectors can focus on the logic of how data should be copied
from/to
> datastores and Kafka.
>
>
> > In any case, we'd probably also have to change configs of connectors if
> we
> > allowed configs like that since presumably transformer configs will just
> be
> > part of the connector config.
> >
>
> Yeah, haven't thought much about how all the configuration would tie
> together...
>
> I think we'd need the ability to:
> - spec transformer chain (fully-qualified class names? perhaps special
> aliases for built-in ones? perhaps third-party fqcns can be assigned
> aliases by users in the chain spec, for easier configuration and to
> uniquely identify a transformation when it occurs more than one time in a
> chain?)
> - configure each transformer -- all properties prefixed with that
> transformer's ID (fqcn / alias) get destined to it
>
> Additionally, I think we would probably want to allow for topic-specific
> overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g. you
> want
> certain transformations for one topic, but different ones for another...)
>



--
Thanks,
Ewen

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Posted by Shikhar Bhushan <sh...@confluent.io>.
Makes sense Ewen, I edited the KIP to include this criteria.

I'd like to start a voting thread soon unless anyone has additional points
for discussion.

On Fri, Dec 30, 2016 at 12:14 PM Ewen Cheslack-Postava <ew...@confluent.io>
wrote:

On Thu, Dec 15, 2016 at 7:41 PM, Shikhar Bhushan <sh...@confluent.io>
wrote:

> There is no decision being proposed on the final list of transformations
> that will ever be in Kafka :-) Just the initial set we should roll with.
>

I'd second this comment as well. I'm very wary of the slippery slope, which
is why I wasn't in favor of including any connectors except for very simple
demos.

But it might be useful to have some initial guidelines, and might even make
sense to include them in the KIP so they are easy for others to find. I
think both the examples Gwen gave are easily excluded with a simple rule:
SMTs that are shipped with Kafka should be general enough to apply to many
data sources & serialization formats. email is a very specific type of data
(email headers and HL7 are pretty similar) and Avro is a specific
serialization format where, presumably, the Connect data type you'd have to
receive to do this transformation is just a byte array of the original Avro
file. In contrast, the included transformations in the current KIP are
*really* broadly applicable; apart from timestamps, I think they pretty
much all could potentially be applied to *any* stream of data.

I think the more interesting cases that we'll probably end up debating are
around serialization formats that "fit" within other connectors, in
particular I'm thinking of CSV and line-oriented JSON parsing. Individual
connectors may avoid this (or not be aware that the data has this
structure), but users will want that type of transformation to be easy and
baked in.

-Ewen


>
> On Thu, Dec 15, 2016 at 3:34 PM Gwen Shapira <gw...@confluent.io> wrote:
>
> You are absolutely right that the vast majority of NiFi's processors are
> not what we would consider SMT.
>
> I went over the list and I think the still contain just short of 50 legit
> SMTs:
> https://cwiki.apache.org/confluence/display/KAFKA/Analyzing+
> NiFi+Transformations
>
> You are right that ExtractHL7 is an extreme that clearly doesn't belong in
> Apache Kafka, but just before that we have ExtractAvroMetadata that may
> fit? and ExtractEmailHeaders doesn't sound totally outlandish either...
>
> Nothing in the baked-in list by Shikhar looks out of place. I am concerned
> about slipperly slope. Or the arbitrariness of the decision if we say that
> this list is final and nothing else will ever make it into Kafka.
>
> Gwen
>
> On Thu, Dec 15, 2016 at 3:00 PM, Ewen Cheslack-Postava <ew...@confluent.io>
> wrote:
>
> > I think there are a couple of factors that make transformations and
> > connectors different.
> >
> > First, NiFi's 150 processors is a bit misleading. In NiFi, processors
> cover
> > data sources, data sinks, serialization/deserialization, *and*
> > transformations. I haven't filtered the list to see how many fall into
> the
> > first 3 categories, but it's a *lot* of the processors they have.
> >
> > Second, since transformations only apply to a single message and I'd
> think
> > they generally shouldn't be interacting with external services (i.e. I
> > think trying to do enrichment in SMT is probably a bad idea), the scope
> of
> > possible transformations is reasonably limited and the transformations
> > themselves tend to be small and easily maintainable. I think this is a
> > dramatic difference from connectors, which are each substantial projects
> in
> > their own right.
> >
> > While I get the slippery slope argument re: including specific
> > transformations, I think we can come up with a reasonable policy (and
via
> > KIPs we can, as a community, come to an agreement based purely on taste
> if
> > it comes down to that). In particular, I'd say keep the core general
> (i.e.
> > no domain-specific transformations/parsing like HL7), pure data
> > manipulation (i.e. no enrichment), and nothing that could just as well
be
> > done as a converter/serializer/deserializer/source connector/sink
> > connector.
> >
> > I was very staunchly against including connectors (aside from a simple
> > example) directly in Kafka, so this may seem like a reversal of
position.
> > But I think the % of use cases covered will look very different between
> > connectors and transformations. Sure, some connectors are very popular,
> and
> > moreso right now because they are the most thoroughly developed, tested,
> > etc. But the top 3 most common transformations will probably be used
> across
> > all the top 20 most popular connectors. I have no doubt people will end
> up
> > writing custom ones (which is why it's nice to make them pluggable
rather
> > than choosing a fixed set), but they'll either be very niche (like
people
> > write custom connectors for their internal systems) or be more broadly
> > applicable but very domain specific such that they are easy to reject
for
> > inclusion.
> >
> > @Gwen if we filtered the list of NiFi processors to ones that fit that
> > criteria, would that still be too long a list for your taste? Similarly,
> > let's say we were going to include some baked in; in that case, does
> > anything look out of place to you in the list Shikhar has included in
the
> > KIP?
> >
> > -Ewen
> >
> > On Thu, Dec 15, 2016 at 2:01 PM, Gwen Shapira <gw...@confluent.io> wrote:
> >
> > > I agree about the ease of use in adding a small-subset of built-in
> > > transformations.
> > >
> > > But the same thing is true for connectors - there are maybe 5 super
> > popular
> > > OSS connectors and the rest is a very long tail. We drew the line at
> not
> > > adding any, because thats the easiest and because we did not want to
> turn
> > > Kafka into a collection of transformations.
> > >
> > > I really don't want to end up with 135 (or even 20) transformations in
> > > Kafka. So either we have a super-clear definition of what belongs and
> > what
> > > doesn't - or we put in one minimal example and the rest goes into the
> > > ecosystem.
> > >
> > > We can also start by putting transformations on github and just see if
> > > there is huge demand for them in Apache. It is easier to add stuff to
> the
> > > project later than to remove functionality.
> > >
> > >
> > >
> > > On Thu, Dec 15, 2016 at 11:59 AM, Shikhar Bhushan <
> shikhar@confluent.io>
> > > wrote:
> > >
> > > > I have updated KIP-66
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 66%3A+Single+Message+Transforms+for+Kafka+Connect
> > > > with
> > > > the changes I proposed in the design.
> > > >
> > > > Gwen, I think the main downside to not including some
transformations
> > > with
> > > > Kafka Connect is that it seems less user friendly if folks have to
> make
> > > > sure to have the right transformation(s) on the classpath as well,
> > > besides
> > > > their connector(s). Additionally by going in with a small set
> included,
> > > we
> > > > can encourage a consistent configuration and implementation style
and
> > > > provide utilities for e.g. data transformations, which I expect we
> will
> > > > definitely need (discussed under 'Patterns for data
> transformations').
> > > >
> > > > It does get hard to draw the line once you go from 'none' to 'some'.
> To
> > > get
> > > > discussion going, if we get agreement on 'none' vs 'some', I added a
> > > table
> > > > under 'Bundled transformations' for transformations which I think
are
> > > worth
> > > > including.
> > > >
> > > > For many of these, I have noticed their absence in the wild as a
pain
> > > point
> > > > --
> > > > TimestampRouter:
> > > > https://github.com/confluentinc/kafka-connect-elasticsearch/
> issues/33
> > > > Mask:
> > > > https://groups.google.com/d/msg/confluent-platform/3yHb8_
> > > > mCReQ/sTQc3dNgBwAJ
> > > > Insert:
> > > > http://stackoverflow.com/questions/40664745/
> > elasticsearch-connector-for-
> > > > kafka-connect-offset-and-timestamp
> > > > RegexRouter:
> > > > https://groups.google.com/d/msg/confluent-platform/
> > > > yEBwu1rGcs0/gIAhRp6kBwAJ
> > > > NumericCast:
> > > > https://github.com/confluentinc/kafka-connect-
> > > > jdbc/issues/101#issuecomment-249096119
> > > > TimestampConverter:
> > > > https://groups.google.com/d/msg/confluent-platform/
> > > > gGAOsw3Qeu4/8JCqdDhGBwAJ
> > > > ValueToKey: https://github.com/confluentinc/kafka-connect-
> > jdbc/pull/166
> > > >
> > > > In other cases, their functionality is already being implemented by
> > > > connectors in divergent ways: RegexRouter, Insert, Replace,
> > > HoistToStruct,
> > > > ExtractFromStruct
> > > >
> > > > On Wed, Dec 14, 2016 at 6:00 PM Gwen Shapira <gw...@confluent.io>
> > wrote:
> > > >
> > > > I'm a bit concerned about adding transformations in Kafka. NiFi has
> 150
> > > > processors, presumably they are all useful for someone. I don't know
> if
> > > I'd
> > > > want all of that in Apache Kafka. What's the downside of keeping it
> > out?
> > > Or
> > > > at least keeping the built-in set super minimal (Flume has like 3
> > > built-in
> > > > interceptors)?
> > > >
> > > > Gwen
> > > >
> > > > On Wed, Dec 14, 2016 at 1:36 PM, Shikhar Bhushan <
> shikhar@confluent.io
> > >
> > > > wrote:
> > > >
> > > > > With regard to a), just using `ConnectRecord` with `newRecord` as
a
> > new
> > > > > abstract method would be a fine choice. In prototyping, both
> options
> > > end
> > > > up
> > > > > looking pretty similar (in terms of how transformations are
> > implemented
> > > > and
> > > > > the runtime initializes and uses them) and I'm starting to lean
> > towards
> > > > not
> > > > > adding a new interface into the mix.
> > > > >
> > > > > On b) I think we should include a small set of useful
> transformations
> > > > with
> > > > > Connect, since they can be applicable across different connectors
> and
> > > we
> > > > > should encourage some standardization for common operations. I'll
> > > update
> > > > > KIP-66 soon including a spec of transformations that I believe are
> > > worth
> > > > > including.
> > > > >
> > > > > On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <
> > > > ewen@confluent.io>
> > > > > wrote:
> > > > >
> > > > > If anyone has time to review here, it'd be great to get feedback.
> I'd
> > > > > imagine that the proposal itself won't be too controversial --
> keeps
> > > > > transformations simple (by only allowing map/filter), doesn't
> affect
> > > the
> > > > > rest of the framework much, and fits in with general config
> structure
> > > > we've
> > > > > used elsewhere (although ConfigDef could use some updates to make
> > this
> > > > > easier...).
> > > > >
> > > > > I think the main open questions for me are:
> > > > >
> > > > > a) Is TransformableRecord worth it to avoid reimplementing small
> bits
> > > of
> > > > > code (it allows for a single implementation of the interface to
> > > trivially
> > > > > apply to both Source and SinkRecords). I think I prefer this, but
> it
> > > does
> > > > > come with some commitment to another interface on top of
> > ConnectRecord.
> > > > We
> > > > > could alternatively modify ConnectRecord which would require fewer
> > > > changes.
> > > > > b) How do folks feel about built-in transformations and the set
> that
> > > are
> > > > > mentioned here? This brings us way back to the discussion of
> built-in
> > > > > connectors. Transformations, especially when intended to be
> > lightweight
> > > > and
> > > > > touch nothing besides the data already in the record, seem
> different
> > > from
> > > > > connectors -- there might be quite a few, but hopefully limited.
> > Since
> > > we
> > > > > (hopefully) already factor out most serialization-specific stuff
> via
> > > > > Converters, I think we can keep this pretty limited. That said, I
> > have
> > > no
> > > > > doubt some folks will (in my opinion) abuse this feature to do
data
> > > > > enrichment by querying external systems, so building a bunch of
> > > > > transformations in could potentially open the floodgates, or at
> least
> > > > make
> > > > > decisions about what is included vs what should be 3rd party
muddy.
> > > > >
> > > > > -Ewen
> > > > >
> > > > >
> > > > > On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <
> > shikhar@confluent.io
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I have another iteration at a proposal for this feature here:
> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/
> > > > > > Connect+Transforms+-+Proposed+Design
> > > > > >
> > > > > > I'd welcome your feedback and comments.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Shikhar
> > > > > >
> > > > > > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <
> > > > ewen@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <
> > > > shikhar@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Hmm, operating on ConnectRecords probably doesn't work since
> > you
> > > > need
> > > > > > to
> > > > > > > > emit the right type of record, which might mean
instantiating
> a
> > > new
> > > > > > one.
> > > > > > > I
> > > > > > > > think that means we either need 2 methods, one for
> > SourceRecord,
> > > > one
> > > > > > for
> > > > > > > > SinkRecord, or we'd need to limit what parts of the message
> you
> > > can
> > > > > > > modify
> > > > > > > > (e.g. you can change the key/value via something like
> > > > > > > > transformKey(ConnectRecord) and
> transformValue(ConnectRecord),
> > > but
> > > > > > other
> > > > > > > > fields would remain the same and the fmwk would handle
> > allocating
> > > > new
> > > > > > > > Source/SinkRecords if needed)
> > > > > > > >
> > > > > > >
> > > > > > > Good point, perhaps we could add an abstract method on
> > > ConnectRecord
> > > > > that
> > > > > > > takes all the shared fields as parameters and the
> implementations
> > > > > return
> > > > > > a
> > > > > > > copy of the narrower SourceRecord/SinkRecord type as
> appropriate.
> > > > > > > Transformers would only operate on ConnectRecord rather than
> > caring
> > > > > about
> > > > > > > SourceRecord or SinkRecord (in theory they could
> instanceof/cast,
> > > but
> > > > > the
> > > > > > > API should discourage it)
> > > > > > >
> > > > > > >
> > > > > > > > Is there a use case for hanging on to the original? I can't
> > think
> > > > of
> > > > > a
> > > > > > > > transformation where you'd need to do that (or couldn't just
> > > order
> > > > > > things
> > > > > > > > differently so it isn't a problem).
> > > > > > >
> > > > > > >
> > > > > > > Yeah maybe this isn't really necessary. No strong preference
> > here.
> > > > > > >
> > > > > > > That said, I do worry a bit that farming too much stuff out to
> > > > > > transformers
> > > > > > > > can result in "programming via config", i.e. a lot of the
> > > > simplicity
> > > > > > you
> > > > > > > > get from Connect disappears in long config files.
> > Standardization
> > > > > would
> > > > > > > be
> > > > > > > > nice and might just avoid this (and doesn't cost that much
> > > > > implementing
> > > > > > > it
> > > > > > > > in each connector), and I'd personally prefer something a
bit
> > > less
> > > > > > > flexible
> > > > > > > > but consistent and easy to configure.
> > > > > > >
> > > > > > >
> > > > > > > Not sure what the you're suggesting :-) Standardized config
> > > > properties
> > > > > > for
> > > > > > > a small set of transformations, leaving it upto connectors to
> > > > > integrate?
> > > > > > >
> > > > > >
> > > > > > I just mean that you get to the point where you're practically
> > > writing
> > > > a
> > > > > > Kafka Streams application, you're just doing it through either
an
> > > > > > incredibly convoluted set of transformers and configs, or a
> single
> > > > > > transformer with incredibly convoluted set of configs. You
> > basically
> > > > get
> > > > > to
> > > > > > the point where you're config is a mini DSL and you're not
really
> > > > saving
> > > > > > that much.
> > > > > >
> > > > > > The real question is how much we want to venture into the "T"
> part
> > of
> > > > > ETL.
> > > > > > I tend to favor minimizing how much we take on since the rest of
> > > > Connect
> > > > > > isn't designed for it, it's designed around the E & L parts.
> > > > > >
> > > > > > -Ewen
> > > > > >
> > > > > >
> > > > > > > Personally I'm skeptical of that level of flexibility in
> > > transformers
> > > > > --
> > > > > > > > its getting awfully complex and certainly takes us pretty
far
> > > from
> > > > > > > "config
> > > > > > > > only" realtime data integration. It's not clear to me what
> the
> > > use
> > > > > > cases
> > > > > > > > are that aren't covered by a small set of common
> > transformations
> > > > that
> > > > > > can
> > > > > > > > be chained together (e.g. rename/remove fields, mask values,
> > and
> > > > > maybe
> > > > > > a
> > > > > > > > couple more).
> > > > > > > >
> > > > > > >
> > > > > > > I agree that we should have some standard transformations that
> we
> > > > ship
> > > > > > with
> > > > > > > connect that users would ideally lean towards for routine
> tasks.
> > > The
> > > > > ones
> > > > > > > you mention are some good candidates where I'd imagine can
> expose
> > > > > simple
> > > > > > > config, e.g.
> > > > > > >    transform.filter.whitelist=x,y,z # filter to a whitelist of
> > > > fields
> > > > > > >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> > > > > > >    topic.rename.replace=-/_
> > > > > > >    topic.rename.prefix=kafka_
> > > > > > > etc..
> > > > > > >
> > > > > > > However the ecosystem will invariably have more complex
> > > transformers
> > > > if
> > > > > > we
> > > > > > > make this pluggable. And because ETL is messy, that's probably
> a
> > > good
> > > > > > thing
> > > > > > > if folks are able to do their data munging orthogonally to
> > > > connectors,
> > > > > so
> > > > > > > that connectors can focus on the logic of how data should be
> > copied
> > > > > > from/to
> > > > > > > datastores and Kafka.
> > > > > > >
> > > > > > >
> > > > > > > > In any case, we'd probably also have to change configs of
> > > > connectors
> > > > > if
> > > > > > > we
> > > > > > > > allowed configs like that since presumably transformer
> configs
> > > will
> > > > > > just
> > > > > > > be
> > > > > > > > part of the connector config.
> > > > > > > >
> > > > > > >
> > > > > > > Yeah, haven't thought much about how all the configuration
> would
> > > tie
> > > > > > > together...
> > > > > > >
> > > > > > > I think we'd need the ability to:
> > > > > > > - spec transformer chain (fully-qualified class names? perhaps
> > > > special
> > > > > > > aliases for built-in ones? perhaps third-party fqcns can be
> > > assigned
> > > > > > > aliases by users in the chain spec, for easier configuration
> and
> > to
> > > > > > > uniquely identify a transformation when it occurs more than
one
> > > time
> > > > in
> > > > > a
> > > > > > > chain?)
> > > > > > > - configure each transformer -- all properties prefixed with
> that
> > > > > > > transformer's ID (fqcn / alias) get destined to it
> > > > > > >
> > > > > > > Additionally, I think we would probably want to allow for
> > > > > topic-specific
> > > > > > > overrides <https://issues.apache.org/jira/browse/KAFKA-3962>
> > (e.g.
> > > > you
> > > > > > > want
> > > > > > > certain transformations for one topic, but different ones for
> > > > > another...)
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks,
> > > > > > Ewen
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Gwen Shapira*
> > > > Product Manager | Confluent
> > > > 650.450.2760 <(650)%20450-2760> <(650)%20450-2760>
<(650)%20450-2760> | @gwenshap
> > > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > > > <http://www.confluent.io/blog>
> > > >
> > >
> > >
> > >
> > > --
> > > *Gwen Shapira*
> > > Product Manager | Confluent
> > > 650.450.2760 <(650)%20450-2760> <(650)%20450-2760> | @gwenshap
> > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > > <http://www.confluent.io/blog>
> > >
> >
>
>
>
> --
> *Gwen Shapira*
> Product Manager | Confluent
> 650.450.2760 <(650)%20450-2760> <(650)%20450-2760> | @gwenshap
> Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> <http://www.confluent.io/blog>
>

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
On Thu, Dec 15, 2016 at 7:41 PM, Shikhar Bhushan <sh...@confluent.io>
wrote:

> There is no decision being proposed on the final list of transformations
> that will ever be in Kafka :-) Just the initial set we should roll with.
>

I'd second this comment as well. I'm very wary of the slippery slope, which
is why I wasn't in favor of including any connectors except for very simple
demos.

But it might be useful to have some initial guidelines, and might even make
sense to include them in the KIP so they are easy for others to find. I
think both the examples Gwen gave are easily excluded with a simple rule:
SMTs that are shipped with Kafka should be general enough to apply to many
data sources & serialization formats. email is a very specific type of data
(email headers and HL7 are pretty similar) and Avro is a specific
serialization format where, presumably, the Connect data type you'd have to
receive to do this transformation is just a byte array of the original Avro
file. In contrast, the included transformations in the current KIP are
*really* broadly applicable; apart from timestamps, I think they pretty
much all could potentially be applied to *any* stream of data.

I think the more interesting cases that we'll probably end up debating are
around serialization formats that "fit" within other connectors, in
particular I'm thinking of CSV and line-oriented JSON parsing. Individual
connectors may avoid this (or not be aware that the data has this
structure), but users will want that type of transformation to be easy and
baked in.

-Ewen


>
> On Thu, Dec 15, 2016 at 3:34 PM Gwen Shapira <gw...@confluent.io> wrote:
>
> You are absolutely right that the vast majority of NiFi's processors are
> not what we would consider SMT.
>
> I went over the list and I think the still contain just short of 50 legit
> SMTs:
> https://cwiki.apache.org/confluence/display/KAFKA/Analyzing+
> NiFi+Transformations
>
> You are right that ExtractHL7 is an extreme that clearly doesn't belong in
> Apache Kafka, but just before that we have ExtractAvroMetadata that may
> fit? and ExtractEmailHeaders doesn't sound totally outlandish either...
>
> Nothing in the baked-in list by Shikhar looks out of place. I am concerned
> about slipperly slope. Or the arbitrariness of the decision if we say that
> this list is final and nothing else will ever make it into Kafka.
>
> Gwen
>
> On Thu, Dec 15, 2016 at 3:00 PM, Ewen Cheslack-Postava <ew...@confluent.io>
> wrote:
>
> > I think there are a couple of factors that make transformations and
> > connectors different.
> >
> > First, NiFi's 150 processors is a bit misleading. In NiFi, processors
> cover
> > data sources, data sinks, serialization/deserialization, *and*
> > transformations. I haven't filtered the list to see how many fall into
> the
> > first 3 categories, but it's a *lot* of the processors they have.
> >
> > Second, since transformations only apply to a single message and I'd
> think
> > they generally shouldn't be interacting with external services (i.e. I
> > think trying to do enrichment in SMT is probably a bad idea), the scope
> of
> > possible transformations is reasonably limited and the transformations
> > themselves tend to be small and easily maintainable. I think this is a
> > dramatic difference from connectors, which are each substantial projects
> in
> > their own right.
> >
> > While I get the slippery slope argument re: including specific
> > transformations, I think we can come up with a reasonable policy (and via
> > KIPs we can, as a community, come to an agreement based purely on taste
> if
> > it comes down to that). In particular, I'd say keep the core general
> (i.e.
> > no domain-specific transformations/parsing like HL7), pure data
> > manipulation (i.e. no enrichment), and nothing that could just as well be
> > done as a converter/serializer/deserializer/source connector/sink
> > connector.
> >
> > I was very staunchly against including connectors (aside from a simple
> > example) directly in Kafka, so this may seem like a reversal of position.
> > But I think the % of use cases covered will look very different between
> > connectors and transformations. Sure, some connectors are very popular,
> and
> > moreso right now because they are the most thoroughly developed, tested,
> > etc. But the top 3 most common transformations will probably be used
> across
> > all the top 20 most popular connectors. I have no doubt people will end
> up
> > writing custom ones (which is why it's nice to make them pluggable rather
> > than choosing a fixed set), but they'll either be very niche (like people
> > write custom connectors for their internal systems) or be more broadly
> > applicable but very domain specific such that they are easy to reject for
> > inclusion.
> >
> > @Gwen if we filtered the list of NiFi processors to ones that fit that
> > criteria, would that still be too long a list for your taste? Similarly,
> > let's say we were going to include some baked in; in that case, does
> > anything look out of place to you in the list Shikhar has included in the
> > KIP?
> >
> > -Ewen
> >
> > On Thu, Dec 15, 2016 at 2:01 PM, Gwen Shapira <gw...@confluent.io> wrote:
> >
> > > I agree about the ease of use in adding a small-subset of built-in
> > > transformations.
> > >
> > > But the same thing is true for connectors - there are maybe 5 super
> > popular
> > > OSS connectors and the rest is a very long tail. We drew the line at
> not
> > > adding any, because thats the easiest and because we did not want to
> turn
> > > Kafka into a collection of transformations.
> > >
> > > I really don't want to end up with 135 (or even 20) transformations in
> > > Kafka. So either we have a super-clear definition of what belongs and
> > what
> > > doesn't - or we put in one minimal example and the rest goes into the
> > > ecosystem.
> > >
> > > We can also start by putting transformations on github and just see if
> > > there is huge demand for them in Apache. It is easier to add stuff to
> the
> > > project later than to remove functionality.
> > >
> > >
> > >
> > > On Thu, Dec 15, 2016 at 11:59 AM, Shikhar Bhushan <
> shikhar@confluent.io>
> > > wrote:
> > >
> > > > I have updated KIP-66
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 66%3A+Single+Message+Transforms+for+Kafka+Connect
> > > > with
> > > > the changes I proposed in the design.
> > > >
> > > > Gwen, I think the main downside to not including some transformations
> > > with
> > > > Kafka Connect is that it seems less user friendly if folks have to
> make
> > > > sure to have the right transformation(s) on the classpath as well,
> > > besides
> > > > their connector(s). Additionally by going in with a small set
> included,
> > > we
> > > > can encourage a consistent configuration and implementation style and
> > > > provide utilities for e.g. data transformations, which I expect we
> will
> > > > definitely need (discussed under 'Patterns for data
> transformations').
> > > >
> > > > It does get hard to draw the line once you go from 'none' to 'some'.
> To
> > > get
> > > > discussion going, if we get agreement on 'none' vs 'some', I added a
> > > table
> > > > under 'Bundled transformations' for transformations which I think are
> > > worth
> > > > including.
> > > >
> > > > For many of these, I have noticed their absence in the wild as a pain
> > > point
> > > > --
> > > > TimestampRouter:
> > > > https://github.com/confluentinc/kafka-connect-elasticsearch/
> issues/33
> > > > Mask:
> > > > https://groups.google.com/d/msg/confluent-platform/3yHb8_
> > > > mCReQ/sTQc3dNgBwAJ
> > > > Insert:
> > > > http://stackoverflow.com/questions/40664745/
> > elasticsearch-connector-for-
> > > > kafka-connect-offset-and-timestamp
> > > > RegexRouter:
> > > > https://groups.google.com/d/msg/confluent-platform/
> > > > yEBwu1rGcs0/gIAhRp6kBwAJ
> > > > NumericCast:
> > > > https://github.com/confluentinc/kafka-connect-
> > > > jdbc/issues/101#issuecomment-249096119
> > > > TimestampConverter:
> > > > https://groups.google.com/d/msg/confluent-platform/
> > > > gGAOsw3Qeu4/8JCqdDhGBwAJ
> > > > ValueToKey: https://github.com/confluentinc/kafka-connect-
> > jdbc/pull/166
> > > >
> > > > In other cases, their functionality is already being implemented by
> > > > connectors in divergent ways: RegexRouter, Insert, Replace,
> > > HoistToStruct,
> > > > ExtractFromStruct
> > > >
> > > > On Wed, Dec 14, 2016 at 6:00 PM Gwen Shapira <gw...@confluent.io>
> > wrote:
> > > >
> > > > I'm a bit concerned about adding transformations in Kafka. NiFi has
> 150
> > > > processors, presumably they are all useful for someone. I don't know
> if
> > > I'd
> > > > want all of that in Apache Kafka. What's the downside of keeping it
> > out?
> > > Or
> > > > at least keeping the built-in set super minimal (Flume has like 3
> > > built-in
> > > > interceptors)?
> > > >
> > > > Gwen
> > > >
> > > > On Wed, Dec 14, 2016 at 1:36 PM, Shikhar Bhushan <
> shikhar@confluent.io
> > >
> > > > wrote:
> > > >
> > > > > With regard to a), just using `ConnectRecord` with `newRecord` as a
> > new
> > > > > abstract method would be a fine choice. In prototyping, both
> options
> > > end
> > > > up
> > > > > looking pretty similar (in terms of how transformations are
> > implemented
> > > > and
> > > > > the runtime initializes and uses them) and I'm starting to lean
> > towards
> > > > not
> > > > > adding a new interface into the mix.
> > > > >
> > > > > On b) I think we should include a small set of useful
> transformations
> > > > with
> > > > > Connect, since they can be applicable across different connectors
> and
> > > we
> > > > > should encourage some standardization for common operations. I'll
> > > update
> > > > > KIP-66 soon including a spec of transformations that I believe are
> > > worth
> > > > > including.
> > > > >
> > > > > On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <
> > > > ewen@confluent.io>
> > > > > wrote:
> > > > >
> > > > > If anyone has time to review here, it'd be great to get feedback.
> I'd
> > > > > imagine that the proposal itself won't be too controversial --
> keeps
> > > > > transformations simple (by only allowing map/filter), doesn't
> affect
> > > the
> > > > > rest of the framework much, and fits in with general config
> structure
> > > > we've
> > > > > used elsewhere (although ConfigDef could use some updates to make
> > this
> > > > > easier...).
> > > > >
> > > > > I think the main open questions for me are:
> > > > >
> > > > > a) Is TransformableRecord worth it to avoid reimplementing small
> bits
> > > of
> > > > > code (it allows for a single implementation of the interface to
> > > trivially
> > > > > apply to both Source and SinkRecords). I think I prefer this, but
> it
> > > does
> > > > > come with some commitment to another interface on top of
> > ConnectRecord.
> > > > We
> > > > > could alternatively modify ConnectRecord which would require fewer
> > > > changes.
> > > > > b) How do folks feel about built-in transformations and the set
> that
> > > are
> > > > > mentioned here? This brings us way back to the discussion of
> built-in
> > > > > connectors. Transformations, especially when intended to be
> > lightweight
> > > > and
> > > > > touch nothing besides the data already in the record, seem
> different
> > > from
> > > > > connectors -- there might be quite a few, but hopefully limited.
> > Since
> > > we
> > > > > (hopefully) already factor out most serialization-specific stuff
> via
> > > > > Converters, I think we can keep this pretty limited. That said, I
> > have
> > > no
> > > > > doubt some folks will (in my opinion) abuse this feature to do data
> > > > > enrichment by querying external systems, so building a bunch of
> > > > > transformations in could potentially open the floodgates, or at
> least
> > > > make
> > > > > decisions about what is included vs what should be 3rd party muddy.
> > > > >
> > > > > -Ewen
> > > > >
> > > > >
> > > > > On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <
> > shikhar@confluent.io
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I have another iteration at a proposal for this feature here:
> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/
> > > > > > Connect+Transforms+-+Proposed+Design
> > > > > >
> > > > > > I'd welcome your feedback and comments.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Shikhar
> > > > > >
> > > > > > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <
> > > > ewen@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <
> > > > shikhar@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Hmm, operating on ConnectRecords probably doesn't work since
> > you
> > > > need
> > > > > > to
> > > > > > > > emit the right type of record, which might mean instantiating
> a
> > > new
> > > > > > one.
> > > > > > > I
> > > > > > > > think that means we either need 2 methods, one for
> > SourceRecord,
> > > > one
> > > > > > for
> > > > > > > > SinkRecord, or we'd need to limit what parts of the message
> you
> > > can
> > > > > > > modify
> > > > > > > > (e.g. you can change the key/value via something like
> > > > > > > > transformKey(ConnectRecord) and
> transformValue(ConnectRecord),
> > > but
> > > > > > other
> > > > > > > > fields would remain the same and the fmwk would handle
> > allocating
> > > > new
> > > > > > > > Source/SinkRecords if needed)
> > > > > > > >
> > > > > > >
> > > > > > > Good point, perhaps we could add an abstract method on
> > > ConnectRecord
> > > > > that
> > > > > > > takes all the shared fields as parameters and the
> implementations
> > > > > return
> > > > > > a
> > > > > > > copy of the narrower SourceRecord/SinkRecord type as
> appropriate.
> > > > > > > Transformers would only operate on ConnectRecord rather than
> > caring
> > > > > about
> > > > > > > SourceRecord or SinkRecord (in theory they could
> instanceof/cast,
> > > but
> > > > > the
> > > > > > > API should discourage it)
> > > > > > >
> > > > > > >
> > > > > > > > Is there a use case for hanging on to the original? I can't
> > think
> > > > of
> > > > > a
> > > > > > > > transformation where you'd need to do that (or couldn't just
> > > order
> > > > > > things
> > > > > > > > differently so it isn't a problem).
> > > > > > >
> > > > > > >
> > > > > > > Yeah maybe this isn't really necessary. No strong preference
> > here.
> > > > > > >
> > > > > > > That said, I do worry a bit that farming too much stuff out to
> > > > > > transformers
> > > > > > > > can result in "programming via config", i.e. a lot of the
> > > > simplicity
> > > > > > you
> > > > > > > > get from Connect disappears in long config files.
> > Standardization
> > > > > would
> > > > > > > be
> > > > > > > > nice and might just avoid this (and doesn't cost that much
> > > > > implementing
> > > > > > > it
> > > > > > > > in each connector), and I'd personally prefer something a bit
> > > less
> > > > > > > flexible
> > > > > > > > but consistent and easy to configure.
> > > > > > >
> > > > > > >
> > > > > > > Not sure what the you're suggesting :-) Standardized config
> > > > properties
> > > > > > for
> > > > > > > a small set of transformations, leaving it upto connectors to
> > > > > integrate?
> > > > > > >
> > > > > >
> > > > > > I just mean that you get to the point where you're practically
> > > writing
> > > > a
> > > > > > Kafka Streams application, you're just doing it through either an
> > > > > > incredibly convoluted set of transformers and configs, or a
> single
> > > > > > transformer with incredibly convoluted set of configs. You
> > basically
> > > > get
> > > > > to
> > > > > > the point where you're config is a mini DSL and you're not really
> > > > saving
> > > > > > that much.
> > > > > >
> > > > > > The real question is how much we want to venture into the "T"
> part
> > of
> > > > > ETL.
> > > > > > I tend to favor minimizing how much we take on since the rest of
> > > > Connect
> > > > > > isn't designed for it, it's designed around the E & L parts.
> > > > > >
> > > > > > -Ewen
> > > > > >
> > > > > >
> > > > > > > Personally I'm skeptical of that level of flexibility in
> > > transformers
> > > > > --
> > > > > > > > its getting awfully complex and certainly takes us pretty far
> > > from
> > > > > > > "config
> > > > > > > > only" realtime data integration. It's not clear to me what
> the
> > > use
> > > > > > cases
> > > > > > > > are that aren't covered by a small set of common
> > transformations
> > > > that
> > > > > > can
> > > > > > > > be chained together (e.g. rename/remove fields, mask values,
> > and
> > > > > maybe
> > > > > > a
> > > > > > > > couple more).
> > > > > > > >
> > > > > > >
> > > > > > > I agree that we should have some standard transformations that
> we
> > > > ship
> > > > > > with
> > > > > > > connect that users would ideally lean towards for routine
> tasks.
> > > The
> > > > > ones
> > > > > > > you mention are some good candidates where I'd imagine can
> expose
> > > > > simple
> > > > > > > config, e.g.
> > > > > > >    transform.filter.whitelist=x,y,z # filter to a whitelist of
> > > > fields
> > > > > > >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> > > > > > >    topic.rename.replace=-/_
> > > > > > >    topic.rename.prefix=kafka_
> > > > > > > etc..
> > > > > > >
> > > > > > > However the ecosystem will invariably have more complex
> > > transformers
> > > > if
> > > > > > we
> > > > > > > make this pluggable. And because ETL is messy, that's probably
> a
> > > good
> > > > > > thing
> > > > > > > if folks are able to do their data munging orthogonally to
> > > > connectors,
> > > > > so
> > > > > > > that connectors can focus on the logic of how data should be
> > copied
> > > > > > from/to
> > > > > > > datastores and Kafka.
> > > > > > >
> > > > > > >
> > > > > > > > In any case, we'd probably also have to change configs of
> > > > connectors
> > > > > if
> > > > > > > we
> > > > > > > > allowed configs like that since presumably transformer
> configs
> > > will
> > > > > > just
> > > > > > > be
> > > > > > > > part of the connector config.
> > > > > > > >
> > > > > > >
> > > > > > > Yeah, haven't thought much about how all the configuration
> would
> > > tie
> > > > > > > together...
> > > > > > >
> > > > > > > I think we'd need the ability to:
> > > > > > > - spec transformer chain (fully-qualified class names? perhaps
> > > > special
> > > > > > > aliases for built-in ones? perhaps third-party fqcns can be
> > > assigned
> > > > > > > aliases by users in the chain spec, for easier configuration
> and
> > to
> > > > > > > uniquely identify a transformation when it occurs more than one
> > > time
> > > > in
> > > > > a
> > > > > > > chain?)
> > > > > > > - configure each transformer -- all properties prefixed with
> that
> > > > > > > transformer's ID (fqcn / alias) get destined to it
> > > > > > >
> > > > > > > Additionally, I think we would probably want to allow for
> > > > > topic-specific
> > > > > > > overrides <https://issues.apache.org/jira/browse/KAFKA-3962>
> > (e.g.
> > > > you
> > > > > > > want
> > > > > > > certain transformations for one topic, but different ones for
> > > > > another...)
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks,
> > > > > > Ewen
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Gwen Shapira*
> > > > Product Manager | Confluent
> > > > 650.450.2760 <(650)%20450-2760> <(650)%20450-2760> | @gwenshap
> > > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > > > <http://www.confluent.io/blog>
> > > >
> > >
> > >
> > >
> > > --
> > > *Gwen Shapira*
> > > Product Manager | Confluent
> > > 650.450.2760 <(650)%20450-2760> | @gwenshap
> > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > > <http://www.confluent.io/blog>
> > >
> >
>
>
>
> --
> *Gwen Shapira*
> Product Manager | Confluent
> 650.450.2760 <(650)%20450-2760> | @gwenshap
> Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> <http://www.confluent.io/blog>
>

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Posted by Shikhar Bhushan <sh...@confluent.io>.
There is no decision being proposed on the final list of transformations
that will ever be in Kafka :-) Just the initial set we should roll with.

On Thu, Dec 15, 2016 at 3:34 PM Gwen Shapira <gw...@confluent.io> wrote:

You are absolutely right that the vast majority of NiFi's processors are
not what we would consider SMT.

I went over the list and I think the still contain just short of 50 legit
SMTs:
https://cwiki.apache.org/confluence/display/KAFKA/Analyzing+NiFi+Transformations

You are right that ExtractHL7 is an extreme that clearly doesn't belong in
Apache Kafka, but just before that we have ExtractAvroMetadata that may
fit? and ExtractEmailHeaders doesn't sound totally outlandish either...

Nothing in the baked-in list by Shikhar looks out of place. I am concerned
about slipperly slope. Or the arbitrariness of the decision if we say that
this list is final and nothing else will ever make it into Kafka.

Gwen

On Thu, Dec 15, 2016 at 3:00 PM, Ewen Cheslack-Postava <ew...@confluent.io>
wrote:

> I think there are a couple of factors that make transformations and
> connectors different.
>
> First, NiFi's 150 processors is a bit misleading. In NiFi, processors
cover
> data sources, data sinks, serialization/deserialization, *and*
> transformations. I haven't filtered the list to see how many fall into the
> first 3 categories, but it's a *lot* of the processors they have.
>
> Second, since transformations only apply to a single message and I'd think
> they generally shouldn't be interacting with external services (i.e. I
> think trying to do enrichment in SMT is probably a bad idea), the scope of
> possible transformations is reasonably limited and the transformations
> themselves tend to be small and easily maintainable. I think this is a
> dramatic difference from connectors, which are each substantial projects
in
> their own right.
>
> While I get the slippery slope argument re: including specific
> transformations, I think we can come up with a reasonable policy (and via
> KIPs we can, as a community, come to an agreement based purely on taste if
> it comes down to that). In particular, I'd say keep the core general (i.e.
> no domain-specific transformations/parsing like HL7), pure data
> manipulation (i.e. no enrichment), and nothing that could just as well be
> done as a converter/serializer/deserializer/source connector/sink
> connector.
>
> I was very staunchly against including connectors (aside from a simple
> example) directly in Kafka, so this may seem like a reversal of position.
> But I think the % of use cases covered will look very different between
> connectors and transformations. Sure, some connectors are very popular,
and
> moreso right now because they are the most thoroughly developed, tested,
> etc. But the top 3 most common transformations will probably be used
across
> all the top 20 most popular connectors. I have no doubt people will end up
> writing custom ones (which is why it's nice to make them pluggable rather
> than choosing a fixed set), but they'll either be very niche (like people
> write custom connectors for their internal systems) or be more broadly
> applicable but very domain specific such that they are easy to reject for
> inclusion.
>
> @Gwen if we filtered the list of NiFi processors to ones that fit that
> criteria, would that still be too long a list for your taste? Similarly,
> let's say we were going to include some baked in; in that case, does
> anything look out of place to you in the list Shikhar has included in the
> KIP?
>
> -Ewen
>
> On Thu, Dec 15, 2016 at 2:01 PM, Gwen Shapira <gw...@confluent.io> wrote:
>
> > I agree about the ease of use in adding a small-subset of built-in
> > transformations.
> >
> > But the same thing is true for connectors - there are maybe 5 super
> popular
> > OSS connectors and the rest is a very long tail. We drew the line at not
> > adding any, because thats the easiest and because we did not want to
turn
> > Kafka into a collection of transformations.
> >
> > I really don't want to end up with 135 (or even 20) transformations in
> > Kafka. So either we have a super-clear definition of what belongs and
> what
> > doesn't - or we put in one minimal example and the rest goes into the
> > ecosystem.
> >
> > We can also start by putting transformations on github and just see if
> > there is huge demand for them in Apache. It is easier to add stuff to
the
> > project later than to remove functionality.
> >
> >
> >
> > On Thu, Dec 15, 2016 at 11:59 AM, Shikhar Bhushan <sh...@confluent.io>
> > wrote:
> >
> > > I have updated KIP-66
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 66%3A+Single+Message+Transforms+for+Kafka+Connect
> > > with
> > > the changes I proposed in the design.
> > >
> > > Gwen, I think the main downside to not including some transformations
> > with
> > > Kafka Connect is that it seems less user friendly if folks have to
make
> > > sure to have the right transformation(s) on the classpath as well,
> > besides
> > > their connector(s). Additionally by going in with a small set
included,
> > we
> > > can encourage a consistent configuration and implementation style and
> > > provide utilities for e.g. data transformations, which I expect we
will
> > > definitely need (discussed under 'Patterns for data transformations').
> > >
> > > It does get hard to draw the line once you go from 'none' to 'some'.
To
> > get
> > > discussion going, if we get agreement on 'none' vs 'some', I added a
> > table
> > > under 'Bundled transformations' for transformations which I think are
> > worth
> > > including.
> > >
> > > For many of these, I have noticed their absence in the wild as a pain
> > point
> > > --
> > > TimestampRouter:
> > > https://github.com/confluentinc/kafka-connect-elasticsearch/issues/33
> > > Mask:
> > > https://groups.google.com/d/msg/confluent-platform/3yHb8_
> > > mCReQ/sTQc3dNgBwAJ
> > > Insert:
> > > http://stackoverflow.com/questions/40664745/
> elasticsearch-connector-for-
> > > kafka-connect-offset-and-timestamp
> > > RegexRouter:
> > > https://groups.google.com/d/msg/confluent-platform/
> > > yEBwu1rGcs0/gIAhRp6kBwAJ
> > > NumericCast:
> > > https://github.com/confluentinc/kafka-connect-
> > > jdbc/issues/101#issuecomment-249096119
> > > TimestampConverter:
> > > https://groups.google.com/d/msg/confluent-platform/
> > > gGAOsw3Qeu4/8JCqdDhGBwAJ
> > > ValueToKey: https://github.com/confluentinc/kafka-connect-
> jdbc/pull/166
> > >
> > > In other cases, their functionality is already being implemented by
> > > connectors in divergent ways: RegexRouter, Insert, Replace,
> > HoistToStruct,
> > > ExtractFromStruct
> > >
> > > On Wed, Dec 14, 2016 at 6:00 PM Gwen Shapira <gw...@confluent.io>
> wrote:
> > >
> > > I'm a bit concerned about adding transformations in Kafka. NiFi has
150
> > > processors, presumably they are all useful for someone. I don't know
if
> > I'd
> > > want all of that in Apache Kafka. What's the downside of keeping it
> out?
> > Or
> > > at least keeping the built-in set super minimal (Flume has like 3
> > built-in
> > > interceptors)?
> > >
> > > Gwen
> > >
> > > On Wed, Dec 14, 2016 at 1:36 PM, Shikhar Bhushan <shikhar@confluent.io
> >
> > > wrote:
> > >
> > > > With regard to a), just using `ConnectRecord` with `newRecord` as a
> new
> > > > abstract method would be a fine choice. In prototyping, both options
> > end
> > > up
> > > > looking pretty similar (in terms of how transformations are
> implemented
> > > and
> > > > the runtime initializes and uses them) and I'm starting to lean
> towards
> > > not
> > > > adding a new interface into the mix.
> > > >
> > > > On b) I think we should include a small set of useful
transformations
> > > with
> > > > Connect, since they can be applicable across different connectors
and
> > we
> > > > should encourage some standardization for common operations. I'll
> > update
> > > > KIP-66 soon including a spec of transformations that I believe are
> > worth
> > > > including.
> > > >
> > > > On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <
> > > ewen@confluent.io>
> > > > wrote:
> > > >
> > > > If anyone has time to review here, it'd be great to get feedback.
I'd
> > > > imagine that the proposal itself won't be too controversial -- keeps
> > > > transformations simple (by only allowing map/filter), doesn't affect
> > the
> > > > rest of the framework much, and fits in with general config
structure
> > > we've
> > > > used elsewhere (although ConfigDef could use some updates to make
> this
> > > > easier...).
> > > >
> > > > I think the main open questions for me are:
> > > >
> > > > a) Is TransformableRecord worth it to avoid reimplementing small
bits
> > of
> > > > code (it allows for a single implementation of the interface to
> > trivially
> > > > apply to both Source and SinkRecords). I think I prefer this, but it
> > does
> > > > come with some commitment to another interface on top of
> ConnectRecord.
> > > We
> > > > could alternatively modify ConnectRecord which would require fewer
> > > changes.
> > > > b) How do folks feel about built-in transformations and the set that
> > are
> > > > mentioned here? This brings us way back to the discussion of
built-in
> > > > connectors. Transformations, especially when intended to be
> lightweight
> > > and
> > > > touch nothing besides the data already in the record, seem different
> > from
> > > > connectors -- there might be quite a few, but hopefully limited.
> Since
> > we
> > > > (hopefully) already factor out most serialization-specific stuff via
> > > > Converters, I think we can keep this pretty limited. That said, I
> have
> > no
> > > > doubt some folks will (in my opinion) abuse this feature to do data
> > > > enrichment by querying external systems, so building a bunch of
> > > > transformations in could potentially open the floodgates, or at
least
> > > make
> > > > decisions about what is included vs what should be 3rd party muddy.
> > > >
> > > > -Ewen
> > > >
> > > >
> > > > On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <
> shikhar@confluent.io
> > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I have another iteration at a proposal for this feature here:
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/
> > > > > Connect+Transforms+-+Proposed+Design
> > > > >
> > > > > I'd welcome your feedback and comments.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Shikhar
> > > > >
> > > > > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <
> > > ewen@confluent.io>
> > > > > wrote:
> > > > >
> > > > > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <
> > > shikhar@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > Hmm, operating on ConnectRecords probably doesn't work since
> you
> > > need
> > > > > to
> > > > > > > emit the right type of record, which might mean instantiating
a
> > new
> > > > > one.
> > > > > > I
> > > > > > > think that means we either need 2 methods, one for
> SourceRecord,
> > > one
> > > > > for
> > > > > > > SinkRecord, or we'd need to limit what parts of the message
you
> > can
> > > > > > modify
> > > > > > > (e.g. you can change the key/value via something like
> > > > > > > transformKey(ConnectRecord) and transformValue(ConnectRecord),
> > but
> > > > > other
> > > > > > > fields would remain the same and the fmwk would handle
> allocating
> > > new
> > > > > > > Source/SinkRecords if needed)
> > > > > > >
> > > > > >
> > > > > > Good point, perhaps we could add an abstract method on
> > ConnectRecord
> > > > that
> > > > > > takes all the shared fields as parameters and the
implementations
> > > > return
> > > > > a
> > > > > > copy of the narrower SourceRecord/SinkRecord type as
appropriate.
> > > > > > Transformers would only operate on ConnectRecord rather than
> caring
> > > > about
> > > > > > SourceRecord or SinkRecord (in theory they could
instanceof/cast,
> > but
> > > > the
> > > > > > API should discourage it)
> > > > > >
> > > > > >
> > > > > > > Is there a use case for hanging on to the original? I can't
> think
> > > of
> > > > a
> > > > > > > transformation where you'd need to do that (or couldn't just
> > order
> > > > > things
> > > > > > > differently so it isn't a problem).
> > > > > >
> > > > > >
> > > > > > Yeah maybe this isn't really necessary. No strong preference
> here.
> > > > > >
> > > > > > That said, I do worry a bit that farming too much stuff out to
> > > > > transformers
> > > > > > > can result in "programming via config", i.e. a lot of the
> > > simplicity
> > > > > you
> > > > > > > get from Connect disappears in long config files.
> Standardization
> > > > would
> > > > > > be
> > > > > > > nice and might just avoid this (and doesn't cost that much
> > > > implementing
> > > > > > it
> > > > > > > in each connector), and I'd personally prefer something a bit
> > less
> > > > > > flexible
> > > > > > > but consistent and easy to configure.
> > > > > >
> > > > > >
> > > > > > Not sure what the you're suggesting :-) Standardized config
> > > properties
> > > > > for
> > > > > > a small set of transformations, leaving it upto connectors to
> > > > integrate?
> > > > > >
> > > > >
> > > > > I just mean that you get to the point where you're practically
> > writing
> > > a
> > > > > Kafka Streams application, you're just doing it through either an
> > > > > incredibly convoluted set of transformers and configs, or a single
> > > > > transformer with incredibly convoluted set of configs. You
> basically
> > > get
> > > > to
> > > > > the point where you're config is a mini DSL and you're not really
> > > saving
> > > > > that much.
> > > > >
> > > > > The real question is how much we want to venture into the "T" part
> of
> > > > ETL.
> > > > > I tend to favor minimizing how much we take on since the rest of
> > > Connect
> > > > > isn't designed for it, it's designed around the E & L parts.
> > > > >
> > > > > -Ewen
> > > > >
> > > > >
> > > > > > Personally I'm skeptical of that level of flexibility in
> > transformers
> > > > --
> > > > > > > its getting awfully complex and certainly takes us pretty far
> > from
> > > > > > "config
> > > > > > > only" realtime data integration. It's not clear to me what the
> > use
> > > > > cases
> > > > > > > are that aren't covered by a small set of common
> transformations
> > > that
> > > > > can
> > > > > > > be chained together (e.g. rename/remove fields, mask values,
> and
> > > > maybe
> > > > > a
> > > > > > > couple more).
> > > > > > >
> > > > > >
> > > > > > I agree that we should have some standard transformations that
we
> > > ship
> > > > > with
> > > > > > connect that users would ideally lean towards for routine tasks.
> > The
> > > > ones
> > > > > > you mention are some good candidates where I'd imagine can
expose
> > > > simple
> > > > > > config, e.g.
> > > > > >    transform.filter.whitelist=x,y,z # filter to a whitelist of
> > > fields
> > > > > >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> > > > > >    topic.rename.replace=-/_
> > > > > >    topic.rename.prefix=kafka_
> > > > > > etc..
> > > > > >
> > > > > > However the ecosystem will invariably have more complex
> > transformers
> > > if
> > > > > we
> > > > > > make this pluggable. And because ETL is messy, that's probably a
> > good
> > > > > thing
> > > > > > if folks are able to do their data munging orthogonally to
> > > connectors,
> > > > so
> > > > > > that connectors can focus on the logic of how data should be
> copied
> > > > > from/to
> > > > > > datastores and Kafka.
> > > > > >
> > > > > >
> > > > > > > In any case, we'd probably also have to change configs of
> > > connectors
> > > > if
> > > > > > we
> > > > > > > allowed configs like that since presumably transformer configs
> > will
> > > > > just
> > > > > > be
> > > > > > > part of the connector config.
> > > > > > >
> > > > > >
> > > > > > Yeah, haven't thought much about how all the configuration would
> > tie
> > > > > > together...
> > > > > >
> > > > > > I think we'd need the ability to:
> > > > > > - spec transformer chain (fully-qualified class names? perhaps
> > > special
> > > > > > aliases for built-in ones? perhaps third-party fqcns can be
> > assigned
> > > > > > aliases by users in the chain spec, for easier configuration and
> to
> > > > > > uniquely identify a transformation when it occurs more than one
> > time
> > > in
> > > > a
> > > > > > chain?)
> > > > > > - configure each transformer -- all properties prefixed with
that
> > > > > > transformer's ID (fqcn / alias) get destined to it
> > > > > >
> > > > > > Additionally, I think we would probably want to allow for
> > > > topic-specific
> > > > > > overrides <https://issues.apache.org/jira/browse/KAFKA-3962>
> (e.g.
> > > you
> > > > > > want
> > > > > > certain transformations for one topic, but different ones for
> > > > another...)
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks,
> > > > > Ewen
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *Gwen Shapira*
> > > Product Manager | Confluent
> > > 650.450.2760 <(650)%20450-2760> <(650)%20450-2760> | @gwenshap
> > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > > <http://www.confluent.io/blog>
> > >
> >
> >
> >
> > --
> > *Gwen Shapira*
> > Product Manager | Confluent
> > 650.450.2760 <(650)%20450-2760> | @gwenshap
> > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > <http://www.confluent.io/blog>
> >
>



--
*Gwen Shapira*
Product Manager | Confluent
650.450.2760 <(650)%20450-2760> | @gwenshap
Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
<http://www.confluent.io/blog>

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Posted by Gwen Shapira <gw...@confluent.io>.
You are absolutely right that the vast majority of NiFi's processors are
not what we would consider SMT.

I went over the list and I think the still contain just short of 50 legit
SMTs:
https://cwiki.apache.org/confluence/display/KAFKA/Analyzing+NiFi+Transformations

You are right that ExtractHL7 is an extreme that clearly doesn't belong in
Apache Kafka, but just before that we have ExtractAvroMetadata that may
fit? and ExtractEmailHeaders doesn't sound totally outlandish either...

Nothing in the baked-in list by Shikhar looks out of place. I am concerned
about slipperly slope. Or the arbitrariness of the decision if we say that
this list is final and nothing else will ever make it into Kafka.

Gwen

On Thu, Dec 15, 2016 at 3:00 PM, Ewen Cheslack-Postava <ew...@confluent.io>
wrote:

> I think there are a couple of factors that make transformations and
> connectors different.
>
> First, NiFi's 150 processors is a bit misleading. In NiFi, processors cover
> data sources, data sinks, serialization/deserialization, *and*
> transformations. I haven't filtered the list to see how many fall into the
> first 3 categories, but it's a *lot* of the processors they have.
>
> Second, since transformations only apply to a single message and I'd think
> they generally shouldn't be interacting with external services (i.e. I
> think trying to do enrichment in SMT is probably a bad idea), the scope of
> possible transformations is reasonably limited and the transformations
> themselves tend to be small and easily maintainable. I think this is a
> dramatic difference from connectors, which are each substantial projects in
> their own right.
>
> While I get the slippery slope argument re: including specific
> transformations, I think we can come up with a reasonable policy (and via
> KIPs we can, as a community, come to an agreement based purely on taste if
> it comes down to that). In particular, I'd say keep the core general (i.e.
> no domain-specific transformations/parsing like HL7), pure data
> manipulation (i.e. no enrichment), and nothing that could just as well be
> done as a converter/serializer/deserializer/source connector/sink
> connector.
>
> I was very staunchly against including connectors (aside from a simple
> example) directly in Kafka, so this may seem like a reversal of position.
> But I think the % of use cases covered will look very different between
> connectors and transformations. Sure, some connectors are very popular, and
> moreso right now because they are the most thoroughly developed, tested,
> etc. But the top 3 most common transformations will probably be used across
> all the top 20 most popular connectors. I have no doubt people will end up
> writing custom ones (which is why it's nice to make them pluggable rather
> than choosing a fixed set), but they'll either be very niche (like people
> write custom connectors for their internal systems) or be more broadly
> applicable but very domain specific such that they are easy to reject for
> inclusion.
>
> @Gwen if we filtered the list of NiFi processors to ones that fit that
> criteria, would that still be too long a list for your taste? Similarly,
> let's say we were going to include some baked in; in that case, does
> anything look out of place to you in the list Shikhar has included in the
> KIP?
>
> -Ewen
>
> On Thu, Dec 15, 2016 at 2:01 PM, Gwen Shapira <gw...@confluent.io> wrote:
>
> > I agree about the ease of use in adding a small-subset of built-in
> > transformations.
> >
> > But the same thing is true for connectors - there are maybe 5 super
> popular
> > OSS connectors and the rest is a very long tail. We drew the line at not
> > adding any, because thats the easiest and because we did not want to turn
> > Kafka into a collection of transformations.
> >
> > I really don't want to end up with 135 (or even 20) transformations in
> > Kafka. So either we have a super-clear definition of what belongs and
> what
> > doesn't - or we put in one minimal example and the rest goes into the
> > ecosystem.
> >
> > We can also start by putting transformations on github and just see if
> > there is huge demand for them in Apache. It is easier to add stuff to the
> > project later than to remove functionality.
> >
> >
> >
> > On Thu, Dec 15, 2016 at 11:59 AM, Shikhar Bhushan <sh...@confluent.io>
> > wrote:
> >
> > > I have updated KIP-66
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 66%3A+Single+Message+Transforms+for+Kafka+Connect
> > > with
> > > the changes I proposed in the design.
> > >
> > > Gwen, I think the main downside to not including some transformations
> > with
> > > Kafka Connect is that it seems less user friendly if folks have to make
> > > sure to have the right transformation(s) on the classpath as well,
> > besides
> > > their connector(s). Additionally by going in with a small set included,
> > we
> > > can encourage a consistent configuration and implementation style and
> > > provide utilities for e.g. data transformations, which I expect we will
> > > definitely need (discussed under 'Patterns for data transformations').
> > >
> > > It does get hard to draw the line once you go from 'none' to 'some'. To
> > get
> > > discussion going, if we get agreement on 'none' vs 'some', I added a
> > table
> > > under 'Bundled transformations' for transformations which I think are
> > worth
> > > including.
> > >
> > > For many of these, I have noticed their absence in the wild as a pain
> > point
> > > --
> > > TimestampRouter:
> > > https://github.com/confluentinc/kafka-connect-elasticsearch/issues/33
> > > Mask:
> > > https://groups.google.com/d/msg/confluent-platform/3yHb8_
> > > mCReQ/sTQc3dNgBwAJ
> > > Insert:
> > > http://stackoverflow.com/questions/40664745/
> elasticsearch-connector-for-
> > > kafka-connect-offset-and-timestamp
> > > RegexRouter:
> > > https://groups.google.com/d/msg/confluent-platform/
> > > yEBwu1rGcs0/gIAhRp6kBwAJ
> > > NumericCast:
> > > https://github.com/confluentinc/kafka-connect-
> > > jdbc/issues/101#issuecomment-249096119
> > > TimestampConverter:
> > > https://groups.google.com/d/msg/confluent-platform/
> > > gGAOsw3Qeu4/8JCqdDhGBwAJ
> > > ValueToKey: https://github.com/confluentinc/kafka-connect-
> jdbc/pull/166
> > >
> > > In other cases, their functionality is already being implemented by
> > > connectors in divergent ways: RegexRouter, Insert, Replace,
> > HoistToStruct,
> > > ExtractFromStruct
> > >
> > > On Wed, Dec 14, 2016 at 6:00 PM Gwen Shapira <gw...@confluent.io>
> wrote:
> > >
> > > I'm a bit concerned about adding transformations in Kafka. NiFi has 150
> > > processors, presumably they are all useful for someone. I don't know if
> > I'd
> > > want all of that in Apache Kafka. What's the downside of keeping it
> out?
> > Or
> > > at least keeping the built-in set super minimal (Flume has like 3
> > built-in
> > > interceptors)?
> > >
> > > Gwen
> > >
> > > On Wed, Dec 14, 2016 at 1:36 PM, Shikhar Bhushan <shikhar@confluent.io
> >
> > > wrote:
> > >
> > > > With regard to a), just using `ConnectRecord` with `newRecord` as a
> new
> > > > abstract method would be a fine choice. In prototyping, both options
> > end
> > > up
> > > > looking pretty similar (in terms of how transformations are
> implemented
> > > and
> > > > the runtime initializes and uses them) and I'm starting to lean
> towards
> > > not
> > > > adding a new interface into the mix.
> > > >
> > > > On b) I think we should include a small set of useful transformations
> > > with
> > > > Connect, since they can be applicable across different connectors and
> > we
> > > > should encourage some standardization for common operations. I'll
> > update
> > > > KIP-66 soon including a spec of transformations that I believe are
> > worth
> > > > including.
> > > >
> > > > On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <
> > > ewen@confluent.io>
> > > > wrote:
> > > >
> > > > If anyone has time to review here, it'd be great to get feedback. I'd
> > > > imagine that the proposal itself won't be too controversial -- keeps
> > > > transformations simple (by only allowing map/filter), doesn't affect
> > the
> > > > rest of the framework much, and fits in with general config structure
> > > we've
> > > > used elsewhere (although ConfigDef could use some updates to make
> this
> > > > easier...).
> > > >
> > > > I think the main open questions for me are:
> > > >
> > > > a) Is TransformableRecord worth it to avoid reimplementing small bits
> > of
> > > > code (it allows for a single implementation of the interface to
> > trivially
> > > > apply to both Source and SinkRecords). I think I prefer this, but it
> > does
> > > > come with some commitment to another interface on top of
> ConnectRecord.
> > > We
> > > > could alternatively modify ConnectRecord which would require fewer
> > > changes.
> > > > b) How do folks feel about built-in transformations and the set that
> > are
> > > > mentioned here? This brings us way back to the discussion of built-in
> > > > connectors. Transformations, especially when intended to be
> lightweight
> > > and
> > > > touch nothing besides the data already in the record, seem different
> > from
> > > > connectors -- there might be quite a few, but hopefully limited.
> Since
> > we
> > > > (hopefully) already factor out most serialization-specific stuff via
> > > > Converters, I think we can keep this pretty limited. That said, I
> have
> > no
> > > > doubt some folks will (in my opinion) abuse this feature to do data
> > > > enrichment by querying external systems, so building a bunch of
> > > > transformations in could potentially open the floodgates, or at least
> > > make
> > > > decisions about what is included vs what should be 3rd party muddy.
> > > >
> > > > -Ewen
> > > >
> > > >
> > > > On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <
> shikhar@confluent.io
> > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I have another iteration at a proposal for this feature here:
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/
> > > > > Connect+Transforms+-+Proposed+Design
> > > > >
> > > > > I'd welcome your feedback and comments.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Shikhar
> > > > >
> > > > > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <
> > > ewen@confluent.io>
> > > > > wrote:
> > > > >
> > > > > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <
> > > shikhar@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > Hmm, operating on ConnectRecords probably doesn't work since
> you
> > > need
> > > > > to
> > > > > > > emit the right type of record, which might mean instantiating a
> > new
> > > > > one.
> > > > > > I
> > > > > > > think that means we either need 2 methods, one for
> SourceRecord,
> > > one
> > > > > for
> > > > > > > SinkRecord, or we'd need to limit what parts of the message you
> > can
> > > > > > modify
> > > > > > > (e.g. you can change the key/value via something like
> > > > > > > transformKey(ConnectRecord) and transformValue(ConnectRecord),
> > but
> > > > > other
> > > > > > > fields would remain the same and the fmwk would handle
> allocating
> > > new
> > > > > > > Source/SinkRecords if needed)
> > > > > > >
> > > > > >
> > > > > > Good point, perhaps we could add an abstract method on
> > ConnectRecord
> > > > that
> > > > > > takes all the shared fields as parameters and the implementations
> > > > return
> > > > > a
> > > > > > copy of the narrower SourceRecord/SinkRecord type as appropriate.
> > > > > > Transformers would only operate on ConnectRecord rather than
> caring
> > > > about
> > > > > > SourceRecord or SinkRecord (in theory they could instanceof/cast,
> > but
> > > > the
> > > > > > API should discourage it)
> > > > > >
> > > > > >
> > > > > > > Is there a use case for hanging on to the original? I can't
> think
> > > of
> > > > a
> > > > > > > transformation where you'd need to do that (or couldn't just
> > order
> > > > > things
> > > > > > > differently so it isn't a problem).
> > > > > >
> > > > > >
> > > > > > Yeah maybe this isn't really necessary. No strong preference
> here.
> > > > > >
> > > > > > That said, I do worry a bit that farming too much stuff out to
> > > > > transformers
> > > > > > > can result in "programming via config", i.e. a lot of the
> > > simplicity
> > > > > you
> > > > > > > get from Connect disappears in long config files.
> Standardization
> > > > would
> > > > > > be
> > > > > > > nice and might just avoid this (and doesn't cost that much
> > > > implementing
> > > > > > it
> > > > > > > in each connector), and I'd personally prefer something a bit
> > less
> > > > > > flexible
> > > > > > > but consistent and easy to configure.
> > > > > >
> > > > > >
> > > > > > Not sure what the you're suggesting :-) Standardized config
> > > properties
> > > > > for
> > > > > > a small set of transformations, leaving it upto connectors to
> > > > integrate?
> > > > > >
> > > > >
> > > > > I just mean that you get to the point where you're practically
> > writing
> > > a
> > > > > Kafka Streams application, you're just doing it through either an
> > > > > incredibly convoluted set of transformers and configs, or a single
> > > > > transformer with incredibly convoluted set of configs. You
> basically
> > > get
> > > > to
> > > > > the point where you're config is a mini DSL and you're not really
> > > saving
> > > > > that much.
> > > > >
> > > > > The real question is how much we want to venture into the "T" part
> of
> > > > ETL.
> > > > > I tend to favor minimizing how much we take on since the rest of
> > > Connect
> > > > > isn't designed for it, it's designed around the E & L parts.
> > > > >
> > > > > -Ewen
> > > > >
> > > > >
> > > > > > Personally I'm skeptical of that level of flexibility in
> > transformers
> > > > --
> > > > > > > its getting awfully complex and certainly takes us pretty far
> > from
> > > > > > "config
> > > > > > > only" realtime data integration. It's not clear to me what the
> > use
> > > > > cases
> > > > > > > are that aren't covered by a small set of common
> transformations
> > > that
> > > > > can
> > > > > > > be chained together (e.g. rename/remove fields, mask values,
> and
> > > > maybe
> > > > > a
> > > > > > > couple more).
> > > > > > >
> > > > > >
> > > > > > I agree that we should have some standard transformations that we
> > > ship
> > > > > with
> > > > > > connect that users would ideally lean towards for routine tasks.
> > The
> > > > ones
> > > > > > you mention are some good candidates where I'd imagine can expose
> > > > simple
> > > > > > config, e.g.
> > > > > >    transform.filter.whitelist=x,y,z # filter to a whitelist of
> > > fields
> > > > > >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> > > > > >    topic.rename.replace=-/_
> > > > > >    topic.rename.prefix=kafka_
> > > > > > etc..
> > > > > >
> > > > > > However the ecosystem will invariably have more complex
> > transformers
> > > if
> > > > > we
> > > > > > make this pluggable. And because ETL is messy, that's probably a
> > good
> > > > > thing
> > > > > > if folks are able to do their data munging orthogonally to
> > > connectors,
> > > > so
> > > > > > that connectors can focus on the logic of how data should be
> copied
> > > > > from/to
> > > > > > datastores and Kafka.
> > > > > >
> > > > > >
> > > > > > > In any case, we'd probably also have to change configs of
> > > connectors
> > > > if
> > > > > > we
> > > > > > > allowed configs like that since presumably transformer configs
> > will
> > > > > just
> > > > > > be
> > > > > > > part of the connector config.
> > > > > > >
> > > > > >
> > > > > > Yeah, haven't thought much about how all the configuration would
> > tie
> > > > > > together...
> > > > > >
> > > > > > I think we'd need the ability to:
> > > > > > - spec transformer chain (fully-qualified class names? perhaps
> > > special
> > > > > > aliases for built-in ones? perhaps third-party fqcns can be
> > assigned
> > > > > > aliases by users in the chain spec, for easier configuration and
> to
> > > > > > uniquely identify a transformation when it occurs more than one
> > time
> > > in
> > > > a
> > > > > > chain?)
> > > > > > - configure each transformer -- all properties prefixed with that
> > > > > > transformer's ID (fqcn / alias) get destined to it
> > > > > >
> > > > > > Additionally, I think we would probably want to allow for
> > > > topic-specific
> > > > > > overrides <https://issues.apache.org/jira/browse/KAFKA-3962>
> (e.g.
> > > you
> > > > > > want
> > > > > > certain transformations for one topic, but different ones for
> > > > another...)
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks,
> > > > > Ewen
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *Gwen Shapira*
> > > Product Manager | Confluent
> > > 650.450.2760 <(650)%20450-2760> | @gwenshap
> > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > > <http://www.confluent.io/blog>
> > >
> >
> >
> >
> > --
> > *Gwen Shapira*
> > Product Manager | Confluent
> > 650.450.2760 | @gwenshap
> > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > <http://www.confluent.io/blog>
> >
>



-- 
*Gwen Shapira*
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
<http://www.confluent.io/blog>

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
I think there are a couple of factors that make transformations and
connectors different.

First, NiFi's 150 processors is a bit misleading. In NiFi, processors cover
data sources, data sinks, serialization/deserialization, *and*
transformations. I haven't filtered the list to see how many fall into the
first 3 categories, but it's a *lot* of the processors they have.

Second, since transformations only apply to a single message and I'd think
they generally shouldn't be interacting with external services (i.e. I
think trying to do enrichment in SMT is probably a bad idea), the scope of
possible transformations is reasonably limited and the transformations
themselves tend to be small and easily maintainable. I think this is a
dramatic difference from connectors, which are each substantial projects in
their own right.

While I get the slippery slope argument re: including specific
transformations, I think we can come up with a reasonable policy (and via
KIPs we can, as a community, come to an agreement based purely on taste if
it comes down to that). In particular, I'd say keep the core general (i.e.
no domain-specific transformations/parsing like HL7), pure data
manipulation (i.e. no enrichment), and nothing that could just as well be
done as a converter/serializer/deserializer/source connector/sink connector.

I was very staunchly against including connectors (aside from a simple
example) directly in Kafka, so this may seem like a reversal of position.
But I think the % of use cases covered will look very different between
connectors and transformations. Sure, some connectors are very popular, and
moreso right now because they are the most thoroughly developed, tested,
etc. But the top 3 most common transformations will probably be used across
all the top 20 most popular connectors. I have no doubt people will end up
writing custom ones (which is why it's nice to make them pluggable rather
than choosing a fixed set), but they'll either be very niche (like people
write custom connectors for their internal systems) or be more broadly
applicable but very domain specific such that they are easy to reject for
inclusion.

@Gwen if we filtered the list of NiFi processors to ones that fit that
criteria, would that still be too long a list for your taste? Similarly,
let's say we were going to include some baked in; in that case, does
anything look out of place to you in the list Shikhar has included in the
KIP?

-Ewen

On Thu, Dec 15, 2016 at 2:01 PM, Gwen Shapira <gw...@confluent.io> wrote:

> I agree about the ease of use in adding a small-subset of built-in
> transformations.
>
> But the same thing is true for connectors - there are maybe 5 super popular
> OSS connectors and the rest is a very long tail. We drew the line at not
> adding any, because thats the easiest and because we did not want to turn
> Kafka into a collection of transformations.
>
> I really don't want to end up with 135 (or even 20) transformations in
> Kafka. So either we have a super-clear definition of what belongs and what
> doesn't - or we put in one minimal example and the rest goes into the
> ecosystem.
>
> We can also start by putting transformations on github and just see if
> there is huge demand for them in Apache. It is easier to add stuff to the
> project later than to remove functionality.
>
>
>
> On Thu, Dec 15, 2016 at 11:59 AM, Shikhar Bhushan <sh...@confluent.io>
> wrote:
>
> > I have updated KIP-66
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 66%3A+Single+Message+Transforms+for+Kafka+Connect
> > with
> > the changes I proposed in the design.
> >
> > Gwen, I think the main downside to not including some transformations
> with
> > Kafka Connect is that it seems less user friendly if folks have to make
> > sure to have the right transformation(s) on the classpath as well,
> besides
> > their connector(s). Additionally by going in with a small set included,
> we
> > can encourage a consistent configuration and implementation style and
> > provide utilities for e.g. data transformations, which I expect we will
> > definitely need (discussed under 'Patterns for data transformations').
> >
> > It does get hard to draw the line once you go from 'none' to 'some'. To
> get
> > discussion going, if we get agreement on 'none' vs 'some', I added a
> table
> > under 'Bundled transformations' for transformations which I think are
> worth
> > including.
> >
> > For many of these, I have noticed their absence in the wild as a pain
> point
> > --
> > TimestampRouter:
> > https://github.com/confluentinc/kafka-connect-elasticsearch/issues/33
> > Mask:
> > https://groups.google.com/d/msg/confluent-platform/3yHb8_
> > mCReQ/sTQc3dNgBwAJ
> > Insert:
> > http://stackoverflow.com/questions/40664745/elasticsearch-connector-for-
> > kafka-connect-offset-and-timestamp
> > RegexRouter:
> > https://groups.google.com/d/msg/confluent-platform/
> > yEBwu1rGcs0/gIAhRp6kBwAJ
> > NumericCast:
> > https://github.com/confluentinc/kafka-connect-
> > jdbc/issues/101#issuecomment-249096119
> > TimestampConverter:
> > https://groups.google.com/d/msg/confluent-platform/
> > gGAOsw3Qeu4/8JCqdDhGBwAJ
> > ValueToKey: https://github.com/confluentinc/kafka-connect-jdbc/pull/166
> >
> > In other cases, their functionality is already being implemented by
> > connectors in divergent ways: RegexRouter, Insert, Replace,
> HoistToStruct,
> > ExtractFromStruct
> >
> > On Wed, Dec 14, 2016 at 6:00 PM Gwen Shapira <gw...@confluent.io> wrote:
> >
> > I'm a bit concerned about adding transformations in Kafka. NiFi has 150
> > processors, presumably they are all useful for someone. I don't know if
> I'd
> > want all of that in Apache Kafka. What's the downside of keeping it out?
> Or
> > at least keeping the built-in set super minimal (Flume has like 3
> built-in
> > interceptors)?
> >
> > Gwen
> >
> > On Wed, Dec 14, 2016 at 1:36 PM, Shikhar Bhushan <sh...@confluent.io>
> > wrote:
> >
> > > With regard to a), just using `ConnectRecord` with `newRecord` as a new
> > > abstract method would be a fine choice. In prototyping, both options
> end
> > up
> > > looking pretty similar (in terms of how transformations are implemented
> > and
> > > the runtime initializes and uses them) and I'm starting to lean towards
> > not
> > > adding a new interface into the mix.
> > >
> > > On b) I think we should include a small set of useful transformations
> > with
> > > Connect, since they can be applicable across different connectors and
> we
> > > should encourage some standardization for common operations. I'll
> update
> > > KIP-66 soon including a spec of transformations that I believe are
> worth
> > > including.
> > >
> > > On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <
> > ewen@confluent.io>
> > > wrote:
> > >
> > > If anyone has time to review here, it'd be great to get feedback. I'd
> > > imagine that the proposal itself won't be too controversial -- keeps
> > > transformations simple (by only allowing map/filter), doesn't affect
> the
> > > rest of the framework much, and fits in with general config structure
> > we've
> > > used elsewhere (although ConfigDef could use some updates to make this
> > > easier...).
> > >
> > > I think the main open questions for me are:
> > >
> > > a) Is TransformableRecord worth it to avoid reimplementing small bits
> of
> > > code (it allows for a single implementation of the interface to
> trivially
> > > apply to both Source and SinkRecords). I think I prefer this, but it
> does
> > > come with some commitment to another interface on top of ConnectRecord.
> > We
> > > could alternatively modify ConnectRecord which would require fewer
> > changes.
> > > b) How do folks feel about built-in transformations and the set that
> are
> > > mentioned here? This brings us way back to the discussion of built-in
> > > connectors. Transformations, especially when intended to be lightweight
> > and
> > > touch nothing besides the data already in the record, seem different
> from
> > > connectors -- there might be quite a few, but hopefully limited. Since
> we
> > > (hopefully) already factor out most serialization-specific stuff via
> > > Converters, I think we can keep this pretty limited. That said, I have
> no
> > > doubt some folks will (in my opinion) abuse this feature to do data
> > > enrichment by querying external systems, so building a bunch of
> > > transformations in could potentially open the floodgates, or at least
> > make
> > > decisions about what is included vs what should be 3rd party muddy.
> > >
> > > -Ewen
> > >
> > >
> > > On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <shikhar@confluent.io
> >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I have another iteration at a proposal for this feature here:
> > > > https://cwiki.apache.org/confluence/display/KAFKA/
> > > > Connect+Transforms+-+Proposed+Design
> > > >
> > > > I'd welcome your feedback and comments.
> > > >
> > > > Thanks,
> > > >
> > > > Shikhar
> > > >
> > > > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <
> > ewen@confluent.io>
> > > > wrote:
> > > >
> > > > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <
> > shikhar@confluent.io>
> > > > wrote:
> > > >
> > > > > >
> > > > > >
> > > > > > Hmm, operating on ConnectRecords probably doesn't work since you
> > need
> > > > to
> > > > > > emit the right type of record, which might mean instantiating a
> new
> > > > one.
> > > > > I
> > > > > > think that means we either need 2 methods, one for SourceRecord,
> > one
> > > > for
> > > > > > SinkRecord, or we'd need to limit what parts of the message you
> can
> > > > > modify
> > > > > > (e.g. you can change the key/value via something like
> > > > > > transformKey(ConnectRecord) and transformValue(ConnectRecord),
> but
> > > > other
> > > > > > fields would remain the same and the fmwk would handle allocating
> > new
> > > > > > Source/SinkRecords if needed)
> > > > > >
> > > > >
> > > > > Good point, perhaps we could add an abstract method on
> ConnectRecord
> > > that
> > > > > takes all the shared fields as parameters and the implementations
> > > return
> > > > a
> > > > > copy of the narrower SourceRecord/SinkRecord type as appropriate.
> > > > > Transformers would only operate on ConnectRecord rather than caring
> > > about
> > > > > SourceRecord or SinkRecord (in theory they could instanceof/cast,
> but
> > > the
> > > > > API should discourage it)
> > > > >
> > > > >
> > > > > > Is there a use case for hanging on to the original? I can't think
> > of
> > > a
> > > > > > transformation where you'd need to do that (or couldn't just
> order
> > > > things
> > > > > > differently so it isn't a problem).
> > > > >
> > > > >
> > > > > Yeah maybe this isn't really necessary. No strong preference here.
> > > > >
> > > > > That said, I do worry a bit that farming too much stuff out to
> > > > transformers
> > > > > > can result in "programming via config", i.e. a lot of the
> > simplicity
> > > > you
> > > > > > get from Connect disappears in long config files. Standardization
> > > would
> > > > > be
> > > > > > nice and might just avoid this (and doesn't cost that much
> > > implementing
> > > > > it
> > > > > > in each connector), and I'd personally prefer something a bit
> less
> > > > > flexible
> > > > > > but consistent and easy to configure.
> > > > >
> > > > >
> > > > > Not sure what the you're suggesting :-) Standardized config
> > properties
> > > > for
> > > > > a small set of transformations, leaving it upto connectors to
> > > integrate?
> > > > >
> > > >
> > > > I just mean that you get to the point where you're practically
> writing
> > a
> > > > Kafka Streams application, you're just doing it through either an
> > > > incredibly convoluted set of transformers and configs, or a single
> > > > transformer with incredibly convoluted set of configs. You basically
> > get
> > > to
> > > > the point where you're config is a mini DSL and you're not really
> > saving
> > > > that much.
> > > >
> > > > The real question is how much we want to venture into the "T" part of
> > > ETL.
> > > > I tend to favor minimizing how much we take on since the rest of
> > Connect
> > > > isn't designed for it, it's designed around the E & L parts.
> > > >
> > > > -Ewen
> > > >
> > > >
> > > > > Personally I'm skeptical of that level of flexibility in
> transformers
> > > --
> > > > > > its getting awfully complex and certainly takes us pretty far
> from
> > > > > "config
> > > > > > only" realtime data integration. It's not clear to me what the
> use
> > > > cases
> > > > > > are that aren't covered by a small set of common transformations
> > that
> > > > can
> > > > > > be chained together (e.g. rename/remove fields, mask values, and
> > > maybe
> > > > a
> > > > > > couple more).
> > > > > >
> > > > >
> > > > > I agree that we should have some standard transformations that we
> > ship
> > > > with
> > > > > connect that users would ideally lean towards for routine tasks.
> The
> > > ones
> > > > > you mention are some good candidates where I'd imagine can expose
> > > simple
> > > > > config, e.g.
> > > > >    transform.filter.whitelist=x,y,z # filter to a whitelist of
> > fields
> > > > >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> > > > >    topic.rename.replace=-/_
> > > > >    topic.rename.prefix=kafka_
> > > > > etc..
> > > > >
> > > > > However the ecosystem will invariably have more complex
> transformers
> > if
> > > > we
> > > > > make this pluggable. And because ETL is messy, that's probably a
> good
> > > > thing
> > > > > if folks are able to do their data munging orthogonally to
> > connectors,
> > > so
> > > > > that connectors can focus on the logic of how data should be copied
> > > > from/to
> > > > > datastores and Kafka.
> > > > >
> > > > >
> > > > > > In any case, we'd probably also have to change configs of
> > connectors
> > > if
> > > > > we
> > > > > > allowed configs like that since presumably transformer configs
> will
> > > > just
> > > > > be
> > > > > > part of the connector config.
> > > > > >
> > > > >
> > > > > Yeah, haven't thought much about how all the configuration would
> tie
> > > > > together...
> > > > >
> > > > > I think we'd need the ability to:
> > > > > - spec transformer chain (fully-qualified class names? perhaps
> > special
> > > > > aliases for built-in ones? perhaps third-party fqcns can be
> assigned
> > > > > aliases by users in the chain spec, for easier configuration and to
> > > > > uniquely identify a transformation when it occurs more than one
> time
> > in
> > > a
> > > > > chain?)
> > > > > - configure each transformer -- all properties prefixed with that
> > > > > transformer's ID (fqcn / alias) get destined to it
> > > > >
> > > > > Additionally, I think we would probably want to allow for
> > > topic-specific
> > > > > overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g.
> > you
> > > > > want
> > > > > certain transformations for one topic, but different ones for
> > > another...)
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks,
> > > > Ewen
> > > >
> > >
> >
> >
> >
> > --
> > *Gwen Shapira*
> > Product Manager | Confluent
> > 650.450.2760 <(650)%20450-2760> | @gwenshap
> > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> > <http://www.confluent.io/blog>
> >
>
>
>
> --
> *Gwen Shapira*
> Product Manager | Confluent
> 650.450.2760 | @gwenshap
> Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> <http://www.confluent.io/blog>
>

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Posted by Shikhar Bhushan <sh...@confluent.io>.
I think the tradeoffs for including connectors are different. Connectors
are comparatively larger in scope, they tend to come with their own set of
dependencies for the systems they need to talk to. Transformations as I
imagine them - at least the ones on the table in the wiki currently -
should be a single not-very-large class (or 3 when there are simple *Key
and *Value variants deriving from a base implementing the common
functionality), in some cases relying on common utilities for munging with
the Connect data API. Correspondingly, the maintenance burden is also
smaller.

It's true that it would probably be easier to add specific transformations
down the line than evolve/remove, but I have faith we can strike a good
balance in making the call on what to include from the start.

On > super-clear definition of what belongs and what doesn't

How about: small and broadly applicable, configurable in an easily
understandable manner, no external dependencies, concrete use-case

On Thu, Dec 15, 2016 at 2:01 PM Gwen Shapira <gw...@confluent.io> wrote:

I agree about the ease of use in adding a small-subset of built-in
transformations.

But the same thing is true for connectors - there are maybe 5 super popular
OSS connectors and the rest is a very long tail. We drew the line at not
adding any, because thats the easiest and because we did not want to turn
Kafka into a collection of transformations.

I really don't want to end up with 135 (or even 20) transformations in
Kafka. So either we have a super-clear definition of what belongs and what
doesn't - or we put in one minimal example and the rest goes into the
ecosystem.

We can also start by putting transformations on github and just see if
there is huge demand for them in Apache. It is easier to add stuff to the
project later than to remove functionality.



On Thu, Dec 15, 2016 at 11:59 AM, Shikhar Bhushan <sh...@confluent.io>
wrote:

> I have updated KIP-66
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 66%3A+Single+Message+Transforms+for+Kafka+Connect
> with
> the changes I proposed in the design.
>
> Gwen, I think the main downside to not including some transformations with
> Kafka Connect is that it seems less user friendly if folks have to make
> sure to have the right transformation(s) on the classpath as well, besides
> their connector(s). Additionally by going in with a small set included, we
> can encourage a consistent configuration and implementation style and
> provide utilities for e.g. data transformations, which I expect we will
> definitely need (discussed under 'Patterns for data transformations').
>
> It does get hard to draw the line once you go from 'none' to 'some'. To
get
> discussion going, if we get agreement on 'none' vs 'some', I added a table
> under 'Bundled transformations' for transformations which I think are
worth
> including.
>
> For many of these, I have noticed their absence in the wild as a pain
point
> --
> TimestampRouter:
> https://github.com/confluentinc/kafka-connect-elasticsearch/issues/33
> Mask:
> https://groups.google.com/d/msg/confluent-platform/3yHb8_
> mCReQ/sTQc3dNgBwAJ
> Insert:
> http://stackoverflow.com/questions/40664745/elasticsearch-connector-for-
> kafka-connect-offset-and-timestamp
> RegexRouter:
> https://groups.google.com/d/msg/confluent-platform/
> yEBwu1rGcs0/gIAhRp6kBwAJ
> NumericCast:
> https://github.com/confluentinc/kafka-connect-
> jdbc/issues/101#issuecomment-249096119
> TimestampConverter:
> https://groups.google.com/d/msg/confluent-platform/
> gGAOsw3Qeu4/8JCqdDhGBwAJ
> ValueToKey: https://github.com/confluentinc/kafka-connect-jdbc/pull/166
>
> In other cases, their functionality is already being implemented by
> connectors in divergent ways: RegexRouter, Insert, Replace, HoistToStruct,
> ExtractFromStruct
>
> On Wed, Dec 14, 2016 at 6:00 PM Gwen Shapira <gw...@confluent.io> wrote:
>
> I'm a bit concerned about adding transformations in Kafka. NiFi has 150
> processors, presumably they are all useful for someone. I don't know if
I'd
> want all of that in Apache Kafka. What's the downside of keeping it out?
Or
> at least keeping the built-in set super minimal (Flume has like 3 built-in
> interceptors)?
>
> Gwen
>
> On Wed, Dec 14, 2016 at 1:36 PM, Shikhar Bhushan <sh...@confluent.io>
> wrote:
>
> > With regard to a), just using `ConnectRecord` with `newRecord` as a new
> > abstract method would be a fine choice. In prototyping, both options end
> up
> > looking pretty similar (in terms of how transformations are implemented
> and
> > the runtime initializes and uses them) and I'm starting to lean towards
> not
> > adding a new interface into the mix.
> >
> > On b) I think we should include a small set of useful transformations
> with
> > Connect, since they can be applicable across different connectors and we
> > should encourage some standardization for common operations. I'll update
> > KIP-66 soon including a spec of transformations that I believe are worth
> > including.
> >
> > On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <
> ewen@confluent.io>
> > wrote:
> >
> > If anyone has time to review here, it'd be great to get feedback. I'd
> > imagine that the proposal itself won't be too controversial -- keeps
> > transformations simple (by only allowing map/filter), doesn't affect the
> > rest of the framework much, and fits in with general config structure
> we've
> > used elsewhere (although ConfigDef could use some updates to make this
> > easier...).
> >
> > I think the main open questions for me are:
> >
> > a) Is TransformableRecord worth it to avoid reimplementing small bits of
> > code (it allows for a single implementation of the interface to
trivially
> > apply to both Source and SinkRecords). I think I prefer this, but it
does
> > come with some commitment to another interface on top of ConnectRecord.
> We
> > could alternatively modify ConnectRecord which would require fewer
> changes.
> > b) How do folks feel about built-in transformations and the set that are
> > mentioned here? This brings us way back to the discussion of built-in
> > connectors. Transformations, especially when intended to be lightweight
> and
> > touch nothing besides the data already in the record, seem different
from
> > connectors -- there might be quite a few, but hopefully limited. Since
we
> > (hopefully) already factor out most serialization-specific stuff via
> > Converters, I think we can keep this pretty limited. That said, I have
no
> > doubt some folks will (in my opinion) abuse this feature to do data
> > enrichment by querying external systems, so building a bunch of
> > transformations in could potentially open the floodgates, or at least
> make
> > decisions about what is included vs what should be 3rd party muddy.
> >
> > -Ewen
> >
> >
> > On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <sh...@confluent.io>
> > wrote:
> >
> > > Hi all,
> > >
> > > I have another iteration at a proposal for this feature here:
> > > https://cwiki.apache.org/confluence/display/KAFKA/
> > > Connect+Transforms+-+Proposed+Design
> > >
> > > I'd welcome your feedback and comments.
> > >
> > > Thanks,
> > >
> > > Shikhar
> > >
> > > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <
> ewen@confluent.io>
> > > wrote:
> > >
> > > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <
> shikhar@confluent.io>
> > > wrote:
> > >
> > > > >
> > > > >
> > > > > Hmm, operating on ConnectRecords probably doesn't work since you
> need
> > > to
> > > > > emit the right type of record, which might mean instantiating a
new
> > > one.
> > > > I
> > > > > think that means we either need 2 methods, one for SourceRecord,
> one
> > > for
> > > > > SinkRecord, or we'd need to limit what parts of the message you
can
> > > > modify
> > > > > (e.g. you can change the key/value via something like
> > > > > transformKey(ConnectRecord) and transformValue(ConnectRecord), but
> > > other
> > > > > fields would remain the same and the fmwk would handle allocating
> new
> > > > > Source/SinkRecords if needed)
> > > > >
> > > >
> > > > Good point, perhaps we could add an abstract method on ConnectRecord
> > that
> > > > takes all the shared fields as parameters and the implementations
> > return
> > > a
> > > > copy of the narrower SourceRecord/SinkRecord type as appropriate.
> > > > Transformers would only operate on ConnectRecord rather than caring
> > about
> > > > SourceRecord or SinkRecord (in theory they could instanceof/cast,
but
> > the
> > > > API should discourage it)
> > > >
> > > >
> > > > > Is there a use case for hanging on to the original? I can't think
> of
> > a
> > > > > transformation where you'd need to do that (or couldn't just order
> > > things
> > > > > differently so it isn't a problem).
> > > >
> > > >
> > > > Yeah maybe this isn't really necessary. No strong preference here.
> > > >
> > > > That said, I do worry a bit that farming too much stuff out to
> > > transformers
> > > > > can result in "programming via config", i.e. a lot of the
> simplicity
> > > you
> > > > > get from Connect disappears in long config files. Standardization
> > would
> > > > be
> > > > > nice and might just avoid this (and doesn't cost that much
> > implementing
> > > > it
> > > > > in each connector), and I'd personally prefer something a bit less
> > > > flexible
> > > > > but consistent and easy to configure.
> > > >
> > > >
> > > > Not sure what the you're suggesting :-) Standardized config
> properties
> > > for
> > > > a small set of transformations, leaving it upto connectors to
> > integrate?
> > > >
> > >
> > > I just mean that you get to the point where you're practically writing
> a
> > > Kafka Streams application, you're just doing it through either an
> > > incredibly convoluted set of transformers and configs, or a single
> > > transformer with incredibly convoluted set of configs. You basically
> get
> > to
> > > the point where you're config is a mini DSL and you're not really
> saving
> > > that much.
> > >
> > > The real question is how much we want to venture into the "T" part of
> > ETL.
> > > I tend to favor minimizing how much we take on since the rest of
> Connect
> > > isn't designed for it, it's designed around the E & L parts.
> > >
> > > -Ewen
> > >
> > >
> > > > Personally I'm skeptical of that level of flexibility in
transformers
> > --
> > > > > its getting awfully complex and certainly takes us pretty far from
> > > > "config
> > > > > only" realtime data integration. It's not clear to me what the use
> > > cases
> > > > > are that aren't covered by a small set of common transformations
> that
> > > can
> > > > > be chained together (e.g. rename/remove fields, mask values, and
> > maybe
> > > a
> > > > > couple more).
> > > > >
> > > >
> > > > I agree that we should have some standard transformations that we
> ship
> > > with
> > > > connect that users would ideally lean towards for routine tasks. The
> > ones
> > > > you mention are some good candidates where I'd imagine can expose
> > simple
> > > > config, e.g.
> > > >    transform.filter.whitelist=x,y,z # filter to a whitelist of
> fields
> > > >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> > > >    topic.rename.replace=-/_
> > > >    topic.rename.prefix=kafka_
> > > > etc..
> > > >
> > > > However the ecosystem will invariably have more complex transformers
> if
> > > we
> > > > make this pluggable. And because ETL is messy, that's probably a
good
> > > thing
> > > > if folks are able to do their data munging orthogonally to
> connectors,
> > so
> > > > that connectors can focus on the logic of how data should be copied
> > > from/to
> > > > datastores and Kafka.
> > > >
> > > >
> > > > > In any case, we'd probably also have to change configs of
> connectors
> > if
> > > > we
> > > > > allowed configs like that since presumably transformer configs
will
> > > just
> > > > be
> > > > > part of the connector config.
> > > > >
> > > >
> > > > Yeah, haven't thought much about how all the configuration would tie
> > > > together...
> > > >
> > > > I think we'd need the ability to:
> > > > - spec transformer chain (fully-qualified class names? perhaps
> special
> > > > aliases for built-in ones? perhaps third-party fqcns can be assigned
> > > > aliases by users in the chain spec, for easier configuration and to
> > > > uniquely identify a transformation when it occurs more than one time
> in
> > a
> > > > chain?)
> > > > - configure each transformer -- all properties prefixed with that
> > > > transformer's ID (fqcn / alias) get destined to it
> > > >
> > > > Additionally, I think we would probably want to allow for
> > topic-specific
> > > > overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g.
> you
> > > > want
> > > > certain transformations for one topic, but different ones for
> > another...)
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Ewen
> > >
> >
>
>
>
> --
> *Gwen Shapira*
> Product Manager | Confluent
> 650.450.2760 <(650)%20450-2760> <(650)%20450-2760> | @gwenshap
> Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> <http://www.confluent.io/blog>
>



--
*Gwen Shapira*
Product Manager | Confluent
650.450.2760 <(650)%20450-2760> | @gwenshap
Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
<http://www.confluent.io/blog>

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Posted by Gwen Shapira <gw...@confluent.io>.
I agree about the ease of use in adding a small-subset of built-in
transformations.

But the same thing is true for connectors - there are maybe 5 super popular
OSS connectors and the rest is a very long tail. We drew the line at not
adding any, because thats the easiest and because we did not want to turn
Kafka into a collection of transformations.

I really don't want to end up with 135 (or even 20) transformations in
Kafka. So either we have a super-clear definition of what belongs and what
doesn't - or we put in one minimal example and the rest goes into the
ecosystem.

We can also start by putting transformations on github and just see if
there is huge demand for them in Apache. It is easier to add stuff to the
project later than to remove functionality.



On Thu, Dec 15, 2016 at 11:59 AM, Shikhar Bhushan <sh...@confluent.io>
wrote:

> I have updated KIP-66
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 66%3A+Single+Message+Transforms+for+Kafka+Connect
> with
> the changes I proposed in the design.
>
> Gwen, I think the main downside to not including some transformations with
> Kafka Connect is that it seems less user friendly if folks have to make
> sure to have the right transformation(s) on the classpath as well, besides
> their connector(s). Additionally by going in with a small set included, we
> can encourage a consistent configuration and implementation style and
> provide utilities for e.g. data transformations, which I expect we will
> definitely need (discussed under 'Patterns for data transformations').
>
> It does get hard to draw the line once you go from 'none' to 'some'. To get
> discussion going, if we get agreement on 'none' vs 'some', I added a table
> under 'Bundled transformations' for transformations which I think are worth
> including.
>
> For many of these, I have noticed their absence in the wild as a pain point
> --
> TimestampRouter:
> https://github.com/confluentinc/kafka-connect-elasticsearch/issues/33
> Mask:
> https://groups.google.com/d/msg/confluent-platform/3yHb8_
> mCReQ/sTQc3dNgBwAJ
> Insert:
> http://stackoverflow.com/questions/40664745/elasticsearch-connector-for-
> kafka-connect-offset-and-timestamp
> RegexRouter:
> https://groups.google.com/d/msg/confluent-platform/
> yEBwu1rGcs0/gIAhRp6kBwAJ
> NumericCast:
> https://github.com/confluentinc/kafka-connect-
> jdbc/issues/101#issuecomment-249096119
> TimestampConverter:
> https://groups.google.com/d/msg/confluent-platform/
> gGAOsw3Qeu4/8JCqdDhGBwAJ
> ValueToKey: https://github.com/confluentinc/kafka-connect-jdbc/pull/166
>
> In other cases, their functionality is already being implemented by
> connectors in divergent ways: RegexRouter, Insert, Replace, HoistToStruct,
> ExtractFromStruct
>
> On Wed, Dec 14, 2016 at 6:00 PM Gwen Shapira <gw...@confluent.io> wrote:
>
> I'm a bit concerned about adding transformations in Kafka. NiFi has 150
> processors, presumably they are all useful for someone. I don't know if I'd
> want all of that in Apache Kafka. What's the downside of keeping it out? Or
> at least keeping the built-in set super minimal (Flume has like 3 built-in
> interceptors)?
>
> Gwen
>
> On Wed, Dec 14, 2016 at 1:36 PM, Shikhar Bhushan <sh...@confluent.io>
> wrote:
>
> > With regard to a), just using `ConnectRecord` with `newRecord` as a new
> > abstract method would be a fine choice. In prototyping, both options end
> up
> > looking pretty similar (in terms of how transformations are implemented
> and
> > the runtime initializes and uses them) and I'm starting to lean towards
> not
> > adding a new interface into the mix.
> >
> > On b) I think we should include a small set of useful transformations
> with
> > Connect, since they can be applicable across different connectors and we
> > should encourage some standardization for common operations. I'll update
> > KIP-66 soon including a spec of transformations that I believe are worth
> > including.
> >
> > On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <
> ewen@confluent.io>
> > wrote:
> >
> > If anyone has time to review here, it'd be great to get feedback. I'd
> > imagine that the proposal itself won't be too controversial -- keeps
> > transformations simple (by only allowing map/filter), doesn't affect the
> > rest of the framework much, and fits in with general config structure
> we've
> > used elsewhere (although ConfigDef could use some updates to make this
> > easier...).
> >
> > I think the main open questions for me are:
> >
> > a) Is TransformableRecord worth it to avoid reimplementing small bits of
> > code (it allows for a single implementation of the interface to trivially
> > apply to both Source and SinkRecords). I think I prefer this, but it does
> > come with some commitment to another interface on top of ConnectRecord.
> We
> > could alternatively modify ConnectRecord which would require fewer
> changes.
> > b) How do folks feel about built-in transformations and the set that are
> > mentioned here? This brings us way back to the discussion of built-in
> > connectors. Transformations, especially when intended to be lightweight
> and
> > touch nothing besides the data already in the record, seem different from
> > connectors -- there might be quite a few, but hopefully limited. Since we
> > (hopefully) already factor out most serialization-specific stuff via
> > Converters, I think we can keep this pretty limited. That said, I have no
> > doubt some folks will (in my opinion) abuse this feature to do data
> > enrichment by querying external systems, so building a bunch of
> > transformations in could potentially open the floodgates, or at least
> make
> > decisions about what is included vs what should be 3rd party muddy.
> >
> > -Ewen
> >
> >
> > On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <sh...@confluent.io>
> > wrote:
> >
> > > Hi all,
> > >
> > > I have another iteration at a proposal for this feature here:
> > > https://cwiki.apache.org/confluence/display/KAFKA/
> > > Connect+Transforms+-+Proposed+Design
> > >
> > > I'd welcome your feedback and comments.
> > >
> > > Thanks,
> > >
> > > Shikhar
> > >
> > > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <
> ewen@confluent.io>
> > > wrote:
> > >
> > > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <
> shikhar@confluent.io>
> > > wrote:
> > >
> > > > >
> > > > >
> > > > > Hmm, operating on ConnectRecords probably doesn't work since you
> need
> > > to
> > > > > emit the right type of record, which might mean instantiating a new
> > > one.
> > > > I
> > > > > think that means we either need 2 methods, one for SourceRecord,
> one
> > > for
> > > > > SinkRecord, or we'd need to limit what parts of the message you can
> > > > modify
> > > > > (e.g. you can change the key/value via something like
> > > > > transformKey(ConnectRecord) and transformValue(ConnectRecord), but
> > > other
> > > > > fields would remain the same and the fmwk would handle allocating
> new
> > > > > Source/SinkRecords if needed)
> > > > >
> > > >
> > > > Good point, perhaps we could add an abstract method on ConnectRecord
> > that
> > > > takes all the shared fields as parameters and the implementations
> > return
> > > a
> > > > copy of the narrower SourceRecord/SinkRecord type as appropriate.
> > > > Transformers would only operate on ConnectRecord rather than caring
> > about
> > > > SourceRecord or SinkRecord (in theory they could instanceof/cast, but
> > the
> > > > API should discourage it)
> > > >
> > > >
> > > > > Is there a use case for hanging on to the original? I can't think
> of
> > a
> > > > > transformation where you'd need to do that (or couldn't just order
> > > things
> > > > > differently so it isn't a problem).
> > > >
> > > >
> > > > Yeah maybe this isn't really necessary. No strong preference here.
> > > >
> > > > That said, I do worry a bit that farming too much stuff out to
> > > transformers
> > > > > can result in "programming via config", i.e. a lot of the
> simplicity
> > > you
> > > > > get from Connect disappears in long config files. Standardization
> > would
> > > > be
> > > > > nice and might just avoid this (and doesn't cost that much
> > implementing
> > > > it
> > > > > in each connector), and I'd personally prefer something a bit less
> > > > flexible
> > > > > but consistent and easy to configure.
> > > >
> > > >
> > > > Not sure what the you're suggesting :-) Standardized config
> properties
> > > for
> > > > a small set of transformations, leaving it upto connectors to
> > integrate?
> > > >
> > >
> > > I just mean that you get to the point where you're practically writing
> a
> > > Kafka Streams application, you're just doing it through either an
> > > incredibly convoluted set of transformers and configs, or a single
> > > transformer with incredibly convoluted set of configs. You basically
> get
> > to
> > > the point where you're config is a mini DSL and you're not really
> saving
> > > that much.
> > >
> > > The real question is how much we want to venture into the "T" part of
> > ETL.
> > > I tend to favor minimizing how much we take on since the rest of
> Connect
> > > isn't designed for it, it's designed around the E & L parts.
> > >
> > > -Ewen
> > >
> > >
> > > > Personally I'm skeptical of that level of flexibility in transformers
> > --
> > > > > its getting awfully complex and certainly takes us pretty far from
> > > > "config
> > > > > only" realtime data integration. It's not clear to me what the use
> > > cases
> > > > > are that aren't covered by a small set of common transformations
> that
> > > can
> > > > > be chained together (e.g. rename/remove fields, mask values, and
> > maybe
> > > a
> > > > > couple more).
> > > > >
> > > >
> > > > I agree that we should have some standard transformations that we
> ship
> > > with
> > > > connect that users would ideally lean towards for routine tasks. The
> > ones
> > > > you mention are some good candidates where I'd imagine can expose
> > simple
> > > > config, e.g.
> > > >    transform.filter.whitelist=x,y,z # filter to a whitelist of
> fields
> > > >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> > > >    topic.rename.replace=-/_
> > > >    topic.rename.prefix=kafka_
> > > > etc..
> > > >
> > > > However the ecosystem will invariably have more complex transformers
> if
> > > we
> > > > make this pluggable. And because ETL is messy, that's probably a good
> > > thing
> > > > if folks are able to do their data munging orthogonally to
> connectors,
> > so
> > > > that connectors can focus on the logic of how data should be copied
> > > from/to
> > > > datastores and Kafka.
> > > >
> > > >
> > > > > In any case, we'd probably also have to change configs of
> connectors
> > if
> > > > we
> > > > > allowed configs like that since presumably transformer configs will
> > > just
> > > > be
> > > > > part of the connector config.
> > > > >
> > > >
> > > > Yeah, haven't thought much about how all the configuration would tie
> > > > together...
> > > >
> > > > I think we'd need the ability to:
> > > > - spec transformer chain (fully-qualified class names? perhaps
> special
> > > > aliases for built-in ones? perhaps third-party fqcns can be assigned
> > > > aliases by users in the chain spec, for easier configuration and to
> > > > uniquely identify a transformation when it occurs more than one time
> in
> > a
> > > > chain?)
> > > > - configure each transformer -- all properties prefixed with that
> > > > transformer's ID (fqcn / alias) get destined to it
> > > >
> > > > Additionally, I think we would probably want to allow for
> > topic-specific
> > > > overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g.
> you
> > > > want
> > > > certain transformations for one topic, but different ones for
> > another...)
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Ewen
> > >
> >
>
>
>
> --
> *Gwen Shapira*
> Product Manager | Confluent
> 650.450.2760 <(650)%20450-2760> | @gwenshap
> Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> <http://www.confluent.io/blog>
>



-- 
*Gwen Shapira*
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
<http://www.confluent.io/blog>

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Posted by Shikhar Bhushan <sh...@confluent.io>.
I have updated KIP-66
https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect
with
the changes I proposed in the design.

Gwen, I think the main downside to not including some transformations with
Kafka Connect is that it seems less user friendly if folks have to make
sure to have the right transformation(s) on the classpath as well, besides
their connector(s). Additionally by going in with a small set included, we
can encourage a consistent configuration and implementation style and
provide utilities for e.g. data transformations, which I expect we will
definitely need (discussed under 'Patterns for data transformations').

It does get hard to draw the line once you go from 'none' to 'some'. To get
discussion going, if we get agreement on 'none' vs 'some', I added a table
under 'Bundled transformations' for transformations which I think are worth
including.

For many of these, I have noticed their absence in the wild as a pain point
--
TimestampRouter:
https://github.com/confluentinc/kafka-connect-elasticsearch/issues/33
Mask:
https://groups.google.com/d/msg/confluent-platform/3yHb8_mCReQ/sTQc3dNgBwAJ
Insert:
http://stackoverflow.com/questions/40664745/elasticsearch-connector-for-kafka-connect-offset-and-timestamp
RegexRouter:
https://groups.google.com/d/msg/confluent-platform/yEBwu1rGcs0/gIAhRp6kBwAJ
NumericCast:
https://github.com/confluentinc/kafka-connect-jdbc/issues/101#issuecomment-249096119
TimestampConverter:
https://groups.google.com/d/msg/confluent-platform/gGAOsw3Qeu4/8JCqdDhGBwAJ
ValueToKey: https://github.com/confluentinc/kafka-connect-jdbc/pull/166

In other cases, their functionality is already being implemented by
connectors in divergent ways: RegexRouter, Insert, Replace, HoistToStruct,
ExtractFromStruct

On Wed, Dec 14, 2016 at 6:00 PM Gwen Shapira <gw...@confluent.io> wrote:

I'm a bit concerned about adding transformations in Kafka. NiFi has 150
processors, presumably they are all useful for someone. I don't know if I'd
want all of that in Apache Kafka. What's the downside of keeping it out? Or
at least keeping the built-in set super minimal (Flume has like 3 built-in
interceptors)?

Gwen

On Wed, Dec 14, 2016 at 1:36 PM, Shikhar Bhushan <sh...@confluent.io>
wrote:

> With regard to a), just using `ConnectRecord` with `newRecord` as a new
> abstract method would be a fine choice. In prototyping, both options end
up
> looking pretty similar (in terms of how transformations are implemented
and
> the runtime initializes and uses them) and I'm starting to lean towards
not
> adding a new interface into the mix.
>
> On b) I think we should include a small set of useful transformations with
> Connect, since they can be applicable across different connectors and we
> should encourage some standardization for common operations. I'll update
> KIP-66 soon including a spec of transformations that I believe are worth
> including.
>
> On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <ew...@confluent.io>
> wrote:
>
> If anyone has time to review here, it'd be great to get feedback. I'd
> imagine that the proposal itself won't be too controversial -- keeps
> transformations simple (by only allowing map/filter), doesn't affect the
> rest of the framework much, and fits in with general config structure
we've
> used elsewhere (although ConfigDef could use some updates to make this
> easier...).
>
> I think the main open questions for me are:
>
> a) Is TransformableRecord worth it to avoid reimplementing small bits of
> code (it allows for a single implementation of the interface to trivially
> apply to both Source and SinkRecords). I think I prefer this, but it does
> come with some commitment to another interface on top of ConnectRecord. We
> could alternatively modify ConnectRecord which would require fewer
changes.
> b) How do folks feel about built-in transformations and the set that are
> mentioned here? This brings us way back to the discussion of built-in
> connectors. Transformations, especially when intended to be lightweight
and
> touch nothing besides the data already in the record, seem different from
> connectors -- there might be quite a few, but hopefully limited. Since we
> (hopefully) already factor out most serialization-specific stuff via
> Converters, I think we can keep this pretty limited. That said, I have no
> doubt some folks will (in my opinion) abuse this feature to do data
> enrichment by querying external systems, so building a bunch of
> transformations in could potentially open the floodgates, or at least make
> decisions about what is included vs what should be 3rd party muddy.
>
> -Ewen
>
>
> On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <sh...@confluent.io>
> wrote:
>
> > Hi all,
> >
> > I have another iteration at a proposal for this feature here:
> > https://cwiki.apache.org/confluence/display/KAFKA/
> > Connect+Transforms+-+Proposed+Design
> >
> > I'd welcome your feedback and comments.
> >
> > Thanks,
> >
> > Shikhar
> >
> > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <ew...@confluent.io>
> > wrote:
> >
> > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <sh...@confluent.io>
> > wrote:
> >
> > > >
> > > >
> > > > Hmm, operating on ConnectRecords probably doesn't work since you
need
> > to
> > > > emit the right type of record, which might mean instantiating a new
> > one.
> > > I
> > > > think that means we either need 2 methods, one for SourceRecord, one
> > for
> > > > SinkRecord, or we'd need to limit what parts of the message you can
> > > modify
> > > > (e.g. you can change the key/value via something like
> > > > transformKey(ConnectRecord) and transformValue(ConnectRecord), but
> > other
> > > > fields would remain the same and the fmwk would handle allocating
new
> > > > Source/SinkRecords if needed)
> > > >
> > >
> > > Good point, perhaps we could add an abstract method on ConnectRecord
> that
> > > takes all the shared fields as parameters and the implementations
> return
> > a
> > > copy of the narrower SourceRecord/SinkRecord type as appropriate.
> > > Transformers would only operate on ConnectRecord rather than caring
> about
> > > SourceRecord or SinkRecord (in theory they could instanceof/cast, but
> the
> > > API should discourage it)
> > >
> > >
> > > > Is there a use case for hanging on to the original? I can't think of
> a
> > > > transformation where you'd need to do that (or couldn't just order
> > things
> > > > differently so it isn't a problem).
> > >
> > >
> > > Yeah maybe this isn't really necessary. No strong preference here.
> > >
> > > That said, I do worry a bit that farming too much stuff out to
> > transformers
> > > > can result in "programming via config", i.e. a lot of the simplicity
> > you
> > > > get from Connect disappears in long config files. Standardization
> would
> > > be
> > > > nice and might just avoid this (and doesn't cost that much
> implementing
> > > it
> > > > in each connector), and I'd personally prefer something a bit less
> > > flexible
> > > > but consistent and easy to configure.
> > >
> > >
> > > Not sure what the you're suggesting :-) Standardized config properties
> > for
> > > a small set of transformations, leaving it upto connectors to
> integrate?
> > >
> >
> > I just mean that you get to the point where you're practically writing a
> > Kafka Streams application, you're just doing it through either an
> > incredibly convoluted set of transformers and configs, or a single
> > transformer with incredibly convoluted set of configs. You basically get
> to
> > the point where you're config is a mini DSL and you're not really saving
> > that much.
> >
> > The real question is how much we want to venture into the "T" part of
> ETL.
> > I tend to favor minimizing how much we take on since the rest of Connect
> > isn't designed for it, it's designed around the E & L parts.
> >
> > -Ewen
> >
> >
> > > Personally I'm skeptical of that level of flexibility in transformers
> --
> > > > its getting awfully complex and certainly takes us pretty far from
> > > "config
> > > > only" realtime data integration. It's not clear to me what the use
> > cases
> > > > are that aren't covered by a small set of common transformations
that
> > can
> > > > be chained together (e.g. rename/remove fields, mask values, and
> maybe
> > a
> > > > couple more).
> > > >
> > >
> > > I agree that we should have some standard transformations that we ship
> > with
> > > connect that users would ideally lean towards for routine tasks. The
> ones
> > > you mention are some good candidates where I'd imagine can expose
> simple
> > > config, e.g.
> > >    transform.filter.whitelist=x,y,z # filter to a whitelist of fields
> > >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> > >    topic.rename.replace=-/_
> > >    topic.rename.prefix=kafka_
> > > etc..
> > >
> > > However the ecosystem will invariably have more complex transformers
if
> > we
> > > make this pluggable. And because ETL is messy, that's probably a good
> > thing
> > > if folks are able to do their data munging orthogonally to connectors,
> so
> > > that connectors can focus on the logic of how data should be copied
> > from/to
> > > datastores and Kafka.
> > >
> > >
> > > > In any case, we'd probably also have to change configs of connectors
> if
> > > we
> > > > allowed configs like that since presumably transformer configs will
> > just
> > > be
> > > > part of the connector config.
> > > >
> > >
> > > Yeah, haven't thought much about how all the configuration would tie
> > > together...
> > >
> > > I think we'd need the ability to:
> > > - spec transformer chain (fully-qualified class names? perhaps special
> > > aliases for built-in ones? perhaps third-party fqcns can be assigned
> > > aliases by users in the chain spec, for easier configuration and to
> > > uniquely identify a transformation when it occurs more than one time
in
> a
> > > chain?)
> > > - configure each transformer -- all properties prefixed with that
> > > transformer's ID (fqcn / alias) get destined to it
> > >
> > > Additionally, I think we would probably want to allow for
> topic-specific
> > > overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g. you
> > > want
> > > certain transformations for one topic, but different ones for
> another...)
> > >
> >
> >
> >
> > --
> > Thanks,
> > Ewen
> >
>



--
*Gwen Shapira*
Product Manager | Confluent
650.450.2760 <(650)%20450-2760> | @gwenshap
Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
<http://www.confluent.io/blog>

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Posted by Gwen Shapira <gw...@confluent.io>.
I'm a bit concerned about adding transformations in Kafka. NiFi has 150
processors, presumably they are all useful for someone. I don't know if I'd
want all of that in Apache Kafka. What's the downside of keeping it out? Or
at least keeping the built-in set super minimal (Flume has like 3 built-in
interceptors)?

Gwen

On Wed, Dec 14, 2016 at 1:36 PM, Shikhar Bhushan <sh...@confluent.io>
wrote:

> With regard to a), just using `ConnectRecord` with `newRecord` as a new
> abstract method would be a fine choice. In prototyping, both options end up
> looking pretty similar (in terms of how transformations are implemented and
> the runtime initializes and uses them) and I'm starting to lean towards not
> adding a new interface into the mix.
>
> On b) I think we should include a small set of useful transformations with
> Connect, since they can be applicable across different connectors and we
> should encourage some standardization for common operations. I'll update
> KIP-66 soon including a spec of transformations that I believe are worth
> including.
>
> On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <ew...@confluent.io>
> wrote:
>
> If anyone has time to review here, it'd be great to get feedback. I'd
> imagine that the proposal itself won't be too controversial -- keeps
> transformations simple (by only allowing map/filter), doesn't affect the
> rest of the framework much, and fits in with general config structure we've
> used elsewhere (although ConfigDef could use some updates to make this
> easier...).
>
> I think the main open questions for me are:
>
> a) Is TransformableRecord worth it to avoid reimplementing small bits of
> code (it allows for a single implementation of the interface to trivially
> apply to both Source and SinkRecords). I think I prefer this, but it does
> come with some commitment to another interface on top of ConnectRecord. We
> could alternatively modify ConnectRecord which would require fewer changes.
> b) How do folks feel about built-in transformations and the set that are
> mentioned here? This brings us way back to the discussion of built-in
> connectors. Transformations, especially when intended to be lightweight and
> touch nothing besides the data already in the record, seem different from
> connectors -- there might be quite a few, but hopefully limited. Since we
> (hopefully) already factor out most serialization-specific stuff via
> Converters, I think we can keep this pretty limited. That said, I have no
> doubt some folks will (in my opinion) abuse this feature to do data
> enrichment by querying external systems, so building a bunch of
> transformations in could potentially open the floodgates, or at least make
> decisions about what is included vs what should be 3rd party muddy.
>
> -Ewen
>
>
> On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <sh...@confluent.io>
> wrote:
>
> > Hi all,
> >
> > I have another iteration at a proposal for this feature here:
> > https://cwiki.apache.org/confluence/display/KAFKA/
> > Connect+Transforms+-+Proposed+Design
> >
> > I'd welcome your feedback and comments.
> >
> > Thanks,
> >
> > Shikhar
> >
> > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <ew...@confluent.io>
> > wrote:
> >
> > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <sh...@confluent.io>
> > wrote:
> >
> > > >
> > > >
> > > > Hmm, operating on ConnectRecords probably doesn't work since you need
> > to
> > > > emit the right type of record, which might mean instantiating a new
> > one.
> > > I
> > > > think that means we either need 2 methods, one for SourceRecord, one
> > for
> > > > SinkRecord, or we'd need to limit what parts of the message you can
> > > modify
> > > > (e.g. you can change the key/value via something like
> > > > transformKey(ConnectRecord) and transformValue(ConnectRecord), but
> > other
> > > > fields would remain the same and the fmwk would handle allocating new
> > > > Source/SinkRecords if needed)
> > > >
> > >
> > > Good point, perhaps we could add an abstract method on ConnectRecord
> that
> > > takes all the shared fields as parameters and the implementations
> return
> > a
> > > copy of the narrower SourceRecord/SinkRecord type as appropriate.
> > > Transformers would only operate on ConnectRecord rather than caring
> about
> > > SourceRecord or SinkRecord (in theory they could instanceof/cast, but
> the
> > > API should discourage it)
> > >
> > >
> > > > Is there a use case for hanging on to the original? I can't think of
> a
> > > > transformation where you'd need to do that (or couldn't just order
> > things
> > > > differently so it isn't a problem).
> > >
> > >
> > > Yeah maybe this isn't really necessary. No strong preference here.
> > >
> > > That said, I do worry a bit that farming too much stuff out to
> > transformers
> > > > can result in "programming via config", i.e. a lot of the simplicity
> > you
> > > > get from Connect disappears in long config files. Standardization
> would
> > > be
> > > > nice and might just avoid this (and doesn't cost that much
> implementing
> > > it
> > > > in each connector), and I'd personally prefer something a bit less
> > > flexible
> > > > but consistent and easy to configure.
> > >
> > >
> > > Not sure what the you're suggesting :-) Standardized config properties
> > for
> > > a small set of transformations, leaving it upto connectors to
> integrate?
> > >
> >
> > I just mean that you get to the point where you're practically writing a
> > Kafka Streams application, you're just doing it through either an
> > incredibly convoluted set of transformers and configs, or a single
> > transformer with incredibly convoluted set of configs. You basically get
> to
> > the point where you're config is a mini DSL and you're not really saving
> > that much.
> >
> > The real question is how much we want to venture into the "T" part of
> ETL.
> > I tend to favor minimizing how much we take on since the rest of Connect
> > isn't designed for it, it's designed around the E & L parts.
> >
> > -Ewen
> >
> >
> > > Personally I'm skeptical of that level of flexibility in transformers
> --
> > > > its getting awfully complex and certainly takes us pretty far from
> > > "config
> > > > only" realtime data integration. It's not clear to me what the use
> > cases
> > > > are that aren't covered by a small set of common transformations that
> > can
> > > > be chained together (e.g. rename/remove fields, mask values, and
> maybe
> > a
> > > > couple more).
> > > >
> > >
> > > I agree that we should have some standard transformations that we ship
> > with
> > > connect that users would ideally lean towards for routine tasks. The
> ones
> > > you mention are some good candidates where I'd imagine can expose
> simple
> > > config, e.g.
> > >    transform.filter.whitelist=x,y,z # filter to a whitelist of fields
> > >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> > >    topic.rename.replace=-/_
> > >    topic.rename.prefix=kafka_
> > > etc..
> > >
> > > However the ecosystem will invariably have more complex transformers if
> > we
> > > make this pluggable. And because ETL is messy, that's probably a good
> > thing
> > > if folks are able to do their data munging orthogonally to connectors,
> so
> > > that connectors can focus on the logic of how data should be copied
> > from/to
> > > datastores and Kafka.
> > >
> > >
> > > > In any case, we'd probably also have to change configs of connectors
> if
> > > we
> > > > allowed configs like that since presumably transformer configs will
> > just
> > > be
> > > > part of the connector config.
> > > >
> > >
> > > Yeah, haven't thought much about how all the configuration would tie
> > > together...
> > >
> > > I think we'd need the ability to:
> > > - spec transformer chain (fully-qualified class names? perhaps special
> > > aliases for built-in ones? perhaps third-party fqcns can be assigned
> > > aliases by users in the chain spec, for easier configuration and to
> > > uniquely identify a transformation when it occurs more than one time in
> a
> > > chain?)
> > > - configure each transformer -- all properties prefixed with that
> > > transformer's ID (fqcn / alias) get destined to it
> > >
> > > Additionally, I think we would probably want to allow for
> topic-specific
> > > overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g. you
> > > want
> > > certain transformations for one topic, but different ones for
> another...)
> > >
> >
> >
> >
> > --
> > Thanks,
> > Ewen
> >
>



-- 
*Gwen Shapira*
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
<http://www.confluent.io/blog>

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Posted by Shikhar Bhushan <sh...@confluent.io>.
With regard to a), just using `ConnectRecord` with `newRecord` as a new
abstract method would be a fine choice. In prototyping, both options end up
looking pretty similar (in terms of how transformations are implemented and
the runtime initializes and uses them) and I'm starting to lean towards not
adding a new interface into the mix.

On b) I think we should include a small set of useful transformations with
Connect, since they can be applicable across different connectors and we
should encourage some standardization for common operations. I'll update
KIP-66 soon including a spec of transformations that I believe are worth
including.

On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <ew...@confluent.io>
wrote:

If anyone has time to review here, it'd be great to get feedback. I'd
imagine that the proposal itself won't be too controversial -- keeps
transformations simple (by only allowing map/filter), doesn't affect the
rest of the framework much, and fits in with general config structure we've
used elsewhere (although ConfigDef could use some updates to make this
easier...).

I think the main open questions for me are:

a) Is TransformableRecord worth it to avoid reimplementing small bits of
code (it allows for a single implementation of the interface to trivially
apply to both Source and SinkRecords). I think I prefer this, but it does
come with some commitment to another interface on top of ConnectRecord. We
could alternatively modify ConnectRecord which would require fewer changes.
b) How do folks feel about built-in transformations and the set that are
mentioned here? This brings us way back to the discussion of built-in
connectors. Transformations, especially when intended to be lightweight and
touch nothing besides the data already in the record, seem different from
connectors -- there might be quite a few, but hopefully limited. Since we
(hopefully) already factor out most serialization-specific stuff via
Converters, I think we can keep this pretty limited. That said, I have no
doubt some folks will (in my opinion) abuse this feature to do data
enrichment by querying external systems, so building a bunch of
transformations in could potentially open the floodgates, or at least make
decisions about what is included vs what should be 3rd party muddy.

-Ewen


On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <sh...@confluent.io>
wrote:

> Hi all,
>
> I have another iteration at a proposal for this feature here:
> https://cwiki.apache.org/confluence/display/KAFKA/
> Connect+Transforms+-+Proposed+Design
>
> I'd welcome your feedback and comments.
>
> Thanks,
>
> Shikhar
>
> On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <ew...@confluent.io>
> wrote:
>
> On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <sh...@confluent.io>
> wrote:
>
> > >
> > >
> > > Hmm, operating on ConnectRecords probably doesn't work since you need
> to
> > > emit the right type of record, which might mean instantiating a new
> one.
> > I
> > > think that means we either need 2 methods, one for SourceRecord, one
> for
> > > SinkRecord, or we'd need to limit what parts of the message you can
> > modify
> > > (e.g. you can change the key/value via something like
> > > transformKey(ConnectRecord) and transformValue(ConnectRecord), but
> other
> > > fields would remain the same and the fmwk would handle allocating new
> > > Source/SinkRecords if needed)
> > >
> >
> > Good point, perhaps we could add an abstract method on ConnectRecord
that
> > takes all the shared fields as parameters and the implementations return
> a
> > copy of the narrower SourceRecord/SinkRecord type as appropriate.
> > Transformers would only operate on ConnectRecord rather than caring
about
> > SourceRecord or SinkRecord (in theory they could instanceof/cast, but
the
> > API should discourage it)
> >
> >
> > > Is there a use case for hanging on to the original? I can't think of a
> > > transformation where you'd need to do that (or couldn't just order
> things
> > > differently so it isn't a problem).
> >
> >
> > Yeah maybe this isn't really necessary. No strong preference here.
> >
> > That said, I do worry a bit that farming too much stuff out to
> transformers
> > > can result in "programming via config", i.e. a lot of the simplicity
> you
> > > get from Connect disappears in long config files. Standardization
would
> > be
> > > nice and might just avoid this (and doesn't cost that much
implementing
> > it
> > > in each connector), and I'd personally prefer something a bit less
> > flexible
> > > but consistent and easy to configure.
> >
> >
> > Not sure what the you're suggesting :-) Standardized config properties
> for
> > a small set of transformations, leaving it upto connectors to integrate?
> >
>
> I just mean that you get to the point where you're practically writing a
> Kafka Streams application, you're just doing it through either an
> incredibly convoluted set of transformers and configs, or a single
> transformer with incredibly convoluted set of configs. You basically get
to
> the point where you're config is a mini DSL and you're not really saving
> that much.
>
> The real question is how much we want to venture into the "T" part of ETL.
> I tend to favor minimizing how much we take on since the rest of Connect
> isn't designed for it, it's designed around the E & L parts.
>
> -Ewen
>
>
> > Personally I'm skeptical of that level of flexibility in transformers --
> > > its getting awfully complex and certainly takes us pretty far from
> > "config
> > > only" realtime data integration. It's not clear to me what the use
> cases
> > > are that aren't covered by a small set of common transformations that
> can
> > > be chained together (e.g. rename/remove fields, mask values, and maybe
> a
> > > couple more).
> > >
> >
> > I agree that we should have some standard transformations that we ship
> with
> > connect that users would ideally lean towards for routine tasks. The
ones
> > you mention are some good candidates where I'd imagine can expose simple
> > config, e.g.
> >    transform.filter.whitelist=x,y,z # filter to a whitelist of fields
> >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> >    topic.rename.replace=-/_
> >    topic.rename.prefix=kafka_
> > etc..
> >
> > However the ecosystem will invariably have more complex transformers if
> we
> > make this pluggable. And because ETL is messy, that's probably a good
> thing
> > if folks are able to do their data munging orthogonally to connectors,
so
> > that connectors can focus on the logic of how data should be copied
> from/to
> > datastores and Kafka.
> >
> >
> > > In any case, we'd probably also have to change configs of connectors
if
> > we
> > > allowed configs like that since presumably transformer configs will
> just
> > be
> > > part of the connector config.
> > >
> >
> > Yeah, haven't thought much about how all the configuration would tie
> > together...
> >
> > I think we'd need the ability to:
> > - spec transformer chain (fully-qualified class names? perhaps special
> > aliases for built-in ones? perhaps third-party fqcns can be assigned
> > aliases by users in the chain spec, for easier configuration and to
> > uniquely identify a transformation when it occurs more than one time in
a
> > chain?)
> > - configure each transformer -- all properties prefixed with that
> > transformer's ID (fqcn / alias) get destined to it
> >
> > Additionally, I think we would probably want to allow for topic-specific
> > overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g. you
> > want
> > certain transformations for one topic, but different ones for
another...)
> >
>
>
>
> --
> Thanks,
> Ewen
>

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
If anyone has time to review here, it'd be great to get feedback. I'd
imagine that the proposal itself won't be too controversial -- keeps
transformations simple (by only allowing map/filter), doesn't affect the
rest of the framework much, and fits in with general config structure we've
used elsewhere (although ConfigDef could use some updates to make this
easier...).

I think the main open questions for me are:

a) Is TransformableRecord worth it to avoid reimplementing small bits of
code (it allows for a single implementation of the interface to trivially
apply to both Source and SinkRecords). I think I prefer this, but it does
come with some commitment to another interface on top of ConnectRecord. We
could alternatively modify ConnectRecord which would require fewer changes.
b) How do folks feel about built-in transformations and the set that are
mentioned here? This brings us way back to the discussion of built-in
connectors. Transformations, especially when intended to be lightweight and
touch nothing besides the data already in the record, seem different from
connectors -- there might be quite a few, but hopefully limited. Since we
(hopefully) already factor out most serialization-specific stuff via
Converters, I think we can keep this pretty limited. That said, I have no
doubt some folks will (in my opinion) abuse this feature to do data
enrichment by querying external systems, so building a bunch of
transformations in could potentially open the floodgates, or at least make
decisions about what is included vs what should be 3rd party muddy.

-Ewen


On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <sh...@confluent.io>
wrote:

> Hi all,
>
> I have another iteration at a proposal for this feature here:
> https://cwiki.apache.org/confluence/display/KAFKA/
> Connect+Transforms+-+Proposed+Design
>
> I'd welcome your feedback and comments.
>
> Thanks,
>
> Shikhar
>
> On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <ew...@confluent.io>
> wrote:
>
> On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <sh...@confluent.io>
> wrote:
>
> > >
> > >
> > > Hmm, operating on ConnectRecords probably doesn't work since you need
> to
> > > emit the right type of record, which might mean instantiating a new
> one.
> > I
> > > think that means we either need 2 methods, one for SourceRecord, one
> for
> > > SinkRecord, or we'd need to limit what parts of the message you can
> > modify
> > > (e.g. you can change the key/value via something like
> > > transformKey(ConnectRecord) and transformValue(ConnectRecord), but
> other
> > > fields would remain the same and the fmwk would handle allocating new
> > > Source/SinkRecords if needed)
> > >
> >
> > Good point, perhaps we could add an abstract method on ConnectRecord that
> > takes all the shared fields as parameters and the implementations return
> a
> > copy of the narrower SourceRecord/SinkRecord type as appropriate.
> > Transformers would only operate on ConnectRecord rather than caring about
> > SourceRecord or SinkRecord (in theory they could instanceof/cast, but the
> > API should discourage it)
> >
> >
> > > Is there a use case for hanging on to the original? I can't think of a
> > > transformation where you'd need to do that (or couldn't just order
> things
> > > differently so it isn't a problem).
> >
> >
> > Yeah maybe this isn't really necessary. No strong preference here.
> >
> > That said, I do worry a bit that farming too much stuff out to
> transformers
> > > can result in "programming via config", i.e. a lot of the simplicity
> you
> > > get from Connect disappears in long config files. Standardization would
> > be
> > > nice and might just avoid this (and doesn't cost that much implementing
> > it
> > > in each connector), and I'd personally prefer something a bit less
> > flexible
> > > but consistent and easy to configure.
> >
> >
> > Not sure what the you're suggesting :-) Standardized config properties
> for
> > a small set of transformations, leaving it upto connectors to integrate?
> >
>
> I just mean that you get to the point where you're practically writing a
> Kafka Streams application, you're just doing it through either an
> incredibly convoluted set of transformers and configs, or a single
> transformer with incredibly convoluted set of configs. You basically get to
> the point where you're config is a mini DSL and you're not really saving
> that much.
>
> The real question is how much we want to venture into the "T" part of ETL.
> I tend to favor minimizing how much we take on since the rest of Connect
> isn't designed for it, it's designed around the E & L parts.
>
> -Ewen
>
>
> > Personally I'm skeptical of that level of flexibility in transformers --
> > > its getting awfully complex and certainly takes us pretty far from
> > "config
> > > only" realtime data integration. It's not clear to me what the use
> cases
> > > are that aren't covered by a small set of common transformations that
> can
> > > be chained together (e.g. rename/remove fields, mask values, and maybe
> a
> > > couple more).
> > >
> >
> > I agree that we should have some standard transformations that we ship
> with
> > connect that users would ideally lean towards for routine tasks. The ones
> > you mention are some good candidates where I'd imagine can expose simple
> > config, e.g.
> >    transform.filter.whitelist=x,y,z # filter to a whitelist of fields
> >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> >    topic.rename.replace=-/_
> >    topic.rename.prefix=kafka_
> > etc..
> >
> > However the ecosystem will invariably have more complex transformers if
> we
> > make this pluggable. And because ETL is messy, that's probably a good
> thing
> > if folks are able to do their data munging orthogonally to connectors, so
> > that connectors can focus on the logic of how data should be copied
> from/to
> > datastores and Kafka.
> >
> >
> > > In any case, we'd probably also have to change configs of connectors if
> > we
> > > allowed configs like that since presumably transformer configs will
> just
> > be
> > > part of the connector config.
> > >
> >
> > Yeah, haven't thought much about how all the configuration would tie
> > together...
> >
> > I think we'd need the ability to:
> > - spec transformer chain (fully-qualified class names? perhaps special
> > aliases for built-in ones? perhaps third-party fqcns can be assigned
> > aliases by users in the chain spec, for easier configuration and to
> > uniquely identify a transformation when it occurs more than one time in a
> > chain?)
> > - configure each transformer -- all properties prefixed with that
> > transformer's ID (fqcn / alias) get destined to it
> >
> > Additionally, I think we would probably want to allow for topic-specific
> > overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g. you
> > want
> > certain transformations for one topic, but different ones for another...)
> >
>
>
>
> --
> Thanks,
> Ewen
>