You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Koert Kuipers <ko...@tresata.com> on 2017/01/30 16:24:06 UTC

kafka connect architecture

i have been playing with kafka connect in standalone and distributed mode.

i like standalone because:
* i get to configure it using a file. this is easy for automated deployment
(chef, puppet, etc.). configuration using a rest api i find inconvenient.
* erors show up in log files instead of having to retrieve them using a
rest api. same argument as previous bullet point really. i know how to
automate log monitoring. rest api isnt great for this.
* isolation of connector classes. every connector has its own jvm. no jar
dependency hell.

i like distributed because:
* well its fault tolerant and can distribute workload

so this makes me wonder... how hard would it be to get the
"connect-standalone" setup where each connector has its own service(s),
configuration is done using files, and errors are written to logs, yet at
the same time i can spin up multiple services for a connector and they form
a group? and while we are at it also remove the rest api entirely, since i
dont need it, it poses a security risk, and it makes it hard to spin up
multiple connectors on same box. with such a setup i could simply deploy as
many services as i need for a connector, using either chef, or perhaps
slider on yarn, or whatever framework i need.

this is related to KAFKA-3815
<https://issues.apache.org/jira/browse/KAFKA-3815> which makes similar
arguments for container deployments

Re: kafka connect architecture

Posted by Stephane Maarek <st...@simplemachines.com.au>.

So a bit of feedback as well.

I hope Kafka connect would work the following way (just a proposal )

You send a configuration which points to a class but also a version for
that class (connector ). Kafka connect then has some sort of capability to
pull that class from a dependency repository and isolate the connector in
its own JVM. This guarantees that Configs are running on the expected
classes and solves any backwards and forward compatibility. This also saves
from the hassle of bundling jars with connect


Right now the major pain point (on top of shared classpath ) is that to add
a new collector I need to pause or delete any existing connector. Include
the new or updated connector within my docker image for connect then do a
rolling restart of all my connect boxes. And then do a out of an update
configuration if need be. That's too long and too manual. Also impacts the
pipeline with some downtime.


Another model worth looking at is the spark model. Submitting jars to a
cluster is simple from a deployment perspective, and the spark cluster
represents the API and infrastructure whereas the jar represents the config
and capabilities.

Let me know your thoughts there may be disadvantages

On 31 Jan. 2017 5:56 pm, "Ewen Cheslack-Postava" <ew...@confluent.io> wrote:

> On Mon, Jan 30, 2017 at 8:24 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
> > i have been playing with kafka connect in standalone and distributed
> mode.
> >
> > i like standalone because:
> > * i get to configure it using a file. this is easy for automated
> deployment
> > (chef, puppet, etc.). configuration using a rest api i find inconvenient.
> >
>
> What exactly is inconvenient? The orchestration tools you mention all have
> built-in tooling to make REST requests. In fact, you could pretty easily
> take a config file you could use with standalone mode and convert it into
> the JSON payload for the REST API and simply make that request. If the
> connector already exists with the same config, it shouldn't have any effect
> on the cluster -- it's just a noop re-registration.
>
>
> > * erors show up in log files instead of having to retrieve them using a
> > rest api. same argument as previous bullet point really. i know how to
> > automate log monitoring. rest api isnt great for this.
> >
>
> If you run in distributed mode, you probably also want to collect log files
> somehow. The errors still show up in log files, they are just spread across
> multiple nodes so you may need to collect them to put them in a central
> location. (Hint: connect can do this :))
>
>
> > * isolation of connector classes. every connector has its own jvm. no jar
> > dependency hell.
> >
>
> Yup, this is definitely a pain point. We're looking into classpath
> isolation in a subsequent release (won't be in AK 0.10.2.0/CP 3.2.0, but I
> am hoping it will be in AK 0.10.3.0/CP3.3.0).
>
>
> >
> > i like distributed because:
> > * well its fault tolerant and can distribute workload
> >
> > so this makes me wonder... how hard would it be to get the
> > "connect-standalone" setup where each connector has its own service(s),
> > configuration is done using files, and errors are written to logs, yet at
> > the same time i can spin up multiple services for a connector and they
> form
> > a group? and while we are at it also remove the rest api entirely, since
> i
> > dont need it, it poses a security risk, and it makes it hard to spin up
> > multiple connectors on same box. with such a setup i could simply deploy
> as
> > many services as i need for a connector, using either chef, or perhaps
> > slider on yarn, or whatever framework i need.
> >
>
> A distributed mode driven by config files is possible and something that's
> been brought up before, although does have some complicating factors. Doing
> a rolling bounce of such a service gets tricky in the face of failures as
> you might have old & new versions of the app starting simultaneously (i.e.
> it becomes difficult to figure out which config to trust).
>
> As to removing the REST API in some cases, I guess I could imagine doing
> it, but in practice you should probably just lock down access by never
> allowing access to that port. If you're worried about security, you should
> have all ports disabled by default; if you don't want to provide access to
> the REST API, simply don't enable access to it.
>
> -Ewen
>
>
> >
> > this is related to KAFKA-3815
> > <https://issues.apache.org/jira/browse/KAFKA-3815> which makes similar
> > arguments for container deployments
> >
>

Re: kafka connect architecture

Posted by Koert Kuipers <ko...@tresata.com>.

see inline.
best, koert

On Tue, Jan 31, 2017 at 1:56 AM, Ewen Cheslack-Postava <ew...@confluent.io>
wrote:

> On Mon, Jan 30, 2017 at 8:24 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
> > i have been playing with kafka connect in standalone and distributed
> mode.
> >
> > i like standalone because:
> > * i get to configure it using a file. this is easy for automated
> deployment
> > (chef, puppet, etc.). configuration using a rest api i find inconvenient.
> >
>
> What exactly is inconvenient? The orchestration tools you mention all have
> built-in tooling to make REST requests. In fact, you could pretty easily
> take a config file you could use with standalone mode and convert it into
> the JSON payload for the REST API and simply make that request. If the
> connector already exists with the same config, it shouldn't have any effect
> on the cluster -- it's just a noop re-registration.
>

it is true that for example chef has some build in support for REST, but
its not nearly as well developed as their config file (template) framework.
i expect the same for other tools (but dont know). also with files security
has been solved: a tool like chef runs as the user with admin privileges to
modify these files
, and permissions can trivially be set so that a limited set of users
 can read these files. all this is much harder with a REST API.



>
> > * erors show up in log files instead of having to retrieve them using a
> > rest api. same argument as previous bullet point really. i know how to
> > automate log monitoring. rest api isnt great for this.
> >
>
> If you run in distributed mode, you probably also want to collect log files
> somehow. The errors still show up in log files, they are just spread across
> multiple nodes so you may need to collect them to put them in a central
> location. (Hint: connect can do this :))
>
> my experience so far with errors in connectors was that they did not show
up in the log
 of the distributed connect service. only by going to the rest api
endpoint for the status of the connector (GET /connectors/<name>/status)
could i get the error.
 perhaps i have to adjust my logging settings.


>
> > * isolation of connector classes. every connector has its own jvm. no jar
> > dependency hell.
> >
>
> Yup, this is definitely a pain point. We're looking into classpath
> isolation in a subsequent release (won't be in AK 0.10.2.0/CP 3.2.0, but I
> am hoping it will be in AK 0.10.3.0/CP3.3.0).
>
>
> >
> > i like distributed because:
> > * well its fault tolerant and can distribute workload
> >
> > so this makes me wonder... how hard would it be to get the
> > "connect-standalone" setup where each connector has its own service(s),
> > configuration is done using files, and errors are written to logs, yet at
> > the same time i can spin up multiple services for a connector and they
> form
> > a group? and while we are at it also remove the rest api entirely, since
> i
> > dont need it, it poses a security risk, and it makes it hard to spin up
> > multiple connectors on same box. with such a setup i could simply deploy
> as
> > many services as i need for a connector, using either chef, or perhaps
> > slider on yarn, or whatever framework i need.
> >
>
> A distributed mode driven by config files is possible and something that's
> been brought up before, although does have some complicating factors. Doing
> a rolling bounce of such a service gets tricky in the face of failures as
> you might have old & new versions of the app starting simultaneously (i.e.
> it becomes difficult to figure out which config to trust).
>

i didnt think about this too much. indeed my plan was to simply bounce all
the services for a particular connector at the same time, and accept
downtime for the given connector. i could do a rolling restart if i am ok
with a mix of old and new running at same time, which might be acceptable
for minor fixes.

how does kafka streams handle this?



>
> As to removing the REST API in some cases, I guess I could imagine doing
> it, but in practice you should probably just lock down access by never
> allowing access to that port. If you're worried about security, you should
> have all ports disabled by default; if you don't want to provide access to
> the REST API, simply don't enable access to it.
>
> -Ewen
>
>
> >
> > this is related to KAFKA-3815
> > <https://issues.apache.org/jira/browse/KAFKA-3815> which makes similar
> > arguments for container deployments
> >
>

Re: kafka connect architecture

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.

On Mon, Jan 30, 2017 at 8:24 AM, Koert Kuipers <ko...@tresata.com> wrote:

> i have been playing with kafka connect in standalone and distributed mode.
>
> i like standalone because:
> * i get to configure it using a file. this is easy for automated deployment
> (chef, puppet, etc.). configuration using a rest api i find inconvenient.
>

What exactly is inconvenient? The orchestration tools you mention all have
built-in tooling to make REST requests. In fact, you could pretty easily
take a config file you could use with standalone mode and convert it into
the JSON payload for the REST API and simply make that request. If the
connector already exists with the same config, it shouldn't have any effect
on the cluster -- it's just a noop re-registration.

> * erors show up in log files instead of having to retrieve them using a
> rest api. same argument as previous bullet point really. i know how to
> automate log monitoring. rest api isnt great for this.
>

If you run in distributed mode, you probably also want to collect log files
somehow. The errors still show up in log files, they are just spread across
multiple nodes so you may need to collect them to put them in a central
location. (Hint: connect can do this :))

> * isolation of connector classes. every connector has its own jvm. no jar
> dependency hell.
>

Yup, this is definitely a pain point. We're looking into classpath
isolation in a subsequent release (won't be in AK 0.10.2.0/CP 3.2.0, but I
am hoping it will be in AK 0.10.3.0/CP3.3.0).

>
> i like distributed because:
> * well its fault tolerant and can distribute workload
>
> so this makes me wonder... how hard would it be to get the
> "connect-standalone" setup where each connector has its own service(s),
> configuration is done using files, and errors are written to logs, yet at
> the same time i can spin up multiple services for a connector and they form
> a group? and while we are at it also remove the rest api entirely, since i
> dont need it, it poses a security risk, and it makes it hard to spin up
> multiple connectors on same box. with such a setup i could simply deploy as
> many services as i need for a connector, using either chef, or perhaps
> slider on yarn, or whatever framework i need.
>

A distributed mode driven by config files is possible and something that's
been brought up before, although does have some complicating factors. Doing
a rolling bounce of such a service gets tricky in the face of failures as
you might have old & new versions of the app starting simultaneously (i.e.
it becomes difficult to figure out which config to trust).

As to removing the REST API in some cases, I guess I could imagine doing
it, but in practice you should probably just lock down access by never
allowing access to that port. If you're worried about security, you should
have all ports disabled by default; if you don't want to provide access to
the REST API, simply don't enable access to it.

-Ewen

>
> this is related to KAFKA-3815
> <https://issues.apache.org/jira/browse/KAFKA-3815> which makes similar
> arguments for container deployments
>