You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Phillip Mann <pm...@trulia.com> on 2017/01/27 18:49:08 UTC

Ideal value for Kafka Connect Distributed tasks.max configuration setting?

I am looking to product ionize and deploy my Kafka Connect application. However, there are two questions I have about the tasks.max setting which is required and of high importance but details are vague for what to actually set this value to.

My simplest question then is as follows: If I have a topic with n partitions that I wish to consume data from and write to some sink (in my case, I am writing to S3), what should I set tasks.max to? Should I set it to n? Should I set it to 2n? Intuitively it seems that I'd want to set the value to n and that's what I've been doing.

What if I change my Kafka topic and increase partitions on the topic? I will have to pause my Kafka Connector and increase the tasks.max if I set it to n? If I have set a value of 2n, then my connector should automatically increase the parallelism it operates?

Thanks for your help!

Re: Ideal value for Kafka Connect Distributed tasks.max configuration setting?

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
On Fri, Jan 27, 2017 at 10:49 AM, Phillip Mann <pm...@trulia.com> wrote:

> I am looking to product ionize and deploy my Kafka Connect application.
> However, there are two questions I have about the tasks.max setting which
> is required and of high importance but details are vague for what to
> actually set this value to.
>
> My simplest question then is as follows: If I have a topic with n
> partitions that I wish to consume data from and write to some sink (in my
> case, I am writing to S3), what should I set tasks.max to? Should I set it
> to n? Should I set it to 2n? Intuitively it seems that I'd want to set the
> value to n and that's what I've been doing.
>

For sink connectors, you cannot get any more parallelism than the # of
topic partitions. While you can set the tasks.max larger than that, it will
not help.

However, you don't need to set tasks.max to the number of topic partitions.
tasks.max maps directly to the # of threads you have processing data. If
the throughput for a topic is low, you might want to set this to a small
number, possibly even 1. If the throughput for each topic partition is
high, you might need n tasks just to keep up.

It's hard to give a definitive answer here because the answer is really
just that "it depends". You can do some amount of capacity planning ahead
of time, but the best approach is to monitor how far you are lagging behind
the input data. If there is lag, increase the # of tasks. If you're not
even utilizing the resources you provided, perhaps you can scale back.


>
> What if I change my Kafka topic and increase partitions on the topic? I
> will have to pause my Kafka Connector and increase the tasks.max if I set
> it to n? If I have set a value of 2n, then my connector should
> automatically increase the parallelism it operates?
>

You don't need to explicitly pause the connector. You can reset a
configuration dynamically and the cluster will sort out the pause/restart
process automatically.

-Ewen


>
> Thanks for your help!
>