You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Barry Kaplan <bk...@memelet.com> on 2016/06/10 01:07:17 UTC

Rate that connect delivers messages

I am running a connect consumer that receives JSON records and indexes into
elasticsearch. The consumer is pushing out 300 messages/s into the a topic
with a single partition. The connect job is configured with 1 task. (This
is all for testing).

What I see is that push is called about every 10s with about 1500 records.
It takes about 1.5 seconds of wall time to complete the indexing of those
records into elasticsearch. But then the task waits another 10s for the
next batch from kafka connect.

Is there some kind of consumer throttling happening? I cannot find any
settings that would tell connect to deliver messages faster or in larger
batches.

I can of course run with more partitions and more tasks, but still, kafka
connect should be able to deliver messages to the task orders of magnitude
faster than elasticsearch can index them.

Re: Rate that connect delivers messages

Posted by Barry Kaplan <bk...@memelet.com>.
Ok, definitely a dev box problem (network for sure). I moved the process
from my dev box to the mesos cluster and the delay between puts is now
60ms.

On Fri, Jun 10, 2016 at 10:32 AM, Barry Kaplan <bk...@memelet.com> wrote:

> Hmm, well CPU is pretty much zero. Heap is barely used. I even made the
> task put method be a noop other than to log time-since-last call. No
> change. With yourkit I see that ES has a thread that is sleeping, but it's
> in a monitor thread pool and clearly not blocking kafka. Anyway, I even
> removed all code except logging and still it takes 10s to deliver just 1500
> messages.
>
> I would say maybe it's the network, but the writing to ES is quite fast
> and the data in from kafka exactly what goes out to ES.
>
> I'll keep investigating.
>

Re: Rate that connect delivers messages

Posted by Barry Kaplan <bk...@memelet.com>.
Hmm, well CPU is pretty much zero. Heap is barely used. I even made the
task put method be a noop other than to log time-since-last call. No
change. With yourkit I see that ES has a thread that is sleeping, but it's
in a monitor thread pool and clearly not blocking kafka. Anyway, I even
removed all code except logging and still it takes 10s to deliver just 1500
messages.

I would say maybe it's the network, but the writing to ES is quite fast and
the data in from kafka exactly what goes out to ES.

I'll keep investigating.

Re: Rate that connect delivers messages

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.
Barry,

It might help to know whether you're hitting a (single threaded) CPU limit
or if the bottleneck is elsewhere. Also, how large on average are the
messages you are consuming? There's nothing that'll force batching like
you're talking about. You can tweak any consumer settings via worker-level
config overrides (see
http://docs.confluent.io/3.0.0/connect/userguide.html#overriding-producer-consumer-settings)
if the defaults aren't working well for you for some reason. 10s sounds
quite long, so I suspect there's some other bottleneck or issue that's
causing it to take so long -- by default consumer fetch requests should
return immediately if any data is available, and even if you increase
fetch.min.bytes, the longest it waits by default is 500ms as defined by
fetch.max.wait.ms.

-Ewen

On Thu, Jun 9, 2016 at 7:06 PM Barry Kaplan <bk...@memelet.com> wrote:

> I am running a connect consumer that receives JSON records and indexes into
> elasticsearch. The consumer is pushing out 300 messages/s into the a topic
> with a single partition. The connect job is configured with 1 task. (This
> is all for testing).
>
> What I see is that push is called about every 10s with about 1500 records.
> It takes about 1.5 seconds of wall time to complete the indexing of those
> records into elasticsearch. But then the task waits another 10s for the
> next batch from kafka connect.
>
> Is there some kind of consumer throttling happening? I cannot find any
> settings that would tell connect to deliver messages faster or in larger
> batches.
>
> I can of course run with more partitions and more tasks, but still, kafka
> connect should be able to deliver messages to the task orders of magnitude
> faster than elasticsearch can index them.
>