You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "James A. Robinson" <ji...@stanford.edu> on 2012/07/09 23:07:27 UTC

question about threaded vs. serial consumers

Hi folks,

Reading through the Java examples for Kafka consumers, it appeared to
me as though I have the following options when it comes to setting up
consumers for a topic with more than one partition:

  (a) Write a consumer that kicks off N threads, one thread per
      partition, to read messages from the broker in parallel.

  (b) Write a single threaded consumer and run N different JVMs, each
      reading one of N partitions.

  (c) Write a single threaded consumer that iterates through each
      partition, using a consumer.timeout.ms setting to trigger a
      timeout so that I don't block forever on a single partition that
      is not being written to.

Are there are options?  Is there a way to set up a non-blocking poll
of a KafkaStream to see whether or not it has a message available?

Does anyone actually do either (b) or (c), or is (a) really the only
realistic option?  Is it possible/likely that a topic with N
partitions might have one partition that is never updated?

Jim

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
James A. Robinson                       jim.robinson@stanford.edu
Stanford University HighWire Press      http://highwire.stanford.edu/
+1 650 7237294 (Work)                   +1 650 7259335 (Fax)

Re: question about threaded vs. serial consumers

Posted by Jay Kreps <ja...@gmail.com>.
That is correct, you have exactly those options.

People do both single threaded multi-jvm consumers, multi-threaded
multi-jvm consumers, and single-threaded single jvm consumers. There is not
particular advantage to one or the other unless the processors need to
share data of some sort (in which case threads may be more convenient).

We do not have the equivalent of the unix poll/select api over partitions
or topics. That would be a good thing for us to add.

Load balancing is up to the producer (messages are partitioned based on a
key or randomly), so whether it is possible for a partition to not get any
events is somewhat situation dependent.

-Jay

On Mon, Jul 9, 2012 at 2:07 PM, James A. Robinson <jim.robinson@stanford.edu
> wrote:

> Hi folks,
>
> Reading through the Java examples for Kafka consumers, it appeared to
> me as though I have the following options when it comes to setting up
> consumers for a topic with more than one partition:
>
>   (a) Write a consumer that kicks off N threads, one thread per
>       partition, to read messages from the broker in parallel.
>
>   (b) Write a single threaded consumer and run N different JVMs, each
>       reading one of N partitions.
>
>   (c) Write a single threaded consumer that iterates through each
>       partition, using a consumer.timeout.ms setting to trigger a
>       timeout so that I don't block forever on a single partition that
>       is not being written to.
>
> Are there are options?  Is there a way to set up a non-blocking poll
> of a KafkaStream to see whether or not it has a message available?
>
> Does anyone actually do either (b) or (c), or is (a) really the only
> realistic option?  Is it possible/likely that a topic with N
> partitions might have one partition that is never updated?
>
> Jim
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> James A. Robinson                       jim.robinson@stanford.edu
> Stanford University HighWire Press      http://highwire.stanford.edu/
> +1 650 7237294 (Work)                   +1 650 7259335 (Fax)
>

Re: question about threaded vs. serial consumers

Posted by Joel Koshy <jj...@gmail.com>.
Hi Jim,

If you use a single machine and configure your consumer with this
topic-count: "topic:N" you will get behavior (a). If N < available
partitions some threads will be allocated more partitions than the others.
If you use more than one machine (or JVMs) and have each of those consumers
configured with topic:X, each of your consumer instances will get that many
active threads (i.e., if there are enough partitions to share).

For (c): if a partition is not getting written to then yes the thread that
is allocated to it will block (until consumer.timeout.ms) - and it needs to
since new data may come to that partition at any time. BTW, if you are
asking about a single topic then this question would only arise if you are
using a custom partitioning scheme on the producer side since the default
partitioner (which is random) (should) ensure that all partitions would get
data eventually. However, if you are consuming multiple topics then it is
quite possible for some topics to not receive data (if their producers
stop).

Are there are options?  Is there a way to set up a non-blocking poll
> of a KafkaStream to see whether or not it has a message available?
>

You can use the offset shell to check the current available offsets
periodically - but that is generally unnecessary. The typical set up is to
just allocate a certain number of threads for consumption - if any of the
partitions don't have any more data then they just stay idle. Also, the
broker has a retention setting (defaults to a week I think) - so partitions
that don't receive data would expire eventually and trigger a rebalance on
the consumers.

Thanks,

Joel