You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Rafi Shamim <ra...@knewton.com> on 2015/01/06 19:45:14 UTC

Offset management in multi-threaded high-level consumer

Hello,

I would like to write a multi-threaded consumer for the high-level
consumer in Kafka 0.8.1. I have found two ways that seem feasible
while keeping the guarantee that messages in a partition are processed
in order. I would appreciate any feedback this list has.

Option 1
--------
- Create multiple threads, so each thread has its own ConsumerConnector.
- Manually commit offsets in each thread after every N messages.
- This was discussed a bit on this list previously. See [1].

### Questions
- Is there a problem with making multiple ConsumerConnectors per machine?
- What does it take for ZooKeeper to handle this much load? We have a
3-node ZooKeeper cluster with relatively small machines. (I expect the
topic will have about 40 messages per second. There will be 3 consumer
groups. That would be 120 commits per second at most, but I can reduce
the frequency of commits to make this lower.)

### Extra info
Kafka 0.9 will have an entirely different commit API, which will allow
one connection to commit offsets per partition, but I can’t wait that
long. See [2].


Option 2
--------
- Create one ConsumerConnector, but ask for multiple streams in that
connection. Give each thread one stream.
- Since there is no way to commit offsets per stream right now, we
need to do autoCommit.
- This sacrifices the at-least-once processing guarantee, which would
be nice to have. See KAFKA-1612 [3].

### Extra info
- There was some discussion in KAFKA-996 about a markForCommit()
method so that autoCommit would preserve the at-least-once guarantee,
but it seems more likely that the consumer API will just be redesigned
to allow commits per partition instead. See [4].


So basically I'm wondering if option 1 is feasible. If not, I'll just
do option 2. Of course, let me know if I was mistaken about anything
or if there is another design which is better.

Thanks in advance.
Rafi

[1] http://mail-archives.apache.org/mod_mbox/kafka-users/201310.mbox/%3CFF142F6B499AE34CAED4D263F6CA32901D35AE89@EXTXMB19.nam.nsroot.net%3E
[2] https://cwiki.apache.org/confluence/display/KAFKA/Client+Rewrite#ClientRewrite-ConsumerAPI
[3] https://issues.apache.org/jira/browse/KAFKA-1612
[4] https://issues.apache.org/jira/browse/KAFKA-966

Re: Offset management in multi-threaded high-level consumer

Posted by Jun Rao <ju...@confluent.io>.
There isn't much difference btw option 1 and 2 in terms of the offset
commit overhead to Zookeeper. In 0.8.2, we will have a Kafka-based offset
management, which is much more efficient than committing to Zookeeper.

Thanks,

Jun

On Tue, Jan 6, 2015 at 10:45 AM, Rafi Shamim <ra...@knewton.com> wrote:

> Hello,
>
> I would like to write a multi-threaded consumer for the high-level
> consumer in Kafka 0.8.1. I have found two ways that seem feasible
> while keeping the guarantee that messages in a partition are processed
> in order. I would appreciate any feedback this list has.
>
> Option 1
> --------
> - Create multiple threads, so each thread has its own ConsumerConnector.
> - Manually commit offsets in each thread after every N messages.
> - This was discussed a bit on this list previously. See [1].
>
> ### Questions
> - Is there a problem with making multiple ConsumerConnectors per machine?
> - What does it take for ZooKeeper to handle this much load? We have a
> 3-node ZooKeeper cluster with relatively small machines. (I expect the
> topic will have about 40 messages per second. There will be 3 consumer
> groups. That would be 120 commits per second at most, but I can reduce
> the frequency of commits to make this lower.)
>
> ### Extra info
> Kafka 0.9 will have an entirely different commit API, which will allow
> one connection to commit offsets per partition, but I can’t wait that
> long. See [2].
>
>
> Option 2
> --------
> - Create one ConsumerConnector, but ask for multiple streams in that
> connection. Give each thread one stream.
> - Since there is no way to commit offsets per stream right now, we
> need to do autoCommit.
> - This sacrifices the at-least-once processing guarantee, which would
> be nice to have. See KAFKA-1612 [3].
>
> ### Extra info
> - There was some discussion in KAFKA-996 about a markForCommit()
> method so that autoCommit would preserve the at-least-once guarantee,
> but it seems more likely that the consumer API will just be redesigned
> to allow commits per partition instead. See [4].
>
>
> So basically I'm wondering if option 1 is feasible. If not, I'll just
> do option 2. Of course, let me know if I was mistaken about anything
> or if there is another design which is better.
>
> Thanks in advance.
> Rafi
>
> [1]
> http://mail-archives.apache.org/mod_mbox/kafka-users/201310.mbox/%3CFF142F6B499AE34CAED4D263F6CA32901D35AE89@EXTXMB19.nam.nsroot.net%3E
> [2]
> https://cwiki.apache.org/confluence/display/KAFKA/Client+Rewrite#ClientRewrite-ConsumerAPI
> [3] https://issues.apache.org/jira/browse/KAFKA-1612
> [4] https://issues.apache.org/jira/browse/KAFKA-966
>