You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Ron Crocker <rc...@newrelic.com> on 2016/10/04 23:33:55 UTC

Applying "single queue multi server" semantics to a Kafka topic

TL;DR: Any experience with overlaying a “single queue multi server” facade onto a Kafka topic?

(I’m new to the list and I tried searching for an answer prior to spamming everyone with this idea - sorry if I missed the answer in that search —ron)

I'm curious if anyone has run into (and solved) a similar problem to one that I'm currently seeing. 

We are using Kafka in a very straightforward fashion: a pool of producers that pump messages into a topic (with say 40 partitions) with a uniform distribution of messages across partitions with a pool of (say 20) uniformly configured consumers. All of the messages (and the work associated with processing a message) is independent and fungible: there is no need for a particular message to be processed by a particular consumer.

When everything is happy, these uniformly configured consumers process all the messages. When things are sad, and in particular when some subset of consumers are performing poorly, they lag on their assigned partitions. And none of the other consumers, but particularly the ones that are still just chugging along, can help.

But none of this is Kafka’s fault - all of this is both well understood and intentional in the design of Kafka. The work is partitioned and it is the responsibility of the consumers to keep up. My concern, though, is that there isn’t very much I can do about it when it (it == laggy consumer) happens. In particular, I see a handful of actions that I could take:
a) I can add more consumers (to say 40, 1 per partition). But this is insufficient - it doesn’t resolve the problem when a consumer can’t keep up with 1 partition.
b) I can [statically] configure how many partitions to assign to a consumer. This is insufficient, same reason as above.
c) I can ensure that my consumers have sufficient capability to process at least 1 partition - this is a nice thought, but my consumers’ performance are often influenced external factors (e.g., database slowdowns) that impact their ability to consume and are difficult to predict ahead of time; further, these external factors tend to impact certain consumers more than others.

One thought not on that list is to give Kafka a “single queue multi server” facade - each of the consumers requests work (from the topic, not a particular partition) at the rate at which they can process it, and more capable consumers will naturally process more things. If the system lags, we can add more consumers, even weak ones, to help get the work done. This would give Kafka the properties of a single M/M/m queueing network. (For the record, I prefer to model the general case of a Kafka-based system as a collection of m independent M/M/1 queueing networks, perhaps with arrival rates proportional to the number of partitions assigned to the consumers)

This, much like c) above, is a nice thought but is pretty challenging. This facade needs to process all the partitions and dole out the work as it’s requested. It needs to be able to manage offsets in an unnatural-to-Kafka way. It needs to handle the partitions somewhat uniformly as well - it can’t just pull work from one partition then the next partition (or can it?). It needs to be lightweight (in terms of overhead). And it needs to be robust.

All that aside, has anyone managed to make such a facade work and have it process at a significantly high rate (in my hypothetical above, I need 100K/second cumulative across those 40 partitions - about 2.5K/sec/partition or 5K/sec/consumer)? If you have thoughts about this, feel free to share. If you think this is silly, that I’m not Kafka-ing the right way or there’s a brutally obvious solution that I’ve just missed share that too.

Thanks!

Ron C.