You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by S Ahmed <sa...@gmail.com> on 2013/01/14 20:15:13 UTC

Is this a good overview of kafka?

Just want to verify that I understand the various components correctly,
please chime in where appropriate :)

producer = puts messages on broker(s)
consumer = pulls messages of a broker

In terms of how messages are organized on a cluster of brokers, producers
put messages by providing a "topic".

At the broker side of things, messages are stored by topic but can also be
logicially seperated by a "partition", so that all messages for a
particular topic are directed to a particular broker.

On the consumer side, when you pull messages off, I know you can dedicated
a consumer (or group of consumers) to a particular partition somehow. But
what if you wanted to just randomly pull messages off?  Say I have 3
brokers, and 5 consumers.  How does the consumer know which broker to
connect too, and co-ordinate with the other consumers?

Is there a flow diagram for the above scenerio? (or any other scenerio so I
can understand how the communication takes place).

Re: Is this a good overview of kafka?

Posted by Felix GV <fe...@mate1inc.com>.

Sure, I'll try to give a better explanation :)

Little disclaimer though: My knowledge is based on my reading of the Kafka
design paper <http://kafka.apache.org/design.html> more than a year ago, so
right off the bat, it's possible that I may be forgetting or assuming
things which I shouldn't... Also, Kafka was pre-0.7 at the time, and we've
been running 0.7.0-ish in prod for a while now, so it's possible that some
of my understanding is outdated in the context of 0.7.2, and there are
definitely a fait bit of things that changed in 0.8 but I don't know what
changed well enough to make informed statements about 0.8. All that to say
that you should take your version of Kafka into account. And it certainly
doesn't hurt to read the design paper either ;)

So, my understanding is that when a Kafka broker comes online:

   - The broker contacts the ZK ensemble and registers itself. It also
   registers partitions for each of to the topics that exist in ZK (and
   according to the settings its own broker config file).
   - Producers are watching the online partitions in ZK, and when it
   changes, ZK fires off an event to them so that they can update their
   partition count. This partition count is used as a modulo on the hash
   returned by the brokers' partitioning function. So even if you have a
   custom partitioning function that deterministically gives out the same hash
   for a given bucket of messages, if you apply a different modulo to that
   hash, then on course it's going to make the messages of that bucket go to a
   different partition. This is done so that all online partitions get to have
   some data.
   - Consumers are also watching the online partitions in ZK. When it
   changes, ZK fires off an event to them, and they start re-balancing, so
   that the partitions are spread as fairly as possible between the consumers.
   In that process, partitions are assigned to consumers, and those
   partition-assignments could (and may very well) be different than the ones
   that were in place before the re-balance.

When a Kafka broker goes offline, if also affects the online partition
count, so the producers will again send their messages to different
partitions (so that all messages have somewhere to go) and the consumers
will re-balance again (to prevent starving a consumer whose partitions
became unavailable).

When a consumer goes online:

   - The consumer registers itself in ZK using its consumer group.
   - If there are other consumers watching that consumer group, then they
   will get notified and a re-balance of the whole group will be triggered,
   just like in the above case.

When a consumer goes offline, a re-balance is triggered as well for the
same reasons.

In the case of consumers going online or offline, this does not change the
ordering guarantees within the partitions per say. BUT, if your consumers
were keeping any sort of internal state in relation to the ordered data
they were consuming, then that state won't be relevant anymore, because
they will start consuming form different partitions after the rebalance.
Depending on the type of processing you're doing, that may or may not break
the work your consumer is doing.

Thus, the only event that does has no chance of affecting the stickiness of
a (data bucket ==> consumer process), is producers going online or offline.
Broker changes definitely alter which message buckets go into which
partitions. Consumer changes don't affect the content of partitions, but it
does change which consumer is consuming which partition.

If ordering guarantees are important to you, then I guess the best thing to
do might be to add watches on the same type of stuff that triggers the
changes described above, and to act accordingly when those changes happen
(by flushing the internal state, restarting the consumers, rolling back the
ZK offset to some checkpoint in the past, or whatever else is relevant in
your use case...)

Hopefully that was clear (and accurate) enough...!

--
Felix

On Mon, Jan 14, 2013 at 9:38 PM, Stan Rosenberg <st...@gmail.com>wrote:

> Hi Felix,
>
> Would you mind elaborating on what you said regarding the ordering
> guaranteed; inlined below.
>
> Thanks,
>
> stan
>
> On Mon, Jan 14, 2013 at 6:08 PM, Felix GV <fe...@mate1inc.com> wrote:
>
> >
> > For example if you partitioned using a User ID field within the messages,
> > you would be
> > guaranteed that all messages pertaining to a certain user would end up in
> > the same partition, and that they would be correctly ordered. You should
> be
> > aware, however, that this guarantee is only maintained as long as there
> are
> > no consumer re-balance (which happens when adding or removing a consumer
> or
> > a broker).
> >
>
> Why would consumer re-balance or broker failure alter the above partition
> invariant?
>

Re: Is this a good overview of kafka?

Posted by Stan Rosenberg <st...@gmail.com>.

Hi Felix,

Would you mind elaborating on what you said regarding the ordering
guaranteed; inlined below.

Thanks,

stan

On Mon, Jan 14, 2013 at 6:08 PM, Felix GV <fe...@mate1inc.com> wrote:

>
> For example if you partitioned using a User ID field within the messages,
> you would be
> guaranteed that all messages pertaining to a certain user would end up in
> the same partition, and that they would be correctly ordered. You should be
> aware, however, that this guarantee is only maintained as long as there are
> no consumer re-balance (which happens when adding or removing a consumer or
> a broker).
>

Why would consumer re-balance or broker failure alter the above partition
invariant?

Re: Is this a good overview of kafka?

Posted by Felix GV <fe...@mate1inc.com>.

Hello,

Your (non-question) statements seem mostly right to me. There is a bit of
confusion regarding your statement about partitions, however.

Partitions are primarily used to represent the smallest unit of
parallelism. If you need to split consumption among a pool of processes,
you need to have enough partitions for each of those consuming processes,
otherwise some of them will receive nothing.

Another property of partitions is that ordering is maintained within a
partition. If your use case requires it, you can implement a custom
partitioner so that a particular field within your produced messages
determines what partition the message is sent to. For example if you
partitioned using a User ID field within the messages, you would be
guaranteed that all messages pertaining to a certain user would end up in
the same partition, and that they would be correctly ordered. You should be
aware, however, that this guarantee is only maintained as long as there are
no consumer re-balance (which happens when adding or removing a consumer or
a broker).

Concerning your questions:

A consumer registers for topics, not for partitions, and it always
registers under the name of a consumer group. If there is only one consumer
registered for a given topic and consumer group, then that consumer will
receive messages from every available partition within that topic. If there
are multiple consumers registered under the same consumer group for a given
topic, then they will share that topic's available partitions among
themselves, which ensures that each partition is consumed by only one
consumer.

The high-level consumer uses Zookeeper to coordinate with the other
consumers and make sure that the partitions are appropriately assigned.

--
Felix

On Mon, Jan 14, 2013 at 2:15 PM, S Ahmed <sa...@gmail.com> wrote:

> Just want to verify that I understand the various components correctly,
> please chime in where appropriate :)
>
> producer = puts messages on broker(s)
> consumer = pulls messages of a broker
>
> In terms of how messages are organized on a cluster of brokers, producers
> put messages by providing a "topic".
>
> At the broker side of things, messages are stored by topic but can also be
> logicially seperated by a "partition", so that all messages for a
> particular topic are directed to a particular broker.
>
> On the consumer side, when you pull messages off, I know you can dedicated
> a consumer (or group of consumers) to a particular partition somehow. But
> what if you wanted to just randomly pull messages off?  Say I have 3
> brokers, and 5 consumers.  How does the consumer know which broker to
> connect too, and co-ordinate with the other consumers?
>
> Is there a flow diagram for the above scenerio? (or any other scenerio so I
> can understand how the communication takes place).
>