You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by "cours.systeme@gmail.com" <co...@gmail.com> on 2017/11/22 20:22:00 UTC

parallel processing of records in a Kafka consumer

I am testing a KafkaConsumer. How can I modify it to process records in parallel?

Re: parallel processing of records in a Kafka consumer

Posted by "Matthias J. Sax" <ma...@confluent.io>.

If you use "consumer groups" it is ensured that a single partitions in
processed by one consumer (and one consumer can get multiple partitions
assigned).

Thus, this work out of the box and is easier to manager than
parallelizing record processing in the consumer. Also, this does not
work if you need to scale over multiple machines.

-Matthias

On 11/23/17 11:02 AM, cours.systeme@gmail.com wrote:
> 
> 
> On 2017-11-22 23:15, "Matthias J. Sax" <ma...@confluent.io> wrote: 
>> I KafkaConsumer itself should be use single threaded. If you want to
>> parallelize processing, each thread should have it's own KafkaConsumer
>> instance and all consumers should use the same `group.id` in their
>> configuration. Load will be shared over all running consumer
>> automatically for this case.
>>
>>
>> -Matthias
>>
>> On 11/22/17 12:22 PM, cours.systeme@gmail.com wrote:
>>> I am testing a KafkaConsumer. How can I modify it to process records in parallel?
>>>
>>
>> Thank you,
> The order of the records and their offsets are important in my program. In addition, I want to parallelize for faster processing.
> What is better idea: to parallelize the record processing in the consumer, or to read from a set of partitions with a set of consumers (each consumer reads from a partition)
>

Re: parallel processing of records in a Kafka consumer

Posted by "cours.systeme@gmail.com" <co...@gmail.com>.


On 2017-11-22 23:15, "Matthias J. Sax" <ma...@confluent.io> wrote: 
> I KafkaConsumer itself should be use single threaded. If you want to
> parallelize processing, each thread should have it's own KafkaConsumer
> instance and all consumers should use the same `group.id` in their
> configuration. Load will be shared over all running consumer
> automatically for this case.
> 
> 
> -Matthias
> 
> On 11/22/17 12:22 PM, cours.systeme@gmail.com wrote:
> > I am testing a KafkaConsumer. How can I modify it to process records in parallel?
> > 
> 
> Thank you,
The order of the records and their offsets are important in my program. In addition, I want to parallelize for faster processing.
What is better idea: to parallelize the record processing in the consumer, or to read from a set of partitions with a set of consumers (each consumer reads from a partition)

Re: parallel processing of records in a Kafka consumer

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Your understanding is correct.

The simplest way to get more parallelism is to increase the number of
partitions. There is some overhead for this, but it not too much.

>> what you're writing is in sharp contrast with what I know...

I guess, this target other messaging system: Kafka has a different
design that other messaging system, because Kafka also targets other use
cases than messaging. It's a stream processing platform.

This design has many advantages (and depending on the use case maybe
also disadvantages -- it's always a trade off). For example, it allows
for very high throughput that messaging systems cannot achieve.

The other case for using multiple threads per consumer is hard to get
right -- not impossible, but you need to put a lot of custom code into
place to make this work correctly. The tricky part is committing of
offsets. In Kafka, you don't commit each message individually, but a
commit of offset X, implies that all messages up to X (excluding X) for
this partition got processed successfully (X is the next offset you want
to consume). Thus, if you "branch out" to different thread after the
consumer polled message, you cannot "randomly commit" but need to put
book keeping code to make sure to no commit a message that was not
processed yet.

This is also a design decision and allow for strict in-order message
delivery per partition. Again, a trade-off with advantages and
disadvantages.

Hope this helps :)

-Matthias

On 11/24/17 1:36 AM, Vincenzo D'Amore wrote:
> Hi Matthias,
> 
> what you're writing is in sharp contrast with what I know...
> 
> I read that: "Kafka consumers are typically part of a consumer group. When
> multiple consumers are subscribed to a topic and belong to the same
> consumer group, each consumer in the group will receive messages from a
> different subset of the partitions in the topic."
> 
> https://www.safaribooksonline.com/library/view/kafka-the-
> definitive/9781491936153/ch04.html
> 
> This means that if there are four partition, first consumer within a
> consumer group will read simultaneously from all the existing partitions.
> 
> If we add another consumer, the existing four partition will be divided
> between the two consumers.
> 
> If we have more consumers than partitions, exceeding consumers will remain
> idle.
> 
> https://gist.github.com/freedev/adc3e58789cc23d25d15a7d273535523
> 
> So, if I understood correctly, you cannot have multiple consumers *within
> the same consumer group* concurrently consuming the same partition.
> 
> To be clear, I'm genuinely interested in understanding if I correctly get
> how Kafka consumers works.
> Comments and suggestions are welcome :)
> 
> Best regards,
> Vincenzo
> 
> On Wed, Nov 22, 2017 at 11:15 PM, Matthias J. Sax <ma...@confluent.io>
> wrote:
> 
>> I KafkaConsumer itself should be use single threaded. If you want to
>> parallelize processing, each thread should have it's own KafkaConsumer
>> instance and all consumers should use the same `group.id` in their
>> configuration. Load will be shared over all running consumer
>> automatically for this case.
>>
>>
>> -Matthias
>>
>> On 11/22/17 12:22 PM, cours.systeme@gmail.com wrote:
>>> I am testing a KafkaConsumer. How can I modify it to process records in
>> parallel?
>>>
>>
>>
> 
>

Re: parallel processing of records in a Kafka consumer

Posted by Vincenzo D'Amore <v....@gmail.com>.

Hi Matthias,

what you're writing is in sharp contrast with what I know...

I read that: "Kafka consumers are typically part of a consumer group. When
multiple consumers are subscribed to a topic and belong to the same
consumer group, each consumer in the group will receive messages from a
different subset of the partitions in the topic."

https://www.safaribooksonline.com/library/view/kafka-the-
definitive/9781491936153/ch04.html

This means that if there are four partition, first consumer within a
consumer group will read simultaneously from all the existing partitions.

If we add another consumer, the existing four partition will be divided
between the two consumers.

If we have more consumers than partitions, exceeding consumers will remain
idle.

https://gist.github.com/freedev/adc3e58789cc23d25d15a7d273535523

So, if I understood correctly, you cannot have multiple consumers *within
the same consumer group* concurrently consuming the same partition.

To be clear, I'm genuinely interested in understanding if I correctly get
how Kafka consumers works.
Comments and suggestions are welcome :)

Best regards,
Vincenzo

On Wed, Nov 22, 2017 at 11:15 PM, Matthias J. Sax <ma...@confluent.io>
wrote:

> I KafkaConsumer itself should be use single threaded. If you want to
> parallelize processing, each thread should have it's own KafkaConsumer
> instance and all consumers should use the same `group.id` in their
> configuration. Load will be shared over all running consumer
> automatically for this case.
>
>
> -Matthias
>
> On 11/22/17 12:22 PM, cours.systeme@gmail.com wrote:
> > I am testing a KafkaConsumer. How can I modify it to process records in
> parallel?
> >
>
>

-- 
Vincenzo D'Amore

Re: parallel processing of records in a Kafka consumer

Posted by "Matthias J. Sax" <ma...@confluent.io>.

I KafkaConsumer itself should be use single threaded. If you want to
parallelize processing, each thread should have it's own KafkaConsumer
instance and all consumers should use the same `group.id` in their
configuration. Load will be shared over all running consumer
automatically for this case.

-Matthias

On 11/22/17 12:22 PM, cours.systeme@gmail.com wrote:
> I am testing a KafkaConsumer. How can I modify it to process records in parallel?
>