You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "Murphy, Gerard" <ge...@sap.com> on 2019/06/13 16:48:31 UTC

Question: Kafka as a message queue for long running tasks

Hi,

I am wondering if there is something I am missing about my set up to facilitate long running jobs.

For my purposes it is ok to have `At most once` message delivery, this means it is not required to think about committing offsets (or at least it is ok to commit each message offset upon receiving it).

I have the following in order to achieve the competing consumer pattern:

  *   A topic
  *   X consumers in the same group
  *   P partitions in a topic (where P >= X always)

My problem is that I have messages that can take ~15 minutes (but this may fluctuate by up to 50% lets say) in order to process. In order to avoid consumers having their partition assignments revoked I have increased the value of `max.poll.interval.ms` to reflect this.
However this comes with some negative consequences:

  *   if some message exceeds this length of time then in a worst case scenario a the consumer processing this message will have to wait up to the value of `max.poll.interval.ms` for a rebalance
  *   if I need to scale and increase the number of consumers based on load then any new consumers might also have to wait the value of `max.poll.interval.ms` for a rebalance to occur in order to process any messages

As it stands at the moment I see that I can proceed as follows:

  *   Set `max.poll.interval.ms` to be a small value and accept that every consumer processing every message will time out and go through the process of having assignments revoked and waiting a small amount of time for a rebalance

However I do not like this, and am considering looking at alternative technology for my message queue as I do not see any obvious way around this.
Admittedly I am new to Kafka, and it is just a gut feeling that the above is not desirable.
I have used RabbitMQ in the past for these scenarios, however we need Kafka in our architecture for other purposes at the moment and it would be nice not to have to introduce another technology if Kafka can achieve this.

I appreciate any advise that anybody can offer on this subject.

Regards,
Ger

Re: Question: Kafka as a message queue for long running tasks

Posted by Raman Gupta <ro...@gmail.com>.
Gerard, we have a similar use case we are using Kafka for, and are
setting max.poll.interval.ms to a large value in order to handle the
worst-case scenario.

Rebalancing is indeed a big problem with this approach (and not just
for "new" consumers as you mentioned -- adding consumers causes a
stop-the-world rebalance on all existing consumers as well). The
static consumer groups protocol introduced in Kafka 2.3 helps quite a
lot with the failed-and-restarting consumer case (though I ran into
several bugs with it and still have a stream that won't start in dev,
it seems to be working in prod now with patched Kafka brokers), but
does not handle scaling your consumers up (or down). Scaling consumers
is still a stop-the-world rebalance, and you can expect it to take at
least the max time any particular in-flight message takes to process
(if all goes well), because your consumers will essentially be sitting
around idle waiting for every other consumer to finish. Theoretically,
the incremental rebalancing capabilities coming out soon (KAFKA-8179)
should help resolve this.

The other issue you will run into with long processing is "hot
partitions/consumers" -- with a large backlog of such messages it is
inevitable that the consumers of some partitions will take a lot
longer to process than some other partitions, which means *some* of
your consumers will just be sitting idle, while others are lagging
*way* behind. There is no easy solution to this that I have found
because one partition is always processed by at most one Kafka
consumer. You'll have to work around this by reading a bunch of
messages, distributing their work yourself, tracking completions
outside of Kafka, and then committing offsets back to Kafka manually.

Honestly, because of these issues, right now I am seriously
considering a migration of our Kafka workloads to Apache Pulsar, which
supports the same semantics as Kafka, but also supports
work/message-queue style messaging *much* better, as it uses
per-message acknowledgement, and subscriptions are not limited to the
number of partitions i.e. you can have N partitions, and M+N
consumers.

Regards,
Raman


On Thu, Jun 13, 2019 at 1:54 PM Mark Anderson <ma...@gmail.com> wrote:
>
> We have a different use case where we stop consuming due to connection to
> an external system being down.
>
> In this case we sleep for the same period as our poll timeout would be and
> recommit the previous offset. This stops the consumer going stale and
> avoids increasing the max interval.
>
> Perhaps you could do something similar?
>
> Mark
>
>
> On Thu, 13 Jun 2019, 17:49 Murphy, Gerard, <ge...@sap.com> wrote:
>
> > Hi,
> >
> > I am wondering if there is something I am missing about my set up to
> > facilitate long running jobs.
> >
> > For my purposes it is ok to have `At most once` message delivery, this
> > means it is not required to think about committing offsets (or at least it
> > is ok to commit each message offset upon receiving it).
> >
> > I have the following in order to achieve the competing consumer pattern:
> >
> >   *   A topic
> >   *   X consumers in the same group
> >   *   P partitions in a topic (where P >= X always)
> >
> > My problem is that I have messages that can take ~15 minutes (but this may
> > fluctuate by up to 50% lets say) in order to process. In order to avoid
> > consumers having their partition assignments revoked I have increased the
> > value of `max.poll.interval.ms` to reflect this.
> > However this comes with some negative consequences:
> >
> >   *   if some message exceeds this length of time then in a worst case
> > scenario a the consumer processing this message will have to wait up to the
> > value of `max.poll.interval.ms` for a rebalance
> >   *   if I need to scale and increase the number of consumers based on
> > load then any new consumers might also have to wait the value of `
> > max.poll.interval.ms` for a rebalance to occur in order to process any
> > messages
> >
> > As it stands at the moment I see that I can proceed as follows:
> >
> >   *   Set `max.poll.interval.ms` to be a small value and accept that
> > every consumer processing every message will time out and go through the
> > process of having assignments revoked and waiting a small amount of time
> > for a rebalance
> >
> > However I do not like this, and am considering looking at alternative
> > technology for my message queue as I do not see any obvious way around this.
> > Admittedly I am new to Kafka, and it is just a gut feeling that the above
> > is not desirable.
> > I have used RabbitMQ in the past for these scenarios, however we need
> > Kafka in our architecture for other purposes at the moment and it would be
> > nice not to have to introduce another technology if Kafka can achieve this.
> >
> > I appreciate any advise that anybody can offer on this subject.
> >
> > Regards,
> > Ger
> >

Re: Question: Kafka as a message queue for long running tasks

Posted by Mark Anderson <ma...@gmail.com>.
We have a different use case where we stop consuming due to connection to
an external system being down.

In this case we sleep for the same period as our poll timeout would be and
recommit the previous offset. This stops the consumer going stale and
avoids increasing the max interval.

Perhaps you could do something similar?

Mark


On Thu, 13 Jun 2019, 17:49 Murphy, Gerard, <ge...@sap.com> wrote:

> Hi,
>
> I am wondering if there is something I am missing about my set up to
> facilitate long running jobs.
>
> For my purposes it is ok to have `At most once` message delivery, this
> means it is not required to think about committing offsets (or at least it
> is ok to commit each message offset upon receiving it).
>
> I have the following in order to achieve the competing consumer pattern:
>
>   *   A topic
>   *   X consumers in the same group
>   *   P partitions in a topic (where P >= X always)
>
> My problem is that I have messages that can take ~15 minutes (but this may
> fluctuate by up to 50% lets say) in order to process. In order to avoid
> consumers having their partition assignments revoked I have increased the
> value of `max.poll.interval.ms` to reflect this.
> However this comes with some negative consequences:
>
>   *   if some message exceeds this length of time then in a worst case
> scenario a the consumer processing this message will have to wait up to the
> value of `max.poll.interval.ms` for a rebalance
>   *   if I need to scale and increase the number of consumers based on
> load then any new consumers might also have to wait the value of `
> max.poll.interval.ms` for a rebalance to occur in order to process any
> messages
>
> As it stands at the moment I see that I can proceed as follows:
>
>   *   Set `max.poll.interval.ms` to be a small value and accept that
> every consumer processing every message will time out and go through the
> process of having assignments revoked and waiting a small amount of time
> for a rebalance
>
> However I do not like this, and am considering looking at alternative
> technology for my message queue as I do not see any obvious way around this.
> Admittedly I am new to Kafka, and it is just a gut feeling that the above
> is not desirable.
> I have used RabbitMQ in the past for these scenarios, however we need
> Kafka in our architecture for other purposes at the moment and it would be
> nice not to have to introduce another technology if Kafka can achieve this.
>
> I appreciate any advise that anybody can offer on this subject.
>
> Regards,
> Ger
>