You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Ali Nazemian <al...@gmail.com> on 2020/05/07 04:38:22 UTC

Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Hi,

With the emerge of using Apache Kafka for event-driven architecture, one
thing that has become important is how to tune apache Kafka consumer to
manage long-running jobs. The main issue raises when we set a relatively
large value for "max.poll.interval.ms". Setting this value will, of course,
resolve the issue of repetitive rebalance, but creates another operational
issue. I am looking for some sort of golden strategy to deal with
long-running jobs with Apache Kafka.

If the consumer hangs for whatever reason, there is no easy way of passing
that stage. It can easily block the pipeline, and you cannot do much about
it. Therefore, it came to my mind that I am probably missing something
here. What are the expectations? Is it not valid to use Apache Kafka for
long-live jobs? Are there any other parameters need to be set, and the
issue of a consumer being stuck is caused by misconfiguration?

I can see there are a lot of the same issues have been raised regarding
"the consumer is stuck" and usually, the answer has been "yeah, that's
because you have a long-running job, etc.". I have seen different
suggestions:

- Avoid using long-running jobs. Read the message, submit it into another
thread and let the consumer to pass. Obviously this can cause data loss and
it would be a difficult problem to handle. It might be better to avoid
using Kafka in the first place for these types of requests.

- Avoid using apache Kafka for long-running requests

- Workaround based approaches like if the consumer is blocked, try to use
another consumer group and set the offset to the current value for the new
consumer group, etc.

There might be other suggestions I have missed here, but that is not the
point of this email. What I am looking for is what is the best practice for
dealing with long-running jobs with Apache Kafka. I cannot easily avoid
using Kafka because it plays a critical part in our application and data
pipeline. On the other side, we have had so many challenges to keep the
long-running jobs stable operationally. So I would appreciate it if someone
can help me to understand what approach can be taken to deal with these
jobs with Apache Kafka as a message broker.

Thanks,
Ali

Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Posted by Jason Turim <ja...@signalvine.com>.

I am not sure off the top, but since the method is on the consumer my
intuition is that it would pause all the partitions the consumer is reading
from.  I think the best thing to do is write a little test harness app to
verify the behavior.

On Mon, May 11, 2020 at 7:31 AM Ali Nazemian <al...@gmail.com> wrote:

> Hi Jason,
>
> Thank you for the message. It seems quite interesting. So something I am
> not sure about "pause" and "resume" is it works based on a partition
> allocation. What will happen if more partitions are assigned to a single
> consumer? For example, in the case where we have over-partition a Kafka
> topic to reserve some capacity for scaling up in case of burst traffic.
>
> Regards,
> Ali
>
> On Sat, May 9, 2020 at 11:52 PM Jason Turim <ja...@signalvine.com> wrote:
>
> > Hi Ali,
> >
> > You may want to look at using the consumer pause / resume api.  Its a
> > mechanism that allows you to poll without retrieving new messages.
> >
> > I employed this strategy to effectively handle highly variable workloads
> by
> > processing them in a background thread.   First pause the consumer when
> > messages were received.  Then start a background thread to process the
> > data, meanwhile the primary thread continues to poll (it won't retrieve
> > data while the consumer is paused).  Finally, when the bg processing is
> > complete, call resume on the consumer to retrieve the next batch of
> > messages.
> >
> > More info here in the section "Detecting Consumer Failures" -
> >
> >
> https://kafka.apache.org/22/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
> >
> > hth,
> > jt
> >
> > On Fri, May 8, 2020, 10:50 PM Chris Toomey <ct...@gmail.com> wrote:
> >
> > > What exactly is your understanding of what's happening when you say
> "the
> > > pipeline will be blocked by the group coordinator for up to "
> > > max.poll.interval.ms""? Please explain that.
> > >
> > > There's no universal recipe for "long-running jobs", there's just
> > > particular issues you might be encountering and suggested solutions to
> > > those issues.
> > >
> > >
> > >
> > > On Fri, May 8, 2020 at 7:03 PM Ali Nazemian <al...@gmail.com>
> > wrote:
> > >
> > > > Hi Chris,
> > > >
> > > > I am not sure where I said about the "automatic partition
> > reassignment",
> > > > but what I know here is the side effect of increasing "
> > > > max.poll.interval.ms"
> > > > is if the consumer hangs for whatever reason the pipeline will be
> > blocked
> > > > by the group coordinator for up to "max.poll.interval.ms". So I am
> not
> > > > sure
> > > > if this is because of the automatic partition assignment or something
> > > else.
> > > > What I am looking for is how I can deal with long-running jobs in
> > Apache
> > > > Kafka.
> > > >
> > > > Thanks,
> > > > Ali
> > > >
> > > > On Sat, May 9, 2020 at 4:25 AM Chris Toomey <ct...@gmail.com>
> wrote:
> > > >
> > > > > I interpreted your post as saying "when our consumer gets stuck,
> > > Kafka's
> > > > > automatic partition reassignment kicks in and that's problematic
> for
> > > us."
> > > > > Hence I suggested not using the automatic partition assignment,
> which
> > > per
> > > > > my interpretation would address your issue.
> > > > >
> > > > > Chris
> > > > >
> > > > > On Fri, May 8, 2020 at 2:19 AM Ali Nazemian <alinazemian@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Thanks, Chris. So what is causing the consumer to get stuck is a
> > side
> > > > > > effect of the built-in partition assignment in Kafka and by
> > > overriding
> > > > > that
> > > > > > behaviour I should be able to address the long-running job issue,
> > is
> > > > that
> > > > > > right? Can you please elaborate more on this?
> > > > > >
> > > > > > Regards,
> > > > > > Ali
> > > > > >
> > > > > > On Fri, May 8, 2020 at 1:09 PM Chris Toomey <ct...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > You really have to decide what behavior it is you want when one
> > of
> > > > your
> > > > > > > consumers gets "stuck". If you don't like the way the group
> > > protocol
> > > > > > > dynamically manages topic partition assignments or can't figure
> > out
> > > > an
> > > > > > > appropriate set of configuration settings that achieve your
> goal,
> > > you
> > > > > can
> > > > > > > always elect to not use the group protocol and instead manage
> > topic
> > > > > > > partition assignment yourself. As I just replied to another
> post,
> > > > > > there's a
> > > > > > > nice writeup of this under  "Manual Partition Assignment" in
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://kafka.apache.org/24/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
> > > > > > >  .
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 7, 2020 at 12:37 AM Ali Nazemian <
> > > alinazemian@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > To help understanding my case in more details, the error I
> can
> > > see
> > > > > > > > constantly is the consumer losing heartbeat and hence
> > apparently
> > > > the
> > > > > > > group
> > > > > > > > get rebalanced based on the log I can see from Kafka side:
> > > > > > > >
> > > > > > > > GroupCoordinator 11]: Member
> > > > > > > > consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo
> > has
> > > > > > failed,
> > > > > > > > removing it from the group
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Ali
> > > > > > > >
> > > > > > > > On Thu, May 7, 2020 at 2:38 PM Ali Nazemian <
> > > alinazemian@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > With the emerge of using Apache Kafka for event-driven
> > > > > architecture,
> > > > > > > one
> > > > > > > > > thing that has become important is how to tune apache Kafka
> > > > > consumer
> > > > > > to
> > > > > > > > > manage long-running jobs. The main issue raises when we
> set a
> > > > > > > relatively
> > > > > > > > > large value for "max.poll.interval.ms". Setting this value
> > > will,
> > > > > of
> > > > > > > > > course, resolve the issue of repetitive rebalance, but
> > creates
> > > > > > another
> > > > > > > > > operational issue. I am looking for some sort of golden
> > > strategy
> > > > to
> > > > > > > deal
> > > > > > > > > with long-running jobs with Apache Kafka.
> > > > > > > > >
> > > > > > > > > If the consumer hangs for whatever reason, there is no easy
> > way
> > > > of
> > > > > > > > passing
> > > > > > > > > that stage. It can easily block the pipeline, and you
> cannot
> > do
> > > > > much
> > > > > > > > about
> > > > > > > > > it. Therefore, it came to my mind that I am probably
> missing
> > > > > > something
> > > > > > > > > here. What are the expectations? Is it not valid to use
> > Apache
> > > > > Kafka
> > > > > > > for
> > > > > > > > > long-live jobs? Are there any other parameters need to be
> > set,
> > > > and
> > > > > > the
> > > > > > > > > issue of a consumer being stuck is caused by
> > misconfiguration?
> > > > > > > > >
> > > > > > > > > I can see there are a lot of the same issues have been
> raised
> > > > > > regarding
> > > > > > > > > "the consumer is stuck" and usually, the answer has been
> > "yeah,
> > > > > > that's
> > > > > > > > > because you have a long-running job, etc.". I have seen
> > > different
> > > > > > > > > suggestions:
> > > > > > > > >
> > > > > > > > > - Avoid using long-running jobs. Read the message, submit
> it
> > > into
> > > > > > > another
> > > > > > > > > thread and let the consumer to pass. Obviously this can
> cause
> > > > data
> > > > > > loss
> > > > > > > > and
> > > > > > > > > it would be a difficult problem to handle. It might be
> better
> > > to
> > > > > > avoid
> > > > > > > > > using Kafka in the first place for these types of requests.
> > > > > > > > >
> > > > > > > > > - Avoid using apache Kafka for long-running requests
> > > > > > > > >
> > > > > > > > > - Workaround based approaches like if the consumer is
> > blocked,
> > > > try
> > > > > to
> > > > > > > use
> > > > > > > > > another consumer group and set the offset to the current
> > value
> > > > for
> > > > > > the
> > > > > > > > new
> > > > > > > > > consumer group, etc.
> > > > > > > > >
> > > > > > > > > There might be other suggestions I have missed here, but
> that
> > > is
> > > > > not
> > > > > > > the
> > > > > > > > > point of this email. What I am looking for is what is the
> > best
> > > > > > practice
> > > > > > > > for
> > > > > > > > > dealing with long-running jobs with Apache Kafka. I cannot
> > > easily
> > > > > > avoid
> > > > > > > > > using Kafka because it plays a critical part in our
> > application
> > > > and
> > > > > > > data
> > > > > > > > > pipeline. On the other side, we have had so many challenges
> > to
> > > > keep
> > > > > > the
> > > > > > > > > long-running jobs stable operationally. So I would
> appreciate
> > > it
> > > > if
> > > > > > > > someone
> > > > > > > > > can help me to understand what approach can be taken to
> deal
> > > with
> > > > > > these
> > > > > > > > > jobs with Apache Kafka as a message broker.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Ali
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > A.Nazemian
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > A.Nazemian
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> >
>
>
> --
> A.Nazemian
>


-- 
Jason Turim (he, him & his)
Vice President of
Software Engineering
SignalVine Inc <http://www.signalvine.com>
(m) 415-407-6501

Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Posted by Ali Nazemian <al...@gmail.com>.

Hi Jason,

Thank you for the message. It seems quite interesting. So something I am
not sure about "pause" and "resume" is it works based on a partition
allocation. What will happen if more partitions are assigned to a single
consumer? For example, in the case where we have over-partition a Kafka
topic to reserve some capacity for scaling up in case of burst traffic.

Regards,
Ali

On Sat, May 9, 2020 at 11:52 PM Jason Turim <ja...@signalvine.com> wrote:

> Hi Ali,
>
> You may want to look at using the consumer pause / resume api.  Its a
> mechanism that allows you to poll without retrieving new messages.
>
> I employed this strategy to effectively handle highly variable workloads by
> processing them in a background thread.   First pause the consumer when
> messages were received.  Then start a background thread to process the
> data, meanwhile the primary thread continues to poll (it won't retrieve
> data while the consumer is paused).  Finally, when the bg processing is
> complete, call resume on the consumer to retrieve the next batch of
> messages.
>
> More info here in the section "Detecting Consumer Failures" -
>
> https://kafka.apache.org/22/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
>
> hth,
> jt
>
> On Fri, May 8, 2020, 10:50 PM Chris Toomey <ct...@gmail.com> wrote:
>
> > What exactly is your understanding of what's happening when you say "the
> > pipeline will be blocked by the group coordinator for up to "
> > max.poll.interval.ms""? Please explain that.
> >
> > There's no universal recipe for "long-running jobs", there's just
> > particular issues you might be encountering and suggested solutions to
> > those issues.
> >
> >
> >
> > On Fri, May 8, 2020 at 7:03 PM Ali Nazemian <al...@gmail.com>
> wrote:
> >
> > > Hi Chris,
> > >
> > > I am not sure where I said about the "automatic partition
> reassignment",
> > > but what I know here is the side effect of increasing "
> > > max.poll.interval.ms"
> > > is if the consumer hangs for whatever reason the pipeline will be
> blocked
> > > by the group coordinator for up to "max.poll.interval.ms". So I am not
> > > sure
> > > if this is because of the automatic partition assignment or something
> > else.
> > > What I am looking for is how I can deal with long-running jobs in
> Apache
> > > Kafka.
> > >
> > > Thanks,
> > > Ali
> > >
> > > On Sat, May 9, 2020 at 4:25 AM Chris Toomey <ct...@gmail.com> wrote:
> > >
> > > > I interpreted your post as saying "when our consumer gets stuck,
> > Kafka's
> > > > automatic partition reassignment kicks in and that's problematic for
> > us."
> > > > Hence I suggested not using the automatic partition assignment, which
> > per
> > > > my interpretation would address your issue.
> > > >
> > > > Chris
> > > >
> > > > On Fri, May 8, 2020 at 2:19 AM Ali Nazemian <al...@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks, Chris. So what is causing the consumer to get stuck is a
> side
> > > > > effect of the built-in partition assignment in Kafka and by
> > overriding
> > > > that
> > > > > behaviour I should be able to address the long-running job issue,
> is
> > > that
> > > > > right? Can you please elaborate more on this?
> > > > >
> > > > > Regards,
> > > > > Ali
> > > > >
> > > > > On Fri, May 8, 2020 at 1:09 PM Chris Toomey <ct...@gmail.com>
> > wrote:
> > > > >
> > > > > > You really have to decide what behavior it is you want when one
> of
> > > your
> > > > > > consumers gets "stuck". If you don't like the way the group
> > protocol
> > > > > > dynamically manages topic partition assignments or can't figure
> out
> > > an
> > > > > > appropriate set of configuration settings that achieve your goal,
> > you
> > > > can
> > > > > > always elect to not use the group protocol and instead manage
> topic
> > > > > > partition assignment yourself. As I just replied to another post,
> > > > > there's a
> > > > > > nice writeup of this under  "Manual Partition Assignment" in
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://kafka.apache.org/24/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
> > > > > >  .
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > >
> > > > > > On Thu, May 7, 2020 at 12:37 AM Ali Nazemian <
> > alinazemian@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > To help understanding my case in more details, the error I can
> > see
> > > > > > > constantly is the consumer losing heartbeat and hence
> apparently
> > > the
> > > > > > group
> > > > > > > get rebalanced based on the log I can see from Kafka side:
> > > > > > >
> > > > > > > GroupCoordinator 11]: Member
> > > > > > > consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo
> has
> > > > > failed,
> > > > > > > removing it from the group
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Ali
> > > > > > >
> > > > > > > On Thu, May 7, 2020 at 2:38 PM Ali Nazemian <
> > alinazemian@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > With the emerge of using Apache Kafka for event-driven
> > > > architecture,
> > > > > > one
> > > > > > > > thing that has become important is how to tune apache Kafka
> > > > consumer
> > > > > to
> > > > > > > > manage long-running jobs. The main issue raises when we set a
> > > > > > relatively
> > > > > > > > large value for "max.poll.interval.ms". Setting this value
> > will,
> > > > of
> > > > > > > > course, resolve the issue of repetitive rebalance, but
> creates
> > > > > another
> > > > > > > > operational issue. I am looking for some sort of golden
> > strategy
> > > to
> > > > > > deal
> > > > > > > > with long-running jobs with Apache Kafka.
> > > > > > > >
> > > > > > > > If the consumer hangs for whatever reason, there is no easy
> way
> > > of
> > > > > > > passing
> > > > > > > > that stage. It can easily block the pipeline, and you cannot
> do
> > > > much
> > > > > > > about
> > > > > > > > it. Therefore, it came to my mind that I am probably missing
> > > > > something
> > > > > > > > here. What are the expectations? Is it not valid to use
> Apache
> > > > Kafka
> > > > > > for
> > > > > > > > long-live jobs? Are there any other parameters need to be
> set,
> > > and
> > > > > the
> > > > > > > > issue of a consumer being stuck is caused by
> misconfiguration?
> > > > > > > >
> > > > > > > > I can see there are a lot of the same issues have been raised
> > > > > regarding
> > > > > > > > "the consumer is stuck" and usually, the answer has been
> "yeah,
> > > > > that's
> > > > > > > > because you have a long-running job, etc.". I have seen
> > different
> > > > > > > > suggestions:
> > > > > > > >
> > > > > > > > - Avoid using long-running jobs. Read the message, submit it
> > into
> > > > > > another
> > > > > > > > thread and let the consumer to pass. Obviously this can cause
> > > data
> > > > > loss
> > > > > > > and
> > > > > > > > it would be a difficult problem to handle. It might be better
> > to
> > > > > avoid
> > > > > > > > using Kafka in the first place for these types of requests.
> > > > > > > >
> > > > > > > > - Avoid using apache Kafka for long-running requests
> > > > > > > >
> > > > > > > > - Workaround based approaches like if the consumer is
> blocked,
> > > try
> > > > to
> > > > > > use
> > > > > > > > another consumer group and set the offset to the current
> value
> > > for
> > > > > the
> > > > > > > new
> > > > > > > > consumer group, etc.
> > > > > > > >
> > > > > > > > There might be other suggestions I have missed here, but that
> > is
> > > > not
> > > > > > the
> > > > > > > > point of this email. What I am looking for is what is the
> best
> > > > > practice
> > > > > > > for
> > > > > > > > dealing with long-running jobs with Apache Kafka. I cannot
> > easily
> > > > > avoid
> > > > > > > > using Kafka because it plays a critical part in our
> application
> > > and
> > > > > > data
> > > > > > > > pipeline. On the other side, we have had so many challenges
> to
> > > keep
> > > > > the
> > > > > > > > long-running jobs stable operationally. So I would appreciate
> > it
> > > if
> > > > > > > someone
> > > > > > > > can help me to understand what approach can be taken to deal
> > with
> > > > > these
> > > > > > > > jobs with Apache Kafka as a message broker.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Ali
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > A.Nazemian
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > A.Nazemian
> > > > >
> > > >
> > >
> > >
> > > --
> > > A.Nazemian
> > >
> >
>


-- 
A.Nazemian

Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Posted by Jason Turim <ja...@signalvine.com>.

Hi Ali,

You may want to look at using the consumer pause / resume api.  Its a
mechanism that allows you to poll without retrieving new messages.

I employed this strategy to effectively handle highly variable workloads by
processing them in a background thread.   First pause the consumer when
messages were received.  Then start a background thread to process the
data, meanwhile the primary thread continues to poll (it won't retrieve
data while the consumer is paused).  Finally, when the bg processing is
complete, call resume on the consumer to retrieve the next batch of
messages.

More info here in the section "Detecting Consumer Failures" -
https://kafka.apache.org/22/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

hth,
jt

On Fri, May 8, 2020, 10:50 PM Chris Toomey <ct...@gmail.com> wrote:

> What exactly is your understanding of what's happening when you say "the
> pipeline will be blocked by the group coordinator for up to "
> max.poll.interval.ms""? Please explain that.
>
> There's no universal recipe for "long-running jobs", there's just
> particular issues you might be encountering and suggested solutions to
> those issues.
>
>
>
> On Fri, May 8, 2020 at 7:03 PM Ali Nazemian <al...@gmail.com> wrote:
>
> > Hi Chris,
> >
> > I am not sure where I said about the "automatic partition reassignment",
> > but what I know here is the side effect of increasing "
> > max.poll.interval.ms"
> > is if the consumer hangs for whatever reason the pipeline will be blocked
> > by the group coordinator for up to "max.poll.interval.ms". So I am not
> > sure
> > if this is because of the automatic partition assignment or something
> else.
> > What I am looking for is how I can deal with long-running jobs in Apache
> > Kafka.
> >
> > Thanks,
> > Ali
> >
> > On Sat, May 9, 2020 at 4:25 AM Chris Toomey <ct...@gmail.com> wrote:
> >
> > > I interpreted your post as saying "when our consumer gets stuck,
> Kafka's
> > > automatic partition reassignment kicks in and that's problematic for
> us."
> > > Hence I suggested not using the automatic partition assignment, which
> per
> > > my interpretation would address your issue.
> > >
> > > Chris
> > >
> > > On Fri, May 8, 2020 at 2:19 AM Ali Nazemian <al...@gmail.com>
> > wrote:
> > >
> > > > Thanks, Chris. So what is causing the consumer to get stuck is a side
> > > > effect of the built-in partition assignment in Kafka and by
> overriding
> > > that
> > > > behaviour I should be able to address the long-running job issue, is
> > that
> > > > right? Can you please elaborate more on this?
> > > >
> > > > Regards,
> > > > Ali
> > > >
> > > > On Fri, May 8, 2020 at 1:09 PM Chris Toomey <ct...@gmail.com>
> wrote:
> > > >
> > > > > You really have to decide what behavior it is you want when one of
> > your
> > > > > consumers gets "stuck". If you don't like the way the group
> protocol
> > > > > dynamically manages topic partition assignments or can't figure out
> > an
> > > > > appropriate set of configuration settings that achieve your goal,
> you
> > > can
> > > > > always elect to not use the group protocol and instead manage topic
> > > > > partition assignment yourself. As I just replied to another post,
> > > > there's a
> > > > > nice writeup of this under  "Manual Partition Assignment" in
> > > > >
> > > > >
> > > >
> > >
> >
> https://kafka.apache.org/24/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
> > > > >  .
> > > > >
> > > > > Chris
> > > > >
> > > > >
> > > > > On Thu, May 7, 2020 at 12:37 AM Ali Nazemian <
> alinazemian@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > To help understanding my case in more details, the error I can
> see
> > > > > > constantly is the consumer losing heartbeat and hence apparently
> > the
> > > > > group
> > > > > > get rebalanced based on the log I can see from Kafka side:
> > > > > >
> > > > > > GroupCoordinator 11]: Member
> > > > > > consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo has
> > > > failed,
> > > > > > removing it from the group
> > > > > >
> > > > > > Thanks,
> > > > > > Ali
> > > > > >
> > > > > > On Thu, May 7, 2020 at 2:38 PM Ali Nazemian <
> alinazemian@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > With the emerge of using Apache Kafka for event-driven
> > > architecture,
> > > > > one
> > > > > > > thing that has become important is how to tune apache Kafka
> > > consumer
> > > > to
> > > > > > > manage long-running jobs. The main issue raises when we set a
> > > > > relatively
> > > > > > > large value for "max.poll.interval.ms". Setting this value
> will,
> > > of
> > > > > > > course, resolve the issue of repetitive rebalance, but creates
> > > > another
> > > > > > > operational issue. I am looking for some sort of golden
> strategy
> > to
> > > > > deal
> > > > > > > with long-running jobs with Apache Kafka.
> > > > > > >
> > > > > > > If the consumer hangs for whatever reason, there is no easy way
> > of
> > > > > > passing
> > > > > > > that stage. It can easily block the pipeline, and you cannot do
> > > much
> > > > > > about
> > > > > > > it. Therefore, it came to my mind that I am probably missing
> > > > something
> > > > > > > here. What are the expectations? Is it not valid to use Apache
> > > Kafka
> > > > > for
> > > > > > > long-live jobs? Are there any other parameters need to be set,
> > and
> > > > the
> > > > > > > issue of a consumer being stuck is caused by misconfiguration?
> > > > > > >
> > > > > > > I can see there are a lot of the same issues have been raised
> > > > regarding
> > > > > > > "the consumer is stuck" and usually, the answer has been "yeah,
> > > > that's
> > > > > > > because you have a long-running job, etc.". I have seen
> different
> > > > > > > suggestions:
> > > > > > >
> > > > > > > - Avoid using long-running jobs. Read the message, submit it
> into
> > > > > another
> > > > > > > thread and let the consumer to pass. Obviously this can cause
> > data
> > > > loss
> > > > > > and
> > > > > > > it would be a difficult problem to handle. It might be better
> to
> > > > avoid
> > > > > > > using Kafka in the first place for these types of requests.
> > > > > > >
> > > > > > > - Avoid using apache Kafka for long-running requests
> > > > > > >
> > > > > > > - Workaround based approaches like if the consumer is blocked,
> > try
> > > to
> > > > > use
> > > > > > > another consumer group and set the offset to the current value
> > for
> > > > the
> > > > > > new
> > > > > > > consumer group, etc.
> > > > > > >
> > > > > > > There might be other suggestions I have missed here, but that
> is
> > > not
> > > > > the
> > > > > > > point of this email. What I am looking for is what is the best
> > > > practice
> > > > > > for
> > > > > > > dealing with long-running jobs with Apache Kafka. I cannot
> easily
> > > > avoid
> > > > > > > using Kafka because it plays a critical part in our application
> > and
> > > > > data
> > > > > > > pipeline. On the other side, we have had so many challenges to
> > keep
> > > > the
> > > > > > > long-running jobs stable operationally. So I would appreciate
> it
> > if
> > > > > > someone
> > > > > > > can help me to understand what approach can be taken to deal
> with
> > > > these
> > > > > > > jobs with Apache Kafka as a message broker.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Ali
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > A.Nazemian
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> >
> >
> > --
> > A.Nazemian
> >
>

Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Posted by Chris Toomey <ct...@gmail.com>.

What exactly is your understanding of what's happening when you say "the
pipeline will be blocked by the group coordinator for up to "
max.poll.interval.ms""? Please explain that.

There's no universal recipe for "long-running jobs", there's just
particular issues you might be encountering and suggested solutions to
those issues.



On Fri, May 8, 2020 at 7:03 PM Ali Nazemian <al...@gmail.com> wrote:

> Hi Chris,
>
> I am not sure where I said about the "automatic partition reassignment",
> but what I know here is the side effect of increasing "
> max.poll.interval.ms"
> is if the consumer hangs for whatever reason the pipeline will be blocked
> by the group coordinator for up to "max.poll.interval.ms". So I am not
> sure
> if this is because of the automatic partition assignment or something else.
> What I am looking for is how I can deal with long-running jobs in Apache
> Kafka.
>
> Thanks,
> Ali
>
> On Sat, May 9, 2020 at 4:25 AM Chris Toomey <ct...@gmail.com> wrote:
>
> > I interpreted your post as saying "when our consumer gets stuck, Kafka's
> > automatic partition reassignment kicks in and that's problematic for us."
> > Hence I suggested not using the automatic partition assignment, which per
> > my interpretation would address your issue.
> >
> > Chris
> >
> > On Fri, May 8, 2020 at 2:19 AM Ali Nazemian <al...@gmail.com>
> wrote:
> >
> > > Thanks, Chris. So what is causing the consumer to get stuck is a side
> > > effect of the built-in partition assignment in Kafka and by overriding
> > that
> > > behaviour I should be able to address the long-running job issue, is
> that
> > > right? Can you please elaborate more on this?
> > >
> > > Regards,
> > > Ali
> > >
> > > On Fri, May 8, 2020 at 1:09 PM Chris Toomey <ct...@gmail.com> wrote:
> > >
> > > > You really have to decide what behavior it is you want when one of
> your
> > > > consumers gets "stuck". If you don't like the way the group protocol
> > > > dynamically manages topic partition assignments or can't figure out
> an
> > > > appropriate set of configuration settings that achieve your goal, you
> > can
> > > > always elect to not use the group protocol and instead manage topic
> > > > partition assignment yourself. As I just replied to another post,
> > > there's a
> > > > nice writeup of this under  "Manual Partition Assignment" in
> > > >
> > > >
> > >
> >
> https://kafka.apache.org/24/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
> > > >  .
> > > >
> > > > Chris
> > > >
> > > >
> > > > On Thu, May 7, 2020 at 12:37 AM Ali Nazemian <al...@gmail.com>
> > > > wrote:
> > > >
> > > > > To help understanding my case in more details, the error I can see
> > > > > constantly is the consumer losing heartbeat and hence apparently
> the
> > > > group
> > > > > get rebalanced based on the log I can see from Kafka side:
> > > > >
> > > > > GroupCoordinator 11]: Member
> > > > > consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo has
> > > failed,
> > > > > removing it from the group
> > > > >
> > > > > Thanks,
> > > > > Ali
> > > > >
> > > > > On Thu, May 7, 2020 at 2:38 PM Ali Nazemian <alinazemian@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > With the emerge of using Apache Kafka for event-driven
> > architecture,
> > > > one
> > > > > > thing that has become important is how to tune apache Kafka
> > consumer
> > > to
> > > > > > manage long-running jobs. The main issue raises when we set a
> > > > relatively
> > > > > > large value for "max.poll.interval.ms". Setting this value will,
> > of
> > > > > > course, resolve the issue of repetitive rebalance, but creates
> > > another
> > > > > > operational issue. I am looking for some sort of golden strategy
> to
> > > > deal
> > > > > > with long-running jobs with Apache Kafka.
> > > > > >
> > > > > > If the consumer hangs for whatever reason, there is no easy way
> of
> > > > > passing
> > > > > > that stage. It can easily block the pipeline, and you cannot do
> > much
> > > > > about
> > > > > > it. Therefore, it came to my mind that I am probably missing
> > > something
> > > > > > here. What are the expectations? Is it not valid to use Apache
> > Kafka
> > > > for
> > > > > > long-live jobs? Are there any other parameters need to be set,
> and
> > > the
> > > > > > issue of a consumer being stuck is caused by misconfiguration?
> > > > > >
> > > > > > I can see there are a lot of the same issues have been raised
> > > regarding
> > > > > > "the consumer is stuck" and usually, the answer has been "yeah,
> > > that's
> > > > > > because you have a long-running job, etc.". I have seen different
> > > > > > suggestions:
> > > > > >
> > > > > > - Avoid using long-running jobs. Read the message, submit it into
> > > > another
> > > > > > thread and let the consumer to pass. Obviously this can cause
> data
> > > loss
> > > > > and
> > > > > > it would be a difficult problem to handle. It might be better to
> > > avoid
> > > > > > using Kafka in the first place for these types of requests.
> > > > > >
> > > > > > - Avoid using apache Kafka for long-running requests
> > > > > >
> > > > > > - Workaround based approaches like if the consumer is blocked,
> try
> > to
> > > > use
> > > > > > another consumer group and set the offset to the current value
> for
> > > the
> > > > > new
> > > > > > consumer group, etc.
> > > > > >
> > > > > > There might be other suggestions I have missed here, but that is
> > not
> > > > the
> > > > > > point of this email. What I am looking for is what is the best
> > > practice
> > > > > for
> > > > > > dealing with long-running jobs with Apache Kafka. I cannot easily
> > > avoid
> > > > > > using Kafka because it plays a critical part in our application
> and
> > > > data
> > > > > > pipeline. On the other side, we have had so many challenges to
> keep
> > > the
> > > > > > long-running jobs stable operationally. So I would appreciate it
> if
> > > > > someone
> > > > > > can help me to understand what approach can be taken to deal with
> > > these
> > > > > > jobs with Apache Kafka as a message broker.
> > > > > >
> > > > > > Thanks,
> > > > > > Ali
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > A.Nazemian
> > > > >
> > > >
> > >
> > >
> > > --
> > > A.Nazemian
> > >
> >
>
>
> --
> A.Nazemian
>

Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Posted by Ali Nazemian <al...@gmail.com>.

Hi Chris,

I am not sure where I said about the "automatic partition reassignment",
but what I know here is the side effect of increasing "max.poll.interval.ms"
is if the consumer hangs for whatever reason the pipeline will be blocked
by the group coordinator for up to "max.poll.interval.ms". So I am not sure
if this is because of the automatic partition assignment or something else.
What I am looking for is how I can deal with long-running jobs in Apache
Kafka.

Thanks,
Ali

On Sat, May 9, 2020 at 4:25 AM Chris Toomey <ct...@gmail.com> wrote:

> I interpreted your post as saying "when our consumer gets stuck, Kafka's
> automatic partition reassignment kicks in and that's problematic for us."
> Hence I suggested not using the automatic partition assignment, which per
> my interpretation would address your issue.
>
> Chris
>
> On Fri, May 8, 2020 at 2:19 AM Ali Nazemian <al...@gmail.com> wrote:
>
> > Thanks, Chris. So what is causing the consumer to get stuck is a side
> > effect of the built-in partition assignment in Kafka and by overriding
> that
> > behaviour I should be able to address the long-running job issue, is that
> > right? Can you please elaborate more on this?
> >
> > Regards,
> > Ali
> >
> > On Fri, May 8, 2020 at 1:09 PM Chris Toomey <ct...@gmail.com> wrote:
> >
> > > You really have to decide what behavior it is you want when one of your
> > > consumers gets "stuck". If you don't like the way the group protocol
> > > dynamically manages topic partition assignments or can't figure out an
> > > appropriate set of configuration settings that achieve your goal, you
> can
> > > always elect to not use the group protocol and instead manage topic
> > > partition assignment yourself. As I just replied to another post,
> > there's a
> > > nice writeup of this under  "Manual Partition Assignment" in
> > >
> > >
> >
> https://kafka.apache.org/24/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
> > >  .
> > >
> > > Chris
> > >
> > >
> > > On Thu, May 7, 2020 at 12:37 AM Ali Nazemian <al...@gmail.com>
> > > wrote:
> > >
> > > > To help understanding my case in more details, the error I can see
> > > > constantly is the consumer losing heartbeat and hence apparently the
> > > group
> > > > get rebalanced based on the log I can see from Kafka side:
> > > >
> > > > GroupCoordinator 11]: Member
> > > > consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo has
> > failed,
> > > > removing it from the group
> > > >
> > > > Thanks,
> > > > Ali
> > > >
> > > > On Thu, May 7, 2020 at 2:38 PM Ali Nazemian <al...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > With the emerge of using Apache Kafka for event-driven
> architecture,
> > > one
> > > > > thing that has become important is how to tune apache Kafka
> consumer
> > to
> > > > > manage long-running jobs. The main issue raises when we set a
> > > relatively
> > > > > large value for "max.poll.interval.ms". Setting this value will,
> of
> > > > > course, resolve the issue of repetitive rebalance, but creates
> > another
> > > > > operational issue. I am looking for some sort of golden strategy to
> > > deal
> > > > > with long-running jobs with Apache Kafka.
> > > > >
> > > > > If the consumer hangs for whatever reason, there is no easy way of
> > > > passing
> > > > > that stage. It can easily block the pipeline, and you cannot do
> much
> > > > about
> > > > > it. Therefore, it came to my mind that I am probably missing
> > something
> > > > > here. What are the expectations? Is it not valid to use Apache
> Kafka
> > > for
> > > > > long-live jobs? Are there any other parameters need to be set, and
> > the
> > > > > issue of a consumer being stuck is caused by misconfiguration?
> > > > >
> > > > > I can see there are a lot of the same issues have been raised
> > regarding
> > > > > "the consumer is stuck" and usually, the answer has been "yeah,
> > that's
> > > > > because you have a long-running job, etc.". I have seen different
> > > > > suggestions:
> > > > >
> > > > > - Avoid using long-running jobs. Read the message, submit it into
> > > another
> > > > > thread and let the consumer to pass. Obviously this can cause data
> > loss
> > > > and
> > > > > it would be a difficult problem to handle. It might be better to
> > avoid
> > > > > using Kafka in the first place for these types of requests.
> > > > >
> > > > > - Avoid using apache Kafka for long-running requests
> > > > >
> > > > > - Workaround based approaches like if the consumer is blocked, try
> to
> > > use
> > > > > another consumer group and set the offset to the current value for
> > the
> > > > new
> > > > > consumer group, etc.
> > > > >
> > > > > There might be other suggestions I have missed here, but that is
> not
> > > the
> > > > > point of this email. What I am looking for is what is the best
> > practice
> > > > for
> > > > > dealing with long-running jobs with Apache Kafka. I cannot easily
> > avoid
> > > > > using Kafka because it plays a critical part in our application and
> > > data
> > > > > pipeline. On the other side, we have had so many challenges to keep
> > the
> > > > > long-running jobs stable operationally. So I would appreciate it if
> > > > someone
> > > > > can help me to understand what approach can be taken to deal with
> > these
> > > > > jobs with Apache Kafka as a message broker.
> > > > >
> > > > > Thanks,
> > > > > Ali
> > > > >
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> >
> >
> > --
> > A.Nazemian
> >
>


-- 
A.Nazemian

Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Posted by Chris Toomey <ct...@gmail.com>.

I interpreted your post as saying "when our consumer gets stuck, Kafka's
automatic partition reassignment kicks in and that's problematic for us."
Hence I suggested not using the automatic partition assignment, which per
my interpretation would address your issue.

Chris

On Fri, May 8, 2020 at 2:19 AM Ali Nazemian <al...@gmail.com> wrote:

> Thanks, Chris. So what is causing the consumer to get stuck is a side
> effect of the built-in partition assignment in Kafka and by overriding that
> behaviour I should be able to address the long-running job issue, is that
> right? Can you please elaborate more on this?
>
> Regards,
> Ali
>
> On Fri, May 8, 2020 at 1:09 PM Chris Toomey <ct...@gmail.com> wrote:
>
> > You really have to decide what behavior it is you want when one of your
> > consumers gets "stuck". If you don't like the way the group protocol
> > dynamically manages topic partition assignments or can't figure out an
> > appropriate set of configuration settings that achieve your goal, you can
> > always elect to not use the group protocol and instead manage topic
> > partition assignment yourself. As I just replied to another post,
> there's a
> > nice writeup of this under  "Manual Partition Assignment" in
> >
> >
> https://kafka.apache.org/24/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
> >  .
> >
> > Chris
> >
> >
> > On Thu, May 7, 2020 at 12:37 AM Ali Nazemian <al...@gmail.com>
> > wrote:
> >
> > > To help understanding my case in more details, the error I can see
> > > constantly is the consumer losing heartbeat and hence apparently the
> > group
> > > get rebalanced based on the log I can see from Kafka side:
> > >
> > > GroupCoordinator 11]: Member
> > > consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo has
> failed,
> > > removing it from the group
> > >
> > > Thanks,
> > > Ali
> > >
> > > On Thu, May 7, 2020 at 2:38 PM Ali Nazemian <al...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > With the emerge of using Apache Kafka for event-driven architecture,
> > one
> > > > thing that has become important is how to tune apache Kafka consumer
> to
> > > > manage long-running jobs. The main issue raises when we set a
> > relatively
> > > > large value for "max.poll.interval.ms". Setting this value will, of
> > > > course, resolve the issue of repetitive rebalance, but creates
> another
> > > > operational issue. I am looking for some sort of golden strategy to
> > deal
> > > > with long-running jobs with Apache Kafka.
> > > >
> > > > If the consumer hangs for whatever reason, there is no easy way of
> > > passing
> > > > that stage. It can easily block the pipeline, and you cannot do much
> > > about
> > > > it. Therefore, it came to my mind that I am probably missing
> something
> > > > here. What are the expectations? Is it not valid to use Apache Kafka
> > for
> > > > long-live jobs? Are there any other parameters need to be set, and
> the
> > > > issue of a consumer being stuck is caused by misconfiguration?
> > > >
> > > > I can see there are a lot of the same issues have been raised
> regarding
> > > > "the consumer is stuck" and usually, the answer has been "yeah,
> that's
> > > > because you have a long-running job, etc.". I have seen different
> > > > suggestions:
> > > >
> > > > - Avoid using long-running jobs. Read the message, submit it into
> > another
> > > > thread and let the consumer to pass. Obviously this can cause data
> loss
> > > and
> > > > it would be a difficult problem to handle. It might be better to
> avoid
> > > > using Kafka in the first place for these types of requests.
> > > >
> > > > - Avoid using apache Kafka for long-running requests
> > > >
> > > > - Workaround based approaches like if the consumer is blocked, try to
> > use
> > > > another consumer group and set the offset to the current value for
> the
> > > new
> > > > consumer group, etc.
> > > >
> > > > There might be other suggestions I have missed here, but that is not
> > the
> > > > point of this email. What I am looking for is what is the best
> practice
> > > for
> > > > dealing with long-running jobs with Apache Kafka. I cannot easily
> avoid
> > > > using Kafka because it plays a critical part in our application and
> > data
> > > > pipeline. On the other side, we have had so many challenges to keep
> the
> > > > long-running jobs stable operationally. So I would appreciate it if
> > > someone
> > > > can help me to understand what approach can be taken to deal with
> these
> > > > jobs with Apache Kafka as a message broker.
> > > >
> > > > Thanks,
> > > > Ali
> > > >
> > >
> > >
> > > --
> > > A.Nazemian
> > >
> >
>
>
> --
> A.Nazemian
>

Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Posted by Ali Nazemian <al...@gmail.com>.

Thanks, Chris. So what is causing the consumer to get stuck is a side
effect of the built-in partition assignment in Kafka and by overriding that
behaviour I should be able to address the long-running job issue, is that
right? Can you please elaborate more on this?

Regards,
Ali

On Fri, May 8, 2020 at 1:09 PM Chris Toomey <ct...@gmail.com> wrote:

> You really have to decide what behavior it is you want when one of your
> consumers gets "stuck". If you don't like the way the group protocol
> dynamically manages topic partition assignments or can't figure out an
> appropriate set of configuration settings that achieve your goal, you can
> always elect to not use the group protocol and instead manage topic
> partition assignment yourself. As I just replied to another post, there's a
> nice writeup of this under  "Manual Partition Assignment" in
>
> https://kafka.apache.org/24/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
>  .
>
> Chris
>
>
> On Thu, May 7, 2020 at 12:37 AM Ali Nazemian <al...@gmail.com>
> wrote:
>
> > To help understanding my case in more details, the error I can see
> > constantly is the consumer losing heartbeat and hence apparently the
> group
> > get rebalanced based on the log I can see from Kafka side:
> >
> > GroupCoordinator 11]: Member
> > consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo has failed,
> > removing it from the group
> >
> > Thanks,
> > Ali
> >
> > On Thu, May 7, 2020 at 2:38 PM Ali Nazemian <al...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > With the emerge of using Apache Kafka for event-driven architecture,
> one
> > > thing that has become important is how to tune apache Kafka consumer to
> > > manage long-running jobs. The main issue raises when we set a
> relatively
> > > large value for "max.poll.interval.ms". Setting this value will, of
> > > course, resolve the issue of repetitive rebalance, but creates another
> > > operational issue. I am looking for some sort of golden strategy to
> deal
> > > with long-running jobs with Apache Kafka.
> > >
> > > If the consumer hangs for whatever reason, there is no easy way of
> > passing
> > > that stage. It can easily block the pipeline, and you cannot do much
> > about
> > > it. Therefore, it came to my mind that I am probably missing something
> > > here. What are the expectations? Is it not valid to use Apache Kafka
> for
> > > long-live jobs? Are there any other parameters need to be set, and the
> > > issue of a consumer being stuck is caused by misconfiguration?
> > >
> > > I can see there are a lot of the same issues have been raised regarding
> > > "the consumer is stuck" and usually, the answer has been "yeah, that's
> > > because you have a long-running job, etc.". I have seen different
> > > suggestions:
> > >
> > > - Avoid using long-running jobs. Read the message, submit it into
> another
> > > thread and let the consumer to pass. Obviously this can cause data loss
> > and
> > > it would be a difficult problem to handle. It might be better to avoid
> > > using Kafka in the first place for these types of requests.
> > >
> > > - Avoid using apache Kafka for long-running requests
> > >
> > > - Workaround based approaches like if the consumer is blocked, try to
> use
> > > another consumer group and set the offset to the current value for the
> > new
> > > consumer group, etc.
> > >
> > > There might be other suggestions I have missed here, but that is not
> the
> > > point of this email. What I am looking for is what is the best practice
> > for
> > > dealing with long-running jobs with Apache Kafka. I cannot easily avoid
> > > using Kafka because it plays a critical part in our application and
> data
> > > pipeline. On the other side, we have had so many challenges to keep the
> > > long-running jobs stable operationally. So I would appreciate it if
> > someone
> > > can help me to understand what approach can be taken to deal with these
> > > jobs with Apache Kafka as a message broker.
> > >
> > > Thanks,
> > > Ali
> > >
> >
> >
> > --
> > A.Nazemian
> >
>


-- 
A.Nazemian

Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Posted by Chris Toomey <ct...@gmail.com>.

You really have to decide what behavior it is you want when one of your
consumers gets "stuck". If you don't like the way the group protocol
dynamically manages topic partition assignments or can't figure out an
appropriate set of configuration settings that achieve your goal, you can
always elect to not use the group protocol and instead manage topic
partition assignment yourself. As I just replied to another post, there's a
nice writeup of this under  "Manual Partition Assignment" in
https://kafka.apache.org/24/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
 .

Chris


On Thu, May 7, 2020 at 12:37 AM Ali Nazemian <al...@gmail.com> wrote:

> To help understanding my case in more details, the error I can see
> constantly is the consumer losing heartbeat and hence apparently the group
> get rebalanced based on the log I can see from Kafka side:
>
> GroupCoordinator 11]: Member
> consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo has failed,
> removing it from the group
>
> Thanks,
> Ali
>
> On Thu, May 7, 2020 at 2:38 PM Ali Nazemian <al...@gmail.com> wrote:
>
> > Hi,
> >
> > With the emerge of using Apache Kafka for event-driven architecture, one
> > thing that has become important is how to tune apache Kafka consumer to
> > manage long-running jobs. The main issue raises when we set a relatively
> > large value for "max.poll.interval.ms". Setting this value will, of
> > course, resolve the issue of repetitive rebalance, but creates another
> > operational issue. I am looking for some sort of golden strategy to deal
> > with long-running jobs with Apache Kafka.
> >
> > If the consumer hangs for whatever reason, there is no easy way of
> passing
> > that stage. It can easily block the pipeline, and you cannot do much
> about
> > it. Therefore, it came to my mind that I am probably missing something
> > here. What are the expectations? Is it not valid to use Apache Kafka for
> > long-live jobs? Are there any other parameters need to be set, and the
> > issue of a consumer being stuck is caused by misconfiguration?
> >
> > I can see there are a lot of the same issues have been raised regarding
> > "the consumer is stuck" and usually, the answer has been "yeah, that's
> > because you have a long-running job, etc.". I have seen different
> > suggestions:
> >
> > - Avoid using long-running jobs. Read the message, submit it into another
> > thread and let the consumer to pass. Obviously this can cause data loss
> and
> > it would be a difficult problem to handle. It might be better to avoid
> > using Kafka in the first place for these types of requests.
> >
> > - Avoid using apache Kafka for long-running requests
> >
> > - Workaround based approaches like if the consumer is blocked, try to use
> > another consumer group and set the offset to the current value for the
> new
> > consumer group, etc.
> >
> > There might be other suggestions I have missed here, but that is not the
> > point of this email. What I am looking for is what is the best practice
> for
> > dealing with long-running jobs with Apache Kafka. I cannot easily avoid
> > using Kafka because it plays a critical part in our application and data
> > pipeline. On the other side, we have had so many challenges to keep the
> > long-running jobs stable operationally. So I would appreciate it if
> someone
> > can help me to understand what approach can be taken to deal with these
> > jobs with Apache Kafka as a message broker.
> >
> > Thanks,
> > Ali
> >
>
>
> --
> A.Nazemian
>

Re: Kafka long running job consumer config best practices and what to do to avoid stuck consumer

Posted by Ali Nazemian <al...@gmail.com>.

To help understanding my case in more details, the error I can see
constantly is the consumer losing heartbeat and hence apparently the group
get rebalanced based on the log I can see from Kafka side:

GroupCoordinator 11]: Member
consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo has failed,
removing it from the group

Thanks,
Ali

On Thu, May 7, 2020 at 2:38 PM Ali Nazemian <al...@gmail.com> wrote:

> Hi,
>
> With the emerge of using Apache Kafka for event-driven architecture, one
> thing that has become important is how to tune apache Kafka consumer to
> manage long-running jobs. The main issue raises when we set a relatively
> large value for "max.poll.interval.ms". Setting this value will, of
> course, resolve the issue of repetitive rebalance, but creates another
> operational issue. I am looking for some sort of golden strategy to deal
> with long-running jobs with Apache Kafka.
>
> If the consumer hangs for whatever reason, there is no easy way of passing
> that stage. It can easily block the pipeline, and you cannot do much about
> it. Therefore, it came to my mind that I am probably missing something
> here. What are the expectations? Is it not valid to use Apache Kafka for
> long-live jobs? Are there any other parameters need to be set, and the
> issue of a consumer being stuck is caused by misconfiguration?
>
> I can see there are a lot of the same issues have been raised regarding
> "the consumer is stuck" and usually, the answer has been "yeah, that's
> because you have a long-running job, etc.". I have seen different
> suggestions:
>
> - Avoid using long-running jobs. Read the message, submit it into another
> thread and let the consumer to pass. Obviously this can cause data loss and
> it would be a difficult problem to handle. It might be better to avoid
> using Kafka in the first place for these types of requests.
>
> - Avoid using apache Kafka for long-running requests
>
> - Workaround based approaches like if the consumer is blocked, try to use
> another consumer group and set the offset to the current value for the new
> consumer group, etc.
>
> There might be other suggestions I have missed here, but that is not the
> point of this email. What I am looking for is what is the best practice for
> dealing with long-running jobs with Apache Kafka. I cannot easily avoid
> using Kafka because it plays a critical part in our application and data
> pipeline. On the other side, we have had so many challenges to keep the
> long-running jobs stable operationally. So I would appreciate it if someone
> can help me to understand what approach can be taken to deal with these
> jobs with Apache Kafka as a message broker.
>
> Thanks,
> Ali
>


-- 
A.Nazemian