You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Harald Kirsch <ha...@raytion.com> on 2016/12/06 15:18:20 UTC

Best approach to frequently restarting consumer process

We have consumer processes which need to restart frequently, say, every 
5 minutes. We have 10 of them so we are facing two restarts every minute 
on average.

1) It seems that nearly every time a consumer restarts  the group is 
rebalanced. Even if the restart takes less than the heartbeat interval.

2) My guess is that the group manager just cannot know that the same 
consumer is knocking at the door again.

Are my suspicions (1) and (2) correct? Is there a chance to fix this 
such that a restart within the heartbeat interval does not lead to a 
re-balance? Would a well defined client.id help?

Regards
Harald

Re: Best approach to frequently restarting consumer process

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.

Consumer groups aren't going to handle 'let it crash' particularly well
(and really any session-based services, but particularly consumer groups
since a single failure affects the entire group). That said, 'let it crash'
doesn't necessarily have to mean 'don't try to clean up at all'. The
consumer group will recover *much* more quickly if you make sure any crash
path includes a:

finally {
   consumer.close();
}

block to do some minimal cleanup. This will cause the consumer to make a
best effort to explicitly leave the group, allowing rebalancing to complete
after the rest of the members rejoin. If you don't do this, your rebalances
get much more expensive since the group coordinator needs to wait for the
session timeout. This will probably notice to noticeably longer pauses. The
one drawback to doing this today is that the close() can potentially block,
so it may not fail as fast as you want it to -- it would be good to get a
timeout-based close() implemented as well. That said, the LeaveGroup
request *is* best effort, so if the consumer was otherwise in a healthy
state, this should be very fast.

All this said, 'let it crash' isn't the same thing as 'constant crashes are
ok'. It's a fault recovery methodology, but crashing every 5 minutes isn't
what the telecom industry had in mind... If things are crashing that
frequently, there is likely a very common bug/memory leak/etc which can be
fixed to significantly reduce the frequency of crashes. Generally 'let it
crash' systems also provide a good way to also collect debugging
information for exactly this purpose.

-Ewen

On Wed, Dec 7, 2016 at 1:38 AM, Harald Kirsch <ha...@raytion.com>
wrote:

> With 'restart' I mean a 'let it crash' setup (as promoted by Erlang and
> Akka, e.g. http://doc.akka.io/docs/akka/snapshot/intro/what-is-akka.html).
> The consumer gets in trouble due to an OOM or a runaway computation or
> whatever that we want to preempt somehow. It crashes or gets killed
> externally.
>
> So whether close() is called or not in the dying process, I don't know.
> But clearly the subscribe is called after a restart.
>
> I understand that we are out of luck with this. We would have to separate
> the crashing part out into a different operating system process, but must
> keep the consumer running all time. :-(
>
> Thanks for the insight
> Harald
>
>
> On 06.12.2016 19:26, Gwen Shapira wrote:
>
>> Can you clarify what you mean by "restart"? If you call
>> consumer.close() and consumer.subscribe() you will definitely trigger
>> a rebalance.
>>
>> It doesn't matter if its "same consumer knocking", we already
>> rebalance when you call consumer.close().
>>
>> Since we want both consumer.close() and consumer.subscribe() to cause
>> rebalance immediately (and not wait for heartbeat), I don't think
>> we'll be changing their behavior.
>>
>> Depending on why consumers need to restart, I'm wondering if you can
>> restart other threads in your application but keep the consumer up and
>> running to avoid the rebalances.
>>
>> On Tue, Dec 6, 2016 at 7:18 AM, Harald Kirsch <ha...@raytion.com>
>> wrote:
>>
>>> We have consumer processes which need to restart frequently, say, every 5
>>> minutes. We have 10 of them so we are facing two restarts every minute on
>>> average.
>>>
>>> 1) It seems that nearly every time a consumer restarts  the group is
>>> rebalanced. Even if the restart takes less than the heartbeat interval.
>>>
>>> 2) My guess is that the group manager just cannot know that the same
>>> consumer is knocking at the door again.
>>>
>>> Are my suspicions (1) and (2) correct? Is there a chance to fix this such
>>> that a restart within the heartbeat interval does not lead to a
>>> re-balance?
>>> Would a well defined client.id help?
>>>
>>> Regards
>>> Harald
>>>
>>>
>>
>>
>>

-- 
Thanks,
Ewen

Re: Best approach to frequently restarting consumer process

Posted by Harald Kirsch <ha...@raytion.com>.

With 'restart' I mean a 'let it crash' setup (as promoted by Erlang and 
Akka, e.g. 
http://doc.akka.io/docs/akka/snapshot/intro/what-is-akka.html). The 
consumer gets in trouble due to an OOM or a runaway computation or 
whatever that we want to preempt somehow. It crashes or gets killed 
externally.

So whether close() is called or not in the dying process, I don't know. 
But clearly the subscribe is called after a restart.

I understand that we are out of luck with this. We would have to 
separate the crashing part out into a different operating system 
process, but must keep the consumer running all time. :-(

Thanks for the insight
Harald

On 06.12.2016 19:26, Gwen Shapira wrote:
> Can you clarify what you mean by "restart"? If you call
> consumer.close() and consumer.subscribe() you will definitely trigger
> a rebalance.
>
> It doesn't matter if its "same consumer knocking", we already
> rebalance when you call consumer.close().
>
> Since we want both consumer.close() and consumer.subscribe() to cause
> rebalance immediately (and not wait for heartbeat), I don't think
> we'll be changing their behavior.
>
> Depending on why consumers need to restart, I'm wondering if you can
> restart other threads in your application but keep the consumer up and
> running to avoid the rebalances.
>
> On Tue, Dec 6, 2016 at 7:18 AM, Harald Kirsch <ha...@raytion.com> wrote:
>> We have consumer processes which need to restart frequently, say, every 5
>> minutes. We have 10 of them so we are facing two restarts every minute on
>> average.
>>
>> 1) It seems that nearly every time a consumer restarts  the group is
>> rebalanced. Even if the restart takes less than the heartbeat interval.
>>
>> 2) My guess is that the group manager just cannot know that the same
>> consumer is knocking at the door again.
>>
>> Are my suspicions (1) and (2) correct? Is there a chance to fix this such
>> that a restart within the heartbeat interval does not lead to a re-balance?
>> Would a well defined client.id help?
>>
>> Regards
>> Harald
>>
>
>
>

Re: Best approach to frequently restarting consumer process

Posted by Gwen Shapira <gw...@confluent.io>.

Can you clarify what you mean by "restart"? If you call
consumer.close() and consumer.subscribe() you will definitely trigger
a rebalance.

It doesn't matter if its "same consumer knocking", we already
rebalance when you call consumer.close().

Since we want both consumer.close() and consumer.subscribe() to cause
rebalance immediately (and not wait for heartbeat), I don't think
we'll be changing their behavior.

Depending on why consumers need to restart, I'm wondering if you can
restart other threads in your application but keep the consumer up and
running to avoid the rebalances.

On Tue, Dec 6, 2016 at 7:18 AM, Harald Kirsch <ha...@raytion.com> wrote:
> We have consumer processes which need to restart frequently, say, every 5
> minutes. We have 10 of them so we are facing two restarts every minute on
> average.
>
> 1) It seems that nearly every time a consumer restarts  the group is
> rebalanced. Even if the restart takes less than the heartbeat interval.
>
> 2) My guess is that the group manager just cannot know that the same
> consumer is knocking at the door again.
>
> Are my suspicions (1) and (2) correct? Is there a chance to fix this such
> that a restart within the heartbeat interval does not lead to a re-balance?
> Would a well defined client.id help?
>
> Regards
> Harald
>

-- 
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog