You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Steven Wu <st...@gmail.com> on 2015/09/09 05:26:38 UTC

some producers stuck when one broker is bad

We have observed that some producer instances stopped sending traffic to
brokers, because the memory buffer is full. those producers got stuck in
this state permanently. Because we couldn't find out which broker is bad
here. So I did a rolling restart the all brokers. after the bad broker got
bounce, those stuck producers out of the woods automatically.

I don't know the exact problem with that bad broker. it seems to me that
some ZK states are inconsistent.

I know timeout fix from KAFKA-2120 can probably avoid the permanent stuck.
Here are some additional questions.
1) any suggestion on how to identify the bad broker(s)?
2) why bouncing of the bad broker got the producers recovered automatically
(without restarting producers)

producer: 0.8.2.1
broker: 0.8.2.1

Thanks,
Steven

Re: some producers stuck when one broker is bad

Posted by Steven Wu <st...@gmail.com>.

I was doing a rolling bounce of all brokers. Immediately after the bad
broker was bounced, those stuck producers recovered

On Fri, Sep 11, 2015 at 9:05 AM, Mayuresh Gharat <gharatmayuresh15@gmail.com
> wrote:

> So how did you detect that the broker is bad? If bouncing brokers solved
> the problem and you did not find any unusual things in the logs on brokers
> , it is likely that the process was up but was isolated from producer
> request and since the producer did not have timeout the producer buffer
> filled up.
>
> Thanks,
>
> Mayuresh
>
>
> On Thu, Sep 10, 2015 at 11:20 PM, Steven Wu <st...@gmail.com> wrote:
>
> > frankly I don't know exactly what went BAD for that broker. process is
> > still UP.
> >
> > On Wed, Sep 9, 2015 at 10:10 AM, Mayuresh Gharat <
> > gharatmayuresh15@gmail.com
> > > wrote:
> >
> > > 1) any suggestion on how to identify the bad broker(s)?
> > > ---> At Linkedin we have alerts that are setup using our internal
> scripts
> > > for detecting if a broker has gone bad. We also check the under
> > replicated
> > > partitions and that can tell us which broker has gone bad. By broker
> > going
> > > bad, it can mean different things. Like the broker is alive but not
> > > responding and is completely isolated or the broker has gone down, etc.
> > > Can you tell us what you meant by your BROKER went BAD?
> > >
> > > 2) why bouncing of the bad broker got the producers recovered
> > automatically
> > > ----> This is because as you bounced, the leaders for other partitions
> > > changed and producer sent out a TopicMetadataRequest which tells the
> > > producer who are the new leaders for the partitions and the producer
> > > started sending messages to those brokers.
> > >
> > > KAFKA-2120 will handle all of this for you automatically.
> > >
> > > Thanks,
> > >
> > > Mayuresh
> > >
> > > On Tue, Sep 8, 2015 at 8:26 PM, Steven Wu <st...@gmail.com>
> wrote:
> > >
> > > > We have observed that some producer instances stopped sending traffic
> > to
> > > > brokers, because the memory buffer is full. those producers got stuck
> > in
> > > > this state permanently. Because we couldn't find out which broker is
> > bad
> > > > here. So I did a rolling restart the all brokers. after the bad
> broker
> > > got
> > > > bounce, those stuck producers out of the woods automatically.
> > > >
> > > > I don't know the exact problem with that bad broker. it seems to me
> > that
> > > > some ZK states are inconsistent.
> > > >
> > > > I know timeout fix from KAFKA-2120 can probably avoid the permanent
> > > stuck.
> > > > Here are some additional questions.
> > > > 1) any suggestion on how to identify the bad broker(s)?
> > > > 2) why bouncing of the bad broker got the producers recovered
> > > automatically
> > > > (without restarting producers)
> > > >
> > > > producer: 0.8.2.1
> > > > broker: 0.8.2.1
> > > >
> > > > Thanks,
> > > > Steven
> > > >
> > >
> > >
> > >
> > > --
> > > -Regards,
> > > Mayuresh R. Gharat
> > > (862) 250-7125
> > >
> >
>
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>

Re: some producers stuck when one broker is bad

Posted by Mayuresh Gharat <gh...@gmail.com>.

So how did you detect that the broker is bad? If bouncing brokers solved
the problem and you did not find any unusual things in the logs on brokers
, it is likely that the process was up but was isolated from producer
request and since the producer did not have timeout the producer buffer
filled up.

Thanks,

Mayuresh


On Thu, Sep 10, 2015 at 11:20 PM, Steven Wu <st...@gmail.com> wrote:

> frankly I don't know exactly what went BAD for that broker. process is
> still UP.
>
> On Wed, Sep 9, 2015 at 10:10 AM, Mayuresh Gharat <
> gharatmayuresh15@gmail.com
> > wrote:
>
> > 1) any suggestion on how to identify the bad broker(s)?
> > ---> At Linkedin we have alerts that are setup using our internal scripts
> > for detecting if a broker has gone bad. We also check the under
> replicated
> > partitions and that can tell us which broker has gone bad. By broker
> going
> > bad, it can mean different things. Like the broker is alive but not
> > responding and is completely isolated or the broker has gone down, etc.
> > Can you tell us what you meant by your BROKER went BAD?
> >
> > 2) why bouncing of the bad broker got the producers recovered
> automatically
> > ----> This is because as you bounced, the leaders for other partitions
> > changed and producer sent out a TopicMetadataRequest which tells the
> > producer who are the new leaders for the partitions and the producer
> > started sending messages to those brokers.
> >
> > KAFKA-2120 will handle all of this for you automatically.
> >
> > Thanks,
> >
> > Mayuresh
> >
> > On Tue, Sep 8, 2015 at 8:26 PM, Steven Wu <st...@gmail.com> wrote:
> >
> > > We have observed that some producer instances stopped sending traffic
> to
> > > brokers, because the memory buffer is full. those producers got stuck
> in
> > > this state permanently. Because we couldn't find out which broker is
> bad
> > > here. So I did a rolling restart the all brokers. after the bad broker
> > got
> > > bounce, those stuck producers out of the woods automatically.
> > >
> > > I don't know the exact problem with that bad broker. it seems to me
> that
> > > some ZK states are inconsistent.
> > >
> > > I know timeout fix from KAFKA-2120 can probably avoid the permanent
> > stuck.
> > > Here are some additional questions.
> > > 1) any suggestion on how to identify the bad broker(s)?
> > > 2) why bouncing of the bad broker got the producers recovered
> > automatically
> > > (without restarting producers)
> > >
> > > producer: 0.8.2.1
> > > broker: 0.8.2.1
> > >
> > > Thanks,
> > > Steven
> > >
> >
> >
> >
> > --
> > -Regards,
> > Mayuresh R. Gharat
> > (862) 250-7125
> >
>



-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125

Re: some producers stuck when one broker is bad

Posted by Steven Wu <st...@gmail.com>.

frankly I don't know exactly what went BAD for that broker. process is
still UP.

On Wed, Sep 9, 2015 at 10:10 AM, Mayuresh Gharat <gharatmayuresh15@gmail.com
> wrote:

> 1) any suggestion on how to identify the bad broker(s)?
> ---> At Linkedin we have alerts that are setup using our internal scripts
> for detecting if a broker has gone bad. We also check the under replicated
> partitions and that can tell us which broker has gone bad. By broker going
> bad, it can mean different things. Like the broker is alive but not
> responding and is completely isolated or the broker has gone down, etc.
> Can you tell us what you meant by your BROKER went BAD?
>
> 2) why bouncing of the bad broker got the producers recovered automatically
> ----> This is because as you bounced, the leaders for other partitions
> changed and producer sent out a TopicMetadataRequest which tells the
> producer who are the new leaders for the partitions and the producer
> started sending messages to those brokers.
>
> KAFKA-2120 will handle all of this for you automatically.
>
> Thanks,
>
> Mayuresh
>
> On Tue, Sep 8, 2015 at 8:26 PM, Steven Wu <st...@gmail.com> wrote:
>
> > We have observed that some producer instances stopped sending traffic to
> > brokers, because the memory buffer is full. those producers got stuck in
> > this state permanently. Because we couldn't find out which broker is bad
> > here. So I did a rolling restart the all brokers. after the bad broker
> got
> > bounce, those stuck producers out of the woods automatically.
> >
> > I don't know the exact problem with that bad broker. it seems to me that
> > some ZK states are inconsistent.
> >
> > I know timeout fix from KAFKA-2120 can probably avoid the permanent
> stuck.
> > Here are some additional questions.
> > 1) any suggestion on how to identify the bad broker(s)?
> > 2) why bouncing of the bad broker got the producers recovered
> automatically
> > (without restarting producers)
> >
> > producer: 0.8.2.1
> > broker: 0.8.2.1
> >
> > Thanks,
> > Steven
> >
>
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>

Re: some producers stuck when one broker is bad

Posted by Mayuresh Gharat <gh...@gmail.com>.

1) any suggestion on how to identify the bad broker(s)?
---> At Linkedin we have alerts that are setup using our internal scripts
for detecting if a broker has gone bad. We also check the under replicated
partitions and that can tell us which broker has gone bad. By broker going
bad, it can mean different things. Like the broker is alive but not
responding and is completely isolated or the broker has gone down, etc.
Can you tell us what you meant by your BROKER went BAD?

2) why bouncing of the bad broker got the producers recovered automatically
----> This is because as you bounced, the leaders for other partitions
changed and producer sent out a TopicMetadataRequest which tells the
producer who are the new leaders for the partitions and the producer
started sending messages to those brokers.

KAFKA-2120 will handle all of this for you automatically.

Thanks,

Mayuresh

On Tue, Sep 8, 2015 at 8:26 PM, Steven Wu <st...@gmail.com> wrote:

> We have observed that some producer instances stopped sending traffic to
> brokers, because the memory buffer is full. those producers got stuck in
> this state permanently. Because we couldn't find out which broker is bad
> here. So I did a rolling restart the all brokers. after the bad broker got
> bounce, those stuck producers out of the woods automatically.
>
> I don't know the exact problem with that bad broker. it seems to me that
> some ZK states are inconsistent.
>
> I know timeout fix from KAFKA-2120 can probably avoid the permanent stuck.
> Here are some additional questions.
> 1) any suggestion on how to identify the bad broker(s)?
> 2) why bouncing of the bad broker got the producers recovered automatically
> (without restarting producers)
>
> producer: 0.8.2.1
> broker: 0.8.2.1
>
> Thanks,
> Steven
>



-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125