You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Bart van Deenen <ba...@fastmail.fm> on 2019/10/18 07:15:48 UTC

Broker that stays outside of the ISR, how to recover

Hi all

We had a Kafka broker failure (too many open files, stupid), and now the partitions on that broker will no longer become part of the ISR set. It's been a few days (organizational issues), and we have significant amounts of data on the ISR partitions.

In order to make the partitions on the broker become part of the ISR set again, should I:

* increase `replica.lag.time.max.ms` on the broker to the number of ms that the partitions are behind. I can guesstimate the value to about 7 days, or should I measure it somehow?
* stop the broker and wipe files (which ones?) and then restart it. Should I also do stuff on zookeeper ?

Is there any _official_ information on how to deal with this situation?

Thanks for helping!

Re: Broker that stays outside of the ISR, how to recover

Posted by Bart van Deenen <ba...@fastmail.fm>.

Hi all

Thanks for the help. Eventually restarting the broker a second time (days later) triggered a full repair, and the cluster is happy again. I don't know why the first restart didn't fix the issue.

Greetings

Bart

On Fri, Oct 18, 2019, at 16:09, M. Manna wrote:
> In addition to what Peter said, I would recommend that you stop and delete
> all data logs (if your replication factor is set correctly). Upon restart,
> they’ll be recreated. This is of course the last time thing to do if you
> cannot determine the root cause.
> 
>  The measure works well for me with my k8s deployment where a pod (broker)
> is killed and recreated upon Liveness Probe failure.
> 
> Thanks,
> 
> 
> On Fri, 18 Oct 2019 at 10:06, Peter Bukowinski <pm...@gmail.com> wrote:
> 
> > Hi Bart,
> >
> > Before changing anything, I would verify whether or not the affected
> > broker is trying to catch up. Have you looked at the broker’s log? Do you
> > see any errors? Check your metrics or the partition directories themselves
> > to see if data is flowing into the broker.
> >
> > If you do want to reset the broker to have it start a fresh resync, stop
> > the kafka broker service/process, 'rm -rf /path/to/kafka-logs' — check the
> > value of your log.dir or log.dirs property in your server.properties file
> > for the path — and then start the service again. It should check in with
> > zookeeper and then start following the topic partition leaders for all the
> > topic partition replicas assigned to it.
> >
> > -- Peter
> >
> > >> On Oct 18, 2019, at 12:16 AM, Bart van Deenen <
> > bartvandeenen@fastmail.fm> wrote:
> > > Hi all
> > >
> > > We had a Kafka broker failure (too many open files, stupid), and now the
> > partitions on that broker will no longer become part of the ISR set. It's
> > been a few days (organizational issues), and we have significant amounts of
> > data on the ISR partitions.
> > >
> > > In order to make the partitions on the broker become part of the ISR set
> > again, should I:
> > >
> > > * increase `replica.lag.time.max.ms` on the broker to the number of ms
> > that the partitions are behind. I can guesstimate the value to about 7
> > days, or should I measure it somehow?
> > > * stop the broker and wipe files (which ones?) and then restart it.
> > Should I also do stuff on zookeeper ?
> > >
> > > Is there any _official_ information on how to deal with this situation?
> > >
> > > Thanks for helping!
> >
>

Re: Broker that stays outside of the ISR, how to recover

Posted by "M. Manna" <ma...@gmail.com>.

In addition to what Peter said, I would recommend that you stop and delete
all data logs (if your replication factor is set correctly). Upon restart,
they’ll be recreated. This is of course the last time thing to do if you
cannot determine the root cause.

 The measure works well for me with my k8s deployment where a pod (broker)
is killed and recreated upon Liveness Probe failure.

Thanks,


On Fri, 18 Oct 2019 at 10:06, Peter Bukowinski <pm...@gmail.com> wrote:

> Hi Bart,
>
> Before changing anything, I would verify whether or not the affected
> broker is trying to catch up. Have you looked at the broker’s log? Do you
> see any errors? Check your metrics or the partition directories themselves
> to see if data is flowing into the broker.
>
> If you do want to reset the broker to have it start a fresh resync, stop
> the kafka broker service/process, 'rm -rf /path/to/kafka-logs' — check the
> value of your log.dir or log.dirs property in your server.properties file
> for the path — and then start the service again. It should check in with
> zookeeper and then start following the topic partition leaders for all the
> topic partition replicas assigned to it.
>
> -- Peter
>
> >> On Oct 18, 2019, at 12:16 AM, Bart van Deenen <
> bartvandeenen@fastmail.fm> wrote:
> > Hi all
> >
> > We had a Kafka broker failure (too many open files, stupid), and now the
> partitions on that broker will no longer become part of the ISR set. It's
> been a few days (organizational issues), and we have significant amounts of
> data on the ISR partitions.
> >
> > In order to make the partitions on the broker become part of the ISR set
> again, should I:
> >
> > * increase `replica.lag.time.max.ms` on the broker to the number of ms
> that the partitions are behind. I can guesstimate the value to about 7
> days, or should I measure it somehow?
> > * stop the broker and wipe files (which ones?) and then restart it.
> Should I also do stuff on zookeeper ?
> >
> > Is there any _official_ information on how to deal with this situation?
> >
> > Thanks for helping!
>

Re: Broker that stays outside of the ISR, how to recover

Posted by Peter Bukowinski <pm...@gmail.com>.

Hi Bart,

Before changing anything, I would verify whether or not the affected broker is trying to catch up. Have you looked at the broker’s log? Do you see any errors? Check your metrics or the partition directories themselves to see if data is flowing into the broker.

If you do want to reset the broker to have it start a fresh resync, stop the kafka broker service/process, 'rm -rf /path/to/kafka-logs' — check the value of your log.dir or log.dirs property in your server.properties file for the path — and then start the service again. It should check in with zookeeper and then start following the topic partition leaders for all the topic partition replicas assigned to it.

-- Peter

>> On Oct 18, 2019, at 12:16 AM, Bart van Deenen <ba...@fastmail.fm> wrote:
> Hi all
> 
> We had a Kafka broker failure (too many open files, stupid), and now the partitions on that broker will no longer become part of the ISR set. It's been a few days (organizational issues), and we have significant amounts of data on the ISR partitions.
> 
> In order to make the partitions on the broker become part of the ISR set again, should I:
> 
> * increase `replica.lag.time.max.ms` on the broker to the number of ms that the partitions are behind. I can guesstimate the value to about 7 days, or should I measure it somehow?
> * stop the broker and wipe files (which ones?) and then restart it. Should I also do stuff on zookeeper ?
> 
> Is there any _official_ information on how to deal with this situation?
> 
> Thanks for helping!