You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Paul Mackles <pm...@adobe.com> on 2014/05/17 01:25:23 UTC

ISR not updating

Hi - We are running kafka_2.8.0-0.8.0-beta1 (we are a little behind in upgrading).

>From what I can tell, connectivity to ZK was lost for a brief period. The cluster seemed to recover OK except that we now have 2 (out of 125) partitions where the ISR appears to be out of date. In other words, kafka-list-topic is showing only one replica in the ISR for the 2 partitions in question (there should be 3).

What's odd is that in looking at the log segments for those partitions on the file system, I can see that they are in fact getting updated and by all measures look to be in sync. I can also see that the brokers where the out-of-sync replicas reside are doing fine and leading other partitions like nothing ever happened. Based on that, it seems like the ISR in ZK is just out-of-date due to a botched recovery from the brief ZK outage.

Has anyone seen anything like this before? I saw this ticket which sounded similar:

https://issues.apache.org/jira/browse/KAFKA-948

Anyone have any suggestions for recovering from this state? I was thinking of running the preferred-replica-election tool next to see if that gets the ISRs in ZK back in sync.

After that, I guess the next step would be to bounce the kafka servers in question.

Thanks,
Paul

Re: ISR not updating

Posted by Jun Rao <ju...@gmail.com>.

Ok. That does indicate the ISR should include all replicas. Which version
of ZK server are you using? Could you check the ZK server log to see if
there if the ISR is being updated?

Thanks,

Jun


On Mon, May 19, 2014 at 1:30 AM, Shone Sadler <sh...@gmail.com>wrote:

> The value of under replicated partitions is 0 across the cluster.
>
> Thanks,
> Shone
>
>
> On Mon, May 19, 2014 at 12:23 AM, Jun Rao <ju...@gmail.com> wrote:
>
> > What's the value of under replicated partitions JMX in each broker?
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Sat, May 17, 2014 at 6:16 PM, Paul Mackles <pm...@adobe.com>
> wrote:
> >
> > > Today we did a rolling restart of ZK. We also restarted the kafka
> > > controller and ISRs are still not being updated in ZK. Again, the
> cluster
> > > seems fine and the replicas in question do appear to be getting
> updated.
> > I
> > > am guessing there must be some bad state persisted in ZK.
> > >
> > > On 5/17/14 7:50 PM, "Shone Sadler" <sh...@gmail.com> wrote:
> > >
> > > >Hi Jun,
> > > >
> > > >I work with Paul and am monitoring the cluster as well.   The status
> has
> > > >not changed.
> > > >
> > > >When we execute kafka-list-topic we are seeing the following (showing
> > one
> > > >of two partitions having the problem)
> > > >
> > > >topic: t1 partition: 33 leader: 1 replicas: 1,2,3 isr: 1
> > > >
> > > >when inspecting the logs of leader: I do see a spurt of ISR
> > > >shrinkage/expansion  around the time that the brokers were partitioned
> > > >from
> > > >ZK. But nothing past the last message "Cached zkVersion [17] not equal
> > to
> > > >that in zookeeper." from  yesterday.  There are not constant changes
> to
> > > >the
> > > >ISR list.
> > > >
> > > >Is there a way to force the leader to update ZK with the latest ISR
> > list?
> > > >
> > > >Thanks,
> > > >Shone
> > > >
> > > >Logs:
> > > >
> > > >cat server.log | grep "\[t1,33\]"
> > > >
> > > >[2014-04-18 10:16:32,814] INFO [ReplicaFetcherManager on broker 1]
> > > >Removing
> > > >fetcher for partition [t1,33] (kafka.server.ReplicaFetcherManager)
> > > >[2014-05-13 19:42:10,784] ERROR [KafkaApi-1] Error when processing
> fetch
> > > >request for partition [t1,33] offset 330118156 from consumer with
> > > >correlation id 0 (kafka.server.KafkaApis)
> > > >[2014-05-14 11:02:25,255] ERROR [KafkaApi-1] Error when processing
> fetch
> > > >request for partition [t1,33] offset 332896470 from consumer with
> > > >correlation id 0 (kafka.server.KafkaApis)
> > > >[2014-05-16 12:00:11,344] INFO Partition [t1,33] on broker 1:
> Shrinking
> > > >ISR
> > > >for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition)
> > > >[2014-05-16 12:00:18,009] INFO Partition [t1,33] on broker 1: Cached
> > > >zkVersion [17] not equal to that in zookeeper, skip updating ISR
> > > >(kafka.cluster.Partition)
> > > >[2014-05-16 13:33:11,344] INFO Partition [t1,33] on broker 1:
> Shrinking
> > > >ISR
> > > >for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition)
> > > >[2014-05-16 13:33:12,403] INFO Partition [t1,33] on broker 1: Cached
> > > >zkVersion [17] not equal to that in zookeeper, skip updating ISR
> > > >(kafka.cluster.Partition)
> > > >
> > > >
> > > >On Sat, May 17, 2014 at 11:44 AM, Jun Rao <ju...@gmail.com> wrote:
> > > >
> > > >> Do you see constant ISR shrinking/expansion of those two partitions
> in
> > > >>the
> > > >> leader broker's log ?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jun
> > > >>
> > > >>
> > > >> On Fri, May 16, 2014 at 4:25 PM, Paul Mackles <pm...@adobe.com>
> > > >>wrote:
> > > >>
> > > >> > Hi - We are running kafka_2.8.0-0.8.0-beta1 (we are a little
> behind
> > in
> > > >> > upgrading).
> > > >> >
> > > >> > From what I can tell, connectivity to ZK was lost for a brief
> > period.
> > > >>The
> > > >> > cluster seemed to recover OK except that we now have 2 (out of
> 125)
> > > >> > partitions where the ISR appears to be out of date. In other
> words,
> > > >> > kafka-list-topic is showing only one replica in the ISR for the 2
> > > >> > partitions in question (there should be 3).
> > > >> >
> > > >> > What's odd is that in looking at the log segments for those
> > > >>partitions on
> > > >> > the file system, I can see that they are in fact getting updated
> and
> > > >>by
> > > >> all
> > > >> > measures look to be in sync. I can also see that the brokers where
> > the
> > > >> > out-of-sync replicas reside are doing fine and leading other
> > > >>partitions
> > > >> > like nothing ever happened. Based on that, it seems like the ISR
> in
> > > >>ZK is
> > > >> > just out-of-date due to a botched recovery from the brief ZK
> outage.
> > > >> >
> > > >> > Has anyone seen anything like this before? I saw this ticket which
> > > >> sounded
> > > >> > similar:
> > > >> >
> > > >> > https://issues.apache.org/jira/browse/KAFKA-948
> > > >> >
> > > >> > Anyone have any suggestions for recovering from this state? I was
> > > >> thinking
> > > >> > of running the preferred-replica-election tool next to see if that
> > > >>gets
> > > >> the
> > > >> > ISRs in ZK back in sync.
> > > >> >
> > > >> > After that, I guess the next step would be to bounce the kafka
> > > >>servers in
> > > >> > question.
> > > >> >
> > > >> > Thanks,
> > > >> > Paul
> > > >> >
> > > >> >
> > > >>
> > >
> > >
> >
>

Re: ISR not updating

Posted by Shone Sadler <sh...@gmail.com>.

The value of under replicated partitions is 0 across the cluster.

Thanks,
Shone


On Mon, May 19, 2014 at 12:23 AM, Jun Rao <ju...@gmail.com> wrote:

> What's the value of under replicated partitions JMX in each broker?
>
> Thanks,
>
> Jun
>
>
> On Sat, May 17, 2014 at 6:16 PM, Paul Mackles <pm...@adobe.com> wrote:
>
> > Today we did a rolling restart of ZK. We also restarted the kafka
> > controller and ISRs are still not being updated in ZK. Again, the cluster
> > seems fine and the replicas in question do appear to be getting updated.
> I
> > am guessing there must be some bad state persisted in ZK.
> >
> > On 5/17/14 7:50 PM, "Shone Sadler" <sh...@gmail.com> wrote:
> >
> > >Hi Jun,
> > >
> > >I work with Paul and am monitoring the cluster as well.   The status has
> > >not changed.
> > >
> > >When we execute kafka-list-topic we are seeing the following (showing
> one
> > >of two partitions having the problem)
> > >
> > >topic: t1 partition: 33 leader: 1 replicas: 1,2,3 isr: 1
> > >
> > >when inspecting the logs of leader: I do see a spurt of ISR
> > >shrinkage/expansion  around the time that the brokers were partitioned
> > >from
> > >ZK. But nothing past the last message "Cached zkVersion [17] not equal
> to
> > >that in zookeeper." from  yesterday.  There are not constant changes to
> > >the
> > >ISR list.
> > >
> > >Is there a way to force the leader to update ZK with the latest ISR
> list?
> > >
> > >Thanks,
> > >Shone
> > >
> > >Logs:
> > >
> > >cat server.log | grep "\[t1,33\]"
> > >
> > >[2014-04-18 10:16:32,814] INFO [ReplicaFetcherManager on broker 1]
> > >Removing
> > >fetcher for partition [t1,33] (kafka.server.ReplicaFetcherManager)
> > >[2014-05-13 19:42:10,784] ERROR [KafkaApi-1] Error when processing fetch
> > >request for partition [t1,33] offset 330118156 from consumer with
> > >correlation id 0 (kafka.server.KafkaApis)
> > >[2014-05-14 11:02:25,255] ERROR [KafkaApi-1] Error when processing fetch
> > >request for partition [t1,33] offset 332896470 from consumer with
> > >correlation id 0 (kafka.server.KafkaApis)
> > >[2014-05-16 12:00:11,344] INFO Partition [t1,33] on broker 1: Shrinking
> > >ISR
> > >for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition)
> > >[2014-05-16 12:00:18,009] INFO Partition [t1,33] on broker 1: Cached
> > >zkVersion [17] not equal to that in zookeeper, skip updating ISR
> > >(kafka.cluster.Partition)
> > >[2014-05-16 13:33:11,344] INFO Partition [t1,33] on broker 1: Shrinking
> > >ISR
> > >for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition)
> > >[2014-05-16 13:33:12,403] INFO Partition [t1,33] on broker 1: Cached
> > >zkVersion [17] not equal to that in zookeeper, skip updating ISR
> > >(kafka.cluster.Partition)
> > >
> > >
> > >On Sat, May 17, 2014 at 11:44 AM, Jun Rao <ju...@gmail.com> wrote:
> > >
> > >> Do you see constant ISR shrinking/expansion of those two partitions in
> > >>the
> > >> leader broker's log ?
> > >>
> > >> Thanks,
> > >>
> > >> Jun
> > >>
> > >>
> > >> On Fri, May 16, 2014 at 4:25 PM, Paul Mackles <pm...@adobe.com>
> > >>wrote:
> > >>
> > >> > Hi - We are running kafka_2.8.0-0.8.0-beta1 (we are a little behind
> in
> > >> > upgrading).
> > >> >
> > >> > From what I can tell, connectivity to ZK was lost for a brief
> period.
> > >>The
> > >> > cluster seemed to recover OK except that we now have 2 (out of 125)
> > >> > partitions where the ISR appears to be out of date. In other words,
> > >> > kafka-list-topic is showing only one replica in the ISR for the 2
> > >> > partitions in question (there should be 3).
> > >> >
> > >> > What's odd is that in looking at the log segments for those
> > >>partitions on
> > >> > the file system, I can see that they are in fact getting updated and
> > >>by
> > >> all
> > >> > measures look to be in sync. I can also see that the brokers where
> the
> > >> > out-of-sync replicas reside are doing fine and leading other
> > >>partitions
> > >> > like nothing ever happened. Based on that, it seems like the ISR in
> > >>ZK is
> > >> > just out-of-date due to a botched recovery from the brief ZK outage.
> > >> >
> > >> > Has anyone seen anything like this before? I saw this ticket which
> > >> sounded
> > >> > similar:
> > >> >
> > >> > https://issues.apache.org/jira/browse/KAFKA-948
> > >> >
> > >> > Anyone have any suggestions for recovering from this state? I was
> > >> thinking
> > >> > of running the preferred-replica-election tool next to see if that
> > >>gets
> > >> the
> > >> > ISRs in ZK back in sync.
> > >> >
> > >> > After that, I guess the next step would be to bounce the kafka
> > >>servers in
> > >> > question.
> > >> >
> > >> > Thanks,
> > >> > Paul
> > >> >
> > >> >
> > >>
> >
> >
>

Re: ISR not updating

Posted by Jun Rao <ju...@gmail.com>.

What's the value of under replicated partitions JMX in each broker?

Thanks,

Jun


On Sat, May 17, 2014 at 6:16 PM, Paul Mackles <pm...@adobe.com> wrote:

> Today we did a rolling restart of ZK. We also restarted the kafka
> controller and ISRs are still not being updated in ZK. Again, the cluster
> seems fine and the replicas in question do appear to be getting updated. I
> am guessing there must be some bad state persisted in ZK.
>
> On 5/17/14 7:50 PM, "Shone Sadler" <sh...@gmail.com> wrote:
>
> >Hi Jun,
> >
> >I work with Paul and am monitoring the cluster as well.   The status has
> >not changed.
> >
> >When we execute kafka-list-topic we are seeing the following (showing one
> >of two partitions having the problem)
> >
> >topic: t1 partition: 33 leader: 1 replicas: 1,2,3 isr: 1
> >
> >when inspecting the logs of leader: I do see a spurt of ISR
> >shrinkage/expansion  around the time that the brokers were partitioned
> >from
> >ZK. But nothing past the last message "Cached zkVersion [17] not equal to
> >that in zookeeper." from  yesterday.  There are not constant changes to
> >the
> >ISR list.
> >
> >Is there a way to force the leader to update ZK with the latest ISR list?
> >
> >Thanks,
> >Shone
> >
> >Logs:
> >
> >cat server.log | grep "\[t1,33\]"
> >
> >[2014-04-18 10:16:32,814] INFO [ReplicaFetcherManager on broker 1]
> >Removing
> >fetcher for partition [t1,33] (kafka.server.ReplicaFetcherManager)
> >[2014-05-13 19:42:10,784] ERROR [KafkaApi-1] Error when processing fetch
> >request for partition [t1,33] offset 330118156 from consumer with
> >correlation id 0 (kafka.server.KafkaApis)
> >[2014-05-14 11:02:25,255] ERROR [KafkaApi-1] Error when processing fetch
> >request for partition [t1,33] offset 332896470 from consumer with
> >correlation id 0 (kafka.server.KafkaApis)
> >[2014-05-16 12:00:11,344] INFO Partition [t1,33] on broker 1: Shrinking
> >ISR
> >for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition)
> >[2014-05-16 12:00:18,009] INFO Partition [t1,33] on broker 1: Cached
> >zkVersion [17] not equal to that in zookeeper, skip updating ISR
> >(kafka.cluster.Partition)
> >[2014-05-16 13:33:11,344] INFO Partition [t1,33] on broker 1: Shrinking
> >ISR
> >for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition)
> >[2014-05-16 13:33:12,403] INFO Partition [t1,33] on broker 1: Cached
> >zkVersion [17] not equal to that in zookeeper, skip updating ISR
> >(kafka.cluster.Partition)
> >
> >
> >On Sat, May 17, 2014 at 11:44 AM, Jun Rao <ju...@gmail.com> wrote:
> >
> >> Do you see constant ISR shrinking/expansion of those two partitions in
> >>the
> >> leader broker's log ?
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >>
> >> On Fri, May 16, 2014 at 4:25 PM, Paul Mackles <pm...@adobe.com>
> >>wrote:
> >>
> >> > Hi - We are running kafka_2.8.0-0.8.0-beta1 (we are a little behind in
> >> > upgrading).
> >> >
> >> > From what I can tell, connectivity to ZK was lost for a brief period.
> >>The
> >> > cluster seemed to recover OK except that we now have 2 (out of 125)
> >> > partitions where the ISR appears to be out of date. In other words,
> >> > kafka-list-topic is showing only one replica in the ISR for the 2
> >> > partitions in question (there should be 3).
> >> >
> >> > What's odd is that in looking at the log segments for those
> >>partitions on
> >> > the file system, I can see that they are in fact getting updated and
> >>by
> >> all
> >> > measures look to be in sync. I can also see that the brokers where the
> >> > out-of-sync replicas reside are doing fine and leading other
> >>partitions
> >> > like nothing ever happened. Based on that, it seems like the ISR in
> >>ZK is
> >> > just out-of-date due to a botched recovery from the brief ZK outage.
> >> >
> >> > Has anyone seen anything like this before? I saw this ticket which
> >> sounded
> >> > similar:
> >> >
> >> > https://issues.apache.org/jira/browse/KAFKA-948
> >> >
> >> > Anyone have any suggestions for recovering from this state? I was
> >> thinking
> >> > of running the preferred-replica-election tool next to see if that
> >>gets
> >> the
> >> > ISRs in ZK back in sync.
> >> >
> >> > After that, I guess the next step would be to bounce the kafka
> >>servers in
> >> > question.
> >> >
> >> > Thanks,
> >> > Paul
> >> >
> >> >
> >>
>
>

Re: ISR not updating

Posted by Paul Mackles <pm...@adobe.com>.

Restarting the partition leaders cleared things up. We were hesitant to do
that at first because it was unclear what would happen with the
availability of the partitions in question since there were no other
replicas in the ISR (at least according to ZK). From what we observed, the
partitions did remain available during the restart. In other words, the
replicas were in sync the whole time and it was really just a matter of
the ISRs in ZK being out-of-sync.

I am not sure if this is an issue in more recent versions.

Paul

On 5/17/14 9:16 PM, "Paul Mackles" <pm...@adobe.com> wrote:

>Today we did a rolling restart of ZK. We also restarted the kafka
>controller and ISRs are still not being updated in ZK. Again, the cluster
>seems fine and the replicas in question do appear to be getting updated. I
>am guessing there must be some bad state persisted in ZK.
>
>On 5/17/14 7:50 PM, "Shone Sadler" <sh...@gmail.com> wrote:
>
>>Hi Jun,
>>
>>I work with Paul and am monitoring the cluster as well.   The status has
>>not changed.
>>
>>When we execute kafka-list-topic we are seeing the following (showing one
>>of two partitions having the problem)
>>
>>topic: t1 partition: 33 leader: 1 replicas: 1,2,3 isr: 1
>>
>>when inspecting the logs of leader: I do see a spurt of ISR
>>shrinkage/expansion  around the time that the brokers were partitioned
>>from
>>ZK. But nothing past the last message "Cached zkVersion [17] not equal to
>>that in zookeeper." from  yesterday.  There are not constant changes to
>>the
>>ISR list.
>>
>>Is there a way to force the leader to update ZK with the latest ISR list?
>>
>>Thanks,
>>Shone
>>
>>Logs:
>>
>>cat server.log | grep "\[t1,33\]"
>>
>>[2014-04-18 10:16:32,814] INFO [ReplicaFetcherManager on broker 1]
>>Removing
>>fetcher for partition [t1,33] (kafka.server.ReplicaFetcherManager)
>>[2014-05-13 19:42:10,784] ERROR [KafkaApi-1] Error when processing fetch
>>request for partition [t1,33] offset 330118156 from consumer with
>>correlation id 0 (kafka.server.KafkaApis)
>>[2014-05-14 11:02:25,255] ERROR [KafkaApi-1] Error when processing fetch
>>request for partition [t1,33] offset 332896470 from consumer with
>>correlation id 0 (kafka.server.KafkaApis)
>>[2014-05-16 12:00:11,344] INFO Partition [t1,33] on broker 1: Shrinking
>>ISR
>>for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition)
>>[2014-05-16 12:00:18,009] INFO Partition [t1,33] on broker 1: Cached
>>zkVersion [17] not equal to that in zookeeper, skip updating ISR
>>(kafka.cluster.Partition)
>>[2014-05-16 13:33:11,344] INFO Partition [t1,33] on broker 1: Shrinking
>>ISR
>>for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition)
>>[2014-05-16 13:33:12,403] INFO Partition [t1,33] on broker 1: Cached
>>zkVersion [17] not equal to that in zookeeper, skip updating ISR
>>(kafka.cluster.Partition)
>>
>>
>>On Sat, May 17, 2014 at 11:44 AM, Jun Rao <ju...@gmail.com> wrote:
>>
>>> Do you see constant ISR shrinking/expansion of those two partitions in
>>>the
>>> leader broker's log ?
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>>
>>> On Fri, May 16, 2014 at 4:25 PM, Paul Mackles <pm...@adobe.com>
>>>wrote:
>>>
>>> > Hi - We are running kafka_2.8.0-0.8.0-beta1 (we are a little behind
>>>in
>>> > upgrading).
>>> >
>>> > From what I can tell, connectivity to ZK was lost for a brief period.
>>>The
>>> > cluster seemed to recover OK except that we now have 2 (out of 125)
>>> > partitions where the ISR appears to be out of date. In other words,
>>> > kafka-list-topic is showing only one replica in the ISR for the 2
>>> > partitions in question (there should be 3).
>>> >
>>> > What's odd is that in looking at the log segments for those
>>>partitions on
>>> > the file system, I can see that they are in fact getting updated and
>>>by
>>> all
>>> > measures look to be in sync. I can also see that the brokers where
>>>the
>>> > out-of-sync replicas reside are doing fine and leading other
>>>partitions
>>> > like nothing ever happened. Based on that, it seems like the ISR in
>>>ZK is
>>> > just out-of-date due to a botched recovery from the brief ZK outage.
>>> >
>>> > Has anyone seen anything like this before? I saw this ticket which
>>> sounded
>>> > similar:
>>> >
>>> > https://issues.apache.org/jira/browse/KAFKA-948
>>> >
>>> > Anyone have any suggestions for recovering from this state? I was
>>> thinking
>>> > of running the preferred-replica-election tool next to see if that
>>>gets
>>> the
>>> > ISRs in ZK back in sync.
>>> >
>>> > After that, I guess the next step would be to bounce the kafka
>>>servers in
>>> > question.
>>> >
>>> > Thanks,
>>> > Paul
>>> >
>>> >
>>>
>

Re: ISR not updating

Posted by Paul Mackles <pm...@adobe.com>.

Today we did a rolling restart of ZK. We also restarted the kafka
controller and ISRs are still not being updated in ZK. Again, the cluster
seems fine and the replicas in question do appear to be getting updated. I
am guessing there must be some bad state persisted in ZK.

On 5/17/14 7:50 PM, "Shone Sadler" <sh...@gmail.com> wrote:

>Hi Jun,
>
>I work with Paul and am monitoring the cluster as well.   The status has
>not changed.
>
>When we execute kafka-list-topic we are seeing the following (showing one
>of two partitions having the problem)
>
>topic: t1 partition: 33 leader: 1 replicas: 1,2,3 isr: 1
>
>when inspecting the logs of leader: I do see a spurt of ISR
>shrinkage/expansion  around the time that the brokers were partitioned
>from
>ZK. But nothing past the last message "Cached zkVersion [17] not equal to
>that in zookeeper." from  yesterday.  There are not constant changes to
>the
>ISR list.
>
>Is there a way to force the leader to update ZK with the latest ISR list?
>
>Thanks,
>Shone
>
>Logs:
>
>cat server.log | grep "\[t1,33\]"
>
>[2014-04-18 10:16:32,814] INFO [ReplicaFetcherManager on broker 1]
>Removing
>fetcher for partition [t1,33] (kafka.server.ReplicaFetcherManager)
>[2014-05-13 19:42:10,784] ERROR [KafkaApi-1] Error when processing fetch
>request for partition [t1,33] offset 330118156 from consumer with
>correlation id 0 (kafka.server.KafkaApis)
>[2014-05-14 11:02:25,255] ERROR [KafkaApi-1] Error when processing fetch
>request for partition [t1,33] offset 332896470 from consumer with
>correlation id 0 (kafka.server.KafkaApis)
>[2014-05-16 12:00:11,344] INFO Partition [t1,33] on broker 1: Shrinking
>ISR
>for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition)
>[2014-05-16 12:00:18,009] INFO Partition [t1,33] on broker 1: Cached
>zkVersion [17] not equal to that in zookeeper, skip updating ISR
>(kafka.cluster.Partition)
>[2014-05-16 13:33:11,344] INFO Partition [t1,33] on broker 1: Shrinking
>ISR
>for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition)
>[2014-05-16 13:33:12,403] INFO Partition [t1,33] on broker 1: Cached
>zkVersion [17] not equal to that in zookeeper, skip updating ISR
>(kafka.cluster.Partition)
>
>
>On Sat, May 17, 2014 at 11:44 AM, Jun Rao <ju...@gmail.com> wrote:
>
>> Do you see constant ISR shrinking/expansion of those two partitions in
>>the
>> leader broker's log ?
>>
>> Thanks,
>>
>> Jun
>>
>>
>> On Fri, May 16, 2014 at 4:25 PM, Paul Mackles <pm...@adobe.com>
>>wrote:
>>
>> > Hi - We are running kafka_2.8.0-0.8.0-beta1 (we are a little behind in
>> > upgrading).
>> >
>> > From what I can tell, connectivity to ZK was lost for a brief period.
>>The
>> > cluster seemed to recover OK except that we now have 2 (out of 125)
>> > partitions where the ISR appears to be out of date. In other words,
>> > kafka-list-topic is showing only one replica in the ISR for the 2
>> > partitions in question (there should be 3).
>> >
>> > What's odd is that in looking at the log segments for those
>>partitions on
>> > the file system, I can see that they are in fact getting updated and
>>by
>> all
>> > measures look to be in sync. I can also see that the brokers where the
>> > out-of-sync replicas reside are doing fine and leading other
>>partitions
>> > like nothing ever happened. Based on that, it seems like the ISR in
>>ZK is
>> > just out-of-date due to a botched recovery from the brief ZK outage.
>> >
>> > Has anyone seen anything like this before? I saw this ticket which
>> sounded
>> > similar:
>> >
>> > https://issues.apache.org/jira/browse/KAFKA-948
>> >
>> > Anyone have any suggestions for recovering from this state? I was
>> thinking
>> > of running the preferred-replica-election tool next to see if that
>>gets
>> the
>> > ISRs in ZK back in sync.
>> >
>> > After that, I guess the next step would be to bounce the kafka
>>servers in
>> > question.
>> >
>> > Thanks,
>> > Paul
>> >
>> >
>>

Re: ISR not updating

Posted by Shone Sadler <sh...@gmail.com>.

Hi Jun,

I work with Paul and am monitoring the cluster as well.   The status has
not changed.

When we execute kafka-list-topic we are seeing the following (showing one
of two partitions having the problem)

topic: t1 partition: 33 leader: 1 replicas: 1,2,3 isr: 1

when inspecting the logs of leader: I do see a spurt of ISR
shrinkage/expansion  around the time that the brokers were partitioned from
ZK. But nothing past the last message "Cached zkVersion [17] not equal to
that in zookeeper." from  yesterday.  There are not constant changes to the
ISR list.

Is there a way to force the leader to update ZK with the latest ISR list?

Thanks,
Shone

Logs:

cat server.log | grep "\[t1,33\]"

[2014-04-18 10:16:32,814] INFO [ReplicaFetcherManager on broker 1] Removing
fetcher for partition [t1,33] (kafka.server.ReplicaFetcherManager)
[2014-05-13 19:42:10,784] ERROR [KafkaApi-1] Error when processing fetch
request for partition [t1,33] offset 330118156 from consumer with
correlation id 0 (kafka.server.KafkaApis)
[2014-05-14 11:02:25,255] ERROR [KafkaApi-1] Error when processing fetch
request for partition [t1,33] offset 332896470 from consumer with
correlation id 0 (kafka.server.KafkaApis)
[2014-05-16 12:00:11,344] INFO Partition [t1,33] on broker 1: Shrinking ISR
for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition)
[2014-05-16 12:00:18,009] INFO Partition [t1,33] on broker 1: Cached
zkVersion [17] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)
[2014-05-16 13:33:11,344] INFO Partition [t1,33] on broker 1: Shrinking ISR
for partition [t1,33] from 3,1,2 to 1 (kafka.cluster.Partition)
[2014-05-16 13:33:12,403] INFO Partition [t1,33] on broker 1: Cached
zkVersion [17] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)


On Sat, May 17, 2014 at 11:44 AM, Jun Rao <ju...@gmail.com> wrote:

> Do you see constant ISR shrinking/expansion of those two partitions in the
> leader broker's log ?
>
> Thanks,
>
> Jun
>
>
> On Fri, May 16, 2014 at 4:25 PM, Paul Mackles <pm...@adobe.com> wrote:
>
> > Hi - We are running kafka_2.8.0-0.8.0-beta1 (we are a little behind in
> > upgrading).
> >
> > From what I can tell, connectivity to ZK was lost for a brief period. The
> > cluster seemed to recover OK except that we now have 2 (out of 125)
> > partitions where the ISR appears to be out of date. In other words,
> > kafka-list-topic is showing only one replica in the ISR for the 2
> > partitions in question (there should be 3).
> >
> > What's odd is that in looking at the log segments for those partitions on
> > the file system, I can see that they are in fact getting updated and by
> all
> > measures look to be in sync. I can also see that the brokers where the
> > out-of-sync replicas reside are doing fine and leading other partitions
> > like nothing ever happened. Based on that, it seems like the ISR in ZK is
> > just out-of-date due to a botched recovery from the brief ZK outage.
> >
> > Has anyone seen anything like this before? I saw this ticket which
> sounded
> > similar:
> >
> > https://issues.apache.org/jira/browse/KAFKA-948
> >
> > Anyone have any suggestions for recovering from this state? I was
> thinking
> > of running the preferred-replica-election tool next to see if that gets
> the
> > ISRs in ZK back in sync.
> >
> > After that, I guess the next step would be to bounce the kafka servers in
> > question.
> >
> > Thanks,
> > Paul
> >
> >
>

Re: ISR not updating

Posted by Jun Rao <ju...@gmail.com>.

Do you see constant ISR shrinking/expansion of those two partitions in the
leader broker's log ?

Thanks,

Jun


On Fri, May 16, 2014 at 4:25 PM, Paul Mackles <pm...@adobe.com> wrote:

> Hi - We are running kafka_2.8.0-0.8.0-beta1 (we are a little behind in
> upgrading).
>
> From what I can tell, connectivity to ZK was lost for a brief period. The
> cluster seemed to recover OK except that we now have 2 (out of 125)
> partitions where the ISR appears to be out of date. In other words,
> kafka-list-topic is showing only one replica in the ISR for the 2
> partitions in question (there should be 3).
>
> What's odd is that in looking at the log segments for those partitions on
> the file system, I can see that they are in fact getting updated and by all
> measures look to be in sync. I can also see that the brokers where the
> out-of-sync replicas reside are doing fine and leading other partitions
> like nothing ever happened. Based on that, it seems like the ISR in ZK is
> just out-of-date due to a botched recovery from the brief ZK outage.
>
> Has anyone seen anything like this before? I saw this ticket which sounded
> similar:
>
> https://issues.apache.org/jira/browse/KAFKA-948
>
> Anyone have any suggestions for recovering from this state? I was thinking
> of running the preferred-replica-election tool next to see if that gets the
> ISRs in ZK back in sync.
>
> After that, I guess the next step would be to bounce the kafka servers in
> question.
>
> Thanks,
> Paul
>
>