You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Radu Radutiu <rr...@gmail.com> on 2017/03/22 13:51:01 UTC

ISR churn

Hello,

Does anyone know how I can debug high ISR churn on the kafka leader on a
cluster without traffic? I have 2 topics on a 4 node cluster  (replica 4
and replica 3) and both show constant changes of the number of insync
replicas:

[2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on broker
2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
(kafka.cluster.Partition)
[2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on broker
2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5
(kafka.cluster.Partition)
[2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on broker
2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
(kafka.cluster.Partition)
[2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on broker
2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5
(kafka.cluster.Partition)
[2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on broker
2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
(kafka.cluster.Partition)
[2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on broker
2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4
(kafka.cluster.Partition)
[2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on broker
2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
(kafka.cluster.Partition)

and

[2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5: Shrinking
ISR for partition [topic1,0] from 5,2,4 to 5,4 (kafka.cluster.Partition)
[2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5: Expanding
ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
[2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5: Shrinking
ISR for partition [topic1,0] from 5,4,2 to 5,4 (kafka.cluster.Partition)
[2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5: Expanding
ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
[2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5: Expanding
ISR for partition topic1-0 from 5,4,2 to 5,4,2,1 (kafka.cluster.Partition)
[2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5: Shrinking
ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1 (kafka.cluster.Partition)
[2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5: Expanding
ISR for partition topic1-0 from 5,4,1 to 5,4,1,2 (kafka.cluster.Partition)
[2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5: Shrinking
ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2 (kafka.cluster.Partition)
[2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5: Expanding
ISR for partition topic1-0 from 5,1,2 to 5,1,2,4 (kafka.cluster.Partition)
(END)

I have tried increasing the num.network.threads (now 8) and
num.replica.fetchers (now 2) but nothing has changed.

The kafka server config is:

default.replication.factor=4
log.retention.check.interval.ms=300000
log.retention.hours=168
log.roll.hours=24
log.segment.bytes=104857600
min.insync.replicas=2
num.io.threads=8
num.network.threads=15
num.partitions=1
num.recovery.threads.per.data.dir=1
num.replica.fetchers=2
offsets.topic.num.partitions=1
offsets.topic.replication.factor=3
replica.lag.time.max.ms=500
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
unclean.leader.election.enable=false
zookeeper.connection.timeout.ms=3000

Best regards,
Radu

Re: ISR churn

Posted by Jeff Widman <je...@netskope.com>.

To manually failover the controller, just delete the /controller znode in
zookeeper

On Wed, Mar 22, 2017 at 11:46 AM, Marcos Juarez <mj...@gmail.com> wrote:

> We're seeing the same exact pattern of ISR shrinking/resizing, mostly on
> partitions with the largest volume, with thousands of messages per second.
> It happens at least a hundred times a day in our production cluster. We do
> have hundreds of topics in our cluster, most of them with 20 or more
> partitions, but most of them see only a few hundred messages per minute.
>
> We're running Kafka 0.10.0.1, and we thought upgrading to the 0.10.1.1
> version would fix the issue, but we've already deployed that version to our
> staging cluster, and we're seeing the same problem.  We still haven't tried
> out the latest 0.10.2.0 version, but I don't see any evidence pointing to a
> fix for that.
>
> This ticket seems to have some similar details, but it doesn't seem like
> there has been follow-up, and there's no target release for fixing:
>
> https://issues.apache.org/jira/browse/KAFKA-4674
>
> Jun Ma, what exactly did you do to failover the controller to a new
> broker? If that works for you, I'd like to try it in our staging clusters.
>
> Thanks,
>
> Marcos Juarez
>
>
>
>
>
> On Wed, Mar 22, 2017 at 11:55 AM, Jun MA <mj...@gmail.com> wrote:
>
>> I have similar issue with our cluster. We don’t know the root cause but
>> we have some interesting observation.
>>
>> 1. We do see correlation between ISR churn and fetcher connection
>> close/create.
>>
>> 2. We’ve tried to add a broker which doesn’t have any partitions on it
>> dedicate to the controller (rolling restart existing brokers and failover
>> the controller to the newly added broker), and that indeed eliminate the
>> random ISR churn. We have a cluster of 6 brokers (besides the dedicated
>> controller) and each one has about 300 partitions on it. I suspect that
>> kafka broker cannot handle running controller + 300 partitions.
>>
>> Anyway that’s so far what I got, I’d also like to know how to debug this.
>> We’re running kafka 0.9.0.1 with heap size 8G.
>>
>> Thanks,
>> Jun
>>
>> On Mar 22, 2017, at 7:06 AM, Manikumar <ma...@gmail.com> wrote:
>>
>> Any erros related to zookeeper seesion timeout? We can also check for
>> excesssive GC.
>> Some times this may due to forming multiple controllers due to soft
>> failures.
>> You can check ActiveControllerCount on brokers, only one broker in the
>> cluster should have 1.
>> Also check for network issues/partitions
>>
>> On Wed, Mar 22, 2017 at 7:21 PM, Radu Radutiu <rr...@gmail.com> wrote:
>>
>> Hello,
>>
>> Does anyone know how I can debug high ISR churn on the kafka leader on a
>> cluster without traffic? I have 2 topics on a 4 node cluster  (replica 4
>> and replica 3) and both show constant changes of the number of insync
>> replicas:
>>
>> [2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on broker
>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
>> (kafka.cluster.Partition)
>> [2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on broker
>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5
>> (kafka.cluster.Partition)
>> [2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on broker
>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
>> (kafka.cluster.Partition)
>> [2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on broker
>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5
>> (kafka.cluster.Partition)
>> [2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on broker
>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
>> (kafka.cluster.Partition)
>> [2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on broker
>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4
>> (kafka.cluster.Partition)
>> [2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on broker
>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
>> (kafka.cluster.Partition)
>>
>> and
>>
>> [2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5: Shrinking
>> ISR for partition [topic1,0] from 5,2,4 to 5,4 (kafka.cluster.Partition)
>> [2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5: Expanding
>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
>> [2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5: Shrinking
>> ISR for partition [topic1,0] from 5,4,2 to 5,4 (kafka.cluster.Partition)
>> [2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5: Expanding
>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
>> [2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5: Expanding
>> ISR for partition topic1-0 from 5,4,2 to 5,4,2,1 (kafka.cluster.Partition)
>> [2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5: Shrinking
>> ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1
>> (kafka.cluster.Partition)
>> [2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5: Expanding
>> ISR for partition topic1-0 from 5,4,1 to 5,4,1,2 (kafka.cluster.Partition)
>> [2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5: Shrinking
>> ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2
>> (kafka.cluster.Partition)
>> [2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5: Expanding
>> ISR for partition topic1-0 from 5,1,2 to 5,1,2,4 (kafka.cluster.Partition)
>> (END)
>>
>> I have tried increasing the num.network.threads (now 8) and
>> num.replica.fetchers (now 2) but nothing has changed.
>>
>> The kafka server config is:
>>
>> default.replication.factor=4
>> log.retention.check.interval.ms=300000
>> log.retention.hours=168
>> log.roll.hours=24
>> log.segment.bytes=104857600
>> min.insync.replicas=2
>> num.io.threads=8
>> num.network.threads=15
>> num.partitions=1
>> num.recovery.threads.per.data.dir=1
>> num.replica.fetchers=2
>> offsets.topic.num.partitions=1
>> offsets.topic.replication.factor=3
>> replica.lag.time.max.ms=500
>> socket.receive.buffer.bytes=102400
>> socket.request.max.bytes=104857600
>> socket.send.buffer.bytes=102400
>> unclean.leader.election.enable=false
>> zookeeper.connection.timeout.ms=3000
>>
>> Best regards,
>> Radu
>>
>>
>>
>

Re: ISR churn

Posted by Radu Radutiu <rr...@gmail.com>.

I see no errors related to zookeeper. I searched all the logs (kafka and
zookeeper) for all 4 servers for all entries in the minute with the ISR
change at 08:23:54 . Here are the logs:

Node n1
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Broker 1 cached leader info
(LeaderAndIsrInfo:(Leader:5,ISR:5,4,2,1,LeaderEpoch:56,ControllerEpoch:63),ReplicationFactor:4),AllReplicas:1,5,2,4)
for partition topic1-0 in response to UpdateMetadata request sent by
controller 4 epoch 64 with correlation id 456 (state.change.logger)

Node n2
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Broker 2 cached leader info
(LeaderAndIsrInfo:(Leader:5,ISR:5,4,2,1,LeaderEpoch:56,ControllerEpoch:63),ReplicationFactor:4),AllReplicas:1,5,2,4)
for partition topic1-0 in response to UpdateMetadata request sent by
controller 4 epoch 64 with correlation id 490 (state.change.logger)

Node n4 (controller)
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,150] TRACE
Controller 4 epoch 64 sending UpdateMetadata request
(Leader:5,ISR:5,4,2,1,LeaderEpoch:56,ControllerEpoch:63) to brokers Set(1,
5, 2, 4) for partition topic1-0 (state.change.logger)
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Broker 4 cached leader info
(LeaderAndIsrInfo:(Leader:5,ISR:5,4,2,1,LeaderEpoch:56,ControllerEpoch:63),ReplicationFactor:4),AllReplicas:1,5,2,4)
for partition topic1-0 in response to UpdateMetadata request sent by
controller 4 epoch 64 with correlation id 490 (state.change.logger)
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Controller 4 epoch 64 received response {error_code=0} for a request sent
to broker n5:9090 (id: 5 rack: null) (state.change.logger)
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Controller 4 epoch 64 received response {error_code=0} for a request sent
to broker n4:9090 (id: 4 rack: null) (state.change.logger)
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Controller 4 epoch 64 received response {error_code=0} for a request sent
to broker n2:9090 (id: 2 rack: null) (state.change.logger)
kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Controller 4 epoch 64 received response {error_code=0} for a request sent
to broker n1:9090 (id: 1 rack: null) (state.change.logger)
kafka_2.12-0.10.2.0/logs/controller.log:[2017-03-23 08:23:55,149] DEBUG
[IsrChangeNotificationListener on Controller 4]: ISR change notification
listener fired (kafka.controller.IsrChangeNotificationListener)
kafka_2.12-0.10.2.0/logs/controller.log:[2017-03-23 08:23:55,150] DEBUG
[IsrChangeNotificationListener on Controller 4]: Sending MetadataRequest to
Brokers:Vector(1, 2, 4, 5) for TopicAndPartitions:Set([topic1,0])
(kafka.controller.IsrChangeNotificationListener)
kafka_2.12-0.10.2.0/logs/controller.log:[2017-03-23 08:23:55,152] DEBUG
[IsrChangeNotificationListener on Controller 4]: ISR change notification
listener fired (kafka.controller.IsrChangeNotificationListener)

Node n5 (leader for topic1-0)
kafka_2.12-0.10.2.0/logs/server.log:[2017-03-23 08:23:54,896] INFO
Partition [topic1,0] on broker 5: Shrinking ISR for partition [topic1,0]
from 5,4,1,2 to 5,4,2 (kafka.cluster.Partition)
kafka_2.12-0.10.2.0/logs/server.log:[2017-03-23 08:23:54,898] INFO
Partition [topic1,0] on broker 5: Expanding ISR for partition topic1-0 from
5,4,2 to 5,4,2,1 (kafka.cluster.Partition)

kafka_2.12-0.10.2.0/logs/state-change.log:[2017-03-23 08:23:55,151] TRACE
Broker 5 cached leader info
(LeaderAndIsrInfo:(Leader:5,ISR:5,4,2,1,LeaderEpoch:56,ControllerEpoch:63),ReplicationFactor:4),AllReplicas:1,5,2,4)
for partition topic1-0 in response to UpdateMetadata request sent by
controller 4 epoch 64 with correlation id 490 (state.change.logger)

There are no log entries before "Shrinking ISR for partition [topic1,0]
from 5,4,1,2 to 5,4,2". The cluster is idle, no network problems, 4 kafka
nodes for 2 topics with 1 partition each, 6 consumers, time synchronized.
I find it strange that in my case the time difference between shrinking and
expanding ISR is very small (3-4 ms).

Best regards,
Radu

On Thu, Mar 23, 2017 at 12:18 AM, James Cheng <wu...@gmail.com> wrote:

> Marcos, Radu,
>
> Are either of you seeing messages saying "Cached zkVersion [...] not equal
> to that in zookeeper"? If so, you may be hitting
> https://issues.apache.org/jira/browse/KAFKA-3042 <
> https://issues.apache.org/jira/browse/KAFKA-3042>
>
> Not sure if that helps you, since I haven't been able it isolate a cause.
> I think it's related to flaky connections to the zookeeper cluster.
>
> -James
>
>
> > On Mar 22, 2017, at 11:46 AM, Marcos Juarez <mj...@gmail.com> wrote:
> >
> > We're seeing the same exact pattern of ISR shrinking/resizing, mostly on
> partitions with the largest volume, with thousands of messages per second.
> It happens at least a hundred times a day in our production cluster. We do
> have hundreds of topics in our cluster, most of them with 20 or more
> partitions, but most of them see only a few hundred messages per minute.
> >
> > We're running Kafka 0.10.0.1, and we thought upgrading to the 0.10.1.1
> version would fix the issue, but we've already deployed that version to our
> staging cluster, and we're seeing the same problem.  We still haven't tried
> out the latest 0.10.2.0 version, but I don't see any evidence pointing to a
> fix for that.
> >
> > This ticket seems to have some similar details, but it doesn't seem like
> there has been follow-up, and there's no target release for fixing:
> >
> > https://issues.apache.org/jira/browse/KAFKA-4674 <
> https://issues.apache.org/jira/browse/KAFKA-4674>
> >
> > Jun Ma, what exactly did you do to failover the controller to a new
> broker? If that works for you, I'd like to try it in our staging clusters.
> >
> > Thanks,
> >
> > Marcos Juarez
> >
> >
> >
> >
> >
> > On Wed, Mar 22, 2017 at 11:55 AM, Jun MA <mj.saber1990@gmail.com
> <ma...@gmail.com>> wrote:
> > I have similar issue with our cluster. We don’t know the root cause but
> we have some interesting observation.
> >
> > 1. We do see correlation between ISR churn and fetcher connection
> close/create.
> >
> >
> > 2. We’ve tried to add a broker which doesn’t have any partitions on it
> dedicate to the controller (rolling restart existing brokers and failover
> the controller to the newly added broker), and that indeed eliminate the
> random ISR churn. We have a cluster of 6 brokers (besides the dedicated
> controller) and each one has about 300 partitions on it. I suspect that
> kafka broker cannot handle running controller + 300 partitions.
> >
> > Anyway that’s so far what I got, I’d also like to know how to debug this.
> > We’re running kafka 0.9.0.1 with heap size 8G.
> >
> > Thanks,
> > Jun
> >
> >> On Mar 22, 2017, at 7:06 AM, Manikumar <manikumar.reddy@gmail.com
> <ma...@gmail.com>> wrote:
> >>
> >> Any erros related to zookeeper seesion timeout? We can also check for
> >> excesssive GC.
> >> Some times this may due to forming multiple controllers due to soft
> >> failures.
> >> You can check ActiveControllerCount on brokers, only one broker in the
> >> cluster should have 1.
> >> Also check for network issues/partitions
> >>
> >> On Wed, Mar 22, 2017 at 7:21 PM, Radu Radutiu <rradutiu@gmail.com
> <ma...@gmail.com>> wrote:
> >>
> >>> Hello,
> >>>
> >>> Does anyone know how I can debug high ISR churn on the kafka leader on
> a
> >>> cluster without traffic? I have 2 topics on a 4 node cluster  (replica
> 4
> >>> and replica 3) and both show constant changes of the number of insync
> >>> replicas:
> >>>
> >>> [2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on
> broker
> >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
> >>> (kafka.cluster.Partition)
> >>>
> >>> and
> >>>
> >>> [2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5:
> Shrinking
> >>> ISR for partition [topic1,0] from 5,2,4 to 5,4
> (kafka.cluster.Partition)
> >>> [2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5:
> Expanding
> >>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
> >>> [2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5:
> Shrinking
> >>> ISR for partition [topic1,0] from 5,4,2 to 5,4
> (kafka.cluster.Partition)
> >>> [2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5:
> Expanding
> >>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
> >>> [2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5:
> Expanding
> >>> ISR for partition topic1-0 from 5,4,2 to 5,4,2,1
> (kafka.cluster.Partition)
> >>> [2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5:
> Shrinking
> >>> ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5:
> Expanding
> >>> ISR for partition topic1-0 from 5,4,1 to 5,4,1,2
> (kafka.cluster.Partition)
> >>> [2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5:
> Shrinking
> >>> ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2
> >>> (kafka.cluster.Partition)
> >>> [2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5:
> Expanding
> >>> ISR for partition topic1-0 from 5,1,2 to 5,1,2,4
> (kafka.cluster.Partition)
> >>> (END)
> >>>
> >>> I have tried increasing the num.network.threads (now 8) and
> >>> num.replica.fetchers (now 2) but nothing has changed.
> >>>
> >>> The kafka server config is:
> >>>
> >>> default.replication.factor=4
> >>> log.retention.check.interval.ms <http://log.retention.check.
> interval.ms/>=300000
> >>> log.retention.hours=168
> >>> log.roll.hours=24
> >>> log.segment.bytes=104857600
> >>> min.insync.replicas=2
> >>> num.io.threads=8
> >>> num.network.threads=15
> >>> num.partitions=1
> >>> num.recovery.threads.per.data.dir=1
> >>> num.replica.fetchers=2
> >>> offsets.topic.num.partitions=1
> >>> offsets.topic.replication.factor=3
> >>> replica.lag.time.max.ms <http://replica.lag.time.max.ms/>=500
> >>> socket.receive.buffer.bytes=102400
> >>> socket.request.max.bytes=104857600
> >>> socket.send.buffer.bytes=102400
> >>> unclean.leader.election.enable=false
> >>> zookeeper.connection.timeout.ms <http://zookeeper.connection.
> timeout.ms/>=3000
> >>>
> >>> Best regards,
> >>> Radu
> >>>
> >
> >
>
>

Re: ISR churn

Posted by James Cheng <wu...@gmail.com>.

Marcos, Radu, 

Are either of you seeing messages saying "Cached zkVersion [...] not equal to that in zookeeper"? If so, you may be hitting https://issues.apache.org/jira/browse/KAFKA-3042 <https://issues.apache.org/jira/browse/KAFKA-3042>

Not sure if that helps you, since I haven't been able it isolate a cause. I think it's related to flaky connections to the zookeeper cluster.

-James


> On Mar 22, 2017, at 11:46 AM, Marcos Juarez <mj...@gmail.com> wrote:
> 
> We're seeing the same exact pattern of ISR shrinking/resizing, mostly on partitions with the largest volume, with thousands of messages per second.  It happens at least a hundred times a day in our production cluster. We do have hundreds of topics in our cluster, most of them with 20 or more partitions, but most of them see only a few hundred messages per minute.  
> 
> We're running Kafka 0.10.0.1, and we thought upgrading to the 0.10.1.1 version would fix the issue, but we've already deployed that version to our staging cluster, and we're seeing the same problem.  We still haven't tried out the latest 0.10.2.0 version, but I don't see any evidence pointing to a fix for that.
> 
> This ticket seems to have some similar details, but it doesn't seem like there has been follow-up, and there's no target release for fixing:
> 
> https://issues.apache.org/jira/browse/KAFKA-4674 <https://issues.apache.org/jira/browse/KAFKA-4674>
> 
> Jun Ma, what exactly did you do to failover the controller to a new broker? If that works for you, I'd like to try it in our staging clusters.
> 
> Thanks,
> 
> Marcos Juarez
> 
> 
> 
> 
> 
> On Wed, Mar 22, 2017 at 11:55 AM, Jun MA <mj.saber1990@gmail.com <ma...@gmail.com>> wrote:
> I have similar issue with our cluster. We don’t know the root cause but we have some interesting observation.
> 
> 1. We do see correlation between ISR churn and fetcher connection close/create.
> 
> 
> 2. We’ve tried to add a broker which doesn’t have any partitions on it dedicate to the controller (rolling restart existing brokers and failover the controller to the newly added broker), and that indeed eliminate the random ISR churn. We have a cluster of 6 brokers (besides the dedicated controller) and each one has about 300 partitions on it. I suspect that kafka broker cannot handle running controller + 300 partitions.
> 
> Anyway that’s so far what I got, I’d also like to know how to debug this.
> We’re running kafka 0.9.0.1 with heap size 8G.
> 
> Thanks,
> Jun
> 
>> On Mar 22, 2017, at 7:06 AM, Manikumar <manikumar.reddy@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Any erros related to zookeeper seesion timeout? We can also check for
>> excesssive GC.
>> Some times this may due to forming multiple controllers due to soft
>> failures.
>> You can check ActiveControllerCount on brokers, only one broker in the
>> cluster should have 1.
>> Also check for network issues/partitions
>> 
>> On Wed, Mar 22, 2017 at 7:21 PM, Radu Radutiu <rradutiu@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> Hello,
>>> 
>>> Does anyone know how I can debug high ISR churn on the kafka leader on a
>>> cluster without traffic? I have 2 topics on a 4 node cluster  (replica 4
>>> and replica 3) and both show constant changes of the number of insync
>>> replicas:
>>> 
>>> [2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
>>> (kafka.cluster.Partition)
>>> 
>>> and
>>> 
>>> [2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5: Shrinking
>>> ISR for partition [topic1,0] from 5,2,4 to 5,4 (kafka.cluster.Partition)
>>> [2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5: Expanding
>>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
>>> [2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5: Shrinking
>>> ISR for partition [topic1,0] from 5,4,2 to 5,4 (kafka.cluster.Partition)
>>> [2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5: Expanding
>>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
>>> [2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5: Expanding
>>> ISR for partition topic1-0 from 5,4,2 to 5,4,2,1 (kafka.cluster.Partition)
>>> [2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5: Shrinking
>>> ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5: Expanding
>>> ISR for partition topic1-0 from 5,4,1 to 5,4,1,2 (kafka.cluster.Partition)
>>> [2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5: Shrinking
>>> ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5: Expanding
>>> ISR for partition topic1-0 from 5,1,2 to 5,1,2,4 (kafka.cluster.Partition)
>>> (END)
>>> 
>>> I have tried increasing the num.network.threads (now 8) and
>>> num.replica.fetchers (now 2) but nothing has changed.
>>> 
>>> The kafka server config is:
>>> 
>>> default.replication.factor=4
>>> log.retention.check.interval.ms <http://log.retention.check.interval.ms/>=300000
>>> log.retention.hours=168
>>> log.roll.hours=24
>>> log.segment.bytes=104857600
>>> min.insync.replicas=2
>>> num.io.threads=8
>>> num.network.threads=15
>>> num.partitions=1
>>> num.recovery.threads.per.data.dir=1
>>> num.replica.fetchers=2
>>> offsets.topic.num.partitions=1
>>> offsets.topic.replication.factor=3
>>> replica.lag.time.max.ms <http://replica.lag.time.max.ms/>=500
>>> socket.receive.buffer.bytes=102400
>>> socket.request.max.bytes=104857600
>>> socket.send.buffer.bytes=102400
>>> unclean.leader.election.enable=false
>>> zookeeper.connection.timeout.ms <http://zookeeper.connection.timeout.ms/>=3000
>>> 
>>> Best regards,
>>> Radu
>>> 
> 
>

Re: ISR churn

Posted by David Garcia <da...@spiceworks.com>.

Sure…there are two types of purgatories: Consumer and Producer
Consumer purgatory (for a partition leader) is a queue for pending requests for data (i.e. polling by some client for the respective partition).  It’s basically a waiting area for poll requests.  Generally speaking, the more consumers there are, the larger this queue will grow…and the larger this queue, the longer consumers will wait for data. 

The Producer queue is for write requests (for a partition leader).  In my experience, the producer queue is more of a bottle neck (because it has to write to disk…which is a slower operation).  Although Kafka delegates to the OS for disk IO, I have found this purgatory-size to be highly predictive of cluster stress.  WRT ISR, if you relax the replica setting for your topics, this will obviously mitigate this issue (i.e. fewer synchronizations needed).  Network latency was also one of our biggest issues.  Have you checked that?

-David

On 3/22/17, 8:28 PM, "Jun MA" <mj...@gmail.com> wrote:

    Hi David,

    I checked our cluster, the producer purgatory size is under 3 mostly. But I’m not quite understand this metrics, could you please explain it a little bit?

    Thanks,
    Jun
    > On Mar 22, 2017, at 3:07 PM, David Garcia <da...@spiceworks.com> wrote:
    > 
    >  producer purgatory size

Re: ISR churn

Posted by Jun MA <mj...@gmail.com>.

Hi David,

I checked our cluster, the producer purgatory size is under 3 mostly. But I’m not quite understand this metrics, could you please explain it a little bit?

Thanks,
Jun
> On Mar 22, 2017, at 3:07 PM, David Garcia <da...@spiceworks.com> wrote:
> 
>  producer purgatory size

Re: ISR churn

Posted by David Garcia <da...@spiceworks.com>.

Look at producer purgatory size.  Anything greater than 10 is bad (from my experience).  Keeping it under 4 seemed to help us.  (i.e. if a broker is getting slammed with write, use rebalance tools or add a new broker).  Also check network latency and/or adjust timeout for ISR checking.  If on AWS, make sure to enable “enhanced networking” (aka: networking that doesn’t suck)

On 3/22/17, 3:39 PM, "Jun MA" <mj...@gmail.com> wrote:

    Let me know if this fix your issue! I’d really interesting to know based on what information should we decide to expand the cluster- bytes per seconds or number of partitions on each broker? And what is the limitation.
    
    
    > On Mar 22, 2017, at 11:46 AM, Marcos Juarez <mj...@gmail.com> wrote:
    > 
    > We're seeing the same exact pattern of ISR shrinking/resizing, mostly on partitions with the largest volume, with thousands of messages per second.  It happens at least a hundred times a day in our production cluster. We do have hundreds of topics in our cluster, most of them with 20 or more partitions, but most of them see only a few hundred messages per minute.  
    > 
    > We're running Kafka 0.10.0.1, and we thought upgrading to the 0.10.1.1 version would fix the issue, but we've already deployed that version to our staging cluster, and we're seeing the same problem.  We still haven't tried out the latest 0.10.2.0 version, but I don't see any evidence pointing to a fix for that.
    > 
    > This ticket seems to have some similar details, but it doesn't seem like there has been follow-up, and there's no target release for fixing:
    > 
    > https://issues.apache.org/jira/browse/KAFKA-4674 <https://issues.apache.org/jira/browse/KAFKA-4674>
    > 
    > Jun Ma, what exactly did you do to failover the controller to a new broker? If that works for you, I'd like to try it in our staging clusters.
    > 
    > Thanks,
    > 
    > Marcos Juarez
    > 
    > 
    > 
    > 
    > 
    > On Wed, Mar 22, 2017 at 11:55 AM, Jun MA <mj.saber1990@gmail.com <ma...@gmail.com>> wrote:
    > I have similar issue with our cluster. We don’t know the root cause but we have some interesting observation.
    > 
    > 1. We do see correlation between ISR churn and fetcher connection close/create.
    > 
    > 
    > 2. We’ve tried to add a broker which doesn’t have any partitions on it dedicate to the controller (rolling restart existing brokers and failover the controller to the newly added broker), and that indeed eliminate the random ISR churn. We have a cluster of 6 brokers (besides the dedicated controller) and each one has about 300 partitions on it. I suspect that kafka broker cannot handle running controller + 300 partitions.
    > 
    > Anyway that’s so far what I got, I’d also like to know how to debug this.
    > We’re running kafka 0.9.0.1 with heap size 8G.
    > 
    > Thanks,
    > Jun
    > 
    >> On Mar 22, 2017, at 7:06 AM, Manikumar <manikumar.reddy@gmail.com <ma...@gmail.com>> wrote:
    >> 
    >> Any erros related to zookeeper seesion timeout? We can also check for
    >> excesssive GC.
    >> Some times this may due to forming multiple controllers due to soft
    >> failures.
    >> You can check ActiveControllerCount on brokers, only one broker in the
    >> cluster should have 1.
    >> Also check for network issues/partitions
    >> 
    >> On Wed, Mar 22, 2017 at 7:21 PM, Radu Radutiu <rradutiu@gmail.com <ma...@gmail.com>> wrote:
    >> 
    >>> Hello,
    >>> 
    >>> Does anyone know how I can debug high ISR churn on the kafka leader on a
    >>> cluster without traffic? I have 2 topics on a 4 node cluster  (replica 4
    >>> and replica 3) and both show constant changes of the number of insync
    >>> replicas:
    >>> 
    >>> [2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on broker
    >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on broker
    >>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on broker
    >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on broker
    >>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on broker
    >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on broker
    >>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on broker
    >>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
    >>> (kafka.cluster.Partition)
    >>> 
    >>> and
    >>> 
    >>> [2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5: Shrinking
    >>> ISR for partition [topic1,0] from 5,2,4 to 5,4 (kafka.cluster.Partition)
    >>> [2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5: Expanding
    >>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
    >>> [2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5: Shrinking
    >>> ISR for partition [topic1,0] from 5,4,2 to 5,4 (kafka.cluster.Partition)
    >>> [2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5: Expanding
    >>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
    >>> [2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5: Expanding
    >>> ISR for partition topic1-0 from 5,4,2 to 5,4,2,1 (kafka.cluster.Partition)
    >>> [2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5: Shrinking
    >>> ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5: Expanding
    >>> ISR for partition topic1-0 from 5,4,1 to 5,4,1,2 (kafka.cluster.Partition)
    >>> [2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5: Shrinking
    >>> ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2
    >>> (kafka.cluster.Partition)
    >>> [2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5: Expanding
    >>> ISR for partition topic1-0 from 5,1,2 to 5,1,2,4 (kafka.cluster.Partition)
    >>> (END)
    >>> 
    >>> I have tried increasing the num.network.threads (now 8) and
    >>> num.replica.fetchers (now 2) but nothing has changed.
    >>> 
    >>> The kafka server config is:
    >>> 
    >>> default.replication.factor=4
    >>> log.retention.check.interval.ms <http://log.retention.check.interval.ms/>=300000
    >>> log.retention.hours=168
    >>> log.roll.hours=24
    >>> log.segment.bytes=104857600
    >>> min.insync.replicas=2
    >>> num.io.threads=8
    >>> num.network.threads=15
    >>> num.partitions=1
    >>> num.recovery.threads.per.data.dir=1
    >>> num.replica.fetchers=2
    >>> offsets.topic.num.partitions=1
    >>> offsets.topic.replication.factor=3
    >>> replica.lag.time.max.ms <http://replica.lag.time.max.ms/>=500
    >>> socket.receive.buffer.bytes=102400
    >>> socket.request.max.bytes=104857600
    >>> socket.send.buffer.bytes=102400
    >>> unclean.leader.election.enable=false
    >>> zookeeper.connection.timeout.ms <http://zookeeper.connection.timeout.ms/>=3000
    >>> 
    >>> Best regards,
    >>> Radu
    >>> 
    > 
    >

Re: ISR churn

Posted by Jun MA <mj...@gmail.com>.

Let me know if this fix your issue! I’d really interesting to know based on what information should we decide to expand the cluster- bytes per seconds or number of partitions on each broker? And what is the limitation.


> On Mar 22, 2017, at 11:46 AM, Marcos Juarez <mj...@gmail.com> wrote:
> 
> We're seeing the same exact pattern of ISR shrinking/resizing, mostly on partitions with the largest volume, with thousands of messages per second.  It happens at least a hundred times a day in our production cluster. We do have hundreds of topics in our cluster, most of them with 20 or more partitions, but most of them see only a few hundred messages per minute.  
> 
> We're running Kafka 0.10.0.1, and we thought upgrading to the 0.10.1.1 version would fix the issue, but we've already deployed that version to our staging cluster, and we're seeing the same problem.  We still haven't tried out the latest 0.10.2.0 version, but I don't see any evidence pointing to a fix for that.
> 
> This ticket seems to have some similar details, but it doesn't seem like there has been follow-up, and there's no target release for fixing:
> 
> https://issues.apache.org/jira/browse/KAFKA-4674 <https://issues.apache.org/jira/browse/KAFKA-4674>
> 
> Jun Ma, what exactly did you do to failover the controller to a new broker? If that works for you, I'd like to try it in our staging clusters.
> 
> Thanks,
> 
> Marcos Juarez
> 
> 
> 
> 
> 
> On Wed, Mar 22, 2017 at 11:55 AM, Jun MA <mj.saber1990@gmail.com <ma...@gmail.com>> wrote:
> I have similar issue with our cluster. We don’t know the root cause but we have some interesting observation.
> 
> 1. We do see correlation between ISR churn and fetcher connection close/create.
> 
> 
> 2. We’ve tried to add a broker which doesn’t have any partitions on it dedicate to the controller (rolling restart existing brokers and failover the controller to the newly added broker), and that indeed eliminate the random ISR churn. We have a cluster of 6 brokers (besides the dedicated controller) and each one has about 300 partitions on it. I suspect that kafka broker cannot handle running controller + 300 partitions.
> 
> Anyway that’s so far what I got, I’d also like to know how to debug this.
> We’re running kafka 0.9.0.1 with heap size 8G.
> 
> Thanks,
> Jun
> 
>> On Mar 22, 2017, at 7:06 AM, Manikumar <manikumar.reddy@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Any erros related to zookeeper seesion timeout? We can also check for
>> excesssive GC.
>> Some times this may due to forming multiple controllers due to soft
>> failures.
>> You can check ActiveControllerCount on brokers, only one broker in the
>> cluster should have 1.
>> Also check for network issues/partitions
>> 
>> On Wed, Mar 22, 2017 at 7:21 PM, Radu Radutiu <rradutiu@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> Hello,
>>> 
>>> Does anyone know how I can debug high ISR churn on the kafka leader on a
>>> cluster without traffic? I have 2 topics on a 4 node cluster  (replica 4
>>> and replica 3) and both show constant changes of the number of insync
>>> replicas:
>>> 
>>> [2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on broker
>>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
>>> (kafka.cluster.Partition)
>>> 
>>> and
>>> 
>>> [2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5: Shrinking
>>> ISR for partition [topic1,0] from 5,2,4 to 5,4 (kafka.cluster.Partition)
>>> [2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5: Expanding
>>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
>>> [2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5: Shrinking
>>> ISR for partition [topic1,0] from 5,4,2 to 5,4 (kafka.cluster.Partition)
>>> [2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5: Expanding
>>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
>>> [2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5: Expanding
>>> ISR for partition topic1-0 from 5,4,2 to 5,4,2,1 (kafka.cluster.Partition)
>>> [2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5: Shrinking
>>> ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5: Expanding
>>> ISR for partition topic1-0 from 5,4,1 to 5,4,1,2 (kafka.cluster.Partition)
>>> [2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5: Shrinking
>>> ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2
>>> (kafka.cluster.Partition)
>>> [2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5: Expanding
>>> ISR for partition topic1-0 from 5,1,2 to 5,1,2,4 (kafka.cluster.Partition)
>>> (END)
>>> 
>>> I have tried increasing the num.network.threads (now 8) and
>>> num.replica.fetchers (now 2) but nothing has changed.
>>> 
>>> The kafka server config is:
>>> 
>>> default.replication.factor=4
>>> log.retention.check.interval.ms <http://log.retention.check.interval.ms/>=300000
>>> log.retention.hours=168
>>> log.roll.hours=24
>>> log.segment.bytes=104857600
>>> min.insync.replicas=2
>>> num.io.threads=8
>>> num.network.threads=15
>>> num.partitions=1
>>> num.recovery.threads.per.data.dir=1
>>> num.replica.fetchers=2
>>> offsets.topic.num.partitions=1
>>> offsets.topic.replication.factor=3
>>> replica.lag.time.max.ms <http://replica.lag.time.max.ms/>=500
>>> socket.receive.buffer.bytes=102400
>>> socket.request.max.bytes=104857600
>>> socket.send.buffer.bytes=102400
>>> unclean.leader.election.enable=false
>>> zookeeper.connection.timeout.ms <http://zookeeper.connection.timeout.ms/>=3000
>>> 
>>> Best regards,
>>> Radu
>>> 
> 
>

Re: ISR churn

Posted by Marcos Juarez <mj...@gmail.com>.

We're seeing the same exact pattern of ISR shrinking/resizing, mostly on
partitions with the largest volume, with thousands of messages per second.
It happens at least a hundred times a day in our production cluster. We do
have hundreds of topics in our cluster, most of them with 20 or more
partitions, but most of them see only a few hundred messages per minute.

We're running Kafka 0.10.0.1, and we thought upgrading to the 0.10.1.1
version would fix the issue, but we've already deployed that version to our
staging cluster, and we're seeing the same problem.  We still haven't tried
out the latest 0.10.2.0 version, but I don't see any evidence pointing to a
fix for that.

This ticket seems to have some similar details, but it doesn't seem like
there has been follow-up, and there's no target release for fixing:

https://issues.apache.org/jira/browse/KAFKA-4674

Jun Ma, what exactly did you do to failover the controller to a new broker?
If that works for you, I'd like to try it in our staging clusters.

Thanks,

Marcos Juarez





On Wed, Mar 22, 2017 at 11:55 AM, Jun MA <mj...@gmail.com> wrote:

> I have similar issue with our cluster. We don’t know the root cause but we
> have some interesting observation.
>
> 1. We do see correlation between ISR churn and fetcher connection
> close/create.
>
> 2. We’ve tried to add a broker which doesn’t have any partitions on it
> dedicate to the controller (rolling restart existing brokers and failover
> the controller to the newly added broker), and that indeed eliminate the
> random ISR churn. We have a cluster of 6 brokers (besides the dedicated
> controller) and each one has about 300 partitions on it. I suspect that
> kafka broker cannot handle running controller + 300 partitions.
>
> Anyway that’s so far what I got, I’d also like to know how to debug this.
> We’re running kafka 0.9.0.1 with heap size 8G.
>
> Thanks,
> Jun
>
> On Mar 22, 2017, at 7:06 AM, Manikumar <ma...@gmail.com> wrote:
>
> Any erros related to zookeeper seesion timeout? We can also check for
> excesssive GC.
> Some times this may due to forming multiple controllers due to soft
> failures.
> You can check ActiveControllerCount on brokers, only one broker in the
> cluster should have 1.
> Also check for network issues/partitions
>
> On Wed, Mar 22, 2017 at 7:21 PM, Radu Radutiu <rr...@gmail.com> wrote:
>
> Hello,
>
> Does anyone know how I can debug high ISR churn on the kafka leader on a
> cluster without traffic? I have 2 topics on a 4 node cluster  (replica 4
> and replica 3) and both show constant changes of the number of insync
> replicas:
>
> [2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on broker
> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
> (kafka.cluster.Partition)
> [2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on broker
> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5
> (kafka.cluster.Partition)
> [2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on broker
> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
> (kafka.cluster.Partition)
> [2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on broker
> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5
> (kafka.cluster.Partition)
> [2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on broker
> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
> (kafka.cluster.Partition)
> [2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on broker
> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4
> (kafka.cluster.Partition)
> [2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on broker
> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
> (kafka.cluster.Partition)
>
> and
>
> [2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5: Shrinking
> ISR for partition [topic1,0] from 5,2,4 to 5,4 (kafka.cluster.Partition)
> [2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5: Expanding
> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
> [2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5: Shrinking
> ISR for partition [topic1,0] from 5,4,2 to 5,4 (kafka.cluster.Partition)
> [2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5: Expanding
> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
> [2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5: Expanding
> ISR for partition topic1-0 from 5,4,2 to 5,4,2,1 (kafka.cluster.Partition)
> [2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5: Shrinking
> ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1
> (kafka.cluster.Partition)
> [2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5: Expanding
> ISR for partition topic1-0 from 5,4,1 to 5,4,1,2 (kafka.cluster.Partition)
> [2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5: Shrinking
> ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2
> (kafka.cluster.Partition)
> [2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5: Expanding
> ISR for partition topic1-0 from 5,1,2 to 5,1,2,4 (kafka.cluster.Partition)
> (END)
>
> I have tried increasing the num.network.threads (now 8) and
> num.replica.fetchers (now 2) but nothing has changed.
>
> The kafka server config is:
>
> default.replication.factor=4
> log.retention.check.interval.ms=300000
> log.retention.hours=168
> log.roll.hours=24
> log.segment.bytes=104857600
> min.insync.replicas=2
> num.io.threads=8
> num.network.threads=15
> num.partitions=1
> num.recovery.threads.per.data.dir=1
> num.replica.fetchers=2
> offsets.topic.num.partitions=1
> offsets.topic.replication.factor=3
> replica.lag.time.max.ms=500
> socket.receive.buffer.bytes=102400
> socket.request.max.bytes=104857600
> socket.send.buffer.bytes=102400
> unclean.leader.election.enable=false
> zookeeper.connection.timeout.ms=3000
>
> Best regards,
> Radu
>
>
>

Re: ISR churn

Posted by Jun MA <mj...@gmail.com>.

I have similar issue with our cluster. We don’t know the root cause but we have some interesting observation.

1. We do see correlation between ISR churn and fetcher connection close/create.


2. We’ve tried to add a broker which doesn’t have any partitions on it dedicate to the controller (rolling restart existing brokers and failover the controller to the newly added broker), and that indeed eliminate the random ISR churn. We have a cluster of 6 brokers (besides the dedicated controller) and each one has about 300 partitions on it. I suspect that kafka broker cannot handle running controller + 300 partitions.

Anyway that’s so far what I got, I’d also like to know how to debug this.
We’re running kafka 0.9.0.1 with heap size 8G.

Thanks,
Jun
> On Mar 22, 2017, at 7:06 AM, Manikumar <ma...@gmail.com> wrote:
> 
> Any erros related to zookeeper seesion timeout? We can also check for
> excesssive GC.
> Some times this may due to forming multiple controllers due to soft
> failures.
> You can check ActiveControllerCount on brokers, only one broker in the
> cluster should have 1.
> Also check for network issues/partitions
> 
> On Wed, Mar 22, 2017 at 7:21 PM, Radu Radutiu <rr...@gmail.com> wrote:
> 
>> Hello,
>> 
>> Does anyone know how I can debug high ISR churn on the kafka leader on a
>> cluster without traffic? I have 2 topics on a 4 node cluster  (replica 4
>> and replica 3) and both show constant changes of the number of insync
>> replicas:
>> 
>> [2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on broker
>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
>> (kafka.cluster.Partition)
>> [2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on broker
>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5
>> (kafka.cluster.Partition)
>> [2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on broker
>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
>> (kafka.cluster.Partition)
>> [2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on broker
>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5
>> (kafka.cluster.Partition)
>> [2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on broker
>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
>> (kafka.cluster.Partition)
>> [2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on broker
>> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4
>> (kafka.cluster.Partition)
>> [2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on broker
>> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
>> (kafka.cluster.Partition)
>> 
>> and
>> 
>> [2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5: Shrinking
>> ISR for partition [topic1,0] from 5,2,4 to 5,4 (kafka.cluster.Partition)
>> [2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5: Expanding
>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
>> [2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5: Shrinking
>> ISR for partition [topic1,0] from 5,4,2 to 5,4 (kafka.cluster.Partition)
>> [2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5: Expanding
>> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
>> [2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5: Expanding
>> ISR for partition topic1-0 from 5,4,2 to 5,4,2,1 (kafka.cluster.Partition)
>> [2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5: Shrinking
>> ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1
>> (kafka.cluster.Partition)
>> [2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5: Expanding
>> ISR for partition topic1-0 from 5,4,1 to 5,4,1,2 (kafka.cluster.Partition)
>> [2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5: Shrinking
>> ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2
>> (kafka.cluster.Partition)
>> [2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5: Expanding
>> ISR for partition topic1-0 from 5,1,2 to 5,1,2,4 (kafka.cluster.Partition)
>> (END)
>> 
>> I have tried increasing the num.network.threads (now 8) and
>> num.replica.fetchers (now 2) but nothing has changed.
>> 
>> The kafka server config is:
>> 
>> default.replication.factor=4
>> log.retention.check.interval.ms=300000
>> log.retention.hours=168
>> log.roll.hours=24
>> log.segment.bytes=104857600
>> min.insync.replicas=2
>> num.io.threads=8
>> num.network.threads=15
>> num.partitions=1
>> num.recovery.threads.per.data.dir=1
>> num.replica.fetchers=2
>> offsets.topic.num.partitions=1
>> offsets.topic.replication.factor=3
>> replica.lag.time.max.ms=500
>> socket.receive.buffer.bytes=102400
>> socket.request.max.bytes=104857600
>> socket.send.buffer.bytes=102400
>> unclean.leader.election.enable=false
>> zookeeper.connection.timeout.ms=3000
>> 
>> Best regards,
>> Radu
>>

Re: ISR churn

Posted by Manikumar <ma...@gmail.com>.

Any erros related to zookeeper seesion timeout? We can also check for
excesssive GC.
Some times this may due to forming multiple controllers due to soft
failures.
You can check ActiveControllerCount on brokers, only one broker in the
cluster should have 1.
Also check for network issues/partitions

On Wed, Mar 22, 2017 at 7:21 PM, Radu Radutiu <rr...@gmail.com> wrote:

> Hello,
>
> Does anyone know how I can debug high ISR churn on the kafka leader on a
> cluster without traffic? I have 2 topics on a 4 node cluster  (replica 4
> and replica 3) and both show constant changes of the number of insync
> replicas:
>
> [2017-03-22 15:30:10,945] INFO Partition [__consumer_offsets,0] on broker
> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
> (kafka.cluster.Partition)
> [2017-03-22 15:31:41,193] INFO Partition [__consumer_offsets,0] on broker
> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,4,5 to 2,5
> (kafka.cluster.Partition)
> [2017-03-22 15:31:41,195] INFO Partition [__consumer_offsets,0] on broker
> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
> (kafka.cluster.Partition)
> [2017-03-22 15:35:03,443] INFO Partition [__consumer_offsets,0] on broker
> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,5
> (kafka.cluster.Partition)
> [2017-03-22 15:35:03,445] INFO Partition [__consumer_offsets,0] on broker
> 2: Expanding ISR for partition __consumer_offsets-0 from 2,5 to 2,5,4
> (kafka.cluster.Partition)
> [2017-03-22 15:37:01,443] INFO Partition [__consumer_offsets,0] on broker
> 2: Shrinking ISR for partition [__consumer_offsets,0] from 2,5,4 to 2,4
> (kafka.cluster.Partition)
> [2017-03-22 15:37:01,445] INFO Partition [__consumer_offsets,0] on broker
> 2: Expanding ISR for partition __consumer_offsets-0 from 2,4 to 2,4,5
> (kafka.cluster.Partition)
>
> and
>
> [2017-03-22 15:09:52,646] INFO Partition [topic1,0] on broker 5: Shrinking
> ISR for partition [topic1,0] from 5,2,4 to 5,4 (kafka.cluster.Partition)
> [2017-03-22 15:09:52,648] INFO Partition [topic1,0] on broker 5: Expanding
> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
> [2017-03-22 15:24:05,646] INFO Partition [topic1,0] on broker 5: Shrinking
> ISR for partition [topic1,0] from 5,4,2 to 5,4 (kafka.cluster.Partition)
> [2017-03-22 15:24:05,648] INFO Partition [topic1,0] on broker 5: Expanding
> ISR for partition topic1-0 from 5,4 to 5,4,2 (kafka.cluster.Partition)
> [2017-03-22 15:26:49,599] INFO Partition [topic1,0] on broker 5: Expanding
> ISR for partition topic1-0 from 5,4,2 to 5,4,2,1 (kafka.cluster.Partition)
> [2017-03-22 15:27:46,396] INFO Partition [topic1,0] on broker 5: Shrinking
> ISR for partition [topic1,0] from 5,4,2,1 to 5,4,1
> (kafka.cluster.Partition)
> [2017-03-22 15:27:46,398] INFO Partition [topic1,0] on broker 5: Expanding
> ISR for partition topic1-0 from 5,4,1 to 5,4,1,2 (kafka.cluster.Partition)
> [2017-03-22 15:45:47,896] INFO Partition [topic1,0] on broker 5: Shrinking
> ISR for partition [topic1,0] from 5,4,1,2 to 5,1,2
> (kafka.cluster.Partition)
> [2017-03-22 15:45:47,898] INFO Partition [topic1,0] on broker 5: Expanding
> ISR for partition topic1-0 from 5,1,2 to 5,1,2,4 (kafka.cluster.Partition)
> (END)
>
> I have tried increasing the num.network.threads (now 8) and
> num.replica.fetchers (now 2) but nothing has changed.
>
> The kafka server config is:
>
> default.replication.factor=4
> log.retention.check.interval.ms=300000
> log.retention.hours=168
> log.roll.hours=24
> log.segment.bytes=104857600
> min.insync.replicas=2
> num.io.threads=8
> num.network.threads=15
> num.partitions=1
> num.recovery.threads.per.data.dir=1
> num.replica.fetchers=2
> offsets.topic.num.partitions=1
> offsets.topic.replication.factor=3
> replica.lag.time.max.ms=500
> socket.receive.buffer.bytes=102400
> socket.request.max.bytes=104857600
> socket.send.buffer.bytes=102400
> unclean.leader.election.enable=false
> zookeeper.connection.timeout.ms=3000
>
> Best regards,
> Radu
>