You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Bob Cotton <bc...@rallydev.com> on 2015/06/17 23:38:44 UTC

Broker ISR confusion with soft-failed broker

3 Node Kafka - 0.8.2.1
3 node ZK  - 3.4.6

We experienced a soft-node failure from one of our brokers (#2). The
process was still running but no logs were being generated, it was not
responding to JMX queries etc.

Several consumers were unable to read from certain partitions while this
was occurring, partitions that have a replication factor of 3.

I have all the server, controller and state-change logs from the event, and
am trying to filter out the relevant information.

But this sequence of events about one partition in the ISR struck me (just
kafka.cluster.Partition):
There are others just like it.

"_time",sourcetype,host,"_raw"
"2015-06-05T07:31:47.878-0600","kafka_server_log","qd-kafka8-01","[2015-06-05
07:31:47,878] INFO Partition [birdseed-user-stream,5] on broker 1:
Shrinking ISR for partition [birdseed-user-stream,5] from 3,2,1 to 3,1
(kafka.cluster.Partition)"
"2015-06-05T07:31:47.884-0600","kafka_server_log","qd-kafka8-01","[2015-06-05
07:31:47,884] INFO Partition [birdseed-user-stream,5] on broker 1: Cached
zkVersion [537] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)"
"2015-06-05T07:31:57.934-0600","kafka_server_log","qd-kafka8-03","[2015-06-05
07:31:57,934] INFO Partition [birdseed-user-stream,5] on broker 3:
Shrinking ISR for partition [birdseed-user-stream,5] from 3,2 to 3
(kafka.cluster.Partition)"
"2015-06-05T07:31:59.454-0600","kafka_server_log","qd-kafka8-01","[2015-06-05
07:31:59,454] INFO Partition [birdseed-user-stream,5] on broker 1:
Shrinking ISR for partition [birdseed-user-stream,5] from 3,2,1 to 1
(kafka.cluster.Partition)"
"2015-06-05T07:31:59.462-0600","kafka_server_log","qd-kafka8-01","[2015-06-05
07:31:59,462] INFO Partition [birdseed-user-stream,5] on broker 1: Cached
zkVersion [537] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)"
"2015-06-05T07:32:07.387-0600","kafka_server_log","qd-kafka8-01","[2015-06-05
07:32:07,387] INFO Partition [birdseed-user-stream,5] on broker 1:
Shrinking ISR for partition [birdseed-user-stream,5] from 3,2,1 to 2,1
(kafka.cluster.Partition)"
"2015-06-05T07:32:07.397-0600","kafka_server_log","qd-kafka8-01","[2015-06-05
07:32:07,397] INFO Partition [birdseed-user-stream,5] on broker 1: Cached
zkVersion [537] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)"
"2015-06-05T07:32:33.619-0600","kafka_server_log","qd-kafka8-03","[2015-06-05
07:32:33,619] INFO Partition [birdseed-user-stream,5] on broker 3:
Expanding ISR for partition [birdseed-user-stream,5] from 3 to 3,2
(kafka.cluster.Partition)"
"2015-06-05T07:33:12.038-0600","kafka_server_log","qd-kafka8-03","[2015-06-05
07:33:12,038] INFO Partition [birdseed-user-stream,5] on broker 3:
Expanding ISR for partition [birdseed-user-stream,5] from 3,2 to 3,2,1
(kafka.cluster.Partition)"
"2015-06-05T07:33:38.959-0600","kafka_server_log","qd-kafka8-03","[2015-06-05
07:33:38,959] INFO Partition [birdseed-user-stream,5] on broker 3:
Shrinking ISR for partition [birdseed-user-stream,5] from 3,2,1 to 3,1
(kafka.cluster.Partition)"
"2015-06-05T07:34:47.266-0600","kafka_server_log","qd-kafka8-03","[2015-06-05
07:34:47,266] INFO Partition [birdseed-user-stream,5] on broker 3:
Expanding ISR for partition [birdseed-user-stream,5] from 3,1 to 3,1,2
(kafka.cluster.Partition)"
"2015-06-05T07:37:00.584-0600","kafka_server_log","qd-kafka8-03","[2015-06-05
07:37:00,584] INFO Partition [birdseed-user-stream,5] on broker 3:
Shrinking ISR for partition [birdseed-user-stream,5] from 3,1,2 to 3
(kafka.cluster.Partition)"
"2015-06-05T07:37:00.590-0600","kafka_server_log","qd-kafka8-03","[2015-06-05
07:37:00,590] INFO Partition [birdseed-user-stream,5] on broker 3: Cached
zkVersion [543] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)"
"2015-06-05T07:37:09.801-0600","kafka_server_log","qd-kafka8-03","[2015-06-05
07:37:09,801] INFO Partition [birdseed-user-stream,5] on broker 3:
Shrinking ISR for partition [birdseed-user-stream,5] from 3,1,2 to 3
(kafka.cluster.Partition)"
"2015-06-05T07:37:09.804-0600","kafka_server_log","qd-kafka8-03","[2015-06-05
07:37:09,804] INFO Partition [birdseed-user-stream,5] on broker 3: Cached
zkVersion [543] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)"

The last 2 lines repeat every 10 seconds until broker #2 was bounced.

We've seen this twice now in production, and unfortunately did not get a
jstack from the frozen VM.

What more information can I provide to help with this?

Thanks
Bob Cotton
Rally Software