You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by XiaoChuan Yu <xi...@kik.com> on 2017/11/08 17:27:20 UTC

Kafka 0.10.1.1 clients crash after node restart

Hi,

We ran into a problem where clients crash after we restarted a node using
kill -15 and then starting it using  (broker 1001).
2 of the brokers including 1001 also can't sync with each other.

Is this a known issue and if so, is it fixed in later versions?

Details:

We see logs similar to the following being spammed in 1001's log, for each
topic for which it is the leader:

[2017-11-08 16:13:14,880] ERROR [ReplicaFetcherThread-0-1002], Error for
partition [some-topic,58] to broker
1002:org.apache.kafka.common.errors.NotLeaderForPartitionException: This
server is not the leader for that topic-partition.
(kafka.server.ReplicaFetcherThread)

Looking at topic metadata we see that 1002 cannot see to sync up with 1001
or the other way around:

Topic: x  Partition: 22   Leader: 1002    Replicas: 1002,1001,1006
Isr: 1002,1006
Topic: x  Partition: 30   Leader: 1001    Replicas: 1001,1002,1005
Isr: 1005,1001

Producer settings are default.
From client side I see these logs (we use Samza):

2017-11-08 10:54:55.749 WARN  o.a.k.c.producer.internals.Sender
[kafka-producer-network-thread | samza_producer] - Got error produce
response with correlation id 4788574 on topic-partition some-topic-8,
retrying (2147483646 attempts left). Error: NOT_LEADER_FOR_PARTITION
...
2017-11-08 10:55:28.187 WARN  o.a.k.c.producer.internals.Sender
[kafka-producer-network-thread | samza_producer-job] - Got error produce
response with correlation id 4787666 on topic-partition  some-topic-8,
retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
...
(these 2 log lines below are from Samza's Kafka client code)
2017-11-08 10:55:28.189 ERROR o.a.s.s.kafka.KafkaSystemProducer
[kafka-producer-network-thread | samza_producer-job] - Closing the producer
because of an exception in callback:
org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s)
for  some-topic-8 due to 44577 ms has passed since batch creation plus
linger time
2017-11-08 10:55:30.135 ERROR o.a.s.s.kafka.KafkaSystemProducer
[kafka-producer-network-thread | samza_producer-job] - Closing the producer
because of an exception in callback:
java.lang.IllegalStateException: Producer is closed forcefully.
        at
org.apache.kafka.clients.producer.internals.RecordAccumulator.abortBatches(RecordAccumulator.java:513)
        at
org.apache.kafka.clients.producer.internals.RecordAccumulator.abortIncompleteBatches(RecordAccumulator.java:493)
        at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:156)
        at java.lang.Thread.run(Thread.java:748)

We have 6 nodes running 0.10.1.1 with settings:
broker.id.generation.enable=true
delete.topic.enable=true
log.dirs=/data/kafka
num.partitions=60
default.replication.factor=3
min.insync.replicas=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=zookeeper1:2181,zookeeper2:2181,zookeeper3:2181/kafka1
zookeeper.connection.timeout.ms=6000
auto.create.topics.enable=true
broker.rack=us-east-1
num.io.threads=1
Brokers are running OpenJDK 1.8 with JVM settings copied from
https://kafka.apache.org/0101/documentation.html#java.

The client is using org.apache.kafka:kafka-clients:jar:0.10.1.1.
Producer and consumer settings are default. Topic configs are default.
The load is fairly low. There are 74 topics with 60 partitions each.

Thanks,
Xiaochuan Yu