You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Sa Li <sa...@gmail.com> on 2015/01/13 19:03:38 UTC
network connection between kafka nodes

Hello, Kafka experts

I have a production cluster which has three nodes(.100, .101, .102) I am
using a C# producer to publish data to kafka brokers, it works for a while
but started to lose connection error to 2 nodes of cluster. Here is the C#
producer error:

[2015-01-13 01:49:49,786] ERROR
[ConsumerFetcherThread-console-consumer-52088_vagrant-ubuntu-trusty-64-1421113533029-20c40ebf-0-101],
Error for partition [PofApiTest77,5] to broker 101:class
kafka.common.NotLeaderForPartitionException
(kafka.consumer.ConsumerFetcherThread)

To duplicate this issue, I run a producer test on vagrant to send data, and
this is what I get:
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance
test-rep-three 50000000000 100 -1 acks=1 bootstrap.servers=
10.100.50.100:9092,10.100.50.101:9092,10.100.50.102:9092
buffer.memory=67108864 batch.size=8196
.
.
.
536403 records sent, 107259.1 records/sec (10.23 MB/sec), 3993.0 ms avg
latency, 11306.0 max latency.
[2015-01-13 17:49:44,055] WARN Error in I/O with harmful-jar.master/
10.100.50.102 (org.apache.kafka.common.network.Selector)
java.io.EOFException
        at
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:62)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:242)
        at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
        at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
        at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
        at java.lang.Thread.run(Thread.java:745)
[2015-01-13 17:49:44,059] WARN Error in I/O with harmful-jar.master/
10.100.50.102 (org.apache.kafka.common.network.Selector)
java.io.EOFException
        at
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:62)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:242)
        at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
        at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
        at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
        at java.lang.Thread.run(Thread.java:745)


[2015-01-13 17:52:38,384] WARN Error in I/O with voluminous-mass.master/
10.100.50.101 (org.apache.kafka.common.network.Selector)
java.io.EOFException
        at
org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:62)
        at org.apache.kafka.common.network.Selector.poll(Selector.java:242)
        at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:191)
        at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
        at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
        at java.lang.Thread.run(Thread.java:745)

Seems the connection was cut off. I tail the kafka/logs/state-change.log

[2015-01-13 17:49:49,028] TRACE Broker 102 received LeaderAndIsr request
(LeaderAndIsrInfo:(Leader:102,ISR:101,100,102,LeaderEpoch:68,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,101,100)
correlation id 7 from controller 101 epoch 1781 for partition
[PofApiTest77,5] (state.change.logger)
[2015-01-13 17:49:49,030] TRACE Broker 102 handling LeaderAndIsr request
correlationId 7 from controller 101 epoch 1781 starting the become-leader
transition for partition [PofApiTest77,5] (state.change.logger)
[2015-01-13 17:49:49,032] TRACE Broker 102 stopped fetchers as part of
become-leader request from controller 101 epoch 1781 with correlation id 7
for partition [PofApiTest77,5] (state.change.logger)
[2015-01-13 17:49:49,040] TRACE Broker 102 completed LeaderAndIsr request
correlationId 7 from controller 101 epoch 1781 for the become-leader
transition for partition [PofApiTest77,5] (state.change.logger)
[2015-01-13 17:49:49,042] TRACE Broker 102 cached leader info
(LeaderAndIsrInfo:(Leader:102,ISR:101,100,102,LeaderEpoch:68,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,101,100)
for partition [PofApiTest77,5] in response to UpdateMetadata request sent
by controller 101 epoch 1781 with correlation id 7 (state.change.logger)
[2015-01-13 17:49:49,045] TRACE Broker 102 received LeaderAndIsr request
(LeaderAndIsrInfo:(Leader:102,ISR:101,100,102,LeaderEpoch:529,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,100,101)
correlation id 8 from controller 101 epoch 1781 for partition
[test-rep-three,5] (state.change.logger)
[2015-01-13 17:49:49,045] TRACE Broker 102 handling LeaderAndIsr request
correlationId 8 from controller 101 epoch 1781 starting the become-leader
transition for partition [test-rep-three,5] (state.change.logger)
[2015-01-13 17:49:49,048] TRACE Broker 102 stopped fetchers as part of
become-leader request from controller 101 epoch 1781 with correlation id 8
for partition [test-rep-three,5] (state.change.logger)
[2015-01-13 17:49:49,049] TRACE Broker 102 completed LeaderAndIsr request
correlationId 8 from controller 101 epoch 1781 for the become-leader
transition for partition [test-rep-three,5] (state.change.logger)
[2015-01-13 17:49:49,051] TRACE Broker 102 cached leader info
(LeaderAndIsrInfo:(Leader:102,ISR:101,100,102,LeaderEpoch:529,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,100,101)
for partition [test-rep-three,5] in response to UpdateMetadata request sent
by controller 101 epoch 1781 with correlation id 8 (state.change.logger)
[2015-01-13 17:49:49,053] TRACE Broker 102 received LeaderAndIsr request
(LeaderAndIsrInfo:(Leader:102,ISR:101,100,102,LeaderEpoch:528,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,101,100)
correlation id 9 from controller 101 epoch 1781 for partition
[test-rep-three,2] (state.change.logger)
[2015-01-13 17:49:49,053] TRACE Broker 102 handling LeaderAndIsr request
correlationId 9 from controller 101 epoch 1781 starting the become-leader
transition for partition [test-rep-three,2] (state.change.logger)
[2015-01-13 17:49:49,054] TRACE Broker 102 stopped fetchers as part of
become-leader request from controller 101 epoch 1781 with correlation id 9
for partition [test-rep-three,2] (state.change.logger)
[2015-01-13 17:49:49,055] TRACE Broker 102 completed LeaderAndIsr request
correlationId 9 from controller 101 epoch 1781 for the become-leader
transition for partition [test-rep-three,2] (state.change.logger)
[2015-01-13 17:49:49,057] TRACE Broker 102 cached leader info
(LeaderAndIsrInfo:(Leader:102,ISR:101,100,102,LeaderEpoch:528,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,101,100)
for partition [test-rep-three,2] in response to UpdateMetadata request sent
by controller 101 epoch 1781 with correlation id 9 (state.change.logger)
[2015-01-13 17:49:49,058] TRACE Broker 102 received LeaderAndIsr request
(LeaderAndIsrInfo:(Leader:102,ISR:100,101,102,LeaderEpoch:68,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,100,101)
correlation id 10 from controller 101 epoch 1781 for partition
[PofApiTest77,2] (state.change.logger)
[2015-01-13 17:49:49,058] TRACE Broker 102 handling LeaderAndIsr request
correlationId 10 from controller 101 epoch 1781 starting the become-leader
transition for partition [PofApiTest77,2] (state.change.logger)
[2015-01-13 17:49:49,058] TRACE Broker 102 stopped fetchers as part of
become-leader request from controller 101 epoch 1781 with correlation id 10
for partition [PofApiTest77,2] (state.change.logger)
[2015-01-13 17:49:49,059] TRACE Broker 102 completed LeaderAndIsr request
correlationId 10 from controller 101 epoch 1781 for the become-leader
transition for partition [PofApiTest77,2] (state.change.logger)
[2015-01-13 17:49:49,060] TRACE Broker 102 cached leader info
(LeaderAndIsrInfo:(Leader:102,ISR:100,101,102,LeaderEpoch:68,ControllerEpoch:1781),ReplicationFactor:3),AllReplicas:102,100,101)
for partition [PofApiTest77,2] in response to UpdateMetadata request sent
by controller 101 epoch 1781 with correlation id 10 (state.change.logger)


Does anyone have similar issue to lose network connection between nodes?

thanks

-- 

Alec Li