You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Rong Tang (JIRA)" <ji...@apache.org> on 2017/12/19 02:46:00 UTC
[jira] [Comment Edited] (KAFKA-6375) Follower replicas can never catch up to be ISR due to creating ReplicaFetcherThread failed.

    [ https://issues.apache.org/jira/browse/KAFKA-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296063#comment-16296063 ] 

Rong Tang edited comment on KAFKA-6375 at 12/19/17 2:45 AM:
------------------------------------------------------------

[~huxi_2b] 

The LeaderAndIsr is not impacted. and the follower worked out all right after I restarted the broker.  The trouble is the broker cannot fetch data from leader.

Here is how broker runs into trouble:

Just after broker is started, NO fetch thread has been created yet.
First LeaderAndIsr request, to make partitions(1,2,3) follower, let's say, their leader epoch are all 1,  replica fetch thread failed to start.  So broker doesn't fetch data for these 3 followers from leader.  [Broker is already in abnormal state]

Second LeaderAndIsr request, to make partitions(2) follower, with leader epoch still 1.  broker ignore this request, because the leader epoch is not changed.

Third LeaderAndIsr reqeust, to make partitons(3) follower, with leader epoch 2.  no exception happened this time, broker starts to fetch data for partition 2 from its leader. [I believe here, replica fetch thread is created successfully]

Now,  broker has 3 followers, but it only fetches data for partition 3.  it doesn't fetch data for partition 1 and 2.  partition 1 and 2 are out of sync.




was (Author: trjianjianjiao):
[~huxi_2b] 

The LeaderAndIsr is not impacted. and the follower worked out all right after I restarted the broker.  The trouble is the broker cannot fetch data from leader.

Here is how broker runs into trouble:

Just after broker is started, NO fetch thread has been created yet.
First LeaderAndIsr request, to make partitions(1,2,3) follower, let's say, their leader epoch are all 1,  replica fetch thread failed to start.  So broker doesn't fetch data for these 3 followers from leader.

Second LeaderAndIsr request, to make partitions(2) follower, with leader epoch still 1.  broker ignore this request, because the leader epoch is not changed.

Third LeaderAndIsr reqeust, to make partitons(3) follower, with leader epoch 2.  no exception happened this time, broker starts to fetch data for partition 2 from its leader. [I believe here, replica fetch thread is created successfully]

Now,  broker has 3 followers, but it only fetches data for partition 3.  it doesn't fetch data for partition 1 and 2.  partition 1 and 2 are out of sync.



> Follower replicas can never catch up to be ISR due to creating ReplicaFetcherThread failed.
> -------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-6375
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6375
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.1.0
>         Environment: Windows,  23 brokers KafkaCluster
>            Reporter: Rong Tang
>
> Hi, I met with a case that in one broker, the out of sync replicas never catch up.
> When the broker starts up, it receives LeaderAndISR requests from controller, which will call createFetcherThread, the thread creation failed, with exceptions below.
> And then, there is no fetcher for these follower replicas, and it is out of sync forever. Unless, later, it receives LeaderAndISR requests that has higher leader EPOCH.  The broker had 260 out of 330 replicas out of sync for one day, until I restarted it.
> Restart the broker can mitigate the issue.
> I have 2 questions.  
> First, Why NEW ReplicaFetcherThread failed?
> *Second, should Kafka do something to fail over, instead of letting the broker in abnormal state.*
> It is a 23 brokers Kafka cluster running on Windows. each broker has 330 replicas.
> [2017-12-13 16:29:21,317] ERROR Error on broker 1000 while processing LeaderAndIsr request with correlationId 1 received from controller 427703487 epoch 22 (state.change.logger)
> org.apache.kafka.common.KafkaException: java.io.IOException: Unable to establish loopback connection
> 	at org.apache.kafka.common.network.Selector.<init>(Selector.java:124)
> 	at kafka.server.ReplicaFetcherThread.<init>(ReplicaFetcherThread.scala:87)
> 	at kafka.server.ReplicaFetcherManager.createFetcherThread(ReplicaFetcherManager.scala:35)
> 	at kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:83)
> 	at kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:78)
> 	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> 	at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
> 	at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
> 	at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
> 	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> 	at kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:78)
> 	at kafka.server.ReplicaManager.makeFollowers(ReplicaManager.scala:869)
> 	at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:689)
> 	at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:149)
> 	at kafka.server.KafkaApis.handle(KafkaApis.scala:83)
> 	at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Unable to establish loopback connection
> 	at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:94)
> 	at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:61)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at sun.nio.ch.PipeImpl.<init>(PipeImpl.java:171)
> 	at sun.nio.ch.SelectorProviderImpl.openPipe(SelectorProviderImpl.java:50)
> 	at java.nio.channels.Pipe.open(Pipe.java:155)
> 	at sun.nio.ch.WindowsSelectorImpl.<init>(WindowsSelectorImpl.java:127)
> 	at sun.nio.ch.WindowsSelectorProvider.openSelector(WindowsSelectorProvider.java:44)
> 	at java.nio.channels.Selector.open(Selector.java:227)
> 	at org.apache.kafka.common.network.Selector.<init>(Selector.java:122)
> 	... 16 more
> Caused by: java.net.ConnectException: Connection timed out: connect
> 	at sun.nio.ch.Net.connect0(Native Method)
> 	at sun.nio.ch.Net.connect(Net.java:454)
> 	at sun.nio.ch.Net.connect(Net.java:446)
> 	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
> 	at java.nio.channels.SocketChannel.open(SocketChannel.java:189)
> 	at sun.nio.ch.PipeImpl$Initializer$LoopbackConnector.run(PipeImpl.java:127)
> 	at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:76)
> 	... 25 more



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)