You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Arup Malakar (JIRA)" <ji...@apache.org> on 2014/05/22 22:49:03 UTC
[jira] [Created] (KAFKA-1466) Kafka server is hung after throwing "Attempt to swap the new high watermark file with the old one failed"

Arup Malakar created KAFKA-1466:
-----------------------------------

             Summary: Kafka server is hung after throwing "Attempt to swap the new high watermark file with the old one failed"
                 Key: KAFKA-1466
                 URL: https://issues.apache.org/jira/browse/KAFKA-1466
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.8.0
            Reporter: Arup Malakar


We have a kafka cluster of four nodes. The cluster was down after one of the nodes threw the following error:
2014-05-21 23:19:44 FATAL [highwatermark-checkpoint-thread1]: HighwaterMarkCheckpoint:109 - Attempt to swap the new high watermark file with the old one failed. I saw the following message in the log file of the failed node:
{code}
2014-05-21 23:19:44 FATAL [highwatermark-checkpoint-thread1]: HighwaterMarkCheckpoint:109 - Attempt to swap the new high watermark file with the old one failed
2014-05-21 23:19:44 INFO  [Thread-1]: KafkaServer:67 - [Kafka Server 4], Shutting down
2014-05-21 23:19:44 INFO  [Thread-1]: KafkaZooKeeper:67 - Closing zookeeper client...
2014-05-21 23:19:44 INFO  [ZkClient-EventThread-21-zoo-c2n1.us-east-1.ooyala.com,zoo-c2n2.us-east-1.ooyala.com,zoo-c2n3.us-east-1.ooyala.com,zoo-c2n4.us-east-1.ooyala.com,zoo-c2n5.u
s-east-1.ooyala.com]: ZkEventThread:82 - Terminate ZkClient event thread.
2014-05-21 23:19:44 INFO  [main-EventThread]: ClientCnxn:521 - EventThread shut down
2014-05-21 23:19:44 INFO  [Thread-1]: ZooKeeper:544 - Session: 0x1456b562865b172 closed
2014-05-21 23:19:44 INFO  [kafka-processor-9092-0]: Processor:67 - Closing socket connection to /10.245.173.136.
2014-05-21 23:19:44 INFO  [Thread-1]: SocketServer:67 - [Socket Server on Broker 4], Shutting down
2014-05-21 23:19:44 INFO  [Thread-1]: SocketServer:67 - [Socket Server on Broker 4], Shutdown completed
2014-05-21 23:19:44 INFO  [Thread-1]: KafkaRequestHandlerPool:67 - [Kafka Request Handler on Broker 4], shutting down
2014-05-21 23:19:44 INFO  [Thread-1]: KafkaRequestHandlerPool:67 - [Kafka Request Handler on Broker 4], shutted down completely
2014-05-21 23:19:44 INFO  [Thread-1]: KafkaScheduler:67 - Shutdown Kafka scheduler
2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaManager:67 - [Replica Manager on Broker 4]: Shut down
2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherManager:67 - [ReplicaFetcherManager on broker 4] shutting down
2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - [ReplicaFetcherThread-0-3], Shutting down
2014-05-21 23:19:45 INFO  [ReplicaFetcherThread-0-3]: ReplicaFetcherThread:67 - [ReplicaFetcherThread-0-3], Stopped
2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - [ReplicaFetcherThread-0-3], Shutdown completed
2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - [ReplicaFetcherThread-0-2], Shutting down
2014-05-21 23:19:45 INFO  [ReplicaFetcherThread-0-2]: ReplicaFetcherThread:67 - [ReplicaFetcherThread-0-2], Stopped
2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - [ReplicaFetcherThread-0-2], Shutdown completed
2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherManager:67 - [ReplicaFetcherManager on broker 4] shutdown completed
{code} 

I notice that since this error was logged there weren't any more logs in the log file but the process was still alive, so I guess it was hung.

The other nodes in the cluster was not able to recover from this error. The partitions owned by this failed node had its leader set to -1:

{code}
topic: test_topic	partition: 8	leader: -1	replicas: 4	isr:
{code}

And other nodes were continuously logging the following errors in the log file:
{code}
2014-05-22 20:03:28 ERROR [kafka-request-handler-7]: KafkaApis:102 - [KafkaApi-3] Error while fetching metadata for partition [test_topic,8]
kafka.common.LeaderNotAvailableException: Leader not available for partition [test_topic,8]
	at kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:474)
	at kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:462)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
	at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61)
	at scala.collection.immutable.List.foreach(List.scala:45)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:206)
	at scala.collection.immutable.List.map(List.scala:45)
	at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:462)
	at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:458)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
	at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:123)
	at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:206)
	at scala.collection.immutable.HashSet.map(HashSet.scala:32)
	at kafka.server.KafkaApis.handleTopicMetadataRequest(KafkaApis.scala:458)
	at kafka.server.KafkaApis.handle(KafkaApis.scala:68)
	at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
	at java.lang.Thread.run(Thread.java:744)
{code}

I had to restart the failed kafka node to recover the cluster. We expect the kafka cluster to work even if a node is down. Any clue what went wrong here?




--
This message was sent by Atlassian JIRA
(v6.2#6252)