You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Joel Koshy (JIRA)" <ji...@apache.org> on 2014/05/23 22:53:02 UTC

[jira] [Resolved] (KAFKA-1466) Kafka server is hung after throwing "Attempt to swap the new high watermark file with the old one failed"

     [ https://issues.apache.org/jira/browse/KAFKA-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joel Koshy resolved KAFKA-1466.
-------------------------------

    Resolution: Cannot Reproduce

Not sure if it would help but perhaps you can include your file system, disk information, OS, JDK, etc. here as well. Were there any other weird errors in the logs preceding the fatal message? Anything odd in the kernel logs around that time? Also, the process should have exited - if it happens again and the PID is still around can you take a threaddump and post that output?

Anyway, I'm closing this for now - unfortunately there's not enough information to debug further. if you see the issue again please reopen with the above information.


> Kafka server is hung after throwing "Attempt to swap the new high watermark file with the old one failed"
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1466
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1466
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Arup Malakar
>
> We have a kafka cluster of four nodes. The cluster was down after one of the nodes threw the following error:
> 2014-05-21 23:19:44 FATAL [highwatermark-checkpoint-thread1]: HighwaterMarkCheckpoint:109 - Attempt to swap the new high watermark file with the old one failed. I saw the following message in the log file of the failed node:
> {code}
> 2014-05-21 23:19:44 FATAL [highwatermark-checkpoint-thread1]: HighwaterMarkCheckpoint:109 - Attempt to swap the new high watermark file with the old one failed
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaServer:67 - [Kafka Server 4], Shutting down
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaZooKeeper:67 - Closing zookeeper client...
> 2014-05-21 23:19:44 INFO  [ZkClient-EventThread-21-zoo-c2n1.us-east-1.ooyala.com,zoo-c2n2.us-east-1.ooyala.com,zoo-c2n3.us-east-1.ooyala.com,zoo-c2n4.us-east-1.ooyala.com,zoo-c2n5.u
> s-east-1.ooyala.com]: ZkEventThread:82 - Terminate ZkClient event thread.
> 2014-05-21 23:19:44 INFO  [main-EventThread]: ClientCnxn:521 - EventThread shut down
> 2014-05-21 23:19:44 INFO  [Thread-1]: ZooKeeper:544 - Session: 0x1456b562865b172 closed
> 2014-05-21 23:19:44 INFO  [kafka-processor-9092-0]: Processor:67 - Closing socket connection to /10.245.173.136.
> 2014-05-21 23:19:44 INFO  [Thread-1]: SocketServer:67 - [Socket Server on Broker 4], Shutting down
> 2014-05-21 23:19:44 INFO  [Thread-1]: SocketServer:67 - [Socket Server on Broker 4], Shutdown completed
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaRequestHandlerPool:67 - [Kafka Request Handler on Broker 4], shutting down
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaRequestHandlerPool:67 - [Kafka Request Handler on Broker 4], shutted down completely
> 2014-05-21 23:19:44 INFO  [Thread-1]: KafkaScheduler:67 - Shutdown Kafka scheduler
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaManager:67 - [Replica Manager on Broker 4]: Shut down
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherManager:67 - [ReplicaFetcherManager on broker 4] shutting down
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - [ReplicaFetcherThread-0-3], Shutting down
> 2014-05-21 23:19:45 INFO  [ReplicaFetcherThread-0-3]: ReplicaFetcherThread:67 - [ReplicaFetcherThread-0-3], Stopped
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - [ReplicaFetcherThread-0-3], Shutdown completed
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - [ReplicaFetcherThread-0-2], Shutting down
> 2014-05-21 23:19:45 INFO  [ReplicaFetcherThread-0-2]: ReplicaFetcherThread:67 - [ReplicaFetcherThread-0-2], Stopped
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherThread:67 - [ReplicaFetcherThread-0-2], Shutdown completed
> 2014-05-21 23:19:45 INFO  [Thread-1]: ReplicaFetcherManager:67 - [ReplicaFetcherManager on broker 4] shutdown completed
> {code} 
> I notice that since this error was logged there weren't any more logs in the log file but the process was still alive, so I guess it was hung.
> The other nodes in the cluster was not able to recover from this error. The partitions owned by this failed node had its leader set to -1:
> {code}
> topic: test_topic	partition: 8	leader: -1	replicas: 4	isr:
> {code}
> And other nodes were continuously logging the following errors in the log file:
> {code}
> 2014-05-22 20:03:28 ERROR [kafka-request-handler-7]: KafkaApis:102 - [KafkaApi-3] Error while fetching metadata for partition [test_topic,8]
> kafka.common.LeaderNotAvailableException: Leader not available for partition [test_topic,8]
> 	at kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:474)
> 	at kafka.server.KafkaApis$$anonfun$17$$anonfun$20.apply(KafkaApis.scala:462)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
> 	at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61)
> 	at scala.collection.immutable.List.foreach(List.scala:45)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:206)
> 	at scala.collection.immutable.List.map(List.scala:45)
> 	at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:462)
> 	at kafka.server.KafkaApis$$anonfun$17.apply(KafkaApis.scala:458)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:206)
> 	at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:123)
> 	at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:206)
> 	at scala.collection.immutable.HashSet.map(HashSet.scala:32)
> 	at kafka.server.KafkaApis.handleTopicMetadataRequest(KafkaApis.scala:458)
> 	at kafka.server.KafkaApis.handle(KafkaApis.scala:68)
> 	at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
> 	at java.lang.Thread.run(Thread.java:744)
> {code}
> I had to restart the failed kafka node to recover the cluster. We expect the kafka cluster to work even if a node is down. Any clue what went wrong here?



--
This message was sent by Atlassian JIRA
(v6.2#6252)