You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Pradeep Gollakota <pr...@gmail.com> on 2015/10/02 00:25:38 UTC
Kakfa Recovery Errors

Hi All,

We’ve been dealing with a Kafka outage for a few days. In an attempt to
recover, we’ve shut down all of our producers and consumers. So the only
connections we see to/from the brokers are other brokers and zookeeper.

The symptoms we’re seeing are:

   1.

   Uncaught exception on kafka-network-threads. We’ve already filed a JIRA
   for this at https://issues.apache.org/jira/browse/KAFKA-2595. When this
   happens the network threads are dying and the number of CLOSE_WAITs on the
   broker are steadily increasing.

   [2015-10-01 22:01:33,380] ERROR Uncaught exception in thread
'kafka-network-thread-9092-6': (kafka.utils.Utils$)
   java.util.NoSuchElementException: None.get
        at scala.None$.get(Option.scala:347)
        at scala.None$.get(Option.scala:345)
        at kafka.network.ConnectionQuotas.dec(SocketServer.scala:524)
        at kafka.network.AbstractServerThread.close(SocketServer.scala:165)
        at kafka.network.AbstractServerThread.close(SocketServer.scala:157)
        at kafka.network.Processor.close(SocketServer.scala:374)
        at kafka.network.Processor.maybeCloseOldestConnection(SocketServer.scala:500)
        at kafka.network.Processor.run(SocketServer.scala:361)
        at java.lang.Thread.run(Thread.java:745)

   2.

   Several different manifestations of SocketTimeoutExceptions

   [2015-10-01 20:10:53,770] INFO Reconnect due to socket error:
java.net.SocketTimeoutException (kafka.consumer.SimpleConsumer)
   [2015-10-01 20:37:10,839] WARN [Kafka Server 0], Error during
controlled shutdown, possibly because leader movement took longer than
the configured socket.timeout.ms: null (kafka.server.KafkaServer)
   [2015-09-30 21:51:27,989] WARN [ReplicaFetcherThread-1-2], Error in
fetch Name: FetchRequest; Version: 0; CorrelationId: 7; ClientId:
ReplicaFetcherThread-1-2; ReplicaId: 0; MaxWait: 120000 ms; MinBytes:
1024 bytes; RequestInfo: [lia.stage.raw_events,11] ->
PartitionFetchInfo(1105814736,52428800),[lia.stage.bundles,6] ->
PartitionFetchInfo(24395234,52428800),[lia.stage.bundles,14] ->
PartitionFetchInfo(14085756,52428800),[lia.stage.activities.anonymized.json,16]
-> PartitionFetchInfo(283942,52428800),[lia.stage.raw_events,3] ->
PartitionFetchInfo(181370702,52428800),[lia.stage.activities,16] ->
PartitionFetchInfo(283965,52428800),[lia.stage.activities.anonymized.json,8]
-> PartitionFetchInfo(284360,52428800),[lia.stage.activities.anonymized.json,0]
-> PartitionFetchInfo(284542,52428800),[lia.stage.activities,0] ->
PartitionFetchInfo(284560,52428800),[lia.stage.raw_events,19] ->
PartitionFetchInfo(277598692,52428800),[lia.stage.activities,8] ->
PartitionFetchInfo(284404,52428800). Possible cause:
java.net.SocketTimeoutException (kafka.server.ReplicaFetcherThread)

   3.

   IOExceptions

   [2015-10-01 16:53:58,337] ERROR Closing socket for /10.21.220.32
because of error (kafka.network.Processor)
   java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at kafka.utils.Utils$.read(Utils.scala:380)
        at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
        at kafka.network.Processor.read(SocketServer.scala:444)
        at kafka.network.Processor.run(SocketServer.scala:340)
        at java.lang.Thread.run(Thread.java:745)

   4.

   ClosedChannelExceptions

   [2015-10-01 21:53:14,164] INFO Reconnect due to socket error:
java.nio.channels.ClosedChannelException
(kafka.consumer.SimpleConsumer)
   [2015-10-01 21:53:16,498] WARN [ReplicaFetcherThread-3-2], Error in
fetch Name: FetchRequest; Version: 0; CorrelationId: 1; ClientId:
ReplicaFetcherThread-3-2; ReplicaId: 0; MaxWait: 120000 ms; MinBytes:
1024 bytes; RequestInfo: [lia.stage.bundles,12] ->
PartitionFetchInfo(13730894,52428800),[lia.stage.activities,10] ->
PartitionFetchInfo(284919,52428800),[lia.stage.activities.anonymized.json,18]
-> PartitionFetchInfo(285209,52428800),[lia.stage.activities,2] ->
PartitionFetchInfo(286064,52428800),[lia.stage.activities.anonymized.json,2]
-> PartitionFetchInfo(286023,52428800),[lia.stage.raw_events,13] ->
PartitionFetchInfo(216334494,52428800),[lia.stage.bundles,4] ->
PartitionFetchInfo(9912933,52428800),[lia.stage.activities.anonymized.json,10]
-> PartitionFetchInfo(284885,52428800),[lia.stage.raw_events,5] ->
PartitionFetchInfo(278656057,52428800),[lia.stage.activities,18] ->
PartitionFetchInfo(285242,52428800). Possible cause:
java.nio.channels.ClosedChannelException
(kafka.server.ReplicaFetcherThread)

   5.

   Another symptom we’re seeing is a lot of recovering unflushed segments
   even after a controlled shutdown.

I was hoping that the Kafka gurus can shed some light on these issues and
make recommendations on how to recover this outage.

Thanks,
Pradeep