You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Pradeep Gollakota <pr...@gmail.com> on 2015/10/02 00:25:38 UTC
Kakfa Recovery Errors
Hi All,
We’ve been dealing with a Kafka outage for a few days. In an attempt to
recover, we’ve shut down all of our producers and consumers. So the only
connections we see to/from the brokers are other brokers and zookeeper.
The symptoms we’re seeing are:
1.
Uncaught exception on kafka-network-threads. We’ve already filed a JIRA
for this at https://issues.apache.org/jira/browse/KAFKA-2595. When this
happens the network threads are dying and the number of CLOSE_WAITs on the
broker are steadily increasing.
[2015-10-01 22:01:33,380] ERROR Uncaught exception in thread
'kafka-network-thread-9092-6': (kafka.utils.Utils$)
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at kafka.network.ConnectionQuotas.dec(SocketServer.scala:524)
at kafka.network.AbstractServerThread.close(SocketServer.scala:165)
at kafka.network.AbstractServerThread.close(SocketServer.scala:157)
at kafka.network.Processor.close(SocketServer.scala:374)
at kafka.network.Processor.maybeCloseOldestConnection(SocketServer.scala:500)
at kafka.network.Processor.run(SocketServer.scala:361)
at java.lang.Thread.run(Thread.java:745)
2.
Several different manifestations of SocketTimeoutExceptions
[2015-10-01 20:10:53,770] INFO Reconnect due to socket error:
java.net.SocketTimeoutException (kafka.consumer.SimpleConsumer)
[2015-10-01 20:37:10,839] WARN [Kafka Server 0], Error during
controlled shutdown, possibly because leader movement took longer than
the configured socket.timeout.ms: null (kafka.server.KafkaServer)
[2015-09-30 21:51:27,989] WARN [ReplicaFetcherThread-1-2], Error in
fetch Name: FetchRequest; Version: 0; CorrelationId: 7; ClientId:
ReplicaFetcherThread-1-2; ReplicaId: 0; MaxWait: 120000 ms; MinBytes:
1024 bytes; RequestInfo: [lia.stage.raw_events,11] ->
PartitionFetchInfo(1105814736,52428800),[lia.stage.bundles,6] ->
PartitionFetchInfo(24395234,52428800),[lia.stage.bundles,14] ->
PartitionFetchInfo(14085756,52428800),[lia.stage.activities.anonymized.json,16]
-> PartitionFetchInfo(283942,52428800),[lia.stage.raw_events,3] ->
PartitionFetchInfo(181370702,52428800),[lia.stage.activities,16] ->
PartitionFetchInfo(283965,52428800),[lia.stage.activities.anonymized.json,8]
-> PartitionFetchInfo(284360,52428800),[lia.stage.activities.anonymized.json,0]
-> PartitionFetchInfo(284542,52428800),[lia.stage.activities,0] ->
PartitionFetchInfo(284560,52428800),[lia.stage.raw_events,19] ->
PartitionFetchInfo(277598692,52428800),[lia.stage.activities,8] ->
PartitionFetchInfo(284404,52428800). Possible cause:
java.net.SocketTimeoutException (kafka.server.ReplicaFetcherThread)
3.
IOExceptions
[2015-10-01 16:53:58,337] ERROR Closing socket for /10.21.220.32
because of error (kafka.network.Processor)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at kafka.utils.Utils$.read(Utils.scala:380)
at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Processor.read(SocketServer.scala:444)
at kafka.network.Processor.run(SocketServer.scala:340)
at java.lang.Thread.run(Thread.java:745)
4.
ClosedChannelExceptions
[2015-10-01 21:53:14,164] INFO Reconnect due to socket error:
java.nio.channels.ClosedChannelException
(kafka.consumer.SimpleConsumer)
[2015-10-01 21:53:16,498] WARN [ReplicaFetcherThread-3-2], Error in
fetch Name: FetchRequest; Version: 0; CorrelationId: 1; ClientId:
ReplicaFetcherThread-3-2; ReplicaId: 0; MaxWait: 120000 ms; MinBytes:
1024 bytes; RequestInfo: [lia.stage.bundles,12] ->
PartitionFetchInfo(13730894,52428800),[lia.stage.activities,10] ->
PartitionFetchInfo(284919,52428800),[lia.stage.activities.anonymized.json,18]
-> PartitionFetchInfo(285209,52428800),[lia.stage.activities,2] ->
PartitionFetchInfo(286064,52428800),[lia.stage.activities.anonymized.json,2]
-> PartitionFetchInfo(286023,52428800),[lia.stage.raw_events,13] ->
PartitionFetchInfo(216334494,52428800),[lia.stage.bundles,4] ->
PartitionFetchInfo(9912933,52428800),[lia.stage.activities.anonymized.json,10]
-> PartitionFetchInfo(284885,52428800),[lia.stage.raw_events,5] ->
PartitionFetchInfo(278656057,52428800),[lia.stage.activities,18] ->
PartitionFetchInfo(285242,52428800). Possible cause:
java.nio.channels.ClosedChannelException
(kafka.server.ReplicaFetcherThread)
5.
Another symptom we’re seeing is a lot of recovering unflushed segments
even after a controlled shutdown.
I was hoping that the Kafka gurus can shed some light on these issues and
make recommendations on how to recover this outage.
Thanks,
Pradeep