You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Arya Ketan <ke...@gmail.com> on 2014/04/10 05:50:51 UTC

Issue with kafka cluster with broker not able to send messages.

*Issue : *Kafka cluster goes to an un-responsive state after some time with
producers getting Socket time-outs on every request made.

*Kafka Version* - 0.8.1

*Machines* :  VMs , 2 cores 8gb RAM, linux , 3 node cluster.

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 10240
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

*Broker Properties :*

3 topics, 5 partitions, 5  i/o , network threads.
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
zookeeper.connection.timeout.ms=1000000
No replication amongst brokers.

*Producer Properties * -
sync producer,
 required.ack = -1 ( Waits for ack from broker, most durable).
kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
(requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
(requested 102400).

*NO Consumers*

*QPS From producer to broker* ..
>From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent.  Each
Message size is about 600 bytes

>From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
message size is about 1 kb.

*Exception seen on producer after 9-10 hours of seamless message send.*

 Exception
kafka.common.FailedToSendMessageException: Failed to send messages after 3
tries.

ERROR : kafka.producer.async.DefaultEventHandler - Failed to send requests
for topics user with correlation ids in [25617,25932]

 WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
kafka.producer.async.DefaultEventHandler - Failed to send producer request
with correlation id 25931 to broker 4 with data for partitions
[userservice,2]
java.net.SocketTimeoutException
        at
sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
        at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
        at
java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
        at kafka.utils.Utils$.read(Utils.scala:375)


*No of Open FDs in one of the broker machines*- 999

*Netstat of kafka broker :*

Lot  of sockets in CLOSE_WAIT with high receive queue nos.

Sample :
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
tcp6  198549      0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
tcp6      44      0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
tcp6  209647      0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
tcp6  200731      0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
tcp6   34133      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
tcp6  208136      0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
tcp6  199587      0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
tcp6      44      0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
CLOSE_WAIT
tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
tcp6  186606      0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
CLOSE_WAIT
tcp6  196856      0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
tcp6      44      0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
CLOSE_WAIT

*On producer boxes :*

Lot of FIN_WAIT 1 ,2 with high send queues.
tcp6       0     44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0     44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0      0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
FIN_WAIT2
tcp6       0      0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
FIN_WAIT2
tcp6       0  60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0     44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0      0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
ESTABLISHED
tcp6       0  32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0     44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
FIN_WAIT1

tcp6       0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED

tcp6       0      0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
FIN_WAIT2
tcp6       0  63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0  63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0      0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
FIN_WAIT2
tcp6       0     44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0     44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0  28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
FIN_WAIT1
tcp6       0  28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
ESTABLISHED
tcp6       0     44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
FIN_WAIT1


*Broker Logs : *
In server.log, mostly  INFO [2014-04-10 03:01:42,230]
[kafka-processor-9092-0][] kafka.network.Processor - Closing socket
connection to /10.65.40.137.

EXCEPT in 1 machine :

INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
kafka.network.Processor - Closing socket connection to /10.65.40.137.

No logs in-between. First fatal exception at 07.11.

FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error writing
to highwatermark file:
java.io.FileNotFoundException:
/var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many open
files)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
        at java.io.FileWriter.<init>(FileWriter.java:73)
        at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
        at
kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
        at
kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
        at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
        at
kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
        at
kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
        at kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] kafka.network.Acceptor -
Error in acceptor
java.io.IOException: Too many open files
        at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
        at
sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
        at kafka.network.Acceptor.accept(SocketServer.scala:200)
        at kafka.network.Acceptor.run(SocketServer.scala:154)
        at java.lang.Thread.run(Thread.java:662)
ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
'kafka-recovery-point-checkpoint'
java.io.FileNotFoundException:
/var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
open files)


*No EXCEPTIONS in any other logs in any other broker machine..*

Attached is the thread dump from leader.

Re: Issue with kafka cluster with broker not able to send messages.

Posted by Arya Ketan <ke...@gmail.com>.
Hi Guozhang,

I just have 1 producer client per producer machine.
The producer is in a singleton scope.

Is there a possibility to close producer sockets by force or use a producer
socket pool??


--

Arya

Arya


On Thu, Apr 10, 2014 at 11:38 AM, Guozhang Wang <wa...@gmail.com> wrote:

> Hello Arya,
>
> The broker seems dead due to too many open file handlers, which are likely
> due to too many open sockets. Hhow many producer clientss do you have on
> these 5 machines, and could you check if there is any socket leak?
>
> Guozhang
>
>
> On Wed, Apr 9, 2014 at 8:50 PM, Arya Ketan <ke...@gmail.com> wrote:
>
> > *Issue : *Kafka cluster goes to an un-responsive state after some time
> > with producers getting Socket time-outs on every request made.
> >
> > *Kafka Version* - 0.8.1
> >
> > *Machines* :  VMs , 2 cores 8gb RAM, linux , 3 node cluster.
> >
> > ulimit -a
> > core file size          (blocks, -c) 0
> > data seg size           (kbytes, -d) unlimited
> > scheduling priority             (-e) 0
> > file size               (blocks, -f) unlimited
> > pending signals                 (-i) 16382
> > max locked memory       (kbytes, -l) 64
> > max memory size         (kbytes, -m) unlimited
> > open files                      (-n) 10240
> > pipe size            (512 bytes, -p) 8
> > POSIX message queues     (bytes, -q) 819200
> > real-time priority              (-r) 0
> > stack size              (kbytes, -s) 8192
> > cpu time               (seconds, -t) unlimited
> > max user processes              (-u) unlimited
> > virtual memory          (kbytes, -v) unlimited
> > file locks                      (-x) unlimited
> >
> > *Broker Properties :*
> >
> > 3 topics, 5 partitions, 5  i/o , network threads.
> > socket.send.buffer.bytes=1048576
> > socket.receive.buffer.bytes=1048576
> > socket.request.max.bytes=104857600
> > zookeeper.connection.timeout.ms=1000000
> > No replication amongst brokers.
> >
> > *Producer Properties * -
> > sync producer,
> >  required.ack = -1 ( Waits for ack from broker, most durable).
> > kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
> > (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
> > (requested 102400).
> >
> > *NO Consumers*
> >
> > *QPS From producer to broker* ..
> > From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent.
>  Each
> > Message size is about 600 bytes
> >
> > From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
> > message size is about 1 kb.
> >
> > *Exception seen on producer after 9-10 hours of seamless message send.*
> >
> >  Exception
> > kafka.common.FailedToSendMessageException: Failed to send messages after
> 3
> > tries.
> >
> > ERROR : kafka.producer.async.DefaultEventHandler - Failed to send
> requests
> > for topics user with correlation ids in [25617,25932]
> >
> >  WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
> > kafka.producer.async.DefaultEventHandler - Failed to send producer
> request
> > with correlation id 25931 to broker 4 with data for partitions
> > [userservice,2]
> > java.net.SocketTimeoutException
> >         at
> > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
> >         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
> >         at
> >
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
> >         at kafka.utils.Utils$.read(Utils.scala:375)
> >
> >
> > *No of Open FDs in one of the broker machines*- 999
> >
> > *Netstat of kafka broker :*
> >
> > Lot  of sockets in CLOSE_WAIT with high receive queue nos.
> >
> > Sample :
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
> > tcp6  198549      0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
> > tcp6  209647      0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
> > tcp6  200731      0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
> > tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
> > tcp6   34133      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
> > tcp6  208136      0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
> > tcp6  199587      0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
> > tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
> > tcp6  186606      0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
> > CLOSE_WAIT
> > tcp6  196856      0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
> > CLOSE_WAIT
> >
> > *On producer boxes :*
> >
> > Lot of FIN_WAIT 1 ,2 with high send queues.
> > tcp6       0     44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0      0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0      0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0  60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0      0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
> > ESTABLISHED
> > tcp6       0  32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
> > FIN_WAIT1
> >
> > tcp6       0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED
> >
> > tcp6       0      0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0  63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0  63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0      0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0     44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0  28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0  28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
> > ESTABLISHED
> > tcp6       0     44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
> > FIN_WAIT1
> >
> >
> > *Broker Logs : *
> > In server.log, mostly  INFO [2014-04-10 03:01:42,230]
> > [kafka-processor-9092-0][] kafka.network.Processor - Closing socket
> > connection to /10.65.40.137.
> >
> > EXCEPT in 1 machine :
> >
> > INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
> > kafka.network.Processor - Closing socket connection to /10.65.40.137.
> >
> > No logs in-between. First fatal exception at 07.11.
> >
> > FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
> > kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error
> writing
> > to highwatermark file:
> > java.io.FileNotFoundException:
> > /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many
> open
> > files)
> >         at java.io.FileOutputStream.open(Native Method)
> >         at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
> >         at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
> >         at java.io.FileWriter.<init>(FileWriter.java:73)
> >         at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
> >         at
> >
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
> >         at
> >
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
> >         at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
> >         at
> >
> kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
> >         at
> >
> kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
> >         at
> kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> >         at
> >
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
> >         at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >         at java.lang.Thread.run(Thread.java:662)
> > ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] kafka.network.Acceptor
> > - Error in acceptor
> > java.io.IOException: Too many open files
> >         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> >         at
> >
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
> >         at kafka.network.Acceptor.accept(SocketServer.scala:200)
> >         at kafka.network.Acceptor.run(SocketServer.scala:154)
> >         at java.lang.Thread.run(Thread.java:662)
> > ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
> > kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
> > 'kafka-recovery-point-checkpoint'
> > java.io.FileNotFoundException:
> > /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
> > open files)
> >
> >
> > *No EXCEPTIONS in any other logs in any other broker machine..*
> >
> > Attached is the thread dump from leader.
> >
>
>
>
> --
> -- Guozhang
>

Re: Issue with kafka cluster with broker not able to send messages.

Posted by Guozhang Wang <wa...@gmail.com>.
Hello Arya,

The broker seems dead due to too many open file handlers, which are likely
due to too many open sockets. Hhow many producer clientss do you have on
these 5 machines, and could you check if there is any socket leak?

Guozhang


On Wed, Apr 9, 2014 at 8:50 PM, Arya Ketan <ke...@gmail.com> wrote:

> *Issue : *Kafka cluster goes to an un-responsive state after some time
> with producers getting Socket time-outs on every request made.
>
> *Kafka Version* - 0.8.1
>
> *Machines* :  VMs , 2 cores 8gb RAM, linux , 3 node cluster.
>
> ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 16382
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 10240
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) unlimited
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> *Broker Properties :*
>
> 3 topics, 5 partitions, 5  i/o , network threads.
> socket.send.buffer.bytes=1048576
> socket.receive.buffer.bytes=1048576
> socket.request.max.bytes=104857600
> zookeeper.connection.timeout.ms=1000000
> No replication amongst brokers.
>
> *Producer Properties * -
> sync producer,
>  required.ack = -1 ( Waits for ack from broker, most durable).
> kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
> (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
> (requested 102400).
>
> *NO Consumers*
>
> *QPS From producer to broker* ..
> From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent.  Each
> Message size is about 600 bytes
>
> From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
> message size is about 1 kb.
>
> *Exception seen on producer after 9-10 hours of seamless message send.*
>
>  Exception
> kafka.common.FailedToSendMessageException: Failed to send messages after 3
> tries.
>
> ERROR : kafka.producer.async.DefaultEventHandler - Failed to send requests
> for topics user with correlation ids in [25617,25932]
>
>  WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
> kafka.producer.async.DefaultEventHandler - Failed to send producer request
> with correlation id 25931 to broker 4 with data for partitions
> [userservice,2]
> java.net.SocketTimeoutException
>         at
> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
>         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
>         at
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
>         at kafka.utils.Utils$.read(Utils.scala:375)
>
>
> *No of Open FDs in one of the broker machines*- 999
>
> *Netstat of kafka broker :*
>
> Lot  of sockets in CLOSE_WAIT with high receive queue nos.
>
> Sample :
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
> tcp6  198549      0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
> tcp6  209647      0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
> tcp6  200731      0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
> tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
> tcp6   34133      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
> tcp6  208136      0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
> tcp6  199587      0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
> tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
> tcp6  186606      0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
> CLOSE_WAIT
> tcp6  196856      0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
> CLOSE_WAIT
>
> *On producer boxes :*
>
> Lot of FIN_WAIT 1 ,2 with high send queues.
> tcp6       0     44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0      0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0      0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0  60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0      0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
> ESTABLISHED
> tcp6       0  32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
> FIN_WAIT1
>
> tcp6       0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED
>
> tcp6       0      0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0  63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0  63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0      0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0     44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0  28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0  28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
> ESTABLISHED
> tcp6       0     44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
> FIN_WAIT1
>
>
> *Broker Logs : *
> In server.log, mostly  INFO [2014-04-10 03:01:42,230]
> [kafka-processor-9092-0][] kafka.network.Processor - Closing socket
> connection to /10.65.40.137.
>
> EXCEPT in 1 machine :
>
> INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
> kafka.network.Processor - Closing socket connection to /10.65.40.137.
>
> No logs in-between. First fatal exception at 07.11.
>
> FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
> kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error writing
> to highwatermark file:
> java.io.FileNotFoundException:
> /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many open
> files)
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
>         at java.io.FileWriter.<init>(FileWriter.java:73)
>         at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
>         at
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
>         at
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
>         at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
>         at
> kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
>         at
> kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
>         at kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] kafka.network.Acceptor
> - Error in acceptor
> java.io.IOException: Too many open files
>         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>         at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
>         at kafka.network.Acceptor.accept(SocketServer.scala:200)
>         at kafka.network.Acceptor.run(SocketServer.scala:154)
>         at java.lang.Thread.run(Thread.java:662)
> ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
> kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
> 'kafka-recovery-point-checkpoint'
> java.io.FileNotFoundException:
> /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
> open files)
>
>
> *No EXCEPTIONS in any other logs in any other broker machine..*
>
> Attached is the thread dump from leader.
>



-- 
-- Guozhang

Re: Issue with kafka cluster with broker not able to send messages.

Posted by Joel Koshy <jj...@gmail.com>.
Interesting - can you share those thread dumps?

Also, your process file handle limit is extremely low (and different
from what seems to be the default for your user-id from your earlier
message). You should bump that up to a couple of ten thousand at
least.

Joel

On Fri, Apr 11, 2014 at 08:13:34AM +0530, Arya Ketan wrote:
> By the way, we may have found the issue ..
> 
> 
> On going through the thread dump, we found that 4-5 threads were blocked on
> log4j.CallAppenders and 2-3 threads were in IN_NATIVE state while trying to
> write logs to disk. The network threads were there-fore blocked on log4j
> threads, thus hanging the kafka broker.
> 
> Also in logging, in our log4j file, we have a console appender.. which
> writes to STDOUT and STDERR,  we suspected that while writing to stdout and
> stderr, the threads are getting blocked. So, we removed the console
> appender and started writing to a file. Its stable for quite some time. (
> Fingers crossed).
> 
> 
> *Process limits : *
> Max cpu time              unlimited            unlimited
> seconds
> Max file size             unlimited            unlimited
> bytes
> Max data size             unlimited            unlimited
> bytes
> Max stack size            10485760             unlimited
> bytes
> Max core file size        0                    unlimited
> bytes
> Max resident set          unlimited            unlimited
> bytes
> Max processes             1031009              1031009
> processes
> Max open files            1024                 1024
> files
> Max locked memory         65536                65536
> bytes
> Max address space         unlimited            unlimited
> bytes
> Max file locks            unlimited            unlimited
> locks
> Max pending signals       1031009              1031009
> signals
> Max msgqueue size         819200               819200
> bytes
> Max nice priority         0                    0
> Max realtime priority     0                    0
> Max realtime timeout      unlimited            unlimited            us
> 
> 
> 
> 
> Arya
> 
> 
> On Fri, Apr 11, 2014 at 5:50 AM, Joel Koshy <jj...@gmail.com> wrote:
> 
> > When you see this happening (on broker 4 in this instance), can you
> > confirm the Kafka process handle limit?
> >
> > cat /proc/<pid>/limits
> >
> > On Thu, Apr 10, 2014 at 09:20:51AM +0530, Arya Ketan wrote:
> > > *Issue : *Kafka cluster goes to an un-responsive state after some time
> > with
> > > producers getting Socket time-outs on every request made.
> > >
> > > *Kafka Version* - 0.8.1
> > >
> > > *Machines* :  VMs , 2 cores 8gb RAM, linux , 3 node cluster.
> > >
> > > ulimit -a
> > > core file size          (blocks, -c) 0
> > > data seg size           (kbytes, -d) unlimited
> > > scheduling priority             (-e) 0
> > > file size               (blocks, -f) unlimited
> > > pending signals                 (-i) 16382
> > > max locked memory       (kbytes, -l) 64
> > > max memory size         (kbytes, -m) unlimited
> > > open files                      (-n) 10240
> > > pipe size            (512 bytes, -p) 8
> > > POSIX message queues     (bytes, -q) 819200
> > > real-time priority              (-r) 0
> > > stack size              (kbytes, -s) 8192
> > > cpu time               (seconds, -t) unlimited
> > > max user processes              (-u) unlimited
> > > virtual memory          (kbytes, -v) unlimited
> > > file locks                      (-x) unlimited
> > >
> > > *Broker Properties :*
> > >
> > > 3 topics, 5 partitions, 5  i/o , network threads.
> > > socket.send.buffer.bytes=1048576
> > > socket.receive.buffer.bytes=1048576
> > > socket.request.max.bytes=104857600
> > > zookeeper.connection.timeout.ms=1000000
> > > No replication amongst brokers.
> > >
> > > *Producer Properties * -
> > > sync producer,
> > >  required.ack = -1 ( Waits for ack from broker, most durable).
> > > kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
> > > (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
> > > (requested 102400).
> > >
> > > *NO Consumers*
> > >
> > > *QPS From producer to broker* ..
> > > From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent.
> >  Each
> > > Message size is about 600 bytes
> > >
> > > From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
> > > message size is about 1 kb.
> > >
> > > *Exception seen on producer after 9-10 hours of seamless message send.*
> > >
> > >  Exception
> > > kafka.common.FailedToSendMessageException: Failed to send messages after
> > 3
> > > tries.
> > >
> > > ERROR : kafka.producer.async.DefaultEventHandler - Failed to send
> > requests
> > > for topics user with correlation ids in [25617,25932]
> > >
> > >  WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
> > > kafka.producer.async.DefaultEventHandler - Failed to send producer
> > request
> > > with correlation id 25931 to broker 4 with data for partitions
> > > [userservice,2]
> > > java.net.SocketTimeoutException
> > >         at
> > > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
> > >         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
> > >         at
> > >
> > java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
> > >         at kafka.utils.Utils$.read(Utils.scala:375)
> > >
> > >
> > > *No of Open FDs in one of the broker machines*- 999
> > >
> > > *Netstat of kafka broker :*
> > >
> > > Lot  of sockets in CLOSE_WAIT with high receive queue nos.
> > >
> > > Sample :
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
> > > tcp6  198549      0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
> > > tcp6      44      0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
> > > CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
> > > tcp6  209647      0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
> > > tcp6  200731      0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
> > > tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
> > > tcp6   34133      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
> > > CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
> > > tcp6  208136      0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
> > > tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
> > > CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
> > > tcp6  199587      0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
> > > tcp6      44      0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
> > > CLOSE_WAIT
> > > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
> > > tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
> > > tcp6  186606      0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
> > > tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
> > > CLOSE_WAIT
> > > tcp6  196856      0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
> > > tcp6      44      0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
> > > CLOSE_WAIT
> > >
> > > *On producer boxes :*
> > >
> > > Lot of FIN_WAIT 1 ,2 with high send queues.
> > > tcp6       0     44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0     44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0      0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
> > > FIN_WAIT2
> > > tcp6       0      0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
> > > FIN_WAIT2
> > > tcp6       0  60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0     44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0      0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
> > > ESTABLISHED
> > > tcp6       0  32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0     44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > >
> > > tcp6       0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED
> > >
> > > tcp6       0      0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
> > > FIN_WAIT2
> > > tcp6       0  63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0  63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0      0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
> > > FIN_WAIT2
> > > tcp6       0     44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0     44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0  28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > > tcp6       0  28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
> > > ESTABLISHED
> > > tcp6       0     44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
> > > FIN_WAIT1
> > >
> > >
> > > *Broker Logs : *
> > > In server.log, mostly  INFO [2014-04-10 03:01:42,230]
> > > [kafka-processor-9092-0][] kafka.network.Processor - Closing socket
> > > connection to /10.65.40.137.
> > >
> > > EXCEPT in 1 machine :
> > >
> > > INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
> > > kafka.network.Processor - Closing socket connection to /10.65.40.137.
> > >
> > > No logs in-between. First fatal exception at 07.11.
> > >
> > > FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
> > > kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error
> > writing
> > > to highwatermark file:
> > > java.io.FileNotFoundException:
> > > /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many
> > open
> > > files)
> > >         at java.io.FileOutputStream.open(Native Method)
> > >         at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
> > >         at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
> > >         at java.io.FileWriter.<init>(FileWriter.java:73)
> > >         at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
> > >         at
> > >
> > kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
> > >         at
> > >
> > kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
> > >         at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
> > >         at
> > >
> > kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
> > >         at
> > >
> > kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
> > >         at
> > kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
> > >         at
> > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> > >         at
> > >
> > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
> > >         at
> > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
> > >         at
> > >
> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
> > >         at
> > >
> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
> > >         at
> > >
> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
> > >         at
> > >
> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> > >         at
> > >
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > >         at java.lang.Thread.run(Thread.java:662)
> > > ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][]
> > kafka.network.Acceptor -
> > > Error in acceptor
> > > java.io.IOException: Too many open files
> > >         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> > >         at
> > >
> > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
> > >         at kafka.network.Acceptor.accept(SocketServer.scala:200)
> > >         at kafka.network.Acceptor.run(SocketServer.scala:154)
> > >         at java.lang.Thread.run(Thread.java:662)
> > > ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
> > > kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
> > > 'kafka-recovery-point-checkpoint'
> > > java.io.FileNotFoundException:
> > > /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
> > > open files)
> > >
> > >
> > > *No EXCEPTIONS in any other logs in any other broker machine..*
> > >
> > > Attached is the thread dump from leader.
> >
> >
> >


Re: Issue with kafka cluster with broker not able to send messages.

Posted by Arya Ketan <ke...@gmail.com>.
By the way, we may have found the issue ..


On going through the thread dump, we found that 4-5 threads were blocked on
log4j.CallAppenders and 2-3 threads were in IN_NATIVE state while trying to
write logs to disk. The network threads were there-fore blocked on log4j
threads, thus hanging the kafka broker.

Also in logging, in our log4j file, we have a console appender.. which
writes to STDOUT and STDERR,  we suspected that while writing to stdout and
stderr, the threads are getting blocked. So, we removed the console
appender and started writing to a file. Its stable for quite some time. (
Fingers crossed).


*Process limits : *
Max cpu time              unlimited            unlimited
seconds
Max file size             unlimited            unlimited
bytes
Max data size             unlimited            unlimited
bytes
Max stack size            10485760             unlimited
bytes
Max core file size        0                    unlimited
bytes
Max resident set          unlimited            unlimited
bytes
Max processes             1031009              1031009
processes
Max open files            1024                 1024
files
Max locked memory         65536                65536
bytes
Max address space         unlimited            unlimited
bytes
Max file locks            unlimited            unlimited
locks
Max pending signals       1031009              1031009
signals
Max msgqueue size         819200               819200
bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us




Arya


On Fri, Apr 11, 2014 at 5:50 AM, Joel Koshy <jj...@gmail.com> wrote:

> When you see this happening (on broker 4 in this instance), can you
> confirm the Kafka process handle limit?
>
> cat /proc/<pid>/limits
>
> On Thu, Apr 10, 2014 at 09:20:51AM +0530, Arya Ketan wrote:
> > *Issue : *Kafka cluster goes to an un-responsive state after some time
> with
> > producers getting Socket time-outs on every request made.
> >
> > *Kafka Version* - 0.8.1
> >
> > *Machines* :  VMs , 2 cores 8gb RAM, linux , 3 node cluster.
> >
> > ulimit -a
> > core file size          (blocks, -c) 0
> > data seg size           (kbytes, -d) unlimited
> > scheduling priority             (-e) 0
> > file size               (blocks, -f) unlimited
> > pending signals                 (-i) 16382
> > max locked memory       (kbytes, -l) 64
> > max memory size         (kbytes, -m) unlimited
> > open files                      (-n) 10240
> > pipe size            (512 bytes, -p) 8
> > POSIX message queues     (bytes, -q) 819200
> > real-time priority              (-r) 0
> > stack size              (kbytes, -s) 8192
> > cpu time               (seconds, -t) unlimited
> > max user processes              (-u) unlimited
> > virtual memory          (kbytes, -v) unlimited
> > file locks                      (-x) unlimited
> >
> > *Broker Properties :*
> >
> > 3 topics, 5 partitions, 5  i/o , network threads.
> > socket.send.buffer.bytes=1048576
> > socket.receive.buffer.bytes=1048576
> > socket.request.max.bytes=104857600
> > zookeeper.connection.timeout.ms=1000000
> > No replication amongst brokers.
> >
> > *Producer Properties * -
> > sync producer,
> >  required.ack = -1 ( Waits for ack from broker, most durable).
> > kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
> > (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
> > (requested 102400).
> >
> > *NO Consumers*
> >
> > *QPS From producer to broker* ..
> > From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent.
>  Each
> > Message size is about 600 bytes
> >
> > From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
> > message size is about 1 kb.
> >
> > *Exception seen on producer after 9-10 hours of seamless message send.*
> >
> >  Exception
> > kafka.common.FailedToSendMessageException: Failed to send messages after
> 3
> > tries.
> >
> > ERROR : kafka.producer.async.DefaultEventHandler - Failed to send
> requests
> > for topics user with correlation ids in [25617,25932]
> >
> >  WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
> > kafka.producer.async.DefaultEventHandler - Failed to send producer
> request
> > with correlation id 25931 to broker 4 with data for partitions
> > [userservice,2]
> > java.net.SocketTimeoutException
> >         at
> > sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
> >         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
> >         at
> >
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
> >         at kafka.utils.Utils$.read(Utils.scala:375)
> >
> >
> > *No of Open FDs in one of the broker machines*- 999
> >
> > *Netstat of kafka broker :*
> >
> > Lot  of sockets in CLOSE_WAIT with high receive queue nos.
> >
> > Sample :
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
> > tcp6  198549      0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
> > tcp6  209647      0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
> > tcp6  200731      0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
> > tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
> > tcp6   34133      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
> > tcp6  208136      0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
> > tcp6  199587      0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
> > CLOSE_WAIT
> > tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
> > tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
> > tcp6  186606      0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
> > CLOSE_WAIT
> > tcp6  196856      0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
> > tcp6      44      0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
> > CLOSE_WAIT
> >
> > *On producer boxes :*
> >
> > Lot of FIN_WAIT 1 ,2 with high send queues.
> > tcp6       0     44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0      0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0      0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0  60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0      0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
> > ESTABLISHED
> > tcp6       0  32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
> > FIN_WAIT1
> >
> > tcp6       0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED
> >
> > tcp6       0      0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0  63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0  63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0      0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
> > FIN_WAIT2
> > tcp6       0     44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0     44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0  28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
> > FIN_WAIT1
> > tcp6       0  28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
> > ESTABLISHED
> > tcp6       0     44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
> > FIN_WAIT1
> >
> >
> > *Broker Logs : *
> > In server.log, mostly  INFO [2014-04-10 03:01:42,230]
> > [kafka-processor-9092-0][] kafka.network.Processor - Closing socket
> > connection to /10.65.40.137.
> >
> > EXCEPT in 1 machine :
> >
> > INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
> > kafka.network.Processor - Closing socket connection to /10.65.40.137.
> >
> > No logs in-between. First fatal exception at 07.11.
> >
> > FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
> > kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error
> writing
> > to highwatermark file:
> > java.io.FileNotFoundException:
> > /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many
> open
> > files)
> >         at java.io.FileOutputStream.open(Native Method)
> >         at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
> >         at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
> >         at java.io.FileWriter.<init>(FileWriter.java:73)
> >         at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
> >         at
> >
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
> >         at
> >
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
> >         at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
> >         at
> >
> kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
> >         at
> >
> kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
> >         at
> kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> >         at
> >
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
> >         at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
> >         at
> >
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >         at java.lang.Thread.run(Thread.java:662)
> > ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][]
> kafka.network.Acceptor -
> > Error in acceptor
> > java.io.IOException: Too many open files
> >         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> >         at
> >
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
> >         at kafka.network.Acceptor.accept(SocketServer.scala:200)
> >         at kafka.network.Acceptor.run(SocketServer.scala:154)
> >         at java.lang.Thread.run(Thread.java:662)
> > ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
> > kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
> > 'kafka-recovery-point-checkpoint'
> > java.io.FileNotFoundException:
> > /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
> > open files)
> >
> >
> > *No EXCEPTIONS in any other logs in any other broker machine..*
> >
> > Attached is the thread dump from leader.
>
>
>

Re: Issue with kafka cluster with broker not able to send messages.

Posted by Joel Koshy <jj...@gmail.com>.
When you see this happening (on broker 4 in this instance), can you
confirm the Kafka process handle limit?

cat /proc/<pid>/limits

On Thu, Apr 10, 2014 at 09:20:51AM +0530, Arya Ketan wrote:
> *Issue : *Kafka cluster goes to an un-responsive state after some time with
> producers getting Socket time-outs on every request made.
> 
> *Kafka Version* - 0.8.1
> 
> *Machines* :  VMs , 2 cores 8gb RAM, linux , 3 node cluster.
> 
> ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 16382
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 10240
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) unlimited
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
> 
> *Broker Properties :*
> 
> 3 topics, 5 partitions, 5  i/o , network threads.
> socket.send.buffer.bytes=1048576
> socket.receive.buffer.bytes=1048576
> socket.request.max.bytes=104857600
> zookeeper.connection.timeout.ms=1000000
> No replication amongst brokers.
> 
> *Producer Properties * -
> sync producer,
>  required.ack = -1 ( Waits for ack from broker, most durable).
> kafka.network.BlockingChannel - Created socket with SO_TIMEOUT = 10000
> (requested 10000), SO_RCVBUF = 43690 (requested -1), SO_SNDBUF = 102400
> (requested 102400).
> 
> *NO Consumers*
> 
> *QPS From producer to broker* ..
> From 4 boxes, every 10-15 seconds, a batch of 200 messages are sent.  Each
> Message size is about 600 bytes
> 
> From 1 box, every 2-3 seconds , a batch of 400 messages are sent. Each
> message size is about 1 kb.
> 
> *Exception seen on producer after 9-10 hours of seamless message send.*
> 
>  Exception
> kafka.common.FailedToSendMessageException: Failed to send messages after 3
> tries.
> 
> ERROR : kafka.producer.async.DefaultEventHandler - Failed to send requests
> for topics user with correlation ids in [25617,25932]
> 
>  WARN [2014-04-10 02:45:49,183] [IrisQueue-writer][]
> kafka.producer.async.DefaultEventHandler - Failed to send producer request
> with correlation id 25931 to broker 4 with data for partitions
> [userservice,2]
> java.net.SocketTimeoutException
>         at
> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
>         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
>         at
> java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
>         at kafka.utils.Utils$.read(Utils.scala:375)
> 
> 
> *No of Open FDs in one of the broker machines*- 999
> 
> *Netstat of kafka broker :*
> 
> Lot  of sockets in CLOSE_WAIT with high receive queue nos.
> 
> Sample :
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43628 CLOSE_WAIT
> tcp6  198549      0 cp-analytics-kafka:9092 w3-web2.:52064 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:60668 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc4.w3-ops-:44909
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:44970 CLOSE_WAIT
> tcp6  209647      0 cp-analytics-kafka:9092 w3-web2.:54775 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:35412 CLOSE_WAIT
> tcp6  200731      0 cp-analytics-kafka:9092 w3-web2.:43435 CLOSE_WAIT
> tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:51673 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:59574 CLOSE_WAIT
> tcp6   34133      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:36767
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41807 CLOSE_WAIT
> tcp6  208136      0 cp-analytics-kafka:9092 w3-web2.:33401 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:43154
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:42944 CLOSE_WAIT
> tcp6  199587      0 cp-analytics-kafka:9092 w3-web2.:47651 CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:41392 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc6.w3-ops-:39080
> CLOSE_WAIT
> tcp6      45      0 cp-analytics-kafka:9092 w3-web2.:43379 CLOSE_WAIT
> tcp6  196401      0 cp-analytics-kafka:9092 w3-web2.:49943 CLOSE_WAIT
> tcp6  186606      0 cp-analytics-kafka:9092 w3-web2.:40921 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc5.w3-ops-:32971
> CLOSE_WAIT
> tcp6  196856      0 cp-analytics-kafka:9092 w3-web2.:38064 CLOSE_WAIT
> tcp6      44      0 cp-analytics-kafka:9092 user-svc3.w3-ops-:46069
> CLOSE_WAIT
> 
> *On producer boxes :*
> 
> Lot of FIN_WAIT 1 ,2 with high send queues.
> tcp6       0     44 user-svc6.w3.nm.f:38114 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc6.w3.nm.f:37938 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0      0 user-svc6.w3.nm.f:59956 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0      0 user-svc6.w3.nm.f:60047 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0  60308 user-svc6.w3.nm.f:38010 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc6.w3.nm.f:33129 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0      0 user-svc6.w3.nm.f:60143 cp-analytics-kafka:9092
> ESTABLISHED
> tcp6       0  32819 user-svc6.w3.nm.f:33198 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc6.w3.nm.f:33314 cp-analytics-kafka:9092
> FIN_WAIT1
> 
> tcp6       0 192584 w3-web2.:37797 cp-analytics-kafka:9092 ESTABLISHED
> 
> tcp6       0      0 user-svc5.w3.nm.f:37790 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0  63018 user-svc5.w3.nm.f:46990 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0  63018 user-svc5.w3.nm.f:47506 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0      0 user-svc5.w3.nm.f:37928 cp-analytics-kafka:9092
> FIN_WAIT2
> tcp6       0     44 user-svc5.w3.nm.f:40756 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0     44 user-svc5.w3.nm.f:47403 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0  28894 user-svc5.w3.nm.f:40319 cp-analytics-kafka:9092
> FIN_WAIT1
> tcp6       0  28893 user-svc5.w3.nm.f:40921 cp-analytics-kafka:9092
> ESTABLISHED
> tcp6       0     44 user-svc5.w3.nm.f:46920 cp-analytics-kafka:9092
> FIN_WAIT1
> 
> 
> *Broker Logs : *
> In server.log, mostly  INFO [2014-04-10 03:01:42,230]
> [kafka-processor-9092-0][] kafka.network.Processor - Closing socket
> connection to /10.65.40.137.
> 
> EXCEPT in 1 machine :
> 
> INFO [2014-04-10 02:29:35,492] [kafka-processor-9092-2][]
> kafka.network.Processor - Closing socket connection to /10.65.40.137.
> 
> No logs in-between. First fatal exception at 07.11.
> 
> FATAL [2014-04-10 07:11:55,472] [kafka-scheduler-0][]
> kafka.server.ReplicaManager - [Replica Manager on Broker 4]: Error writing
> to highwatermark file:
> java.io.FileNotFoundException:
> /var/lib/fk-3p-kafka/logs/replication-offset-checkpoint.tmp (Too many open
> files)
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:145)
>         at java.io.FileWriter.<init>(FileWriter.java:73)
>         at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37)
>         at
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:447)
>         at
> kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$1.apply(ReplicaManager.scala:444)
>         at scala.collection.immutable.Map$Map1.foreach(Map.scala:105)
>         at
> kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scala:444)
>         at
> kafka.server.ReplicaManager$$anonfun$startHighWaterMarksCheckPointThread$1.apply$mcV$sp(ReplicaManager.scala:94)
>         at kafka.utils.KafkaScheduler$$anon$1.run(KafkaScheduler.scala:100)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> ERROR [2014-04-10 07:11:58,592] [kafka-acceptor][] kafka.network.Acceptor -
> Error in acceptor
> java.io.IOException: Too many open files
>         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>         at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
>         at kafka.network.Acceptor.accept(SocketServer.scala:200)
>         at kafka.network.Acceptor.run(SocketServer.scala:154)
>         at java.lang.Thread.run(Thread.java:662)
> ERROR [2014-04-10 07:12:07,726] [kafka-scheduler-3][]
> kafka.utils.KafkaScheduler - Uncaught exception in scheduled task
> 'kafka-recovery-point-checkpoint'
> java.io.FileNotFoundException:
> /var/lib/fk-3p-kafka/logs/recovery-point-offset-checkpoint.tmp (Too many
> open files)
> 
> 
> *No EXCEPTIONS in any other logs in any other broker machine..*
> 
> Attached is the thread dump from leader.