You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Aaron Rankin <aa...@sproutsocial.com> on 2012/06/09 17:34:14 UTC

Kafka process dies sporadically

Hi,

We're testing Kafka and have found that the process dies often. There's little to no indication of why. We're running the latest code from Git, which we built using the instructions there. We're also running Zookeeper 3.3.5. Our setup has three brokers, producers running on the same network and consumers in another data center, a 30ms Internet ping away. 

Does anyone have some intuition about why this is happening?

The only stack trace we're seeing is coming from Zookeeper:

1193285089 [CommitProcessor:2] ERROR org.apache.zookeeper.server.NIOServerCnxn  - Unexpected Exception:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
        at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
        at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
        at org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1545)
	at org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:115)
	at org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:87)
	at org.apache.zookeeper.server.DataTree.deleteNode(DataTree.java:577)
	at org.apache.zookeeper.server.DataTree.killSession(DataTree.java:829)
        at org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:804)
        at org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:328)
        at org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:715)
        at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:107)
        at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)


Also, we constantly are seeing these in the logs:

1193365748 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  org.apache.zookeeper.server.NIOServerCnxn  - Closed socket connection for client /127.0.0.1:53426 (no session established for client)
1193425755 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO  org.apache.zookeeper.server.NIOServerCnxn  - Accepted socket connection from /127.0.0.1:53428
1193425755 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN  org.apache.zookeeper.server.NIOServerCnxn  - EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket




Re: Kafka process dies sporadically

Posted by Jun Rao <ju...@gmail.com>.
It's likely that the larger buffer and fetch size triggered long GCs, which
caused ZK session to timeout. You may want to do a bit GC tuning.

Thanks,

Jun

On Mon, Jun 11, 2012 at 10:12 AM, Aaron Rankin <aa...@sproutsocial.com>wrote:

> This might help to explain the root cause. I found that two consumer
> parameters may be correlated with the broker issues. Our setup is
> inter-data-center and so we followed some of the advice on the mirroring
> wiki page,
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+mirroring. In
> particular, we increased "socket.buffersize" to 655360 and "fetch.size" to
> 3072000 on our consumers. Since disabling those two parameters, allowing
> the defaults to take effect, our brokers haven't died once in well over a
> day. Prior, they were dying every hour.
>
>
> On Jun 11, 2012, at 10:28 AM, Aaron Rankin wrote:
>
> > Jun,
> >
> > I was using the github mirror, which appears to be active. The last
> commit there is the same as with the Apache Git mirror
> (2a59ad76c657e4aad8ee6ca67078f49d2f6017c9).
> >
> >
> > Aaron
> >
> >
> > On Jun 11, 2012, at 12:05 AM, Jun Rao wrote:
> >
> >> Aaron,
> >>
> >> Which Git did you try, github or the Apache git mirror? Kafka has moved
> to
> >> Apache. So please try the 0.7 release in Apache.
> >>
> >> The error you saw are from ZK. Do you see lots of ZK session expiration
> in
> >> your log?
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Sat, Jun 9, 2012 at 8:34 AM, Aaron Rankin <aa...@sproutsocial.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> We're testing Kafka and have found that the process dies often. There's
> >>> little to no indication of why. We're running the latest code from Git,
> >>> which we built using the instructions there. We're also running
> Zookeeper
> >>> 3.3.5. Our setup has three brokers, producers running on the same
> network
> >>> and consumers in another data center, a 30ms Internet ping away.
> >>>
> >>> Does anyone have some intuition about why this is happening?
> >>>
> >>> The only stack trace we're seeing is coming from Zookeeper:
> >>>
> >>> 1193285089 [CommitProcessor:2] ERROR
> >>> org.apache.zookeeper.server.NIOServerCnxn  - Unexpected Exception:
> >>> java.nio.channels.CancelledKeyException
> >>>      at
> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
> >>>      at
> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
> >>>      at
> >>>
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
> >>>      at
> >>>
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
> >>>      at
> >>>
> org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1545)
> >>>      at
> >>>
> org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:115)
> >>>      at
> >>>
> org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:87)
> >>>      at
> >>> org.apache.zookeeper.server.DataTree.deleteNode(DataTree.java:577)
> >>>      at
> >>> org.apache.zookeeper.server.DataTree.killSession(DataTree.java:829)
> >>>      at
> >>> org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:804)
> >>>      at
> >>> org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:328)
> >>>      at
> >>>
> org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:715)
> >>>      at
> >>>
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:107)
> >>>      at
> >>>
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
> >>>
> >>>
> >>> Also, we constantly are seeing these in the logs:
> >>>
> >>> 1193365748 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO
> >>> org.apache.zookeeper.server.NIOServerCnxn  - Closed socket connection
> for
> >>> client /127.0.0.1:53426 (no session established for client)
> >>> 1193425755 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO
> >>> org.apache.zookeeper.server.NIOServerCnxn  - Accepted socket connection
> >>> from /127.0.0.1:53428
> >>> 1193425755 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN
> >>> org.apache.zookeeper.server.NIOServerCnxn  - EndOfStreamException:
> Unable
> >>> to read additional data from client sessionid 0x0, likely client has
> closed
> >>> socket
> >>>
> >>>
> >>>
> >>>
> >
>
>

Re: Kafka process dies sporadically

Posted by Aaron Rankin <aa...@sproutsocial.com>.
This might help to explain the root cause. I found that two consumer parameters may be correlated with the broker issues. Our setup is inter-data-center and so we followed some of the advice on the mirroring wiki page, https://cwiki.apache.org/confluence/display/KAFKA/Kafka+mirroring. In particular, we increased "socket.buffersize" to 655360 and "fetch.size" to 3072000 on our consumers. Since disabling those two parameters, allowing the defaults to take effect, our brokers haven't died once in well over a day. Prior, they were dying every hour.


On Jun 11, 2012, at 10:28 AM, Aaron Rankin wrote:

> Jun,
> 
> I was using the github mirror, which appears to be active. The last commit there is the same as with the Apache Git mirror (2a59ad76c657e4aad8ee6ca67078f49d2f6017c9).
> 
> 
> Aaron
> 
> 
> On Jun 11, 2012, at 12:05 AM, Jun Rao wrote:
> 
>> Aaron,
>> 
>> Which Git did you try, github or the Apache git mirror? Kafka has moved to
>> Apache. So please try the 0.7 release in Apache.
>> 
>> The error you saw are from ZK. Do you see lots of ZK session expiration in
>> your log?
>> 
>> Thanks,
>> 
>> Jun
>> 
>> On Sat, Jun 9, 2012 at 8:34 AM, Aaron Rankin <aa...@sproutsocial.com> wrote:
>> 
>>> Hi,
>>> 
>>> We're testing Kafka and have found that the process dies often. There's
>>> little to no indication of why. We're running the latest code from Git,
>>> which we built using the instructions there. We're also running Zookeeper
>>> 3.3.5. Our setup has three brokers, producers running on the same network
>>> and consumers in another data center, a 30ms Internet ping away.
>>> 
>>> Does anyone have some intuition about why this is happening?
>>> 
>>> The only stack trace we're seeing is coming from Zookeeper:
>>> 
>>> 1193285089 [CommitProcessor:2] ERROR
>>> org.apache.zookeeper.server.NIOServerCnxn  - Unexpected Exception:
>>> java.nio.channels.CancelledKeyException
>>>      at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
>>>      at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
>>>      at
>>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
>>>      at
>>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
>>>      at
>>> org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1545)
>>>      at
>>> org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:115)
>>>      at
>>> org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:87)
>>>      at
>>> org.apache.zookeeper.server.DataTree.deleteNode(DataTree.java:577)
>>>      at
>>> org.apache.zookeeper.server.DataTree.killSession(DataTree.java:829)
>>>      at
>>> org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:804)
>>>      at
>>> org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:328)
>>>      at
>>> org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:715)
>>>      at
>>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:107)
>>>      at
>>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
>>> 
>>> 
>>> Also, we constantly are seeing these in the logs:
>>> 
>>> 1193365748 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO
>>> org.apache.zookeeper.server.NIOServerCnxn  - Closed socket connection for
>>> client /127.0.0.1:53426 (no session established for client)
>>> 1193425755 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO
>>> org.apache.zookeeper.server.NIOServerCnxn  - Accepted socket connection
>>> from /127.0.0.1:53428
>>> 1193425755 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN
>>> org.apache.zookeeper.server.NIOServerCnxn  - EndOfStreamException: Unable
>>> to read additional data from client sessionid 0x0, likely client has closed
>>> socket
>>> 
>>> 
>>> 
>>> 
> 


Re: Kafka process dies sporadically

Posted by Aaron Rankin <aa...@sproutsocial.com>.
Jun,

I was using the github mirror, which appears to be active. The last commit there is the same as with the Apache Git mirror (2a59ad76c657e4aad8ee6ca67078f49d2f6017c9).


Aaron


On Jun 11, 2012, at 12:05 AM, Jun Rao wrote:

> Aaron,
> 
> Which Git did you try, github or the Apache git mirror? Kafka has moved to
> Apache. So please try the 0.7 release in Apache.
> 
> The error you saw are from ZK. Do you see lots of ZK session expiration in
> your log?
> 
> Thanks,
> 
> Jun
> 
> On Sat, Jun 9, 2012 at 8:34 AM, Aaron Rankin <aa...@sproutsocial.com> wrote:
> 
>> Hi,
>> 
>> We're testing Kafka and have found that the process dies often. There's
>> little to no indication of why. We're running the latest code from Git,
>> which we built using the instructions there. We're also running Zookeeper
>> 3.3.5. Our setup has three brokers, producers running on the same network
>> and consumers in another data center, a 30ms Internet ping away.
>> 
>> Does anyone have some intuition about why this is happening?
>> 
>> The only stack trace we're seeing is coming from Zookeeper:
>> 
>> 1193285089 [CommitProcessor:2] ERROR
>> org.apache.zookeeper.server.NIOServerCnxn  - Unexpected Exception:
>> java.nio.channels.CancelledKeyException
>>       at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
>>       at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
>>       at
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
>>       at
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
>>       at
>> org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1545)
>>       at
>> org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:115)
>>       at
>> org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:87)
>>       at
>> org.apache.zookeeper.server.DataTree.deleteNode(DataTree.java:577)
>>       at
>> org.apache.zookeeper.server.DataTree.killSession(DataTree.java:829)
>>       at
>> org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:804)
>>       at
>> org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:328)
>>       at
>> org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:715)
>>       at
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:107)
>>       at
>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
>> 
>> 
>> Also, we constantly are seeing these in the logs:
>> 
>> 1193365748 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO
>> org.apache.zookeeper.server.NIOServerCnxn  - Closed socket connection for
>> client /127.0.0.1:53426 (no session established for client)
>> 1193425755 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO
>> org.apache.zookeeper.server.NIOServerCnxn  - Accepted socket connection
>> from /127.0.0.1:53428
>> 1193425755 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN
>> org.apache.zookeeper.server.NIOServerCnxn  - EndOfStreamException: Unable
>> to read additional data from client sessionid 0x0, likely client has closed
>> socket
>> 
>> 
>> 
>> 


Re: Kafka process dies sporadically

Posted by Jun Rao <ju...@gmail.com>.
Aaron,

Which Git did you try, github or the Apache git mirror? Kafka has moved to
Apache. So please try the 0.7 release in Apache.

The error you saw are from ZK. Do you see lots of ZK session expiration in
your log?

Thanks,

Jun

On Sat, Jun 9, 2012 at 8:34 AM, Aaron Rankin <aa...@sproutsocial.com> wrote:

> Hi,
>
> We're testing Kafka and have found that the process dies often. There's
> little to no indication of why. We're running the latest code from Git,
> which we built using the instructions there. We're also running Zookeeper
> 3.3.5. Our setup has three brokers, producers running on the same network
> and consumers in another data center, a 30ms Internet ping away.
>
> Does anyone have some intuition about why this is happening?
>
> The only stack trace we're seeing is coming from Zookeeper:
>
> 1193285089 [CommitProcessor:2] ERROR
> org.apache.zookeeper.server.NIOServerCnxn  - Unexpected Exception:
> java.nio.channels.CancelledKeyException
>        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
>        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
>        at
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
>        at
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
>        at
> org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1545)
>        at
> org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:115)
>        at
> org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:87)
>        at
> org.apache.zookeeper.server.DataTree.deleteNode(DataTree.java:577)
>        at
> org.apache.zookeeper.server.DataTree.killSession(DataTree.java:829)
>        at
> org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:804)
>        at
> org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:328)
>        at
> org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:715)
>        at
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:107)
>        at
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
>
>
> Also, we constantly are seeing these in the logs:
>
> 1193365748 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO
>  org.apache.zookeeper.server.NIOServerCnxn  - Closed socket connection for
> client /127.0.0.1:53426 (no session established for client)
> 1193425755 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO
>  org.apache.zookeeper.server.NIOServerCnxn  - Accepted socket connection
> from /127.0.0.1:53428
> 1193425755 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN
>  org.apache.zookeeper.server.NIOServerCnxn  - EndOfStreamException: Unable
> to read additional data from client sessionid 0x0, likely client has closed
> socket
>
>
>
>