You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Mike Schilli <m...@perlmeister.com> on 2011/10/12 09:04:58 UTC

Total Outage of all ZooKeeper Instances

Hi ZooKeeper folks,

we just had an outage with our application using ZooKeeper the 
other day and are investigating how it happened.

Our setup is 3 servers, each of them running

    * the application
    * a local ZooKeeper instance

All 3 ZooKeeper instances form a quorum. Each application talks to the
local ZooKeeper instance only.

Looking through the logs of our applications on the 3 different servers, 
it turns out that *all three* ZooKeeper instances returned -7 to their 
applications during the outage. The applications tried to throw away 
their handle and reconnect, but none of them was able to recover.

Anyone got any ideas on how to prevent this kind of problem from
happening?

Thanks for any help.

-- 
-- Mike

Mike Schilli
m@perlmeister.com

Re: Total Outage of all ZooKeeper Instances

Posted by Mike Schilli <m...@perlmeister.com>.
On Wed, 12 Oct 2011, Patrick Hunt wrote:

> We use a handful of exit(#) codes, none of which are -7 though. (we
> don't use 7 either).

We're using the Perl wrapper (Net::ZooKeeper from CPAN) which translates

     [ -7,   'ZOPERATIONTIMEOUT',        'Operation timeout', ],

> connect. Also more details on your setup (for example what version of
> ZK are you using :-) ) would be helpful.

We're using 3.3.3_0.

> My suggestion is that you create a jira for this, and attach all the
> server logs and 1 or more client logs during the time in question.

Will do.

> What did you do to resolve the issue?

After we shut down all three application servers, ZK recovered after
a while.

-- 
-- Mike

Mike Schilli
m@perlmeister.com

>
> Patrick
>
> On Wed, Oct 12, 2011 at 8:46 AM, Mike Schilli <m...@perlmeister.com> wrote:
>> On Wed, 12 Oct 2011, Ted Dunning wrote:
>>
>>> Without a bit of log info, I don't think we would have any idea what
>>> the problem was other than the -7 return code.
>>>
>>> Can you give some specifics about what the ZK instances were saying?
>>
>> Sure, this is what I found in the zookeeper logs around the same time:
>>
>> 2011-10-06 21:14:32.322544500 WARN
>>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
>> EndOfStreamExcepti
>> on: Unable to read additional data from client sessionid 0x3324ffac32a0012,
>> likely client has closed socket
>> 2011-10-06 21:14:32.322860500 INFO
>>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
>> socket con
>> nection for client /127.0.0.1:44431 which had sessionid 0x3324ffac32a0012
>> 2011-10-06 21:14:32.829464500 WARN
>>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
>> EndOfStreamExcepti
>> on: Unable to read additional data from client sessionid 0x3324ffac32a0011,
>> likely client has closed socket
>> 2011-10-06 21:14:32.829468500 INFO
>>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
>> socket con
>> nection for client /127.0.0.1:44435 which had sessionid 0x3324ffac32a0011
>>
>> Followed by a number of client reconnect attempts:
>>
>> 2011-10-06 21:14:34.956557500 INFO
>>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client
>> attempting to renew session 0x3324ffac32a000f at /127.0.0.1:45019
>> 2011-10-06 21:14:34.956559500 INFO
>>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@103] - Revalidating
>> client: 230
>> 334469384044559
>> 2011-10-06 21:14:34.957578500 INFO
>>  [QuorumPeer:/0.0.0.0:2181:NIOServerCnxn@1580] - Established session
>> 0x3324ffac32a0
>> 00f with negotiated timeout 10000 for client /127.0.0.1:45019
>> 2011-10-06 21:14:35.199455500 WARN
>>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
>> EndOfStreamExcepti
>> on: Unable to read additional data from client sessionid 0x3324ffac32a000e,
>> likely client has closed socket
>> 2011-10-06 21:14:35.199457500 INFO
>>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
>> socket con
>> nection for client /127.0.0.1:44434 which had sessionid 0x3324ffac32a000e
>> 2011-10-06 21:14:35.577628500 ERROR [CommitProcessor:3:NIOServerCnxn@445] -
>> Unexpected Exception: 2011-10-06 21:14:35.577629500
>> java.nio.channels.CancelledKeyException
>> 2011-10-06 21:14:35.577631500   at
>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
>> 2011-10-06 21:14:35.577632500   at
>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
>> 2011-10-06 21:14:35.577633500   at
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
>> 2011-10-06 21:14:35.577634500   at
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
>> 2011-10-06 21:14:35.577641500   at
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProces
>> sor.java:367)
>> 2011-10-06 21:14:35.577642500   at
>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
>> 2011-10-06 21:14:35.577797500 ERROR [CommitProcessor:3:NIOServerCnxn@445] -
>> Unexpected Exception: 2011-10-06 21:14:35.577798500
>> java.nio.channels.CancelledKeyException
>> 2011-10-06 21:14:35.577800500   at
>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
>> 2011-10-06 21:14:35.577801500   at
>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
>> 2011-10-06 21:14:35.577802500   at
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
>> 2011-10-06 21:14:35.577804500   at
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
>> 2011-10-06 21:14:35.577830500   at
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProces
>> sor.java:171)
>> 2011-10-06 21:14:35.577831500   at
>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
>>
>>
>> --
>> -- Mike
>>
>> Mike Schilli
>> m@perlmeister.com
>>
>

Re: Total Outage of all ZooKeeper Instances

Posted by Ted Dunning <te...@gmail.com>.
Were you providing all three servers in the connect string.

On Wed, Oct 12, 2011 at 5:13 PM, Patrick Hunt <ph...@apache.org> wrote:

> It's not clear to me from your description if the servers could form a
> quorum or not, and if they could then why couldn't the clients
> connect. Also more details on your setup (for example what version of
> ZK are you using :-) ) would be helpful.
>

Re: Total Outage of all ZooKeeper Instances

Posted by Patrick Hunt <ph...@apache.org>.
We use a handful of exit(#) codes, none of which are -7 though. (we
don't use 7 either). Perhaps this is some issue with the wrapper? Can
you take a look? Are you loading any other code into the JVM that
could account for this exit code?

It's not clear to me from your description if the servers could form a
quorum or not, and if they could then why couldn't the clients
connect. Also more details on your setup (for example what version of
ZK are you using :-) ) would be helpful.

My suggestion is that you create a jira for this, and attach all the
server logs and 1 or more client logs during the time in question.

What did you do to resolve the issue?

Patrick

On Wed, Oct 12, 2011 at 8:46 AM, Mike Schilli <m...@perlmeister.com> wrote:
> On Wed, 12 Oct 2011, Ted Dunning wrote:
>
>> Without a bit of log info, I don't think we would have any idea what
>> the problem was other than the -7 return code.
>>
>> Can you give some specifics about what the ZK instances were saying?
>
> Sure, this is what I found in the zookeeper logs around the same time:
>
> 2011-10-06 21:14:32.322544500 WARN
>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
> EndOfStreamExcepti
> on: Unable to read additional data from client sessionid 0x3324ffac32a0012,
> likely client has closed socket
> 2011-10-06 21:14:32.322860500 INFO
>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
> socket con
> nection for client /127.0.0.1:44431 which had sessionid 0x3324ffac32a0012
> 2011-10-06 21:14:32.829464500 WARN
>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
> EndOfStreamExcepti
> on: Unable to read additional data from client sessionid 0x3324ffac32a0011,
> likely client has closed socket
> 2011-10-06 21:14:32.829468500 INFO
>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
> socket con
> nection for client /127.0.0.1:44435 which had sessionid 0x3324ffac32a0011
>
> Followed by a number of client reconnect attempts:
>
> 2011-10-06 21:14:34.956557500 INFO
>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client
> attempting to renew session 0x3324ffac32a000f at /127.0.0.1:45019
> 2011-10-06 21:14:34.956559500 INFO
>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@103] - Revalidating
> client: 230
> 334469384044559
> 2011-10-06 21:14:34.957578500 INFO
>  [QuorumPeer:/0.0.0.0:2181:NIOServerCnxn@1580] - Established session
> 0x3324ffac32a0
> 00f with negotiated timeout 10000 for client /127.0.0.1:45019
> 2011-10-06 21:14:35.199455500 WARN
>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
> EndOfStreamExcepti
> on: Unable to read additional data from client sessionid 0x3324ffac32a000e,
> likely client has closed socket
> 2011-10-06 21:14:35.199457500 INFO
>  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
> socket con
> nection for client /127.0.0.1:44434 which had sessionid 0x3324ffac32a000e
> 2011-10-06 21:14:35.577628500 ERROR [CommitProcessor:3:NIOServerCnxn@445] -
> Unexpected Exception: 2011-10-06 21:14:35.577629500
> java.nio.channels.CancelledKeyException
> 2011-10-06 21:14:35.577631500   at
> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
> 2011-10-06 21:14:35.577632500   at
> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
> 2011-10-06 21:14:35.577633500   at
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
> 2011-10-06 21:14:35.577634500   at
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
> 2011-10-06 21:14:35.577641500   at
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProces
> sor.java:367)
> 2011-10-06 21:14:35.577642500   at
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
> 2011-10-06 21:14:35.577797500 ERROR [CommitProcessor:3:NIOServerCnxn@445] -
> Unexpected Exception: 2011-10-06 21:14:35.577798500
> java.nio.channels.CancelledKeyException
> 2011-10-06 21:14:35.577800500   at
> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
> 2011-10-06 21:14:35.577801500   at
> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
> 2011-10-06 21:14:35.577802500   at
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
> 2011-10-06 21:14:35.577804500   at
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
> 2011-10-06 21:14:35.577830500   at
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProces
> sor.java:171)
> 2011-10-06 21:14:35.577831500   at
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
>
>
> --
> -- Mike
>
> Mike Schilli
> m@perlmeister.com
>

Re: Total Outage of all ZooKeeper Instances

Posted by Mike Schilli <m...@perlmeister.com>.
On Wed, 12 Oct 2011, Ted Dunning wrote:

> Without a bit of log info, I don't think we would have any idea what
> the problem was other than the -7 return code.
> 
> Can you give some specifics about what the ZK instances were saying?

Sure, this is what I found in the zookeeper logs around the same time:

2011-10-06 21:14:32.322544500 WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamExcepti
on: Unable to read additional data from client sessionid 0x3324ffac32a0012, likely client has closed socket
2011-10-06 21:14:32.322860500 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket con
nection for client /127.0.0.1:44431 which had sessionid 0x3324ffac32a0012
2011-10-06 21:14:32.829464500 WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamExcepti
on: Unable to read additional data from client sessionid 0x3324ffac32a0011, likely client has closed socket
2011-10-06 21:14:32.829468500 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket con
nection for client /127.0.0.1:44435 which had sessionid 0x3324ffac32a0011

Followed by a number of client reconnect attempts:

2011-10-06 21:14:34.956557500 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting 
to renew session 0x3324ffac32a000f at /127.0.0.1:45019
2011-10-06 21:14:34.956559500 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@103] - Revalidating client: 230
334469384044559
2011-10-06 21:14:34.957578500 INFO  [QuorumPeer:/0.0.0.0:2181:NIOServerCnxn@1580] - Established session 0x3324ffac32a0
00f with negotiated timeout 10000 for client /127.0.0.1:45019
2011-10-06 21:14:35.199455500 WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamExcepti
on: Unable to read additional data from client sessionid 0x3324ffac32a000e, likely client has closed socket
2011-10-06 21:14:35.199457500 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket con
nection for client /127.0.0.1:44434 which had sessionid 0x3324ffac32a000e
2011-10-06 21:14:35.577628500 ERROR [CommitProcessor:3:NIOServerCnxn@445] - Unexpected Exception: 
2011-10-06 21:14:35.577629500 java.nio.channels.CancelledKeyException
2011-10-06 21:14:35.577631500   at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
2011-10-06 21:14:35.577632500   at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
2011-10-06 21:14:35.577633500   at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
2011-10-06 21:14:35.577634500   at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
2011-10-06 21:14:35.577641500   at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProces
sor.java:367)
2011-10-06 21:14:35.577642500   at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
2011-10-06 21:14:35.577797500 ERROR [CommitProcessor:3:NIOServerCnxn@445] - Unexpected Exception: 
2011-10-06 21:14:35.577798500 java.nio.channels.CancelledKeyException
2011-10-06 21:14:35.577800500   at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
2011-10-06 21:14:35.577801500   at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
2011-10-06 21:14:35.577802500   at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
2011-10-06 21:14:35.577804500   at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
2011-10-06 21:14:35.577830500   at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProces
sor.java:171)
2011-10-06 21:14:35.577831500   at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)


-- 
-- Mike

Mike Schilli
m@perlmeister.com

Re: Total Outage of all ZooKeeper Instances

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Oct 12, 2011 at 7:04 AM, Mike Schilli <m...@perlmeister.com> wrote:

> Anyone got any ideas on how to prevent this kind of problem from
> happening?
>

Without a bit of log info, I don't think we would have any idea what the
problem was other than the -7 return code.

Can you give some specifics about what the ZK instances were saying?