You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Mike Schilli <m...@perlmeister.com> on 2011/10/12 09:04:58 UTC
Total Outage of all ZooKeeper Instances
Hi ZooKeeper folks,
we just had an outage with our application using ZooKeeper the
other day and are investigating how it happened.
Our setup is 3 servers, each of them running
* the application
* a local ZooKeeper instance
All 3 ZooKeeper instances form a quorum. Each application talks to the
local ZooKeeper instance only.
Looking through the logs of our applications on the 3 different servers,
it turns out that *all three* ZooKeeper instances returned -7 to their
applications during the outage. The applications tried to throw away
their handle and reconnect, but none of them was able to recover.
Anyone got any ideas on how to prevent this kind of problem from
happening?
Thanks for any help.
--
-- Mike
Mike Schilli
m@perlmeister.com
Re: Total Outage of all ZooKeeper Instances
Posted by Mike Schilli <m...@perlmeister.com>.
On Wed, 12 Oct 2011, Patrick Hunt wrote:
> We use a handful of exit(#) codes, none of which are -7 though. (we
> don't use 7 either).
We're using the Perl wrapper (Net::ZooKeeper from CPAN) which translates
[ -7, 'ZOPERATIONTIMEOUT', 'Operation timeout', ],
> connect. Also more details on your setup (for example what version of
> ZK are you using :-) ) would be helpful.
We're using 3.3.3_0.
> My suggestion is that you create a jira for this, and attach all the
> server logs and 1 or more client logs during the time in question.
Will do.
> What did you do to resolve the issue?
After we shut down all three application servers, ZK recovered after
a while.
--
-- Mike
Mike Schilli
m@perlmeister.com
>
> Patrick
>
> On Wed, Oct 12, 2011 at 8:46 AM, Mike Schilli <m...@perlmeister.com> wrote:
>> On Wed, 12 Oct 2011, Ted Dunning wrote:
>>
>>> Without a bit of log info, I don't think we would have any idea what
>>> the problem was other than the -7 return code.
>>>
>>> Can you give some specifics about what the ZK instances were saying?
>>
>> Sure, this is what I found in the zookeeper logs around the same time:
>>
>> 2011-10-06 21:14:32.322544500 WARN
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
>> EndOfStreamExcepti
>> on: Unable to read additional data from client sessionid 0x3324ffac32a0012,
>> likely client has closed socket
>> 2011-10-06 21:14:32.322860500 INFO
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
>> socket con
>> nection for client /127.0.0.1:44431 which had sessionid 0x3324ffac32a0012
>> 2011-10-06 21:14:32.829464500 WARN
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
>> EndOfStreamExcepti
>> on: Unable to read additional data from client sessionid 0x3324ffac32a0011,
>> likely client has closed socket
>> 2011-10-06 21:14:32.829468500 INFO
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
>> socket con
>> nection for client /127.0.0.1:44435 which had sessionid 0x3324ffac32a0011
>>
>> Followed by a number of client reconnect attempts:
>>
>> 2011-10-06 21:14:34.956557500 INFO
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client
>> attempting to renew session 0x3324ffac32a000f at /127.0.0.1:45019
>> 2011-10-06 21:14:34.956559500 INFO
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@103] - Revalidating
>> client: 230
>> 334469384044559
>> 2011-10-06 21:14:34.957578500 INFO
>> [QuorumPeer:/0.0.0.0:2181:NIOServerCnxn@1580] - Established session
>> 0x3324ffac32a0
>> 00f with negotiated timeout 10000 for client /127.0.0.1:45019
>> 2011-10-06 21:14:35.199455500 WARN
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
>> EndOfStreamExcepti
>> on: Unable to read additional data from client sessionid 0x3324ffac32a000e,
>> likely client has closed socket
>> 2011-10-06 21:14:35.199457500 INFO
>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
>> socket con
>> nection for client /127.0.0.1:44434 which had sessionid 0x3324ffac32a000e
>> 2011-10-06 21:14:35.577628500 ERROR [CommitProcessor:3:NIOServerCnxn@445] -
>> Unexpected Exception: 2011-10-06 21:14:35.577629500
>> java.nio.channels.CancelledKeyException
>> 2011-10-06 21:14:35.577631500 at
>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
>> 2011-10-06 21:14:35.577632500 at
>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
>> 2011-10-06 21:14:35.577633500 at
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
>> 2011-10-06 21:14:35.577634500 at
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
>> 2011-10-06 21:14:35.577641500 at
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProces
>> sor.java:367)
>> 2011-10-06 21:14:35.577642500 at
>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
>> 2011-10-06 21:14:35.577797500 ERROR [CommitProcessor:3:NIOServerCnxn@445] -
>> Unexpected Exception: 2011-10-06 21:14:35.577798500
>> java.nio.channels.CancelledKeyException
>> 2011-10-06 21:14:35.577800500 at
>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
>> 2011-10-06 21:14:35.577801500 at
>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
>> 2011-10-06 21:14:35.577802500 at
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
>> 2011-10-06 21:14:35.577804500 at
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
>> 2011-10-06 21:14:35.577830500 at
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProces
>> sor.java:171)
>> 2011-10-06 21:14:35.577831500 at
>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
>>
>>
>> --
>> -- Mike
>>
>> Mike Schilli
>> m@perlmeister.com
>>
>
Re: Total Outage of all ZooKeeper Instances
Posted by Ted Dunning <te...@gmail.com>.
Were you providing all three servers in the connect string.
On Wed, Oct 12, 2011 at 5:13 PM, Patrick Hunt <ph...@apache.org> wrote:
> It's not clear to me from your description if the servers could form a
> quorum or not, and if they could then why couldn't the clients
> connect. Also more details on your setup (for example what version of
> ZK are you using :-) ) would be helpful.
>
Re: Total Outage of all ZooKeeper Instances
Posted by Patrick Hunt <ph...@apache.org>.
We use a handful of exit(#) codes, none of which are -7 though. (we
don't use 7 either). Perhaps this is some issue with the wrapper? Can
you take a look? Are you loading any other code into the JVM that
could account for this exit code?
It's not clear to me from your description if the servers could form a
quorum or not, and if they could then why couldn't the clients
connect. Also more details on your setup (for example what version of
ZK are you using :-) ) would be helpful.
My suggestion is that you create a jira for this, and attach all the
server logs and 1 or more client logs during the time in question.
What did you do to resolve the issue?
Patrick
On Wed, Oct 12, 2011 at 8:46 AM, Mike Schilli <m...@perlmeister.com> wrote:
> On Wed, 12 Oct 2011, Ted Dunning wrote:
>
>> Without a bit of log info, I don't think we would have any idea what
>> the problem was other than the -7 return code.
>>
>> Can you give some specifics about what the ZK instances were saying?
>
> Sure, this is what I found in the zookeeper logs around the same time:
>
> 2011-10-06 21:14:32.322544500 WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
> EndOfStreamExcepti
> on: Unable to read additional data from client sessionid 0x3324ffac32a0012,
> likely client has closed socket
> 2011-10-06 21:14:32.322860500 INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
> socket con
> nection for client /127.0.0.1:44431 which had sessionid 0x3324ffac32a0012
> 2011-10-06 21:14:32.829464500 WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
> EndOfStreamExcepti
> on: Unable to read additional data from client sessionid 0x3324ffac32a0011,
> likely client has closed socket
> 2011-10-06 21:14:32.829468500 INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
> socket con
> nection for client /127.0.0.1:44435 which had sessionid 0x3324ffac32a0011
>
> Followed by a number of client reconnect attempts:
>
> 2011-10-06 21:14:34.956557500 INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client
> attempting to renew session 0x3324ffac32a000f at /127.0.0.1:45019
> 2011-10-06 21:14:34.956559500 INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@103] - Revalidating
> client: 230
> 334469384044559
> 2011-10-06 21:14:34.957578500 INFO
> [QuorumPeer:/0.0.0.0:2181:NIOServerCnxn@1580] - Established session
> 0x3324ffac32a0
> 00f with negotiated timeout 10000 for client /127.0.0.1:45019
> 2011-10-06 21:14:35.199455500 WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] -
> EndOfStreamExcepti
> on: Unable to read additional data from client sessionid 0x3324ffac32a000e,
> likely client has closed socket
> 2011-10-06 21:14:35.199457500 INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
> socket con
> nection for client /127.0.0.1:44434 which had sessionid 0x3324ffac32a000e
> 2011-10-06 21:14:35.577628500 ERROR [CommitProcessor:3:NIOServerCnxn@445] -
> Unexpected Exception: 2011-10-06 21:14:35.577629500
> java.nio.channels.CancelledKeyException
> 2011-10-06 21:14:35.577631500 at
> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
> 2011-10-06 21:14:35.577632500 at
> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
> 2011-10-06 21:14:35.577633500 at
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
> 2011-10-06 21:14:35.577634500 at
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
> 2011-10-06 21:14:35.577641500 at
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProces
> sor.java:367)
> 2011-10-06 21:14:35.577642500 at
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
> 2011-10-06 21:14:35.577797500 ERROR [CommitProcessor:3:NIOServerCnxn@445] -
> Unexpected Exception: 2011-10-06 21:14:35.577798500
> java.nio.channels.CancelledKeyException
> 2011-10-06 21:14:35.577800500 at
> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
> 2011-10-06 21:14:35.577801500 at
> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
> 2011-10-06 21:14:35.577802500 at
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
> 2011-10-06 21:14:35.577804500 at
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
> 2011-10-06 21:14:35.577830500 at
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProces
> sor.java:171)
> 2011-10-06 21:14:35.577831500 at
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
>
>
> --
> -- Mike
>
> Mike Schilli
> m@perlmeister.com
>
Re: Total Outage of all ZooKeeper Instances
Posted by Mike Schilli <m...@perlmeister.com>.
On Wed, 12 Oct 2011, Ted Dunning wrote:
> Without a bit of log info, I don't think we would have any idea what
> the problem was other than the -7 return code.
>
> Can you give some specifics about what the ZK instances were saying?
Sure, this is what I found in the zookeeper logs around the same time:
2011-10-06 21:14:32.322544500 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamExcepti
on: Unable to read additional data from client sessionid 0x3324ffac32a0012, likely client has closed socket
2011-10-06 21:14:32.322860500 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket con
nection for client /127.0.0.1:44431 which had sessionid 0x3324ffac32a0012
2011-10-06 21:14:32.829464500 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamExcepti
on: Unable to read additional data from client sessionid 0x3324ffac32a0011, likely client has closed socket
2011-10-06 21:14:32.829468500 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket con
nection for client /127.0.0.1:44435 which had sessionid 0x3324ffac32a0011
Followed by a number of client reconnect attempts:
2011-10-06 21:14:34.956557500 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting
to renew session 0x3324ffac32a000f at /127.0.0.1:45019
2011-10-06 21:14:34.956559500 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@103] - Revalidating client: 230
334469384044559
2011-10-06 21:14:34.957578500 INFO [QuorumPeer:/0.0.0.0:2181:NIOServerCnxn@1580] - Established session 0x3324ffac32a0
00f with negotiated timeout 10000 for client /127.0.0.1:45019
2011-10-06 21:14:35.199455500 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamExcepti
on: Unable to read additional data from client sessionid 0x3324ffac32a000e, likely client has closed socket
2011-10-06 21:14:35.199457500 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket con
nection for client /127.0.0.1:44434 which had sessionid 0x3324ffac32a000e
2011-10-06 21:14:35.577628500 ERROR [CommitProcessor:3:NIOServerCnxn@445] - Unexpected Exception:
2011-10-06 21:14:35.577629500 java.nio.channels.CancelledKeyException
2011-10-06 21:14:35.577631500 at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
2011-10-06 21:14:35.577632500 at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
2011-10-06 21:14:35.577633500 at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
2011-10-06 21:14:35.577634500 at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
2011-10-06 21:14:35.577641500 at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProces
sor.java:367)
2011-10-06 21:14:35.577642500 at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
2011-10-06 21:14:35.577797500 ERROR [CommitProcessor:3:NIOServerCnxn@445] - Unexpected Exception:
2011-10-06 21:14:35.577798500 java.nio.channels.CancelledKeyException
2011-10-06 21:14:35.577800500 at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
2011-10-06 21:14:35.577801500 at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
2011-10-06 21:14:35.577802500 at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
2011-10-06 21:14:35.577804500 at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
2011-10-06 21:14:35.577830500 at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProces
sor.java:171)
2011-10-06 21:14:35.577831500 at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
--
-- Mike
Mike Schilli
m@perlmeister.com
Re: Total Outage of all ZooKeeper Instances
Posted by Ted Dunning <te...@gmail.com>.
On Wed, Oct 12, 2011 at 7:04 AM, Mike Schilli <m...@perlmeister.com> wrote:
> Anyone got any ideas on how to prevent this kind of problem from
> happening?
>
Without a bit of log info, I don't think we would have any idea what the
problem was other than the -7 return code.
Can you give some specifics about what the ZK instances were saying?