You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by "mostolog@gmail.com" <mo...@gmail.com> on 2017/07/14 11:27:02 UTC
intermitent failures led to node restart
Hi
Using 3.5.3-beta, from time to time, our zookeeper ensemble leader
commits suicide:
Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,797]
ERROR Unexpected exception causing shutdown while sock still open
(org.apache.zookeeper.server.quorum.LearnerHandler)
Jul 13 12:40:32 host zookeeper[28511]:
java.net.SocketTimeoutException: Read timed out
Jul 13 12:40:32 host zookeeper[28511]: #011at
java.net.SocketInputStream.socketRead0(Native Method)
Jul 13 12:40:32 host zookeeper[28511]: #011at
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
Jul 13 12:40:32 host zookeeper[28511]: #011at
java.net.SocketInputStream.read(SocketInputStream.java:171)
Jul 13 12:40:32 host zookeeper[28511]: #011at
java.net.SocketInputStream.read(SocketInputStream.java:141)
Jul 13 12:40:32 host zookeeper[28511]: #011at
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
Jul 13 12:40:32 host zookeeper[28511]: #011at
java.io.BufferedInputStream.read(BufferedInputStream.java:265)
Jul 13 12:40:32 host zookeeper[28511]: #011at
java.io.DataInputStream.readInt(DataInputStream.java:387)
Jul 13 12:40:32 host zookeeper[28511]: #011at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
Jul 13 12:40:32 host zookeeper[28511]: #011at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
Jul 13 12:40:32 host zookeeper[28511]: #011at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
Jul 13 12:40:32 host zookeeper[28511]: #011at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,844]
WARN ******* GOODBYE /10.0.0.11:42816 ********
(org.apache.zookeeper.server.quorum.LearnerHandler)
Another ensemble node also complains (*cause or effect? notice
miliseconds!*):
Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,500]
WARN Exception when following the leader
(org.apache.zookeeper.server.quorum.Learner)
Jul 13 12:40:32 host12 zookeeper[26293]: java.net.SocketException:
Broken pipe (Write failed)
Jul 13 12:40:32 host12 zookeeper[26293]: #011at
java.net.SocketOutputStream.socketWrite0(Native Method)
Jul 13 12:40:32 host12 zookeeper[26293]: #011at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
Jul 13 12:40:32 host12 zookeeper[26293]: #011at
java.net.SocketOutputStream.write(SocketOutputStream.java:155)
Jul 13 12:40:32 host12 zookeeper[26293]: #011at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
Jul 13 12:40:32 host12 zookeeper[26293]: #011at
java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
Jul 13 12:40:32 host12 zookeeper[26293]: #011at
org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:141)
Jul 13 12:40:32 host12 zookeeper[26293]: #011at
org.apache.zookeeper.server.quorum.Learner.ping(Learner.java:620)
Jul 13 12:40:32 host12 zookeeper[26293]: #011at
org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:118)
Jul 13 12:40:32 host12 zookeeper[26293]: #011at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:92)
Jul 13 12:40:32 host12 zookeeper[26293]: #011at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,567]
WARN PeerState set to LOOKING
(org.apache.zookeeper.server.quorum.QuorumPeer)
while the third node still continues living happily.
Ideas?
Re: intermitent failures led to node restart
Posted by "mostolog@gmail.com" <mo...@gmail.com>.
We are using latest docker release, so I'm not thinking on network issues
Cluster, however, handles a lot of traffic, so perhaps node is "too
busy" to answer within time. Is this possible?
I'll have an eye on syncLimit.
Thanks.
PS: Any other replies are welcome too!
On 17/07/17 21:40, Abraham Fine wrote:
> Hello-
>
> It looks like there is a timeout during a ping. This is strange because
> pings don't need to transfer much data.
>
> Does the issue always occur with the same learner? Is it possible that
> there is a network issue here?
>
> You also may want to consider playing with the syncLimit.
>
> Thanks,
> Abe
>
> On Mon, Jul 17, 2017, at 01:30, mostolog@gmail.com wrote:
>> ping!
>>
>>
>> On 14/07/17 13:27, mostolog@gmail.com wrote:
>>> Hi
>>>
>>> Using 3.5.3-beta, from time to time, our zookeeper ensemble leader
>>> commits suicide:
>>>
>>> Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,797]
>>> ERROR Unexpected exception causing shutdown while sock still open
>>> (org.apache.zookeeper.server.quorum.LearnerHandler)
>>> Jul 13 12:40:32 host zookeeper[28511]:
>>> java.net.SocketTimeoutException: Read timed out
>>> Jul 13 12:40:32 host zookeeper[28511]: #011at
>>> java.net.SocketInputStream.socketRead0(Native Method)
>>> Jul 13 12:40:32 host zookeeper[28511]: #011at
>>> java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>>> Jul 13 12:40:32 host zookeeper[28511]: #011at
>>> java.net.SocketInputStream.read(SocketInputStream.java:171)
>>> Jul 13 12:40:32 host zookeeper[28511]: #011at
>>> java.net.SocketInputStream.read(SocketInputStream.java:141)
>>> Jul 13 12:40:32 host zookeeper[28511]: #011at
>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>>> Jul 13 12:40:32 host zookeeper[28511]: #011at
>>> java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>>> Jul 13 12:40:32 host zookeeper[28511]: #011at
>>> java.io.DataInputStream.readInt(DataInputStream.java:387)
>>> Jul 13 12:40:32 host zookeeper[28511]: #011at
>>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>> Jul 13 12:40:32 host zookeeper[28511]: #011at
>>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>> Jul 13 12:40:32 host zookeeper[28511]: #011at
>>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
>>> Jul 13 12:40:32 host zookeeper[28511]: #011at
>>> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
>>> Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,844]
>>> WARN ******* GOODBYE /10.0.0.11:42816 ********
>>> (org.apache.zookeeper.server.quorum.LearnerHandler)
>>>
>>> Another ensemble node also complains (*cause or effect? notice
>>> miliseconds!*):
>>>
>>> Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,500]
>>> WARN Exception when following the leader
>>> (org.apache.zookeeper.server.quorum.Learner)
>>> Jul 13 12:40:32 host12 zookeeper[26293]: java.net.SocketException:
>>> Broken pipe (Write failed)
>>> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>> java.net.SocketOutputStream.socketWrite0(Native Method)
>>> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>>> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>> java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>>> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>>> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>>> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>> org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:141)
>>> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>> org.apache.zookeeper.server.quorum.Learner.ping(Learner.java:620)
>>> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>> org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:118)
>>> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:92)
>>> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
>>> Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,567]
>>> WARN PeerState set to LOOKING
>>> (org.apache.zookeeper.server.quorum.QuorumPeer)
>>>
>>> while the third node still continues living happily.
>>>
>>> Ideas?
>>>
>>>
>>>
Re: intermitent failures led to node restart
Posted by Abraham Fine <af...@apache.org>.
Hello-
It looks like there is a timeout during a ping. This is strange because
pings don't need to transfer much data.
Does the issue always occur with the same learner? Is it possible that
there is a network issue here?
You also may want to consider playing with the syncLimit.
Thanks,
Abe
On Mon, Jul 17, 2017, at 01:30, mostolog@gmail.com wrote:
> ping!
>
>
> On 14/07/17 13:27, mostolog@gmail.com wrote:
> >
> > Hi
> >
> > Using 3.5.3-beta, from time to time, our zookeeper ensemble leader
> > commits suicide:
> >
> > Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,797]
> > ERROR Unexpected exception causing shutdown while sock still open
> > (org.apache.zookeeper.server.quorum.LearnerHandler)
> > Jul 13 12:40:32 host zookeeper[28511]:
> > java.net.SocketTimeoutException: Read timed out
> > Jul 13 12:40:32 host zookeeper[28511]: #011at
> > java.net.SocketInputStream.socketRead0(Native Method)
> > Jul 13 12:40:32 host zookeeper[28511]: #011at
> > java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> > Jul 13 12:40:32 host zookeeper[28511]: #011at
> > java.net.SocketInputStream.read(SocketInputStream.java:171)
> > Jul 13 12:40:32 host zookeeper[28511]: #011at
> > java.net.SocketInputStream.read(SocketInputStream.java:141)
> > Jul 13 12:40:32 host zookeeper[28511]: #011at
> > java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> > Jul 13 12:40:32 host zookeeper[28511]: #011at
> > java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> > Jul 13 12:40:32 host zookeeper[28511]: #011at
> > java.io.DataInputStream.readInt(DataInputStream.java:387)
> > Jul 13 12:40:32 host zookeeper[28511]: #011at
> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> > Jul 13 12:40:32 host zookeeper[28511]: #011at
> > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> > Jul 13 12:40:32 host zookeeper[28511]: #011at
> > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
> > Jul 13 12:40:32 host zookeeper[28511]: #011at
> > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
> > Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,844]
> > WARN ******* GOODBYE /10.0.0.11:42816 ********
> > (org.apache.zookeeper.server.quorum.LearnerHandler)
> >
> > Another ensemble node also complains (*cause or effect? notice
> > miliseconds!*):
> >
> > Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,500]
> > WARN Exception when following the leader
> > (org.apache.zookeeper.server.quorum.Learner)
> > Jul 13 12:40:32 host12 zookeeper[26293]: java.net.SocketException:
> > Broken pipe (Write failed)
> > Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> > java.net.SocketOutputStream.socketWrite0(Native Method)
> > Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> > Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> > java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> > Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> > Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> > Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> > org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:141)
> > Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> > org.apache.zookeeper.server.quorum.Learner.ping(Learner.java:620)
> > Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> > org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:118)
> > Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:92)
> > Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> > Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,567]
> > WARN PeerState set to LOOKING
> > (org.apache.zookeeper.server.quorum.QuorumPeer)
> >
> > while the third node still continues living happily.
> >
> > Ideas?
> >
> >
> >
>
Re: intermitent failures led to node restart
Posted by "mostolog@gmail.com" <mo...@gmail.com>.
ping!
On 14/07/17 13:27, mostolog@gmail.com wrote:
>
> Hi
>
> Using 3.5.3-beta, from time to time, our zookeeper ensemble leader
> commits suicide:
>
> Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,797]
> ERROR Unexpected exception causing shutdown while sock still open
> (org.apache.zookeeper.server.quorum.LearnerHandler)
> Jul 13 12:40:32 host zookeeper[28511]:
> java.net.SocketTimeoutException: Read timed out
> Jul 13 12:40:32 host zookeeper[28511]: #011at
> java.net.SocketInputStream.socketRead0(Native Method)
> Jul 13 12:40:32 host zookeeper[28511]: #011at
> java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> Jul 13 12:40:32 host zookeeper[28511]: #011at
> java.net.SocketInputStream.read(SocketInputStream.java:171)
> Jul 13 12:40:32 host zookeeper[28511]: #011at
> java.net.SocketInputStream.read(SocketInputStream.java:141)
> Jul 13 12:40:32 host zookeeper[28511]: #011at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> Jul 13 12:40:32 host zookeeper[28511]: #011at
> java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> Jul 13 12:40:32 host zookeeper[28511]: #011at
> java.io.DataInputStream.readInt(DataInputStream.java:387)
> Jul 13 12:40:32 host zookeeper[28511]: #011at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> Jul 13 12:40:32 host zookeeper[28511]: #011at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> Jul 13 12:40:32 host zookeeper[28511]: #011at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
> Jul 13 12:40:32 host zookeeper[28511]: #011at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
> Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,844]
> WARN ******* GOODBYE /10.0.0.11:42816 ********
> (org.apache.zookeeper.server.quorum.LearnerHandler)
>
> Another ensemble node also complains (*cause or effect? notice
> miliseconds!*):
>
> Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,500]
> WARN Exception when following the leader
> (org.apache.zookeeper.server.quorum.Learner)
> Jul 13 12:40:32 host12 zookeeper[26293]: java.net.SocketException:
> Broken pipe (Write failed)
> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> java.net.SocketOutputStream.socketWrite0(Native Method)
> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:141)
> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> org.apache.zookeeper.server.quorum.Learner.ping(Learner.java:620)
> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:118)
> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:92)
> Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,567]
> WARN PeerState set to LOOKING
> (org.apache.zookeeper.server.quorum.QuorumPeer)
>
> while the third node still continues living happily.
>
> Ideas?
>
>
>