You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by "mostolog@gmail.com" <mo...@gmail.com> on 2017/07/14 11:27:02 UTC

intermitent failures led to node restart

Hi

Using 3.5.3-beta, from time to time, our zookeeper ensemble leader 
commits suicide:

    Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,797]
    ERROR Unexpected exception causing shutdown while sock still open
    (org.apache.zookeeper.server.quorum.LearnerHandler)
    Jul 13 12:40:32 host zookeeper[28511]:
    java.net.SocketTimeoutException: Read timed out
    Jul 13 12:40:32 host zookeeper[28511]: #011at
    java.net.SocketInputStream.socketRead0(Native Method)
    Jul 13 12:40:32 host zookeeper[28511]: #011at
    java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    Jul 13 12:40:32 host zookeeper[28511]: #011at
    java.net.SocketInputStream.read(SocketInputStream.java:171)
    Jul 13 12:40:32 host zookeeper[28511]: #011at
    java.net.SocketInputStream.read(SocketInputStream.java:141)
    Jul 13 12:40:32 host zookeeper[28511]: #011at
    java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    Jul 13 12:40:32 host zookeeper[28511]: #011at
    java.io.BufferedInputStream.read(BufferedInputStream.java:265)
    Jul 13 12:40:32 host zookeeper[28511]: #011at
    java.io.DataInputStream.readInt(DataInputStream.java:387)
    Jul 13 12:40:32 host zookeeper[28511]: #011at
    org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
    Jul 13 12:40:32 host zookeeper[28511]: #011at
    org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
    Jul 13 12:40:32 host zookeeper[28511]: #011at
    org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
    Jul 13 12:40:32 host zookeeper[28511]: #011at
    org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
    Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,844]
    WARN ******* GOODBYE /10.0.0.11:42816 ********
    (org.apache.zookeeper.server.quorum.LearnerHandler)

Another ensemble node also complains (*cause or effect? notice 
miliseconds!*):

    Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,500]
    WARN Exception when following the leader
    (org.apache.zookeeper.server.quorum.Learner)
    Jul 13 12:40:32 host12 zookeeper[26293]: java.net.SocketException:
    Broken pipe (Write failed)
    Jul 13 12:40:32 host12 zookeeper[26293]: #011at
    java.net.SocketOutputStream.socketWrite0(Native Method)
    Jul 13 12:40:32 host12 zookeeper[26293]: #011at
    java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
    Jul 13 12:40:32 host12 zookeeper[26293]: #011at
    java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    Jul 13 12:40:32 host12 zookeeper[26293]: #011at
    java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    Jul 13 12:40:32 host12 zookeeper[26293]: #011at
    java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    Jul 13 12:40:32 host12 zookeeper[26293]: #011at
    org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:141)
    Jul 13 12:40:32 host12 zookeeper[26293]: #011at
    org.apache.zookeeper.server.quorum.Learner.ping(Learner.java:620)
    Jul 13 12:40:32 host12 zookeeper[26293]: #011at
    org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:118)
    Jul 13 12:40:32 host12 zookeeper[26293]: #011at
    org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:92)
    Jul 13 12:40:32 host12 zookeeper[26293]: #011at
    org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
    Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,567]
    WARN PeerState set to LOOKING
    (org.apache.zookeeper.server.quorum.QuorumPeer)

while the third node still continues living happily.

Ideas?




Re: intermitent failures led to node restart

Posted by "mostolog@gmail.com" <mo...@gmail.com>.
We are using latest docker release, so I'm not thinking on network issues

Cluster, however, handles a lot of traffic, so perhaps node is "too 
busy" to answer within time. Is this possible?

I'll have an eye on syncLimit.

Thanks.

PS: Any other replies are welcome too!


On 17/07/17 21:40, Abraham Fine wrote:
> Hello-
>
> It looks like there is a timeout during a ping. This is strange because
> pings don't need to transfer much data.
>
> Does the issue always occur with the same learner? Is it possible that
> there is a network issue here?
>
> You also may want to consider playing with the syncLimit.
>
> Thanks,
> Abe
>
> On Mon, Jul 17, 2017, at 01:30, mostolog@gmail.com wrote:
>> ping!
>>
>>
>> On 14/07/17 13:27, mostolog@gmail.com wrote:
>>> Hi
>>>
>>> Using 3.5.3-beta, from time to time, our zookeeper ensemble leader
>>> commits suicide:
>>>
>>>      Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,797]
>>>      ERROR Unexpected exception causing shutdown while sock still open
>>>      (org.apache.zookeeper.server.quorum.LearnerHandler)
>>>      Jul 13 12:40:32 host zookeeper[28511]:
>>>      java.net.SocketTimeoutException: Read timed out
>>>      Jul 13 12:40:32 host zookeeper[28511]: #011at
>>>      java.net.SocketInputStream.socketRead0(Native Method)
>>>      Jul 13 12:40:32 host zookeeper[28511]: #011at
>>>      java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>>>      Jul 13 12:40:32 host zookeeper[28511]: #011at
>>>      java.net.SocketInputStream.read(SocketInputStream.java:171)
>>>      Jul 13 12:40:32 host zookeeper[28511]: #011at
>>>      java.net.SocketInputStream.read(SocketInputStream.java:141)
>>>      Jul 13 12:40:32 host zookeeper[28511]: #011at
>>>      java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>>>      Jul 13 12:40:32 host zookeeper[28511]: #011at
>>>      java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>>>      Jul 13 12:40:32 host zookeeper[28511]: #011at
>>>      java.io.DataInputStream.readInt(DataInputStream.java:387)
>>>      Jul 13 12:40:32 host zookeeper[28511]: #011at
>>>      org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>>      Jul 13 12:40:32 host zookeeper[28511]: #011at
>>>      org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>>      Jul 13 12:40:32 host zookeeper[28511]: #011at
>>>      org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
>>>      Jul 13 12:40:32 host zookeeper[28511]: #011at
>>>      org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
>>>      Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,844]
>>>      WARN ******* GOODBYE /10.0.0.11:42816 ********
>>>      (org.apache.zookeeper.server.quorum.LearnerHandler)
>>>
>>> Another ensemble node also complains (*cause or effect? notice
>>> miliseconds!*):
>>>
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,500]
>>>      WARN Exception when following the leader
>>>      (org.apache.zookeeper.server.quorum.Learner)
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: java.net.SocketException:
>>>      Broken pipe (Write failed)
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>>      java.net.SocketOutputStream.socketWrite0(Native Method)
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>>      java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>>      java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>>      java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>>      java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>>      org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:141)
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>>      org.apache.zookeeper.server.quorum.Learner.ping(Learner.java:620)
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>>      org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:118)
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>>      org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:92)
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>>>      org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
>>>      Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,567]
>>>      WARN PeerState set to LOOKING
>>>      (org.apache.zookeeper.server.quorum.QuorumPeer)
>>>
>>> while the third node still continues living happily.
>>>
>>> Ideas?
>>>
>>>
>>>


Re: intermitent failures led to node restart

Posted by Abraham Fine <af...@apache.org>.
Hello-

It looks like there is a timeout during a ping. This is strange because
pings don't need to transfer much data. 

Does the issue always occur with the same learner? Is it possible that
there is a network issue here?

You also may want to consider playing with the syncLimit.

Thanks,
Abe

On Mon, Jul 17, 2017, at 01:30, mostolog@gmail.com wrote:
> ping!
> 
> 
> On 14/07/17 13:27, mostolog@gmail.com wrote:
> >
> > Hi
> >
> > Using 3.5.3-beta, from time to time, our zookeeper ensemble leader 
> > commits suicide:
> >
> >     Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,797]
> >     ERROR Unexpected exception causing shutdown while sock still open
> >     (org.apache.zookeeper.server.quorum.LearnerHandler)
> >     Jul 13 12:40:32 host zookeeper[28511]:
> >     java.net.SocketTimeoutException: Read timed out
> >     Jul 13 12:40:32 host zookeeper[28511]: #011at
> >     java.net.SocketInputStream.socketRead0(Native Method)
> >     Jul 13 12:40:32 host zookeeper[28511]: #011at
> >     java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> >     Jul 13 12:40:32 host zookeeper[28511]: #011at
> >     java.net.SocketInputStream.read(SocketInputStream.java:171)
> >     Jul 13 12:40:32 host zookeeper[28511]: #011at
> >     java.net.SocketInputStream.read(SocketInputStream.java:141)
> >     Jul 13 12:40:32 host zookeeper[28511]: #011at
> >     java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> >     Jul 13 12:40:32 host zookeeper[28511]: #011at
> >     java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> >     Jul 13 12:40:32 host zookeeper[28511]: #011at
> >     java.io.DataInputStream.readInt(DataInputStream.java:387)
> >     Jul 13 12:40:32 host zookeeper[28511]: #011at
> >     org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >     Jul 13 12:40:32 host zookeeper[28511]: #011at
> >     org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >     Jul 13 12:40:32 host zookeeper[28511]: #011at
> >     org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
> >     Jul 13 12:40:32 host zookeeper[28511]: #011at
> >     org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
> >     Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,844]
> >     WARN ******* GOODBYE /10.0.0.11:42816 ********
> >     (org.apache.zookeeper.server.quorum.LearnerHandler)
> >
> > Another ensemble node also complains (*cause or effect? notice 
> > miliseconds!*):
> >
> >     Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,500]
> >     WARN Exception when following the leader
> >     (org.apache.zookeeper.server.quorum.Learner)
> >     Jul 13 12:40:32 host12 zookeeper[26293]: java.net.SocketException:
> >     Broken pipe (Write failed)
> >     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> >     java.net.SocketOutputStream.socketWrite0(Native Method)
> >     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> >     java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> >     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> >     java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> >     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> >     java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> >     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> >     java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> >     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> >     org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:141)
> >     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> >     org.apache.zookeeper.server.quorum.Learner.ping(Learner.java:620)
> >     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> >     org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:118)
> >     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> >     org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:92)
> >     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
> >     org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> >     Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,567]
> >     WARN PeerState set to LOOKING
> >     (org.apache.zookeeper.server.quorum.QuorumPeer)
> >
> > while the third node still continues living happily.
> >
> > Ideas?
> >
> >
> >
> 

Re: intermitent failures led to node restart

Posted by "mostolog@gmail.com" <mo...@gmail.com>.
ping!


On 14/07/17 13:27, mostolog@gmail.com wrote:
>
> Hi
>
> Using 3.5.3-beta, from time to time, our zookeeper ensemble leader 
> commits suicide:
>
>     Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,797]
>     ERROR Unexpected exception causing shutdown while sock still open
>     (org.apache.zookeeper.server.quorum.LearnerHandler)
>     Jul 13 12:40:32 host zookeeper[28511]:
>     java.net.SocketTimeoutException: Read timed out
>     Jul 13 12:40:32 host zookeeper[28511]: #011at
>     java.net.SocketInputStream.socketRead0(Native Method)
>     Jul 13 12:40:32 host zookeeper[28511]: #011at
>     java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>     Jul 13 12:40:32 host zookeeper[28511]: #011at
>     java.net.SocketInputStream.read(SocketInputStream.java:171)
>     Jul 13 12:40:32 host zookeeper[28511]: #011at
>     java.net.SocketInputStream.read(SocketInputStream.java:141)
>     Jul 13 12:40:32 host zookeeper[28511]: #011at
>     java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>     Jul 13 12:40:32 host zookeeper[28511]: #011at
>     java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>     Jul 13 12:40:32 host zookeeper[28511]: #011at
>     java.io.DataInputStream.readInt(DataInputStream.java:387)
>     Jul 13 12:40:32 host zookeeper[28511]: #011at
>     org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>     Jul 13 12:40:32 host zookeeper[28511]: #011at
>     org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>     Jul 13 12:40:32 host zookeeper[28511]: #011at
>     org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
>     Jul 13 12:40:32 host zookeeper[28511]: #011at
>     org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
>     Jul 13 12:40:32 host zookeeper[28511]: [2017-07-13 12:40:32,844]
>     WARN ******* GOODBYE /10.0.0.11:42816 ********
>     (org.apache.zookeeper.server.quorum.LearnerHandler)
>
> Another ensemble node also complains (*cause or effect? notice 
> miliseconds!*):
>
>     Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,500]
>     WARN Exception when following the leader
>     (org.apache.zookeeper.server.quorum.Learner)
>     Jul 13 12:40:32 host12 zookeeper[26293]: java.net.SocketException:
>     Broken pipe (Write failed)
>     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>     java.net.SocketOutputStream.socketWrite0(Native Method)
>     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>     java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>     java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>     java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>     java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>     org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:141)
>     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>     org.apache.zookeeper.server.quorum.Learner.ping(Learner.java:620)
>     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>     org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:118)
>     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>     org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:92)
>     Jul 13 12:40:32 host12 zookeeper[26293]: #011at
>     org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
>     Jul 13 12:40:32 host12 zookeeper[26293]: [2017-07-13 12:40:32,567]
>     WARN PeerState set to LOOKING
>     (org.apache.zookeeper.server.quorum.QuorumPeer)
>
> while the third node still continues living happily.
>
> Ideas?
>
>
>