You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by "R. Tyler Croy" <ty...@monkeypox.org> on 2015/01/14 17:32:18 UTC

Large number of errored transactions and periodic node disconnects on 3.4.6

A colleague of mine and I are still fully trying to understand the behavior
we're seeing from one of our Zookeeper clusters running 3.4.6.

Some log snippets are included further down, but the symptom of what we're
seeing is that nodes in the cluster periodically detach themselves entirely
from the rest of the cluster. The period is not however consistent, in at least
one known case we've seen multiple nodes detach around the same time causing
leader re-election challenges (putting it kindly!).

Our investigation has led us to see a large number of transaction errors
(txnType: -1) in our trace logs with "n/a" paths. To give you an idea of how
large of a number we're talking about, here's a summary from ~20min of logs:

-> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' | sort | uniq -c | sort -nr
70585 txntype:-1 reqpath:n/a
4860 txntype:5 reqpath:n/a
7 txntype:-11 reqpath:n/a
7 txntype:-10 reqpath:n/a
[..snip..]

The number of session create/close transaction types looks about right as does
the number of 'set data' transactions; however the "reqpath" being "n/a" does
not look entirely correct.


We're only using Zookeeper to hold offsets for normal Kafka high-level
consumers and Storm's KafkaSpout consumers, and I'm relatively confident that
those consumers are behaving correctly.


We have also come across a number of ostensibly resolved bugs in JIRA that, to
our untrained eyes, look similar to what we're seeing:

    "Zookeeper 3.3.4 loses ephemeral nodes under stress"
        <https://issues.apache.org/jira/browse/ZOOKEEPER-1740>
    "the threads number of a zookeeper is increased all the time"
        <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing the
        thread growth mentioned here, just log similarities)


The node detaching issue is not something we've been able to recreate on our
own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on OpenJDK
1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide
information that might help determine the cause of our problems!


Cheers (logs snippets to follow)


Error logs from an incident where a node detached itself:

    2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
    2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
    2015-01-02 20:10:37,577 [myid:4] - WARN  [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646] - ******* GOODBYE /10.96.100.69:54260 ********
    2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
    2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646] - ******* GOODBYE /10.96.14.103:47982 ********
    2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
    2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646] - ******* GOODBYE /10.84.11.131:47798 ********
    2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646] - ******* GOODBYE /10.96.100.136:48175 ********



Some trace logs from "normal operation" that looked suspect to us:


    2015-01-12 14:05:12,753 [myid:1] - TRACE [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71] - loadData — session in archive: 381306787892823513 with timeout: 6000
    2015-01-12 14:05:12,797 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] - Snapshotting: 0xb3ea0ee8f4db to /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db
    2015-01-12 14:05:12,822 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid 0xb3ea0ee8f4dc expected 0x1
    2015-01-12 14:05:12,833 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,833 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88] - Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring processTxn failure hdr: -1 : error: -110
    2015-01-12 14:05:12,838 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88] - Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring processTxn failure hdr: -1 : error: -110
    2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840 zxid:0xb3ea0ee8f4de txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843 zxid:0xb3ea0ee8f4df txntype:-1 reqpath:n/a

Re: Large number of errored transactions and periodic node disconnects on 3.4.6

Posted by ma...@g2.nctu.edu.tw.
Thank for your kindly reply but I can't fully understand your discussion.
Basically I have two devices (iPhone 5 and iPad Air ) but sometimes I also play with other iOS devices . Anyway the iPhone 5 could be my first priority device. Can you recovery my character status to the situation before your serve maintain 
(I tried to delete app and reinstalled it but still can't connect the serve)
Thank for your help~

從我的 iPhone 傳送


> R. Tyler Croy <ty...@monkeypox.org> 於 2015年1月15日 上午6:01 寫道:
> 
> Replying to the list to make sure that Ben's responses get archived below
> 
>> On Wed, 14 Jan 2015, Ben Smith wrote:
>> 
>> On Wed, Jan 14, 2015 at 1:28 PM, Flavio Junqueira <fp...@yahoo.com>
>> wrote:
>> 
>>> Do you have anything running along with the ZK servers on the same
>>> machine?
>> 
>> 
>> No the zookeeper machines are 'dedicated'. The only other stuff running is
>> infrastructure related (metrics and log collection agents that incur no/low
>> load).
>> 
>> 
>>> Is that a virtualized environment?
>> 
>> 
>> Yes, this is in in the 'cloud' (: They are on AWS c3.large instances. The
>> instance CPU and Memory resources are really low.
>> 
>> 
>>> I'm wondering if a concurrent workload is causing servers to lag behind
>>> and drop the ensemble. Also, how many disk devices are you using? is there
>>> a dedicated device for the txn log?
>> 
>> They are only using one device each, it is not dedicated, but it also isn't
>> anywhere close to saturation, using iostat to validate svctm:
>> 
>> https://gist.github.com/polynomial/2dc6aefefd7fa31afc73
>> 
>> Here is the config if that helps too:
>> 
>> $ echo 'conf' | nc localhost 2181
>> clientPort=2181
>> dataDir=/path/to/zookeeper/snapshots/version-2
>> dataLogDir=/path/to/zookeeper/transactions/version-2
>> tickTime=2000
>> maxClientCnxns=60
>> minSessionTimeout=4000
>> maxSessionTimeout=40000
>> serverId=4
>> initLimit=10
>> syncLimit=5
>> electionAlg=3
>> electionPort=3888
>> quorumPort=2888
>> peerType=0
>> 
>> -ben
>> 
>> 
>>> 
>>> -Flavio
>>> 
>>>> On 14 Jan 2015, at 19:07, R. Tyler Croy <ty...@monkeypox.org> wrote:
>>>> 
>>>> (replies inline)
>>>> 
>>>>> On Wed, 14 Jan 2015, Flavio Junqueira wrote:
>>>>> 
>>>>> I'm not sure what could be causing the communication between servers to
>>> fail.
>>>>> Are you sure this is a ZK problem? Could it be a problem with your
>>> setting?
>>>> 
>>>> 
>>>> We're not certain of much at this point to be honest. We've executed some
>>>> network stress tests and see no disruptions in the layer 1-3 network
>>>> communications.
>>>> 
>>>> I'm not aware of any Zookeeper server-side setting that would result in
>>> the
>>>> behavior we're seeing but I'm happy to try out some configuration
>>> changes if
>>>> someboyd believes those will either help alleviate the problem or help
>>> identify
>>>> it with more exactness.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Wednesday, January 14, 2015 4:33 PM, R. Tyler Croy <
>>> tyler@monkeypox.org> wrote:
>>>>> 
>>>>> 
>>>>> A colleague of mine and I are still fully trying to understand the
>>> behavior
>>>>> we're seeing from one of our Zookeeper clusters running 3.4.6.
>>>>> 
>>>>> Some log snippets are included further down, but the symptom of what
>>> we're
>>>>> seeing is that nodes in the cluster periodically detach themselves
>>> entirely
>>>>> from the rest of the cluster. The period is not however consistent, in
>>> at least
>>>>> one known case we've seen multiple nodes detach around the same time
>>> causing
>>>>> leader re-election challenges (putting it kindly!).
>>>>> 
>>>>> Our investigation has led us to see a large number of transaction errors
>>>>> (txnType: -1) in our trace logs with "n/a" paths. To give you an idea
>>> of how
>>>>> large of a number we're talking about, here's a summary from ~20min of
>>> logs:
>>>>> 
>>>>> -> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' |
>>> sort | uniq -c | sort -nr
>>>>> 70585 txntype:-1 reqpath:n/a
>>>>> 4860 txntype:5 reqpath:n/a
>>>>> 7 txntype:-11 reqpath:n/a
>>>>> 7 txntype:-10 reqpath:n/a
>>>>> [..snip..]
>>>>> 
>>>>> The number of session create/close transaction types looks about right
>>> as does
>>>>> the number of 'set data' transactions; however the "reqpath" being
>>> "n/a" does
>>>>> not look entirely correct.
>>>>> 
>>>>> 
>>>>> We're only using Zookeeper to hold offsets for normal Kafka high-level
>>>>> consumers and Storm's KafkaSpout consumers, and I'm relatively
>>> confident that
>>>>> those consumers are behaving correctly.
>>>>> 
>>>>> 
>>>>> We have also come across a number of ostensibly resolved bugs in JIRA
>>> that, to
>>>>> our untrained eyes, look similar to what we're seeing:
>>>>> 
>>>>> "Zookeeper 3.3.4 loses ephemeral nodes under stress"
>>>>>    <https://issues.apache.org/jira/browse/ZOOKEEPER-1740>
>>>>> "the threads number of a zookeeper is increased all the time"
>>>>>    <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing
>>> the
>>>>>    thread growth mentioned here, just log similarities)
>>>>> 
>>>>> 
>>>>> The node detaching issue is not something we've been able to recreate
>>> on our
>>>>> own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on
>>> OpenJDK
>>>>> 1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide
>>>>> information that might help determine the cause of our problems!
>>>>> 
>>>>> 
>>>>> Cheers (logs snippets to follow)
>>>>> 
>>>>> 
>>>>> Error logs from an incident where a node detached itself:
>>>>> 
>>>>> 2015-01-02 20:10:37,576 [myid:4] - ERROR
>>> [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633] - Unexpected
>>> exception causing shutdown while sock still open
>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>        at java.net.SocketInputStream.socketRead0(Native Method)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>>>>        at
>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>>>>        at
>>> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>>>>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>>>>> 2015-01-02 20:10:37,577 [myid:4] - ERROR
>>> [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633] - Unexpected
>>> exception causing shutdown while sock still open
>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>        at java.net.SocketInputStream.socketRead0(Native Method)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>>>>        at
>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>>>>        at
>>> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>>>>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>>>>> 2015-01-02 20:10:37,577 [myid:4] - WARN
>>> [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646] - ******* GOODBYE
>>> /10.96.100.69:54260 ********
>>>>> 2015-01-02 20:10:37,576 [myid:4] - ERROR
>>> [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633] - Unexpected
>>> exception causing shutdown while sock still open
>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>        at java.net.SocketInputStream.socketRead0(Native Method)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>>>>        at
>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>>>>        at
>>> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>>>>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>>>>> 2015-01-02 20:10:37,578 [myid:4] - WARN
>>> [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646] - ******* GOODBYE
>>> /10.96.14.103:47982 ********
>>>>> 2015-01-02 20:10:37,577 [myid:4] - ERROR
>>> [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633] - Unexpected
>>> exception causing shutdown while sock still open
>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>        at java.net.SocketInputStream.socketRead0(Native Method)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>>>>        at
>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>>>>        at
>>> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>>>>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>>>>> 2015-01-02 20:10:37,578 [myid:4] - WARN
>>> [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646] - ******* GOODBYE
>>> /10.84.11.131:47798 ********
>>>>> 2015-01-02 20:10:37,578 [myid:4] - WARN
>>> [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646] - *******
>>> GOODBYE /10.96.100.136:48175 ********
>>>>> 
>>>>> 
>>>>> 
>>>>> Some trace logs from "normal operation" that looked suspect to us:
>>>>> 
>>>>> 
>>>>> 2015-01-12 14:05:12,753 [myid:1] - TRACE
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71] - loadData â??????
>>> session in archive: 381306787892823513 with timeout: 6000
>>>>> 2015-01-12 14:05:12,797 [myid:1] - INFO
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] -
>>> Snapshotting: 0xb3ea0ee8f4db to
>>> /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db
>>>>> 2015-01-12 14:05:12,822 [myid:1] - WARN
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid
>>> 0xb3ea0ee8f4dc expected 0x1
>>>>> 2015-01-12 14:05:12,833 [myid:1] - DEBUG
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
>>> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a
>>> zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,833 [myid:1] - DEBUG
>>> [CommitProcessor:1:FinalRequestProcessor@88] - Processing request::
>>> sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc
>>> txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90]
>>> - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483a
>>> zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949]
>>> - Ignoring processTxn failure hdr: -1 : error: -110
>>>>> 2015-01-12 14:05:12,838 [myid:1] - DEBUG
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
>>> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d
>>> zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
>>> [CommitProcessor:1:FinalRequestProcessor@88] - Processing request::
>>> sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd
>>> txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90]
>>> - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483d
>>> zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949]
>>> - Ignoring processTxn failure hdr: -1 : error: -110
>>>>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
>>> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840
>>> zxid:0xb3ea0ee8f4de txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
>>> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843
>>> zxid:0xb3ea0ee8f4df txntype:-1 reqpath:n/a
>>> 
>>> 

Re: Large number of errored transactions and periodic node disconnects on 3.4.6

Posted by "R. Tyler Croy" <ty...@monkeypox.org>.
Replying to the list to make sure that Ben's responses get archived below

On Wed, 14 Jan 2015, Ben Smith wrote:

> On Wed, Jan 14, 2015 at 1:28 PM, Flavio Junqueira <fp...@yahoo.com>
> wrote:
> 
> > Do you have anything running along with the ZK servers on the same
> > machine?
> 
> 
> No the zookeeper machines are 'dedicated'. The only other stuff running is
> infrastructure related (metrics and log collection agents that incur no/low
> load).
> 
> 
> > Is that a virtualized environment?
> 
> 
> Yes, this is in in the 'cloud' (: They are on AWS c3.large instances. The
> instance CPU and Memory resources are really low.
> 
> 
> > I'm wondering if a concurrent workload is causing servers to lag behind
> > and drop the ensemble. Also, how many disk devices are you using? is there
> > a dedicated device for the txn log?
> >
> 
> They are only using one device each, it is not dedicated, but it also isn't
> anywhere close to saturation, using iostat to validate svctm:
> 
> https://gist.github.com/polynomial/2dc6aefefd7fa31afc73
> 
> Here is the config if that helps too:
> 
> $ echo 'conf' | nc localhost 2181
> clientPort=2181
> dataDir=/path/to/zookeeper/snapshots/version-2
> dataLogDir=/path/to/zookeeper/transactions/version-2
> tickTime=2000
> maxClientCnxns=60
> minSessionTimeout=4000
> maxSessionTimeout=40000
> serverId=4
> initLimit=10
> syncLimit=5
> electionAlg=3
> electionPort=3888
> quorumPort=2888
> peerType=0
> 
> -ben
> 
> 
> >
> > -Flavio
> >
> > > On 14 Jan 2015, at 19:07, R. Tyler Croy <ty...@monkeypox.org> wrote:
> > >
> > > (replies inline)
> > >
> > > On Wed, 14 Jan 2015, Flavio Junqueira wrote:
> > >
> > >> I'm not sure what could be causing the communication between servers to
> > fail.
> > >> Are you sure this is a ZK problem? Could it be a problem with your
> > setting?
> > >
> > >
> > > We're not certain of much at this point to be honest. We've executed some
> > > network stress tests and see no disruptions in the layer 1-3 network
> > > communications.
> > >
> > > I'm not aware of any Zookeeper server-side setting that would result in
> > the
> > > behavior we're seeing but I'm happy to try out some configuration
> > changes if
> > > someboyd believes those will either help alleviate the problem or help
> > identify
> > > it with more exactness.
> > >
> > >
> > >
> > >
> > >> On Wednesday, January 14, 2015 4:33 PM, R. Tyler Croy <
> > tyler@monkeypox.org> wrote:
> > >>
> > >>
> > >> A colleague of mine and I are still fully trying to understand the
> > behavior
> > >> we're seeing from one of our Zookeeper clusters running 3.4.6.
> > >>
> > >> Some log snippets are included further down, but the symptom of what
> > we're
> > >> seeing is that nodes in the cluster periodically detach themselves
> > entirely
> > >> from the rest of the cluster. The period is not however consistent, in
> > at least
> > >> one known case we've seen multiple nodes detach around the same time
> > causing
> > >> leader re-election challenges (putting it kindly!).
> > >>
> > >> Our investigation has led us to see a large number of transaction errors
> > >> (txnType: -1) in our trace logs with "n/a" paths. To give you an idea
> > of how
> > >> large of a number we're talking about, here's a summary from ~20min of
> > logs:
> > >>
> > >> -> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' |
> > sort | uniq -c | sort -nr
> > >> 70585 txntype:-1 reqpath:n/a
> > >> 4860 txntype:5 reqpath:n/a
> > >> 7 txntype:-11 reqpath:n/a
> > >> 7 txntype:-10 reqpath:n/a
> > >> [..snip..]
> > >>
> > >> The number of session create/close transaction types looks about right
> > as does
> > >> the number of 'set data' transactions; however the "reqpath" being
> > "n/a" does
> > >> not look entirely correct.
> > >>
> > >>
> > >> We're only using Zookeeper to hold offsets for normal Kafka high-level
> > >> consumers and Storm's KafkaSpout consumers, and I'm relatively
> > confident that
> > >> those consumers are behaving correctly.
> > >>
> > >>
> > >> We have also come across a number of ostensibly resolved bugs in JIRA
> > that, to
> > >> our untrained eyes, look similar to what we're seeing:
> > >>
> > >> "Zookeeper 3.3.4 loses ephemeral nodes under stress"
> > >>     <https://issues.apache.org/jira/browse/ZOOKEEPER-1740>
> > >> "the threads number of a zookeeper is increased all the time"
> > >>     <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing
> > the
> > >>     thread growth mentioned here, just log similarities)
> > >>
> > >>
> > >> The node detaching issue is not something we've been able to recreate
> > on our
> > >> own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on
> > OpenJDK
> > >> 1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide
> > >> information that might help determine the cause of our problems!
> > >>
> > >>
> > >> Cheers (logs snippets to follow)
> > >>
> > >>
> > >> Error logs from an incident where a node detached itself:
> > >>
> > >> 2015-01-02 20:10:37,576 [myid:4] - ERROR
> > [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633] - Unexpected
> > exception causing shutdown while sock still open
> > >> java.net.SocketTimeoutException: Read timed out
> > >>         at java.net.SocketInputStream.socketRead0(Native Method)
> > >>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
> > >>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
> > >>         at
> > java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> > >>         at
> > java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> > >>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >>         at
> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> > >>         at
> > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> > >>         at
> > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> > >>         at
> > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
> > >> 2015-01-02 20:10:37,577 [myid:4] - ERROR
> > [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633] - Unexpected
> > exception causing shutdown while sock still open
> > >> java.net.SocketTimeoutException: Read timed out
> > >>         at java.net.SocketInputStream.socketRead0(Native Method)
> > >>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
> > >>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
> > >>         at
> > java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> > >>         at
> > java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> > >>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >>         at
> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> > >>         at
> > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> > >>         at
> > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> > >>         at
> > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
> > >> 2015-01-02 20:10:37,577 [myid:4] - WARN
> > [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646] - ******* GOODBYE
> > /10.96.100.69:54260 ********
> > >> 2015-01-02 20:10:37,576 [myid:4] - ERROR
> > [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633] - Unexpected
> > exception causing shutdown while sock still open
> > >> java.net.SocketTimeoutException: Read timed out
> > >>         at java.net.SocketInputStream.socketRead0(Native Method)
> > >>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
> > >>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
> > >>         at
> > java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> > >>         at
> > java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> > >>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >>         at
> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> > >>         at
> > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> > >>         at
> > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> > >>         at
> > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
> > >> 2015-01-02 20:10:37,578 [myid:4] - WARN
> > [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646] - ******* GOODBYE
> > /10.96.14.103:47982 ********
> > >> 2015-01-02 20:10:37,577 [myid:4] - ERROR
> > [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633] - Unexpected
> > exception causing shutdown while sock still open
> > >> java.net.SocketTimeoutException: Read timed out
> > >>         at java.net.SocketInputStream.socketRead0(Native Method)
> > >>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
> > >>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
> > >>         at
> > java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> > >>         at
> > java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> > >>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > >>         at
> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> > >>         at
> > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> > >>         at
> > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> > >>         at
> > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
> > >> 2015-01-02 20:10:37,578 [myid:4] - WARN
> > [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646] - ******* GOODBYE
> > /10.84.11.131:47798 ********
> > >> 2015-01-02 20:10:37,578 [myid:4] - WARN
> > [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646] - *******
> > GOODBYE /10.96.100.136:48175 ********
> > >>
> > >>
> > >>
> > >> Some trace logs from "normal operation" that looked suspect to us:
> > >>
> > >>
> > >> 2015-01-12 14:05:12,753 [myid:1] - TRACE
> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71] - loadData â??????
> > session in archive: 381306787892823513 with timeout: 6000
> > >> 2015-01-12 14:05:12,797 [myid:1] - INFO
> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] -
> > Snapshotting: 0xb3ea0ee8f4db to
> > /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db
> > >> 2015-01-12 14:05:12,822 [myid:1] - WARN
> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid
> > 0xb3ea0ee8f4dc expected 0x1
> > >> 2015-01-12 14:05:12,833 [myid:1] - DEBUG
> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
> > Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a
> > zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
> > >> 2015-01-12 14:05:12,833 [myid:1] - DEBUG
> > [CommitProcessor:1:FinalRequestProcessor@88] - Processing request::
> > sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc
> > txntype:-1 reqpath:n/a
> > >> 2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90]
> > - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483a
> > zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
> > >> 2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949]
> > - Ignoring processTxn failure hdr: -1 : error: -110
> > >> 2015-01-12 14:05:12,838 [myid:1] - DEBUG
> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
> > Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d
> > zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
> > >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
> > [CommitProcessor:1:FinalRequestProcessor@88] - Processing request::
> > sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd
> > txntype:-1 reqpath:n/a
> > >> 2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90]
> > - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483d
> > zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
> > >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949]
> > - Ignoring processTxn failure hdr: -1 : error: -110
> > >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
> > Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840
> > zxid:0xb3ea0ee8f4de txntype:-1 reqpath:n/a
> > >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
> > Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843
> > zxid:0xb3ea0ee8f4df txntype:-1 reqpath:n/a
> > >>
> > >>
> > >>
> >
> >

Re: Large number of errored transactions and periodic node disconnects on 3.4.6

Posted by Ben Smith <po...@gmail.com>.
On Wed, Jan 14, 2015 at 1:28 PM, Flavio Junqueira <fp...@yahoo.com>
wrote:

> Do you have anything running along with the ZK servers on the same
> machine?


No the zookeeper machines are 'dedicated'. The only other stuff running is
infrastructure related (metrics and log collection agents that incur no/low
load).


> Is that a virtualized environment?


Yes, this is in in the 'cloud' (: They are on AWS c3.large instances. The
instance CPU and Memory resources are really low.


> I'm wondering if a concurrent workload is causing servers to lag behind
> and drop the ensemble. Also, how many disk devices are you using? is there
> a dedicated device for the txn log?
>

They are only using one device each, it is not dedicated, but it also isn't
anywhere close to saturation, using iostat to validate svctm:

https://gist.github.com/polynomial/2dc6aefefd7fa31afc73

Here is the config if that helps too:

$ echo 'conf' | nc localhost 2181
clientPort=2181
dataDir=/path/to/zookeeper/snapshots/version-2
dataLogDir=/path/to/zookeeper/transactions/version-2
tickTime=2000
maxClientCnxns=60
minSessionTimeout=4000
maxSessionTimeout=40000
serverId=4
initLimit=10
syncLimit=5
electionAlg=3
electionPort=3888
quorumPort=2888
peerType=0

-ben


>
> -Flavio
>
> > On 14 Jan 2015, at 19:07, R. Tyler Croy <ty...@monkeypox.org> wrote:
> >
> > (replies inline)
> >
> > On Wed, 14 Jan 2015, Flavio Junqueira wrote:
> >
> >> I'm not sure what could be causing the communication between servers to
> fail.
> >> Are you sure this is a ZK problem? Could it be a problem with your
> setting?
> >
> >
> > We're not certain of much at this point to be honest. We've executed some
> > network stress tests and see no disruptions in the layer 1-3 network
> > communications.
> >
> > I'm not aware of any Zookeeper server-side setting that would result in
> the
> > behavior we're seeing but I'm happy to try out some configuration
> changes if
> > someboyd believes those will either help alleviate the problem or help
> identify
> > it with more exactness.
> >
> >
> >
> >
> >> On Wednesday, January 14, 2015 4:33 PM, R. Tyler Croy <
> tyler@monkeypox.org> wrote:
> >>
> >>
> >> A colleague of mine and I are still fully trying to understand the
> behavior
> >> we're seeing from one of our Zookeeper clusters running 3.4.6.
> >>
> >> Some log snippets are included further down, but the symptom of what
> we're
> >> seeing is that nodes in the cluster periodically detach themselves
> entirely
> >> from the rest of the cluster. The period is not however consistent, in
> at least
> >> one known case we've seen multiple nodes detach around the same time
> causing
> >> leader re-election challenges (putting it kindly!).
> >>
> >> Our investigation has led us to see a large number of transaction errors
> >> (txnType: -1) in our trace logs with "n/a" paths. To give you an idea
> of how
> >> large of a number we're talking about, here's a summary from ~20min of
> logs:
> >>
> >> -> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' |
> sort | uniq -c | sort -nr
> >> 70585 txntype:-1 reqpath:n/a
> >> 4860 txntype:5 reqpath:n/a
> >> 7 txntype:-11 reqpath:n/a
> >> 7 txntype:-10 reqpath:n/a
> >> [..snip..]
> >>
> >> The number of session create/close transaction types looks about right
> as does
> >> the number of 'set data' transactions; however the "reqpath" being
> "n/a" does
> >> not look entirely correct.
> >>
> >>
> >> We're only using Zookeeper to hold offsets for normal Kafka high-level
> >> consumers and Storm's KafkaSpout consumers, and I'm relatively
> confident that
> >> those consumers are behaving correctly.
> >>
> >>
> >> We have also come across a number of ostensibly resolved bugs in JIRA
> that, to
> >> our untrained eyes, look similar to what we're seeing:
> >>
> >> "Zookeeper 3.3.4 loses ephemeral nodes under stress"
> >>     <https://issues.apache.org/jira/browse/ZOOKEEPER-1740>
> >> "the threads number of a zookeeper is increased all the time"
> >>     <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing
> the
> >>     thread growth mentioned here, just log similarities)
> >>
> >>
> >> The node detaching issue is not something we've been able to recreate
> on our
> >> own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on
> OpenJDK
> >> 1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide
> >> information that might help determine the cause of our problems!
> >>
> >>
> >> Cheers (logs snippets to follow)
> >>
> >>
> >> Error logs from an incident where a node detached itself:
> >>
> >> 2015-01-02 20:10:37,576 [myid:4] - ERROR
> [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633] - Unexpected
> exception causing shutdown while sock still open
> >> java.net.SocketTimeoutException: Read timed out
> >>         at java.net.SocketInputStream.socketRead0(Native Method)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >>         at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> >>         at
> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> >>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >>         at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> >>         at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
> >> 2015-01-02 20:10:37,577 [myid:4] - ERROR
> [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633] - Unexpected
> exception causing shutdown while sock still open
> >> java.net.SocketTimeoutException: Read timed out
> >>         at java.net.SocketInputStream.socketRead0(Native Method)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >>         at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> >>         at
> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> >>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >>         at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> >>         at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
> >> 2015-01-02 20:10:37,577 [myid:4] - WARN
> [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646] - ******* GOODBYE
> /10.96.100.69:54260 ********
> >> 2015-01-02 20:10:37,576 [myid:4] - ERROR
> [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633] - Unexpected
> exception causing shutdown while sock still open
> >> java.net.SocketTimeoutException: Read timed out
> >>         at java.net.SocketInputStream.socketRead0(Native Method)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >>         at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> >>         at
> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> >>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >>         at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> >>         at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
> >> 2015-01-02 20:10:37,578 [myid:4] - WARN
> [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646] - ******* GOODBYE
> /10.96.14.103:47982 ********
> >> 2015-01-02 20:10:37,577 [myid:4] - ERROR
> [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633] - Unexpected
> exception causing shutdown while sock still open
> >> java.net.SocketTimeoutException: Read timed out
> >>         at java.net.SocketInputStream.socketRead0(Native Method)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
> >>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >>         at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> >>         at
> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> >>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
> >>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >>         at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> >>         at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
> >> 2015-01-02 20:10:37,578 [myid:4] - WARN
> [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646] - ******* GOODBYE
> /10.84.11.131:47798 ********
> >> 2015-01-02 20:10:37,578 [myid:4] - WARN
> [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646] - *******
> GOODBYE /10.96.100.136:48175 ********
> >>
> >>
> >>
> >> Some trace logs from "normal operation" that looked suspect to us:
> >>
> >>
> >> 2015-01-12 14:05:12,753 [myid:1] - TRACE
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71] - loadData â??????
> session in archive: 381306787892823513 with timeout: 6000
> >> 2015-01-12 14:05:12,797 [myid:1] - INFO
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] -
> Snapshotting: 0xb3ea0ee8f4db to
> /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db
> >> 2015-01-12 14:05:12,822 [myid:1] - WARN
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid
> 0xb3ea0ee8f4dc expected 0x1
> >> 2015-01-12 14:05:12,833 [myid:1] - DEBUG
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a
> zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,833 [myid:1] - DEBUG
> [CommitProcessor:1:FinalRequestProcessor@88] - Processing request::
> sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc
> txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90]
> - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483a
> zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949]
> - Ignoring processTxn failure hdr: -1 : error: -110
> >> 2015-01-12 14:05:12,838 [myid:1] - DEBUG
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d
> zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
> [CommitProcessor:1:FinalRequestProcessor@88] - Processing request::
> sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd
> txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90]
> - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483d
> zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949]
> - Ignoring processTxn failure hdr: -1 : error: -110
> >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840
> zxid:0xb3ea0ee8f4de txntype:-1 reqpath:n/a
> >> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843
> zxid:0xb3ea0ee8f4df txntype:-1 reqpath:n/a
> >>
> >>
> >>
>
>

Re: Large number of errored transactions and periodic node disconnects on 3.4.6

Posted by Flavio Junqueira <fp...@yahoo.com.INVALID>.
Do you have anything running along with the ZK servers on the same machine? Is that a virtualized environment? I'm wondering if a concurrent workload is causing servers to lag behind and drop the ensemble. Also, how many disk devices are you using? is there a dedicated device for the txn log?

-Flavio

> On 14 Jan 2015, at 19:07, R. Tyler Croy <ty...@monkeypox.org> wrote:
> 
> (replies inline)
> 
> On Wed, 14 Jan 2015, Flavio Junqueira wrote:
> 
>> I'm not sure what could be causing the communication between servers to fail.
>> Are you sure this is a ZK problem? Could it be a problem with your setting?
> 
> 
> We're not certain of much at this point to be honest. We've executed some
> network stress tests and see no disruptions in the layer 1-3 network
> communications.
> 
> I'm not aware of any Zookeeper server-side setting that would result in the
> behavior we're seeing but I'm happy to try out some configuration changes if
> someboyd believes those will either help alleviate the problem or help identify
> it with more exactness.
> 
> 
> 
> 
>> On Wednesday, January 14, 2015 4:33 PM, R. Tyler Croy <ty...@monkeypox.org> wrote:
>> 
>> 
>> A colleague of mine and I are still fully trying to understand the behavior
>> we're seeing from one of our Zookeeper clusters running 3.4.6.
>> 
>> Some log snippets are included further down, but the symptom of what we're
>> seeing is that nodes in the cluster periodically detach themselves entirely
>> from the rest of the cluster. The period is not however consistent, in at least
>> one known case we've seen multiple nodes detach around the same time causing
>> leader re-election challenges (putting it kindly!).
>> 
>> Our investigation has led us to see a large number of transaction errors
>> (txnType: -1) in our trace logs with "n/a" paths. To give you an idea of how
>> large of a number we're talking about, here's a summary from ~20min of logs:
>> 
>> -> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' | sort | uniq -c | sort -nr
>> 70585 txntype:-1 reqpath:n/a
>> 4860 txntype:5 reqpath:n/a
>> 7 txntype:-11 reqpath:n/a
>> 7 txntype:-10 reqpath:n/a
>> [..snip..]
>> 
>> The number of session create/close transaction types looks about right as does
>> the number of 'set data' transactions; however the "reqpath" being "n/a" does
>> not look entirely correct.
>> 
>> 
>> We're only using Zookeeper to hold offsets for normal Kafka high-level
>> consumers and Storm's KafkaSpout consumers, and I'm relatively confident that
>> those consumers are behaving correctly.
>> 
>> 
>> We have also come across a number of ostensibly resolved bugs in JIRA that, to
>> our untrained eyes, look similar to what we're seeing:
>> 
>> "Zookeeper 3.3.4 loses ephemeral nodes under stress"
>>     <https://issues.apache.org/jira/browse/ZOOKEEPER-1740>
>> "the threads number of a zookeeper is increased all the time"
>>     <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing the
>>     thread growth mentioned here, just log similarities)
>> 
>> 
>> The node detaching issue is not something we've been able to recreate on our
>> own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on OpenJDK
>> 1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide
>> information that might help determine the cause of our problems!
>> 
>> 
>> Cheers (logs snippets to follow)
>> 
>> 
>> Error logs from an incident where a node detached itself:
>> 
>> 2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
>> java.net.SocketTimeoutException: Read timed out
>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>         at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>> 2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
>> java.net.SocketTimeoutException: Read timed out
>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>         at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>> 2015-01-02 20:10:37,577 [myid:4] - WARN  [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646] - ******* GOODBYE /10.96.100.69:54260 ********
>> 2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
>> java.net.SocketTimeoutException: Read timed out
>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>         at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>> 2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646] - ******* GOODBYE /10.96.14.103:47982 ********
>> 2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
>> java.net.SocketTimeoutException: Read timed out
>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>         at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>> 2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646] - ******* GOODBYE /10.84.11.131:47798 ********
>> 2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646] - ******* GOODBYE /10.96.100.136:48175 ********
>> 
>> 
>> 
>> Some trace logs from "normal operation" that looked suspect to us:
>> 
>> 
>> 2015-01-12 14:05:12,753 [myid:1] - TRACE [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71] - loadData â?????? session in archive: 381306787892823513 with timeout: 6000
>> 2015-01-12 14:05:12,797 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] - Snapshotting: 0xb3ea0ee8f4db to /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db
>> 2015-01-12 14:05:12,822 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid 0xb3ea0ee8f4dc expected 0x1
>> 2015-01-12 14:05:12,833 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,833 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88] - Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring processTxn failure hdr: -1 : error: -110
>> 2015-01-12 14:05:12,838 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88] - Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring processTxn failure hdr: -1 : error: -110
>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840 zxid:0xb3ea0ee8f4de txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843 zxid:0xb3ea0ee8f4df txntype:-1 reqpath:n/a
>> 
>> 
>> 


Re: Large number of errored transactions and periodic node disconnects on 3.4.6

Posted by "R. Tyler Croy" <ty...@monkeypox.org>.
(replies inline)

On Wed, 14 Jan 2015, Flavio Junqueira wrote:

> I'm not sure what could be causing the communication between servers to fail.
> Are you sure this is a ZK problem? Could it be a problem with your setting?


We're not certain of much at this point to be honest. We've executed some
network stress tests and see no disruptions in the layer 1-3 network
communications.

I'm not aware of any Zookeeper server-side setting that would result in the
behavior we're seeing but I'm happy to try out some configuration changes if
someboyd believes those will either help alleviate the problem or help identify
it with more exactness.




>      On Wednesday, January 14, 2015 4:33 PM, R. Tyler Croy <ty...@monkeypox.org> wrote:
>    
>  
> A colleague of mine and I are still fully trying to understand the behavior
> we're seeing from one of our Zookeeper clusters running 3.4.6.
> 
> Some log snippets are included further down, but the symptom of what we're
> seeing is that nodes in the cluster periodically detach themselves entirely
> from the rest of the cluster. The period is not however consistent, in at least
> one known case we've seen multiple nodes detach around the same time causing
> leader re-election challenges (putting it kindly!).
> 
> Our investigation has led us to see a large number of transaction errors
> (txnType: -1) in our trace logs with "n/a" paths. To give you an idea of how
> large of a number we're talking about, here's a summary from ~20min of logs:
> 
> -> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' | sort | uniq -c | sort -nr
> 70585 txntype:-1 reqpath:n/a
> 4860 txntype:5 reqpath:n/a
> 7 txntype:-11 reqpath:n/a
> 7 txntype:-10 reqpath:n/a
> [..snip..]
> 
> The number of session create/close transaction types looks about right as does
> the number of 'set data' transactions; however the "reqpath" being "n/a" does
> not look entirely correct.
> 
> 
> We're only using Zookeeper to hold offsets for normal Kafka high-level
> consumers and Storm's KafkaSpout consumers, and I'm relatively confident that
> those consumers are behaving correctly.
> 
> 
> We have also come across a number of ostensibly resolved bugs in JIRA that, to
> our untrained eyes, look similar to what we're seeing:
> 
>     "Zookeeper 3.3.4 loses ephemeral nodes under stress"
>         <https://issues.apache.org/jira/browse/ZOOKEEPER-1740>
>     "the threads number of a zookeeper is increased all the time"
>         <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing the
>         thread growth mentioned here, just log similarities)
> 
> 
> The node detaching issue is not something we've been able to recreate on our
> own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on OpenJDK
> 1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide
> information that might help determine the cause of our problems!
> 
> 
> Cheers (logs snippets to follow)
> 
> 
> Error logs from an incident where a node detached itself:
> 
>     2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
>     java.net.SocketTimeoutException: Read timed out
>             at java.net.SocketInputStream.socketRead0(Native Method)
>             at java.net.SocketInputStream.read(SocketInputStream.java:152)
>             at java.net.SocketInputStream.read(SocketInputStream.java:122)
>             at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>             at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>             at java.io.DataInputStream.readInt(DataInputStream.java:387)
>             at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>             at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>             at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>             at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>     2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
>     java.net.SocketTimeoutException: Read timed out
>             at java.net.SocketInputStream.socketRead0(Native Method)
>             at java.net.SocketInputStream.read(SocketInputStream.java:152)
>             at java.net.SocketInputStream.read(SocketInputStream.java:122)
>             at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>             at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>             at java.io.DataInputStream.readInt(DataInputStream.java:387)
>             at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>             at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>             at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>             at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>     2015-01-02 20:10:37,577 [myid:4] - WARN  [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646] - ******* GOODBYE /10.96.100.69:54260 ********
>     2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
>     java.net.SocketTimeoutException: Read timed out
>             at java.net.SocketInputStream.socketRead0(Native Method)
>             at java.net.SocketInputStream.read(SocketInputStream.java:152)
>             at java.net.SocketInputStream.read(SocketInputStream.java:122)
>             at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>             at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>             at java.io.DataInputStream.readInt(DataInputStream.java:387)
>             at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>             at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>             at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>             at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>     2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646] - ******* GOODBYE /10.96.14.103:47982 ********
>     2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
>     java.net.SocketTimeoutException: Read timed out
>             at java.net.SocketInputStream.socketRead0(Native Method)
>             at java.net.SocketInputStream.read(SocketInputStream.java:152)
>             at java.net.SocketInputStream.read(SocketInputStream.java:122)
>             at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>             at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>             at java.io.DataInputStream.readInt(DataInputStream.java:387)
>             at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>             at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>             at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>             at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>     2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646] - ******* GOODBYE /10.84.11.131:47798 ********
>     2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646] - ******* GOODBYE /10.96.100.136:48175 ********
> 
> 
> 
> Some trace logs from "normal operation" that looked suspect to us:
> 
> 
>     2015-01-12 14:05:12,753 [myid:1] - TRACE [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71] - loadData â?????? session in archive: 381306787892823513 with timeout: 6000
>     2015-01-12 14:05:12,797 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] - Snapshotting: 0xb3ea0ee8f4db to /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db
>     2015-01-12 14:05:12,822 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid 0xb3ea0ee8f4dc expected 0x1
>     2015-01-12 14:05:12,833 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,833 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88] - Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring processTxn failure hdr: -1 : error: -110
>     2015-01-12 14:05:12,838 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88] - Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring processTxn failure hdr: -1 : error: -110
>     2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840 zxid:0xb3ea0ee8f4de txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843 zxid:0xb3ea0ee8f4df txntype:-1 reqpath:n/a
> 
>  
>    

Re: Large number of errored transactions and periodic node disconnects on 3.4.6

Posted by Flavio Junqueira <fp...@yahoo.com.INVALID>.
Hi there,
I'm not sure what could be causing the communication between servers to fail. Are you sure this is a ZK problem? Could it be a problem with your setting?
-Flavio 

     On Wednesday, January 14, 2015 4:33 PM, R. Tyler Croy <ty...@monkeypox.org> wrote:
   
 

 
A colleague of mine and I are still fully trying to understand the behavior
we're seeing from one of our Zookeeper clusters running 3.4.6.

Some log snippets are included further down, but the symptom of what we're
seeing is that nodes in the cluster periodically detach themselves entirely
from the rest of the cluster. The period is not however consistent, in at least
one known case we've seen multiple nodes detach around the same time causing
leader re-election challenges (putting it kindly!).

Our investigation has led us to see a large number of transaction errors
(txnType: -1) in our trace logs with "n/a" paths. To give you an idea of how
large of a number we're talking about, here's a summary from ~20min of logs:

-> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' | sort | uniq -c | sort -nr
70585 txntype:-1 reqpath:n/a
4860 txntype:5 reqpath:n/a
7 txntype:-11 reqpath:n/a
7 txntype:-10 reqpath:n/a
[..snip..]

The number of session create/close transaction types looks about right as does
the number of 'set data' transactions; however the "reqpath" being "n/a" does
not look entirely correct.


We're only using Zookeeper to hold offsets for normal Kafka high-level
consumers and Storm's KafkaSpout consumers, and I'm relatively confident that
those consumers are behaving correctly.


We have also come across a number of ostensibly resolved bugs in JIRA that, to
our untrained eyes, look similar to what we're seeing:

    "Zookeeper 3.3.4 loses ephemeral nodes under stress"
        <https://issues.apache.org/jira/browse/ZOOKEEPER-1740>
    "the threads number of a zookeeper is increased all the time"
        <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing the
        thread growth mentioned here, just log similarities)


The node detaching issue is not something we've been able to recreate on our
own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on OpenJDK
1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide
information that might help determine the cause of our problems!


Cheers (logs snippets to follow)


Error logs from an incident where a node detached itself:

    2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
    2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
    2015-01-02 20:10:37,577 [myid:4] - WARN  [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646] - ******* GOODBYE /10.96.100.69:54260 ********
    2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
    2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646] - ******* GOODBYE /10.96.14.103:47982 ********
    2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633] - Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
    2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646] - ******* GOODBYE /10.84.11.131:47798 ********
    2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646] - ******* GOODBYE /10.96.100.136:48175 ********



Some trace logs from "normal operation" that looked suspect to us:


    2015-01-12 14:05:12,753 [myid:1] - TRACE [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71] - loadData — session in archive: 381306787892823513 with timeout: 6000
    2015-01-12 14:05:12,797 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] - Snapshotting: 0xb3ea0ee8f4db to /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db
    2015-01-12 14:05:12,822 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid 0xb3ea0ee8f4dc expected 0x1
    2015-01-12 14:05:12,833 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,833 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88] - Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring processTxn failure hdr: -1 : error: -110
    2015-01-12 14:05:12,838 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88] - Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring processTxn failure hdr: -1 : error: -110
    2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840 zxid:0xb3ea0ee8f4de txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] - Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843 zxid:0xb3ea0ee8f4df txntype:-1 reqpath:n/a