You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2010/09/28 01:42:33 UTC

[jira] Created: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

QuorumCnxManager$SendWorker grows without bounds
------------------------------------------------

                 Key: ZOOKEEPER-880
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
             Project: Zookeeper
          Issue Type: Bug
    Affects Versions: 3.2.2
            Reporter: Jean-Daniel Cryans
         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack

We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:

{noformat}
tickTime=3000
dataDir=/somewhere_thats_not_tmp
clientPort=2181
initLimit=10
syncLimit=5
server.0=sv4borg9:2888:3888
server.1=sv4borg10:2888:3888
server.2=sv4borg11:2888:3888
server.3=sv4borg12:2888:3888
server.4=sv4borg13:2888:3888
{noformat}

The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Vishal K (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917685#action_12917685 ] 

Vishal K commented on ZOOKEEPER-880:
------------------------------------

While debugging for https://issues.apache.org/jira/browse/ZOOKEEPER-822 I found that senderWorkerMap would not have an entry for a server, but there will be a RecvWorker and SendWorker thread running for the server. In my case, this was seen when the leader died (i.e., during leader election). However, I think this can happen when a peer disconnects from another peer. The cause was incorrect handling of add/remove of entries from senderWorkerMap, which is exposed due to race conditions in QuorumCnxManager. There is a patch available for ZOOKEEPER-822.

I am not sure if the ZOOKEEPER-822 is causing trouble here as well. I just wanted to point out the possibility.


> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915838#action_12915838 ] 

Patrick Hunt commented on ZOOKEEPER-880:
----------------------------------------

JD tried moving the datadirectory to another disk (new datadir), that didn't help, same problem. Also note: the snapshot file is ~2mb in size.

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916546#action_12916546 ] 

Patrick Hunt commented on ZOOKEEPER-880:
----------------------------------------

This issue sounds very similar to ZOOKEEPER-883 -- they are also using Nagios to monitor the election port in that case as well.

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915553#action_12915553 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
----------------------------------------------

Also we checked dmesg, syslog, df, ifconfig, ethtool, and ifstat for anything unusual and nothing obvious comes out.

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated ZOOKEEPER-880:
-----------------------------------------

    Attachment: TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz

Here's a new run, at TRACE-level, starting from a fresh log and a cleaned up dataDir on a different disk.

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Flavio Junqueira (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915625#action_12915625 ] 

Flavio Junqueira commented on ZOOKEEPER-880:
--------------------------------------------

J-D, Has it happened just once or it is reproducible? Does it also happen with 3.3?

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915832#action_12915832 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
----------------------------------------------

bq. to be overly clear - this is happening on just 1 server, the other servers on the cluster are not seeing this, is that right?

Yes, sv4borg9.

bq. any insight on GC and JVM activity. Are there significant pauses on the GC, or perhaps swapping of that jvm? How active is the JVM? How active (cpu) are the other processes on this host? You mentioned they are using 50% disk, what about cpu?

No swapping, GC activity is normal as far as I can tell by the GC log, 1 active CPU for that process according to top (the rest of the cpus are idle most of the time).  

bq. If I understood correctly the JVM hosting the ZK server is hosting other code as well, is that right? You mentioned something about hbase managing the ZK server, could you elaborate on that as well?

That machine is also the Namenode, JobTracker and HBase master (all in their own JVMs). The only thing special is that the quorum peers are started by HBase.

bq. Is there a way you could move the ZK datadir on that host to an unused spindle and see if that helps at all?

I'll look into that.

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915634#action_12915634 ] 

Patrick Hunt commented on ZOOKEEPER-880:
----------------------------------------

bq. one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads 

to be overly clear - this is happening on just 1 server, the other servers on the cluster are not seeing this, is that right?

any insight on GC and JVM activity. Are there significant pauses on the GC, or perhaps swapping of that jvm?  How active is the JVM? How active (cpu) are the other processes on this host? You mentioned they are using 50% disk, what about cpu? 

Is there a way you could move the ZK datadir on that host to an unused spindle and see if that helps at all?

If I understood correctly the JVM hosting the ZK server is hosting other code as well, is that right? You mentioned something about hbase managing the ZK server, could you elaborate on that as well?



> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915849#action_12915849 ] 

Benjamin Reed commented on ZOOKEEPER-880:
-----------------------------------------

is there an easy way to reproduce this?

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916544#action_12916544 ] 

Patrick Hunt commented on ZOOKEEPER-880:
----------------------------------------

I tried reproducing this with 5 servers on my laptop and using check_tcp that I compiled. I can't get it to happen. However I do notice a large number of connections (from nagios to election port) in the SYN_RECV state (some in close_wait as well).

JD - can you turn off nagios, restart the effected server (clear the log4j logs as well), then look to see if the problem is still occuring? update this jira with the logs from the "bad" server and one other (at least) would be helpful. Thanks!

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915627#action_12915627 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
----------------------------------------------

It's currently happening (currently at 344 threads and growing, was at 76 when I created this jira), it's still there when I bounce the peer, and it does happen on 3.3.1 too.

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915850#action_12915850 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
----------------------------------------------

bq. is there an easy way to reproduce this?

Unfortunately none I can see... we have 5 clusters that use the same hardware and ZK configurations and we only find this issue on this cluster, on this specific node, although all the other nodes of that cluster have the same InterruptedExceptions (but aren't leaking SendWorkers).

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack, TRACE-hbase-hadoop-zookeeper-sv4borg9.log.gz
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated ZOOKEEPER-880:
-----------------------------------------

    Attachment: hbase-hadoop-zookeeper-sv4borg9.log.gz
                jstack
                hbase-hadoop-zookeeper-sv4borg12.log.gz

Attaching the logs of the problematic server (sv4borg9) that I restarted this afternoon and the logs from one of the other servers from that point. Also attaching the jstack of first server.

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-880) QuorumCnxManager$SendWorker grows without bounds

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915559#action_12915559 ] 

Jean-Daniel Cryans commented on ZOOKEEPER-880:
----------------------------------------------

Oh but it looks like the process on sv4borg9 is generating a lot of IO (50% on one CPU on average according to top). Seems to be mostly writes.

> QuorumCnxManager$SendWorker grows without bounds
> ------------------------------------------------
>
>                 Key: ZOOKEEPER-880
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-880
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>            Reporter: Jean-Daniel Cryans
>         Attachments: hbase-hadoop-zookeeper-sv4borg12.log.gz, hbase-hadoop-zookeeper-sv4borg9.log.gz, jstack
>
>
> We're seeing an issue where one server in the ensemble has a steady growing number of QuorumCnxManager$SendWorker threads up to a point where the OS runs out of native threads, and at the same time we see a lot of exceptions in the logs.  This is on 3.2.2 and our config looks like:
> {noformat}
> tickTime=3000
> dataDir=/somewhere_thats_not_tmp
> clientPort=2181
> initLimit=10
> syncLimit=5
> server.0=sv4borg9:2888:3888
> server.1=sv4borg10:2888:3888
> server.2=sv4borg11:2888:3888
> server.3=sv4borg12:2888:3888
> server.4=sv4borg13:2888:3888
> {noformat}
> The issue is on the first server. I'm going to attach threads dumps and logs in moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.