You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by "Qian Ye (JIRA)" <ji...@apache.org> on 2010/02/02 03:58:20 UTC

[jira] Created: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Too many CLOSE_WAIT socket state on a server
--------------------------------------------

                 Key: ZOOKEEPER-662
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
             Project: Zookeeper
          Issue Type: Bug
          Components: quorum
    Affects Versions: 3.2.1
         Environment: Linux 2.6.9
            Reporter: Qian Ye
             Fix For: 3.3.0


I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg

======
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial 
# synchronization phase can take
initLimit=5
# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=2
# the directory where the snapshot is stored.
dataDir=./data/
# the port at which the clients will connect
clientPort=8181

# zookeeper cluster list
server.100=10.23.253.43:8887:8888
server.101=10.23.150.29:8887:8888
server.102=10.23.247.141:8887:8888
server.200=10.65.20.68:8887:8888
server.201=10.65.27.21:8887:8888
=====

Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Qian Ye (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838157#action_12838157 ] 

Qian Ye commented on ZOOKEEPER-662:
-----------------------------------

Thx Patrick, this situation might be consequence of a network switch adjustment. The effect of the adjustment, as I know, is that two Zookeeper servers lost connection to the other three Zookeeper servers. This connection loss last about several minutes. I have tried to reproduce it, but haven't succeeded yet. I would put an eye on this issue, and let you know if I got any more information about this. Thank you.

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Qian Ye (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828916#action_12828916 ] 

Qian Ye commented on ZOOKEEPER-662:
-----------------------------------

I'm using the c client, and there is also a monitor process using "echo stat | nc zookeeper 8181" every 20 seconds to get the status of the servers.  If the monitor process failed to get a valid reply, it would send a sms alarm to my cell phone. When the problem happened, I received such an alarm. It said connection refused.  I haven't found the backlog for the client port in the source code. If it used the default value 128, then so many CLOSE_WAIT states would prevent the kernel from accepting new connection, right?

P.S. I cannot tell why the client keep reconnect with the same error, I will take a look at it and append more information if I can find something.


> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829164#action_12829164 ] 

Patrick Hunt commented on ZOOKEEPER-662:
----------------------------------------

Please look into why your c client 10.81.14.8  is failing, this might be useful information for me wrt reproducing the issue you are seeing. Can you provide some detail on the environment the c client is running? (os, 32vs64, etc...) and what the issue is that it is seeing? What is that client attempting to do? etc....

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829162#action_12829162 ] 

Patrick Hunt commented on ZOOKEEPER-662:
----------------------------------------

Qian, if you look at the logs you can see both of these clients, the client I mentioned in earlier comment, also the "stat" client:

2010-02-01 06:24:49,783 - INFO  [NIOServerCxn.Factory:8181:NIOServerCnxn@698] - Processing stat command from /10.65.7.48:48413
2010-02-01 06:24:49,783 - WARN  [NIOServerCxn.Factory:8181:NIOServerCnxn@494] - Exception causing close of session 0x0 due to java.io.IOException: Responded to info probe

(really the second line should not be a warn, this is improved in 3.3.0 codebase).

>From the logs I don't see anything to indicate a problem though. I'm wondering if there is some timing problem in either our c or java networking code (also you are using linux 2.6.9 which is older kernel, I'm wondering if perhaps the timing our app sees is different).

One thing about the 4 letter words (like stat). In some cases I've seen the response from the 4letter word be truncated. Perhaps this caused your monitoring app to fail? You might add some diags to your monitor app to debug this sort of thing.

What I mean is, you request a "stat" and the client sees some of the response, but not all of the response. I'm not sure why this is, but
it may have something to do with either the way nc works (I always use nc for this) or the way the server works - in the sense that
the server pushes the response text onto the wire and then closes the connection. Perhaps in some cases the socket close causes the client
to not see all the response? Is that possible in tcp close?


> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Qian Ye (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837696#action_12837696 ] 

Qian Ye commented on ZOOKEEPER-662:
-----------------------------------

This has not happened again yet. Is there anything we can do to find the reason? When this kind of thing occurred, it really put our system in risk.

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828806#action_12828806 ] 

Mahadev konar commented on ZOOKEEPER-662:
-----------------------------------------

Qian,
 even if there are 120 CLOSE_WAIT sockets, the number of file descriptors limit is usually much bigger than this. You should not be having a problem with clients trying to connect to the server. How do you know the server.200 wasnt accepting client connections?

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836065#action_12836065 ] 

Patrick Hunt commented on ZOOKEEPER-662:
----------------------------------------

Ah, looking back at the code I see we do have a linger timeout of 2 seconds on those sockets... however it shouldn't result in so many waiting sockets (unless you are trying a large number of connections per second, which it doesn't seem to be the case here)

Have you seen this happen again? Or just that one time?

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Qian Ye (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Qian Ye updated ZOOKEEPER-662:
------------------------------

    Attachment: zookeeper.log.2010020106
                zookeeper.log.2010020105

related log to this issue

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Qian Ye (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829920#action_12829920 ] 

Qian Ye commented on ZOOKEEPER-662:
-----------------------------------

Hi Patrick, the c clients all run in a Linux environment, the kernels are 2.6.9. Some of the servers are 32 bit machines and some of them are 64 bits. It seems that the client on the server 10.81.14.81 has some problem, which caused the client to fail frequently. Because there is a monitor app which can restart the c client when it failed, the client on 10.81.14.81 keep restarting and connecting to the zookeeper servers frequently.  

You mentioned that some of the response for request "stat" didn't reach the client, it looks like the behaviors of TCP connection with SO_LINER option on. In this kind of situation, the server only put the response on the wire and close, however, the response package may be discarded, and the TCP/IP stack wouldn't re-send the response. Is it the scenario we met here?

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828743#action_12828743 ] 

Patrick Hunt commented on ZOOKEEPER-662:
----------------------------------------

Qian, in looking at your log I see the following. 

2010-02-01 05:40:54,488 - INFO  [NIOServerCxn.Factory:8181:NIOServerCnxn@583] - Connected to /10.81.14.81:16629 lastZxid 0
2010-02-01 05:40:54,488 - INFO  [NIOServerCxn.Factory:8181:NIOServerCnxn@615] - Creating new session 0xc8265060cea96b73
2010-02-01 05:40:54,489 - INFO  [CommitProcessor:0:NIOServerCnxn@964] - Finished init of 0xc8265060cea96b73 valid:true
2010-02-01 05:40:59,627 - WARN  [NIOServerCxn.Factory:8181:NIOServerCnxn@494] - Exception causing close of session 0xc8265060cea96b73 due to java.io.IOException: Read error

can you tell if this client is crashing or having some issue? I see it reconnect very 15seconds or so with this same error.

I don't see what you described though - this client is stil able (just before your restart) to establish a session (create a connection). Were
you able to tell which client (was there one in particular?) where the CLOSE_WAIT was happening? Could you tell if it was the
ZK client or the 4letterword client? (do you have the netstat available?)


> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837999#action_12837999 ] 

Patrick Hunt commented on ZOOKEEPER-662:
----------------------------------------

If you do capture netstat capture it a few times with gaps between. Similar to looking for a thread issue by taking a few spaced thread dumps, in netstat we can see if the state is changing at all.


> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829171#action_12829171 ] 

Patrick Hunt commented on ZOOKEEPER-662:
----------------------------------------

According to this page http://www.freesoft.org/CIE/Course/Section4/11.htm closing the conn should be fine (which is expected). The weird thing
though is that I've definitely seen this issue were stat returns some of the data, but the results seem to be truncated. I wonder if it's more an issue
with nc then... thoughts?

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828674#action_12828674 ] 

Patrick Hunt commented on ZOOKEEPER-662:
----------------------------------------

What client type are you using? Java or C? both?
Are you using any 4letter words? Or tools that use 4letter words?


> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-662) Too many CLOSE_WAIT socket state on a server

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837994#action_12837994 ] 

Patrick Hunt commented on ZOOKEEPER-662:
----------------------------------------

I agree, this is a potentially serious issue. Unfortunately though, based on the information we have I don't see how I can provide more insight. Also take into account that we have many users in similar situation, however this is the first we've heard of this type of issue, ever. (not that that diminishes your issue) So I just don't have that much to go on.

I would suggest that you check your monitoring script and ensure it handles all error cases, such as failing to connect to the server, or getting a partial response due to things like the linger issue.

Also ensure that you can capture the server/client logs if this does happen again. If it does happen capture the full/detailed netstat (netstat -a I guess) so that we can get detailed information.

You might also make sure to save the transactional logs if this happens again. Not the log4j logs, but the transaction logs that are kept in the datadir. Those can actually be scanned and we can see what was going on (changes to znodes as well as session info).

Can you think of anything else that would help here? Have you been able to reproduce the problem? Have you tried reproducing it and can't? That's all I can think of currently.

> Too many CLOSE_WAIT socket state on a server
> --------------------------------------------
>
>                 Key: ZOOKEEPER-662
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-662
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.1
>         Environment: Linux 2.6.9
>            Reporter: Qian Ye
>             Fix For: 3.3.0
>
>         Attachments: zookeeper.log.2010020105, zookeeper.log.2010020106
>
>
> I have a zookeeper cluster with 5 servers, zookeeper version 3.2.1, here is the content in the configure file, zoo.cfg
> ======
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial 
> # synchronization phase can take
> initLimit=5
> # The number of ticks that can pass between 
> # sending a request and getting an acknowledgement
> syncLimit=2
> # the directory where the snapshot is stored.
> dataDir=./data/
> # the port at which the clients will connect
> clientPort=8181
> # zookeeper cluster list
> server.100=10.23.253.43:8887:8888
> server.101=10.23.150.29:8887:8888
> server.102=10.23.247.141:8887:8888
> server.200=10.65.20.68:8887:8888
> server.201=10.65.27.21:8887:8888
> =====
> Before the problem happened, the server.200 was the leader. Yesterday morning, I found the there were many sockets with the state of CLOSE_WAIT on the clientPort (8181),  the total was over about 120. Because of these CLOSE_WAIT, the server.200 could not accept more connections from the clients. The only thing I can do under this situation is restart the server.200, at about 2010-02-01 06:06:35. The related log is attached to the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.