You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "guoping.gp (JIRA)" <ji...@apache.org> on 2016/10/14 05:42:20 UTC
[jira] [Comment Edited] (ZOOKEEPER-2615) Zookeeper server holds
onto dead/expired session ids in the watch data structures
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574296#comment-15574296 ]
guoping.gp edited comment on ZOOKEEPER-2615 at 10/14/16 5:41 AM:
-----------------------------------------------------------------
REPRODUCE TEST CASE STEP BY STEP
====================================
1). start zookeeper cluster, says server1(localhost:6181), server2(localhost:4181), server3(localhost:2181)
let's assume now server2 is the LEADER.
2). use zkCli.sh create a persistence node /curator_exists_watch
3). use Curator Framework2.9.1 to create a simple Java Client Application, as following:
*****
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.framework.api.CuratorWatcher;
import org.apache.curator.retry.ExponentialBackoffRetry;
import org.apache.zookeeper.WatchedEvent;
public class ExistingWatchDemo {
public static void main(String[] args) throws Exception {
CuratorFrameworkFactory.Builder builder = CuratorFrameworkFactory.builder();
builder = builder.connectString("localhost:2181,localhost:6181")
.sessionTimeoutMs(15000)
.connectionTimeoutMs(10000)
.retryPolicy(new ExponentialBackoffRetry(5000, Integer.MAX_VALUE));
final CuratorFramework zkClient = builder.build();
zkClient.start();
CuratorWatcher watcher = new CuratorWatcher() {
public void process(WatchedEvent event) throws Exception {
System.out.println("received event " + event);
System.out.println("watch again ...");
zkClient.checkExists().usingWatcher(this).forPath("/curator_exists_watch");
}
};
zkClient.checkExists().usingWatcher(watcher).forPath("/curator_exists_watch");
Thread.sleep(600 * 1000);
}
}
*****
to stay us to reproduce in more convenience, here we provide only 2 servers in the connection string instead of 3 both, says as server1(FOLLOWING) server3(FOLLOWING)
4). breakpoint at 2 places:
FinalRequestProcessor.java: line 303 : zks.getZKDatabase().setWatches(relativeZxid,
NIOServerCnxn.java: line 966: synchronized(factory.cnxns){
5). running the client app ExistingWatchDemo.java, let say the client application connection to server3(2181)
6). use the IDE make a remote debug session connect to another server of the connection string which we provided, here server1(6181)
7). now kill -STOP server3(2181), the client app will find the server hang and reconnect to server1(6181), we can confirm the reconnect in
the client's log:
---------
016-10-14 13:31:21,505 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1096] - Client session timed out, have not heard from server in 3754ms for sessionid 0x357c1a4462b000e, closing socket connection and attempting reconnect
2016-10-14 13:31:22,389 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:6181. Will not attempt to authenticate using SASL (unknown error)
2016-10-14 13:31:22,392 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@852] - Socket connection established to localhost/0:0:0:0:0:0:0:1:6181, initiating session
2016-10-14 13:31:22,392 [myid:] - DEBUG [main-SendThread(localhost:6181):ClientCnxn$SendThread@892] - Session establishment request sent on localhost/0:0:0:0:0:0:0:1:6181
2016-10-14 13:31:22,407 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@1235] - Session establishment complete on server localhost/0:0:0:0:0:0:0:1:6181, sessionid = 0x357c1a4462b000e, negotiated timeout = 15000
---------
the reconnect will hit the breakpoint FinalRequestProcessor.java: line 303 as following:
Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in FinalRequestProcessor))
FinalRequestProcessor.processRequest(Request) line: 303
CommitProcessor.run() line: 74
8). now kill the process of the client app, kill -9 <pid of client app>, this will let us another breakpoint hit, as following
Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended (breakpoint at line 966 in NIOServerCnxn))
NIOServerCnxn.close() line: 966
NIOServerCnxn.doIO(SelectionKey) line: 360
NIOServerCnxnFactory.run() line: 208
Thread.run() line: 745
9). now resume the breakpoint(NIOServerCnxn:966) in step 8, after that, then resume the breakpoint (FinalRequestProcessor:303) in step 7
10). now use wchp and cons to check the leak watch, as following shows
pguodeMacBook-Air:bin pguo$ echo wchp | nc localhost 6181
/curator_exists_watch
0x357c1a4462b000e
pguodeMacBook-Air:bin pguo$ echo cons | nc localhost 6181
/0:0:0:0:0:0:0:1:61702[0](queued=0,recved=1,sent=0)
was (Author: guoping.gp):
REPRODUCE TEST CASE STEP BY STEP
====================================
1. start zookeeper cluster, says server1(localhost:6181), server2(localhost:4181), server3(localhost:2181)
let's assume now server2 is the LEADER.
2. use zkCli.sh create a persistence node /curator_exists_watch
3. use Curator Framework2.9.1 to create a simple Java Client Application, as following:
*****
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.framework.api.CuratorWatcher;
import org.apache.curator.retry.ExponentialBackoffRetry;
import org.apache.zookeeper.WatchedEvent;
public class ExistingWatchDemo {
public static void main(String[] args) throws Exception {
CuratorFrameworkFactory.Builder builder = CuratorFrameworkFactory.builder();
builder = builder.connectString("localhost:2181,localhost:6181")
.sessionTimeoutMs(15000)
.connectionTimeoutMs(10000)
.retryPolicy(new ExponentialBackoffRetry(5000, Integer.MAX_VALUE));
final CuratorFramework zkClient = builder.build();
zkClient.start();
CuratorWatcher watcher = new CuratorWatcher() {
public void process(WatchedEvent event) throws Exception {
System.out.println("received event " + event);
System.out.println("watch again ...");
zkClient.checkExists().usingWatcher(this).forPath("/curator_exists_watch");
}
};
zkClient.checkExists().usingWatcher(watcher).forPath("/curator_exists_watch");
Thread.sleep(600 * 1000);
}
}
*****
to stay us to reproduce in more convenience, here we provide only 2 servers in the connection string instead of 3 both, says as server1(FOLLOWING) server3(FOLLOWING)
4. breakpoint at 2 places:
FinalRequestProcessor.java: line 303 : zks.getZKDatabase().setWatches(relativeZxid,
NIOServerCnxn.java: line 966: synchronized(factory.cnxns){
5. running the client app ExistingWatchDemo.java, let say the client application connection to server3(2181)
6. use the IDE make a remote debug session connect to another server of the connection string which we provided, here server1(6181)
7. now kill -STOP server3(2181), the client app will find the server hang and reconnect to server1(6181), we can confirm the reconnect in
the client's log:
---------
016-10-14 13:31:21,505 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread@1096] - Client session timed out, have not heard from server in 3754ms for sessionid 0x357c1a4462b000e, closing socket connection and attempting reconnect
2016-10-14 13:31:22,389 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:6181. Will not attempt to authenticate using SASL (unknown error)
2016-10-14 13:31:22,392 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@852] - Socket connection established to localhost/0:0:0:0:0:0:0:1:6181, initiating session
2016-10-14 13:31:22,392 [myid:] - DEBUG [main-SendThread(localhost:6181):ClientCnxn$SendThread@892] - Session establishment request sent on localhost/0:0:0:0:0:0:0:1:6181
2016-10-14 13:31:22,407 [myid:] - INFO [main-SendThread(localhost:6181):ClientCnxn$SendThread@1235] - Session establishment complete on server localhost/0:0:0:0:0:0:0:1:6181, sessionid = 0x357c1a4462b000e, negotiated timeout = 15000
---------
the reconnect will hit the breakpoint FinalRequestProcessor.java: line 303 as following:
Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in FinalRequestProcessor))
FinalRequestProcessor.processRequest(Request) line: 303
CommitProcessor.run() line: 74
8. now kill the process of the client app, kill -9 <pid of client app>, this will let us another breakpoint hit, as following
Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended (breakpoint at line 966 in NIOServerCnxn))
NIOServerCnxn.close() line: 966
NIOServerCnxn.doIO(SelectionKey) line: 360
NIOServerCnxnFactory.run() line: 208
Thread.run() line: 745
9. now resume the breakpoint(NIOServerCnxn:966) in step 8, after that, then resume the breakpoint (FinalRequestProcessor:303) in step 7
10. now use wchp and cons to check the leak watch, as following shows
pguodeMacBook-Air:bin pguo$ echo wchp | nc localhost 6181
/curator_exists_watch
0x357c1a4462b000e
pguodeMacBook-Air:bin pguo$ echo cons | nc localhost 6181
/0:0:0:0:0:0:0:1:61702[0](queued=0,recved=1,sent=0)
> Zookeeper server holds onto dead/expired session ids in the watch data structures
> ---------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-2615
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2615
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.4.6
> Reporter: guoping.gp
>
> The same issue (https://issues.apache.org/jira/browse/ZOOKEEPER-1382) still can be found even with zookeeper 3.4.6.
> this issue cause our production zookeeper cluster leak about 1 million watchs, after restart the server one by one, the watch count decrease to only about 40000.
> I can reproduce the issue on my mac,here it is:
> ------------------------------------------------------------------------
> pguodeMacBook-Air:bin pguo$ echo srvr | nc localhost 6181
> Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
> Latency min/avg/max: 0/1156/128513
> Received: 539
> Sent: 531
> Connections: 1
> Outstanding: 0
> Zxid: 0x100000037
> Mode: follower
> Node count: 5
> ------------------------
> pguodeMacBook-Air:bin pguo$ echo cons | nc localhost 6181
> /127.0.0.1:55759[1](queued=0,recved=5,sent=5,sid=0x157be2732d0000e,lop=PING,est=1476372631116,to=15000,lcxid=0x1,lzxid=0xffffffffffffffff,lresp=1476372646260,llat=8,minlat=0,avglat=6,maxlat=17)
> /0:0:0:0:0:0:0:1:55767[0](queued=0,recved=1,sent=0)
> ------------------------
> pguodeMacBook-Air:bin pguo$ echo wchp | nc localhost 6181
> /curator_exists_watch
> 0x357be48e4d90007
> 0x357be48e4d90009
> 0x157be2732d0000e
> as above 4-letter's report shows, 0x357be48e4d90007 and 0x357be48e4d90009 are leaked after the two sessions expired
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)