You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "guoping.gp (JIRA)" <ji...@apache.org> on 2016/10/14 05:42:20 UTC
[jira] [Comment Edited] (ZOOKEEPER-2615) Zookeeper server holds onto dead/expired session ids in the watch data structures

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574296#comment-15574296 ] 

guoping.gp edited comment on ZOOKEEPER-2615 at 10/14/16 5:41 AM:
-----------------------------------------------------------------

REPRODUCE TEST CASE STEP BY STEP
====================================
1).  start zookeeper cluster, says server1(localhost:6181), server2(localhost:4181), server3(localhost:2181)
    let's assume now server2 is the LEADER.

2). use zkCli.sh create a persistence node /curator_exists_watch

3). use Curator Framework2.9.1 to create a simple Java Client Application,  as following:

*****

import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.framework.api.CuratorWatcher;
import org.apache.curator.retry.ExponentialBackoffRetry;
import org.apache.zookeeper.WatchedEvent;

public class ExistingWatchDemo {

	public static void main(String[] args) throws Exception {

		CuratorFrameworkFactory.Builder builder = CuratorFrameworkFactory.builder();
		builder = builder.connectString("localhost:2181,localhost:6181")
                                         .sessionTimeoutMs(15000)
				         .connectionTimeoutMs(10000)
                                        .retryPolicy(new ExponentialBackoffRetry(5000, Integer.MAX_VALUE));
		
		final CuratorFramework zkClient = builder.build();

		zkClient.start();

		CuratorWatcher watcher = new CuratorWatcher() {

			public void process(WatchedEvent event) throws Exception {
                                 System.out.println("received event " + event);
                                 System.out.println("watch again ...");
                                zkClient.checkExists().usingWatcher(this).forPath("/curator_exists_watch");
			}
			
		};
		
		zkClient.checkExists().usingWatcher(watcher).forPath("/curator_exists_watch");

		Thread.sleep(600 * 1000);

	}

}

*****

to stay us to reproduce in more convenience, here we provide only 2 servers in the connection string instead of 3 both, says as server1(FOLLOWING) server3(FOLLOWING)

4).  breakpoint at 2 places: 
 
    FinalRequestProcessor.java: line 303 :         zks.getZKDatabase().setWatches(relativeZxid, 
    NIOServerCnxn.java: line 966:                     synchronized(factory.cnxns){


5).  running the client app ExistingWatchDemo.java, let say the client application connection to server3(2181)

6). use the IDE make a remote debug session connect to another server of the connection string which we provided, here server1(6181)

7). now kill -STOP server3(2181), the client app will find the server hang and reconnect to  server1(6181), we can confirm the reconnect in
the client's log:

---------
016-10-14 13:31:21,505 [myid:] - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1096] - Client session timed out, have not heard from server in 3754ms for sessionid 0x357c1a4462b000e, closing socket connection and attempting reconnect
2016-10-14 13:31:22,389 [myid:] - INFO  [main-SendThread(localhost:6181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:6181. Will not attempt to authenticate using SASL (unknown error)
2016-10-14 13:31:22,392 [myid:] - INFO  [main-SendThread(localhost:6181):ClientCnxn$SendThread@852] - Socket connection established to localhost/0:0:0:0:0:0:0:1:6181, initiating session
2016-10-14 13:31:22,392 [myid:] - DEBUG [main-SendThread(localhost:6181):ClientCnxn$SendThread@892] - Session establishment request sent on localhost/0:0:0:0:0:0:0:1:6181
2016-10-14 13:31:22,407 [myid:] - INFO  [main-SendThread(localhost:6181):ClientCnxn$SendThread@1235] - Session establishment complete on server localhost/0:0:0:0:0:0:0:1:6181, sessionid = 0x357c1a4462b000e, negotiated timeout = 15000
---------

the reconnect will hit the breakpoint FinalRequestProcessor.java: line 303 as following:

Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in FinalRequestProcessor))	
	FinalRequestProcessor.processRequest(Request) line: 303	
	CommitProcessor.run() line: 74	

8). now kill the process of  the client app, kill -9 <pid of client app>, this will let us another breakpoint hit, as following
 
Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended (breakpoint at line 966 in NIOServerCnxn))	
	NIOServerCnxn.close() line: 966	
	NIOServerCnxn.doIO(SelectionKey) line: 360	
	NIOServerCnxnFactory.run() line: 208	
	Thread.run() line: 745	

9).  now resume the breakpoint(NIOServerCnxn:966) in step 8, after that, then resume the breakpoint (FinalRequestProcessor:303) in step 7

10). now use wchp and cons to check the leak watch, as following shows

pguodeMacBook-Air:bin pguo$ echo wchp | nc localhost 6181
/curator_exists_watch
       	0x357c1a4462b000e

pguodeMacBook-Air:bin pguo$ echo cons | nc localhost 6181
 /0:0:0:0:0:0:0:1:61702[0](queued=0,recved=1,sent=0)
 


was (Author: guoping.gp):
REPRODUCE TEST CASE STEP BY STEP
====================================
1. start zookeeper cluster, says server1(localhost:6181), server2(localhost:4181), server3(localhost:2181)
    let's assume now server2 is the LEADER.

2. use zkCli.sh create a persistence node /curator_exists_watch

3. use Curator Framework2.9.1 to create a simple Java Client Application,  as following:

*****

import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.framework.api.CuratorWatcher;
import org.apache.curator.retry.ExponentialBackoffRetry;
import org.apache.zookeeper.WatchedEvent;

public class ExistingWatchDemo {

	public static void main(String[] args) throws Exception {

		CuratorFrameworkFactory.Builder builder = CuratorFrameworkFactory.builder();
		builder = builder.connectString("localhost:2181,localhost:6181")
                                         .sessionTimeoutMs(15000)
				         .connectionTimeoutMs(10000)
                                        .retryPolicy(new ExponentialBackoffRetry(5000, Integer.MAX_VALUE));
		
		final CuratorFramework zkClient = builder.build();

		zkClient.start();

		CuratorWatcher watcher = new CuratorWatcher() {

			public void process(WatchedEvent event) throws Exception {
                                 System.out.println("received event " + event);
                                 System.out.println("watch again ...");
                                zkClient.checkExists().usingWatcher(this).forPath("/curator_exists_watch");
			}
			
		};
		
		zkClient.checkExists().usingWatcher(watcher).forPath("/curator_exists_watch");

		Thread.sleep(600 * 1000);

	}

}

*****

to stay us to reproduce in more convenience, here we provide only 2 servers in the connection string instead of 3 both, says as server1(FOLLOWING) server3(FOLLOWING)

4.  breakpoint at 2 places: 
 
    FinalRequestProcessor.java: line 303 :         zks.getZKDatabase().setWatches(relativeZxid, 
    NIOServerCnxn.java: line 966:                     synchronized(factory.cnxns){


5. running the client app ExistingWatchDemo.java, let say the client application connection to server3(2181)

6. use the IDE make a remote debug session connect to another server of the connection string which we provided, here server1(6181)

7. now kill -STOP server3(2181), the client app will find the server hang and reconnect to  server1(6181), we can confirm the reconnect in
the client's log:

---------
016-10-14 13:31:21,505 [myid:] - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1096] - Client session timed out, have not heard from server in 3754ms for sessionid 0x357c1a4462b000e, closing socket connection and attempting reconnect
2016-10-14 13:31:22,389 [myid:] - INFO  [main-SendThread(localhost:6181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:6181. Will not attempt to authenticate using SASL (unknown error)
2016-10-14 13:31:22,392 [myid:] - INFO  [main-SendThread(localhost:6181):ClientCnxn$SendThread@852] - Socket connection established to localhost/0:0:0:0:0:0:0:1:6181, initiating session
2016-10-14 13:31:22,392 [myid:] - DEBUG [main-SendThread(localhost:6181):ClientCnxn$SendThread@892] - Session establishment request sent on localhost/0:0:0:0:0:0:0:1:6181
2016-10-14 13:31:22,407 [myid:] - INFO  [main-SendThread(localhost:6181):ClientCnxn$SendThread@1235] - Session establishment complete on server localhost/0:0:0:0:0:0:0:1:6181, sessionid = 0x357c1a4462b000e, negotiated timeout = 15000
---------

the reconnect will hit the breakpoint FinalRequestProcessor.java: line 303 as following:

Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in FinalRequestProcessor))	
	FinalRequestProcessor.processRequest(Request) line: 303	
	CommitProcessor.run() line: 74	

8. now kill the process of  the client app, kill -9 <pid of client app>, this will let us another breakpoint hit, as following
 
Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended (breakpoint at line 966 in NIOServerCnxn))	
	NIOServerCnxn.close() line: 966	
	NIOServerCnxn.doIO(SelectionKey) line: 360	
	NIOServerCnxnFactory.run() line: 208	
	Thread.run() line: 745	

9.  now resume the breakpoint(NIOServerCnxn:966) in step 8, after that, then resume the breakpoint (FinalRequestProcessor:303) in step 7

10. now use wchp and cons to check the leak watch, as following shows

pguodeMacBook-Air:bin pguo$ echo wchp | nc localhost 6181
/curator_exists_watch
       	0x357c1a4462b000e

pguodeMacBook-Air:bin pguo$ echo cons | nc localhost 6181
 /0:0:0:0:0:0:0:1:61702[0](queued=0,recved=1,sent=0)
 

> Zookeeper server holds onto dead/expired session ids in the watch data structures
> ---------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2615
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2615
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.6
>            Reporter: guoping.gp
>
> The same issue (https://issues.apache.org/jira/browse/ZOOKEEPER-1382) still can be found even with zookeeper 3.4.6.
> this issue cause our production zookeeper cluster leak about 1 million watchs, after restart the server one by one, the watch count decrease to only about 40000.
> I can reproduce the issue on my mac,here it is:
> ------------------------------------------------------------------------
> pguodeMacBook-Air:bin pguo$ echo srvr | nc localhost 6181
> Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
> Latency min/avg/max: 0/1156/128513
> Received: 539
> Sent: 531
> Connections: 1
> Outstanding: 0
> Zxid: 0x100000037
> Mode: follower
> Node count: 5
> ------------------------
> pguodeMacBook-Air:bin pguo$ echo cons | nc localhost 6181
>  /127.0.0.1:55759[1](queued=0,recved=5,sent=5,sid=0x157be2732d0000e,lop=PING,est=1476372631116,to=15000,lcxid=0x1,lzxid=0xffffffffffffffff,lresp=1476372646260,llat=8,minlat=0,avglat=6,maxlat=17)
>  /0:0:0:0:0:0:0:1:55767[0](queued=0,recved=1,sent=0)
> ------------------------
> pguodeMacBook-Air:bin pguo$ echo wchp | nc localhost 6181
> /curator_exists_watch
>        	0x357be48e4d90007
>        	0x357be48e4d90009
>        	0x157be2732d0000e
> as above 4-letter's report shows, 	0x357be48e4d90007 and 0x357be48e4d90009 are leaked after the two sessions expired 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)