You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@curator.apache.org by Joe Naegele <jn...@grierforensics.com> on 2018/03/28 20:14:25 UTC

Unexpected session expired event

Good afternoon,

I’m experiencing an odd issue with ZooKeeper/Curator. I’m using ZooKeeper 3.5.2-alpha and Curator 4.0.0. My issue may be a ZooKeeper problem, but I want to understand whether Curator is involved, or how Curator might avoid the problem.

I’ve configured my client (framework) as follows:

client = CuratorFrameworkFactory.builder()
      .connectString(”127.0.0.1:2181”)
      .sessionTimeoutMs(120000)
      .connectionTimeoutMs(30000)
      .retryPolicy(new ExponentialBackoffRetry(2000, 8))
      .build()
client.start()

My ZooKeeper server is just one server with a “default” configuration.  Two client programs run on the same server and use a variety of recipes, including GroupMember, SharedCount, DistributedAtomicInteger, and, most importantly, an InterProcessReadWriteLock. One of the client programs performs hours of work while the other waits on the lock. While doing work, the program often reports the following message:

06:23:56.867 WARN  org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 32147ms for sessionid 0x1092fb2fc3000a9

and ZooKeeper immediately afterward reports:

06:23:56,910 [myid:] - WARN  [NIOServerCnxn@365] - Unable to read additional data from client sessionid 0x1092fb2fc3000a9, likely client has closed socket
06:23:56,911 [myid:] - INFO  [MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=StandaloneServer_port2181,name1=Connections,name2=127.0.0.1,name3=0x1092fb2fc3000a9]
06:23:56,912 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket connection for client /127.0.0.1:47312 which had sessionid 0x1092fb2fc3000a9
06:23:58,447 [myid:] - INFO  [NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:55396
06:23:58,448 [myid:] - INFO  [ZooKeeperServer@969] - Client attempting to renew session 0x1092fb2fc3000a9 at /127.0.0.1:55396
06:23:58,449 [myid:] - INFO  [ZooKeeperServer@678] - Established session 0x1092fb2fc3000a9 with negotiated timeout 40000 for client /127.0.0.1:55396
06:23:58,450 [myid:] - INFO  [PrepRequestProcessor@841] - Got user-level KeeperException when processing sessionid:0x1092fb2fc3000a9 type:create2 cxid:0x9fb zxid:0x4c602ea txntype:-1 reqpath:n/a Error Path:/xxx/AAAClient-5532fe53-f746-4f28-a09e-1dce6d908885 Error:KeeperErrorCode = NodeExists for /xxx/AAAClient-5532fe53-f746-4f28-a09e-1dce6d908885

So it looks like Curator correctly reestablishes a connection with the ZooKeeper server and the previously-established GroupMember recipe is kept intact.

But eventually, the ZooKeeper server expires the session, Curator attempts to reconnect and renew the session:

06:45:48,331 [myid:] - WARN  [NIOServerCnxn@365] - Unable to read additional data from client sessionid 0x1092fb2fc3000a9, likely client has closed socket
06:45:48,346 [myid:] - INFO  [MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=StandaloneServer_port2181,name1=Connections,name2=127.0.0.1,name3=0x1092fb2fc3000a9]
06:45:48,347 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket connection for client /127.0.0.1:55396 which had sessionid 0x1092fb2fc3000a9
06:45:49,483 [myid:] - INFO  [ZooKeeperServer@384] - Expiring session 0x1092fb2fc3000a9, timeout of 40000ms exceeded
06:45:49,484 [myid:] - INFO  [PrepRequestProcessor@647] - Processed session termination for sessionid: 0x1092fb2fc3000a9
06:45:49,502 [myid:] - INFO  [PrepRequestProcessor@841] - Got user-level KeeperException when processing sessionid:0x1092fb2fc3000aa type:create cxid:0xf6 zxid:0x4c602ee txntype:-1 reqpath:n/a Error Path:/xxx/BBBClient-f3ac6d3b-2b4e-4992-abd3-49f9f6b77e3a Error:KeeperErrorCode = NodeExists for /xxx/BBBClient-f3ac6d3b-2b4e-4992-abd3-49f9f6b77e3a
06:45:49,839 [myid:] - INFO  [NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:37844
06:45:49,839 [myid:] - INFO  [ZooKeeperServer@969] - Client attempting to renew session 0x1092fb2fc3000a9 at /127.0.0.1:37844

The client program reports:

06:45:49.840 WARN  org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x1092fb2fc3000a9 has expired
06:45:49.840 WARN  org.apache.curator.ConnectionState - Session expired event received

And ZooKeeper reports:

06:45:49,840 [myid:] - INFO  [ZooKeeperServer@686] - Invalid session 0x1092fb2fc3000a9 for client /127.0.0.1:37844, probably expired
06:45:49,840 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket connection for client /127.0.0.1:37844 which had sessionid 0x1092fb2fc3000a9
06:45:49,859 [myid:] - INFO  [NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:37862
06:45:49,859 [myid:] - INFO  [ZooKeeperServer@964] - Client attempting to establish new session at /127.0.0.1:37862
06:45:49,867 [myid:] - INFO  [ZooKeeperServer@678] - Established session 0x1092fb2fc3000ab with negotiated timeout 40000 for client /127.0.0.1:37862

And, critically, when the new session is established, the InterProcessReadWriteLock is released by the establishing client, so the other program acquires the lock.


  1.  Can I prevent this behavior so the other client doesn’t acquire the lock on session loss?
  2.  Are there any hints above as to what is causing the session loss? Why is my client program closing the ZK socket? Why doesn’t it hear from the server for so long?

Thanks in advance!

Joe Naegele
Grier Forensics

Re: Unexpected session expired event

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
We'd need to see the ZooKeeper logs to know why you're losing your session. Also, you may have more luck on the ZooKeeper mailing list.

-Jordan

> On Mar 29, 2018, at 6:27 AM, Cameron McKenzie <mc...@gmail.com> wrote:
> 
> hey Joe,
> Curator just wraps up the Zookeeper client. It is responsible for sending heart beats etc, which it does in its own thread. If you're running hundreds of threads within your VM, perhaps it's possible that the client ZK thread is getting starved, it's difficult to know.
> 
> cheers
> 
> On Thu, Mar 29, 2018 at 8:16 AM, Joe Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote:
> Cameron,
> 
>  
> 
> Thanks, this is very helpful.
> 
>  
> 
> My program is not experiencing extreme GC events, and it doesn’t appear the ZK server is either (as expected). I have a feeling the connection loss is due to the client being starved of CPU time. My main program spawns hundreds of threads with different roles, then waits on them in succession, so the main thread effectively sleeps (BLOCKED) for most of its lifetime. I’m curious though, how does the Curator framework send/receive heartbeats once started? I had assumed it has its own thread pool and runs “in the background”, regardless of what my main thread and any other threads do.
> 
>  
> 
> I follow your suggestion to read/write additional state for tracking persistent lock ownership. Sounds reasonable.
> 
>  
> 
> Thanks again
> 
> ---
> 
> Joe Naegele
> 
> Grier Forensics
> 
>  
> 
> From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com <ma...@gmail.com>] 
> Sent: Wednesday, March 28, 2018 4:37 PM
> To: user@curator.apache.org <ma...@curator.apache.org>
> Subject: Re: Unexpected session expired event
> 
>  
> 
> hey Joe,
> 
> The session timeout that you've outlined in the Curator code is 120 seconds, but the negotiated timeout is 40 seconds. This is presumably because the tick time on your Zookeeper server is set to 2 seconds (the maximum session timeout is 20 * configured tick time).
> 
>  
> 
> As to why your're seeing the connection loss, it's very difficult to know. It could be that Zookeeper or your clients are being starved of CPU time, meaning that heartbeats aren't being sent / received. Or, it could be that you're getting 'stop the world' GC events in the JVM.
> 
>  
> 
> In regards to the lock acquisition, if you're trying to ensure that the second client doesn't run until the first client completes its work (rather than just the first client no longer holding a lock) then you probably need to write some state to ZK to communicate this. The state used by the locks is transient, so will disappear if the lock holders session is closed.
> 
> cheers
> 
>  
> 
>  
> 
>  
> 
> On Thu, Mar 29, 2018 at 7:14 AM, Joe Naegele <jnaegele@grierforensics.com <ma...@grierforensics.com>> wrote:
> 
> Good afternoon,
> 
>  
> 
> I’m experiencing an odd issue with ZooKeeper/Curator. I’m using ZooKeeper 3.5.2-alpha and Curator 4.0.0. My issue may be a ZooKeeper problem, but I want to understand whether Curator is involved, or how Curator might avoid the problem.
> 
>  
> 
> I’ve configured my client (framework) as follows:
> 
>  
> 
> client = CuratorFrameworkFactory.builder()
> 
>       .connectString(”127.0.0.1:2181 <http://127.0.0.1:2181/>”)
> 
>       .sessionTimeoutMs(120000)
> 
>       .connectionTimeoutMs(30000)
> 
>       .retryPolicy(new ExponentialBackoffRetry(2000, 8))
> 
>       .build()
> 
> client.start()
> 
>  
> 
> My ZooKeeper server is just one server with a “default” configuration.  Two client programs run on the same server and use a variety of recipes, including GroupMember, SharedCount, DistributedAtomicInteger, and, most importantly, an InterProcessReadWriteLock. One of the client programs performs hours of work while the other waits on the lock. While doing work, the program often reports the following message:
> 
>  
> 
> 06:23:56.867 WARN  org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 32147ms for sessionid 0x1092fb2fc3000a9
> 
>  
> 
> and ZooKeeper immediately afterward reports:
> 
>                
> 
> 06:23:56,910 [myid:] - WARN  [NIOServerCnxn@365] - Unable to read additional data from client sessionid 0x1092fb2fc3000a9, likely client has closed socket
> 
> 06:23:56,911 [myid:] - INFO  [MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=StandaloneServer_port2181,name1=Connections,name2=127.0.0.1,name3=0x1092fb2fc3000a9]
> 
> 06:23:56,912 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket connection for client /127.0.0.1:47312 <http://127.0.0.1:47312/> which had sessionid 0x1092fb2fc3000a9
> 
> 06:23:58,447 [myid:] - INFO  [NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296 <ht...@296>] - Accepted socket connection from /127.0.0.1:55396 <http://127.0.0.1:55396/>
> 06:23:58,448 [myid:] - INFO  [ZooKeeperServer@969] - Client attempting to renew session 0x1092fb2fc3000a9 at /127.0.0.1:55396 <http://127.0.0.1:55396/>
> 06:23:58,449 [myid:] - INFO  [ZooKeeperServer@678] - Established session 0x1092fb2fc3000a9 with negotiated timeout 40000 for client /127.0.0.1:55396 <http://127.0.0.1:55396/>
> 06:23:58,450 [myid:] - INFO  [PrepRequestProcessor@841] - Got user-level KeeperException when processing sessionid:0x1092fb2fc3000a9 type:create2 cxid:0x9fb zxid:0x4c602ea txntype:-1 reqpath:n/a Error Path:/xxx/AAAClient-5532fe53-f746-4f28-a09e-1dce6d908885 Error:KeeperErrorCode = NodeExists for /xxx/AAAClient-5532fe53-f746-4f28-a09e-1dce6d908885
> 
>  
> 
> So it looks like Curator correctly reestablishes a connection with the ZooKeeper server and the previously-established GroupMember recipe is kept intact.
> 
>  
> 
> But eventually, the ZooKeeper server expires the session, Curator attempts to reconnect and renew the session:
> 
>  
> 
> 06:45:48,331 [myid:] - WARN  [NIOServerCnxn@365] - Unable to read additional data from client sessionid 0x1092fb2fc3000a9, likely client has closed socket
> 
> 06:45:48,346 [myid:] - INFO  [MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=StandaloneServer_port2181,name1=Connections,name2=127.0.0.1,name3=0x1092fb2fc3000a9]
> 
> 06:45:48,347 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket connection for client /127.0.0.1:55396 <http://127.0.0.1:55396/> which had sessionid 0x1092fb2fc3000a9
> 
> 06:45:49,483 [myid:] - INFO  [ZooKeeperServer@384] - Expiring session 0x1092fb2fc3000a9, timeout of 40000ms exceeded
> 
> 06:45:49,484 [myid:] - INFO  [PrepRequestProcessor@647] - Processed session termination for sessionid: 0x1092fb2fc3000a9
> 
> 06:45:49,502 [myid:] - INFO  [PrepRequestProcessor@841] - Got user-level KeeperException when processing sessionid:0x1092fb2fc3000aa type:create cxid:0xf6 zxid:0x4c602ee txntype:-1 reqpath:n/a Error Path:/xxx/BBBClient-f3ac6d3b-2b4e-4992-abd3-49f9f6b77e3a Error:KeeperErrorCode = NodeExists for /xxx/BBBClient-f3ac6d3b-2b4e-4992-abd3-49f9f6b77e3a
> 
> 06:45:49,839 [myid:] - INFO  [NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:37844 <http://127.0.0.1:37844/>
> 06:45:49,839 [myid:] - INFO  [ZooKeeperServer@969] - Client attempting to renew session 0x1092fb2fc3000a9 at /127.0.0.1:37844 <http://127.0.0.1:37844/>
>  
> 
> The client program reports:
> 
>  
> 
> 06:45:49.840 WARN  org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x1092fb2fc3000a9 has expired
> 
> 06:45:49.840 WARN  org.apache.curator.ConnectionState - Session expired event received
> 
>  
> 
> And ZooKeeper reports:
> 
>  
> 
> 06:45:49,840 [myid:] - INFO  [ZooKeeperServer@686] - Invalid session 0x1092fb2fc3000a9 for client /127.0.0.1:37844 <http://127.0.0.1:37844/>, probably expired
> 
> 06:45:49,840 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket connection for client /127.0.0.1:37844 <http://127.0.0.1:37844/> which had sessionid 0x1092fb2fc3000a9
> 
> 06:45:49,859 [myid:] - INFO  [NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:37862 <http://127.0.0.1:37862/>
> 06:45:49,859 [myid:] - INFO  [ZooKeeperServer@964] - Client attempting to establish new session at /127.0.0.1:37862 <http://127.0.0.1:37862/>
> 06:45:49,867 [myid:] - INFO  [ZooKeeperServer@678] - Established session 0x1092fb2fc3000ab with negotiated timeout 40000 for client /127.0.0.1:37862 <http://127.0.0.1:37862/>
>  
> 
> And, critically, when the new session is established, the InterProcessReadWriteLock is released by the establishing client, so the other program acquires the lock.
> 
>  
> 
> Can I prevent this behavior so the other client doesn’t acquire the lock on session loss?
> Are there any hints above as to what is causing the session loss? Why is my client program closing the ZK socket? Why doesn’t it hear from the server for so long?
>  
> 
> Thanks in advance!
> 
>  
> 
> Joe Naegele
> 
> Grier Forensics
> 
>  
> 
> 


Re: Unexpected session expired event

Posted by Cameron McKenzie <mc...@gmail.com>.
hey Joe,
Curator just wraps up the Zookeeper client. It is responsible for sending
heart beats etc, which it does in its own thread. If you're running
hundreds of threads within your VM, perhaps it's possible that the client
ZK thread is getting starved, it's difficult to know.

cheers

On Thu, Mar 29, 2018 at 8:16 AM, Joe Naegele <jn...@grierforensics.com>
wrote:

> Cameron,
>
>
>
> Thanks, this is very helpful.
>
>
>
> My program is not experiencing extreme GC events, and it doesn’t appear
> the ZK server is either (as expected). I have a feeling the connection loss
> is due to the client being starved of CPU time. My main program spawns
> hundreds of threads with different roles, then waits on them in succession,
> so the main thread effectively sleeps (BLOCKED) for most of its lifetime.
> I’m curious though, how does the Curator framework send/receive heartbeats
> once started? I had assumed it has its own thread pool and runs “in the
> background”, regardless of what my main thread and any other threads do.
>
>
>
> I follow your suggestion to read/write additional state for tracking
> persistent lock ownership. Sounds reasonable.
>
>
>
> Thanks again
>
> ---
>
> Joe Naegele
>
> Grier Forensics
>
>
>
> *From:* Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> *Sent:* Wednesday, March 28, 2018 4:37 PM
> *To:* user@curator.apache.org
> *Subject:* Re: Unexpected session expired event
>
>
>
> hey Joe,
>
> The session timeout that you've outlined in the Curator code is 120
> seconds, but the negotiated timeout is 40 seconds. This is presumably
> because the tick time on your Zookeeper server is set to 2 seconds (the
> maximum session timeout is 20 * configured tick time).
>
>
>
> As to why your're seeing the connection loss, it's very difficult to know.
> It could be that Zookeeper or your clients are being starved of CPU time,
> meaning that heartbeats aren't being sent / received. Or, it could be that
> you're getting 'stop the world' GC events in the JVM.
>
>
>
> In regards to the lock acquisition, if you're trying to ensure that the
> second client doesn't run until the first client completes its work (rather
> than just the first client no longer holding a lock) then you probably need
> to write some state to ZK to communicate this. The state used by the locks
> is transient, so will disappear if the lock holders session is closed.
>
> cheers
>
>
>
>
>
>
>
> On Thu, Mar 29, 2018 at 7:14 AM, Joe Naegele <jn...@grierforensics.com>
> wrote:
>
> Good afternoon,
>
>
>
> I’m experiencing an odd issue with ZooKeeper/Curator. I’m using ZooKeeper
> 3.5.2-alpha and Curator 4.0.0. My issue may be a ZooKeeper problem, but I
> want to understand whether Curator is involved, or how Curator might avoid
> the problem.
>
>
>
> I’ve configured my client (framework) as follows:
>
>
>
> client = CuratorFrameworkFactory.builder()
>
>       .connectString(”127.0.0.1:2181”)
>
>       .sessionTimeoutMs(120000)
>
>       .connectionTimeoutMs(30000)
>
>       .retryPolicy(new ExponentialBackoffRetry(2000, 8))
>
>       .build()
>
> client.start()
>
>
>
> My ZooKeeper server is just one server with a “default” configuration.
> Two client programs run on the same server and use a variety of recipes,
> including GroupMember, SharedCount, DistributedAtomicInteger, and, most
> importantly, an InterProcessReadWriteLock. One of the client programs
> performs hours of work while the other waits on the lock. While doing work,
> the program often reports the following message:
>
>
>
> 06:23:56.867 WARN  org.apache.zookeeper.ClientCnxn - Client session timed
> out, have not heard from server in 32147ms for sessionid 0x1092fb2fc3000a9
>
>
>
> and ZooKeeper immediately afterward reports:
>
>
>
> 06:23:56,910 [myid:] - WARN  [NIOServerCnxn@365] - Unable to read
> additional data from client sessionid 0x1092fb2fc3000a9, likely client has
> closed socket
>
> 06:23:56,911 [myid:] - INFO  [MBeanRegistry@128] - Unregister MBean
> [org.apache.ZooKeeperService:name0=StandaloneServer_
> port2181,name1=Connections,name2=127.0.0.1,name3=0x1092fb2fc3000a9]
>
> 06:23:56,912 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket
> connection for client /127.0.0.1:47312 which had sessionid
> 0x1092fb2fc3000a9
>
> 06:23:58,447 [myid:] - INFO  [NIOServerCxnFactory.AcceptThread:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted
> socket connection from /127.0.0.1:55396
>
> 06:23:58,448 [myid:] - INFO  [ZooKeeperServer@969] - Client attempting to
> renew session 0x1092fb2fc3000a9 at /127.0.0.1:55396
>
> 06:23:58,449 [myid:] - INFO  [ZooKeeperServer@678] - Established session
> 0x1092fb2fc3000a9 with negotiated timeout 40000 for client /
> 127.0.0.1:55396
>
> 06:23:58,450 [myid:] - INFO  [PrepRequestProcessor@841] - Got user-level
> KeeperException when processing sessionid:0x1092fb2fc3000a9 type:create2
> cxid:0x9fb zxid:0x4c602ea txntype:-1 reqpath:n/a Error
> Path:/xxx/AAAClient-5532fe53-f746-4f28-a09e-1dce6d908885
> Error:KeeperErrorCode = NodeExists for /xxx/AAAClient-5532fe53-f746-
> 4f28-a09e-1dce6d908885
>
>
>
> So it looks like Curator correctly reestablishes a connection with the
> ZooKeeper server and the previously-established GroupMember recipe is kept
> intact.
>
>
>
> But eventually, the ZooKeeper server expires the session, Curator attempts
> to reconnect and renew the session:
>
>
>
> 06:45:48,331 [myid:] - WARN  [NIOServerCnxn@365] - Unable to read
> additional data from client sessionid 0x1092fb2fc3000a9, likely client has
> closed socket
>
> 06:45:48,346 [myid:] - INFO  [MBeanRegistry@128] - Unregister MBean
> [org.apache.ZooKeeperService:name0=StandaloneServer_
> port2181,name1=Connections,name2=127.0.0.1,name3=0x1092fb2fc3000a9]
>
> 06:45:48,347 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket
> connection for client /127.0.0.1:55396 which had sessionid
> 0x1092fb2fc3000a9
>
> 06:45:49,483 [myid:] - INFO  [ZooKeeperServer@384] - Expiring session
> 0x1092fb2fc3000a9, timeout of 40000ms exceeded
>
> 06:45:49,484 [myid:] - INFO  [PrepRequestProcessor@647] - Processed
> session termination for sessionid: 0x1092fb2fc3000a9
>
> 06:45:49,502 [myid:] - INFO  [PrepRequestProcessor@841] - Got user-level
> KeeperException when processing sessionid:0x1092fb2fc3000aa type:create
> cxid:0xf6 zxid:0x4c602ee txntype:-1 reqpath:n/a Error
> Path:/xxx/BBBClient-f3ac6d3b-2b4e-4992-abd3-49f9f6b77e3a
> Error:KeeperErrorCode = NodeExists for /xxx/BBBClient-f3ac6d3b-2b4e-
> 4992-abd3-49f9f6b77e3a
>
> 06:45:49,839 [myid:] - INFO  [NIOServerCnxnFactory$AcceptThread@296] -
> Accepted socket connection from /127.0.0.1:37844
>
> 06:45:49,839 [myid:] - INFO  [ZooKeeperServer@969] - Client attempting to
> renew session 0x1092fb2fc3000a9 at /127.0.0.1:37844
>
>
>
> The client program reports:
>
>
>
> 06:45:49.840 WARN  org.apache.zookeeper.ClientCnxn - Unable to reconnect
> to ZooKeeper service, session 0x1092fb2fc3000a9 has expired
>
> 06:45:49.840 WARN  org.apache.curator.ConnectionState - Session expired
> event received
>
>
>
> And ZooKeeper reports:
>
>
>
> 06:45:49,840 [myid:] - INFO  [ZooKeeperServer@686] - Invalid session
> 0x1092fb2fc3000a9 for client /127.0.0.1:37844, probably expired
>
> 06:45:49,840 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket
> connection for client /127.0.0.1:37844 which had sessionid
> 0x1092fb2fc3000a9
>
> 06:45:49,859 [myid:] - INFO  [NIOServerCnxnFactory$AcceptThread@296] -
> Accepted socket connection from /127.0.0.1:37862
>
> 06:45:49,859 [myid:] - INFO  [ZooKeeperServer@964] - Client attempting to
> establish new session at /127.0.0.1:37862
>
> 06:45:49,867 [myid:] - INFO  [ZooKeeperServer@678] - Established session
> 0x1092fb2fc3000ab with negotiated timeout 40000 for client /
> 127.0.0.1:37862
>
>
>
> And, critically, when the new session is established, the
> InterProcessReadWriteLock is released by the establishing client, so the
> other program acquires the lock.
>
>
>
>    1. Can I prevent this behavior so the other client doesn’t acquire the
>    lock on session loss?
>    2. Are there any hints above as to what is causing the session loss?
>    Why is my client program closing the ZK socket? Why doesn’t it hear from
>    the server for so long?
>
>
>
> Thanks in advance!
>
>
>
> Joe Naegele
>
> Grier Forensics
>
>
>

RE: Unexpected session expired event

Posted by Joe Naegele <jn...@grierforensics.com>.
Cameron,

Thanks, this is very helpful.

My program is not experiencing extreme GC events, and it doesn’t appear the ZK server is either (as expected). I have a feeling the connection loss is due to the client being starved of CPU time. My main program spawns hundreds of threads with different roles, then waits on them in succession, so the main thread effectively sleeps (BLOCKED) for most of its lifetime. I’m curious though, how does the Curator framework send/receive heartbeats once started? I had assumed it has its own thread pool and runs “in the background”, regardless of what my main thread and any other threads do.

I follow your suggestion to read/write additional state for tracking persistent lock ownership. Sounds reasonable.

Thanks again
---
Joe Naegele
Grier Forensics

From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
Sent: Wednesday, March 28, 2018 4:37 PM
To: user@curator.apache.org
Subject: Re: Unexpected session expired event

hey Joe,
The session timeout that you've outlined in the Curator code is 120 seconds, but the negotiated timeout is 40 seconds. This is presumably because the tick time on your Zookeeper server is set to 2 seconds (the maximum session timeout is 20 * configured tick time).

As to why your're seeing the connection loss, it's very difficult to know. It could be that Zookeeper or your clients are being starved of CPU time, meaning that heartbeats aren't being sent / received. Or, it could be that you're getting 'stop the world' GC events in the JVM.

In regards to the lock acquisition, if you're trying to ensure that the second client doesn't run until the first client completes its work (rather than just the first client no longer holding a lock) then you probably need to write some state to ZK to communicate this. The state used by the locks is transient, so will disappear if the lock holders session is closed.
cheers



On Thu, Mar 29, 2018 at 7:14 AM, Joe Naegele <jn...@grierforensics.com>> wrote:
Good afternoon,

I’m experiencing an odd issue with ZooKeeper/Curator. I’m using ZooKeeper 3.5.2-alpha and Curator 4.0.0. My issue may be a ZooKeeper problem, but I want to understand whether Curator is involved, or how Curator might avoid the problem.

I’ve configured my client (framework) as follows:

client = CuratorFrameworkFactory.builder()
      .connectString(”127.0.0.1:2181<http://127.0.0.1:2181>”)
      .sessionTimeoutMs(120000)
      .connectionTimeoutMs(30000)
      .retryPolicy(new ExponentialBackoffRetry(2000, 8))
      .build()
client.start()

My ZooKeeper server is just one server with a “default” configuration.  Two client programs run on the same server and use a variety of recipes, including GroupMember, SharedCount, DistributedAtomicInteger, and, most importantly, an InterProcessReadWriteLock. One of the client programs performs hours of work while the other waits on the lock. While doing work, the program often reports the following message:

06:23:56.867 WARN  org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 32147ms for sessionid 0x1092fb2fc3000a9

and ZooKeeper immediately afterward reports:

06:23:56,910 [myid:] - WARN  [NIOServerCnxn@365] - Unable to read additional data from client sessionid 0x1092fb2fc3000a9, likely client has closed socket
06:23:56,911 [myid:] - INFO  [MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=StandaloneServer_port2181,name1=Connections,name2=127.0.0.1,name3=0x1092fb2fc3000a9]
06:23:56,912 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket connection for client /127.0.0.1:47312<http://127.0.0.1:47312> which had sessionid 0x1092fb2fc3000a9
06:23:58,447 [myid:] - INFO  [NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296<ht...@296>] - Accepted socket connection from /127.0.0.1:55396<http://127.0.0.1:55396>
06:23:58,448 [myid:] - INFO  [ZooKeeperServer@969] - Client attempting to renew session 0x1092fb2fc3000a9 at /127.0.0.1:55396<http://127.0.0.1:55396>
06:23:58,449 [myid:] - INFO  [ZooKeeperServer@678] - Established session 0x1092fb2fc3000a9 with negotiated timeout 40000 for client /127.0.0.1:55396<http://127.0.0.1:55396>
06:23:58,450 [myid:] - INFO  [PrepRequestProcessor@841] - Got user-level KeeperException when processing sessionid:0x1092fb2fc3000a9 type:create2 cxid:0x9fb zxid:0x4c602ea txntype:-1 reqpath:n/a Error Path:/xxx/AAAClient-5532fe53-f746-4f28-a09e-1dce6d908885 Error:KeeperErrorCode = NodeExists for /xxx/AAAClient-5532fe53-f746-4f28-a09e-1dce6d908885

So it looks like Curator correctly reestablishes a connection with the ZooKeeper server and the previously-established GroupMember recipe is kept intact.

But eventually, the ZooKeeper server expires the session, Curator attempts to reconnect and renew the session:

06:45:48,331 [myid:] - WARN  [NIOServerCnxn@365] - Unable to read additional data from client sessionid 0x1092fb2fc3000a9, likely client has closed socket
06:45:48,346 [myid:] - INFO  [MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=StandaloneServer_port2181,name1=Connections,name2=127.0.0.1,name3=0x1092fb2fc3000a9]
06:45:48,347 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket connection for client /127.0.0.1:55396<http://127.0.0.1:55396> which had sessionid 0x1092fb2fc3000a9
06:45:49,483 [myid:] - INFO  [ZooKeeperServer@384] - Expiring session 0x1092fb2fc3000a9, timeout of 40000ms exceeded
06:45:49,484 [myid:] - INFO  [PrepRequestProcessor@647] - Processed session termination for sessionid: 0x1092fb2fc3000a9
06:45:49,502 [myid:] - INFO  [PrepRequestProcessor@841] - Got user-level KeeperException when processing sessionid:0x1092fb2fc3000aa type:create cxid:0xf6 zxid:0x4c602ee txntype:-1 reqpath:n/a Error Path:/xxx/BBBClient-f3ac6d3b-2b4e-4992-abd3-49f9f6b77e3a Error:KeeperErrorCode = NodeExists for /xxx/BBBClient-f3ac6d3b-2b4e-4992-abd3-49f9f6b77e3a
06:45:49,839 [myid:] - INFO  [NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:37844<http://127.0.0.1:37844>
06:45:49,839 [myid:] - INFO  [ZooKeeperServer@969] - Client attempting to renew session 0x1092fb2fc3000a9 at /127.0.0.1:37844<http://127.0.0.1:37844>

The client program reports:

06:45:49.840 WARN  org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x1092fb2fc3000a9 has expired
06:45:49.840 WARN  org.apache.curator.ConnectionState - Session expired event received

And ZooKeeper reports:

06:45:49,840 [myid:] - INFO  [ZooKeeperServer@686] - Invalid session 0x1092fb2fc3000a9 for client /127.0.0.1:37844<http://127.0.0.1:37844>, probably expired
06:45:49,840 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket connection for client /127.0.0.1:37844<http://127.0.0.1:37844> which had sessionid 0x1092fb2fc3000a9
06:45:49,859 [myid:] - INFO  [NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:37862<http://127.0.0.1:37862>
06:45:49,859 [myid:] - INFO  [ZooKeeperServer@964] - Client attempting to establish new session at /127.0.0.1:37862<http://127.0.0.1:37862>
06:45:49,867 [myid:] - INFO  [ZooKeeperServer@678] - Established session 0x1092fb2fc3000ab with negotiated timeout 40000 for client /127.0.0.1:37862<http://127.0.0.1:37862>

And, critically, when the new session is established, the InterProcessReadWriteLock is released by the establishing client, so the other program acquires the lock.


  1.  Can I prevent this behavior so the other client doesn’t acquire the lock on session loss?
  2.  Are there any hints above as to what is causing the session loss? Why is my client program closing the ZK socket? Why doesn’t it hear from the server for so long?

Thanks in advance!

Joe Naegele
Grier Forensics


Re: Unexpected session expired event

Posted by Cameron McKenzie <mc...@gmail.com>.
hey Joe,
The session timeout that you've outlined in the Curator code is 120
seconds, but the negotiated timeout is 40 seconds. This is presumably
because the tick time on your Zookeeper server is set to 2 seconds (the
maximum session timeout is 20 * configured tick time).

As to why your're seeing the connection loss, it's very difficult to know.
It could be that Zookeeper or your clients are being starved of CPU time,
meaning that heartbeats aren't being sent / received. Or, it could be that
you're getting 'stop the world' GC events in the JVM.

In regards to the lock acquisition, if you're trying to ensure that the
second client doesn't run until the first client completes its work (rather
than just the first client no longer holding a lock) then you probably need
to write some state to ZK to communicate this. The state used by the locks
is transient, so will disappear if the lock holders session is closed.
cheers



On Thu, Mar 29, 2018 at 7:14 AM, Joe Naegele <jn...@grierforensics.com>
wrote:

> Good afternoon,
>
>
>
> I’m experiencing an odd issue with ZooKeeper/Curator. I’m using ZooKeeper
> 3.5.2-alpha and Curator 4.0.0. My issue may be a ZooKeeper problem, but I
> want to understand whether Curator is involved, or how Curator might avoid
> the problem.
>
>
>
> I’ve configured my client (framework) as follows:
>
>
>
> client = CuratorFrameworkFactory.builder()
>
>       .connectString(”127.0.0.1:2181”)
>
>       .sessionTimeoutMs(120000)
>
>       .connectionTimeoutMs(30000)
>
>       .retryPolicy(new ExponentialBackoffRetry(2000, 8))
>
>       .build()
>
> client.start()
>
>
>
> My ZooKeeper server is just one server with a “default” configuration.
> Two client programs run on the same server and use a variety of recipes,
> including GroupMember, SharedCount, DistributedAtomicInteger, and, most
> importantly, an InterProcessReadWriteLock. One of the client programs
> performs hours of work while the other waits on the lock. While doing work,
> the program often reports the following message:
>
>
>
> 06:23:56.867 WARN  org.apache.zookeeper.ClientCnxn - Client session timed
> out, have not heard from server in 32147ms for sessionid 0x1092fb2fc3000a9
>
>
>
> and ZooKeeper immediately afterward reports:
>
>
>
> 06:23:56,910 [myid:] - WARN  [NIOServerCnxn@365] - Unable to read
> additional data from client sessionid 0x1092fb2fc3000a9, likely client has
> closed socket
>
> 06:23:56,911 [myid:] - INFO  [MBeanRegistry@128] - Unregister MBean
> [org.apache.ZooKeeperService:name0=StandaloneServer_
> port2181,name1=Connections,name2=127.0.0.1,name3=0x1092fb2fc3000a9]
>
> 06:23:56,912 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket
> connection for client /127.0.0.1:47312 which had sessionid
> 0x1092fb2fc3000a9
>
> 06:23:58,447 [myid:] - INFO  [NIOServerCxnFactory.AcceptThread:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted
> socket connection from /127.0.0.1:55396
>
> 06:23:58,448 [myid:] - INFO  [ZooKeeperServer@969] - Client attempting to
> renew session 0x1092fb2fc3000a9 at /127.0.0.1:55396
>
> 06:23:58,449 [myid:] - INFO  [ZooKeeperServer@678] - Established session
> 0x1092fb2fc3000a9 with negotiated timeout 40000 for client /
> 127.0.0.1:55396
>
> 06:23:58,450 [myid:] - INFO  [PrepRequestProcessor@841] - Got user-level
> KeeperException when processing sessionid:0x1092fb2fc3000a9 type:create2
> cxid:0x9fb zxid:0x4c602ea txntype:-1 reqpath:n/a Error
> Path:/xxx/AAAClient-5532fe53-f746-4f28-a09e-1dce6d908885
> Error:KeeperErrorCode = NodeExists for /xxx/AAAClient-5532fe53-f746-
> 4f28-a09e-1dce6d908885
>
>
>
> So it looks like Curator correctly reestablishes a connection with the
> ZooKeeper server and the previously-established GroupMember recipe is kept
> intact.
>
>
>
> But eventually, the ZooKeeper server expires the session, Curator attempts
> to reconnect and renew the session:
>
>
>
> 06:45:48,331 [myid:] - WARN  [NIOServerCnxn@365] - Unable to read
> additional data from client sessionid 0x1092fb2fc3000a9, likely client has
> closed socket
>
> 06:45:48,346 [myid:] - INFO  [MBeanRegistry@128] - Unregister MBean
> [org.apache.ZooKeeperService:name0=StandaloneServer_
> port2181,name1=Connections,name2=127.0.0.1,name3=0x1092fb2fc3000a9]
>
> 06:45:48,347 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket
> connection for client /127.0.0.1:55396 which had sessionid
> 0x1092fb2fc3000a9
>
> 06:45:49,483 [myid:] - INFO  [ZooKeeperServer@384] - Expiring session
> 0x1092fb2fc3000a9, timeout of 40000ms exceeded
>
> 06:45:49,484 [myid:] - INFO  [PrepRequestProcessor@647] - Processed
> session termination for sessionid: 0x1092fb2fc3000a9
>
> 06:45:49,502 [myid:] - INFO  [PrepRequestProcessor@841] - Got user-level
> KeeperException when processing sessionid:0x1092fb2fc3000aa type:create
> cxid:0xf6 zxid:0x4c602ee txntype:-1 reqpath:n/a Error
> Path:/xxx/BBBClient-f3ac6d3b-2b4e-4992-abd3-49f9f6b77e3a
> Error:KeeperErrorCode = NodeExists for /xxx/BBBClient-f3ac6d3b-2b4e-
> 4992-abd3-49f9f6b77e3a
>
> 06:45:49,839 [myid:] - INFO  [NIOServerCnxnFactory$AcceptThread@296] -
> Accepted socket connection from /127.0.0.1:37844
>
> 06:45:49,839 [myid:] - INFO  [ZooKeeperServer@969] - Client attempting to
> renew session 0x1092fb2fc3000a9 at /127.0.0.1:37844
>
>
>
> The client program reports:
>
>
>
> 06:45:49.840 WARN  org.apache.zookeeper.ClientCnxn - Unable to reconnect
> to ZooKeeper service, session 0x1092fb2fc3000a9 has expired
>
> 06:45:49.840 WARN  org.apache.curator.ConnectionState - Session expired
> event received
>
>
>
> And ZooKeeper reports:
>
>
>
> 06:45:49,840 [myid:] - INFO  [ZooKeeperServer@686] - Invalid session
> 0x1092fb2fc3000a9 for client /127.0.0.1:37844, probably expired
>
> 06:45:49,840 [myid:] - INFO  [NIOServerCnxn@607] - Closed socket
> connection for client /127.0.0.1:37844 which had sessionid
> 0x1092fb2fc3000a9
>
> 06:45:49,859 [myid:] - INFO  [NIOServerCnxnFactory$AcceptThread@296] -
> Accepted socket connection from /127.0.0.1:37862
>
> 06:45:49,859 [myid:] - INFO  [ZooKeeperServer@964] - Client attempting to
> establish new session at /127.0.0.1:37862
>
> 06:45:49,867 [myid:] - INFO  [ZooKeeperServer@678] - Established session
> 0x1092fb2fc3000ab with negotiated timeout 40000 for client /
> 127.0.0.1:37862
>
>
>
> And, critically, when the new session is established, the
> InterProcessReadWriteLock is released by the establishing client, so the
> other program acquires the lock.
>
>
>
>    1. Can I prevent this behavior so the other client doesn’t acquire the
>    lock on session loss?
>    2. Are there any hints above as to what is causing the session loss?
>    Why is my client program closing the ZK socket? Why doesn’t it hear from
>    the server for so long?
>
>
>
> Thanks in advance!
>
>
>
> Joe Naegele
>
> Grier Forensics
>