You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Benjamin Reed (JIRA)" <ji...@apache.org> on 2016/09/24 09:37:20 UTC

[jira] [Resolved] (ZOOKEEPER-2600) dangling ephemerals on overloaded server with local sessions

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed resolved ZOOKEEPER-2600.
--------------------------------------
    Resolution: Cannot Reproduce

> dangling ephemerals on overloaded server with local sessions
> ------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2600
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2600
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>            Reporter: Benjamin Reed
>
> we had the following strange production bug:
> there was an ephemeral znode for a session that was no longer active.  it happened even in the absence of failures.
> we are running with local sessions enabled and slightly different logic than the open source zookeeper, but code inspection shows that the problem is also in open source.
> the triggering condition was server overload. we had a traffic burst and it we were having commit latencies of over 30 seconds.
> after digging through logs/code we realized from the logs that the create session txn for the ephemeral node started (in the PrepRequestProcessor) at 11:23:04 and committed at 11:23:38 (the "Adding global session" is output in the commit processor). it took 34 seconds to commit the createSession, during that time the session expired. due to delays it appears that the interleave was as follows:
> 1) create session hits prep request processor and create session txn generated 11:23:04
> 2) time passes as the create session is going through zab
> 3) the session expires, close session is generated, and close session txn generated 11:23:23
> 4) the create session gets committed and the session gets re-added to the sessionTracker 11:23:38
> 5) the create ephemeral node hits prep request processor and a create txn generated 11:23:40
> 6) the close session gets committed (all ephemeral nodes for the session are deleted) and the session is deleted from sessionTracker
> 7) the create ephemeral node gets committed
> the root cause seems to be that the gobal sessions are managed by both the PrepRequestProcessor and the CommitProcessor. also with the local session upgrading we can have changes in flight before our sessions commits. i think there are probably two places to fix:
> 1) changes to session tracker should not happen in prep request processor.
> 2) we should not have requests in flight while create session is in process. there are two options to prevent this:
> a) when a create session is generated in makeUpgradeRequest, we need to start queuing the requests from the clients and only submit them once the create session is committed
> b) the client should explicitly detect that it needs to change from local session to global session and explicitly open a global session and get the commit before it sends an ephemeral create request
> option 2a) is a more transparent fix, but architecturally and in the long term i think 2b) might be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)