You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Lea Morschel (Jira)" <ji...@apache.org> on 2020/08/17 09:33:00 UTC

[jira] [Commented] (ZOOKEEPER-3890) Ephemeral node not deleted after session is gone, then elected as leader

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178869#comment-17178869 ] 

Lea Morschel commented on ZOOKEEPER-3890:
-----------------------------------------

Sorry for taking so long for an answer and thank you for investigating!

The problem we observed this issue with is an embedded ZooKeeper. After investigations I discovered that it includes workarounds for issues https://issues.apache.org/jira/browse/ZOOKEEPER-2812 and https://issues.apache.org/jira/browse/ZOOKEEPER-2810, that result in the SessionTracker sometimes getting passed an empty HashMap instead of a correct listing of sessionsWithTimeouts on startup – which is a mistake on our side and now fixed successfully. Our recent transition from ZooKeeper 3.4 to 3.5 might have played a role as well.
With a standalone ZooKeeper server or even cluster we have had sporadic reports of issues that probably were related to the persistence of stale ephemeral nodes as well. However, I have been unsuccessfully trying to reproduce the reported problem in a standalone Zookeeper server of version 3.5.7 and have to conclude that at least the easily reproducible scenario described in this issue does not apply and that I have not yet found another such scenario.

Therefore I am sorry for having bothered you prematurely, as it seems to have been mainly a problem on our side. I will close this issue and keep watching these types of problems in case we do observe them again at some point!

Just some final words on your observations: the lines
{code:java}
Ignoring processTxn failure hdr: -1, error: -110, path: null{code}
and
{code:java}
EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.72{code}
still show up on startup, but do not seem related to the fixed problem in our embedded ZooKeeper instance.
The line
{code:java}
ZKShutdownHandler is not registered, so ZooKeeper server won't take any action on ERROR or SHUTDOWN server state changes{code}
shows up in the logs because of the high log level (DEBUG) and because a {{ZKShutdownHandler}} is not/may not be registered if the user creates a {{ZooKeeperServer}} object outside of {{ZooKeeperServerMain.runFromConfig.}}
The observed errors (3.), however, indeed seem to have been related to the described problem and are now gone.

Thank you again!

> Ephemeral node not deleted after session is gone, then elected as leader
> ------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3890
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3890
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.14, 3.5.7
>            Reporter: Lea Morschel
>            Priority: Major
>         Attachments: cmdline-feedback.txt, zkLogsAndSnapshots.tar.xz
>
>
> When a ZooKeeper client session disappears, the associated ephemeral node that is used for leader election is occasionally not deleted and persists (indefinitely, it seems).
>  A leader election process may select such a stale node to be the leader. In a scenario where there is a redundant service that takes action when acquiring leadership by means of a ZooKeeper election process, this leads to none of the services being active when the stale ephemeral node is elected.
> One of the scenarios where such a stale ephemeral node is created can be triggered by force-killing the  ZooKeeper server ({{kill -9 <pid}}>) as well as the client, which leads to the session being recreated after restarting the server on its side, even though the actual client session is gone. This node even persists after regular restarts from now on. No pings from its owner-session are received, compared to an active one, yet the session never expires. This scenario involves a single ZooKeeper server, but the problem has also been observed in a cluster of three.
> When the ephemeral node is first persisted after restarting (and every restart thereafter), the following is observable in the ZooKeeper server logs. The scenario involves a local ZooKeeper server (version 3.5.7) and a single leader election participant.
> {code:java}
> Opening datadir:/my/path snapDir:/my/path
> zookeeper.snapshot.trust.empty : true
> tickTime set to 2000
> minSessionTimeout set to 4000
> maxSessionTimeout set to 40000
> zookeeper.snapshotSizeFactor = 0.33
> Reading snapshot /my/path/version-2/snapshot.71
> Created new input stream /my/path/version-2/log.4b
> Created new input archive /my/path/version-2/log.4b
> EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.4b
> Created new input stream /my/path/version-2/log.72
> Created new input archive /my/path/version-2/log.72
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> Ignoring processTxn failure hdr: -1 : error: -110
> Ignoring processTxn failure hdr: -1, error: -110, path: null
> EOF exception java.io.EOFException: Failed to read /my/path/version-2/log.72
> Snapshotting: 0x8b to /my/path/version-2/snapshot.8b
> ZKShutdownHandler is not registered, so ZooKeeper server won't take any action on ERROR or SHUTDOWN server state changes
> autopurge.snapRetainCount set to 3
> autopurge.purgeInterval set to 3{code}
> Could this problem be solved by ZooKeeper checking the sessions for each participating node before starting a leader election?
> So far only manual intervention (removing the stale ephemeral node) seems to "fix" the issue temporarily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)