You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Håkon Hallingstad (Jira)" <ji...@apache.org> on 2022/05/12 07:25:00 UTC

[jira] [Commented] (ZOOKEEPER-4541) Ephemeral znode owned by closed session visible in 1 of 3 servers

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-4541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17535922#comment-17535922 ] 

Håkon Hallingstad commented on ZOOKEEPER-4541:
----------------------------------------------

Server 1 was stopped after the session was closed at 17:18:43, and started again. On start it logged digest errors:
 
{code}
2022-05-10 17:18:58.624 I - cfg1 ACL digest algorithm is: SHA1 (DigestAuthenticationProvider)
2022-05-10 17:18:58.625 I - cfg1 zookeeper.DigestAuthenticationProvider.enabled = true (DigestAuthenticationProvider)
2022-05-10 17:18:58.710 I - cfg1 zookeeper.digest.enabled = true (ZooKeeperServer)
2022-05-10 17:18:58.725 I - cfg1 Reading snapshot /opt/vespa/var/zookeeper/version-2/snapshot.a1a00008bef.gz (FileSnap)
2022-05-10 17:18:58.753 I - cfg1 The digest in the snapshot has digest version of 2, , with zxid as 0xa1b00000006, and digest value as 2313248497541 (DataTree)
2022-05-10 17:18:59.842 W - cfg1 Message:Digests are not matching. Value is Zxid. Value:11154030067713 (DataTree)
2022-05-10 17:18:59.844 E - cfg1 First digest mismatch on txn: 216180518738395136,2361,11154030067713,1652203051156,15
                                 , '/provision/v1/locks/maintenanceJobLocks/InfrastructureProvisioner/_c_50be1b68-5822-496a-8820-99e3dc4890a1-lock-0000337756,#31302e3231352e3137392e3239,v\{s{31,s{'world,'anyone}}},T,337757
                                 , expected digest is 2,2649457361965
                                 , actual digest is 2649853538874,  (DataTree)
2022-05-10 17:18:59.847 I - cfg1 111604 txns loaded in 987 ms (FileTxnSnapLog)
2022-05-10 17:18:59.847 I - cfg1 Snapshot loaded in 1123 ms, highest zxid is 0xa2500000117, digest is 2645998505798 (ZKDatabase)
2022-05-10 17:34:00.004 W - cfg1 [2771 times] Message: Digests are not matching. Value is Zxid. Last value:11154030070498 (DataTree)
{code}


> Ephemeral znode owned by closed session visible in 1 of 3 servers
> -----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4541
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4541
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>            Reporter: Håkon Hallingstad
>            Priority: Major
>
> We have a ZooKeeper 3.7.0 ensemble with servers 1-3. Using zkCli.sh we saw the following znode on server 1:
> {code:java}
> stat /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
> cZxid = 0xa240000381f
> ctime = Tue May 10 17:17:27 UTC 2022
> mZxid = 0xa240000381f
> mtime = Tue May 10 17:17:27 UTC 2022
> pZxid = 0xa240000381f
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x20006d5d6a10000
> dataLength = 14
> numChildren = 0
> {code}
> This znode was absent on server 2 and 3. A delete on the node failed everywhere.
> {code:java}
> delete /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
> Node does not exist: /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
> {code}
> This makes sense as a mutable operation goes to the leader, which was server 3 and where the node is absent. Restarting server 1 did not fix the issue. Stopping server 1, removing the zookeeper database and version-2 directory, and starting the server fixed the issue.
> Session 0x20006d5d6a10000 was created by a ZooKeeper client and initially connected to server 2. The client later closed the session at around the same time as the then-leader server 2 was stopped:
> {code:java}
> 2022-05-10 17:17:28.141 I - cfg2 Submitting global closeSession request for session 0x20006d5d6a10000 (ZooKeeperServer)
> 2022-05-10 17:17:28.145 I - cfg2 Session: 0x20006d5d6a10000 closed (ZooKeeper)
> {code}
> That both were closed/shutdown at the same time is something we do on our side. Perhaps there is a race in the handling of closing of sessions and the shutdown of the leader?
> We did a {{dump}} of server 1-3, and server 1 had two sessions that did not exist in server 2 and 3, and there was one ephemeral node reported on 1 that was not reported on server 2 and 3. The ephemeral owner matched one of the two extraneous sessions.  The dump from server 1 with the extra entries:
> {code:java}
> Global Sessions(8):
> ...
> 0x20006d5d6a10000 120000ms
> 0x2004360ecca0000 120000ms
> ...
> Sessions with Ephemerals (4):
> 0x20006d5d6a10000:
> /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)