You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Neha Narkhede (JIRA)" <ji...@apache.org> on 2012/04/26 21:30:30 UTC

[jira] [Commented] (ZOOKEEPER-1457) Ephemeral node deleted for unexpired sessions

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262900#comment-13262900 ] 

Neha Narkhede commented on ZOOKEEPER-1457:
------------------------------------------

I've attempted to serialize the sequence of events according to relevant data from txn and log4j logs below. 

Zookeeper cluster size is 5 servers - srvr-84.prod, srvr-85.prod, srvr-88.prod, srvr-89.prod, srvr-90.prod
Leader zookeeper is srvr-90.prod 
Session A = 0x9367a5bd54c9078 
Session B = 0x8367a5bd75ea01b 
Session C = 0x8367a5bd75e9dd5 
The znode being created = /consumers/kafka-DR/ids/kafka-DR_srvr-0193.prod-1334094226781-6a7f1b76 
Zookeeper session timeout is 6 seconds 

1. Original session A, 0x9367a5bd54c9078 expired at 20:33:09 with zxid 0x31189e02d8 on the leader. 

2. Leader processed session termination for 0x9367a5bd54c9078 at 20:33.09.000. 

3. Leader realizes client has closed socket, so it closes NIOServerCnxn for 0x9367a5bd54c9078 at 20:33:09.009 

4. Kafka client gets Expired state at 20:33:09.161 

5. Kafka client starts re-registering its node in ZK at 20:33.09.209 with socket connection established to srvr-89.prod 

6. Client session 0x0 times out several times, reconnects to another server and repeats session establishment request. 

7. On the leader's txn log, session B 0x8367a5bd75ea01b is created at 20:33.17- 
1. time:4/23/12 20:33:17 PM PDT session:0x8367a5bd75ea01b cxid:0x0 zxid:0x31189e1985 type:createSession timeOut:6000 

8. On zk server srvr-89, session B 0x8367a5bd75ea01b is established - 
1. zk.log4j.2012-04-23-srvr-89.gz: 1. 2012-04-23 20:33:17,987 - INFO [CommitProcessor:8:NIOServerCnxn@1580] - Established session 0x8367a5bd75ea01b with negotiated timeout 6000 for client /172.17.87.173:52865 
2. 2012/04/23 20:33:18.022 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (SyncConnected) 

9. At 20:33:18.022 on Kafka client, session is established with session id 0x8367a5bd75ea01b 
1. 2012/04/23 20:33:18.022 INFO [ClientCnxn] [main-SendThread(srvr-89.prod:12913)] [kafka] Session establishment complete on server srvr-89.prod/172.17.229.105:12913, sessionid = 0x8367a5bd75ea01b, negotiated timeout = 6000 

10. Kafka zookeeper client fails to create znode at 20:33.18.358 with Session B due to NodeExists exception 
1. 2012/04/23 20:33:18.358 INFO [ZkUtils$] [ZkClient-EventThread-81-srvr-85.prod:12913,srvr-88.prod:12913,srvr-84.prod:12913,srvr-89.prod:12913,srvr-90.prod:12913/kafka] [kafka] /consumers/kafka-DR/ids/kafka-DR_srvr-0193.prod-1334094226781-6a7f1b76 exists with value { "firehoseActivity": 1,"firehoseUpdates": 1,"GenericSearchEvent": 1,"firehoseShares": 1,"ProfileViewEvent": 1 } during connection loss; this is ok

11. Kafka client gets Disconnected and SyncConnected with session B at 20:33:26 
1. 2012/04/23 20:33:26.669 INFO [ClientCnxn] [main-SendThread(srvr-89.prod:12913)] [kafka] Client session timed out, have not heard from server in 4000ms for sessionid 0x8367a5bd75ea01b, closing socket connection and attempting reconnect 
2. 2012/04/23 20:33:26.769 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Disconnected) 
3. 2012/04/23 20:33:27.471 INFO [ClientCnxn] [main-SendThread(srvr-89.prod:12913)] [kafka] Opening socket connection to server srvr-88.prod/172.17.229.104:12913 
4. 2012/04/23 20:33:27.531 INFO [ClientCnxn] [main-SendThread(srvr-88.prod:12913)] [kafka] Socket connection established to srvr-88.prod/172.17.229.104:12913, initiating session 
5. 2012/04/23 20:33:28.419 INFO [ClientCnxn] [main-SendThread(srvr-88.prod:12913)] [kafka] Session establishment complete on server srvr-88.prod/172.17.229.104:12913, sessionid = 0x8367a5bd75ea01b, negotiated timeout = 6000 

12. Kafka client runs into NoNode exception at 20:33:30 2012/04/23 20:33:30.538 INFO [ZookeeperConsumerConnector] [kafka-DR_srvr-0193.prod-1334094226781-6a7f1b76_watcher_executor] [kafka] exception during rebalance 

At around the same time, the session C (0x8367a5bd75e9dd5) gets established and also creates the node and later expires. 

1. On leader srvr-90, session C gets created 1. time:4/23/12 20:33:09 PM PDT session:0x8367a5bd75e9dd5 cxid:0x0 zxid:0x31189e0709 type:createSession timeOut:6000 

2. On server srvr-89, session gets created at 20:33.13.930 
1. zk.log4j.2012-04-23-srvr-89.gz:2012-04-23 20:33:13,930 - INFO [CommitProcessor:8:NIOServerCnxn@1580] - Established session 0x8367a5bd75e9dd5 with negotiated timeout 6000 for client /172.17.87.173:52802 
2. zk.log4j.2012-04-23-srvr-89.gz:2012-04-23 20:33:17,667 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x8367a5bd75e9dd5, likely client has closed socket 
3. zk.log4j.2012-04-23-srvr-89.gz:2012-04-23 20:33:17,668 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed socket connection for client /172.17.87.173:52802 which had sessionid 0x8367a5bd75e9dd5 

3. Session C creates the znode - 
1. time:4/23/12 20:33:17 PM PDT session:0x8367a5bd75e9dd5 cxid:0x1 zxid:0x31189e1464 type:create acl:[31,s{'world,'anyone}] ephemeral:true path:/kafka/consumers/kafka-DR/ids/kafka-DR_ech3-app0193.prod-1334094226781-6a7f1b76 data:7B202266697265686F73654163746976697479223A20312C2266697265686F736555706461746573223A20312C2247656E657269635365617263684576656E74223A20312C2266697265686F7365536861726573223A20312C2250726F66696C65566965774576656E74223A2031207D 
2. Session B gets closed on the leader

4. Session C expires on leader time:4/23/12 20:33:29 PM PDT session:0x8367a5bd75e9dd5 cxid:0x0 zxid:0x31189e27d0 type:closeSession 

Since the leader processes create session and create znode for Session C first, shouldn't it be the session id that gets returned to the client as create session response ? Does this sound like a bug ?
                
> Ephemeral node deleted for unexpired sessions
> ---------------------------------------------
>
>                 Key: ZOOKEEPER-1457
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1457
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.3.4
>            Reporter: Neha Narkhede
>
> This week, we saw a potential bug with zookeeper 3.3.4. In an attempt to adding a separate disk for zookeeper transaction logs, our SysOps team threw new disks at all the zookeeper servers in our production cluster at around the same time. Right after this, we saw degraded performance on our zookeeper cluster. And yes, I agree that this degraded behavior is expected and we could've done a better job and upgraded one server at a time. Al though, the observed impact was that ephemeral nodes got deleted without session expiration on the zookeeper clients. 
> Let me try and describe what I've observed from the Kafka and ZK server logs - Kafka client has a session established with ZK, say Session A, that it has been using successfully. At the time of the degraded ZK performance issue, Session A expires. Kafka's ZkClient tries to establish another session with ZK. After 9 seconds, it establishes a session, say Session B and tries to use it for creating a znode. This operation fails with a NodeExists error since another session, say session C, has created that znode. This is considered OK since ZkClient retries an operation transparently if it gets disconnected and sometimes you can get NodeExists. But then later, session C expires and hence the ephemeral node is deleted from ZK. This leads to unexpected errors in Kafka since its session, Session B, is still valid and hence it expects the znode to be there. The issue is that session C was established, created the znode and expired, without the zookeeper client on Kafka ever knowing about it. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira