You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by Neha Narkhede <ne...@gmail.com> on 2012/04/26 20:17:54 UTC

Ephemeral node deleted for unexpired sessions in Zookeeper 3.3.4

Hi,

This week, we saw a potential bug with zookeeper 3.3.4. In an attempt to
adding a separate disk for zookeeper transaction logs, our SysOps team
threw new disks at all the zookeeper servers in our production cluster at
around the same time. Right after this, we saw degraded performance on our
zookeeper cluster. And yes, I agree that this degraded behavior is expected
and we could've done a better job and upgraded one server at a time. Al
though, the observed impact was that ephemeral nodes got deleted without
session expiration on the zookeeper clients.

Let me try and describe what I've observed from the Kafka and ZK server
logs -

Kafka client has a session established with ZK, say Session A, that it has
been using successfully. At the time of the degraded ZK performance issue,
Session A expires. Kafka's ZkClient tries to establish another session with
ZK. After 9 seconds, it establishes a session, say Session B and tries to
use it for creating a znode. This operation fails with a NodeExists error
since another session, say session C, has created that znode. This is
considered OK since ZkClient retries an operation transparently if it gets
 disconnected and sometimes you can get NodeExists. But then later, session
C expires and hence the ephemeral node is deleted from ZK. This leads to
unexpected errors in Kafka since its session, Session B, is still valid and
hence it expects the znode to be there. The issue is that session C was
established, created the znode and expired, without the zookeeper client on
Kafka ever knowing about it.

I've attempted to serialize the sequence of events according to relevant
data from txn and log4j logs below.

Leader zookeeper is srvr-90.prod

Session A = 0x9367a5bd54c9078
Session B = 0x8367a5bd75ea01b
Session C = 0x8367a5bd75e9dd5
The znode being created =
 /consumers/kafka-DR/ids/kafka-DR_srvr-0193.prod-1334094226781-6a7f1b76
zookeeper session timeout is 6 seconds


   1.

   Original session A, 0x9367a5bd54c9078 expired at 20:33:09 with zxid
   0x31189e02d8 on the leader.
   2.

   Leader processed session termination for 0x9367a5bd54c9078 at
   20:33.09.000.
   3.

   Leader realizes client has closed socket, so it closes NIOServerCnxn for
   0x9367a5bd54c9078 at 20:33:09.009
   4.

   Kafka client gets Expired state at 20:33:09.161
   5.

   Kafka client starts re-registering its node in ZK at 20:33.09.209 with
   socket connection established to srvr-89.prod
   6.

   Client session 0x0 times out several times, reconnects to another server
   and repeats session establishment request.
   7.

   On the leader's txn log, session B 0x8367a5bd75ea01b is created at
   20:33.17-
   1.

      time:4/23/12 20:33:17 PM PDT session:0x8367a5bd75ea01b cxid:0x0
      zxid:0x31189e1985 type:createSession timeOut:6000
       8.

   On zk server srvr-89, session B 0x8367a5bd75ea01b is established -
   1.

      zk.log4j.2012-04-23-srvr-89.gz:
      1.

         2012-04-23 20:33:17,987 - INFO
          [CommitProcessor:8:NIOServerCnxn@1580] - Established session
         0x8367a5bd75ea01b with negotiated timeout 6000 for client /
         172.17.87.173:52865
         2.

         2012/04/23 20:33:18.022 INFO [ZkClient] [main-EventThread] [kafka]
         zookeeper state changed (SyncConnected)
          9.

   At 20:33:18.022 on Kafka client, session is established with session id
   0x8367a5bd75ea01b
   1.

      2012/04/23 20:33:18.022 INFO [ClientCnxn]
      [main-SendThread(srvr-89.prod:12913)] [kafka] Session establishment
      complete on server srvr-89.prod/172.17.229.105:12913, sessionid =
      0x8367a5bd75ea01b, negotiated timeout = 6000
       10.

   Kafka zookeeper client fails to create znode at 20:33.18.358 with
   Session B due to NodeExists exception
   1.

      2012/04/23 20:33:18.358 INFO [ZkUtils$]
      [ZkClient-EventThread-81-srvr-85.prod:12913,srvr-88.prod:12913,srvr-84.prod:12913,srvr-89.prod:12913,srvr-90.prod:12913/kafka]
      [kafka]
      /consumers/kafka-DR/ids/kafka-DR_srvr-0193.prod-1334094226781-6a7f1b76
      exists with value { "firehoseActivity": 1,"firehoseUpdates":
      1,"GenericSearchEvent": 1,"firehoseShares": 1,"ProfileViewEvent": 1 }
      during connection loss; this is ok
       11.

   Kafka client gets Disconnected and SyncConnected with session B at
   20:33:26
   1.

      2012/04/23 20:33:26.669 INFO [ClientCnxn]
      [main-SendThread(srvr-89.prod:12913)] [kafka] Client session timed out,
      have not heard from server in 4000ms for sessionid 0x8367a5bd75ea01b,
      closing socket connection and attempting reconnect
      2.

      2012/04/23 20:33:26.769 INFO [ZkClient] [main-EventThread] [kafka]
      zookeeper state changed (Disconnected)
      3.

      2012/04/23 20:33:27.471 INFO [ClientCnxn]
      [main-SendThread(srvr-89.prod:12913)] [kafka] Opening socket
connection to
      server srvr-88.prod/172.17.229.104:12913
      4.

      2012/04/23 20:33:27.531 INFO [ClientCnxn]
      [main-SendThread(srvr-88.prod:12913)] [kafka] Socket connection
established
      to srvr-88.prod/172.17.229.104:12913, initiating session
      5.

      2012/04/23 20:33:28.419 INFO [ClientCnxn]
      [main-SendThread(srvr-88.prod:12913)] [kafka] Session establishment
      complete on server srvr-88.prod/172.17.229.104:12913, sessionid =
      0x8367a5bd75ea01b, negotiated timeout = 6000
       12.

   Kafka client runs into NoNode exception at 20:33:30
   2012/04/23 20:33:30.538 INFO [ZookeeperConsumerConnector]
   [kafka-DR_srvr-0193.prod-1334094226781-6a7f1b76_watcher_executor] [kafka]
   exception during rebalance

At around the same time, the session C (0x8367a5bd75e9dd5) gets established
and also creates the node and later expires.

   1.

   On leader srvr-90, session C gets created
   1.

      time:4/23/12 20:33:09 PM PDT session:0x8367a5bd75e9dd5 cxid:0x0
      zxid:0x31189e0709 type:createSession timeOut:6000
       2.

   On server srvr-89, session gets created at 20:33.13.930
   1.

      zk.log4j.2012-04-23-srvr-89.gz:2012-04-23 20:33:13,930 - INFO
       [CommitProcessor:8:NIOServerCnxn@1580] - Established session
      0x8367a5bd75e9dd5 with negotiated timeout 6000 for client /
      172.17.87.173:52802
      2.

      zk.log4j.2012-04-23-srvr-89.gz:2012-04-23 20:33:17,667 - WARN
       [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@634] -
      EndOfStreamException: Unable to read additional data from client
sessionid
      0x8367a5bd75e9dd5, likely client has closed socket
      3.

      zk.log4j.2012-04-23-srvr-89.gz:2012-04-23 20:33:17,668 - INFO
       [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] -
      Closed socket connection for client /172.17.87.173:52802 which had
      sessionid 0x8367a5bd75e9dd5
       3.

   3. Session B creates the znode -
   1.

      time:4/23/12 20:33:17 PM PDT session:0x8367a5bd75e9dd5 cxid:0x1
      zxid:0x31189e1464 type:create acl:[31,s{'world,'anyone}] ephemeral:true
      path:/kafka/consumers/kafka-DR/ids/kafka-DR_ech3-app0193.prod-1334094226781-6a7f1b76
      data:7B202266697265686F73654163746976697479223A20312C2266697265686F736555706461746573223A20312C2247656E657269635365617263684576656E74223A20312C2266697265686F7365536861726573223A20312C2250726F66696C65566965774576656E74223A2031207D

      2.

      Session B gets closed on the leader
       4.

   Session C expires on leader
   1.

      time:4/23/12 20:33:29 PM PDT session:0x8367a5bd75e9dd5 cxid:0x0
      zxid:0x31189e27d0 type:closeSession


Since the leader processes create session and create znode for Session C
first, shouldn't it be the session id that gets returned to the client as
create session response ?
Does this sound like a bug ?

*TL;DR*

Ephemeral nodes got deleted for unexpired sessions on ZK 3.3.4.
The kafka and zookeeper server log4j logs (timestamps are in UTC) as well
as the leader's txn log are
here.<http://people.apache.org/%7Enehanarkhede/kafka-misc/zookeeper-outage-2012-04-23/kafka-zk-bug-2012-04-23.tar.gz>

Will appreciate any help with debugging this.

Thanks,
Neha

Re: Ephemeral node deleted for unexpired sessions in Zookeeper 3.3.4

Posted by Patrick Hunt <ph...@apache.org>.
Please create a jira for this, attach the logs, config, etc... Thanks.

Patrick

On Thu, Apr 26, 2012 at 11:17 AM, Neha Narkhede <ne...@gmail.com> wrote:
> Hi,
>
> This week, we saw a potential bug with zookeeper 3.3.4. In an attempt to
> adding a separate disk for zookeeper transaction logs, our SysOps team
> threw new disks at all the zookeeper servers in our production cluster at
> around the same time. Right after this, we saw degraded performance on our
> zookeeper cluster. And yes, I agree that this degraded behavior is expected
> and we could've done a better job and upgraded one server at a time. Al
> though, the observed impact was that ephemeral nodes got deleted without
> session expiration on the zookeeper clients.
>
> Let me try and describe what I've observed from the Kafka and ZK server
> logs -
>
> Kafka client has a session established with ZK, say Session A, that it has
> been using successfully. At the time of the degraded ZK performance issue,
> Session A expires. Kafka's ZkClient tries to establish another session with
> ZK. After 9 seconds, it establishes a session, say Session B and tries to
> use it for creating a znode. This operation fails with a NodeExists error
> since another session, say session C, has created that znode. This is
> considered OK since ZkClient retries an operation transparently if it gets
>  disconnected and sometimes you can get NodeExists. But then later, session
> C expires and hence the ephemeral node is deleted from ZK. This leads to
> unexpected errors in Kafka since its session, Session B, is still valid and
> hence it expects the znode to be there. The issue is that session C was
> established, created the znode and expired, without the zookeeper client on
> Kafka ever knowing about it.
>
> I've attempted to serialize the sequence of events according to relevant
> data from txn and log4j logs below.
>
> Leader zookeeper is srvr-90.prod
>
> Session A = 0x9367a5bd54c9078
> Session B = 0x8367a5bd75ea01b
> Session C = 0x8367a5bd75e9dd5
> The znode being created =
>  /consumers/kafka-DR/ids/kafka-DR_srvr-0193.prod-1334094226781-6a7f1b76
> zookeeper session timeout is 6 seconds
>
>
>   1.
>
>   Original session A, 0x9367a5bd54c9078 expired at 20:33:09 with zxid
>   0x31189e02d8 on the leader.
>   2.
>
>   Leader processed session termination for 0x9367a5bd54c9078 at
>   20:33.09.000.
>   3.
>
>   Leader realizes client has closed socket, so it closes NIOServerCnxn for
>   0x9367a5bd54c9078 at 20:33:09.009
>   4.
>
>   Kafka client gets Expired state at 20:33:09.161
>   5.
>
>   Kafka client starts re-registering its node in ZK at 20:33.09.209 with
>   socket connection established to srvr-89.prod
>   6.
>
>   Client session 0x0 times out several times, reconnects to another server
>   and repeats session establishment request.
>   7.
>
>   On the leader's txn log, session B 0x8367a5bd75ea01b is created at
>   20:33.17-
>   1.
>
>      time:4/23/12 20:33:17 PM PDT session:0x8367a5bd75ea01b cxid:0x0
>      zxid:0x31189e1985 type:createSession timeOut:6000
>       8.
>
>   On zk server srvr-89, session B 0x8367a5bd75ea01b is established -
>   1.
>
>      zk.log4j.2012-04-23-srvr-89.gz:
>      1.
>
>         2012-04-23 20:33:17,987 - INFO
>          [CommitProcessor:8:NIOServerCnxn@1580] - Established session
>         0x8367a5bd75ea01b with negotiated timeout 6000 for client /
>         172.17.87.173:52865
>         2.
>
>         2012/04/23 20:33:18.022 INFO [ZkClient] [main-EventThread] [kafka]
>         zookeeper state changed (SyncConnected)
>          9.
>
>   At 20:33:18.022 on Kafka client, session is established with session id
>   0x8367a5bd75ea01b
>   1.
>
>      2012/04/23 20:33:18.022 INFO [ClientCnxn]
>      [main-SendThread(srvr-89.prod:12913)] [kafka] Session establishment
>      complete on server srvr-89.prod/172.17.229.105:12913, sessionid =
>      0x8367a5bd75ea01b, negotiated timeout = 6000
>       10.
>
>   Kafka zookeeper client fails to create znode at 20:33.18.358 with
>   Session B due to NodeExists exception
>   1.
>
>      2012/04/23 20:33:18.358 INFO [ZkUtils$]
>      [ZkClient-EventThread-81-srvr-85.prod:12913,srvr-88.prod:12913,srvr-84.prod:12913,srvr-89.prod:12913,srvr-90.prod:12913/kafka]
>      [kafka]
>      /consumers/kafka-DR/ids/kafka-DR_srvr-0193.prod-1334094226781-6a7f1b76
>      exists with value { "firehoseActivity": 1,"firehoseUpdates":
>      1,"GenericSearchEvent": 1,"firehoseShares": 1,"ProfileViewEvent": 1 }
>      during connection loss; this is ok
>       11.
>
>   Kafka client gets Disconnected and SyncConnected with session B at
>   20:33:26
>   1.
>
>      2012/04/23 20:33:26.669 INFO [ClientCnxn]
>      [main-SendThread(srvr-89.prod:12913)] [kafka] Client session timed out,
>      have not heard from server in 4000ms for sessionid 0x8367a5bd75ea01b,
>      closing socket connection and attempting reconnect
>      2.
>
>      2012/04/23 20:33:26.769 INFO [ZkClient] [main-EventThread] [kafka]
>      zookeeper state changed (Disconnected)
>      3.
>
>      2012/04/23 20:33:27.471 INFO [ClientCnxn]
>      [main-SendThread(srvr-89.prod:12913)] [kafka] Opening socket
> connection to
>      server srvr-88.prod/172.17.229.104:12913
>      4.
>
>      2012/04/23 20:33:27.531 INFO [ClientCnxn]
>      [main-SendThread(srvr-88.prod:12913)] [kafka] Socket connection
> established
>      to srvr-88.prod/172.17.229.104:12913, initiating session
>      5.
>
>      2012/04/23 20:33:28.419 INFO [ClientCnxn]
>      [main-SendThread(srvr-88.prod:12913)] [kafka] Session establishment
>      complete on server srvr-88.prod/172.17.229.104:12913, sessionid =
>      0x8367a5bd75ea01b, negotiated timeout = 6000
>       12.
>
>   Kafka client runs into NoNode exception at 20:33:30
>   2012/04/23 20:33:30.538 INFO [ZookeeperConsumerConnector]
>   [kafka-DR_srvr-0193.prod-1334094226781-6a7f1b76_watcher_executor] [kafka]
>   exception during rebalance
>
> At around the same time, the session C (0x8367a5bd75e9dd5) gets established
> and also creates the node and later expires.
>
>   1.
>
>   On leader srvr-90, session C gets created
>   1.
>
>      time:4/23/12 20:33:09 PM PDT session:0x8367a5bd75e9dd5 cxid:0x0
>      zxid:0x31189e0709 type:createSession timeOut:6000
>       2.
>
>   On server srvr-89, session gets created at 20:33.13.930
>   1.
>
>      zk.log4j.2012-04-23-srvr-89.gz:2012-04-23 20:33:13,930 - INFO
>       [CommitProcessor:8:NIOServerCnxn@1580] - Established session
>      0x8367a5bd75e9dd5 with negotiated timeout 6000 for client /
>      172.17.87.173:52802
>      2.
>
>      zk.log4j.2012-04-23-srvr-89.gz:2012-04-23 20:33:17,667 - WARN
>       [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@634] -
>      EndOfStreamException: Unable to read additional data from client
> sessionid
>      0x8367a5bd75e9dd5, likely client has closed socket
>      3.
>
>      zk.log4j.2012-04-23-srvr-89.gz:2012-04-23 20:33:17,668 - INFO
>       [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] -
>      Closed socket connection for client /172.17.87.173:52802 which had
>      sessionid 0x8367a5bd75e9dd5
>       3.
>
>   3. Session B creates the znode -
>   1.
>
>      time:4/23/12 20:33:17 PM PDT session:0x8367a5bd75e9dd5 cxid:0x1
>      zxid:0x31189e1464 type:create acl:[31,s{'world,'anyone}] ephemeral:true
>      path:/kafka/consumers/kafka-DR/ids/kafka-DR_ech3-app0193.prod-1334094226781-6a7f1b76
>      data:7B202266697265686F73654163746976697479223A20312C2266697265686F736555706461746573223A20312C2247656E657269635365617263684576656E74223A20312C2266697265686F7365536861726573223A20312C2250726F66696C65566965774576656E74223A2031207D
>
>      2.
>
>      Session B gets closed on the leader
>       4.
>
>   Session C expires on leader
>   1.
>
>      time:4/23/12 20:33:29 PM PDT session:0x8367a5bd75e9dd5 cxid:0x0
>      zxid:0x31189e27d0 type:closeSession
>
>
> Since the leader processes create session and create znode for Session C
> first, shouldn't it be the session id that gets returned to the client as
> create session response ?
> Does this sound like a bug ?
>
> *TL;DR*
>
> Ephemeral nodes got deleted for unexpired sessions on ZK 3.3.4.
> The kafka and zookeeper server log4j logs (timestamps are in UTC) as well
> as the leader's txn log are
> here.<http://people.apache.org/%7Enehanarkhede/kafka-misc/zookeeper-outage-2012-04-23/kafka-zk-bug-2012-04-23.tar.gz>
>
> Will appreciate any help with debugging this.
>
> Thanks,
> Neha

Re: Ephemeral node deleted for unexpired sessions in Zookeeper 3.3.4

Posted by Patrick Hunt <ph...@apache.org>.
Please create a jira for this, attach the logs, config, etc... Thanks.

Patrick

On Thu, Apr 26, 2012 at 11:17 AM, Neha Narkhede <ne...@gmail.com> wrote:
> Hi,
>
> This week, we saw a potential bug with zookeeper 3.3.4. In an attempt to
> adding a separate disk for zookeeper transaction logs, our SysOps team
> threw new disks at all the zookeeper servers in our production cluster at
> around the same time. Right after this, we saw degraded performance on our
> zookeeper cluster. And yes, I agree that this degraded behavior is expected
> and we could've done a better job and upgraded one server at a time. Al
> though, the observed impact was that ephemeral nodes got deleted without
> session expiration on the zookeeper clients.
>
> Let me try and describe what I've observed from the Kafka and ZK server
> logs -
>
> Kafka client has a session established with ZK, say Session A, that it has
> been using successfully. At the time of the degraded ZK performance issue,
> Session A expires. Kafka's ZkClient tries to establish another session with
> ZK. After 9 seconds, it establishes a session, say Session B and tries to
> use it for creating a znode. This operation fails with a NodeExists error
> since another session, say session C, has created that znode. This is
> considered OK since ZkClient retries an operation transparently if it gets
>  disconnected and sometimes you can get NodeExists. But then later, session
> C expires and hence the ephemeral node is deleted from ZK. This leads to
> unexpected errors in Kafka since its session, Session B, is still valid and
> hence it expects the znode to be there. The issue is that session C was
> established, created the znode and expired, without the zookeeper client on
> Kafka ever knowing about it.
>
> I've attempted to serialize the sequence of events according to relevant
> data from txn and log4j logs below.
>
> Leader zookeeper is srvr-90.prod
>
> Session A = 0x9367a5bd54c9078
> Session B = 0x8367a5bd75ea01b
> Session C = 0x8367a5bd75e9dd5
> The znode being created =
>  /consumers/kafka-DR/ids/kafka-DR_srvr-0193.prod-1334094226781-6a7f1b76
> zookeeper session timeout is 6 seconds
>
>
>   1.
>
>   Original session A, 0x9367a5bd54c9078 expired at 20:33:09 with zxid
>   0x31189e02d8 on the leader.
>   2.
>
>   Leader processed session termination for 0x9367a5bd54c9078 at
>   20:33.09.000.
>   3.
>
>   Leader realizes client has closed socket, so it closes NIOServerCnxn for
>   0x9367a5bd54c9078 at 20:33:09.009
>   4.
>
>   Kafka client gets Expired state at 20:33:09.161
>   5.
>
>   Kafka client starts re-registering its node in ZK at 20:33.09.209 with
>   socket connection established to srvr-89.prod
>   6.
>
>   Client session 0x0 times out several times, reconnects to another server
>   and repeats session establishment request.
>   7.
>
>   On the leader's txn log, session B 0x8367a5bd75ea01b is created at
>   20:33.17-
>   1.
>
>      time:4/23/12 20:33:17 PM PDT session:0x8367a5bd75ea01b cxid:0x0
>      zxid:0x31189e1985 type:createSession timeOut:6000
>       8.
>
>   On zk server srvr-89, session B 0x8367a5bd75ea01b is established -
>   1.
>
>      zk.log4j.2012-04-23-srvr-89.gz:
>      1.
>
>         2012-04-23 20:33:17,987 - INFO
>          [CommitProcessor:8:NIOServerCnxn@1580] - Established session
>         0x8367a5bd75ea01b with negotiated timeout 6000 for client /
>         172.17.87.173:52865
>         2.
>
>         2012/04/23 20:33:18.022 INFO [ZkClient] [main-EventThread] [kafka]
>         zookeeper state changed (SyncConnected)
>          9.
>
>   At 20:33:18.022 on Kafka client, session is established with session id
>   0x8367a5bd75ea01b
>   1.
>
>      2012/04/23 20:33:18.022 INFO [ClientCnxn]
>      [main-SendThread(srvr-89.prod:12913)] [kafka] Session establishment
>      complete on server srvr-89.prod/172.17.229.105:12913, sessionid =
>      0x8367a5bd75ea01b, negotiated timeout = 6000
>       10.
>
>   Kafka zookeeper client fails to create znode at 20:33.18.358 with
>   Session B due to NodeExists exception
>   1.
>
>      2012/04/23 20:33:18.358 INFO [ZkUtils$]
>      [ZkClient-EventThread-81-srvr-85.prod:12913,srvr-88.prod:12913,srvr-84.prod:12913,srvr-89.prod:12913,srvr-90.prod:12913/kafka]
>      [kafka]
>      /consumers/kafka-DR/ids/kafka-DR_srvr-0193.prod-1334094226781-6a7f1b76
>      exists with value { "firehoseActivity": 1,"firehoseUpdates":
>      1,"GenericSearchEvent": 1,"firehoseShares": 1,"ProfileViewEvent": 1 }
>      during connection loss; this is ok
>       11.
>
>   Kafka client gets Disconnected and SyncConnected with session B at
>   20:33:26
>   1.
>
>      2012/04/23 20:33:26.669 INFO [ClientCnxn]
>      [main-SendThread(srvr-89.prod:12913)] [kafka] Client session timed out,
>      have not heard from server in 4000ms for sessionid 0x8367a5bd75ea01b,
>      closing socket connection and attempting reconnect
>      2.
>
>      2012/04/23 20:33:26.769 INFO [ZkClient] [main-EventThread] [kafka]
>      zookeeper state changed (Disconnected)
>      3.
>
>      2012/04/23 20:33:27.471 INFO [ClientCnxn]
>      [main-SendThread(srvr-89.prod:12913)] [kafka] Opening socket
> connection to
>      server srvr-88.prod/172.17.229.104:12913
>      4.
>
>      2012/04/23 20:33:27.531 INFO [ClientCnxn]
>      [main-SendThread(srvr-88.prod:12913)] [kafka] Socket connection
> established
>      to srvr-88.prod/172.17.229.104:12913, initiating session
>      5.
>
>      2012/04/23 20:33:28.419 INFO [ClientCnxn]
>      [main-SendThread(srvr-88.prod:12913)] [kafka] Session establishment
>      complete on server srvr-88.prod/172.17.229.104:12913, sessionid =
>      0x8367a5bd75ea01b, negotiated timeout = 6000
>       12.
>
>   Kafka client runs into NoNode exception at 20:33:30
>   2012/04/23 20:33:30.538 INFO [ZookeeperConsumerConnector]
>   [kafka-DR_srvr-0193.prod-1334094226781-6a7f1b76_watcher_executor] [kafka]
>   exception during rebalance
>
> At around the same time, the session C (0x8367a5bd75e9dd5) gets established
> and also creates the node and later expires.
>
>   1.
>
>   On leader srvr-90, session C gets created
>   1.
>
>      time:4/23/12 20:33:09 PM PDT session:0x8367a5bd75e9dd5 cxid:0x0
>      zxid:0x31189e0709 type:createSession timeOut:6000
>       2.
>
>   On server srvr-89, session gets created at 20:33.13.930
>   1.
>
>      zk.log4j.2012-04-23-srvr-89.gz:2012-04-23 20:33:13,930 - INFO
>       [CommitProcessor:8:NIOServerCnxn@1580] - Established session
>      0x8367a5bd75e9dd5 with negotiated timeout 6000 for client /
>      172.17.87.173:52802
>      2.
>
>      zk.log4j.2012-04-23-srvr-89.gz:2012-04-23 20:33:17,667 - WARN
>       [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@634] -
>      EndOfStreamException: Unable to read additional data from client
> sessionid
>      0x8367a5bd75e9dd5, likely client has closed socket
>      3.
>
>      zk.log4j.2012-04-23-srvr-89.gz:2012-04-23 20:33:17,668 - INFO
>       [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] -
>      Closed socket connection for client /172.17.87.173:52802 which had
>      sessionid 0x8367a5bd75e9dd5
>       3.
>
>   3. Session B creates the znode -
>   1.
>
>      time:4/23/12 20:33:17 PM PDT session:0x8367a5bd75e9dd5 cxid:0x1
>      zxid:0x31189e1464 type:create acl:[31,s{'world,'anyone}] ephemeral:true
>      path:/kafka/consumers/kafka-DR/ids/kafka-DR_ech3-app0193.prod-1334094226781-6a7f1b76
>      data:7B202266697265686F73654163746976697479223A20312C2266697265686F736555706461746573223A20312C2247656E657269635365617263684576656E74223A20312C2266697265686F7365536861726573223A20312C2250726F66696C65566965774576656E74223A2031207D
>
>      2.
>
>      Session B gets closed on the leader
>       4.
>
>   Session C expires on leader
>   1.
>
>      time:4/23/12 20:33:29 PM PDT session:0x8367a5bd75e9dd5 cxid:0x0
>      zxid:0x31189e27d0 type:closeSession
>
>
> Since the leader processes create session and create znode for Session C
> first, shouldn't it be the session id that gets returned to the client as
> create session response ?
> Does this sound like a bug ?
>
> *TL;DR*
>
> Ephemeral nodes got deleted for unexpired sessions on ZK 3.3.4.
> The kafka and zookeeper server log4j logs (timestamps are in UTC) as well
> as the leader's txn log are
> here.<http://people.apache.org/%7Enehanarkhede/kafka-misc/zookeeper-outage-2012-04-23/kafka-zk-bug-2012-04-23.tar.gz>
>
> Will appreciate any help with debugging this.
>
> Thanks,
> Neha