You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Sajith <sa...@gmail.com> on 2014/05/20 16:02:26 UTC

Supervisor getting repeatedly disconnected from the ZooKeeper

Hi all,

In my topology I observers that one of the supervisor machines get
repeatedly disconnected from the Zookeeper and it prints the following
error,

EndOfStreamException: Unable to read additional data from client sessionid
0x146193a4b70073d, likely client has closed socket
    at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
    at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
    at java.lang.Thread.run(Thread.java:662)
2014-05-20 06:51:20,631 [myid:] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
client /204.13.85.2:37938 which had sessionid 0x146193a4b70073d
2014-05-20 06:51:20,631 [myid:] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x146193a4b700741, likely client has closed socket
    at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
    at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
    at java.lang.Thread.run(Thread.java:662)
2014-05-20 06:51:20,632 [myid:] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
client /204.13.85.2:37942 which had sessionid 0x146193a4b700741
2014-05-20 06:51:20,634 [myid:] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception

In the supervisor log the following error is getting printed along with the
above error in zookeper

2014-05-20 06:59:33 b.s.d.supervisor [INFO]
dfa06019-0c29-4782-94da-c37fcc75243d still hasn't started
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Worker
dfa06019-0c29-4782-94da-c37fcc75243d failed to start
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Worker
4677c74c-8239-4cd3-8ff7-c95c3724e40e failed to start
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Worker
39c70558-c144-4da6-b685-841d7a531ec0 failed to start
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Worker
983c05ff-107e-483c-97e6-bb5c309606ec failed to start
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 39c70558-c144-4da6-b685-841d7a531ec0. Current supervisor time:
1400594374. State: :not-started, Heartbeat: nil
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down
5dd6583a-a5a4-4d76-8797-e885eacdf18f:39c70558-c144-4da6-b685-841d7a531ec0
2014-05-20 06:59:34 b.s.util [INFO] Error when trying to kill 25682.
Process is probably already dead.
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shut down
5dd6583a-a5a4-4d76-8797-e885eacdf18f:39c70558-c144-4da6-b685-841d7a531ec0
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 983c05ff-107e-483c-97e6-bb5c309606ec. Current supervisor time:
1400594374. State: :not-started, Heartbeat: nil
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down
5dd6583a-a5a4-4d76-8797-e885eacdf18f:983c05ff-107e-483c-97e6-bb5c309606ec
2014-05-20 06:59:34 b.s.util [INFO] Error when trying to kill 25684.
Process is probably already dead.
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shut down
5dd6583a-a5a4-4d76-8797-e885eacdf18f:983c05ff-107e-483c-97e6-bb5c309606ec
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 4677c74c-8239-4cd3-8ff7-c95c3724e40e. Current supervisor time:
1400594374. State: :not-started, Heartbeat: nil
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down
5dd6583a-a5a4-4d76-8797-e885eacdf18f:4677c74c-8239-4cd3-8ff7-c95c3724e40e
2014-05-20 06:59:34 b.s.util [INFO] Error when trying to kill 25680.
Process is probably already dead.
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shut down
5dd6583a-a5a4-4d76-8797-e885eacdf18f:4677c74c-8239-4cd3-8ff7-c95c3724e40e
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down and clearing
state for id dfa06019-0c29-4782-94da-c37fcc75243d. Current supervisor time:
1400594374. State: :not-started, Heartbeat: nil
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down
5dd6583a-a5a4-4d76-8797-e885eacdf18f:dfa06019-0c29-4782-94da-c37fcc75243d
2014-05-20 06:59:34 b.s.util [INFO] Error when trying to kill 25679.
Process is probably already dead.
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shut down
5dd6583a-a5a4-4d76-8797-e885eacdf18f:dfa06019-0c29-4782-94da-c37fcc75243d
2014-05-20 06:59:34 b.s.d.supervisor [INFO] Launching worker with
assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id
"LatencyMeasureTopology-17-1400594223", :executors [[4 4]]} for this
supervisor 5dd6583a-a5a4-4d76-8797-e885eacdf18f on port 6700 with id
05c9e509-c29f-4310-b959-02f083224518

What's going wrong here? I feel like this is a heartbeat expiry issue! If
so what are the parameters that i should tweak to avoid this issue.

Thanks,
Sajith.