You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Michael Dev <mi...@outlook.com> on 2014/05/23 22:38:07 UTC

Workers constantly restarted due to session timeout

Hi all,

We are seeing our workers constantly being killed by Storm with to the following logs:
worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, closing socket and attempting reconnect
supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}

Eventually Storm decides to just kill the worker and restart it as you see in the supervisor log. We theorize this is the Zookeeper heartbeat thread and it is being choked out due to very high CPU load on the machine (near 100%).

I have increased the connection timeouts in the storm.yaml config file yet Storm seems to continue to use some unknown value for the above client session timeout messages:
storm.zookeeper.connection.timeout: 300000
storm.zookeeper.session.timeout: 300000

1) What timeout config is appropriate for the above timeout  message?
2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?

Thanks,
Michael

Re: Workers constantly restarted due to session timeout

Posted by Michael Rose <mi...@fullcontact.com>.

Do you have GC logging turned on? With a 60GB heap I could pretty easily
see stop-the-world GCs taking longer than the session timeout.

Michael Rose (@Xorlev <https://twitter.com/xorlev>)
Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
michael@fullcontact.com


On Thu, May 29, 2014 at 10:45 AM, Michael Dev <mi...@outlook.com>
wrote:

>  Derek,
>
> We are currently running with -Xmx60G and only about 20-30G of that has
> been observed to be used. I'm still observing workers restarted every 2
> minutes.
>
> What timeout is relevant to increase for the heartbeats in question? Is it
> be a config on the Zookeeper side we can increase to make our topology more
> resilient to these restarts?
>
> Michael
>
> > Date: Fri, 23 May 2014 15:50:50 -0500
> > From: derekd@yahoo-inc.com
> > To: user@storm.incubator.apache.org
> > Subject: Re: Workers constantly restarted due to session timeout
>
> >
> > > 2) Is this expected behavior for Storm to be unable to keep up with
> heartbeat threads under high CPU or is our theory incorrect?
> >
> > Check your JVM max heap size (-Xmx). If you use too much, the JVM will
> garbage-collect, and that will stop everything--including the thread whose
> job it is to do the heartbeating.
> >
> >
> >
> > --
> > Derek
> >
> > On 5/23/14, 15:38, Michael Dev wrote:
> > > Hi all,
> > >
> > > We are seeing our workers constantly being killed by Storm with to the
> following logs:
> > > worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed
> out, have not heard from the server in 28105ms for sessionid
> 0x14619bf2f4e0109, closing socket and attempting reconnect
> > > supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and
> clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current
> supervisor time: 1400876250. State: :disallowed, Heartbeat:
> #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249,
> :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
> > >
> > > Eventually Storm decides to just kill the worker and restart it as you
> see in the supervisor log. We theorize this is the Zookeeper heartbeat
> thread and it is being choked out due to very high CPU load on the machine
> (near 100%).
> > >
> > > I have increased the connection timeouts in the storm.yaml config file
> yet Storm seems to continue to use some unknown value for the above client
> session timeout messages:
> > > storm.zookeeper.connection.timeout: 300000
> > > storm.zookeeper.session.timeout: 300000
> > >
> > > 1) What timeout config is appropriate for the above timeout message?
> > > 2) Is this expected behavior for Storm to be unable to keep up with
> heartbeat threads under high CPU or is our theory incorrect?
> > >
> > > Thanks,
> > > Michael
> > >
> > >
>

RE: Workers constantly restarted due to session timeout

Posted by Michael Dev <mi...@outlook.com>.

Thanks again for all the advice.

Turns out the issue was related to GC. The long pauses in uptime were directly correlated to full GC executions pausing all threads. We had written a spout that was caching input tuples locally waiting for the cluster to invoke ISpout.nextTuple(). The topology just wasn't operating fast enough so that spout's local cache would grow unbounded resulting in hitting very high memory usage (up to whatever xmx was, 30, 60, 100gb) thus resulting in painfully long "stop-the"world" GC executions. We were falling short of being able to handle the spout's activity by around 100-200k tuples every 30 seconds.

Unfortunately this indicates that we'll need to reevaluate inefficient sections in our topology barring more hardware but at least the issue as been identified.

Thanks,
Michael

> Date: Tue, 3 Jun 2014 14:06:38 -0500
> From: derekd@yahoo-inc.com
> To: user@storm.incubator.apache.org
> Subject: Re: Workers constantly restarted due to session timeout
> 
> > 1) Is it appropriate to run Zookeeper in parallel on the same node with the storm services?
> 
> I recommend separate, and even then to ZK storage to a path on its own disk device if possible.  ZK is a bottleneck for storm, and when it is too slow lots of bad things can happen.
> 
> Some folks use shared hosts (with or without VMs) in which to run ZK.  In those situations, VMs or processes owned by other users doing unrelated things can put load on the disk, and that will dramatically slow down ZK.
> 
> 
> > 2) We have zookeeper 3.4.5 installed. I see Storm uses zookeeper-3.3.3 as its client. Should we downgrade our installation?
> 
> I am not sure about that, since we've been running with ZK 3.4.5 in storm (and on the server).  It might work very well, but I have not tried it.  I do not remember if anyone on this list has identified any issues with this 3.3.3 + 3.4.5 combo.
> 
> 
> One setting we changed to dramatically improve performance with ZK was setting the system property '-Dzookeeper.forceSync=no' on the server.
> 
> Normally, ZK will sync to disk on every write, and that causes two seeks: one for the data and one for the data log.  It gets really expensive with all of the workers heartbeating in through ZK.  Be warned that with only on ZK server, an outage could leave you in an inconsistent state.
> 
> You might check to see if the ZK server is keeping up.  There are tools like iotop that can give information about disk load.
> 
> -- 
> Derek
> 
> On 6/3/14, 13:14, Michael Dev wrote:
> >
> >
> >
> > Thank you Derek for the explanation between :disallowed and :timed-out. That was extremely helpful in understanding what decisions Storm is making. I increased the timeouts for both messages to 5 minutes and returned the zookeeper session timeouts to their default values. This made it plain to see periods in time where the "Uptime" column for the busiest component's Worker would not update (1-2 minutes, potentially never resulting in a worker restart).
> >
> > ZK logs report constant disconnects and reconnects while the "Uptime" is not updating:
> > 16:28:30,440 - INFO NIOServerCnxn@1001 - Closed socket connection for client /10.49.21.151:54004 which has sessionid 0x1464f1fddc1018f
> > 16:31:18,364 - INFO NIOServerCnxnFactory@197 - Accepted socket connection from /10.49.21.151:34419
> > 16.31:18,365 - WARN ZookeeperServer@793 - Connection request from old client /10.49.21.151:34419; will be dropped if server is in r-o mode
> > 16:31:18,365 - INFO ZookeeperServer@832 - Client attempting to renew session 0x264f1fddc4021e at /10.49.21.151:34419
> > 16:31:18,365 - INFO Learner@107 - Revalidating client: 0x264f1fddc4021e
> > 16:31:18,366 - INFO ZooKeeperServer@588 - Invalid session 0x264f1fddc4021e for client /10.49.21.151:34419, probably expired
> > 16:31:18,366 - NIOServerCnxn@1001 - Closed socket connection for client /10.49.21.151:34419 which had sessionid 0x264f1fddc4021e
> > 16:31:18,378 - INFO NIOServerCnxnFactory@197 - Accepted socket connection from  /10.49.21.151:34420
> > 16:31:18,391 - WARN ZookeeperServer@793 - Connection request from old
> > client /10.49.21.151:34420; will be dropped if server is in r-o mode
> > 16:31:18,392 - INFO ZookeeperServer@839 - Client attempting to establish new session at /10.49.21.151:34420
> > 16:31:18,394 - INFO ZookeeperServer@595 - Established session 0x1464fafddc10218 with negotiated timeout 20000 for client /10.49.21.151:34420
> > 16.31.44,002 - INFO NIOServerCnxn@1001 - Closed socket connection for /10.49.21.151:34420 which had sessionid 0x1464fafddc10218
> > 16.32.48,055 - INFO NIOServerCxnFactory@197 - Accepted socket connection from /10.49.21.151:34432
> > 16:32:48,056 - WARN ZookeeperServer@793 - Connection request from old
> > client /10.49.21.151:34432; will be dropped if server is in r-o mode
> > 16.32.48,056 - INFO ZookeeperServer@832 - Client attempting to renew session 0x2464fafddc4021f at /10.49.21.151:34432
> > 16:32:48,056 - INFO Learner@107 - Revalidating client: 0x2464fafddc4021f
> > 16:32:48,057 - INFO ZooKeeperServer@588 - Invalid session 0x2464fafddc4021f for client /10.49.21.151:34432, probably expired
> > 16:32:48,057 - NIOServerCnxn@1001 - Closed socket connection for client
> > /10.49.21.151:34432 which had sessionid 0x2464fafddc4021f
> > ...etc until Storm has had enough and restarts the worker resulting in this
> > 16:47:20,706 - NIOServerCnxn@349 - Caught end of stream exception
> > EndOfStreamException: Unable to read additional data from client sessionid 0x3464f20777e01cf, likely client has closed socket
> >     at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
> >     at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
> >     at java.langThread.run(Thread.java:745)
> >
> > 1) Is it appropriate to run Zookeeper in parallel on the same node with the storm services?
> > 2) We have zookeeper 3.4.5 installed. I see Storm uses zookeeper-3.3.3 as its client. Should we downgrade our installation?
> >
> >> Date: Sat, 31 May 2014 13:50:57 -0500
> >> From: derekd@yahoo-inc.com
> >> To: user@storm.incubator.apache.org
> >> Subject: Re: Workers constantly restarted due to session timeout
> >>
> >>> Are you certain that nimbus.task.timeout.secs is the correct config?
> >>
> >> That config controls the length of time before nimbus thinks a worker has timed out.
> >>
> >> https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/daemon/nimbus.clj#L369-L372
> >>
> >> Its default is 30 seconds.
> >>
> >> https://github.com/apache/incubator-storm/blob/master/conf/defaults.yaml#L45
> >>
> >>
> >>> storm.zookeeper.connection.timeout: 300000
> >>> storm.zookeeper.session.timeout: 300000
> >>
> >> So these will make the situation worse while workers losing connections to ZK, since it will cause the workers to wait longer before reconnecting.  They could wait until nimbus thinks the worker is dead before trying to reconnect.
> >>
> >>
> >>>>>>> supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
> >>
> >> Here if the "State" is ":disallowed", then that means it is Nimbus that de-scheduled the worker on that node--very probably in this case because it thought it was dead.  When the supervisor sees this, it will kill the worker.  (A state of ":timed-out" means instead that the worker did not heartbeat to its supervisor in time.)
> >>
> >> If the CPU load on the worker was high enough to prevent heartbeats, then I would expect to see :timed-out state above instead of :disallowed.  The reason is that the worker has only 5 seconds to do those heartbeats, while it has 30 seconds to heartbeat to nimbus (via ZK).  (More often what happens to cause this is memory has run out and garbage collection stops everything just long enough.)
> >>
> >> The real question is why connections from the worker to ZK are timing out in the first place.
> >>
> >> What about the ZK servers?  Sometimes ZooKeeper servers cannot keep up, and that causes pretty severe problems with timeouts.
> >> --
> >> Derek
> >>
> >> On 5/30/14, 17:51, Michael Dev wrote:
> >>>
> >>>
> >>>
> >>> Michael R,
> >>>
> >>> We don't have GC logging enabled yet. I lean towards agreeing with Derek that I don't think it's the issue but I will take a look at logging on Monday just to verify.
> >>>
> >>> Derek D,
> >>>
> >>> Are you certain that nimbus.task.timeout.secs is the correct config? Tracing through the github code it would seem to me that this is used as the timeout value when making a Thrift connection to the Nimbus node. I thought the logs indicated the timeout was occurring in the session connection to zookeeper as evidenced by ClientCxn being a Zookeeper class.
> >>>
> >>> I discovered that we were running with the default maxSessionTimeout zookeeper config of 40 seconds. This would explain why our storm config of 5 minutes was not being picked up (but obviously not the root problem nor why timeout messages report 14-20 second timeout values). Typically we saw losses in connection occur when our cluster becomes super busy with a burst of data pushing workers to near 100% CPU. I'm testing the following configs over the weekend to see if they at least allow us to prevent chronic worker restarting during the brief high CPU periods.
> >>>
> >>> Our current setup is as follows:
> >>> Storm 0.9.0.1
> >>> 3 Storm node cluster
> >>> 1 Supervisor running per Storm node
> >>> 1-3 topologies deployed on the Storm cluster (depends on dev/prod/etc systems)
> >>> 3 Workers per topology
> >>> Variable number of executors per component depending on how slow that component is. Example file i/o has many executors (say 12) while in memory validation has only 3 executors. Always maintaining a multiple of the number of workers for even distribution.
> >>> Kyro serialization with Java Serialization failover disabled to ensure we're using 100% kryo between bolts.
> >>>
> >>> zoo.cfg
> >>> tickTime=2000
> >>> dataDir=/srv/zookeeper/data
> >>> clientPort=2182
> >>> initLimit=5
> >>> syncLimit=2
> >>> skipACL=true
> >>> maxClientCnxns=1000
> >>> maxSessionTimeout=300000
> >>> server.1=node1
> >>> server.2=node2
> >>> server.3=node3
> >>>
> >>> storm yaml
> >>> storm.zookeeper.port: 2182
> >>> storm.local.dir=/srv/storm/data
> >>> nimbus.host: "node1"
> >>> storm.zookeeper.servers:
> >>>    - "node1"
> >>>    - "node2"
> >>>    - "node3"
> >>> supervisor.slot.ports:
> >>>    - 6700
> >>>    - 6701
> >>>    - 6702
> >>>    - 6703
> >>>    - 6704
> >>> java.library.path: "/usr/lib:/srv/storm/lib"
> >>> #Storm 0.9 netty support
> >>> storm.messaging.transport: "backtype.storm.messaging.netty.Context"
> >>> storm.messaging.netty.server_worker_threads: 1
> >>> storm.messaging.netty.client_worker_threads: 1
> >>> storm.messaging.netty.buffer_size: 5242880
> >>> storm.messaging.netty.max_retries: 100
> >>> storm.messaging.netty.max_wait_ms: 1000
> >>> storm.messaging.netty.min_wait_ms: 100
> >>> # Timeout band-aids in testing
> >>> topology.receiver.buffer.size: 2
> >>> storm.zookeeper.connection.timeout: 300000
> >>> storm.zookeeper.session.timeout: 300000
> >>>
> >>>
> >>>> Date: Thu, 29 May 2014 12:56:19 -0500
> >>>> From: derekd@yahoo-inc.com
> >>>> To: user@storm.incubator.apache.org
> >>>> Subject: Re: Workers constantly restarted due to session timeout
> >>>>
> >>>> OK, so GC is probably not the issue.
> >>>>
> >>>>
> >>>> Specifically, this is a connection timeout to ZK from the worker, and it is resulting in nimbus removing it from the assignments for that node.  In turn, the supervisor reads the schedule and shoots the worker because it is no longer scheduled to be running.
> >>>>
> >>>>
> >>>> The relevant config is nimbus.task.timeout.secs, and I think the default is 30s.  What you could try is to make nimbus timeout longer than storm.zookeeper.session.timeout.  This would allow the ZK connections to timeout and get a heartbeat in before nimbus decides they have timed out.
> >>>>
> >>>>
> >>>> But the real question is why are the ZK sessions timing out at all?
> >>>>
> >>>> Do you see this on several workers on that node?  What about the supervisor?  What about other nodes?  What do the ZK logs say?
> >>>>
> >>>>
> >>>> --
> >>>> Derek
> >>>>
> >>>> On 5/29/14, 11:45, Michael Dev wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> Derek,
> >>>>>
> >>>>> We are currently running with -Xmx60G and only about 20-30G of that has been observed to be used. I'm still observing workers restarted every 2 minutes.
> >>>>>
> >>>>> What timeout is relevant to increase for the heartbeats in question? Is it be a config on the Zookeeper side we can increase to make our topology more resilient to these restarts?
> >>>>>
> >>>>> Michael
> >>>>>
> >>>>>> Date: Fri, 23 May 2014 15:50:50 -0500
> >>>>>> From: derekd@yahoo-inc.com
> >>>>>> To: user@storm.incubator.apache.org
> >>>>>> Subject: Re: Workers constantly restarted due to session timeout
> >>>>>>
> >>>>>>> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
> >>>>>>
> >>>>>> Check your JVM max heap size (-Xmx).  If you use too much, the JVM will garbage-collect, and that will stop everything--including the thread whose job it is to do the heartbeating.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Derek
> >>>>>>
> >>>>>> On 5/23/14, 15:38, Michael Dev wrote:
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> We are seeing our workers constantly being killed by Storm with to the following logs:
> >>>>>>> worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, closing socket and attempting reconnect
> >>>>>>> supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
> >>>>>>>
> >>>>>>> Eventually Storm decides to just kill the worker and restart it as you see in the supervisor log. We theorize this is the Zookeeper heartbeat thread and it is being choked out due to very high CPU load on the machine (near 100%).
> >>>>>>>
> >>>>>>> I have increased the connection timeouts in the storm.yaml config file yet Storm seems to continue to use some unknown value for the above client session timeout messages:
> >>>>>>> storm.zookeeper.connection.timeout: 300000
> >>>>>>> storm.zookeeper.session.timeout: 300000
> >>>>>>>
> >>>>>>> 1) What timeout config is appropriate for the above timeout  message?
> >>>>>>> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Michael
> >>>>>>>      		 	   		
> >>>>>>>
> >>>>>
> >>>>>     		 	   		
> >>>>>
> >>>
> >>>    		 	   		
> >>>
> >
> >   		 	   		
> >

Re: Workers constantly restarted due to session timeout

Posted by Derek Dagit <de...@yahoo-inc.com>.

> 1) Is it appropriate to run Zookeeper in parallel on the same node with the storm services?

I recommend separate, and even then to ZK storage to a path on its own disk device if possible.  ZK is a bottleneck for storm, and when it is too slow lots of bad things can happen.

Some folks use shared hosts (with or without VMs) in which to run ZK.  In those situations, VMs or processes owned by other users doing unrelated things can put load on the disk, and that will dramatically slow down ZK.


> 2) We have zookeeper 3.4.5 installed. I see Storm uses zookeeper-3.3.3 as its client. Should we downgrade our installation?

I am not sure about that, since we've been running with ZK 3.4.5 in storm (and on the server).  It might work very well, but I have not tried it.  I do not remember if anyone on this list has identified any issues with this 3.3.3 + 3.4.5 combo.


One setting we changed to dramatically improve performance with ZK was setting the system property '-Dzookeeper.forceSync=no' on the server.

Normally, ZK will sync to disk on every write, and that causes two seeks: one for the data and one for the data log.  It gets really expensive with all of the workers heartbeating in through ZK.  Be warned that with only on ZK server, an outage could leave you in an inconsistent state.

You might check to see if the ZK server is keeping up.  There are tools like iotop that can give information about disk load.

-- 
Derek

On 6/3/14, 13:14, Michael Dev wrote:
>
>
>
> Thank you Derek for the explanation between :disallowed and :timed-out. That was extremely helpful in understanding what decisions Storm is making. I increased the timeouts for both messages to 5 minutes and returned the zookeeper session timeouts to their default values. This made it plain to see periods in time where the "Uptime" column for the busiest component's Worker would not update (1-2 minutes, potentially never resulting in a worker restart).
>
> ZK logs report constant disconnects and reconnects while the "Uptime" is not updating:
> 16:28:30,440 - INFO NIOServerCnxn@1001 - Closed socket connection for client /10.49.21.151:54004 which has sessionid 0x1464f1fddc1018f
> 16:31:18,364 - INFO NIOServerCnxnFactory@197 - Accepted socket connection from /10.49.21.151:34419
> 16.31:18,365 - WARN ZookeeperServer@793 - Connection request from old client /10.49.21.151:34419; will be dropped if server is in r-o mode
> 16:31:18,365 - INFO ZookeeperServer@832 - Client attempting to renew session 0x264f1fddc4021e at /10.49.21.151:34419
> 16:31:18,365 - INFO Learner@107 - Revalidating client: 0x264f1fddc4021e
> 16:31:18,366 - INFO ZooKeeperServer@588 - Invalid session 0x264f1fddc4021e for client /10.49.21.151:34419, probably expired
> 16:31:18,366 - NIOServerCnxn@1001 - Closed socket connection for client /10.49.21.151:34419 which had sessionid 0x264f1fddc4021e
> 16:31:18,378 - INFO NIOServerCnxnFactory@197 - Accepted socket connection from  /10.49.21.151:34420
> 16:31:18,391 - WARN ZookeeperServer@793 - Connection request from old
> client /10.49.21.151:34420; will be dropped if server is in r-o mode
> 16:31:18,392 - INFO ZookeeperServer@839 - Client attempting to establish new session at /10.49.21.151:34420
> 16:31:18,394 - INFO ZookeeperServer@595 - Established session 0x1464fafddc10218 with negotiated timeout 20000 for client /10.49.21.151:34420
> 16.31.44,002 - INFO NIOServerCnxn@1001 - Closed socket connection for /10.49.21.151:34420 which had sessionid 0x1464fafddc10218
> 16.32.48,055 - INFO NIOServerCxnFactory@197 - Accepted socket connection from /10.49.21.151:34432
> 16:32:48,056 - WARN ZookeeperServer@793 - Connection request from old
> client /10.49.21.151:34432; will be dropped if server is in r-o mode
> 16.32.48,056 - INFO ZookeeperServer@832 - Client attempting to renew session 0x2464fafddc4021f at /10.49.21.151:34432
> 16:32:48,056 - INFO Learner@107 - Revalidating client: 0x2464fafddc4021f
> 16:32:48,057 - INFO ZooKeeperServer@588 - Invalid session 0x2464fafddc4021f for client /10.49.21.151:34432, probably expired
> 16:32:48,057 - NIOServerCnxn@1001 - Closed socket connection for client
> /10.49.21.151:34432 which had sessionid 0x2464fafddc4021f
> ...etc until Storm has had enough and restarts the worker resulting in this
> 16:47:20,706 - NIOServerCnxn@349 - Caught end of stream exception
> EndOfStreamException: Unable to read additional data from client sessionid 0x3464f20777e01cf, likely client has closed socket
>     at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
>     at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>     at java.langThread.run(Thread.java:745)
>
> 1) Is it appropriate to run Zookeeper in parallel on the same node with the storm services?
> 2) We have zookeeper 3.4.5 installed. I see Storm uses zookeeper-3.3.3 as its client. Should we downgrade our installation?
>
>> Date: Sat, 31 May 2014 13:50:57 -0500
>> From: derekd@yahoo-inc.com
>> To: user@storm.incubator.apache.org
>> Subject: Re: Workers constantly restarted due to session timeout
>>
>>> Are you certain that nimbus.task.timeout.secs is the correct config?
>>
>> That config controls the length of time before nimbus thinks a worker has timed out.
>>
>> https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/daemon/nimbus.clj#L369-L372
>>
>> Its default is 30 seconds.
>>
>> https://github.com/apache/incubator-storm/blob/master/conf/defaults.yaml#L45
>>
>>
>>> storm.zookeeper.connection.timeout: 300000
>>> storm.zookeeper.session.timeout: 300000
>>
>> So these will make the situation worse while workers losing connections to ZK, since it will cause the workers to wait longer before reconnecting.  They could wait until nimbus thinks the worker is dead before trying to reconnect.
>>
>>
>>>>>>> supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
>>
>> Here if the "State" is ":disallowed", then that means it is Nimbus that de-scheduled the worker on that node--very probably in this case because it thought it was dead.  When the supervisor sees this, it will kill the worker.  (A state of ":timed-out" means instead that the worker did not heartbeat to its supervisor in time.)
>>
>> If the CPU load on the worker was high enough to prevent heartbeats, then I would expect to see :timed-out state above instead of :disallowed.  The reason is that the worker has only 5 seconds to do those heartbeats, while it has 30 seconds to heartbeat to nimbus (via ZK).  (More often what happens to cause this is memory has run out and garbage collection stops everything just long enough.)
>>
>> The real question is why connections from the worker to ZK are timing out in the first place.
>>
>> What about the ZK servers?  Sometimes ZooKeeper servers cannot keep up, and that causes pretty severe problems with timeouts.
>> --
>> Derek
>>
>> On 5/30/14, 17:51, Michael Dev wrote:
>>>
>>>
>>>
>>> Michael R,
>>>
>>> We don't have GC logging enabled yet. I lean towards agreeing with Derek that I don't think it's the issue but I will take a look at logging on Monday just to verify.
>>>
>>> Derek D,
>>>
>>> Are you certain that nimbus.task.timeout.secs is the correct config? Tracing through the github code it would seem to me that this is used as the timeout value when making a Thrift connection to the Nimbus node. I thought the logs indicated the timeout was occurring in the session connection to zookeeper as evidenced by ClientCxn being a Zookeeper class.
>>>
>>> I discovered that we were running with the default maxSessionTimeout zookeeper config of 40 seconds. This would explain why our storm config of 5 minutes was not being picked up (but obviously not the root problem nor why timeout messages report 14-20 second timeout values). Typically we saw losses in connection occur when our cluster becomes super busy with a burst of data pushing workers to near 100% CPU. I'm testing the following configs over the weekend to see if they at least allow us to prevent chronic worker restarting during the brief high CPU periods.
>>>
>>> Our current setup is as follows:
>>> Storm 0.9.0.1
>>> 3 Storm node cluster
>>> 1 Supervisor running per Storm node
>>> 1-3 topologies deployed on the Storm cluster (depends on dev/prod/etc systems)
>>> 3 Workers per topology
>>> Variable number of executors per component depending on how slow that component is. Example file i/o has many executors (say 12) while in memory validation has only 3 executors. Always maintaining a multiple of the number of workers for even distribution.
>>> Kyro serialization with Java Serialization failover disabled to ensure we're using 100% kryo between bolts.
>>>
>>> zoo.cfg
>>> tickTime=2000
>>> dataDir=/srv/zookeeper/data
>>> clientPort=2182
>>> initLimit=5
>>> syncLimit=2
>>> skipACL=true
>>> maxClientCnxns=1000
>>> maxSessionTimeout=300000
>>> server.1=node1
>>> server.2=node2
>>> server.3=node3
>>>
>>> storm yaml
>>> storm.zookeeper.port: 2182
>>> storm.local.dir=/srv/storm/data
>>> nimbus.host: "node1"
>>> storm.zookeeper.servers:
>>>    - "node1"
>>>    - "node2"
>>>    - "node3"
>>> supervisor.slot.ports:
>>>    - 6700
>>>    - 6701
>>>    - 6702
>>>    - 6703
>>>    - 6704
>>> java.library.path: "/usr/lib:/srv/storm/lib"
>>> #Storm 0.9 netty support
>>> storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>>> storm.messaging.netty.server_worker_threads: 1
>>> storm.messaging.netty.client_worker_threads: 1
>>> storm.messaging.netty.buffer_size: 5242880
>>> storm.messaging.netty.max_retries: 100
>>> storm.messaging.netty.max_wait_ms: 1000
>>> storm.messaging.netty.min_wait_ms: 100
>>> # Timeout band-aids in testing
>>> topology.receiver.buffer.size: 2
>>> storm.zookeeper.connection.timeout: 300000
>>> storm.zookeeper.session.timeout: 300000
>>>
>>>
>>>> Date: Thu, 29 May 2014 12:56:19 -0500
>>>> From: derekd@yahoo-inc.com
>>>> To: user@storm.incubator.apache.org
>>>> Subject: Re: Workers constantly restarted due to session timeout
>>>>
>>>> OK, so GC is probably not the issue.
>>>>
>>>>
>>>> Specifically, this is a connection timeout to ZK from the worker, and it is resulting in nimbus removing it from the assignments for that node.  In turn, the supervisor reads the schedule and shoots the worker because it is no longer scheduled to be running.
>>>>
>>>>
>>>> The relevant config is nimbus.task.timeout.secs, and I think the default is 30s.  What you could try is to make nimbus timeout longer than storm.zookeeper.session.timeout.  This would allow the ZK connections to timeout and get a heartbeat in before nimbus decides they have timed out.
>>>>
>>>>
>>>> But the real question is why are the ZK sessions timing out at all?
>>>>
>>>> Do you see this on several workers on that node?  What about the supervisor?  What about other nodes?  What do the ZK logs say?
>>>>
>>>>
>>>> --
>>>> Derek
>>>>
>>>> On 5/29/14, 11:45, Michael Dev wrote:
>>>>>
>>>>>
>>>>>
>>>>> Derek,
>>>>>
>>>>> We are currently running with -Xmx60G and only about 20-30G of that has been observed to be used. I'm still observing workers restarted every 2 minutes.
>>>>>
>>>>> What timeout is relevant to increase for the heartbeats in question? Is it be a config on the Zookeeper side we can increase to make our topology more resilient to these restarts?
>>>>>
>>>>> Michael
>>>>>
>>>>>> Date: Fri, 23 May 2014 15:50:50 -0500
>>>>>> From: derekd@yahoo-inc.com
>>>>>> To: user@storm.incubator.apache.org
>>>>>> Subject: Re: Workers constantly restarted due to session timeout
>>>>>>
>>>>>>> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
>>>>>>
>>>>>> Check your JVM max heap size (-Xmx).  If you use too much, the JVM will garbage-collect, and that will stop everything--including the thread whose job it is to do the heartbeating.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Derek
>>>>>>
>>>>>> On 5/23/14, 15:38, Michael Dev wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> We are seeing our workers constantly being killed by Storm with to the following logs:
>>>>>>> worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, closing socket and attempting reconnect
>>>>>>> supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
>>>>>>>
>>>>>>> Eventually Storm decides to just kill the worker and restart it as you see in the supervisor log. We theorize this is the Zookeeper heartbeat thread and it is being choked out due to very high CPU load on the machine (near 100%).
>>>>>>>
>>>>>>> I have increased the connection timeouts in the storm.yaml config file yet Storm seems to continue to use some unknown value for the above client session timeout messages:
>>>>>>> storm.zookeeper.connection.timeout: 300000
>>>>>>> storm.zookeeper.session.timeout: 300000
>>>>>>>
>>>>>>> 1) What timeout config is appropriate for the above timeout  message?
>>>>>>> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Michael
>>>>>>>      		 	   		
>>>>>>>
>>>>>
>>>>>     		 	   		
>>>>>
>>>
>>>    		 	   		
>>>
>
>   		 	   		
>

RE: Workers constantly restarted due to session timeout

Posted by Michael Dev <mi...@outlook.com>.



Thank you Derek for the explanation between :disallowed and :timed-out. That was extremely helpful in understanding what decisions Storm is making. I increased the timeouts for both messages to 5 minutes and returned the zookeeper session timeouts to their default values. This made it plain to see periods in time where the "Uptime" column for the busiest component's Worker would not update (1-2 minutes, potentially never resulting in a worker restart).

ZK logs report constant disconnects and reconnects while the "Uptime" is not updating:
16:28:30,440 - INFO NIOServerCnxn@1001 - Closed socket connection for client /10.49.21.151:54004 which has sessionid 0x1464f1fddc1018f
16:31:18,364 - INFO NIOServerCnxnFactory@197 - Accepted socket connection from /10.49.21.151:34419
16.31:18,365 - WARN ZookeeperServer@793 - Connection request from old client /10.49.21.151:34419; will be dropped if server is in r-o mode
16:31:18,365 - INFO ZookeeperServer@832 - Client attempting to renew session 0x264f1fddc4021e at /10.49.21.151:34419
16:31:18,365 - INFO Learner@107 - Revalidating client: 0x264f1fddc4021e 
16:31:18,366 - INFO ZooKeeperServer@588 - Invalid session 0x264f1fddc4021e for client /10.49.21.151:34419, probably expired
16:31:18,366 - NIOServerCnxn@1001 - Closed socket connection for client /10.49.21.151:34419 which had sessionid 0x264f1fddc4021e 
16:31:18,378 - INFO NIOServerCnxnFactory@197 - Accepted socket connection from  /10.49.21.151:34420
16:31:18,391 - WARN ZookeeperServer@793 - Connection request from old 
client /10.49.21.151:34420; will be dropped if server is in r-o mode
16:31:18,392 - INFO ZookeeperServer@839 - Client attempting to establish new session at /10.49.21.151:34420
16:31:18,394 - INFO ZookeeperServer@595 - Established session 0x1464fafddc10218 with negotiated timeout 20000 for client /10.49.21.151:34420
16.31.44,002 - INFO NIOServerCnxn@1001 - Closed socket connection for /10.49.21.151:34420 which had sessionid 0x1464fafddc10218
16.32.48,055 - INFO NIOServerCxnFactory@197 - Accepted socket connection from /10.49.21.151:34432
16:32:48,056 - WARN ZookeeperServer@793 - Connection request from old 
client /10.49.21.151:34432; will be dropped if server is in r-o mode
16.32.48,056 - INFO ZookeeperServer@832 - Client attempting to renew session 0x2464fafddc4021f at /10.49.21.151:34432
16:32:48,056 - INFO Learner@107 - Revalidating client: 0x2464fafddc4021f 
16:32:48,057 - INFO ZooKeeperServer@588 - Invalid session 0x2464fafddc4021f for client /10.49.21.151:34432, probably expired
16:32:48,057 - NIOServerCnxn@1001 - Closed socket connection for client 
/10.49.21.151:34432 which had sessionid 0x2464fafddc4021f 
...etc until Storm has had enough and restarts the worker resulting in this
16:47:20,706 - NIOServerCnxn@349 - Caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x3464f20777e01cf, likely client has closed socket
   at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
   at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
   at java.langThread.run(Thread.java:745)

1) Is it appropriate to run Zookeeper in parallel on the same node with the storm services?
2) We have zookeeper 3.4.5 installed. I see Storm uses zookeeper-3.3.3 as its client. Should we downgrade our installation?

> Date: Sat, 31 May 2014 13:50:57 -0500
> From: derekd@yahoo-inc.com
> To: user@storm.incubator.apache.org
> Subject: Re: Workers constantly restarted due to session timeout
> 
> > Are you certain that nimbus.task.timeout.secs is the correct config?
> 
> That config controls the length of time before nimbus thinks a worker has timed out.
> 
> https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/daemon/nimbus.clj#L369-L372
> 
> Its default is 30 seconds.
> 
> https://github.com/apache/incubator-storm/blob/master/conf/defaults.yaml#L45
> 
> 
> > storm.zookeeper.connection.timeout: 300000
> > storm.zookeeper.session.timeout: 300000
> 
> So these will make the situation worse while workers losing connections to ZK, since it will cause the workers to wait longer before reconnecting.  They could wait until nimbus thinks the worker is dead before trying to reconnect.
> 
> 
> >>>>> supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
> 
> Here if the "State" is ":disallowed", then that means it is Nimbus that de-scheduled the worker on that node--very probably in this case because it thought it was dead.  When the supervisor sees this, it will kill the worker.  (A state of ":timed-out" means instead that the worker did not heartbeat to its supervisor in time.)
> 
> If the CPU load on the worker was high enough to prevent heartbeats, then I would expect to see :timed-out state above instead of :disallowed.  The reason is that the worker has only 5 seconds to do those heartbeats, while it has 30 seconds to heartbeat to nimbus (via ZK).  (More often what happens to cause this is memory has run out and garbage collection stops everything just long enough.)
> 
> The real question is why connections from the worker to ZK are timing out in the first place.
> 
> What about the ZK servers?  Sometimes ZooKeeper servers cannot keep up, and that causes pretty severe problems with timeouts.
> -- 
> Derek
> 
> On 5/30/14, 17:51, Michael Dev wrote:
> >
> >
> >
> > Michael R,
> >
> > We don't have GC logging enabled yet. I lean towards agreeing with Derek that I don't think it's the issue but I will take a look at logging on Monday just to verify.
> >
> > Derek D,
> >
> > Are you certain that nimbus.task.timeout.secs is the correct config? Tracing through the github code it would seem to me that this is used as the timeout value when making a Thrift connection to the Nimbus node. I thought the logs indicated the timeout was occurring in the session connection to zookeeper as evidenced by ClientCxn being a Zookeeper class.
> >
> > I discovered that we were running with the default maxSessionTimeout zookeeper config of 40 seconds. This would explain why our storm config of 5 minutes was not being picked up (but obviously not the root problem nor why timeout messages report 14-20 second timeout values). Typically we saw losses in connection occur when our cluster becomes super busy with a burst of data pushing workers to near 100% CPU. I'm testing the following configs over the weekend to see if they at least allow us to prevent chronic worker restarting during the brief high CPU periods.
> >
> > Our current setup is as follows:
> > Storm 0.9.0.1
> > 3 Storm node cluster
> > 1 Supervisor running per Storm node
> > 1-3 topologies deployed on the Storm cluster (depends on dev/prod/etc systems)
> > 3 Workers per topology
> > Variable number of executors per component depending on how slow that component is. Example file i/o has many executors (say 12) while in memory validation has only 3 executors. Always maintaining a multiple of the number of workers for even distribution.
> > Kyro serialization with Java Serialization failover disabled to ensure we're using 100% kryo between bolts.
> >
> > zoo.cfg
> > tickTime=2000
> > dataDir=/srv/zookeeper/data
> > clientPort=2182
> > initLimit=5
> > syncLimit=2
> > skipACL=true
> > maxClientCnxns=1000
> > maxSessionTimeout=300000
> > server.1=node1
> > server.2=node2
> > server.3=node3
> >
> > storm yaml
> > storm.zookeeper.port: 2182
> > storm.local.dir=/srv/storm/data
> > nimbus.host: "node1"
> > storm.zookeeper.servers:
> >   - "node1"
> >   - "node2"
> >   - "node3"
> > supervisor.slot.ports:
> >   - 6700
> >   - 6701
> >   - 6702
> >   - 6703
> >   - 6704
> > java.library.path: "/usr/lib:/srv/storm/lib"
> > #Storm 0.9 netty support
> > storm.messaging.transport: "backtype.storm.messaging.netty.Context"
> > storm.messaging.netty.server_worker_threads: 1
> > storm.messaging.netty.client_worker_threads: 1
> > storm.messaging.netty.buffer_size: 5242880
> > storm.messaging.netty.max_retries: 100
> > storm.messaging.netty.max_wait_ms: 1000
> > storm.messaging.netty.min_wait_ms: 100
> > # Timeout band-aids in testing
> > topology.receiver.buffer.size: 2
> > storm.zookeeper.connection.timeout: 300000
> > storm.zookeeper.session.timeout: 300000
> >
> >
> >> Date: Thu, 29 May 2014 12:56:19 -0500
> >> From: derekd@yahoo-inc.com
> >> To: user@storm.incubator.apache.org
> >> Subject: Re: Workers constantly restarted due to session timeout
> >>
> >> OK, so GC is probably not the issue.
> >>
> >>
> >> Specifically, this is a connection timeout to ZK from the worker, and it is resulting in nimbus removing it from the assignments for that node.  In turn, the supervisor reads the schedule and shoots the worker because it is no longer scheduled to be running.
> >>
> >>
> >> The relevant config is nimbus.task.timeout.secs, and I think the default is 30s.  What you could try is to make nimbus timeout longer than storm.zookeeper.session.timeout.  This would allow the ZK connections to timeout and get a heartbeat in before nimbus decides they have timed out.
> >>
> >>
> >> But the real question is why are the ZK sessions timing out at all?
> >>
> >> Do you see this on several workers on that node?  What about the supervisor?  What about other nodes?  What do the ZK logs say?
> >>
> >>
> >> --
> >> Derek
> >>
> >> On 5/29/14, 11:45, Michael Dev wrote:
> >>>
> >>>
> >>>
> >>> Derek,
> >>>
> >>> We are currently running with -Xmx60G and only about 20-30G of that has been observed to be used. I'm still observing workers restarted every 2 minutes.
> >>>
> >>> What timeout is relevant to increase for the heartbeats in question? Is it be a config on the Zookeeper side we can increase to make our topology more resilient to these restarts?
> >>>
> >>> Michael
> >>>
> >>>> Date: Fri, 23 May 2014 15:50:50 -0500
> >>>> From: derekd@yahoo-inc.com
> >>>> To: user@storm.incubator.apache.org
> >>>> Subject: Re: Workers constantly restarted due to session timeout
> >>>>
> >>>>> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
> >>>>
> >>>> Check your JVM max heap size (-Xmx).  If you use too much, the JVM will garbage-collect, and that will stop everything--including the thread whose job it is to do the heartbeating.
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Derek
> >>>>
> >>>> On 5/23/14, 15:38, Michael Dev wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> We are seeing our workers constantly being killed by Storm with to the following logs:
> >>>>> worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, closing socket and attempting reconnect
> >>>>> supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
> >>>>>
> >>>>> Eventually Storm decides to just kill the worker and restart it as you see in the supervisor log. We theorize this is the Zookeeper heartbeat thread and it is being choked out due to very high CPU load on the machine (near 100%).
> >>>>>
> >>>>> I have increased the connection timeouts in the storm.yaml config file yet Storm seems to continue to use some unknown value for the above client session timeout messages:
> >>>>> storm.zookeeper.connection.timeout: 300000
> >>>>> storm.zookeeper.session.timeout: 300000
> >>>>>
> >>>>> 1) What timeout config is appropriate for the above timeout  message?
> >>>>> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
> >>>>>
> >>>>> Thanks,
> >>>>> Michael
> >>>>>     		 	   		
> >>>>>
> >>>
> >>>    		 	   		
> >>>
> >
> >   		 	   		
> >

Re: Workers constantly restarted due to session timeout

Posted by Derek Dagit <de...@yahoo-inc.com>.

> Are you certain that nimbus.task.timeout.secs is the correct config?

That config controls the length of time before nimbus thinks a worker has timed out.

https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/daemon/nimbus.clj#L369-L372

Its default is 30 seconds.

https://github.com/apache/incubator-storm/blob/master/conf/defaults.yaml#L45


> storm.zookeeper.connection.timeout: 300000
> storm.zookeeper.session.timeout: 300000

So these will make the situation worse while workers losing connections to ZK, since it will cause the workers to wait longer before reconnecting.  They could wait until nimbus thinks the worker is dead before trying to reconnect.


>>>>> supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}

Here if the "State" is ":disallowed", then that means it is Nimbus that de-scheduled the worker on that node--very probably in this case because it thought it was dead.  When the supervisor sees this, it will kill the worker.  (A state of ":timed-out" means instead that the worker did not heartbeat to its supervisor in time.)

If the CPU load on the worker was high enough to prevent heartbeats, then I would expect to see :timed-out state above instead of :disallowed.  The reason is that the worker has only 5 seconds to do those heartbeats, while it has 30 seconds to heartbeat to nimbus (via ZK).  (More often what happens to cause this is memory has run out and garbage collection stops everything just long enough.)

The real question is why connections from the worker to ZK are timing out in the first place.

What about the ZK servers?  Sometimes ZooKeeper servers cannot keep up, and that causes pretty severe problems with timeouts.
-- 
Derek

On 5/30/14, 17:51, Michael Dev wrote:
>
>
>
> Michael R,
>
> We don't have GC logging enabled yet. I lean towards agreeing with Derek that I don't think it's the issue but I will take a look at logging on Monday just to verify.
>
> Derek D,
>
> Are you certain that nimbus.task.timeout.secs is the correct config? Tracing through the github code it would seem to me that this is used as the timeout value when making a Thrift connection to the Nimbus node. I thought the logs indicated the timeout was occurring in the session connection to zookeeper as evidenced by ClientCxn being a Zookeeper class.
>
> I discovered that we were running with the default maxSessionTimeout zookeeper config of 40 seconds. This would explain why our storm config of 5 minutes was not being picked up (but obviously not the root problem nor why timeout messages report 14-20 second timeout values). Typically we saw losses in connection occur when our cluster becomes super busy with a burst of data pushing workers to near 100% CPU. I'm testing the following configs over the weekend to see if they at least allow us to prevent chronic worker restarting during the brief high CPU periods.
>
> Our current setup is as follows:
> Storm 0.9.0.1
> 3 Storm node cluster
> 1 Supervisor running per Storm node
> 1-3 topologies deployed on the Storm cluster (depends on dev/prod/etc systems)
> 3 Workers per topology
> Variable number of executors per component depending on how slow that component is. Example file i/o has many executors (say 12) while in memory validation has only 3 executors. Always maintaining a multiple of the number of workers for even distribution.
> Kyro serialization with Java Serialization failover disabled to ensure we're using 100% kryo between bolts.
>
> zoo.cfg
> tickTime=2000
> dataDir=/srv/zookeeper/data
> clientPort=2182
> initLimit=5
> syncLimit=2
> skipACL=true
> maxClientCnxns=1000
> maxSessionTimeout=300000
> server.1=node1
> server.2=node2
> server.3=node3
>
> storm yaml
> storm.zookeeper.port: 2182
> storm.local.dir=/srv/storm/data
> nimbus.host: "node1"
> storm.zookeeper.servers:
>   - "node1"
>   - "node2"
>   - "node3"
> supervisor.slot.ports:
>   - 6700
>   - 6701
>   - 6702
>   - 6703
>   - 6704
> java.library.path: "/usr/lib:/srv/storm/lib"
> #Storm 0.9 netty support
> storm.messaging.transport: "backtype.storm.messaging.netty.Context"
> storm.messaging.netty.server_worker_threads: 1
> storm.messaging.netty.client_worker_threads: 1
> storm.messaging.netty.buffer_size: 5242880
> storm.messaging.netty.max_retries: 100
> storm.messaging.netty.max_wait_ms: 1000
> storm.messaging.netty.min_wait_ms: 100
> # Timeout band-aids in testing
> topology.receiver.buffer.size: 2
> storm.zookeeper.connection.timeout: 300000
> storm.zookeeper.session.timeout: 300000
>
>
>> Date: Thu, 29 May 2014 12:56:19 -0500
>> From: derekd@yahoo-inc.com
>> To: user@storm.incubator.apache.org
>> Subject: Re: Workers constantly restarted due to session timeout
>>
>> OK, so GC is probably not the issue.
>>
>>
>> Specifically, this is a connection timeout to ZK from the worker, and it is resulting in nimbus removing it from the assignments for that node.  In turn, the supervisor reads the schedule and shoots the worker because it is no longer scheduled to be running.
>>
>>
>> The relevant config is nimbus.task.timeout.secs, and I think the default is 30s.  What you could try is to make nimbus timeout longer than storm.zookeeper.session.timeout.  This would allow the ZK connections to timeout and get a heartbeat in before nimbus decides they have timed out.
>>
>>
>> But the real question is why are the ZK sessions timing out at all?
>>
>> Do you see this on several workers on that node?  What about the supervisor?  What about other nodes?  What do the ZK logs say?
>>
>>
>> --
>> Derek
>>
>> On 5/29/14, 11:45, Michael Dev wrote:
>>>
>>>
>>>
>>> Derek,
>>>
>>> We are currently running with -Xmx60G and only about 20-30G of that has been observed to be used. I'm still observing workers restarted every 2 minutes.
>>>
>>> What timeout is relevant to increase for the heartbeats in question? Is it be a config on the Zookeeper side we can increase to make our topology more resilient to these restarts?
>>>
>>> Michael
>>>
>>>> Date: Fri, 23 May 2014 15:50:50 -0500
>>>> From: derekd@yahoo-inc.com
>>>> To: user@storm.incubator.apache.org
>>>> Subject: Re: Workers constantly restarted due to session timeout
>>>>
>>>>> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
>>>>
>>>> Check your JVM max heap size (-Xmx).  If you use too much, the JVM will garbage-collect, and that will stop everything--including the thread whose job it is to do the heartbeating.
>>>>
>>>>
>>>>
>>>> --
>>>> Derek
>>>>
>>>> On 5/23/14, 15:38, Michael Dev wrote:
>>>>> Hi all,
>>>>>
>>>>> We are seeing our workers constantly being killed by Storm with to the following logs:
>>>>> worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, closing socket and attempting reconnect
>>>>> supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
>>>>>
>>>>> Eventually Storm decides to just kill the worker and restart it as you see in the supervisor log. We theorize this is the Zookeeper heartbeat thread and it is being choked out due to very high CPU load on the machine (near 100%).
>>>>>
>>>>> I have increased the connection timeouts in the storm.yaml config file yet Storm seems to continue to use some unknown value for the above client session timeout messages:
>>>>> storm.zookeeper.connection.timeout: 300000
>>>>> storm.zookeeper.session.timeout: 300000
>>>>>
>>>>> 1) What timeout config is appropriate for the above timeout  message?
>>>>> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
>>>>>
>>>>> Thanks,
>>>>> Michael
>>>>>     		 	   		
>>>>>
>>>
>>>    		 	   		
>>>
>
>   		 	   		
>

RE: Workers constantly restarted due to session timeout

Posted by Michael Dev <mi...@outlook.com>.



Michael R,

We don't have GC logging enabled yet. I lean towards agreeing with Derek that I don't think it's the issue but I will take a look at logging on Monday just to verify.

Derek D,

Are you certain that nimbus.task.timeout.secs is the correct config? Tracing through the github code it would seem to me that this is used as the timeout value when making a Thrift connection to the Nimbus node. I thought the logs indicated the timeout was occurring in the session connection to zookeeper as evidenced by ClientCxn being a Zookeeper class.

I discovered that we were running with the default maxSessionTimeout zookeeper config of 40 seconds. This would explain why our storm config of 5 minutes was not being picked up (but obviously not the root problem nor why timeout messages report 14-20 second timeout values). Typically we saw losses in connection occur when our cluster becomes super busy with a burst of data pushing workers to near 100% CPU. I'm testing the following configs over the weekend to see if they at least allow us to prevent chronic worker restarting during the brief high CPU periods.

Our current setup is as follows:
Storm 0.9.0.1
3 Storm node cluster
1 Supervisor running per Storm node
1-3 topologies deployed on the Storm cluster (depends on dev/prod/etc systems)
3 Workers per topology
Variable number of executors per component depending on how slow that component is. Example file i/o has many executors (say 12) while in memory validation has only 3 executors. Always maintaining a multiple of the number of workers for even distribution.
Kyro serialization with Java Serialization failover disabled to ensure we're using 100% kryo between bolts.

zoo.cfg
tickTime=2000
dataDir=/srv/zookeeper/data
clientPort=2182
initLimit=5
syncLimit=2
skipACL=true
maxClientCnxns=1000
maxSessionTimeout=300000
server.1=node1
server.2=node2
server.3=node3

storm yaml
storm.zookeeper.port: 2182
storm.local.dir=/srv/storm/data
nimbus.host: "node1"
storm.zookeeper.servers:
 - "node1"
 - "node2"
 - "node3"
supervisor.slot.ports:
 - 6700
 - 6701
 - 6702
 - 6703
 - 6704
java.library.path: "/usr/lib:/srv/storm/lib"
#Storm 0.9 netty support
storm.messaging.transport: "backtype.storm.messaging.netty.Context" 
storm.messaging.netty.server_worker_threads: 1
storm.messaging.netty.client_worker_threads: 1 
storm.messaging.netty.buffer_size: 5242880 
storm.messaging.netty.max_retries: 100 
storm.messaging.netty.max_wait_ms: 1000 
storm.messaging.netty.min_wait_ms: 100 
# Timeout band-aids in testing
topology.receiver.buffer.size: 2
storm.zookeeper.connection.timeout: 300000
storm.zookeeper.session.timeout: 300000


> Date: Thu, 29 May 2014 12:56:19 -0500
> From: derekd@yahoo-inc.com
> To: user@storm.incubator.apache.org
> Subject: Re: Workers constantly restarted due to session timeout
> 
> OK, so GC is probably not the issue.
> 
> 
> Specifically, this is a connection timeout to ZK from the worker, and it is resulting in nimbus removing it from the assignments for that node.  In turn, the supervisor reads the schedule and shoots the worker because it is no longer scheduled to be running.
> 
> 
> The relevant config is nimbus.task.timeout.secs, and I think the default is 30s.  What you could try is to make nimbus timeout longer than storm.zookeeper.session.timeout.  This would allow the ZK connections to timeout and get a heartbeat in before nimbus decides they have timed out.
> 
> 
> But the real question is why are the ZK sessions timing out at all?
> 
> Do you see this on several workers on that node?  What about the supervisor?  What about other nodes?  What do the ZK logs say?
> 
> 
> -- 
> Derek
> 
> On 5/29/14, 11:45, Michael Dev wrote:
> >
> >
> >
> > Derek,
> >
> > We are currently running with -Xmx60G and only about 20-30G of that has been observed to be used. I'm still observing workers restarted every 2 minutes.
> >
> > What timeout is relevant to increase for the heartbeats in question? Is it be a config on the Zookeeper side we can increase to make our topology more resilient to these restarts?
> >
> > Michael
> >
> >> Date: Fri, 23 May 2014 15:50:50 -0500
> >> From: derekd@yahoo-inc.com
> >> To: user@storm.incubator.apache.org
> >> Subject: Re: Workers constantly restarted due to session timeout
> >>
> >>> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
> >>
> >> Check your JVM max heap size (-Xmx).  If you use too much, the JVM will garbage-collect, and that will stop everything--including the thread whose job it is to do the heartbeating.
> >>
> >>
> >>
> >> --
> >> Derek
> >>
> >> On 5/23/14, 15:38, Michael Dev wrote:
> >>> Hi all,
> >>>
> >>> We are seeing our workers constantly being killed by Storm with to the following logs:
> >>> worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, closing socket and attempting reconnect
> >>> supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
> >>>
> >>> Eventually Storm decides to just kill the worker and restart it as you see in the supervisor log. We theorize this is the Zookeeper heartbeat thread and it is being choked out due to very high CPU load on the machine (near 100%).
> >>>
> >>> I have increased the connection timeouts in the storm.yaml config file yet Storm seems to continue to use some unknown value for the above client session timeout messages:
> >>> storm.zookeeper.connection.timeout: 300000
> >>> storm.zookeeper.session.timeout: 300000
> >>>
> >>> 1) What timeout config is appropriate for the above timeout  message?
> >>> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
> >>>
> >>> Thanks,
> >>> Michael
> >>>    		 	   		
> >>>
> >
> >   		 	   		
> >

Re: Workers constantly restarted due to session timeout

Posted by Derek Dagit <de...@yahoo-inc.com>.

OK, so GC is probably not the issue.


Specifically, this is a connection timeout to ZK from the worker, and it is resulting in nimbus removing it from the assignments for that node.  In turn, the supervisor reads the schedule and shoots the worker because it is no longer scheduled to be running.


The relevant config is nimbus.task.timeout.secs, and I think the default is 30s.  What you could try is to make nimbus timeout longer than storm.zookeeper.session.timeout.  This would allow the ZK connections to timeout and get a heartbeat in before nimbus decides they have timed out.


But the real question is why are the ZK sessions timing out at all?

Do you see this on several workers on that node?  What about the supervisor?  What about other nodes?  What do the ZK logs say?


-- 
Derek

On 5/29/14, 11:45, Michael Dev wrote:
>
>
>
> Derek,
>
> We are currently running with -Xmx60G and only about 20-30G of that has been observed to be used. I'm still observing workers restarted every 2 minutes.
>
> What timeout is relevant to increase for the heartbeats in question? Is it be a config on the Zookeeper side we can increase to make our topology more resilient to these restarts?
>
> Michael
>
>> Date: Fri, 23 May 2014 15:50:50 -0500
>> From: derekd@yahoo-inc.com
>> To: user@storm.incubator.apache.org
>> Subject: Re: Workers constantly restarted due to session timeout
>>
>>> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
>>
>> Check your JVM max heap size (-Xmx).  If you use too much, the JVM will garbage-collect, and that will stop everything--including the thread whose job it is to do the heartbeating.
>>
>>
>>
>> --
>> Derek
>>
>> On 5/23/14, 15:38, Michael Dev wrote:
>>> Hi all,
>>>
>>> We are seeing our workers constantly being killed by Storm with to the following logs:
>>> worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, closing socket and attempting reconnect
>>> supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
>>>
>>> Eventually Storm decides to just kill the worker and restart it as you see in the supervisor log. We theorize this is the Zookeeper heartbeat thread and it is being choked out due to very high CPU load on the machine (near 100%).
>>>
>>> I have increased the connection timeouts in the storm.yaml config file yet Storm seems to continue to use some unknown value for the above client session timeout messages:
>>> storm.zookeeper.connection.timeout: 300000
>>> storm.zookeeper.session.timeout: 300000
>>>
>>> 1) What timeout config is appropriate for the above timeout  message?
>>> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
>>>
>>> Thanks,
>>> Michael
>>>    		 	   		
>>>
>
>   		 	   		
>

Re: Workers constantly restarted due to session timeout

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Can you share your storm config and version?

> On May 29, 2014, at 12:45 PM, Michael Dev <mi...@outlook.com> wrote:
> 
> Derek,
> 
> We are currently running with -Xmx60G and only about 20-30G of that has been observed to be used. I'm still observing workers restarted every 2 minutes.
> 
> What timeout is relevant to increase for the heartbeats in question? Is it be a config on the Zookeeper side we can increase to make our topology more resilient to these restarts?
> 
> Michael
> 
> > Date: Fri, 23 May 2014 15:50:50 -0500
> > From: derekd@yahoo-inc.com
> > To: user@storm.incubator.apache.org
> > Subject: Re: Workers constantly restarted due to session timeout
> > 
> > > 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
> > 
> > Check your JVM max heap size (-Xmx). If you use too much, the JVM will garbage-collect, and that will stop everything--including the thread whose job it is to do the heartbeating.
> > 
> > 
> > 
> > -- 
> > Derek
> > 
> > On 5/23/14, 15:38, Michael Dev wrote:
> > > Hi all,
> > >
> > > We are seeing our workers constantly being killed by Storm with to the following logs:
> > > worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, closing socket and attempting reconnect
> > > supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
> > >
> > > Eventually Storm decides to just kill the worker and restart it as you see in the supervisor log. We theorize this is the Zookeeper heartbeat thread and it is being choked out due to very high CPU load on the machine (near 100%).
> > >
> > > I have increased the connection timeouts in the storm.yaml config file yet Storm seems to continue to use some unknown value for the above client session timeout messages:
> > > storm.zookeeper.connection.timeout: 300000
> > > storm.zookeeper.session.timeout: 300000
> > >
> > > 1) What timeout config is appropriate for the above timeout message?
> > > 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
> > >
> > > Thanks,
> > > Michael
> > > 
> > >

RE: Workers constantly restarted due to session timeout

Posted by Michael Dev <mi...@outlook.com>.



Derek,

We are currently running with -Xmx60G and only about 20-30G of that has been observed to be used. I'm still observing workers restarted every 2 minutes.

What timeout is relevant to increase for the heartbeats in question? Is it be a config on the Zookeeper side we can increase to make our topology more resilient to these restarts?

Michael

> Date: Fri, 23 May 2014 15:50:50 -0500
> From: derekd@yahoo-inc.com
> To: user@storm.incubator.apache.org
> Subject: Re: Workers constantly restarted due to session timeout
> 
> > 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
> 
> Check your JVM max heap size (-Xmx).  If you use too much, the JVM will garbage-collect, and that will stop everything--including the thread whose job it is to do the heartbeating.
> 
> 
> 
> -- 
> Derek
> 
> On 5/23/14, 15:38, Michael Dev wrote:
> > Hi all,
> >
> > We are seeing our workers constantly being killed by Storm with to the following logs:
> > worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, closing socket and attempting reconnect
> > supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
> >
> > Eventually Storm decides to just kill the worker and restart it as you see in the supervisor log. We theorize this is the Zookeeper heartbeat thread and it is being choked out due to very high CPU load on the machine (near 100%).
> >
> > I have increased the connection timeouts in the storm.yaml config file yet Storm seems to continue to use some unknown value for the above client session timeout messages:
> > storm.zookeeper.connection.timeout: 300000
> > storm.zookeeper.session.timeout: 300000
> >
> > 1) What timeout config is appropriate for the above timeout  message?
> > 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
> >
> > Thanks,
> > Michael
> >   		 	   		
> >

Re: Workers constantly restarted due to session timeout

Posted by Derek Dagit <de...@yahoo-inc.com>.

> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?

Check your JVM max heap size (-Xmx).  If you use too much, the JVM will garbage-collect, and that will stop everything--including the thread whose job it is to do the heartbeating.



-- 
Derek

On 5/23/14, 15:38, Michael Dev wrote:
> Hi all,
>
> We are seeing our workers constantly being killed by Storm with to the following logs:
> worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, closing socket and attempting reconnect
> supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
>
> Eventually Storm decides to just kill the worker and restart it as you see in the supervisor log. We theorize this is the Zookeeper heartbeat thread and it is being choked out due to very high CPU load on the machine (near 100%).
>
> I have increased the connection timeouts in the storm.yaml config file yet Storm seems to continue to use some unknown value for the above client session timeout messages:
> storm.zookeeper.connection.timeout: 300000
> storm.zookeeper.session.timeout: 300000
>
> 1) What timeout config is appropriate for the above timeout  message?
> 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect?
>
> Thanks,
> Michael
>   		 	   		
>