You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Erik Weathers <ew...@groupon.com> on 2016/05/01 03:22:20 UTC

Re: How Does Nimbus Decide to Restart Topology?

(I don't know anything about the pacemaker service introduced in storm 1.0,
so this statement is pre-1.0).

The executor threads within the worker processes write heartbeats to
ZooKeeper.  If they aren't successfully heartbeating then it could be many
things:

1. ZK too busy? (Seems unlikely)
2. Network too busy? (Seems unlikely)
3. Worker process died due to exception (this is almost always what we see)
4. Worker process hung (eg doing GC). (This would usually first be caught
by the supervisor on that host since it checks a local hearbeat file that
the worker normally writes to every second -- if the heartbeat doesn't get
refreshed before the timeout then the supervisor kills the worker process
with State being :timed-out (or :time-out, something like that). This of
course depends on the various timeout config values you have on the worker
and nimbus hosts.)

- Erik

On Saturday, April 30, 2016, Kevin Conaway <ke...@gmail.com>
wrote:

> We are using Storm 0.10 and we noticed that Nimbus decided to restart our
> topology.  From researching past threads it seems like this is related to
> not receiving heartbeats from the supervisors but I'm unsure if this was
> the case.  Our topology was mostly idle at the time that the restart was
> triggered.
>
> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK nodes, I
> saw the following messages at the time of the restart:
>
> 2016-04-30 01:33:46,001 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x45453e198e8007f, timeout of 20000ms exceeded
> 2016-04-30 01:33:46,003 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x25453e1c2640085, timeout of 20000ms exceeded
> 2016-04-30 01:33:46,003 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x45453e198e80076, timeout of 20000ms exceeded
> 2016-04-30 01:33:48,003 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x35453e1a529008b, timeout of 20000ms exceeded
> 2016-04-30 01:33:50,001 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x15453e198d10084, timeout of 20000ms exceeded
> 2016-04-30 01:33:50,002 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x35453e1a5290090, timeout of 20000ms exceeded
> 2016-04-30 01:33:50,002 [myid:4] - INFO
>  [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x15453e198d1008e, timeout of 20000ms exceeded
>
> In the nimbus log, there was the following log message:
>
> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8] not
> alive
>
> Shortly thereafter, the supervisors started restarting the workers.  The
>  following log message was in the supervisor log:
>
> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current supervisor time:
> 1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040,
> :storm-id "<topology>", :executors [[111 111] [75 75] [51 51] [3 3] [39 39]
> [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87] [171 171]
> [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>
> Previous threads have suggested that this was due to heavy GC causing the
> heartbeats not to reach Zookeeper but the topology was idle at this time so
> I don't think GC was the culprit.  The GC par new time was about 50ms on
> each node (as reported to Graphite).
>
> Thoughts on what could have happened here and how to debug further?
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>

Re: How Does Nimbus Decide to Restart Topology?

Posted by Kevin Conaway <ke...@gmail.com>.

Looking at the storm worker thread dumps, it looks like the bolt/spout
components are all waiting for the topology to become active (including the
__system and __acker components):

"Thread-37-__acker" #70 prio=5 os_prio=0 tid=0x00007f6feca57000 nid=0x3a8d
waiting on condition [0x00007f6f54ef2000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713)
at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477)
at clojure.lang.AFn.run(AFn.java:22)
at java.lang.Thread.run(Thread.java:745)

...

"Thread-35-my-bolt" #68 prio=5 os_prio=0 tid=0x00007f6feca53000 nid=0x3a8b
waiting on condition [0x00007f6f550f4000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713)
at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477)
at clojure.lang.AFn.run(AFn.java:22)
at java.lang.Thread.run(Thread.java:745)

...
"Thread-17-__system" #50 prio=5 os_prio=0 tid=0x00007f6feca31800 nid=0x3a79
waiting on condition [0x00007f6f563a7000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713)
at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477)
at clojure.lang.AFn.run(AFn.java:22)
at java.lang.Thread.run(Thread.java:745)

Looking in the store code, I see that they are sleeping here:

(while (not @(:storm-active-atom executor-data))
  (Thread/sleep 100))

At what point does the topology become "active"?


On Mon, May 2, 2016 at 11:38 AM, Kevin Conaway <ke...@gmail.com>
wrote:

> Some more interesting info.
>
> As I mentioned in the last post, we have 6 workers.  Each worker has 2
> slots.
>
> Only one of the supervisors received a state :timed-out message.  After it
> received the message, it restarted both workers successfully.
>
> The other 5 supervisors all experienced the same issue as I mentioned in
> the last post.  The workers were launched twice.  Whats interesting is that
> both workers on those 5 nodes came up the second time in a hung state, they
> all had a lot of ERROR messages from the netty client about not being able
> to connect to the other nodes:
>
> 2016-04-30 01:34:59.758 b.s.m.n.Client [ERROR] connection attempt 26 to
> Netty-Client-StormWorker-111549876-5-117071490.<fqdn>/10.226.77.191:6700
> failed: java.net.ConnectException: Connection refused:
> StormWorker-111549876-5-117071490.<fqdn>/10.226.77.191:6700
> It looks like the bolts and spouts never got fully prepared (as evidenced
> by lack of log messages indicating so)
>
> On the one supervisor which _did_ receive the timeout message, the 2
> workers came up just fine.
>
> On Mon, May 2, 2016 at 11:21 AM, Kevin Conaway <ke...@gmail.com>
> wrote:
>
>> Unfortunately we're not capturing disk i/o in our metrics, I can look in
>> to doing that for next time.
>>
>> We're not capturing GC logs, we are using the graphite storm metric
>> consumer to push metrics to graphite, one of which is the GC time from the
>> default GC mxbean.
>>
>> > I'm assuming you're saying that multiple workers had state :timed-out
>> at once?
>>
>> We have 6 workers.  Only one had state :timed-out, the others had state
>> :disallowed.  Looking at one of the supervisors as an example, it looks
>> like it received multiple requests to reschedule the worker which caused
>> the worker to be launched multiple times.  Is this normal?
>>
>> 2016-04-30 01:34:00.872 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 589b0ed5-c4e9-422f-a6c5-5d65145915f7. Current
>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>> 1461980040, :storm-id "<topology>", :executors [[124 124] [64 64] [196 196]
>> [40 40] [28 28] [184 184] [100 100] [-1 -1] [172 172] [16 16] [52 52] [148
>> 148] [136 136] [112 112] [76 76] [88 88] [160 160] [4 4]], :port 6700}
>>
>> 2016-04-30 01:34:00.874 b.s.d.supervisor [INFO] Shutting down
>> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:589b0ed5-c4e9-422f-a6c5-5d65145915f7
>>
>> 2016-04-30 01:34:02.013 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 460a7d66-06de-4ca5-9140-7d46dcdea841. Current
>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>> 1461980040, :storm-id "<topology>", :executors [[178 178] [58 58] [190 190]
>> [118 118] [22 22] [142 142] [-1 -1] [166 166] [106 106] [70 70] [10 10] [46
>> 46] [82 82] [154 154] [94 94] [34 34] [130 130]], :port 6701}
>>
>> 2016-04-30 01:34:02.014 b.s.d.supervisor [INFO] Shutting down
>> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:460a7d66-06de-4ca5-9140-7d46dcdea841
>>
>> 2016-04-30 01:34:03.095 b.s.d.supervisor [INFO] Launching worker with
>> assignment {:storm-id "<topology>", :executors [[3 3] [33 33] [103 103]
>> [163 163] [53 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [153
>> 153] [13 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113
>> 113]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port
>> 6700 with id 567a18fd-33d1-49b6-a3f4-ace65641bd67
>>
>> 2016-04-30 01:34:03.122 b.s.d.supervisor [INFO] Launching worker with
>> assignment {:storm-id "<topology>", :executors [[8 8] [188 188] [68 68]
>> [198 198] [178 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48
>> 48] [148 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78
>> 78]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port 6701
>> with id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6
>>
>>
>> 2016-04-30 01:34:41.322 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 567a18fd-33d1-49b6-a3f4-ace65641bd67. Current
>> supervisor time: 1461980081. State: :disallowed, Heartbeat: {:time-secs
>> 1461980080, :storm-id "<topology>", :executors [[3 3] [33 33] [103 103]
>> [163 163] [53 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [-1 -1]
>> [153 153] [13 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183]
>> [113 113]], :port 6700}
>>
>> 2016-04-30 01:34:41.323 b.s.d.supervisor [INFO] Shutting down
>> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:567a18fd-33d1-49b6-a3f4-ace65641bd67
>>
>> 2016-04-30 01:34:42.353 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6. Current
>> supervisor time: 1461980081. State: :disallowed, Heartbeat: {:time-secs
>> 1461980080, :storm-id "<topology>", :executors [[8 8] [188 188] [68 68]
>> [198 198] [178 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48
>> 48] [-1 -1] [148 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168
>> 168] [78 78]], :port 6701}
>>
>> 2016-04-30 01:34:42.354 b.s.d.supervisor [INFO] Shutting down
>> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:9fcc869d-08d7-44ec-bde2-bf9ed86403e6
>>
>> On Sun, May 1, 2016 at 5:52 PM, Erik Weathers <ew...@groupon.com>
>> wrote:
>>
>>> Maybe disk I/O was high?  Are you capturing GC logs to disk in unique
>>> files (you can sub in the PID and timestamp into the GC log filename)?  I
>>> know you believe it's not responsible, but it's the only thing I've ever
>>> found to be responsible thus far.  (Except for a problem in storm 0.9.3
>>> with netty that has since been fixed -- we worked around that by
>>> downgrading to zero-MQ.)  You might try monitoring the heartbeat files
>>> written by the workers to watch for the file creation to be happening less
>>> frequently than once per second.
>>>
>>> > all of the worker sessions expired at the same time
>>>
>>> I'm assuming you're saying that multiple workers had state :timed-out at
>>> once?  Was that on the same host?  If the state is :disallowed, that is
>>> perfectly normal when the reassignment happens, as I described earlier.
>>>
>>> - Erik
>>>
>>> On Sunday, May 1, 2016, Kevin Conaway <ke...@gmail.com> wrote:
>>>
>>>> Any tips on where to continue investigating or other metrics to capture?
>>>>
>>>> As i mentioned before, the topology was mostly idle. Low cpu usage, low
>>>> gc time (cms parnew), stable heap, no eth errors. Its hard to see why all
>>>> of the worker sessions expired at the same time
>>>>
>>>> On Sunday, May 1, 2016, Erik Weathers <ew...@groupon.com> wrote:
>>>>
>>>>> To be clear, the supervisor wasn't shutting  itself down, it was
>>>>> killing the worker process.
>>>>>
>>>>> Also for clarity, the :disallowed state simply means a new
>>>>> task->worker assignment has been computed by the nimbus and so the
>>>>> supervisor has noticed from the ZK state that the currently running worker
>>>>> is no longer supposed to be running.
>>>>>
>>>>> Regarding the :timed-out state it *definitely* means the worker was
>>>>> hung and couldn't write to the local heartbeat file within the timeout (30
>>>>> secs by default).  As for *why*... That is hard to determine.
>>>>>
>>>>> - Erik
>>>>>
>>>>> On Sunday, May 1, 2016, Kevin Conaway <ke...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Erik.  We're using Storm 0.10 so Pacemaker doesn't come in to
>>>>>> play here.
>>>>>>
>>>>>> 3. Worker process died due to exception (this is almost always what
>>>>>> we see)
>>>>>> 4. Worker process hung (eg doing GC).
>>>>>>
>>>>>> I don't think its either of these, TBH.  There were no abnormal
>>>>>> terminations in the logs (or in the Storm error logs). I have the GC
>>>>>> metrics that were reported to graphite and the GC time was fairly low
>>>>>> (~50ms).
>>>>>>
>>>>>> My feeling is that it is Zookeeper related and not worker related
>>>>>> because it appears that ZK also expired sessions from the supervisor nodes
>>>>>> as well:
>>>>>>
>>>>>> I saw the following messages in the log of one of the ZK nodes:
>>>>>>
>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>>>
>>>>>> 2016-04-30 01:33:46,005 [myid:4] - INFO  [ProcessThread(sid:4
>>>>>> cport:-1)::PrepRequestProcessor@476] - Processed session termination
>>>>>> for sessionid: 0x45453e198e80076
>>>>>>
>>>>>> 2016-04-30 01:33:46,009 [myid:4] - INFO
>>>>>> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection
>>>>>> for client /<ip>:44500 which had sessionid 0x45453e198e80076
>>>>>>
>>>>>> And some corresponding log messages on one of the supervisor log
>>>>>> files (for the same ZK session ID).  It looks like this occurred right as
>>>>>> the supervisor was shutting down due to Nimbus restarting it:
>>>>>>
>>>>>> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
>>>>>> read additional data from server sessionid 0x45453e198e80076, likely server
>>>>>> has closed socket, closing socket connection and attempting reconnect
>>>>>>
>>>>>> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and
>>>>>> clearing state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current
>>>>>> supervisor time: 1461980064. State: :timed-out, Heartbeat: {:time-secs
>>>>>> 1461980008, :storm-id "<topology>", :executors [[62 62] [2 2] [38 38] [98
>>>>>> 98] [50 50] [-1 -1] [170 170] [86 86] [194 194] [158 158] [122 122] [74 74]
>>>>>> [26 26] [146 146] [110 110] [14 14] [182 182] [134 134]], :port 6700}
>>>>>>
>>>>>> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
>>>>>> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b
>>>>>>
>>>>>> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
>>>>>> 4e6934f9-88e4-4506-9d6e-25db72be941b
>>>>>>
>>>>>> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
>>>>>> execution of cleanup threads on worker.
>>>>>>
>>>>>> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager
>>>>>> [INFO] State change: SUSPENDED
>>>>>>
>>>>>> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
>>>>>> :disconnected::none: with disconnected Zookeeper.
>>>>>>
>>>>>> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening
>>>>>> socket connection to server <zk host>:2181. Will not attempt to
>>>>>> authenticate using SASL (unknown error)
>>>>>>
>>>>>> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket
>>>>>> connection established to <zk host>:2181, initiating session
>>>>>>
>>>>>> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
>>>>>> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired,
>>>>>> closing socket connection
>>>>>>
>>>>>> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager
>>>>>> [INFO] State change: LOST
>>>>>>
>>>>>> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event
>>>>>> :expired::none: with disconnected Zookeeper.
>>>>>>
>>>>>> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
>>>>>> expired event received
>>>>>>
>>>>>> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <eweathers@groupon.com
>>>>>> > wrote:
>>>>>>
>>>>>>> (I don't know anything about the pacemaker service introduced in
>>>>>>> storm 1.0, so this statement is pre-1.0).
>>>>>>>
>>>>>>> The executor threads within the worker processes write heartbeats to
>>>>>>> ZooKeeper.  If they aren't successfully heartbeating then it could be many
>>>>>>> things:
>>>>>>>
>>>>>>> 1. ZK too busy? (Seems unlikely)
>>>>>>> 2. Network too busy? (Seems unlikely)
>>>>>>> 3. Worker process died due to exception (this is almost always what
>>>>>>> we see)
>>>>>>> 4. Worker process hung (eg doing GC). (This would usually first be
>>>>>>> caught by the supervisor on that host since it checks a local hearbeat file
>>>>>>> that the worker normally writes to every second -- if the heartbeat doesn't
>>>>>>> get refreshed before the timeout then the supervisor kills the worker
>>>>>>> process with State being :timed-out (or :time-out, something like that).
>>>>>>> This of course depends on the various timeout config values you have on the
>>>>>>> worker and nimbus hosts.)
>>>>>>>
>>>>>>> - Erik
>>>>>>>
>>>>>>>
>>>>>>> On Saturday, April 30, 2016, Kevin Conaway <
>>>>>>> kevin.a.conaway@gmail.com> wrote:
>>>>>>>
>>>>>>>> We are using Storm 0.10 and we noticed that Nimbus decided to
>>>>>>>> restart our topology.  From researching past threads it seems like this is
>>>>>>>> related to not receiving heartbeats from the supervisors but I'm unsure if
>>>>>>>> this was the case.  Our topology was mostly idle at the time that the
>>>>>>>> restart was triggered.
>>>>>>>>
>>>>>>>> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK
>>>>>>>> nodes, I saw the following messages at the time of the restart:
>>>>>>>>
>>>>>>>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x45453e198e8007f, timeout of 20000ms exceeded
>>>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x25453e1c2640085, timeout of 20000ms exceeded
>>>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>>>>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x35453e1a529008b, timeout of 20000ms exceeded
>>>>>>>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x15453e198d10084, timeout of 20000ms exceeded
>>>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x35453e1a5290090, timeout of 20000ms exceeded
>>>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>>>>>>>
>>>>>>>> In the nimbus log, there was the following log message:
>>>>>>>>
>>>>>>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8
>>>>>>>> 8] not alive
>>>>>>>>
>>>>>>>> Shortly thereafter, the supervisors started restarting the
>>>>>>>> workers.  The  following log message was in the supervisor log:
>>>>>>>>
>>>>>>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>>>>>>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>>>>>>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>>>>>>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 51]
>>>>>>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87]
>>>>>>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>>>>>>>
>>>>>>>> Previous threads have suggested that this was due to heavy GC
>>>>>>>> causing the heartbeats not to reach Zookeeper but the topology was idle at
>>>>>>>> this time so I don't think GC was the culprit.  The GC par new time was
>>>>>>>> about 50ms on each node (as reported to Graphite).
>>>>>>>>
>>>>>>>> Thoughts on what could have happened here and how to debug further?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Kevin Conaway
>>>>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>>>>> https://github.com/kevinconaway
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kevin Conaway
>>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>>> https://github.com/kevinconaway
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Kevin Conaway
>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>> https://github.com/kevinconaway
>>>>
>>>>
>>
>>
>> --
>> Kevin Conaway
>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>> https://github.com/kevinconaway
>>
>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>



-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Re: How Does Nimbus Decide to Restart Topology?

Posted by Kevin Conaway <ke...@gmail.com>.

Some more interesting info.

As I mentioned in the last post, we have 6 workers.  Each worker has 2
slots.

Only one of the supervisors received a state :timed-out message.  After it
received the message, it restarted both workers successfully.

The other 5 supervisors all experienced the same issue as I mentioned in
the last post.  The workers were launched twice.  Whats interesting is that
both workers on those 5 nodes came up the second time in a hung state, they
all had a lot of ERROR messages from the netty client about not being able
to connect to the other nodes:

2016-04-30 01:34:59.758 b.s.m.n.Client [ERROR] connection attempt 26 to
Netty-Client-StormWorker-111549876-5-117071490.<fqdn>/10.226.77.191:6700
failed: java.net.ConnectException: Connection refused:
StormWorker-111549876-5-117071490.<fqdn>/10.226.77.191:6700
It looks like the bolts and spouts never got fully prepared (as evidenced
by lack of log messages indicating so)

On the one supervisor which _did_ receive the timeout message, the 2
workers came up just fine.

On Mon, May 2, 2016 at 11:21 AM, Kevin Conaway <ke...@gmail.com>
wrote:

> Unfortunately we're not capturing disk i/o in our metrics, I can look in
> to doing that for next time.
>
> We're not capturing GC logs, we are using the graphite storm metric
> consumer to push metrics to graphite, one of which is the GC time from the
> default GC mxbean.
>
> > I'm assuming you're saying that multiple workers had state :timed-out at
> once?
>
> We have 6 workers.  Only one had state :timed-out, the others had state
> :disallowed.  Looking at one of the supervisors as an example, it looks
> like it received multiple requests to reschedule the worker which caused
> the worker to be launched multiple times.  Is this normal?
>
> 2016-04-30 01:34:00.872 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 589b0ed5-c4e9-422f-a6c5-5d65145915f7. Current supervisor time:
> 1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040,
> :storm-id "<topology>", :executors [[124 124] [64 64] [196 196] [40 40] [28
> 28] [184 184] [100 100] [-1 -1] [172 172] [16 16] [52 52] [148 148] [136
> 136] [112 112] [76 76] [88 88] [160 160] [4 4]], :port 6700}
>
> 2016-04-30 01:34:00.874 b.s.d.supervisor [INFO] Shutting down
> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:589b0ed5-c4e9-422f-a6c5-5d65145915f7
>
> 2016-04-30 01:34:02.013 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 460a7d66-06de-4ca5-9140-7d46dcdea841. Current supervisor time:
> 1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040,
> :storm-id "<topology>", :executors [[178 178] [58 58] [190 190] [118 118]
> [22 22] [142 142] [-1 -1] [166 166] [106 106] [70 70] [10 10] [46 46] [82
> 82] [154 154] [94 94] [34 34] [130 130]], :port 6701}
>
> 2016-04-30 01:34:02.014 b.s.d.supervisor [INFO] Shutting down
> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:460a7d66-06de-4ca5-9140-7d46dcdea841
>
> 2016-04-30 01:34:03.095 b.s.d.supervisor [INFO] Launching worker with
> assignment {:storm-id "<topology>", :executors [[3 3] [33 33] [103 103]
> [163 163] [53 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [153
> 153] [13 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113
> 113]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port
> 6700 with id 567a18fd-33d1-49b6-a3f4-ace65641bd67
>
> 2016-04-30 01:34:03.122 b.s.d.supervisor [INFO] Launching worker with
> assignment {:storm-id "<topology>", :executors [[8 8] [188 188] [68 68]
> [198 198] [178 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48
> 48] [148 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78
> 78]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port 6701
> with id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6
>
>
> 2016-04-30 01:34:41.322 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 567a18fd-33d1-49b6-a3f4-ace65641bd67. Current supervisor time:
> 1461980081. State: :disallowed, Heartbeat: {:time-secs 1461980080,
> :storm-id "<topology>", :executors [[3 3] [33 33] [103 103] [163 163] [53
> 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [-1 -1] [153 153] [13
> 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113 113]],
> :port 6700}
>
> 2016-04-30 01:34:41.323 b.s.d.supervisor [INFO] Shutting down
> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:567a18fd-33d1-49b6-a3f4-ace65641bd67
>
> 2016-04-30 01:34:42.353 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6. Current supervisor time:
> 1461980081. State: :disallowed, Heartbeat: {:time-secs 1461980080,
> :storm-id "<topology>", :executors [[8 8] [188 188] [68 68] [198 198] [178
> 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48 48] [-1 -1] [148
> 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78 78]],
> :port 6701}
>
> 2016-04-30 01:34:42.354 b.s.d.supervisor [INFO] Shutting down
> 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:9fcc869d-08d7-44ec-bde2-bf9ed86403e6
>
> On Sun, May 1, 2016 at 5:52 PM, Erik Weathers <ew...@groupon.com>
> wrote:
>
>> Maybe disk I/O was high?  Are you capturing GC logs to disk in unique
>> files (you can sub in the PID and timestamp into the GC log filename)?  I
>> know you believe it's not responsible, but it's the only thing I've ever
>> found to be responsible thus far.  (Except for a problem in storm 0.9.3
>> with netty that has since been fixed -- we worked around that by
>> downgrading to zero-MQ.)  You might try monitoring the heartbeat files
>> written by the workers to watch for the file creation to be happening less
>> frequently than once per second.
>>
>> > all of the worker sessions expired at the same time
>>
>> I'm assuming you're saying that multiple workers had state :timed-out at
>> once?  Was that on the same host?  If the state is :disallowed, that is
>> perfectly normal when the reassignment happens, as I described earlier.
>>
>> - Erik
>>
>> On Sunday, May 1, 2016, Kevin Conaway <ke...@gmail.com> wrote:
>>
>>> Any tips on where to continue investigating or other metrics to capture?
>>>
>>> As i mentioned before, the topology was mostly idle. Low cpu usage, low
>>> gc time (cms parnew), stable heap, no eth errors. Its hard to see why all
>>> of the worker sessions expired at the same time
>>>
>>> On Sunday, May 1, 2016, Erik Weathers <ew...@groupon.com> wrote:
>>>
>>>> To be clear, the supervisor wasn't shutting  itself down, it was
>>>> killing the worker process.
>>>>
>>>> Also for clarity, the :disallowed state simply means a new
>>>> task->worker assignment has been computed by the nimbus and so the
>>>> supervisor has noticed from the ZK state that the currently running worker
>>>> is no longer supposed to be running.
>>>>
>>>> Regarding the :timed-out state it *definitely* means the worker was
>>>> hung and couldn't write to the local heartbeat file within the timeout (30
>>>> secs by default).  As for *why*... That is hard to determine.
>>>>
>>>> - Erik
>>>>
>>>> On Sunday, May 1, 2016, Kevin Conaway <ke...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Erik.  We're using Storm 0.10 so Pacemaker doesn't come in to
>>>>> play here.
>>>>>
>>>>> 3. Worker process died due to exception (this is almost always what we
>>>>> see)
>>>>> 4. Worker process hung (eg doing GC).
>>>>>
>>>>> I don't think its either of these, TBH.  There were no abnormal
>>>>> terminations in the logs (or in the Storm error logs). I have the GC
>>>>> metrics that were reported to graphite and the GC time was fairly low
>>>>> (~50ms).
>>>>>
>>>>> My feeling is that it is Zookeeper related and not worker related
>>>>> because it appears that ZK also expired sessions from the supervisor nodes
>>>>> as well:
>>>>>
>>>>> I saw the following messages in the log of one of the ZK nodes:
>>>>>
>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>>
>>>>> 2016-04-30 01:33:46,005 [myid:4] - INFO  [ProcessThread(sid:4
>>>>> cport:-1)::PrepRequestProcessor@476] - Processed session termination
>>>>> for sessionid: 0x45453e198e80076
>>>>>
>>>>> 2016-04-30 01:33:46,009 [myid:4] - INFO
>>>>> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for
>>>>> client /<ip>:44500 which had sessionid 0x45453e198e80076
>>>>>
>>>>> And some corresponding log messages on one of the supervisor log files
>>>>> (for the same ZK session ID).  It looks like this occurred right as the
>>>>> supervisor was shutting down due to Nimbus restarting it:
>>>>>
>>>>> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read
>>>>> additional data from server sessionid 0x45453e198e80076, likely server has
>>>>> closed socket, closing socket connection and attempting reconnect
>>>>>
>>>>> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and
>>>>> clearing state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current
>>>>> supervisor time: 1461980064. State: :timed-out, Heartbeat: {:time-secs
>>>>> 1461980008, :storm-id "<topology>", :executors [[62 62] [2 2] [38 38] [98
>>>>> 98] [50 50] [-1 -1] [170 170] [86 86] [194 194] [158 158] [122 122] [74 74]
>>>>> [26 26] [146 146] [110 110] [14 14] [182 182] [134 134]], :port 6700}
>>>>>
>>>>> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
>>>>> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b
>>>>>
>>>>> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
>>>>> 4e6934f9-88e4-4506-9d6e-25db72be941b
>>>>>
>>>>> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
>>>>> execution of cleanup threads on worker.
>>>>>
>>>>> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager
>>>>> [INFO] State change: SUSPENDED
>>>>>
>>>>> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
>>>>> :disconnected::none: with disconnected Zookeeper.
>>>>>
>>>>> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
>>>>> connection to server <zk host>:2181. Will not attempt to authenticate using
>>>>> SASL (unknown error)
>>>>>
>>>>> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket
>>>>> connection established to <zk host>:2181, initiating session
>>>>>
>>>>> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
>>>>> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired,
>>>>> closing socket connection
>>>>>
>>>>> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager
>>>>> [INFO] State change: LOST
>>>>>
>>>>> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event
>>>>> :expired::none: with disconnected Zookeeper.
>>>>>
>>>>> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
>>>>> expired event received
>>>>>
>>>>> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <ew...@groupon.com>
>>>>> wrote:
>>>>>
>>>>>> (I don't know anything about the pacemaker service introduced in
>>>>>> storm 1.0, so this statement is pre-1.0).
>>>>>>
>>>>>> The executor threads within the worker processes write heartbeats to
>>>>>> ZooKeeper.  If they aren't successfully heartbeating then it could be many
>>>>>> things:
>>>>>>
>>>>>> 1. ZK too busy? (Seems unlikely)
>>>>>> 2. Network too busy? (Seems unlikely)
>>>>>> 3. Worker process died due to exception (this is almost always what
>>>>>> we see)
>>>>>> 4. Worker process hung (eg doing GC). (This would usually first be
>>>>>> caught by the supervisor on that host since it checks a local hearbeat file
>>>>>> that the worker normally writes to every second -- if the heartbeat doesn't
>>>>>> get refreshed before the timeout then the supervisor kills the worker
>>>>>> process with State being :timed-out (or :time-out, something like that).
>>>>>> This of course depends on the various timeout config values you have on the
>>>>>> worker and nimbus hosts.)
>>>>>>
>>>>>> - Erik
>>>>>>
>>>>>>
>>>>>> On Saturday, April 30, 2016, Kevin Conaway <ke...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> We are using Storm 0.10 and we noticed that Nimbus decided to
>>>>>>> restart our topology.  From researching past threads it seems like this is
>>>>>>> related to not receiving heartbeats from the supervisors but I'm unsure if
>>>>>>> this was the case.  Our topology was mostly idle at the time that the
>>>>>>> restart was triggered.
>>>>>>>
>>>>>>> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK
>>>>>>> nodes, I saw the following messages at the time of the restart:
>>>>>>>
>>>>>>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x45453e198e8007f, timeout of 20000ms exceeded
>>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x25453e1c2640085, timeout of 20000ms exceeded
>>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>>>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x35453e1a529008b, timeout of 20000ms exceeded
>>>>>>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x15453e198d10084, timeout of 20000ms exceeded
>>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x35453e1a5290090, timeout of 20000ms exceeded
>>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>>>>>>
>>>>>>> In the nimbus log, there was the following log message:
>>>>>>>
>>>>>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8
>>>>>>> 8] not alive
>>>>>>>
>>>>>>> Shortly thereafter, the supervisors started restarting the workers.
>>>>>>> The  following log message was in the supervisor log:
>>>>>>>
>>>>>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>>>>>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>>>>>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>>>>>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 51]
>>>>>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87]
>>>>>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>>>>>>
>>>>>>> Previous threads have suggested that this was due to heavy GC
>>>>>>> causing the heartbeats not to reach Zookeeper but the topology was idle at
>>>>>>> this time so I don't think GC was the culprit.  The GC par new time was
>>>>>>> about 50ms on each node (as reported to Graphite).
>>>>>>>
>>>>>>> Thoughts on what could have happened here and how to debug further?
>>>>>>>
>>>>>>> --
>>>>>>> Kevin Conaway
>>>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>>>> https://github.com/kevinconaway
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kevin Conaway
>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>> https://github.com/kevinconaway
>>>>>
>>>>
>>>
>>> --
>>> Kevin Conaway
>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>> https://github.com/kevinconaway
>>>
>>>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>



-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Re: How Does Nimbus Decide to Restart Topology?

Posted by Kevin Conaway <ke...@gmail.com>.

Unfortunately we're not capturing disk i/o in our metrics, I can look in to
doing that for next time.

We're not capturing GC logs, we are using the graphite storm metric
consumer to push metrics to graphite, one of which is the GC time from the
default GC mxbean.

> I'm assuming you're saying that multiple workers had state :timed-out at
once?

We have 6 workers.  Only one had state :timed-out, the others had state
:disallowed.  Looking at one of the supervisors as an example, it looks
like it received multiple requests to reschedule the worker which caused
the worker to be launched multiple times.  Is this normal?

2016-04-30 01:34:00.872 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 589b0ed5-c4e9-422f-a6c5-5d65145915f7. Current supervisor time:
1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040,
:storm-id "<topology>", :executors [[124 124] [64 64] [196 196] [40 40] [28
28] [184 184] [100 100] [-1 -1] [172 172] [16 16] [52 52] [148 148] [136
136] [112 112] [76 76] [88 88] [160 160] [4 4]], :port 6700}

2016-04-30 01:34:00.874 b.s.d.supervisor [INFO] Shutting down
3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:589b0ed5-c4e9-422f-a6c5-5d65145915f7

2016-04-30 01:34:02.013 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 460a7d66-06de-4ca5-9140-7d46dcdea841. Current supervisor time:
1461980040. State: :disallowed, Heartbeat: {:time-secs 1461980040,
:storm-id "<topology>", :executors [[178 178] [58 58] [190 190] [118 118]
[22 22] [142 142] [-1 -1] [166 166] [106 106] [70 70] [10 10] [46 46] [82
82] [154 154] [94 94] [34 34] [130 130]], :port 6701}

2016-04-30 01:34:02.014 b.s.d.supervisor [INFO] Shutting down
3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:460a7d66-06de-4ca5-9140-7d46dcdea841

2016-04-30 01:34:03.095 b.s.d.supervisor [INFO] Launching worker with
assignment {:storm-id "<topology>", :executors [[3 3] [33 33] [103 103]
[163 163] [53 53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [153
153] [13 13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113
113]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port
6700 with id 567a18fd-33d1-49b6-a3f4-ace65641bd67

2016-04-30 01:34:03.122 b.s.d.supervisor [INFO] Launching worker with
assignment {:storm-id "<topology>", :executors [[8 8] [188 188] [68 68]
[198 198] [178 178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48
48] [148 148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78
78]]} for this supervisor 3c373bc0-e8d5-4c3c-960d-53dfeb27fc86 on port 6701
with id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6


2016-04-30 01:34:41.322 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 567a18fd-33d1-49b6-a3f4-ace65641bd67. Current supervisor time:
1461980081. State: :disallowed, Heartbeat: {:time-secs 1461980080,
:storm-id "<topology>", :executors [[3 3] [33 33] [103 103] [163 163] [53
53] [73 73] [123 123] [43 43] [63 63] [23 23] [93 93] [-1 -1] [153 153] [13
13] [193 193] [143 143] [83 83] [173 173] [133 133] [183 183] [113 113]],
:port 6700}

2016-04-30 01:34:41.323 b.s.d.supervisor [INFO] Shutting down
3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:567a18fd-33d1-49b6-a3f4-ace65641bd67

2016-04-30 01:34:42.353 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 9fcc869d-08d7-44ec-bde2-bf9ed86403e6. Current supervisor time:
1461980081. State: :disallowed, Heartbeat: {:time-secs 1461980080,
:storm-id "<topology>", :executors [[8 8] [188 188] [68 68] [198 198] [178
178] [58 58] [118 118] [18 18] [28 28] [38 38] [98 98] [48 48] [-1 -1] [148
148] [158 158] [128 128] [88 88] [138 138] [108 108] [168 168] [78 78]],
:port 6701}

2016-04-30 01:34:42.354 b.s.d.supervisor [INFO] Shutting down
3c373bc0-e8d5-4c3c-960d-53dfeb27fc86:9fcc869d-08d7-44ec-bde2-bf9ed86403e6

On Sun, May 1, 2016 at 5:52 PM, Erik Weathers <ew...@groupon.com> wrote:

> Maybe disk I/O was high?  Are you capturing GC logs to disk in unique
> files (you can sub in the PID and timestamp into the GC log filename)?  I
> know you believe it's not responsible, but it's the only thing I've ever
> found to be responsible thus far.  (Except for a problem in storm 0.9.3
> with netty that has since been fixed -- we worked around that by
> downgrading to zero-MQ.)  You might try monitoring the heartbeat files
> written by the workers to watch for the file creation to be happening less
> frequently than once per second.
>
> > all of the worker sessions expired at the same time
>
> I'm assuming you're saying that multiple workers had state :timed-out at
> once?  Was that on the same host?  If the state is :disallowed, that is
> perfectly normal when the reassignment happens, as I described earlier.
>
> - Erik
>
> On Sunday, May 1, 2016, Kevin Conaway <ke...@gmail.com> wrote:
>
>> Any tips on where to continue investigating or other metrics to capture?
>>
>> As i mentioned before, the topology was mostly idle. Low cpu usage, low
>> gc time (cms parnew), stable heap, no eth errors. Its hard to see why all
>> of the worker sessions expired at the same time
>>
>> On Sunday, May 1, 2016, Erik Weathers <ew...@groupon.com> wrote:
>>
>>> To be clear, the supervisor wasn't shutting  itself down, it was killing
>>> the worker process.
>>>
>>> Also for clarity, the :disallowed state simply means a new
>>> task->worker assignment has been computed by the nimbus and so the
>>> supervisor has noticed from the ZK state that the currently running worker
>>> is no longer supposed to be running.
>>>
>>> Regarding the :timed-out state it *definitely* means the worker was
>>> hung and couldn't write to the local heartbeat file within the timeout (30
>>> secs by default).  As for *why*... That is hard to determine.
>>>
>>> - Erik
>>>
>>> On Sunday, May 1, 2016, Kevin Conaway <ke...@gmail.com> wrote:
>>>
>>>> Thanks Erik.  We're using Storm 0.10 so Pacemaker doesn't come in to
>>>> play here.
>>>>
>>>> 3. Worker process died due to exception (this is almost always what we
>>>> see)
>>>> 4. Worker process hung (eg doing GC).
>>>>
>>>> I don't think its either of these, TBH.  There were no abnormal
>>>> terminations in the logs (or in the Storm error logs). I have the GC
>>>> metrics that were reported to graphite and the GC time was fairly low
>>>> (~50ms).
>>>>
>>>> My feeling is that it is Zookeeper related and not worker related
>>>> because it appears that ZK also expired sessions from the supervisor nodes
>>>> as well:
>>>>
>>>> I saw the following messages in the log of one of the ZK nodes:
>>>>
>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>
>>>> 2016-04-30 01:33:46,005 [myid:4] - INFO  [ProcessThread(sid:4
>>>> cport:-1)::PrepRequestProcessor@476] - Processed session termination
>>>> for sessionid: 0x45453e198e80076
>>>>
>>>> 2016-04-30 01:33:46,009 [myid:4] - INFO
>>>> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for
>>>> client /<ip>:44500 which had sessionid 0x45453e198e80076
>>>>
>>>> And some corresponding log messages on one of the supervisor log files
>>>> (for the same ZK session ID).  It looks like this occurred right as the
>>>> supervisor was shutting down due to Nimbus restarting it:
>>>>
>>>> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read
>>>> additional data from server sessionid 0x45453e198e80076, likely server has
>>>> closed socket, closing socket connection and attempting reconnect
>>>>
>>>> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and
>>>> clearing state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current
>>>> supervisor time: 1461980064. State: :timed-out, Heartbeat: {:time-secs
>>>> 1461980008, :storm-id "<topology>", :executors [[62 62] [2 2] [38 38] [98
>>>> 98] [50 50] [-1 -1] [170 170] [86 86] [194 194] [158 158] [122 122] [74 74]
>>>> [26 26] [146 146] [110 110] [14 14] [182 182] [134 134]], :port 6700}
>>>>
>>>> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
>>>> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b
>>>>
>>>> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
>>>> 4e6934f9-88e4-4506-9d6e-25db72be941b
>>>>
>>>> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
>>>> execution of cleanup threads on worker.
>>>>
>>>> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
>>>> State change: SUSPENDED
>>>>
>>>> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
>>>> :disconnected::none: with disconnected Zookeeper.
>>>>
>>>> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
>>>> connection to server <zk host>:2181. Will not attempt to authenticate using
>>>> SASL (unknown error)
>>>>
>>>> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket
>>>> connection established to <zk host>:2181, initiating session
>>>>
>>>> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
>>>> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired,
>>>> closing socket connection
>>>>
>>>> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
>>>> State change: LOST
>>>>
>>>> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event
>>>> :expired::none: with disconnected Zookeeper.
>>>>
>>>> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
>>>> expired event received
>>>>
>>>> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <ew...@groupon.com>
>>>> wrote:
>>>>
>>>>> (I don't know anything about the pacemaker service introduced in storm
>>>>> 1.0, so this statement is pre-1.0).
>>>>>
>>>>> The executor threads within the worker processes write heartbeats to
>>>>> ZooKeeper.  If they aren't successfully heartbeating then it could be many
>>>>> things:
>>>>>
>>>>> 1. ZK too busy? (Seems unlikely)
>>>>> 2. Network too busy? (Seems unlikely)
>>>>> 3. Worker process died due to exception (this is almost always what we
>>>>> see)
>>>>> 4. Worker process hung (eg doing GC). (This would usually first be
>>>>> caught by the supervisor on that host since it checks a local hearbeat file
>>>>> that the worker normally writes to every second -- if the heartbeat doesn't
>>>>> get refreshed before the timeout then the supervisor kills the worker
>>>>> process with State being :timed-out (or :time-out, something like that).
>>>>> This of course depends on the various timeout config values you have on the
>>>>> worker and nimbus hosts.)
>>>>>
>>>>> - Erik
>>>>>
>>>>>
>>>>> On Saturday, April 30, 2016, Kevin Conaway <ke...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> We are using Storm 0.10 and we noticed that Nimbus decided to restart
>>>>>> our topology.  From researching past threads it seems like this is related
>>>>>> to not receiving heartbeats from the supervisors but I'm unsure if this was
>>>>>> the case.  Our topology was mostly idle at the time that the restart was
>>>>>> triggered.
>>>>>>
>>>>>> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK nodes,
>>>>>> I saw the following messages at the time of the restart:
>>>>>>
>>>>>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x45453e198e8007f, timeout of 20000ms exceeded
>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x25453e1c2640085, timeout of 20000ms exceeded
>>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x35453e1a529008b, timeout of 20000ms exceeded
>>>>>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x15453e198d10084, timeout of 20000ms exceeded
>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x35453e1a5290090, timeout of 20000ms exceeded
>>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>>>>>
>>>>>> In the nimbus log, there was the following log message:
>>>>>>
>>>>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8]
>>>>>> not alive
>>>>>>
>>>>>> Shortly thereafter, the supervisors started restarting the workers.
>>>>>> The  following log message was in the supervisor log:
>>>>>>
>>>>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>>>>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>>>>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>>>>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 51]
>>>>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87]
>>>>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>>>>>
>>>>>> Previous threads have suggested that this was due to heavy GC causing
>>>>>> the heartbeats not to reach Zookeeper but the topology was idle at this
>>>>>> time so I don't think GC was the culprit.  The GC par new time was about
>>>>>> 50ms on each node (as reported to Graphite).
>>>>>>
>>>>>> Thoughts on what could have happened here and how to debug further?
>>>>>>
>>>>>> --
>>>>>> Kevin Conaway
>>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>>> https://github.com/kevinconaway
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Kevin Conaway
>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>> https://github.com/kevinconaway
>>>>
>>>
>>
>> --
>> Kevin Conaway
>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>> https://github.com/kevinconaway
>>
>>


-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Re: How Does Nimbus Decide to Restart Topology?

Posted by Erik Weathers <ew...@groupon.com>.

Maybe disk I/O was high?  Are you capturing GC logs to disk in unique files
(you can sub in the PID and timestamp into the GC log filename)?  I know
you believe it's not responsible, but it's the only thing I've ever found
to be responsible thus far.  (Except for a problem in storm 0.9.3 with
netty that has since been fixed -- we worked around that by downgrading to
zero-MQ.)  You might try monitoring the heartbeat files written by the
workers to watch for the file creation to be happening less frequently than
once per second.

> all of the worker sessions expired at the same time

I'm assuming you're saying that multiple workers had state :timed-out at
once?  Was that on the same host?  If the state is :disallowed, that is
perfectly normal when the reassignment happens, as I described earlier.

- Erik

On Sunday, May 1, 2016, Kevin Conaway <ke...@gmail.com> wrote:

> Any tips on where to continue investigating or other metrics to capture?
>
> As i mentioned before, the topology was mostly idle. Low cpu usage, low gc
> time (cms parnew), stable heap, no eth errors. Its hard to see why all of
> the worker sessions expired at the same time
>
> On Sunday, May 1, 2016, Erik Weathers <eweathers@groupon.com
> <javascript:_e(%7B%7D,'cvml','eweathers@groupon.com');>> wrote:
>
>> To be clear, the supervisor wasn't shutting  itself down, it was killing
>> the worker process.
>>
>> Also for clarity, the :disallowed state simply means a new
>> task->worker assignment has been computed by the nimbus and so the
>> supervisor has noticed from the ZK state that the currently running worker
>> is no longer supposed to be running.
>>
>> Regarding the :timed-out state it *definitely* means the worker was hung
>> and couldn't write to the local heartbeat file within the timeout (30 secs
>> by default).  As for *why*... That is hard to determine.
>>
>> - Erik
>>
>> On Sunday, May 1, 2016, Kevin Conaway <ke...@gmail.com> wrote:
>>
>>> Thanks Erik.  We're using Storm 0.10 so Pacemaker doesn't come in to
>>> play here.
>>>
>>> 3. Worker process died due to exception (this is almost always what we
>>> see)
>>> 4. Worker process hung (eg doing GC).
>>>
>>> I don't think its either of these, TBH.  There were no abnormal
>>> terminations in the logs (or in the Storm error logs). I have the GC
>>> metrics that were reported to graphite and the GC time was fairly low
>>> (~50ms).
>>>
>>> My feeling is that it is Zookeeper related and not worker related
>>> because it appears that ZK also expired sessions from the supervisor nodes
>>> as well:
>>>
>>> I saw the following messages in the log of one of the ZK nodes:
>>>
>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>
>>> 2016-04-30 01:33:46,005 [myid:4] - INFO  [ProcessThread(sid:4
>>> cport:-1)::PrepRequestProcessor@476] - Processed session termination
>>> for sessionid: 0x45453e198e80076
>>>
>>> 2016-04-30 01:33:46,009 [myid:4] - INFO
>>> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for
>>> client /<ip>:44500 which had sessionid 0x45453e198e80076
>>>
>>> And some corresponding log messages on one of the supervisor log files
>>> (for the same ZK session ID).  It looks like this occurred right as the
>>> supervisor was shutting down due to Nimbus restarting it:
>>>
>>> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read
>>> additional data from server sessionid 0x45453e198e80076, likely server has
>>> closed socket, closing socket connection and attempting reconnect
>>>
>>> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and
>>> clearing state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current
>>> supervisor time: 1461980064. State: :timed-out, Heartbeat: {:time-secs
>>> 1461980008, :storm-id "<topology>", :executors [[62 62] [2 2] [38 38] [98
>>> 98] [50 50] [-1 -1] [170 170] [86 86] [194 194] [158 158] [122 122] [74 74]
>>> [26 26] [146 146] [110 110] [14 14] [182 182] [134 134]], :port 6700}
>>>
>>> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
>>> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b
>>>
>>> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
>>> 4e6934f9-88e4-4506-9d6e-25db72be941b
>>>
>>> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
>>> execution of cleanup threads on worker.
>>>
>>> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
>>> State change: SUSPENDED
>>>
>>> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
>>> :disconnected::none: with disconnected Zookeeper.
>>>
>>> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
>>> connection to server <zk host>:2181. Will not attempt to authenticate using
>>> SASL (unknown error)
>>>
>>> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket
>>> connection established to <zk host>:2181, initiating session
>>>
>>> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
>>> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired,
>>> closing socket connection
>>>
>>> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
>>> State change: LOST
>>>
>>> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event
>>> :expired::none: with disconnected Zookeeper.
>>>
>>> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
>>> expired event received
>>>
>>> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <ew...@groupon.com>
>>> wrote:
>>>
>>>> (I don't know anything about the pacemaker service introduced in storm
>>>> 1.0, so this statement is pre-1.0).
>>>>
>>>> The executor threads within the worker processes write heartbeats to
>>>> ZooKeeper.  If they aren't successfully heartbeating then it could be many
>>>> things:
>>>>
>>>> 1. ZK too busy? (Seems unlikely)
>>>> 2. Network too busy? (Seems unlikely)
>>>> 3. Worker process died due to exception (this is almost always what we
>>>> see)
>>>> 4. Worker process hung (eg doing GC). (This would usually first be
>>>> caught by the supervisor on that host since it checks a local hearbeat file
>>>> that the worker normally writes to every second -- if the heartbeat doesn't
>>>> get refreshed before the timeout then the supervisor kills the worker
>>>> process with State being :timed-out (or :time-out, something like that).
>>>> This of course depends on the various timeout config values you have on the
>>>> worker and nimbus hosts.)
>>>>
>>>> - Erik
>>>>
>>>>
>>>> On Saturday, April 30, 2016, Kevin Conaway <ke...@gmail.com>
>>>> wrote:
>>>>
>>>>> We are using Storm 0.10 and we noticed that Nimbus decided to restart
>>>>> our topology.  From researching past threads it seems like this is related
>>>>> to not receiving heartbeats from the supervisors but I'm unsure if this was
>>>>> the case.  Our topology was mostly idle at the time that the restart was
>>>>> triggered.
>>>>>
>>>>> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK nodes,
>>>>> I saw the following messages at the time of the restart:
>>>>>
>>>>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x45453e198e8007f, timeout of 20000ms exceeded
>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x25453e1c2640085, timeout of 20000ms exceeded
>>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x35453e1a529008b, timeout of 20000ms exceeded
>>>>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x15453e198d10084, timeout of 20000ms exceeded
>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x35453e1a5290090, timeout of 20000ms exceeded
>>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>>>>
>>>>> In the nimbus log, there was the following log message:
>>>>>
>>>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8]
>>>>> not alive
>>>>>
>>>>> Shortly thereafter, the supervisors started restarting the workers.
>>>>> The  following log message was in the supervisor log:
>>>>>
>>>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>>>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>>>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>>>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 51]
>>>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87]
>>>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>>>>
>>>>> Previous threads have suggested that this was due to heavy GC causing
>>>>> the heartbeats not to reach Zookeeper but the topology was idle at this
>>>>> time so I don't think GC was the culprit.  The GC par new time was about
>>>>> 50ms on each node (as reported to Graphite).
>>>>>
>>>>> Thoughts on what could have happened here and how to debug further?
>>>>>
>>>>> --
>>>>> Kevin Conaway
>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>> https://github.com/kevinconaway
>>>>>
>>>>
>>>
>>>
>>> --
>>> Kevin Conaway
>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>> https://github.com/kevinconaway
>>>
>>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>
>

Re: How Does Nimbus Decide to Restart Topology?

Posted by Kevin Conaway <ke...@gmail.com>.

Any tips on where to continue investigating or other metrics to capture?

As i mentioned before, the topology was mostly idle. Low cpu usage, low gc
time (cms parnew), stable heap, no eth errors. Its hard to see why all of
the worker sessions expired at the same time

On Sunday, May 1, 2016, Erik Weathers <ew...@groupon.com> wrote:

> To be clear, the supervisor wasn't shutting  itself down, it was killing
> the worker process.
>
> Also for clarity, the :disallowed state simply means a new
> task->worker assignment has been computed by the nimbus and so the
> supervisor has noticed from the ZK state that the currently running worker
> is no longer supposed to be running.
>
> Regarding the :timed-out state it *definitely* means the worker was hung
> and couldn't write to the local heartbeat file within the timeout (30 secs
> by default).  As for *why*... That is hard to determine.
>
> - Erik
>
> On Sunday, May 1, 2016, Kevin Conaway <kevin.a.conaway@gmail.com
> <javascript:_e(%7B%7D,'cvml','kevin.a.conaway@gmail.com');>> wrote:
>
>> Thanks Erik.  We're using Storm 0.10 so Pacemaker doesn't come in to play
>> here.
>>
>> 3. Worker process died due to exception (this is almost always what we
>> see)
>> 4. Worker process hung (eg doing GC).
>>
>> I don't think its either of these, TBH.  There were no abnormal
>> terminations in the logs (or in the Storm error logs). I have the GC
>> metrics that were reported to graphite and the GC time was fairly low
>> (~50ms).
>>
>> My feeling is that it is Zookeeper related and not worker related because
>> it appears that ZK also expired sessions from the supervisor nodes as well:
>>
>> I saw the following messages in the log of one of the ZK nodes:
>>
>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>> [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x45453e198e80076, timeout of 20000ms exceeded
>>
>> 2016-04-30 01:33:46,005 [myid:4] - INFO  [ProcessThread(sid:4
>> cport:-1)::PrepRequestProcessor@476] - Processed session termination for
>> sessionid: 0x45453e198e80076
>>
>> 2016-04-30 01:33:46,009 [myid:4] - INFO
>> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for
>> client /<ip>:44500 which had sessionid 0x45453e198e80076
>>
>> And some corresponding log messages on one of the supervisor log files
>> (for the same ZK session ID).  It looks like this occurred right as the
>> supervisor was shutting down due to Nimbus restarting it:
>>
>> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read
>> additional data from server sessionid 0x45453e198e80076, likely server has
>> closed socket, closing socket connection and attempting reconnect
>>
>> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current
>> supervisor time: 1461980064. State: :timed-out, Heartbeat: {:time-secs
>> 1461980008, :storm-id "<topology>", :executors [[62 62] [2 2] [38 38] [98
>> 98] [50 50] [-1 -1] [170 170] [86 86] [194 194] [158 158] [122 122] [74 74]
>> [26 26] [146 146] [110 110] [14 14] [182 182] [134 134]], :port 6700}
>>
>> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
>> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b
>>
>> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
>> 4e6934f9-88e4-4506-9d6e-25db72be941b
>>
>> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
>> execution of cleanup threads on worker.
>>
>> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
>> State change: SUSPENDED
>>
>> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
>> :disconnected::none: with disconnected Zookeeper.
>>
>> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
>> connection to server <zk host>:2181. Will not attempt to authenticate using
>> SASL (unknown error)
>>
>> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket connection
>> established to <zk host>:2181, initiating session
>>
>> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
>> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired,
>> closing socket connection
>>
>> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
>> State change: LOST
>>
>> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event :expired::none:
>> with disconnected Zookeeper.
>>
>> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
>> expired event received
>>
>> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <ew...@groupon.com>
>> wrote:
>>
>>> (I don't know anything about the pacemaker service introduced in storm
>>> 1.0, so this statement is pre-1.0).
>>>
>>> The executor threads within the worker processes write heartbeats to
>>> ZooKeeper.  If they aren't successfully heartbeating then it could be many
>>> things:
>>>
>>> 1. ZK too busy? (Seems unlikely)
>>> 2. Network too busy? (Seems unlikely)
>>> 3. Worker process died due to exception (this is almost always what we
>>> see)
>>> 4. Worker process hung (eg doing GC). (This would usually first be
>>> caught by the supervisor on that host since it checks a local hearbeat file
>>> that the worker normally writes to every second -- if the heartbeat doesn't
>>> get refreshed before the timeout then the supervisor kills the worker
>>> process with State being :timed-out (or :time-out, something like that).
>>> This of course depends on the various timeout config values you have on the
>>> worker and nimbus hosts.)
>>>
>>> - Erik
>>>
>>>
>>> On Saturday, April 30, 2016, Kevin Conaway <ke...@gmail.com>
>>> wrote:
>>>
>>>> We are using Storm 0.10 and we noticed that Nimbus decided to restart
>>>> our topology.  From researching past threads it seems like this is related
>>>> to not receiving heartbeats from the supervisors but I'm unsure if this was
>>>> the case.  Our topology was mostly idle at the time that the restart was
>>>> triggered.
>>>>
>>>> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK nodes, I
>>>> saw the following messages at the time of the restart:
>>>>
>>>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>> 0x45453e198e8007f, timeout of 20000ms exceeded
>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>> 0x25453e1c2640085, timeout of 20000ms exceeded
>>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>> 0x35453e1a529008b, timeout of 20000ms exceeded
>>>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>> 0x15453e198d10084, timeout of 20000ms exceeded
>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>> 0x35453e1a5290090, timeout of 20000ms exceeded
>>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>>>
>>>> In the nimbus log, there was the following log message:
>>>>
>>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8]
>>>> not alive
>>>>
>>>> Shortly thereafter, the supervisors started restarting the workers.
>>>> The  following log message was in the supervisor log:
>>>>
>>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 51]
>>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87]
>>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>>>
>>>> Previous threads have suggested that this was due to heavy GC causing
>>>> the heartbeats not to reach Zookeeper but the topology was idle at this
>>>> time so I don't think GC was the culprit.  The GC par new time was about
>>>> 50ms on each node (as reported to Graphite).
>>>>
>>>> Thoughts on what could have happened here and how to debug further?
>>>>
>>>> --
>>>> Kevin Conaway
>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>> https://github.com/kevinconaway
>>>>
>>>
>>
>>
>> --
>> Kevin Conaway
>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>> https://github.com/kevinconaway
>>
>

-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Re: How Does Nimbus Decide to Restart Topology?

Posted by Erik Weathers <ew...@groupon.com>.

To be clear, the supervisor wasn't shutting  itself down, it was killing
the worker process.

Also for clarity, the :disallowed state simply means a new
task->worker assignment has been computed by the nimbus and so the
supervisor has noticed from the ZK state that the currently running worker
is no longer supposed to be running.

Regarding the :timed-out state it *definitely* means the worker was hung
and couldn't write to the local heartbeat file within the timeout (30 secs
by default).  As for *why*... That is hard to determine.

- Erik

On Sunday, May 1, 2016, Kevin Conaway <ke...@gmail.com> wrote:

> Thanks Erik.  We're using Storm 0.10 so Pacemaker doesn't come in to play
> here.
>
> 3. Worker process died due to exception (this is almost always what we see)
> 4. Worker process hung (eg doing GC).
>
> I don't think its either of these, TBH.  There were no abnormal
> terminations in the logs (or in the Storm error logs). I have the GC
> metrics that were reported to graphite and the GC time was fairly low
> (~50ms).
>
> My feeling is that it is Zookeeper related and not worker related because
> it appears that ZK also expired sessions from the supervisor nodes as well:
>
> I saw the following messages in the log of one of the ZK nodes:
>
> 2016-04-30 01:33:46,003 [myid:4] - INFO
> [SessionTracker:ZooKeeperServer@325] - Expiring session
> 0x45453e198e80076, timeout of 20000ms exceeded
>
> 2016-04-30 01:33:46,005 [myid:4] - INFO  [ProcessThread(sid:4
> cport:-1)::PrepRequestProcessor@476] - Processed session termination for
> sessionid: 0x45453e198e80076
>
> 2016-04-30 01:33:46,009 [myid:4] - INFO
> [CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for
> client /<ip>:44500 which had sessionid 0x45453e198e80076
>
> And some corresponding log messages on one of the supervisor log files
> (for the same ZK session ID).  It looks like this occurred right as the
> supervisor was shutting down due to Nimbus restarting it:
>
> 2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read
> additional data from server sessionid 0x45453e198e80076, likely server has
> closed socket, closing socket connection and attempting reconnect
>
> 2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current supervisor time:
> 1461980064. State: :timed-out, Heartbeat: {:time-secs 1461980008, :storm-id
> "<topology>", :executors [[62 62] [2 2] [38 38] [98 98] [50 50] [-1 -1]
> [170 170] [86 86] [194 194] [158 158] [122 122] [74 74] [26 26] [146 146]
> [110 110] [14 14] [182 182] [134 134]], :port 6700}
>
> 2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
> 30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b
>
> 2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
> 4e6934f9-88e4-4506-9d6e-25db72be941b
>
> 2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
> execution of cleanup threads on worker.
>
> 2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
> State change: SUSPENDED
>
> 2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
> :disconnected::none: with disconnected Zookeeper.
>
> 2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
> connection to server <zk host>:2181. Will not attempt to authenticate using
> SASL (unknown error)
>
> 2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket connection
> established to <zk host>:2181, initiating session
>
> 2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to
> reconnect to ZooKeeper service, session 0x45453e198e80076 has expired,
> closing socket connection
>
> 2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
> State change: LOST
>
> 2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event :expired::none:
> with disconnected Zookeeper.
>
> 2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
> expired event received
>
> On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <eweathers@groupon.com
> <javascript:_e(%7B%7D,'cvml','eweathers@groupon.com');>> wrote:
>
>> (I don't know anything about the pacemaker service introduced in storm
>> 1.0, so this statement is pre-1.0).
>>
>> The executor threads within the worker processes write heartbeats to
>> ZooKeeper.  If they aren't successfully heartbeating then it could be many
>> things:
>>
>> 1. ZK too busy? (Seems unlikely)
>> 2. Network too busy? (Seems unlikely)
>> 3. Worker process died due to exception (this is almost always what we
>> see)
>> 4. Worker process hung (eg doing GC). (This would usually first be caught
>> by the supervisor on that host since it checks a local hearbeat file that
>> the worker normally writes to every second -- if the heartbeat doesn't get
>> refreshed before the timeout then the supervisor kills the worker process
>> with State being :timed-out (or :time-out, something like that). This of
>> course depends on the various timeout config values you have on the worker
>> and nimbus hosts.)
>>
>> - Erik
>>
>>
>> On Saturday, April 30, 2016, Kevin Conaway <kevin.a.conaway@gmail.com
>> <javascript:_e(%7B%7D,'cvml','kevin.a.conaway@gmail.com');>> wrote:
>>
>>> We are using Storm 0.10 and we noticed that Nimbus decided to restart
>>> our topology.  From researching past threads it seems like this is related
>>> to not receiving heartbeats from the supervisors but I'm unsure if this was
>>> the case.  Our topology was mostly idle at the time that the restart was
>>> triggered.
>>>
>>> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK nodes, I
>>> saw the following messages at the time of the restart:
>>>
>>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x45453e198e8007f, timeout of 20000ms exceeded
>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x25453e1c2640085, timeout of 20000ms exceeded
>>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x45453e198e80076, timeout of 20000ms exceeded
>>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x35453e1a529008b, timeout of 20000ms exceeded
>>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x15453e198d10084, timeout of 20000ms exceeded
>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x35453e1a5290090, timeout of 20000ms exceeded
>>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>>
>>> In the nimbus log, there was the following log message:
>>>
>>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8]
>>> not alive
>>>
>>> Shortly thereafter, the supervisors started restarting the workers.  The
>>>  following log message was in the supervisor log:
>>>
>>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 51]
>>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87]
>>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>>
>>> Previous threads have suggested that this was due to heavy GC causing
>>> the heartbeats not to reach Zookeeper but the topology was idle at this
>>> time so I don't think GC was the culprit.  The GC par new time was about
>>> 50ms on each node (as reported to Graphite).
>>>
>>> Thoughts on what could have happened here and how to debug further?
>>>
>>> --
>>> Kevin Conaway
>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>> https://github.com/kevinconaway
>>>
>>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>

Re: How Does Nimbus Decide to Restart Topology?

Posted by Kevin Conaway <ke...@gmail.com>.

Thanks Erik.  We're using Storm 0.10 so Pacemaker doesn't come in to play
here.

3. Worker process died due to exception (this is almost always what we see)
4. Worker process hung (eg doing GC).

I don't think its either of these, TBH.  There were no abnormal
terminations in the logs (or in the Storm error logs). I have the GC
metrics that were reported to graphite and the GC time was fairly low
(~50ms).

My feeling is that it is Zookeeper related and not worker related because
it appears that ZK also expired sessions from the supervisor nodes as well:

I saw the following messages in the log of one of the ZK nodes:

2016-04-30 01:33:46,003 [myid:4] - INFO  [SessionTracker:ZooKeeperServer@325]
- Expiring session 0x45453e198e80076, timeout of 20000ms exceeded

2016-04-30 01:33:46,005 [myid:4] - INFO  [ProcessThread(sid:4
cport:-1)::PrepRequestProcessor@476] - Processed session termination for
sessionid: 0x45453e198e80076

2016-04-30 01:33:46,009 [myid:4] - INFO
[CommitProcessor:4:NIOServerCnxn@1001] - Closed socket connection for
client /<ip>:44500 which had sessionid 0x45453e198e80076

And some corresponding log messages on one of the supervisor log files (for
the same ZK session ID).  It looks like this occurred right as the
supervisor was shutting down due to Nimbus restarting it:

2016-04-30 01:34:24.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to read
additional data from server sessionid 0x45453e198e80076, likely server has
closed socket, closing socket connection and attempting reconnect

2016-04-30 01:34:24.594 b.s.d.supervisor [INFO] Shutting down and clearing
state for id 4e6934f9-88e4-4506-9d6e-25db72be941b. Current supervisor time:
1461980064. State: :timed-out, Heartbeat: {:time-secs 1461980008, :storm-id
"<topology>", :executors [[62 62] [2 2] [38 38] [98 98] [50 50] [-1 -1]
[170 170] [86 86] [194 194] [158 158] [122 122] [74 74] [26 26] [146 146]
[110 110] [14 14] [182 182] [134 134]], :port 6700}

2016-04-30 01:34:24.596 b.s.d.supervisor [INFO] Shutting down
30a7e66c-ff0c-419a-9f50-eb44dc6b1ceb:4e6934f9-88e4-4506-9d6e-25db72be941b

2016-04-30 01:34:24.599 b.s.config [INFO] GET worker-user
4e6934f9-88e4-4506-9d6e-25db72be941b

2016-04-30 01:34:24.632 b.s.d.supervisor [INFO] Sleep 1 seconds for
execution of cleanup threads on worker.

2016-04-30 01:34:24.637 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
State change: SUSPENDED

2016-04-30 01:34:24.652 b.s.cluster [WARN] Received event
:disconnected::none: with disconnected Zookeeper.

2016-04-30 01:34:31.121 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket
connection to server <zk host>:2181. Will not attempt to authenticate using
SASL (unknown error)

2016-04-30 01:34:31.150 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket connection
established to <zk host>:2181, initiating session

2016-04-30 01:34:31.198 o.a.s.s.o.a.z.ClientCnxn [INFO] Unable to reconnect
to ZooKeeper service, session 0x45453e198e80076 has expired, closing socket
connection

2016-04-30 01:34:31.199 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO]
State change: LOST

2016-04-30 01:34:31.203 b.s.cluster [WARN] Received event :expired::none:
with disconnected Zookeeper.

2016-04-30 01:34:31.203 o.a.s.s.o.a.c.ConnectionState [WARN] Session
expired event received

On Sat, Apr 30, 2016 at 9:22 PM, Erik Weathers <ew...@groupon.com>
wrote:

> (I don't know anything about the pacemaker service introduced in storm
> 1.0, so this statement is pre-1.0).
>
> The executor threads within the worker processes write heartbeats to
> ZooKeeper.  If they aren't successfully heartbeating then it could be many
> things:
>
> 1. ZK too busy? (Seems unlikely)
> 2. Network too busy? (Seems unlikely)
> 3. Worker process died due to exception (this is almost always what we see)
> 4. Worker process hung (eg doing GC). (This would usually first be caught
> by the supervisor on that host since it checks a local hearbeat file that
> the worker normally writes to every second -- if the heartbeat doesn't get
> refreshed before the timeout then the supervisor kills the worker process
> with State being :timed-out (or :time-out, something like that). This of
> course depends on the various timeout config values you have on the worker
> and nimbus hosts.)
>
> - Erik
>
>
> On Saturday, April 30, 2016, Kevin Conaway <ke...@gmail.com>
> wrote:
>
>> We are using Storm 0.10 and we noticed that Nimbus decided to restart our
>> topology.  From researching past threads it seems like this is related to
>> not receiving heartbeats from the supervisors but I'm unsure if this was
>> the case.  Our topology was mostly idle at the time that the restart was
>> triggered.
>>
>> We have a 5 node Zookeeper (3.4.5) ensemble.  On one of the ZK nodes, I
>> saw the following messages at the time of the restart:
>>
>> 2016-04-30 01:33:46,001 [myid:4] - INFO
>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x45453e198e8007f, timeout of 20000ms exceeded
>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x25453e1c2640085, timeout of 20000ms exceeded
>> 2016-04-30 01:33:46,003 [myid:4] - INFO
>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x45453e198e80076, timeout of 20000ms exceeded
>> 2016-04-30 01:33:48,003 [myid:4] - INFO
>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x35453e1a529008b, timeout of 20000ms exceeded
>> 2016-04-30 01:33:50,001 [myid:4] - INFO
>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x15453e198d10084, timeout of 20000ms exceeded
>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x35453e1a5290090, timeout of 20000ms exceeded
>> 2016-04-30 01:33:50,002 [myid:4] - INFO
>>  [SessionTracker:ZooKeeperServer@325] - Expiring session
>> 0x15453e198d1008e, timeout of 20000ms exceeded
>>
>> In the nimbus log, there was the following log message:
>>
>> 2016-04-30 01:34:00.734 b.s.d.nimbus [INFO] Executor <topology>:[8 8] not
>> alive
>>
>> Shortly thereafter, the supervisors started restarting the workers.  The
>>  following log message was in the supervisor log:
>>
>> 2016-04-30 01:34:00.855 b.s.d.supervisor [INFO] Shutting down and
>> clearing state for id 10ed4848-05f7-48e5-bf2a-736d12f208ed. Current
>> supervisor time: 1461980040. State: :disallowed, Heartbeat: {:time-secs
>> 1461980040, :storm-id "<topology>", :executors [[111 111] [75 75] [51 51]
>> [3 3] [39 39] [159 159] [123 123] [63 63] [-1 -1] [147 147] [27 27] [87 87]
>> [171 171] [195 195] [135 135] [15 15] [99 99] [183 183]], :port 6700}
>>
>> Previous threads have suggested that this was due to heavy GC causing the
>> heartbeats not to reach Zookeeper but the topology was idle at this time so
>> I don't think GC was the culprit.  The GC par new time was about 50ms on
>> each node (as reported to Graphite).
>>
>> Thoughts on what could have happened here and how to debug further?
>>
>> --
>> Kevin Conaway
>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>> https://github.com/kevinconaway
>>
>


-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway