You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Paul Poulosky (JIRA)" <ji...@apache.org> on 2016/07/19 15:14:20 UTC
[jira] [Created] (STORM-1984) Race during rebalance
Paul Poulosky created STORM-1984:
------------------------------------
Summary: Race during rebalance
Key: STORM-1984
URL: https://issues.apache.org/jira/browse/STORM-1984
Project: Apache Storm
Issue Type: Bug
Components: storm-core
Affects Versions: 1.0.0
Reporter: Paul Poulosky
We have been seeing an issue with a storm cluster getting into a restart loop because of bad topology state saved in ZK.
On startup, we are seeing a rebalance timer being set with a time value of nil.
This rebalance was called during a startup state transition here..
https://github.com/apache/storm/blob/master/storm-core/src/clj/org/apache/storm/daemon/nimbus.clj#L330-L336
The problem is that topology-action-options is nil in storm-base.
(I added a temporary debug print)
2016-07-19 14:41:56.604 b.s.d.nimbus [INFO] In state-transitions #backtype.storm.daemon.common.StormBase{:storm-name "test1", :launch-time-secs 1468879726, :status {:type :rebalancing}, :num-workers 3, :component->executors {"__system" 0, "__acker" 3, "exclaim2" 2, "exclaim1" 3, "word" 10}, :owner "hadoopqa", :topology-action-options nil, :prev-status {:type :active}}
If nimbus happens to crash during the rebalancing state, before the scheduler can reschedule the topology and then return it back to active or inactive, but after storm-base was set to nil here....
https://github.com/apache/storm/blob/master/storm-core/src/clj/org/apache/storm/daemon/nimbus.clj#L292-L299
Then we get into a state where nimbus will crash repeatedly if supervised on startup.
We should remove the set of topology options to nil in do-rebalance, and / or ignore the rebalance on startup if the delay can't be read.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)