You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by "Paul Poulosky (JIRA)" <ji...@apache.org> on 2016/07/19 15:14:20 UTC

[jira] [Created] (STORM-1984) Race during rebalance

Paul Poulosky created STORM-1984:
------------------------------------

             Summary: Race during rebalance
                 Key: STORM-1984
                 URL: https://issues.apache.org/jira/browse/STORM-1984
             Project: Apache Storm
          Issue Type: Bug
          Components: storm-core
    Affects Versions: 1.0.0
            Reporter: Paul Poulosky


We have been seeing an issue with a storm cluster getting into a restart loop because of bad topology state saved in ZK.

On startup, we are seeing a rebalance timer being set with a time value of nil.

This rebalance was called during a startup state transition here..

https://github.com/apache/storm/blob/master/storm-core/src/clj/org/apache/storm/daemon/nimbus.clj#L330-L336

The problem is that topology-action-options is nil in storm-base.  

(I added a temporary debug print)

2016-07-19 14:41:56.604 b.s.d.nimbus [INFO] In state-transitions #backtype.storm.daemon.common.StormBase{:storm-name "test1", :launch-time-secs 1468879726, :status {:type :rebalancing}, :num-workers 3, :component->executors {"__system" 0, "__acker" 3, "exclaim2" 2, "exclaim1" 3, "word" 10}, :owner "hadoopqa", :topology-action-options nil, :prev-status {:type :active}}

If nimbus happens to crash during the rebalancing state, before the scheduler can reschedule the topology and then return it back to active or inactive, but after storm-base was set to nil here....

https://github.com/apache/storm/blob/master/storm-core/src/clj/org/apache/storm/daemon/nimbus.clj#L292-L299

Then we get into a state where nimbus will crash repeatedly if supervised on startup.

We should remove the set of topology options to nil in do-rebalance, and / or ignore the rebalance on startup if the delay can't be read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)