You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Oliver Hall <ol...@metabroadcast.com> on 2014/04/11 11:03:47 UTC
Trident batches timing out on 0.9.0.1

Hi,

We've been running Storm for a while, and recently decided to upgrade
from 0.8 to 0.9.0.1. We managed to get our staging infrastructure (one
'master' node running nimbus, drpc, ui, zookeeper, one 'worker' node
running the supervisor) working, and moved onto upgrading production
(this being a cluster of three nodes: one 'master' node, and two
'worker' nodes). However, our production cluster refused to run a
topology correctly - Trident batches would be formed from incoming
events, then would be failed due to timing out, without being
processed. After some experimentation, we managed to replicate this on
our staging cluster, by introducing a second 'worker' node.

We're completely stumped on this, as there are no errors in the logs,
and no differences that we can detect between the existing (working)
'worker' node and our freshly built 'worker' node. We've tried various
different arrangements of 'worker' nodes in the cluster, and found the
following behaviour:

1. New cluster with master node and worker node, worker1:  correct
behaviour. Batches processed.
2. Cluster with master node with worker1 disabled, and a new worker
node, worker2, added: all batches time out.
3. Cluster with master node, and worker1 and worker2 running: all
batches time out on both worker1 and worker2
4. Cluster with master node, and worker2 disabled, worker1 enabled:
correct behaviour. Batches processed.

In addition to these three isolated scenarios, we've tried stopping
and starting the supervisor process on each of the workers to
transition between each and every one of the above scenarios. It is
always the case that as soon as worker1 is not the only node running
the supervisor process, the topology starts failing. Even with just
worker2 running the supervisor, the topology fails to work. As soon as
the worker1 becomes the only active worker node within the cluster,
regardless of previous arrangements, everything starts running OK.

During the failure state, there is no indication in the logs of
anything amiss - no stack traces, no apparent errors. The only output
is from our spout logging that batches are created, and again once
they time out.

It's worth noting at this point that we use Puppet to manage our node
configuration, and our nodes are AWS instances. Hence when I say that
the worker2 is identical to worker1, I mean that they were
bootstrapped from the same config, and thus should be identical. I can
also discount a single 'dodgy' host being responsible for the new
worker node's behaviour - I've been fighting this for over a week, and
have in that time bootstrapped and terminated various new nodes. All
have had identical behaviour.