You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Nicolas Bär <ni...@gmail.com> on 2014/07/26 01:46:37 UTC

Worker failure leads to 10-15s topology freeze

Hi All

I'm currently investigating the case of worker node failures, where a
failure leads to both the worker process and supervisor beeing killed. The
documentation states the following: "What happens when a node dies? The
tasks assigned to that machine will time-out and Nimbus will reassign those
tasks to other machines."
The documentation somewhat implies that only tasks that were currently
running on the failed node will be started elsewhere on the cluster. But
during my tests I see quite a different behavior. It looks like each time a
node fails or is added again to the cluster the topology will freeze for
about 10 to 15 seconds (the spouts/bolts do not emit/receive any events).
During this time all tasks are reshuffled and may start on different
machines. After reshuffling the spouts/bolts resume to emit/receive data
without any problems. I'm working on test data to eliminate any external
data source problems from the tests.

Is this the design of storm to handle node failures or cluster growth? And
if not, what am I missing here? Is this related to the configuration option
`nimbus.reassign`?

I tried to keep heartbeats and timeouts very low, so a failure is detected
faster:
nimbus.task.timeout.secs: 2
nimbus.supervisor.timeout.secs: 2
supervisor.worker.start.timeout.secs: 5
supervisor.worker.timeout.secs: 2
supervisor.monitor.frequency.secs: 2
supervisor.heartbeat.frequency.secs: 1

And I defined the following properties for my topology:
conf.put(Config.TOPOLOGY_RECEIVER_BUFFER_SIZE, 8);
conf.put(Config.TOPOLOGY_TRANSFER_BUFFER_SIZE, 32);
conf.put(Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE, 16384);
conf.put(Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE, 16384);
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 2000);
conf.put(Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS, 1);

Any ideas on how to reduce the freeze of the topology during node failures
would be highly appreciated.

Best
Nicolas