You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Dennis Ignacio <de...@gmail.com> on 2014/11/24 21:37:51 UTC

What causes workers to stop processing?

Greetings,

I've tried making various storm.yaml and zookeeper parameter changes to
find out what causes random workers in our cluster to die, but have not had
any luck. Sometimes I get a clean test run, though most of the times the
worker just stops for no apparent reason.

Supervisor.log file says it's shutting down the worker, but doesn't
indicate why. Sometimes the state is "timed-out", and other times it is
"disallowed". When I do see the disallowed state, I see a new worker
process spawned.

Are there any specific heartbeat, timeout, or combination-of-parameters I
should pay particular attention to?

*Cluster details:*
Storm - 0.9.2-incubating
Zookeeper - 3.4.5
Ubuntu 12.04


Thanks,
Dennis

Re: What causes workers to stop processing?

Posted by Andrey Yegorov <an...@gmail.com>.

On Mon, Nov 24, 2014 at 12:37 PM, Dennis Ignacio <de...@gmail.com>
wrote:

> Are there any specific heartbeat, timeout, or combination-of-parameters I
> should pay particular attention to?
>

I do not remember names of the parameters off he top of my head, but you
can increase timeouts for nimbus-worker ping, zookeeper timeout, netty
timeout. Also monitor cpu usage and memory usage of the workers.

Nimbus can decide to re-assign bolts to  other slots of ping times out,
this can happen because of e.g. high cpu usage on supervisor machines or
because of general zookeeper issues, such as
https://issues.apache.org/jira/browse/ZOOKEEPER-885#comment-13749751

----------
Andrey Yegorov