You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Marc Vaillant <va...@animetrics.com> on 2014/10/04 00:04:16 UTC

who is responsible for restarting failing executors/tasks?

Consider a bolt that is shuffle grouped, with say a parallelism of 3 and
1 task per executor.  Suppose that 1 of the tasks gets into a
non-responsive state (hangs, swallows tuples, etc).  I.e. any tuple that
is sent to it ultimately fails.  What mechanism in storm is monitoring
the health of that executor to know that it should be torn down and a
new one should be spawned?  We seem to be having the exact problem where
1 of the 3 executors gets into a state where it fails to process tuples,
but storm doesn't do anything about it.  After the executor is stuck in
a failing state, storm continues to send tuples to it according to the
shuffle grouping paradigm so on average about 1/3 of the tuples fail. 

Thanks,
Marc