You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by abe oppenheim <ab...@gmail.com> on 2015/10/02 21:13:45 UTC

Executors Constantly Dying

Hi,

I'm seeing weird behavior in my topologies and was hoping for some advice
on how to troubleshoot the issue.

This behavior occurs throughout my topology, but it is easiest to explain
it as the behavior of one bolt. This bolt has 20 executors. When I submit
the topology, the executors are evenly split between 2 hosts. The executors
on one host seem stable, but the Uptime for the executors on the other host
never grows above 10mins-ish, they are constantly being re-prepared.

I don't know what this is symptomatic of or how to diagnose it. All the
Executors have the same Uptime, so I assume this indicates that their
Worker is dying.

Any advice on how to troubleshoot this? Possibly a way to tap into the
Worker lifecycle so I can confirm it is dying every few minutes? Possibly
an explanation of why a Worker would die so consistently, and suggestions
about how to approach this?

Also, any input on how "bad" this is? My topology still processes stuff,
but I assume this constant recreation of Executors has a significant
performance impact?

thanks,
Abe

Re: Executors Constantly Dying

Posted by Abe Oppenheim <ab...@gmail.com>.
Thanks, this is very helpful advice.


> On Oct 5, 2015, at 10:29 AM, Bobby Evans <ev...@yahoo-inc.com.INVALID> wrote:
> 
> Please check the supervisor log on that node, and also check the worker log for the worker.  If the supervisor prints out a message about ":disallowed" then nimbus rescheduled it some place else.  If it prints out a message about timed-out then the worker was not responding, and the supervisor relaunched it thinking it was dead.  There are usually two causes for this.  1) it was dead and you will probably see a lot message in the worker log with the stack trace for the exception that killed the worker. 2) GC was going crazy on that worker and it didn't get enough time to actually heartbeat.  If it is the latter you really are going to need to do some profiling.  You can test this by increasing the heap size and seeing if it fixes it, or preferably shutting off your supervisor and attaching a debugger/taking a heap dump to see where the memory is being used.  If you have a memory leak, increasing the heap size will not fix it.
>  - Bobby 
> 
> 
>     On Friday, October 2, 2015 2:14 PM, abe oppenheim <ab...@gmail.com> wrote:
> 
> 
> Hi,
> 
> I'm seeing weird behavior in my topologies and was hoping for some advice
> on how to troubleshoot the issue.
> 
> This behavior occurs throughout my topology, but it is easiest to explain
> it as the behavior of one bolt. This bolt has 20 executors. When I submit
> the topology, the executors are evenly split between 2 hosts. The executors
> on one host seem stable, but the Uptime for the executors on the other host
> never grows above 10mins-ish, they are constantly being re-prepared.
> 
> I don't know what this is symptomatic of or how to diagnose it. All the
> Executors have the same Uptime, so I assume this indicates that their
> Worker is dying.
> 
> Any advice on how to troubleshoot this? Possibly a way to tap into the
> Worker lifecycle so I can confirm it is dying every few minutes? Possibly
> an explanation of why a Worker would die so consistently, and suggestions
> about how to approach this?
> 
> Also, any input on how "bad" this is? My topology still processes stuff,
> but I assume this constant recreation of Executors has a significant
> performance impact?
> 
> thanks,
> Abe
> 
> 

Re: Executors Constantly Dying

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.
Please check the supervisor log on that node, and also check the worker log for the worker.  If the supervisor prints out a message about ":disallowed" then nimbus rescheduled it some place else.  If it prints out a message about timed-out then the worker was not responding, and the supervisor relaunched it thinking it was dead.  There are usually two causes for this.  1) it was dead and you will probably see a lot message in the worker log with the stack trace for the exception that killed the worker. 2) GC was going crazy on that worker and it didn't get enough time to actually heartbeat.  If it is the latter you really are going to need to do some profiling.  You can test this by increasing the heap size and seeing if it fixes it, or preferably shutting off your supervisor and attaching a debugger/taking a heap dump to see where the memory is being used.  If you have a memory leak, increasing the heap size will not fix it.
 - Bobby 


     On Friday, October 2, 2015 2:14 PM, abe oppenheim <ab...@gmail.com> wrote:
   

 Hi,

I'm seeing weird behavior in my topologies and was hoping for some advice
on how to troubleshoot the issue.

This behavior occurs throughout my topology, but it is easiest to explain
it as the behavior of one bolt. This bolt has 20 executors. When I submit
the topology, the executors are evenly split between 2 hosts. The executors
on one host seem stable, but the Uptime for the executors on the other host
never grows above 10mins-ish, they are constantly being re-prepared.

I don't know what this is symptomatic of or how to diagnose it. All the
Executors have the same Uptime, so I assume this indicates that their
Worker is dying.

Any advice on how to troubleshoot this? Possibly a way to tap into the
Worker lifecycle so I can confirm it is dying every few minutes? Possibly
an explanation of why a Worker would die so consistently, and suggestions
about how to approach this?

Also, any input on how "bad" this is? My topology still processes stuff,
but I assume this constant recreation of Executors has a significant
performance impact?

thanks,
Abe