You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Fang Chen <fc...@gmail.com> on 2015/05/29 03:47:29 UTC

Re: Supervisor repeatedly killing worker

Did you find any solutions to this?

I ran into exactly the same situation with 0.9.4.  I have a testing kafka
topic with around 10m tuples and the supervisor started to kill its worker
(first time around 15minutes later, then I saw the same after 2minutes, or
6 minutes, although I have set supervisor.worker.start.timeout.secs and
supervisor.worker.timeout.secs to 40 minutes each).

I then did another experiment, right after the topology was running, I
killed all supervisors and the workers could finish all tuples without
issues.

On Thu, Apr 16, 2015 at 11:56 AM, Grant Overby (groverby) <
groverby@cisco.com> wrote:

>    I’m not, and If I had to guess I’d say it’s likely something is going
> wrong with the heartbeats, but how can I go about finding out?
>
>
>
>
>   From: Paul Poulosky <pp...@yahoo-inc.com>
> Reply-To: "user@storm.apache.org" <us...@storm.apache.org>, Paul Poulosky <
> ppoulosk@yahoo-inc.com>
> Date: Thursday, April 16, 2015 at 2:38 PM
> To: "user@storm.apache.org" <us...@storm.apache.org>
> Subject: Re: Supervisor repeatedly killing worker
>
>  10.0.1.5
>

Re: Supervisor repeatedly killing worker

Posted by Jeffery Maass <ma...@gmail.com>.
Set logging to info level.  The reason is explained every time.  Sorry, I
don't have any examples....

You have to look at 3 logs:
* nimbus - will say that it is killing a task/executor.  As I recall, you
have to figure out that the task/executor links up to the supervisor.
* supervisor - will say that it is killing a worker.  It will list the
status of the worker.  As I recall there are 2.  I can't remember what they
are called.. :(
** One status means that the supervisor hasn't heard from the worker.  It
will list 2 times as milliseconds.  When you subtract the 2, you will see
that they add up to the setting : supervisor.worker.timeout.secs
** The other status means that nimbus said to the supervisor : kill the
worker holding the task.  The supervisor then kills the worker process.
* worker - if it is killed by a supervisor, there will be no logging about
such.  The supervisor is simply issuing a kill statement, so the worker log
just ends.  If the worker has decided to shut itself down, the logs will
say why.  A worker might shut itself down if say, there was an unhandled
exception somewhere.

from https://github.com/apache/storm/blob/master/conf/defaults.yaml :
      nimbus.task.timeout.secs: 30
      nimbus.supervisor.timeout.secs: 60
      #how long supervisor will wait to ensure that a worker process is
started
      supervisor.worker.start.timeout.secs: 120
      #how long between heartbeats until supervisor considers that worker
dead and tries to restart it
      supervisor.worker.timeout.secs: 30


Thank you for your time!

+++++++++++++++++++++
Jeff Maass <ma...@gmail.com>
linkedin.com/in/jeffmaass
stackoverflow.com/users/373418/maassql
+++++++++++++++++++++


On Thu, May 28, 2015 at 8:47 PM, Fang Chen <fc...@gmail.com> wrote:

> Did you find any solutions to this?
>
> I ran into exactly the same situation with 0.9.4.  I have a testing kafka
> topic with around 10m tuples and the supervisor started to kill its worker
> (first time around 15minutes later, then I saw the same after 2minutes, or
> 6 minutes, although I have set supervisor.worker.start.timeout.secs and
> supervisor.worker.timeout.secs to 40 minutes each).
>
> I then did another experiment, right after the topology was running, I
> killed all supervisors and the workers could finish all tuples without
> issues.
>
> On Thu, Apr 16, 2015 at 11:56 AM, Grant Overby (groverby) <
> groverby@cisco.com> wrote:
>
>>    I’m not, and If I had to guess I’d say it’s likely something is going
>> wrong with the heartbeats, but how can I go about finding out?
>>
>>
>>
>>
>>   From: Paul Poulosky <pp...@yahoo-inc.com>
>> Reply-To: "user@storm.apache.org" <us...@storm.apache.org>, Paul Poulosky
>> <pp...@yahoo-inc.com>
>> Date: Thursday, April 16, 2015 at 2:38 PM
>> To: "user@storm.apache.org" <us...@storm.apache.org>
>> Subject: Re: Supervisor repeatedly killing worker
>>
>>  10.0.1.5
>>
>
>