You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Jungtaek Lim (JIRA)" <ji...@apache.org> on 2017/01/06 07:26:58 UTC

[jira] [Commented] (STORM-2176) Workers do not shutdown cleanly and worker hooks don't run when a topology is killed

    [ https://issues.apache.org/jira/browse/STORM-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803845#comment-15803845 ] 

Jungtaek Lim commented on STORM-2176:
-------------------------------------

Investigated this and got some informations:

1. As Taylor stated, give same seconds to supervisor waiting time and worker waiting time doesn't make sense. Supervisor should wait longer. If not there's no reason for worker to kill itself.

2. For normal topology, after adjusting seconds I got error code 143 (128 + 15), which seems to be expected.

3. If worker has one or more multi-lang tasks, for now they are having high chance to raise NoOutputException in progress of shutting down (yes we should fix that, but not considered as critical for me), and when it calls System.exit() to give up, worker is going into deadlock since shutdown hook is already running.

https://dzone.com/articles/know-jvm-series-2-shutdown

{quote}
5. Once shutdown sequence starts, it can be stopped by Runtime.halt() only.

Once the shutdown sequence starts, only Runtime.halt() (which forcefully terminates the JVM) can stop the execution of the shutdown sequence (except for external influences such as SIGKILL). This means that calling System.exit() with in a Shutdown Hook will not work. Actually, if you call System.exit() with in a Shutdown Hook, the VM will get stuck, and you have to terminate the process forcefully.
{quote}

4. Why log is not written after shutdown is starting is not relevant to this issue.
Log4j2 also registers shutdown hook to close its resource, and after hook is run we're unable to log.
(I'm not sure it is only related to async logger.)
Given that JVM runs shutdown hooks concurrently without guaranteeing the sequence, eventually log is not written to anywhere.
Log4j2 provides configuration to disable its shutdown hook: add attribute shutdownHook="disable" to the configuration tag.

> Workers do not shutdown cleanly and worker hooks don't run when a topology is killed
> ------------------------------------------------------------------------------------
>
>                 Key: STORM-2176
>                 URL: https://issues.apache.org/jira/browse/STORM-2176
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 1.0.1, 1.0.2
>            Reporter: P. Taylor Goetz
>            Priority: Critical
>
> This appears to have been introduced in the 1.0.0 release. The issues does not seem to affect 0.10.2.
> When a topology is killed and workers receive the notification to shutdown, they do not shutdown cleanly, so worker hooks never get invoked.
> When a worker shuts down cleanly, the worker logs should contain entries such as the following:
> {code}
> 2016-10-28 18:52:06.273 b.s.d.worker [INFO] Shut down transfer thread
> 2016-10-28 18:52:06.279 b.s.d.worker [INFO] Shutting down default resources
> 2016-10-28 18:52:06.287 b.s.d.worker [INFO] Shut down default resources
> 2016-10-28 18:52:06.351 b.s.d.worker [INFO] Disconnecting from storm cluster state context
> 2016-10-28 18:52:06.359 b.s.d.worker [INFO] Shut down worker exclaim-1-1477680593 61bddd66-0fda-4556-b742-4b63f0df6fc1 6700
> {code}
> In the 1.0.x line of releases (and presumably 1.x, though I haven't checked) this does not happen -- the worker shutdown process appears to get stuck shutting down executors (https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/daemon/worker.clj#L666), no further log messages are seen in the worker log, and worker hooks do not run.
> There are two properties that affect how workers exit. The first is the configuration property {{supervisor.worker.shutdown.sleep.secs}}, which defaults to 1 second. This corresponds to how long the supervisor will wait for a worker to exit gracefully before forcibly killing it with {{kill -9}}. When this happens the supervisor will log that the worker terminated with exit code 137 (128 + 9).
> The second property is a hard-coded 1 second delay (https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/util.clj#L463) added as a shutdown hook that will call {{Runtime.halt()}} if the delay is exceeded. When this happens, the supervisor will log that the worker terminated with exit code 20 (hard-coded).
> Side Note: The hardcoded halt delay in worker.clj and the default value for {{supervisor.worker.shutdown.sleep.secs}} both being 1 second should probably be changed since it creates a race to see whether the supervisor delay or the worker delay wins.
> To test this, I set {{supervisor.worker.shutdown.sleep.secs}} to 15 to allow plenty of time for the worker to exit gracefully, and deployed and killed a topology. In this case the supervisor consistently reported exit code 20 for the worker, indicating the hard-coded shutdown hook caused the worker to exit.
> I thought the hard-coded 1 second shutdown hook delay might not be long enough for the worker to shutdown cleanly. To test that hypothesis, I changed the hard-code delay to 10 seconds, leaving {{supervisor.worker.shutdown.sleep.secs}} at 15 seconds. Again supervisor reported an exit code of 20 for the worker, and there were no log messages indicating the worker had exited cleanly and that the worker hook had run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)