You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Jungtaek Lim (JIRA)" <ji...@apache.org> on 2017/01/20 01:25:26 UTC

[jira] [Resolved] (STORM-2176) Workers do not shutdown cleanly and worker hooks don't run when a topology is killed

     [ https://issues.apache.org/jira/browse/STORM-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jungtaek Lim resolved STORM-2176.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 1.0.3
                   1.1.0
                   2.0.0

Merged into master, 1.x, 1.0.x branches.

> Workers do not shutdown cleanly and worker hooks don't run when a topology is killed
> ------------------------------------------------------------------------------------
>
>                 Key: STORM-2176
>                 URL: https://issues.apache.org/jira/browse/STORM-2176
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 1.0.1, 1.0.2
>            Reporter: P. Taylor Goetz
>            Assignee: Jungtaek Lim
>            Priority: Critical
>             Fix For: 2.0.0, 1.1.0, 1.0.3
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> This appears to have been introduced in the 1.0.0 release. The issues does not seem to affect 0.10.2.
> When a topology is killed and workers receive the notification to shutdown, they do not shutdown cleanly, so worker hooks never get invoked.
> When a worker shuts down cleanly, the worker logs should contain entries such as the following:
> {code}
> 2016-10-28 18:52:06.273 b.s.d.worker [INFO] Shut down transfer thread
> 2016-10-28 18:52:06.279 b.s.d.worker [INFO] Shutting down default resources
> 2016-10-28 18:52:06.287 b.s.d.worker [INFO] Shut down default resources
> 2016-10-28 18:52:06.351 b.s.d.worker [INFO] Disconnecting from storm cluster state context
> 2016-10-28 18:52:06.359 b.s.d.worker [INFO] Shut down worker exclaim-1-1477680593 61bddd66-0fda-4556-b742-4b63f0df6fc1 6700
> {code}
> In the 1.0.x line of releases (and presumably 1.x, though I haven't checked) this does not happen -- the worker shutdown process appears to get stuck shutting down executors (https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/daemon/worker.clj#L666), no further log messages are seen in the worker log, and worker hooks do not run.
> There are two properties that affect how workers exit. The first is the configuration property {{supervisor.worker.shutdown.sleep.secs}}, which defaults to 1 second. This corresponds to how long the supervisor will wait for a worker to exit gracefully before forcibly killing it with {{kill -9}}. When this happens the supervisor will log that the worker terminated with exit code 137 (128 + 9).
> The second property is a hard-coded 1 second delay (https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/util.clj#L463) added as a shutdown hook that will call {{Runtime.halt()}} if the delay is exceeded. When this happens, the supervisor will log that the worker terminated with exit code 20 (hard-coded).
> Side Note: The hardcoded halt delay in worker.clj and the default value for {{supervisor.worker.shutdown.sleep.secs}} both being 1 second should probably be changed since it creates a race to see whether the supervisor delay or the worker delay wins.
> To test this, I set {{supervisor.worker.shutdown.sleep.secs}} to 15 to allow plenty of time for the worker to exit gracefully, and deployed and killed a topology. In this case the supervisor consistently reported exit code 20 for the worker, indicating the hard-coded shutdown hook caused the worker to exit.
> I thought the hard-coded 1 second shutdown hook delay might not be long enough for the worker to shutdown cleanly. To test that hypothesis, I changed the hard-code delay to 10 seconds, leaving {{supervisor.worker.shutdown.sleep.secs}} at 15 seconds. Again supervisor reported an exit code of 20 for the worker, and there were no log messages indicating the worker had exited cleanly and that the worker hook had run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)