You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by "Michael Noll (JIRA)" <ji...@apache.org> on 2014/12/17 17:48:14 UTC

[jira] [Commented] (STORM-112) Race condition between Topology Kill and Worker Timeout can crash supervisor

    [ https://issues.apache.org/jira/browse/STORM-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250102#comment-14250102 ] 

Michael Noll commented on STORM-112:
------------------------------------

I think I can confirm this is still affecting Storm as of 0.9.2.

It may also be caused by killing a topology with a small kill wait time (say, 0-5 seconds), followed by resubmitting the same topology immediately or a few seconds after killing the previous running instance.

> Race condition between Topology Kill and Worker Timeout can crash supervisor
> ----------------------------------------------------------------------------
>
>                 Key: STORM-112
>                 URL: https://issues.apache.org/jira/browse/STORM-112
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: James Xu
>
> Recently during testing on a single node cluster we saw a supervisor crash when a topology was killed. The supervisor came back up and recovered, so it was not that big of a deal, but when we dug into it, it appears that there is a race.
> https://github.com/nathanmarz/storm/issues/656
> When a topology is killed the local assignments are reset, and then stormconf.ser is deleted right away. But at the same time sync-process may already be running with old state indicating that a worker timed out and needs to be relaunched. launch-worker then tries to read in the topology conf which was deleted and crashes.
> The following is a sanitized version of the supervisor log that shows this happening.
> https://gist.github.com/revans2/6282830



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)