You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/03/13 20:06:33 UTC

[jira] [Commented] (STORM-956) When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

    [ https://issues.apache.org/jira/browse/STORM-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192484#comment-15192484 ] 

ASF GitHub Bot commented on STORM-956:
--------------------------------------

GitHub user srdo opened a pull request:

    https://github.com/apache/storm/pull/1209

    STORM-956: When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

    The previous PR at https://github.com/apache/storm/pull/647 doesn't look active anymore. Having Storm tell you which components are backing up would still be a nice feature to have.
    
    I've taken a look at implementing the suggestions from the previous PR, but I have a few questions.
    
    The previous discussion seemed to point toward shutting down the worker when an executor is hanging. I'm guessing there's no nice way to just restart the hanging executors? Is it sufficient to call shutdown on the worker object from do-executor-heartbeats?
    
    I'm not really sure what Constants/SYSTEM_EXECUTOR_ID is for? Should it be ignored when checking for hanging executors?
    
    I'm hoping to add the zookeeper/metrics logging and shutdown functionality soon if this PR looks like it's going in the right direction.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srdo/storm STORM-956

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/1209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1209
    
----
commit c0d1c4ef6ae0d1e144f5af85174d68d5a93eb06a
Author: chuanlei <ni...@126.com>
Date:   2015-07-22T07:37:28Z

    stop worker heartbeat, when the executor threads hang-on

commit 16980a3e4e015865348afee7661157cc9a21525a
Author: chuanlei <ni...@gmail.com>
Date:   2015-07-22T08:55:39Z

    add the setup-check! to mk-threads

commit 9884c578fe8fa85197b1e5d4118598425160bb3f
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-13T14:57:27Z

    Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 9dd030396b0d921f25c5269e17c58b649387211d
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-13T18:58:29Z

    STORM-956: Add support for warning about hanging executors

----


> When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat
> -----------------------------------------------------------------------------------------
>
>                 Key: STORM-956
>                 URL: https://issues.apache.org/jira/browse/STORM-956
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: Chuanlei Ni
>            Assignee: Chuanlei Ni
>            Priority: Minor
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Sometimes the work threads produced by mk-threads in executor.clj hang on external resources or other unknown reasons. This makes the workers stop processing the tuples.  I think it is better to kill this worker to resolve the "hang". I plan to :
> 1. like `setup-ticks`, send a system-tick to receive-queue
> 2. the tuple-action-fn deal with this system-tick and remember the time that processes this tuple in the executor-data
> 3. when worker do local heartbeat, check the time the executor writes to executor-data. If the time is long from current (for example, 3 minutes), the worker does not do the heartbeat.  So the supervisor could deal with this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)