You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by srdo <gi...@git.apache.org> on 2016/03/13 20:06:32 UTC

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

GitHub user srdo opened a pull request:

    https://github.com/apache/storm/pull/1209

    STORM-956: When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

    The previous PR at https://github.com/apache/storm/pull/647 doesn't look active anymore. Having Storm tell you which components are backing up would still be a nice feature to have.
    
    I've taken a look at implementing the suggestions from the previous PR, but I have a few questions.
    
    The previous discussion seemed to point toward shutting down the worker when an executor is hanging. I'm guessing there's no nice way to just restart the hanging executors? Is it sufficient to call shutdown on the worker object from do-executor-heartbeats?
    
    I'm not really sure what Constants/SYSTEM_EXECUTOR_ID is for? Should it be ignored when checking for hanging executors?
    
    I'm hoping to add the zookeeper/metrics logging and shutdown functionality soon if this PR looks like it's going in the right direction.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srdo/storm STORM-956

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/1209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1209
    
----
commit c0d1c4ef6ae0d1e144f5af85174d68d5a93eb06a
Author: chuanlei <ni...@126.com>
Date:   2015-07-22T07:37:28Z

    stop worker heartbeat, when the executor threads hang-on

commit 16980a3e4e015865348afee7661157cc9a21525a
Author: chuanlei <ni...@gmail.com>
Date:   2015-07-22T08:55:39Z

    add the setup-check! to mk-threads

commit 9884c578fe8fa85197b1e5d4118598425160bb3f
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-13T14:57:27Z

    Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 9dd030396b0d921f25c5269e17c58b649387211d
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-13T18:58:29Z

    STORM-956: Add support for warning about hanging executors

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by hustfxj <gi...@git.apache.org>.
Github user hustfxj commented on the pull request:

    https://github.com/apache/storm/pull/1209#issuecomment-196213692
  
    It looks to good.  I also hope we should see this done through both the metrics system and through writing an error into zookeeper that would show up on the UI for the component that is stuck as @revans2  @bastiliu  said. Then let users manually see what is happening.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
GitHub user srdo reopened a pull request:

    https://github.com/apache/storm/pull/1209

    STORM-956: When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

    The previous PR at https://github.com/apache/storm/pull/647 doesn't look active anymore. Having Storm tell you which components are backing up would still be a nice feature to have.
    
    I've taken a look at implementing the suggestions from the previous PR, but I have a few questions.
    
    The previous discussion seemed to point toward shutting down the worker when an executor is hanging. I'm guessing there's no nice way to just restart the hanging executors? Is it sufficient to call shutdown on the worker object from do-executor-heartbeats?
    
    I'm not really sure what Constants/SYSTEM_EXECUTOR_ID is for? Should it be ignored when checking for hanging executors?
    
    I'm hoping to add the zookeeper/metrics logging and shutdown functionality soon if the idea of this PR is sound.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srdo/storm STORM-956

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/1209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1209
    
----
commit c0d1c4ef6ae0d1e144f5af85174d68d5a93eb06a
Author: chuanlei <ni...@126.com>
Date:   2015-07-22T07:37:28Z

    stop worker heartbeat, when the executor threads hang-on

commit 16980a3e4e015865348afee7661157cc9a21525a
Author: chuanlei <ni...@gmail.com>
Date:   2015-07-22T08:55:39Z

    add the setup-check! to mk-threads

commit 9884c578fe8fa85197b1e5d4118598425160bb3f
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-13T14:57:27Z

    Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 9dd030396b0d921f25c5269e17c58b649387211d
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-13T18:58:29Z

    STORM-956: Add support for warning about hanging executors

commit af0a56df27f6d4765dd868cf85c0633832cd8a72
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-14T20:45:12Z

    STORM-956: Put hang check in its own function, added worker shutdown call, scheduled hang check interval to match lowest configured timeout.

commit 9bb475213b18d1bbac9277f04dba8381e7a2fa2a
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-15T19:29:00Z

    STORM-956: Log error in Zookeeper when executor is hanging

commit 1a7fd227eade0c205acc1a23aa80ce3e3b845818
Author: Stig Døssing <sd...@it-minds.dk>
Date:   2016-03-18T12:03:21Z

    Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 159a169e2cdf475fb69a3895fc354b9729d0bb6f
Author: Stig Døssing <sd...@it-minds.dk>
Date:   2016-03-20T09:14:46Z

    STORM-956: Added support for extending hang timeout via outputcollectors. Added tests for zk error logging, per-component configuration, disabling hang checks and hang checks warning and shutting down worker properly.

commit 5000a78fa5b7a43f49e4dbdec7ccf7f87714cb70
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-21T14:55:49Z

    Add comment to Config about disabling hang checking

commit 3396fdc48647a49b1c44d59c3cbe09c098376c4a
Author: Stig Rohde Døssing <st...@gmail.com>
Date:   2016-03-24T21:04:19Z

    Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 76b090746baca87719de5fdbeedbb4a8a7f75aed
Author: Stig Rohde Døssing <sd...@it-minds.dk>
Date:   2016-04-04T13:14:03Z

    Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit b6f963387b6ef4d9153481dbd1faf502bbabecf5
Author: Stig Rohde Døssing <sd...@it-minds.dk>
Date:   2016-04-07T07:32:34Z

    Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit bb2585f1db71e7523b5a62572b540e4773805a53
Author: Stig Rohde Døssing <sd...@it-minds.dk>
Date:   2016-04-08T10:49:38Z

    Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit f717555bf4e3c4c8bf2ba0639694c4486ceb4e73
Author: Stig Rohde Døssing <sd...@it-minds.dk>
Date:   2016-04-08T12:37:46Z

    STORM-956: Remove automatic notifyNotHanging from outputcollector methods

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #1209: STORM-956: When the execute() or nextTuple() hang on exte...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the issue:

    https://github.com/apache/storm/pull/1209
  
    I think this feature is too fiddly for the user, and it doesn't really seem like there's any demand for it. Closing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
Github user srdo closed the pull request at:

    https://github.com/apache/storm/pull/1209


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the pull request:

    https://github.com/apache/storm/pull/1209#issuecomment-196988005
  
    I'll add some tests that errors are going into Zookeeper when executors hang soon/that workers are getting killed only if the option is enabled, and will take a look at adding a function to OutputCollector for manually updating the last hang check timestamp. Could you elaborate on what you'd like this code to do regarding metrics? Is it just a counter of how many times a hanging executor has been found?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the pull request:

    https://github.com/apache/storm/pull/1209#issuecomment-199348649
  
    Sorry, pressed the close button by accident.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the pull request:

    https://github.com/apache/storm/pull/1209#issuecomment-207432109
  
    @hustfxj Please take another look when you have time, I think this feature is done :)
    
    The test failure seems to occur intermittently in storm-kafka, looks unrelated. It ran for me locally with all-tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
Github user srdo closed the pull request at:

    https://github.com/apache/storm/pull/1209


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the pull request:

    https://github.com/apache/storm/pull/1209#issuecomment-198878063
  
    Will reopen this when the PR is closer to done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the pull request:

    https://github.com/apache/storm/pull/1209#issuecomment-196028832
  
    I don't want this merged yet, just posted to get feedback :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
GitHub user srdo reopened a pull request:

    https://github.com/apache/storm/pull/1209

    STORM-956: When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

    The previous PR at https://github.com/apache/storm/pull/647 doesn't look active anymore. Having Storm tell you which components are backing up would still be a nice feature to have.
    
    I've taken a look at implementing the suggestions from the previous PR, but I have a few questions.
    
    The previous discussion seemed to point toward shutting down the worker when an executor is hanging. I'm guessing there's no nice way to just restart the hanging executors? Is it sufficient to call shutdown on the worker object from do-executor-heartbeats?
    
    I'm not really sure what Constants/SYSTEM_EXECUTOR_ID is for? Should it be ignored when checking for hanging executors?
    
    I'm hoping to add the zookeeper/metrics logging and shutdown functionality soon if the idea of this PR is sound.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srdo/storm STORM-956

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/1209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1209
    
----
commit c0d1c4ef6ae0d1e144f5af85174d68d5a93eb06a
Author: chuanlei <ni...@126.com>
Date:   2015-07-22T07:37:28Z

    stop worker heartbeat, when the executor threads hang-on

commit 16980a3e4e015865348afee7661157cc9a21525a
Author: chuanlei <ni...@gmail.com>
Date:   2015-07-22T08:55:39Z

    add the setup-check! to mk-threads

commit 9884c578fe8fa85197b1e5d4118598425160bb3f
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-13T14:57:27Z

    Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 9dd030396b0d921f25c5269e17c58b649387211d
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-13T18:58:29Z

    STORM-956: Add support for warning about hanging executors

commit af0a56df27f6d4765dd868cf85c0633832cd8a72
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-14T20:45:12Z

    STORM-956: Put hang check in its own function, added worker shutdown call, scheduled hang check interval to match lowest configured timeout.

commit 9bb475213b18d1bbac9277f04dba8381e7a2fa2a
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-15T19:29:00Z

    STORM-956: Log error in Zookeeper when executor is hanging

commit 1a7fd227eade0c205acc1a23aa80ce3e3b845818
Author: Stig Døssing <sd...@it-minds.dk>
Date:   2016-03-18T12:03:21Z

    Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 159a169e2cdf475fb69a3895fc354b9729d0bb6f
Author: Stig Døssing <sd...@it-minds.dk>
Date:   2016-03-20T09:14:46Z

    STORM-956: Added support for extending hang timeout via outputcollectors. Added tests for zk error logging, per-component configuration, disabling hang checks and hang checks warning and shutting down worker properly.

commit 5000a78fa5b7a43f49e4dbdec7ccf7f87714cb70
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-21T14:55:49Z

    Add comment to Config about disabling hang checking

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by hustfxj <gi...@git.apache.org>.
Github user hustfxj commented on the pull request:

    https://github.com/apache/storm/pull/1209#issuecomment-196220384
  
    Spout itself emits messages by SpoutOutputCollector 's emit().  If lots of messages failed, then acker will trigger SpoutOutputCollector emits those failed messages. It may happen dead lock. Because down bolts may slow to handle messsages and it will block emit(),  then spout/acker thread will block.  Thus others messages which is send by those can't be handled by acker. So the bolts will block. The scene may be called "loop dead lock".  I want say that this PR is sound to this scene. Because It can make us find the dead lock in time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the pull request:

    https://github.com/apache/storm/pull/1209#issuecomment-199348274
  
    The test failure looks to be the same as on master https://travis-ci.org/apache/storm/jobs/117200843. It ran locally for me with all-tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
GitHub user srdo reopened a pull request:

    https://github.com/apache/storm/pull/1209

    STORM-956: When the execute() or nextTuple() hang on external resources, stop the Worker's heartbeat

    The previous PR at https://github.com/apache/storm/pull/647 doesn't look active anymore. Having Storm tell you which components are backing up would still be a nice feature to have.
    
    I've taken a look at implementing the suggestions from the previous PR, but I have a few questions.
    
    The previous discussion seemed to point toward shutting down the worker when an executor is hanging. I'm guessing there's no nice way to just restart the hanging executors? Is it sufficient to call shutdown on the worker object from do-executor-heartbeats?
    
    I'm not really sure what Constants/SYSTEM_EXECUTOR_ID is for? Should it be ignored when checking for hanging executors?
    
    I'm hoping to add the zookeeper/metrics logging and shutdown functionality soon if the idea of this PR is sound.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srdo/storm STORM-956

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/1209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1209
    
----
commit c0d1c4ef6ae0d1e144f5af85174d68d5a93eb06a
Author: chuanlei <ni...@126.com>
Date:   2015-07-22T07:37:28Z

    stop worker heartbeat, when the executor threads hang-on

commit 16980a3e4e015865348afee7661157cc9a21525a
Author: chuanlei <ni...@gmail.com>
Date:   2015-07-22T08:55:39Z

    add the setup-check! to mk-threads

commit 9884c578fe8fa85197b1e5d4118598425160bb3f
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-13T14:57:27Z

    Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 9dd030396b0d921f25c5269e17c58b649387211d
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-13T18:58:29Z

    STORM-956: Add support for warning about hanging executors

commit af0a56df27f6d4765dd868cf85c0633832cd8a72
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-14T20:45:12Z

    STORM-956: Put hang check in its own function, added worker shutdown call, scheduled hang check interval to match lowest configured timeout.

commit 9bb475213b18d1bbac9277f04dba8381e7a2fa2a
Author: Stig Døssing <st...@gmail.com>
Date:   2016-03-15T19:29:00Z

    STORM-956: Log error in Zookeeper when executor is hanging

commit 1a7fd227eade0c205acc1a23aa80ce3e3b845818
Author: Stig Døssing <sd...@it-minds.dk>
Date:   2016-03-18T12:03:21Z

    Merge branch 'master' of https://github.com/apache/storm into STORM-956

commit 159a169e2cdf475fb69a3895fc354b9729d0bb6f
Author: Stig Døssing <sd...@it-minds.dk>
Date:   2016-03-20T09:14:46Z

    STORM-956: Added support for extending hang timeout via outputcollectors. Added tests for zk error logging, per-component configuration, disabling hang checks and hang checks warning and shutting down worker properly.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request #1209: STORM-956: When the execute() or nextTuple() hang ...

Posted by srdo <gi...@git.apache.org>.
Github user srdo closed the pull request at:

    https://github.com/apache/storm/pull/1209


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
Github user srdo closed the pull request at:

    https://github.com/apache/storm/pull/1209


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the pull request:

    https://github.com/apache/storm/pull/1209#issuecomment-199321366
  
    The hang checks should now support writing errors to Zookeeper, extending the timeout by interacting with an OutputCollector, setting different time limits per component, and disabling the checks entirely by setting the timelimit/check frequency to null. I took a quick look at the metrics system, but can't really see a nice way of logging to it if we're potentially shutting down the worker when this system is triggered.
    
    I'm not sure the automatic/manual hang timeout resets are really necessary on SpoutOutputCollector, since I don't see a case where a user would want to hang in nextTuple while still emitting tuples. Let me know if they should be removed.
    
    I think this PR is ready for re-review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-956: When the execute() or nextTuple() h...

Posted by hustfxj <gi...@git.apache.org>.
Github user hustfxj commented on the pull request:

    https://github.com/apache/storm/pull/1209#issuecomment-197155218
  
    @srdo  I mean that we can report errors into Zookeeper whether the option is enabled. And the metrisc is not only record the counter, but also the timeout time. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---