You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by d2r <gi...@git.apache.org> on 2015/01/22 18:18:38 UTC

[GitHub] storm pull request: [STORM-636] Faster, optional retrieval of last...

GitHub user d2r opened a pull request:

    https://github.com/apache/storm/pull/392

    [STORM-636] Faster, optional retrieval of last component error

    We want to speed up interactions with the topology that call getTopologyInfo, including the Topology Page in the UI.
    
    Before change:
    - Errors written to /errors/topo-id/comp-name/eN, where N is a sequence number.
    - getTopologyInfo grabs all /errors/topo-id/*/* errors from ZK
    
    After this change:
    - Errors written to /errors/topo-id/comp-name/eN, where N is a sequence number. (unchanged)
    - Errors also written to /errors/topo-id/comp-name-last-error (this means there is an extra ZK write per error)
    - New Nimbus thrift interface method `getTopologyInfoWithOpts`
      - Only option right now is how many errors to retrieve from ZK: 0, 1, or all.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/d2r/storm storm-636-ui-errors

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/392.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #392
    
----
commit 1cfa190f2efb06f8798984b43dec801e5ff20ad5
Author: Derek Dagit <de...@yahoo-inc.com>
Date:   2015-01-22T16:46:03Z

    Faster, optional retrieval of last component error

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-636] Faster, optional retrieval of last...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.
Github user Parth-Brahmbhatt commented on the pull request:

    https://github.com/apache/storm/pull/392#issuecomment-71706395
  
    I did not know we only stored 10 errors, makes sense to not go with pagination if its hardcoded to 10. I will take a look at the complete PR today.
    
    Not to derail the discussion but personally, I would much rather not store errors in zk at all if its just for rendering the errors in UI. If the spouts/bolts could just store this in memory with some expiration that should suffice and we could expose  an API at worker layer to get this information directly from it. If the host dies you lose some errors but that does not seem like a big deal. The only downside will be ui would now have to make requests against worker hosts to get erros but that seems ok to me, you would also get parallelism as all these worker calls can be made in parallel. I haven't thought this through completely and its probably much more work but I would love to hear your opinion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-636] Faster, optional retrieval of last...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/storm/pull/392


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-636] Faster, optional retrieval of last...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/392#issuecomment-71864014
  
    I agree that ZK is not the ideal place to storm most things, but it is by far the most convenient.  As such unless it is shown that it is causing a significant load on ZK I would rather leave it there until we can find/build a better place to put it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-636] Faster, optional retrieval of last...

Posted by d2r <gi...@git.apache.org>.
Github user d2r commented on a diff in the pull request:

    https://github.com/apache/storm/pull/392#discussion_r23627700
  
    --- Diff: storm-core/src/storm.thrift ---
    @@ -243,6 +243,16 @@ struct SubmitOptions {
       2: optional Credentials creds;
     }
     
    +enum NumErrorsChoice {
    +  ALL,
    +  NONE,
    +  ONE
    +}
    +
    +struct GetInfoOptions {
    +  1: optional NumErrorsChoice num_err_choice;
    --- End diff --
    
    I hadn't thought about it that way, but I'm open to it.  Could you give an example of what you mean?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-636] Faster, optional retrieval of last...

Posted by d2r <gi...@git.apache.org>.
Github user d2r commented on the pull request:

    https://github.com/apache/storm/pull/392#issuecomment-71865651
  
    Agreed, I would like a review from @Parth-Brahmbhatt first if possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-636] Faster, optional retrieval of last...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.
Github user Parth-Brahmbhatt commented on a diff in the pull request:

    https://github.com/apache/storm/pull/392#discussion_r23393694
  
    --- Diff: storm-core/src/storm.thrift ---
    @@ -243,6 +243,16 @@ struct SubmitOptions {
       2: optional Credentials creds;
     }
     
    +enum NumErrorsChoice {
    +  ALL,
    +  NONE,
    +  ONE
    +}
    +
    +struct GetInfoOptions {
    +  1: optional NumErrorsChoice num_err_choice;
    --- End diff --
    
    Instead of an Enum don't you think a pagination struct with a start and end would be more flexible? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-636] Faster, optional retrieval of last...

Posted by d2r <gi...@git.apache.org>.
Github user d2r commented on the pull request:

    https://github.com/apache/storm/pull/392#issuecomment-71059798
  
    For testing this I did the following:
    
    I modified the ExclamationTopology as follows:
    - bolts report a new RuntimeException  via reportError on each call to execute
    - changed topology.max.error.report.per.interval: 100 (default 10), so each executor/thread will write up to 100 errors to ZK every 10 seconds.  This is a quick-and-dirty way of increasing write load on ZK for testing.
    - Replicated the Exclamation bolt an additional ~450 times, so there are many more bolts in the topology.
    
    Procedure:
    - Launch the topology in Inactive state, wait for it to get assigned and stable
    - go to the topology page and activate the topology (at this point the load time is on the order of seconds)
    - Refresh the topology until I see fresh errors reported for each bolt
    - Wait a a minute
    - Refresh the page and record load times from the Firefox network inspector.
    - Repeat 4 more times
    
    Results:
      Before patch (median about a minute):
      - 57s
      - 1m13s
      - 55s
      - 51s
      - 1m34s
    
      With patch (median about 10s)
      - 12s
      - 6s
      - 8s
      - 13s
      - 10s



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-636] Faster, optional retrieval of last...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/392#issuecomment-71864756
  
    The code looks fine to me +1.  Although I want to wait for @Parth-Brahmbhatt to finish his review before merging anything in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-636] Faster, optional retrieval of last...

Posted by d2r <gi...@git.apache.org>.
Github user d2r commented on the pull request:

    https://github.com/apache/storm/pull/392#issuecomment-71714975
  
    > Not to derail the discussion but personally, I would much rather not store errors in zk at all if its just for rendering the errors in UI.  If the spouts/bolts could just store this in memory with some expiration that should suffice and we could expose an API at worker layer to get this information directly from it. If the host dies you lose some errors but that does not seem like a big deal. The only downside will be ui would now have to make requests against worker hosts to get erros but that seems ok to me, you would also get parallelism as all these worker calls can be made in parallel. I haven't thought this through completely and its probably much more work but I would love to hear your opinion.
    
    Yeah, we were thinking about distributing things this way too.  We figured that the bigger problem is the heartbeats, and if we could get an improvement with less effort here, it would be worth it.  It would be a much bigger change to distribute the errors out of ZK, yet maybe it is not a bad idea.  (Also, I think it is good to persist the errors anyway, not just in memory.  Users would like to see errors on the UI even if there was some issue that brought the supervisor down—like a rolling upgrade of the cluster.)  Maybe we could file a JIRA for better gathering of errors.
    
    This change was intended to be small in scope and just give a way to get errors more efficiently when a topology has many, many components.  It was prompted by seeing topology page load times of minutes from one of our customers.  Plus, this may be less of a problem once heartbeats (and their metrics) are no longer getting sent around, but still it may not a bad idea to use a more distributed model like you suggest.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-636] Faster, optional retrieval of last...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.
Github user Parth-Brahmbhatt commented on the pull request:

    https://github.com/apache/storm/pull/392#issuecomment-71878844
  
    I am +1 too, the code looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-636] Faster, optional retrieval of last...

Posted by d2r <gi...@git.apache.org>.
Github user d2r commented on a diff in the pull request:

    https://github.com/apache/storm/pull/392#discussion_r23628156
  
    --- Diff: storm-core/src/storm.thrift ---
    @@ -243,6 +243,16 @@ struct SubmitOptions {
       2: optional Credentials creds;
     }
     
    +enum NumErrorsChoice {
    +  ALL,
    +  NONE,
    +  ONE
    +}
    +
    +struct GetInfoOptions {
    +  1: optional NumErrorsChoice num_err_choice;
    --- End diff --
    
    I should add, the current ZK storage of errors isn't very flexible.  We store at most 10 errors per component (hard-coded).  With only 10 errors to work with, I am not sure how much more flexibility we get by doing pagination.  I guess this hard-coded limit of 10 is why I did not pursue pagination—it is just easier to give all 10 errors in the case that we want more than 1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---