You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by eshioji <gi...@git.apache.org> on 2015/05/31 11:02:21 UTC

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

GitHub user eshioji opened a pull request:

    https://github.com/apache/storm/pull/568

    [STORM-763] nimbus reassigned worker A to another machine, but other worker's netty client can't connect to the new worker A

    (This PR builds on [STORM-839](https://github.com/apache/storm/pull/566), so it includes its changes)
    
    While I'm not 100% certain if it's the same issue, I also saw these messages logged a lot (hundreds per second per machine).
    
    Upon investigation, I found that after firing a connection request, `Client.Connector` waits synchronously for the connection. Because there is only one scheduler thread per worker by default, this means only one connection request per worker process can be serviced at a time. This becomes a problem when connections are lost frequently for whatever reason, and is especially exacerbated when each connection attempt takes time.
    
    Commit `aa5c2d71` changes the code so that the `Connector` fires & forgets the connection request, and lets Netty's callback thread handle the rest. There is no blocking or expensive operation done here, so using Netty's thread should be fine.
    
    This change did decrease the amount of ERROR logs in my topology, but I still saw a fair amount. Upon investigation, I found that this is simply because each call to `send` produces an ERROR log. So `ad8112d10` changes the code so that a connection error is exactly logged twice; once when the connection error is detected, and the second time when the re-connection is concluded (either a successful re-connect, or permanent failure).
    
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/eshioji/storm STORM-763

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/568.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #568
    
----
commit ed8ab3ec194f19c75fc2f5c000609204f04b50e8
Author: Enno Shioji <es...@gmail.com>
Date:   2015-05-28T19:42:05Z

    Simplified the flow and removed the lock that was causing the deadlock

commit 91b8eb3840432e47b79f40abebec8304627732a8
Author: Enno Shioji <es...@gmail.com>
Date:   2015-05-28T19:46:17Z

    Bump version

commit b7d84bdc7fd3de34f45a94131cdbb6bfbd3763dc
Author: Enno Shioji <es...@gmail.com>
Date:   2015-05-28T21:27:31Z

    Remove background flushing because it doesn't seem necessary. Netty's Channel queues up written data on an unbounded buffer. The background flushing seems to have been added to avoid this, but in practice it was probably doing it anyways because flushMessages(), which is called by send() doesn't check for isWritable. Moreover, queuing on an unbounded buffer seems fine because back pressure is provided by MAX_PENDING_TUPLE. If OOME occurs due to this buffer overflowing, it seems reasonable that one has to reduce MAX_PENDING_TUPLE, rather than Storm trying to cope with it by dropping messages.

commit 679e42bc1e38f51c2759667b03cb45322c6a793b
Author: Enno Shioji <es...@gmail.com>
Date:   2015-05-28T21:31:35Z

    Change to a SNAPSHOT version for deployment purposes

commit 27a92e2aa3488c0203f500306e0583ff9e7e1e82
Author: Enno Shioji <es...@gmail.com>
Date:   2015-05-29T09:32:16Z

    Remove (now) dead comment and code

commit 09bf6e1b5d9d351f2a60cd9a32e0239752cf437a
Author: Enno Shioji <es...@gmail.com>
Date:   2015-05-29T10:23:46Z

    Merge branch '0.9.x-branch' into STORM-839
    
    Conflicts:
    	examples/storm-starter/pom.xml
    	external/storm-hbase/pom.xml
    	external/storm-hdfs/pom.xml
    	external/storm-kafka/pom.xml
    	pom.xml
    	storm-buildtools/maven-shade-clojure-transformer/pom.xml
    	storm-core/pom.xml
    	storm-dist/binary/pom.xml
    	storm-dist/source/pom.xml

commit fdb394c158ccd87c0a06c060239766d666a8db5a
Author: Enno Shioji <es...@gmail.com>
Date:   2015-05-29T10:28:20Z

    Accidentally committed generated file

commit 36eff0a409336d9247ce2e96b52fcf9630a446fb
Author: Enno Shioji <es...@gmail.com>
Date:   2015-05-29T10:30:38Z

    Remove dead method

commit 884f496dcd06bc9413c04fdf730aca2cfb4239c6
Author: Enno Shioji <es...@gmail.com>
Date:   2015-05-29T10:32:39Z

    Remove comment line I forgot to remove

commit aa5c2d719bb3913285d4274cfcf8364df958b1ff
Author: Enno Shioji <es...@gmail.com>
Date:   2015-05-29T13:40:47Z

    Do not block in Connector. This task runs on a single (by default) thread that is shared among all Clients. If the task blocks, other reconnection requests can't be processed, resulting in a lot of messages being dropped. By not blocking, the thread should be able to service reconnection requests a lot quicker.

commit ad8112d10d662ae81498d11f78a602b97243a142
Author: Enno Shioji <es...@gmail.com>
Date:   2015-05-30T23:54:31Z

    Log error message for dropping messages only once per connection error (logging it everytime on send was flooding the log).

commit ee4e94a01c29caacea480b12d57a7d77174bb9be
Author: Enno Shioji <es...@gmail.com>
Date:   2015-05-30T23:54:52Z

    Remove obsolete metrics

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by miguno <gi...@git.apache.org>.
Github user miguno commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-108339300
  
    And like Bobby already said:  many thanks for your continued work and improvements, @eshioji!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-117169045
  
    The main thing is around merge conflicts.  0.9.x is very different from 0.10.x and master, so when I go the cherry-pick there are a lot of conflicts.  0.10.x and master are still very close so I think just one pull request should be enough between the two.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-107654049
  
    @eshioji Sounds great. The metric looks good.  Just from looking at the code it looks good, but I do want to play around with it a bit before giving it a final +1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by eshioji <gi...@git.apache.org>.
Github user eshioji commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-108978724
  
    @revans2 I was able to bring back the performance where it was, if not a bit higher:
    
    |     | 0.9.5-SNAPSHOT | STORM-763 |
    |-----|----------------|-----------|
    | 10  | 23.8           | 23.9      |
    | 1K  | 78.6           | 82.5      |
    | 2K  | 75.6           | 74.5      |
    | 10K | 71.1           | 72.9      |
    
    The change brings back the batching of in-between calls. However, it does it a bit differently; instead of using a background process, it uses Netty's Channel interest notification callback. This should have an added benefit in theory, because pending messages are immediately flushed as space becomes available, rather than potentially waiting up to FLUSH_CHECK_INTERVAL_MS. I also brought back @miguno 's graceful shutdown. Let me know if this is good. 
    
    One question; I noticed that you guys plan to make 0.9.5 the last 0.9.x release. This PR is against 0.9.x, but should I rather open it against 0.10.x (or master)?
    
    PS
    @miguno No worries at all, I'm glad I can give back, however small it may be :)
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by eshioji <gi...@git.apache.org>.
Github user eshioji commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-112825846
  
    Hi @revans2 , did you have time to look at the results above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-115709421
  
    @eshioji Sorry I took so long to respond.  A bunch of arbitrary end of the quarter fires broke out that sucked up all of my time :).
    
    I am +1 for merging this in, but I want to see a corresponding pull request for master too.  I cannot pull code just into 0.9.x without going into 0.10.x and master too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/storm/pull/568


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-107517980
  
    Do you have any performance numbers around this change?  This is on the critical path, and I want to verify that there has been no regression around this.
    
    I am also a bit concerned about losing the pending metric.  We really need a good way to know if this connection is getting backed-up.  Meaning the network connection is slow but still up or is just simply saturated.  We really need an equivalent of pending for us to know how much data Netty has queued up to send.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-108366003
  
    @eshioji I think getting this into a 0.9.x release is doable, especially if the performance issues can be addressed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-107976283
  
    @eshioji,
    
    I ran some speed of light [tests](https://github.com/yahoo/storm-perf-test) on my macbook pro, that was running all of the processes, and had 4 slots configured.  I ran with 4 spouts, 8 bolts, acking enabled and I varied the maxSpoutPending with 10, 1000, and 2000.  I calculated the average number of messages per millisecond.
    
    maxSpoutPending | 0.11.0-SNAPSHOT | STORM-763 | diff
    ----------------|---------|--------------------------|-------
    10 | 28.6491 | 26.9551 | -1.694 (-5.9%)
    1000 | 45.2541 | 43.0215 | -2.2326 (-4.9%)
    2000 | 42.7067 | 40.5542 | -2.1525 (-5.0%)
    
    As you can see this change is having an impact of about -5% on the performance in this one use case.  I do like very much that the code is smaller and STORM-763 seems significant enough that I am inclined to let this in as is, but I really would like to have you take a crack at possibly doing some optimizations to the code.  I'm not sure but my guess is that it is the 64k high water mark, which is switching the channel to no longer be writable, or the fact that we do not batch in-between calls to send.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by knusbaum <gi...@git.apache.org>.
Github user knusbaum commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-121344016
  
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by eshioji <gi...@git.apache.org>.
Github user eshioji commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-116872325
  
    @revans2 No worries! 
    
    Just to clarify, do I create a pull request for 0.10.x AND master?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by miguno <gi...@git.apache.org>.
Github user miguno commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-108338846
  
    > @eshioji wrote:
    > Also I have a question, maybe @miguno could help; I've removed the graceful shutdown which tries to flush all pending message before the Client is closed, mostly to make it easier for me to fix the deadlock. However now I'm worried I might have removed something significant. Do you think I should bring it back?
    
    IIRC the graceful shutdown was primarily (but not exclusively) for non-acking topologies to minimize any potential data loss.  I think it would be preferable if we'd continue to allow for graceful shutdowns, if possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by eshioji <gi...@git.apache.org>.
Github user eshioji commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-117171697
  
    @revans2 I see, OK, I'll ping you again when they are ready.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by eshioji <gi...@git.apache.org>.
Github user eshioji commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-107822705
  
    @revans2 Sure, take your time and let me know if you find something to change!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by eshioji <gi...@git.apache.org>.
Github user eshioji commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-108263716
  
    @revans2 Thanks for the performance testing, I could replicate very similar results on our machine. I'm trying a few things, hopefully I can get it back to where it was.
    
    In case you haven't noticed, this change includes fix for [STORM-839](https://github.com/apache/storm/pull/566) which seems pretty critical (I actually encountered a deadlock on my live cluster). Do you think the patch should be applied to 0.9.x if the performance concern is alleviated? Or are you more thinking of 0.11.x? (I noticed you tested with 0.11)
    
    Also I have a question, maybe @miguno could help; I've removed the graceful shutdown which tries to flush all pending message before the Client is closed, mostly to make it easier for me to fix the deadlock. However now I'm worried I might have removed something significant. Do you think I should bring it back?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-121338790
  
    +1 the code compiles and the tests all pass


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

Posted by eshioji <gi...@git.apache.org>.
Github user eshioji commented on the pull request:

    https://github.com/apache/storm/pull/568#issuecomment-107578216
  
    @revans2 Re: pending metric, I see. I removed it thinking that it exclusively had to do with the `pendingMessage` field, but now I realise it was actually tracking messages inside Netty's internal buffer, too. I brought it back; should be equivalent to the old metrics.
    
    Re: performance, I couldn't see any performance difference on my test cluster (5 node, processes around 60K tuples per sec) under "normal" conditions. With `v0.9.4` it would start to go haywire after several hours and start to consume lots of CPU and a lot of tuples start to fail. That symptom seems to have been alleviated with the patch. However, this topology feeds off a live data stream so admittedly it's not a strict comparison (because the volume of data is not completely the same from run to run). Is there a standard way you guys use to verify performance regression?
    
    
      


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---