You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by tedxia <gi...@git.apache.org> on 2015/07/17 06:03:17 UTC

[GitHub] storm pull request: STORM-946: We should remove Closed Client form...

GitHub user tedxia opened a pull request:

    https://github.com/apache/storm/pull/639

    STORM-946: We should remove Closed Client form cached-node+port->socket in worker

    Patch for [STORM-946](https://issues.apache.org/jira/browse/STORM-946)
    
    The client may be Closed status after reconnect failed, and we will remove closed client from Context to escape memory leak.
    But there is also reference for the closed Client in cached-node+port->socket in worker, for this reason we should also remove closed Client from cached-node+port->socket.
    Meanwhile there is another reason for us to do so. Think about this situation: worker A connect to worker B1 B2, but for some reason worker B1 B2 died at the same, then nimbus reschedule worker B1 B1. And new B1 B2 may partly rescheduled at the some host:port as old B1 B2, that is (old B1: host1+port1, old B2: host2+port2, new B1: host2+port2, new B2: host3+port3). Worker A realized worker B1 B2 died and start reconnect to worker B1 B2, but before new worker B1 and old B2 have the same host+port, and by the current logic, we will remove old B1 Client and and create new Client for new worker B2, and do nothing to old B2 and new B1 because they have the same host+port. This will result the topology stop processing tuples. Once we remove closed Client from cached-node+port->socket before refresh-connections, this will not happen again.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tedxia/storm ted-remove-closed-socket-from-cached-node+port-socket

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/639.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #639
    
----
commit 28c75bd3e6d9925c71acb8878c1d3786abfa0ba2
Author: xiajun <xi...@xiaomi.com>
Date:   2015-07-17T03:59:28Z

    STORM-946: We should remove Closed Client form cached-node+port->socket in worker

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-946: We should remove Closed Client form...

Posted by caofangkun <gi...@git.apache.org>.
Github user caofangkun commented on a diff in the pull request:

    https://github.com/apache/storm/pull/639#discussion_r34865678
  
    --- Diff: storm-core/src/clj/backtype/storm/daemon/worker.clj ---
    @@ -288,7 +288,18 @@
                                           (filter-key (complement (-> worker :task-ids set))))
                   needed-connections (-> needed-assignment vals set)
                   needed-tasks (-> needed-assignment keys)
    -              
    +
    +              closed-connections (into [] (for [[node+port socket] @(:cached-node+port->socket worker)]
    +                                            (if (.isClosed socket) node+port)))
    --- End diff --
    
    -1 
    
    We need take attention to unneeded connections but not closed connections ,
    sometimes closed connections need to be reconnected by [retry policy](https://github.com/apache/storm/blob/master/storm-core/src/jvm/backtype/storm/messaging/netty/Client.java#L138) to make sure reconnect losted connections.
    [refresh-connections](https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/worker.clj#L313) will try remove unneeded connections
    
    by the way:
     
    ```  
    java.lang.IllegalArgumentException: No matching field found: isClosed for class  backtype.storm.messaging.netty.Client 
    ```
    May be you should use [client.status()](https://github.com/apache/storm/blob/master/storm-core/src/jvm/backtype/storm/messaging/netty/Client.java#L191) somewhere when you need .



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-946: We should remove Closed Client form...

Posted by kevinconaway <gi...@git.apache.org>.
Github user kevinconaway commented on the pull request:

    https://github.com/apache/storm/pull/639#issuecomment-219860409
  
    Actually it looks like this issue
    
    >There is another more serious issue that lead to huge problems in one of out topologies whenever a worker crashed due to some exception.
    If worker A sucessfully connects to worker B for the first time during startup but worker B closes the connection for some reason before the :worker-active-flag is set to true (here https://github.com/apache/storm/blob/v0.9.6/storm-core/src/clj/backtype/storm/daemon/worker.clj#L356), there will be no further reconnect attempts, since no messages will be processed and neither send() nor flushMessages() will ever be called.
    
    may be fixed by STORM-1609 with the addition of the client keepalive TimerTask


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-946: We should remove Closed Client form...

Posted by kevinconaway <gi...@git.apache.org>.
Github user kevinconaway commented on the pull request:

    https://github.com/apache/storm/pull/639#issuecomment-219828158
  
    @revans2 was there ever an issue filed about this? We are seeing the same behavior in Storm 0.10.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-946: We should remove Closed Client form...

Posted by vesense <gi...@git.apache.org>.
Github user vesense commented on the pull request:

    https://github.com/apache/storm/pull/639#issuecomment-122184898
  
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-946: We should remove Closed Client form...

Posted by tedxia <gi...@git.apache.org>.
Github user tedxia commented on the pull request:

    https://github.com/apache/storm/pull/639#issuecomment-122206400
  
    Will you have a look at the description of this pull request, it describe the situation.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #639: STORM-946: We should remove Closed Client form cached-node...

Posted by nicom <gi...@git.apache.org>.
Github user nicom commented on the issue:

    https://github.com/apache/storm/pull/639
  
    > may be fixed by STORM-1609 with the addition of the client keepalive TimerTask
    
    I can confirm that it is


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-946: We should remove Closed Client form...

Posted by tedxia <gi...@git.apache.org>.
Github user tedxia commented on the pull request:

    https://github.com/apache/storm/pull/639#issuecomment-122199889
  
    Yes we should use client.status(),  and I am sorry for this error, I will add new patch to fix this.
    
    @caofangkun Will you have a look at Client, once client's status change to Closed, it will never have change to change to other state, that is to say reconnect won't be called after client closed.
    
    The problem happened in our product cluster when some net error happen, but after net recover, the topology can not recover. After I remove closed client first, this problem never come again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-946: We should remove Closed Client form...

Posted by nicom <gi...@git.apache.org>.
Github user nicom commented on the pull request:

    https://github.com/apache/storm/pull/639#issuecomment-203982738
  
    @caofangkun 
    
    Sorry for commenting on such an old pull request, but we recently also encountered the problem this patch is trying to address.
    
    > Does have any situation will call (.close socket) but not removed from cached-node+port->socket ?
    
    Indeed there is one other situation where the close() method is called:
    https://github.com/apache/storm/blob/v0.9.6/storm-core/src/jvm/backtype/storm/messaging/netty/Client.java#L505
    
    This only happens if the number of reconnects reaches the number set in the 'storm.messaging.netty.max_retries' config setting (I believe the default is 300). But the problem is aggravated by the fact, that the `connectionAttempts` counter is never reset. So if a topology runs for a long enough time this might become a problem.
    Actually this issue has been fixed in 0.10.0
    
    There is another more serious issue that lead to huge problem in one of out topologies whenever a worker crashed due to some exception.
    If worker A sucessfully connects to worker B for the first time during startup but worker B closes the connection for some reason before the `:worker-active-flag` is set to true (here https://github.com/apache/storm/blob/v0.9.6/storm-core/src/clj/backtype/storm/daemon/worker.clj#L356), there will be no further reconnect attempts, since no messages will be processed and neither `send()` nor `flushMessages()` will ever be called.
    
    In our case this frequenty happened because nimbus often rescheduled the whole toplogy right around the same time when the supervisor restarted the failed worker, leading to a race condition. This might have to do with the fact that `nimbus.task.timeout.secs` and `supervisor.worker.timeout.secs` have the same value by default.
    
    Tedxia's patch won't fix this problem I belive, since the status returned by `status()` will be `ConnectionWithStatus$Status/Connecting` in this situation.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-946: We should remove Closed Client form...

Posted by caofangkun <gi...@git.apache.org>.
Github user caofangkun commented on the pull request:

    https://github.com/apache/storm/pull/639#issuecomment-122203636
  
    @tedxia 
    I could only find  two situations ( am I right?) will change ```client.status``` to ```closed``` by call ```(.close socket)```. and do not need reconnect at all
    1:  when [ reresh-connections](https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/worker.clj#L314) remove unneeded connections  and wil remove connecion from ``` cached-node+port->socket ```
    2: when [shutdown](https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/worker.clj#L462) worker 
    
    Does have any  situation will call  ```(.close socket)``` but not removed from  ``` cached-node+port->socket ``` ?
    
    
    
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] storm pull request: STORM-946: We should remove Closed Client form...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/639#issuecomment-204058115
  
    @nicom 
    
    Please file a JIRA at https://issues.apache.org/jira under the STORM project describing what you described here.  We ran into a similar issue, but I would have to check on the progress of that fix/etc, and if it could be back ported to 0.9.x


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---