You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by wangchaod <gi...@git.apache.org> on 2018/08/29 13:53:55 UTC

[GitHub] zookeeper pull request #611: ZOOKEEPER-3131

GitHub user wangchaod opened a pull request:

    https://github.com/apache/zookeeper/pull/611

    ZOOKEEPER-3131

    org.apache.zookeeper.server.WatchManager resource leak

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangchaod/zookeeper master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/611.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #611
    
----
commit 36f1e786689f66d1044d6de96e8063dd6e828477
Author: wangchaod <wa...@...>
Date:   2018-08-29T13:50:51Z

    ZOOKEEPER-3131
    org.apache.zookeeper.server.WatchManager resource leak

----


---

[GitHub] zookeeper issue #611: ZOOKEEPER-3131

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/611
  
    @lvfangmin Let me try to wrap what has been discussed with this issue so far.
    - You're saying that #611 is not needed, because this PR fixes it and don't need to maintain `watch2Paths` explicitly,
    - If that's true and this patch targets Netty only, which patch has fixed the NIO side?
    - One of our customers reported inconsistent watch reporting in 4lw commands which I successfully fixed with #611 . Are sure it's still not needed given that it should be only a reporting issue?
    (`wchc` command uses `watch2paths` for reporting, so if it still contains watchers which are not effective anymore, the report is meaningless and might be misleading) 


---

[GitHub] zookeeper issue #611: ZOOKEEPER-3131

Posted by lvfangmin <gi...@git.apache.org>.

Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/611
  
    @anmolnar here are some clarifications based on your comments:
    
    * #611 is not needed anymore because the issue @wangchaod reported here is due to the 'leaking' watcher entry in watch2Paths after the cnxn is actually closed. And this patch fixed it by removing it from watch2Paths when closing.
    * NIO is already doing the right job, which will remove itself from watch2Paths when the cnxn is closed.  (this patch is actually following what we're doing in NIO).
    * Leaving an empty watcher entry in watch2paths shouldn't result in inconsistent watches. One thing I can think of which might cause the watch inconsistent is due to the race condition of closing cnxn and the on flying add watch request in pipeline, which should be fixed it in the #590 with the isDeadWatch check.


---

[GitHub] zookeeper issue #611: ZOOKEEPER-3131

Posted by wangchaod <gi...@git.apache.org>.

Github user wangchaod commented on the issue:

    https://github.com/apache/zookeeper/pull/611
  
    @lvfangmin OK thanks


---

[GitHub] zookeeper issue #611: ZOOKEEPER-3131

Posted by wangchaod <gi...@git.apache.org>.

Github user wangchaod commented on the issue:

    https://github.com/apache/zookeeper/pull/611
  
    @enixon In my running environment, the temporarily empty state of its watch set will not pass soon. So then, the empty watcher increases to more than 10 thousand, and the OutOfMemoryError occurred in the end.   Is that because the corresponding ServerCnxn is not closed?
    Plus, the jvm parameter: -Xmx512m


---

[GitHub] zookeeper issue #611: ZOOKEEPER-3131

Posted by lvfangmin <gi...@git.apache.org>.

Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/611
  
    @hanm based on @wangchaod's description in the JIRA, #612 should fix the exact issue.
    
    @anmolnar as I commented in the JIRA, not removing the empty map is optimized from performance perspective, like not need to remove/add the entry for the same connection again after the watch triggered and client reset the watch, which will reduce the GC overhead, also it reduces the lock contention. 
    
    Since the maximum number of map entries in watch2Paths won't larger than the total connections number on that server, leave an empty map entry there won't cost too much memory overhead. 
    
    Without the fix in #612, the map will grow infinitely as the ZK runs and client reconnect going on, which will cause the OOM problem.
    
    This is actually a critical bug for NettyServerCnxn implementation, wondering why we didn't saw other people report this. 
    
    Internally, we're using NIO, but we're planning to move onto Netty to get the TLS support. So recently, I'm working on improving the Netty implementation, adding features missing compared to NIO, for example, the maximum cnxns per IP limit support, the ideal cnxn expirer, the session map to improve the performance, etc, I'll send out the diffs, please help review.


---

[GitHub] zookeeper issue #611: ZOOKEEPER-3131

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/611
  
    @wangchaod can we move on?


---

[GitHub] zookeeper issue #611: ZOOKEEPER-3131

Posted by hanm <gi...@git.apache.org>.

Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/611
  
    #612 is now merged. @wangchaod Do you know what was the root cause of your OOM issue? Were you using Netty instead of NIO? Would #612 fix your case? I am curious if #612 does not fix your case because that implies that you get a lot of connections with empty watchers and I am expecting you reach connection limit even before those watchers can cause OOM.


---

[GitHub] zookeeper issue #611: ZOOKEEPER-3131

Posted by lvfangmin <gi...@git.apache.org>.

Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/611
  
    @wangchaod just send another PR for removing the watcher when closing netty cnxn: https://github.com/apache/zookeeper/pull/612.
    
    As I mentioned, this will also be fixed in PR of ZOOKEEPER-1177, but let's get this fixed separately.


---

[GitHub] zookeeper issue #611: ZOOKEEPER-3131

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/zookeeper/pull/611
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2221/



---

[GitHub] zookeeper issue #611: ZOOKEEPER-3131

Posted by enixon <gi...@git.apache.org>.

Github user enixon commented on the issue:

    https://github.com/apache/zookeeper/pull/611
  
    I see. It depends on what you are optimizing. The Watcher will be removed from watch2Paths when the ServerCnxn for it is closed. Keeping a set in the map for it strikes me as an optimization based on the heuristic that clients that create one watch are likely to make others and that the empty state of their watches is a temporary one. This is vulnerable to creep if connections only make single watches during their lifetime and do not renew when they're fired. 
    
    My inclination is to keep the current behavior. Far more often, a client workload requires re-establishing a watch after it has fired and the temporarily empty state of its watch set will soon pass. Aggressively cleaning up the map just makes more work in recreating entries and allocating new HashSets.


---

[GitHub] zookeeper issue #611: ZOOKEEPER-3131

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/611
  
    @enixon I think it leaks the map anyway, because when the new Watcher is added, new Watcher instance gets created which will be added to Map anyway and the empty one remains forever.
    
    @wangchaod Would you please consider adding a quick unit test for the patch. Otherwise, it looks good.


---

[GitHub] zookeeper issue #611: ZOOKEEPER-3131

Posted by lvfangmin <gi...@git.apache.org>.

Github user lvfangmin commented on the issue:

    https://github.com/apache/zookeeper/pull/611
  
    @wangchaod it would be great if you can confirm this is fixed, so that we can abandon the PR.


---