You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by "Vishal Kathuria (JIRA)" <ji...@apache.org> on 2011/09/10 01:00:09 UTC

[jira] [Created] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Enabling a large number of watches for a large number of clients
----------------------------------------------------------------

Key: ZOOKEEPER-1177
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
Project: ZooKeeper
Issue Type: Improvement
Components: server
Affects Versions: 3.3.3
Reporter: Vishal Kathuria
Assignee: Vishal Kathuria
Fix For: 3.5.0

In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.

In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
Each client fetches and puts watches on all the znodes. That is 200 million watches.

It seems a single watch takes about 100 bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.

So we need a more compact way of storing watches. Here are the possible solutions.
1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.

Are there any other suggestions of solutions?

Thanks

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Patrick Hunt (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177742#comment-13177742 ] 

Patrick Hunt commented on ZOOKEEPER-1177:
-----------------------------------------

Good point. The performance is terrible but not entirely for the reason you mentioned (although that's contributing).

with 10k watchers with 10k un-watched paths, all of the paths being triggered, the NEW case runs in ~6sec, while the old runs in ~50ms.

The problem is the loop itself. I tried re-running the test with all of the locking turned off (incl making the table a hashmap rather than a concurrenthashmap) and while it does run faster it still takes ~3sec. I even turned off the entire "if (watchedPaths.get(pathId))" conditional/block and it runs about the same (3 sec).

Doesn't seem like we should commit this with such poor performance for the common case. Perhaps we need a hybrid approach? I don't see how that could easily be done though.
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Zhihong Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180610#comment-13180610 ] 

Zhihong Yu commented on ZOOKEEPER-1177:
---------------------------------------

Patrick mentioned lack of SparseBitSet.
I found this: http://blog.rapleaf.com/dev/2010/12/17/memory-efficient-sparse-bitsets/ which leads to https://github.com/lemire/javaewah
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Vishal Kathuria (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103183#comment-13103183 ] 

Vishal Kathuria commented on ZOOKEEPER-1177:
--------------------------------------------

Thanks for reviewing and your comments Mahadev. I was leaning towards 1 as well.

There is one issue I am worried about if we use czxid as the index into the bitmap - it will keep growing over time as the recently created nodes will have a the latest zxid as the value. The bitmap will become sparse and we will need some sort of bitmap compression scheme. 

The bitmap compression would increase the cost of bit lookup.

I had another thought about allocating an id for a znode without causing a change in the storage format.

We could allocate the node id for a znode as we are deserializing the data tree during restore at startup time. All these ids would be sequential so the max value of the node id would be the same as the total number of nodes.

For the new node create after startup, we can keep allocating node ids using the same counter. 

The only other thing we will have to design for is reusing node ids as the znodes are deleted and created - this should be easy by keeping a bitmap of free node id values.

Thanks

> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Vikas Mehta (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179323#comment-13179323 ] 

Vikas Mehta commented on ZOOKEEPER-1177:
----------------------------------------

That was a big miss on my part for the non-watched case. Thanks for performance analysis and catching this issue.

I did attempt to keep the two-way hashmap in one of the implementation and at the time, the locking was still very coarse in WatchManager (esp. having to synchronize on the path when adding watch added a lot of contention as all the watchers set their watches on the path again after the watch was triggered).

With the current locking approach, we can try the 2-way hash map again with the lock on the pathID only when updating watches on the path (pathID to watches map entry). I will try to make the new patch with 2-way locks ready in next few days.

Patrick - can you send me your perf tests so that I can compare results.

Thanks,
Vikas
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102492#comment-13102492 ] 

Mahadev konar commented on ZOOKEEPER-1177:
------------------------------------------

Vishal,
 I like idea 1). Each znode has a a txn id with which its created, which is anyway unique for each znode (look at czxid in stat). We can use this for uniquely identifying the znode. Doing 1) would be pretty easy.

> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150984#comment-13150984 ] 

Hadoop QA commented on ZOOKEEPER-1177:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12503831/Zookeeper-after-resolving-merge-conflicts.patch
  against trunk revision 1202360.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/791//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/791//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/791//console

This message is automatically generated.
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151453#comment-13151453 ] 

Hadoop QA commented on ZOOKEEPER-1177:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12503927/ZooKeeper-with-fix-for-findbugs-warning.patch
  against trunk revision 1202557.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/792//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/792//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/792//console

This message is automatically generated.
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Patrick Hunt (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt updated ZOOKEEPER-1177:
------------------------------------

    Attachment: ZOOKEEPER-1177.patch

Vikas, here's the updated patch with my perf tests, the easiest way I've found to run is:

bq. ant -Dtest.junit.maxmem=2g -Dtest.output=yes -Dtestcase=WatchManagerPerf clean test-core-java

You mentioned earlier seeing contention issues with multiple reader/writer threads. I  think you should augment my *Perf test with some addl MT tests so that we can all see the issues/results. Thanks!
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Patrick Hunt (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179668#comment-13179668 ] 

Patrick Hunt commented on ZOOKEEPER-1177:
-----------------------------------------

Vikas - can you incorporate Ben's idea as well - leave the old impl around, add yours as a new impl, new intf supported by both. That will simplify testing & perf verification. A system property that allows you to switch the impl would be great and could be used by the tests. I'll upload my perf test today.
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177533#comment-13177533 ] 

Hadoop QA commented on ZOOKEEPER-1177:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508907/ZOOKEEPER-1177.patch
  against trunk revision 1225352.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 2 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/870//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/870//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/870//console

This message is automatically generated.
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Patrick Hunt (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178958#comment-13178958 ] 

Patrick Hunt commented on ZOOKEEPER-1177:
-----------------------------------------

@daniel -- I was thinking the same thing over the weekend. We'd probably want to refactor a bit while we're doing it.

@ben sounds reasonable. Esp if we want to enable performance comparison. We should also look at the locking overhead that you highlighted earlier.

I was also planning to contribute my perf tests. I'll work on this further unless someone else wants to?

vishal - have you made any other changes to this code not yet contributed, or issues found?
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Patrick Hunt (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177492#comment-13177492 ] 

Patrick Hunt commented on ZOOKEEPER-1177:
-----------------------------------------

In reviewing this code I noticed a couple concerns:

1) it's unfortunate there is no SparseBitSet, in the worst case scenario we might end up wasting memory due to the id assignment. Say you had 1k sessions and 1m znodes. If the first session set watches on all 1m znodes and the other 999 sessions set a watch on znode 1m (the one with the highest id), the 999 sessions would have empty bitsets, save for the very last bit (#1m). Not sure if this is a real concern in practice though.

2) more concerning is removeWatcher - it removes the bitset, but not the associated entries in path2Id/id2Path. Granted cleanup is an expensive operation in this case, however without it we'll end up "leaking" path2id/id2path mappings in some cases.

1 doesn't seem like a blocker given the benefits, perhaps 2 is not a concern given in most cases the set of paths is bounded and if the znode is deleted the watch manager will cleanup at that time.

Thoughts?
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Patrick Hunt (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131738#comment-13131738 ] 

Patrick Hunt commented on ZOOKEEPER-1177:
-----------------------------------------

This is awesome work guys, thanks.

Question: in your testing what is the impact of triggering a large number of watches on overall operation latency?

bq. Each client fetches and puts watches on all the znodes.

Say I delete a znode in your example, that will trigger 10k notifications to be sent (one to each session) - what is the impact on the latency of this request (the delete), both with and without this patch?


Subsequent to the investigations you've been doing, should we have concerns on overall service availability due to large numbers of watches being triggered concurrently?
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Benjamin Reed (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177701#comment-13177701 ] 

Benjamin Reed commented on ZOOKEEPER-1177:
------------------------------------------

i'm curious about the effect this has on the overall performance of non-watched operations. if you have 10K watchers. then you are going to acquire and release 10k locks in removeWatches for every write operation. correct? 
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Vikas Mehta (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vikas Mehta updated ZOOKEEPER-1177:
-----------------------------------

    Attachment: Zookeeper-after-resolving-merge-conflicts.patch

patch after incorporating review comments and resolving merge conflicts.
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Vikas Mehta (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134579#comment-13134579 ] 

Vikas Mehta commented on ZOOKEEPER-1177:
----------------------------------------

Hi Patrick,

   Sorry for the late response. Let me know if you were looking for something different than what I answered:

1) in your testing what is the impact of triggering a large number of watches on overall operation latency?

[vikas] Without this change, with large number of watches, zookeeper would run out of memory storing all the watches. With this change, it is now bound by the network bandwidth. One difference between the current and the old implementation is that now triggerWatch() loops through all the watchers to find that watches it needs to trigger compared to using the reverse map in the previous version to prevent this scan for slight benefit in cases when number of watches per path is much smaller compared to number of watchers.

2) Say I delete a znode in your example, that will trigger 10k notifications to be sent (one to each session) - what is the impact on the latency of this request (the delete), both with and without this patch?

[vikas] Without the patch, as mentioned above we are not able to run zookeeper with so many watches. If we do not have too many watchers (or overall watches), impact with this change would be linear scan of the watchers to identify the watches that need to be triggered for the update/delete node operation.

3) Subsequent to the investigations you've been doing, should we have concerns on overall service availability due to large numbers of watches being triggered concurrently?

[vikas] We are thinking of implementing some throttling on the server (and later may be client-side as well) to prevent deterioration in the Zookeeper performance or availability.

Thanks,
Vikas
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Daniel Gómez Ferro (Commented JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178837#comment-13178837 ] 

Daniel Gómez Ferro commented on ZOOKEEPER-1177:
-----------------------------------------------

I think we could use the same trick to keep a reverse index, that is, mapping each Watcher to a unique integer id and having a BitSet per watched path. We would double the memory usage, but I think its worth it.
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Patrick Hunt (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180952#comment-13180952 ] 

Patrick Hunt commented on ZOOKEEPER-1177:
-----------------------------------------

Zhihong - I searched around a bit but didn't find that, I'll take a look (it's ASL licensed), thanks!
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Patrick Hunt (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt updated ZOOKEEPER-1177:
------------------------------------

    Attachment: ZOOKEEPER-1177.patch

Added a couple minor tweaks to the patch (final fields and such).
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Patrick Hunt (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140598#comment-13140598 ] 

Patrick Hunt commented on ZOOKEEPER-1177:
-----------------------------------------

Hi Vikas, yes, you can submit at any time. That just marks the issue as something that should be reviewed by a committer, typically for commit to svn. Alternately you can indicate in your submit message that the patch is a draft and you'd like initial f/b, etc... The main reason for marking as submit is to get QA bot to test the patch and to highlight to a committer that you'd like the patch to be committed. Alternately you could work work the community on the mailing list (refering to this jira/patch), that happens sometimes when people are in the very early stages (say the patch doesn't even compile yet).
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Vikas Mehta (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vikas Mehta updated ZOOKEEPER-1177:
-----------------------------------

    Attachment: ZooKeeper-with-fix-for-findbugs-warning.patch

Fixed the findbugs warning.
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Camille Fournier (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127699#comment-13127699 ] 

Camille Fournier commented on ZOOKEEPER-1177:
---------------------------------------------

This one probably deserves a reviewboard review, if you would please create one. At a glance, my eyebrows raise at comments like: // HACK: read the pathID without holding the read lock
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Benjamin Reed (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178851#comment-13178851 ] 

Benjamin Reed commented on ZOOKEEPER-1177:
------------------------------------------

perhaps we should make WatchManager an interface so that we can select alternate implementations at runtime. that way we could also but more experimental implementations in without risking default installations.
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Vikas Mehta (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140547#comment-13140547 ] 

Vikas Mehta commented on ZOOKEEPER-1177:
----------------------------------------

Is it ok for me to submit this patch?

Thanks,
Vikas
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Vishal Kathuria (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178975#comment-13178975 ] 

Vishal Kathuria commented on ZOOKEEPER-1177:
--------------------------------------------

Thanks for doing such a detailed performance analysis on this Patrick. There are no other changes to this code that we haven't yet contributed.

Re @Daniel's suggestion - as far as I recall, Vikas tried that but we had to abandon that for locking contention in our version of ZK that has multi-threaded reads. Vikas: If you have the details from that trial, could you pls share them?
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Vikas Mehta (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vikas Mehta updated ZOOKEEPER-1177:
-----------------------------------

    Attachment: ZooKeeper.patch

Please review the attached patch to optimize the memory needed by WatchManager. This patch also optimizes the locking in WatchManager.

With this optimization, memory consumed by WatchManager is under 100M for 200M watches.
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Vikas Mehta (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128570#comment-13128570 ] 

Vikas Mehta commented on ZOOKEEPER-1177:
----------------------------------------

Created the reviewboard review:

https://reviews.apache.org/r/2399/

Also, I removed the unit test HACK comment, because it dosen't affect the test or the WatchManager changes in this case. I was a bit unhappy because in zookeeper you would always enter this function with a read lock, but in the unit test it doesn't (and it is not required to hold the read lock for this particular test).
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176831#comment-13176831 ] 

Hadoop QA commented on ZOOKEEPER-1177:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12503927/ZooKeeper-with-fix-for-findbugs-warning.patch
  against trunk revision 1225200.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/866//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/866//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/866//console

This message is automatically generated.
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179879#comment-13179879 ] 

Hadoop QA commented on ZOOKEEPER-1177:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12509448/ZOOKEEPER-1177.patch
  against trunk revision 1227000.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 4 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/876//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/876//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/876//console

This message is automatically generated.
                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1177.patch, ZOOKEEPER-1177.patch, ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1177) Enabling a large number of watches for a large number of clients

Posted by "Patrick Hunt (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177485#comment-13177485 ] 

Patrick Hunt commented on ZOOKEEPER-1177:
-----------------------------------------

I ran some ghetto performance numbers against this patch on trunk (NEW) vs without (OLD)

I modified testSizeInBytes to create 10k watchers and 1k paths, each watcher is watching all the paths - 10m watches in total. (OLD failed with 10k/10k, even at 2g, while NEW ran fine with 512m)

{noformat}
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) Server VM (build 20.1-b02, mixed mode)

ant -Dtest.junit.maxmem=2g -Dtest.output=yes -Dtestcase=WatchManagerTest clean test-core-java

add - add 10m watches
size - run size one the manager
dump - dump the watches to /dev/null (bypath and byid)
trigger - trigger the 10m watches

the numbers settled down to something like this after letting the VM warm up:

NEW
    [junit] 1753ms to add
    [junit] size:10000000
    [junit] 1ms to size
    [junit] 3424ms to dumpwatches true
    [junit] 3066ms to dumpwatches false
    [junit] 2318ms to trigger
OLD
    [junit] 9736ms to add
    [junit] size:10000000
    [junit] 0ms to size
    [junit] 5615ms to dumpwatches true
    [junit] 3035ms to dumpwatches false
    [junit] 5530ms to trigger

notice:
add - ~5 times faster
size - approx the same, even though NEW is scanning all bitsets
dump - faster for bypath, about the same for byid
trigger - ~2 times faster
{noformat}

here are the numbers with 1k watchers and 10k paths

{noformat}
NEW
    [junit] 1219ms to add
    [junit] size:10000000
    [junit] 0ms to size
    [junit] 3527ms to dumpwatches true
    [junit] 3680ms to dumpwatches false
    [junit] 1426ms to trigger
OLD
    [junit] 7020ms to add
    [junit] size:10000000
    [junit] 1ms to size
    [junit] 3585ms to dumpwatches true
    [junit] 3251ms to dumpwatches false
    [junit] 2843ms to trigger

both old and NEW do better in this case than in the 10k/1k case. NEW is still significantly ahead of OLD.

{noformat}

                
> Enabling a large number of watches for a large number of clients
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1177
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1177
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Vishal Kathuria
>            Assignee: Vishal Kathuria
>             Fix For: 3.5.0
>
>         Attachments: ZooKeeper-with-fix-for-findbugs-warning.patch, ZooKeeper.patch, Zookeeper-after-resolving-merge-conflicts.patch
>
>
> In my ZooKeeper, I see watch manager consuming several GB of memory and I dug a bit deeper.
> In the scenario I am testing, I have 10K clients connected to an observer. There are about 20K znodes in ZooKeeper, each is about 1K - so about 20M data in total.
> Each client fetches and puts watches on all the znodes. That is 200 million watches.
> It seems a single watch takes about 100  bytes. I am currently at 14528037 watches and according to the yourkit profiler, WatchManager has 1.2 G already. This is not going to work as it might end up needing 20G of RAM just for the watches.
> So we need a more compact way of storing watches. Here are the possible solutions.
> 1. Use a bitmap instead of the current hashmap. In this approach, each znode would get a unique id when its gets created. For every session, we can keep track of a bitmap that indicates the set of znodes this session is watching. A bitmap, assuming a 100K znodes, would be 12K. For 10K sessions, we can keep track of watches using 120M instead of 20G.
> 2. This second idea is based on the observation that clients watch znodes in sets (for example all znodes under a folder). Multiple clients watch the same set and the total number of sets is a couple of orders of magnitude smaller than the total number of znodes. In my scenario, there are about 100 sets. So instead of keeping track of watches at the znode level, keep track of it at the set level. It may mean that get may also need to be implemented at the set level. With this, we can save the watches in 100M.
> Are there any other suggestions of solutions?
> Thanks
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira