You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Sivan Greenberg <si...@omniqueue.com> on 2010/07/28 10:24:14 UTC

beam CPU hog

Hi all,

 I have 2 Python scripts, that utilize CouchDBKit[0] to do their work.
(This is actually part of a generic setup to allow an automatic shared
PHP session storage with fault tolerance and replication,
automatically, which would probably be released as FOSS once of
satisfactory quality).

 One of the scripts is a conflict resolution strategy program, the
other is a very simple test for the process. It seems that CouchDB is
not holding up to the "load" where load is created by something like:
for i in $(seq 1 100); do ./test_ConflictResolver.py ; done

Since this is executing in a sequential manner, I did not expect it to
cause any load issues, so I'm a bit surprised.

I am putting online[1] the necessary files to run this, hopefully some
CouchDB experts could comment about this.

To test, just extract the files to the same folder and change the log
path in configuration.py

Please contact me for any issues running the test code, as I hope for
quick fix for this as I want to test it on production.

-Sivan

P.S. Please feel free to comment otherwise on the code, as any
improvement I could introduce will be bless-ed! :=)

[0]: http://couchdbkit.org/
[1]: http://omniqueue.com/relaxession/

Re: beam CPU hog

Posted by Sivan Greenberg <si...@omniqueue.com>.
On Thu, Jul 29, 2010 at 1:22 PM, Sivan Greenberg <si...@omniqueue.com> wrote:
> Given that, and in view of the fault tolerance CouchDB provides, does
> one need to check if a node wa brought up, and wait for all the
> changes to have been processed , or is it something more in my
> resolution code, in which I have to wait until the current chunk of
> resolutions is done?

Actually, thinking of it now- this does not make sense. This would
require dealing with the new conflicts arising while the node was done
again, like in an endless loop... (since when it has finished the
chunk, and notifies it is ready for writes again)

Maybe there's someway to know a db is in sync within a preset delta or
completely sync , and then let it be used for reads and writes of
sessions ?

-Sivan

Re: beam CPU hog

Posted by Sivan Greenberg <si...@omniqueue.com>.
On Thu, Jul 29, 2010 at 12:15 AM, Randall Leeds <ra...@gmail.com> wrote:
> It seems like with your deployment you would rather be using
> continuous replication since your data is server-side and your servers
> are generally connected. It might be harder to generate conflicts in a
> test scenario this way because you have to "beat" the replication to
> the other copy.

I realize that, however in an attempt to make the test rigorous as if
the node just came back after a crash, where actually conflicts arrose
while it was down, I started replicating just after the conflicting
saves were done, to simulate that. But come to think of it, I don't
think there are too many solutions or even if it is possible at all to
have one node resolve all the conflicts it gets of being down for an
hour for example, quickly. It is simply impossible given the other
nodes have accumulated substantial backlog.

Given that, and in view of the fault tolerance CouchDB provides, does
one need to check if a node wa brought up, and wait for all the
changes to have been processed , or is it something more in my
resolution code, in which I have to wait until the current chunk of
resolutions is done?

>
> The result of using continuous replication would mean that you will
> get update conflicts more often than conflicting revisions. Your
> client can handle this by fetching the document and deciding (based on
> the timestamp) whether it should clobber the current value.

I am going to test this now (which is actually the healthy production
state of my setup) but as I said, you cannot guarantee replication
will work all the time, as we're dealing with a "cluster of unreliable
commodity hardware" (tm) :-)


I'm starting to fear the rabbit hole deepens....=)

Any ideas how to approach this if my approach is even half wrong?


-Sivan

Re: beam CPU hog

Posted by Randall Leeds <ra...@gmail.com>.
On Wed, Jul 28, 2010 at 02:55, Sivan Greenberg <si...@omniqueue.com> wrote:
> Another odd thing is that I don't seem to realize why after a 100-300
> runs of the script time after another, from a point onward, it would
> fail on the setUp method when trying to save the doc- without any
> apparent failure in the runTest that could cause the tearDown not to
> run, and leave a doc residual in the db causing the version
> conflict....

Not sure about this one.

> On Wed, Jul 28, 2010 at 12:26 PM, Sivan Greenberg <si...@omniqueue.com> wrote:
>> On Wed, Jul 28, 2010 at 12:02 PM, Randall Leeds <ra...@gmail.com> wrote:
>>>
>>> 2) Give us more info about your problem with 'load'. You really
>>> shouldn't care about the cpu load. How long your test takes is much
>>> more important. If you're getting a decent number of operations/second
>>> and your cpu is pinned you should be thrilled.
>>
>> The problem is that the servers designated to host CouchDB are very
>> strong servers (8Gigs of RAM, losts of CPU and cores) , but still do a
>> couple of others things like http server and possibly a couple more
>> services. So when CPU is hogged , performance of web apps is effected.
>>
>> However- I would not care too much about this to start with if
>> CouchDB's performance would actually provide the result- the idea
>> about this is to have the conflict resolved in real or near real time
>> manner, to be as coherent as possible with the winning doc or latest
>> version of a shopping cart. So right now, when the db is a bit big
>> (less then 1G still) operations take long such that the expected
>> outcome of conflicts cleared (e.g. _conflicts goes away form the doc
>> obj when asked for with conflict=true) does not happen, when
>> simulating a user action triggering fetching his session details, he
>> gets the wrong version...

It seems like with your deployment you would rather be using
continuous replication since your data is server-side and your servers
are generally connected. It might be harder to generate conflicts in a
test scenario this way because you have to "beat" the replication to
the other copy.

The result of using continuous replication would mean that you will
get update conflicts more often than conflicting revisions. Your
client can handle this by fetching the document and deciding (based on
the timestamp) whether it should clobber the current value.

Randall

Re: beam CPU hog

Posted by Sivan Greenberg <si...@omniqueue.com>.
Another odd thing is that I don't seem to realize why after a 100-300
runs of the script time after another, from a point onward, it would
fail on the setUp method when trying to save the doc- without any
apparent failure in the runTest that could cause the tearDown not to
run, and leave a doc residual in the db causing the version
conflict....

Sivan

On Wed, Jul 28, 2010 at 12:26 PM, Sivan Greenberg <si...@omniqueue.com> wrote:
> On Wed, Jul 28, 2010 at 12:02 PM, Randall Leeds <ra...@gmail.com> wrote:
>> 1) Spinning up a replication means a bunch of HTTP requests. The fact
>> that the requests are local only means you're not seeing network
>> latency and your cpu is more pinned than in the real world. You could
>> try creating 100 documents first and causing conflicts on all of them
>> in a single replication or replace your urls with bare db names (i.e.,
>> just 'session_store_rep' and 'session_store') to get local replication
>> which bypasses the HTTP layer. The latter option will also reduce the
>> amount of json encoding/decoding you're doing.
>
> I'm actually in wish to experience closest to real world things, so
> bypassing it feels to me as running the test less rigorously. Please
> correct me if I'm wrong as there's a chance I didn't get to the bottom
> of this still. I did change the target to not use http as this is how
> it will be in the real deployment, thanks for noticing that :)
>
>>
>> 2) Give us more info about your problem with 'load'. You really
>> shouldn't care about the cpu load. How long your test takes is much
>> more important. If you're getting a decent number of operations/second
>> and your cpu is pinned you should be thrilled.
>
> The problem is that the servers designated to host CouchDB are very
> strong servers (8Gigs of RAM, losts of CPU and cores) , but still do a
> couple of others things like http server and possibly a couple more
> services. So when CPU is hogged , performance of web apps is effected.
>
> However- I would not care too much about this to start with if
> CouchDB's performance would actually provide the result- the idea
> about this is to have the conflict resolved in real or near real time
> manner, to be as coherent as possible with the winning doc or latest
> version of a shopping cart. So right now, when the db is a bit big
> (less then 1G still) operations take long such that the expected
> outcome of conflicts cleared (e.g. _conflicts goes away form the doc
> obj when asked for with conflict=true) does not happen, when
> simulating a user action triggering fetching his session details, he
> gets the wrong version...
>
> For the record, no I can't use sticky sessions :) (this has gone up
> once or twice)
>
>> Hoping this helps you out :)
>
> This is a start :-)
>
> Many thanks so far!
>
> Sivan
>

Re: beam CPU hog

Posted by Sivan Greenberg <si...@omniqueue.com>.
On Wed, Jul 28, 2010 at 12:02 PM, Randall Leeds <ra...@gmail.com> wrote:
> 1) Spinning up a replication means a bunch of HTTP requests. The fact
> that the requests are local only means you're not seeing network
> latency and your cpu is more pinned than in the real world. You could
> try creating 100 documents first and causing conflicts on all of them
> in a single replication or replace your urls with bare db names (i.e.,
> just 'session_store_rep' and 'session_store') to get local replication
> which bypasses the HTTP layer. The latter option will also reduce the
> amount of json encoding/decoding you're doing.

I'm actually in wish to experience closest to real world things, so
bypassing it feels to me as running the test less rigorously. Please
correct me if I'm wrong as there's a chance I didn't get to the bottom
of this still. I did change the target to not use http as this is how
it will be in the real deployment, thanks for noticing that :)

>
> 2) Give us more info about your problem with 'load'. You really
> shouldn't care about the cpu load. How long your test takes is much
> more important. If you're getting a decent number of operations/second
> and your cpu is pinned you should be thrilled.

The problem is that the servers designated to host CouchDB are very
strong servers (8Gigs of RAM, losts of CPU and cores) , but still do a
couple of others things like http server and possibly a couple more
services. So when CPU is hogged , performance of web apps is effected.

However- I would not care too much about this to start with if
CouchDB's performance would actually provide the result- the idea
about this is to have the conflict resolved in real or near real time
manner, to be as coherent as possible with the winning doc or latest
version of a shopping cart. So right now, when the db is a bit big
(less then 1G still) operations take long such that the expected
outcome of conflicts cleared (e.g. _conflicts goes away form the doc
obj when asked for with conflict=true) does not happen, when
simulating a user action triggering fetching his session details, he
gets the wrong version...

For the record, no I can't use sticky sessions :) (this has gone up
once or twice)

> Hoping this helps you out :)

This is a start :-)

Many thanks so far!

Sivan

Re: beam CPU hog

Posted by Randall Leeds <ra...@gmail.com>.
On Wed, Jul 28, 2010 at 01:41, Sivan Greenberg <si...@omniqueue.com> wrote:
> Just another note, the problem see to grow larger when the database
> size expands. I am going to time each operation now to see if I can
> find a specific culprit.

I looked at your code and I've got a couple things for you to try.

1) Spinning up a replication means a bunch of HTTP requests. The fact
that the requests are local only means you're not seeing network
latency and your cpu is more pinned than in the real world. You could
try creating 100 documents first and causing conflicts on all of them
in a single replication or replace your urls with bare db names (i.e.,
just 'session_store_rep' and 'session_store') to get local replication
which bypasses the HTTP layer. The latter option will also reduce the
amount of json encoding/decoding you're doing.

2) Give us more info about your problem with 'load'. You really
shouldn't care about the cpu load. How long your test takes is much
more important. If you're getting a decent number of operations/second
and your cpu is pinned you should be thrilled. Imagine if you were
encoding an audio file and your cpu wasn't at 100%. You would be mad
because it would take longer than it has to. In general, if you can be
cpu bound you're doing things right as long as things are humming
along quickly. It means couchdb is fulfilling your request as quickly
as possible and neither the network nor the disk is a bottleneck.

Hoping this helps you out :)

Randall

Re: beam CPU hog

Posted by Sivan Greenberg <si...@omniqueue.com>.
Just another note, the problem see to grow larger when the database
size expands. I am going to time each operation now to see if I can
find a specific culprit.

Sivan

On Wed, Jul 28, 2010 at 11:24 AM, Sivan Greenberg <si...@omniqueue.com> wrote:
> Hi all,
>
>  I have 2 Python scripts, that utilize CouchDBKit[0] to do their work.
> (This is actually part of a generic setup to allow an automatic shared
> PHP session storage with fault tolerance and replication,
> automatically, which would probably be released as FOSS once of
> satisfactory quality).
>
>  One of the scripts is a conflict resolution strategy program, the
> other is a very simple test for the process. It seems that CouchDB is
> not holding up to the "load" where load is created by something like:
> for i in $(seq 1 100); do ./test_ConflictResolver.py ; done
>
> Since this is executing in a sequential manner, I did not expect it to
> cause any load issues, so I'm a bit surprised.
>
> I am putting online[1] the necessary files to run this, hopefully some
> CouchDB experts could comment about this.
>
> To test, just extract the files to the same folder and change the log
> path in configuration.py
>
> Please contact me for any issues running the test code, as I hope for
> quick fix for this as I want to test it on production.
>
> -Sivan
>
> P.S. Please feel free to comment otherwise on the code, as any
> improvement I could introduce will be bless-ed! :=)
>
> [0]: http://couchdbkit.org/
> [1]: http://omniqueue.com/relaxession/
>