You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Marcus Persson Lindqvist <ma...@gmail.com> on 2009/06/14 23:03:21 UTC

_mutiple_ databases memory profile

Greetings all!

In short - what factors are involved in memory consumtion for couchdb for a
large (x * 1000+) number of databases? Any hints welcome.



I've recently starting to dig couchdb alot and are using it as primary
storage of a backend-type application to much success and relaxation. It
really saves a lot of pain not having to care much about the details of a
repository.

Now, however, my application is growing in data and I'm looking for some
pointers of what to expect in terms of memory consumption (my primary
bottleneck).

The data is highly segmentized - I'm using about 4 different "classes" of
documents from X different "sources" (X is currently 200 but might grow up
to 2000 or more), neither of which need to know about the others. Going
reduction of btrees and such, I figured I would use a separate database for
each, yielding 800 DBs at the moment.

And kudos to couch for making it a breeze implementing, it was really nice
and smooth.

But now I'm starting to see some memory consumtion growth and I'm looking
for pointers of how to think about this. What mechanisms actually cosumes
memory? What should one avoid? Is it better to use fewer databases for this
point of view.

What would be a reasonable memory footprint and how does one caclulate on
it? Currently it consumed about 300MB.

Each database is really just a pet store. I need to extract documents in
order. Thats it. I'm currently doing this with a simple view. (Are there any
"trivial" build-in way of getting documents i reversed insertion-order btw?)

And yeah, the load for most databases is really low, so insert/output
performance could be compromized for less memory consumtions.

Any hints, tips or experiences?

Marcus

Re: _mutiple_ databases memory profile

Posted by Chris Anderson <jc...@apache.org>.
On Sun, Jun 14, 2009 at 2:03 PM, Marcus Persson
Lindqvist<ma...@gmail.com> wrote:
> Greetings all!
>
> In short - what factors are involved in memory consumtion for couchdb for a
> large (x * 1000+) number of databases? Any hints welcome.
>
>
>
> I've recently starting to dig couchdb alot and are using it as primary
> storage of a backend-type application to much success and relaxation. It
> really saves a lot of pain not having to care much about the details of a
> repository.
>
> Now, however, my application is growing in data and I'm looking for some
> pointers of what to expect in terms of memory consumption (my primary
> bottleneck).
>
> The data is highly segmentized - I'm using about 4 different "classes" of
> documents from X different "sources" (X is currently 200 but might grow up
> to 2000 or more), neither of which need to know about the others. Going
> reduction of btrees and such, I figured I would use a separate database for
> each, yielding 800 DBs at the moment.
>
> And kudos to couch for making it a breeze implementing, it was really nice
> and smooth.
>
> But now I'm starting to see some memory consumtion growth and I'm looking
> for pointers of how to think about this. What mechanisms actually cosumes
> memory? What should one avoid? Is it better to use fewer databases for this
> point of view.
>
> What would be a reasonable memory footprint and how does one caclulate on
> it? Currently it consumed about 300MB.
>
> Each database is really just a pet store. I need to extract documents in
> order. Thats it. I'm currently doing this with a simple view. (Are there any
> "trivial" build-in way of getting documents i reversed insertion-order btw?)
>

If you really just want to pull them out in the order you put them in,
have a look at the _all_docs_by_seq view:

http://127.0.0.1:5984/test_suite_db/_all_docs_by_seq?descending=true

Do note that if you update a document after inserting it, it will move
to a new position in the list.

> And yeah, the load for most databases is really low, so insert/output
> performance could be compromized for less memory consumtions.
>
> Any hints, tips or experiences?
>
> Marcus
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: _mutiple_ databases memory profile

Posted by Paul Davis <pa...@gmail.com>.
Marcus,

Just to be sure, what follows from max_open_dbs that Jan was talking
about is that once you hit that limit, your memory usage should be
semi constant. Adjusting the memory usage due to databases would then
be semi-indirectly configurable via that parameter. As always, I would
urge you to create some benchmark scripts and report your findings
back to the list or in a blog post. We're always looking for more
benchmarks.

There are methods for both reporting memory and running garbage
collection in Erlang, but nothing that's exposed by CouchDB to
clients. You can check couch_utils:should_flush/1 for examples of both
calls. Adding a memory usage stat to the stats api probably wouldn't
be out of the question. My (very limited) understanding of the garbage
collection scheme in Erlang is that there's no real global garbage
collection, so there's not really a clear way to make some sort of
aggressiveness setting unless there are native system wide parameters
I haven't seen (and I haven't looked for such a thing).

HTH,
Paul Davis

On Wed, Jun 17, 2009 at 9:40 AM, Marcus Persson
Lindqvist<ma...@gmail.com> wrote:
> Thanks for your answers, its frustrating to not have a generall idea.
> Are there any tweaks for reduction of memory consumption, like setting gc
> aggresiveness or something? Can the VM output it's memory status somehow?
>
> On Tue, Jun 16, 2009 at 3:40 PM, Jan Lehnardt <ja...@apache.org> wrote:
>
>>
>> On 14 Jun 2009, at 23:03, Marcus Persson Lindqvist wrote:
>>
>>  Greetings all!
>>>
>>> In short - what factors are involved in memory consumtion for couchdb for
>>> a
>>> large (x * 1000+) number of databases? Any hints welcome.
>>>
>>
>> Each database requires a file handle and at least one Erlang process to
>> be open and used. Views add more file handles and Erlang processes.
>> Both file handles and processes are cheap (processes even more so than
>> file handles).
>>
>> CouchDB has a max_open_dbs setting that controls the number of
>> databases that are open at any time. It is an LRU cache, so unused
>> databases drop out of that cache as new ones are opened. CouchDB
>> has been tested with ~1 000 000 databases in total and 20 000 open
>> databases at any time.
>>
>> You may need to raise system limits to accommodate a large number
>> of file handles and you might want to increase the max_open_dbs
>> setting.
>>
>> There is also a small write buffer for each db that gets flushed every
>> second. It's size and the flush-interval can be configured on a per-server
>> basis.
>>
>> Cheers
>> Jan
>> --
>>
>>
>>
>>  I've recently starting to dig couchdb alot and are using it as primary
>>> storage of a backend-type application to much success and relaxation. It
>>> really saves a lot of pain not having to care much about the details of a
>>> repository.
>>>
>>> Now, however, my application is growing in data and I'm looking for some
>>> pointers of what to expect in terms of memory consumption (my primary
>>> bottleneck).
>>>
>>> The data is highly segmentized - I'm using about 4 different "classes" of
>>> documents from X different "sources" (X is currently 200 but might grow up
>>> to 2000 or more), neither of which need to know about the others. Going
>>> reduction of btrees and such, I figured I would use a separate database
>>> for
>>> each, yielding 800 DBs at the moment.
>>>
>>> And kudos to couch for making it a breeze implementing, it was really nice
>>> and smooth.
>>>
>>> But now I'm starting to see some memory consumtion growth and I'm looking
>>> for pointers of how to think about this. What mechanisms actually cosumes
>>> memory? What should one avoid? Is it better to use fewer databases for
>>> this
>>> point of view.
>>>
>>> What would be a reasonable memory footprint and how does one caclulate on
>>> it? Currently it consumed about 300MB.
>>>
>>> Each database is really just a pet store. I need to extract documents in
>>> order. Thats it. I'm currently doing this with a simple view. (Are there
>>> any
>>> "trivial" build-in way of getting documents i reversed insertion-order
>>> btw?)
>>>
>>> And yeah, the load for most databases is really low, so insert/output
>>> performance could be compromized for less memory consumtions.
>>>
>>> Any hints, tips or experiences?
>>>
>>> Marcus
>>>
>>
>>
>

Re: _mutiple_ databases memory profile

Posted by Marcus Persson Lindqvist <ma...@gmail.com>.
Thanks for your answers, its frustrating to not have a generall idea.
Are there any tweaks for reduction of memory consumption, like setting gc
aggresiveness or something? Can the VM output it's memory status somehow?

On Tue, Jun 16, 2009 at 3:40 PM, Jan Lehnardt <ja...@apache.org> wrote:

>
> On 14 Jun 2009, at 23:03, Marcus Persson Lindqvist wrote:
>
>  Greetings all!
>>
>> In short - what factors are involved in memory consumtion for couchdb for
>> a
>> large (x * 1000+) number of databases? Any hints welcome.
>>
>
> Each database requires a file handle and at least one Erlang process to
> be open and used. Views add more file handles and Erlang processes.
> Both file handles and processes are cheap (processes even more so than
> file handles).
>
> CouchDB has a max_open_dbs setting that controls the number of
> databases that are open at any time. It is an LRU cache, so unused
> databases drop out of that cache as new ones are opened. CouchDB
> has been tested with ~1 000 000 databases in total and 20 000 open
> databases at any time.
>
> You may need to raise system limits to accommodate a large number
> of file handles and you might want to increase the max_open_dbs
> setting.
>
> There is also a small write buffer for each db that gets flushed every
> second. It's size and the flush-interval can be configured on a per-server
> basis.
>
> Cheers
> Jan
> --
>
>
>
>  I've recently starting to dig couchdb alot and are using it as primary
>> storage of a backend-type application to much success and relaxation. It
>> really saves a lot of pain not having to care much about the details of a
>> repository.
>>
>> Now, however, my application is growing in data and I'm looking for some
>> pointers of what to expect in terms of memory consumption (my primary
>> bottleneck).
>>
>> The data is highly segmentized - I'm using about 4 different "classes" of
>> documents from X different "sources" (X is currently 200 but might grow up
>> to 2000 or more), neither of which need to know about the others. Going
>> reduction of btrees and such, I figured I would use a separate database
>> for
>> each, yielding 800 DBs at the moment.
>>
>> And kudos to couch for making it a breeze implementing, it was really nice
>> and smooth.
>>
>> But now I'm starting to see some memory consumtion growth and I'm looking
>> for pointers of how to think about this. What mechanisms actually cosumes
>> memory? What should one avoid? Is it better to use fewer databases for
>> this
>> point of view.
>>
>> What would be a reasonable memory footprint and how does one caclulate on
>> it? Currently it consumed about 300MB.
>>
>> Each database is really just a pet store. I need to extract documents in
>> order. Thats it. I'm currently doing this with a simple view. (Are there
>> any
>> "trivial" build-in way of getting documents i reversed insertion-order
>> btw?)
>>
>> And yeah, the load for most databases is really low, so insert/output
>> performance could be compromized for less memory consumtions.
>>
>> Any hints, tips or experiences?
>>
>> Marcus
>>
>
>

Re: _mutiple_ databases memory profile

Posted by Jan Lehnardt <ja...@apache.org>.
On 14 Jun 2009, at 23:03, Marcus Persson Lindqvist wrote:

> Greetings all!
>
> In short - what factors are involved in memory consumtion for  
> couchdb for a
> large (x * 1000+) number of databases? Any hints welcome.

Each database requires a file handle and at least one Erlang process to
be open and used. Views add more file handles and Erlang processes.
Both file handles and processes are cheap (processes even more so than
file handles).

CouchDB has a max_open_dbs setting that controls the number of
databases that are open at any time. It is an LRU cache, so unused
databases drop out of that cache as new ones are opened. CouchDB
has been tested with ~1 000 000 databases in total and 20 000 open
databases at any time.

You may need to raise system limits to accommodate a large number
of file handles and you might want to increase the max_open_dbs
setting.

There is also a small write buffer for each db that gets flushed every
second. It's size and the flush-interval can be configured on a per- 
server
basis.

Cheers
Jan
--


> I've recently starting to dig couchdb alot and are using it as primary
> storage of a backend-type application to much success and  
> relaxation. It
> really saves a lot of pain not having to care much about the details  
> of a
> repository.
>
> Now, however, my application is growing in data and I'm looking for  
> some
> pointers of what to expect in terms of memory consumption (my primary
> bottleneck).
>
> The data is highly segmentized - I'm using about 4 different  
> "classes" of
> documents from X different "sources" (X is currently 200 but might  
> grow up
> to 2000 or more), neither of which need to know about the others.  
> Going
> reduction of btrees and such, I figured I would use a separate  
> database for
> each, yielding 800 DBs at the moment.
>
> And kudos to couch for making it a breeze implementing, it was  
> really nice
> and smooth.
>
> But now I'm starting to see some memory consumtion growth and I'm  
> looking
> for pointers of how to think about this. What mechanisms actually  
> cosumes
> memory? What should one avoid? Is it better to use fewer databases  
> for this
> point of view.
>
> What would be a reasonable memory footprint and how does one  
> caclulate on
> it? Currently it consumed about 300MB.
>
> Each database is really just a pet store. I need to extract  
> documents in
> order. Thats it. I'm currently doing this with a simple view. (Are  
> there any
> "trivial" build-in way of getting documents i reversed insertion- 
> order btw?)
>
> And yeah, the load for most databases is really low, so insert/output
> performance could be compromized for less memory consumtions.
>
> Any hints, tips or experiences?
>
> Marcus