You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Zdravko Gligic <zg...@gmail.com> on 2011/06/30 01:59:11 UTC

Frugal Erlang vs Resources Hungry CouchDB

Hi Folks,

In many places I have read how Erlang runs on small devices and how
(as a result) it is very frugal with resources.  I think that I have
read that or at least something to that effect.  However, none of that
seems to apply to CouchDB.

I believe that I read somewhere that the length of key names can make
a significant reduction in disk usage - as in, cutting it in half or
less.  However, when I asked about it on #couchdb, a very smart person
stated point blank (with a bit of attitude or maybe just conviction)
that if I was worried about disk then I should not be using CouchDB.

In many places I have read how both DB and View compactions can free
up as much as 90% of occupied space.  Similarly, I have read how
CouchDB would be struggling on smaller VPS allocations and how a mere
2GB database would struggle with anything less than that much in RAM -
especially when compactions and/or cleanups are running.

Whenever I come across such CouchDB resources related postings, I keep
thinking about all of those Couches on all of those mobile devices (at
least in all of those presentations and slides) and asking my self
"how do they do that" ?

Regards,
teslan

Re: Frugal Erlang vs Resources Hungry CouchDB

Posted by Paul Davis <pa...@gmail.com>.
On Thu, Jun 30, 2011 at 12:29 PM, Jens Alfke <je...@mooseyard.com> wrote:
>
> On Jun 29, 2011, at 7:00 PM, sleepnova wrote:
>
>> I think what many people really concerned is the growing pattern of size as
>> number of docs increase. (space complexity)
>> (If it grows exponentially then that's not a good sign.)
>
> It’s basically linear, assuming the database gets compacted periodically. The file format is a B-tree, like most other databases, so the extra space for interior nodes is going to be O(log n). Views, like traditional indexes, also occupy B-tree nodes, so depending on how many of those you have, they’ll occupy some extra space, but also probably a lot less than the documents themselves.
>
> It sounds like append-only writing and compaction are confusing to some people. They’re not really very complicated. If you have some familiarity with garbage collection, CouchDB works almost exactly like a copying collector[1]: new objects are allocated simply by bumping a pointer, and collection works by copying the live objects into a new space, then discarding the old one. By contrast, most other databases work like a regular memory allocator: freeing obsolete objects in place, keeping a map of free space, and reallocating that space to new objects later.
>
> —Jens
>
> [1] http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)#Copying_vs._mark-and-sweep_vs._mark-and-don.27t-sweep

Just a heads up that I'm going to be stealing that description. :D

Re: Frugal Erlang vs Resources Hungry CouchDB

Posted by Jens Alfke <je...@mooseyard.com>.
On Jun 29, 2011, at 7:00 PM, sleepnova wrote:

> I think what many people really concerned is the growing pattern of size as
> number of docs increase. (space complexity)
> (If it grows exponentially then that's not a good sign.)

It’s basically linear, assuming the database gets compacted periodically. The file format is a B-tree, like most other databases, so the extra space for interior nodes is going to be O(log n). Views, like traditional indexes, also occupy B-tree nodes, so depending on how many of those you have, they’ll occupy some extra space, but also probably a lot less than the documents themselves.

It sounds like append-only writing and compaction are confusing to some people. They’re not really very complicated. If you have some familiarity with garbage collection, CouchDB works almost exactly like a copying collector[1]: new objects are allocated simply by bumping a pointer, and collection works by copying the live objects into a new space, then discarding the old one. By contrast, most other databases work like a regular memory allocator: freeing obsolete objects in place, keeping a map of free space, and reallocating that space to new objects later.

—Jens

[1] http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)#Copying_vs._mark-and-sweep_vs._mark-and-don.27t-sweep

Re: Frugal Erlang vs Resources Hungry CouchDB

Posted by sleepnova <wa...@gmail.com>.
I think what many people really concerned is the growing pattern of size as
number of docs increase. (space complexity)
(If it grows exponentially then that's not a good sign.)

So is there any official/non-official, theoretically/benchmark showing this
characteristic?

2011/6/30 Paul Davis <pa...@gmail.com>
>
> Teslan,
>
> I'm not sure where you were getting the impression that Erlang was
> frugal with disk space. In general, its true that Erlang is pretty
> good at using a minimal amount of CPU/RAM resources while it runs,
> though as in all things, that usage will scale with load.
>
> As to disk usage, that's a direct trade off in the design of CouchDB.
> The append only b+tree is going to cause fragmentation in the database
> files. There are of course games we could play to minimize to a
> certain extent by doing things like log structured merge trees with
> more aggressive compaction but then the issue becomes that we end up
> requiring more active file descriptors per database which in turn
> hurts people that are hosting a large number of databases on a single
> node (think hosting, or db per user account).
>
> My guess that whoever it was on IRC was just speaking with conviction.
> We don't try and hide the fact that CouchDB uses quite a bit more
> space than people would expect at first by any means.
>
> As to the amount of space that can be cleaned up, it really depends on
> the specific load patterns and how aggressive people are at keeping
> the database files compacted. Obviously I could write a single
> document hundreds of thousands of times without compacting, and then
> compact and have a database that is a percent or less of the
> "uncompacted" size.
>
> I'm also not sure about why someone would say that a 2GiB database
> would struggle with less than 2GiB of RAM. RAM usage is more or less
> tied to the number of concurrent clients you have accessing the
> database and the amount and type of view generations you have running.
> Its not really tied to the physical size of the database as we don't
> hold caches to anything. There used to be a silly benchmark floating
> around that showed CouchDB handling a couple thousand requests for a
> small doc and it was only using 9M of RAM. Granted that's a super
> idealized case, but I'd just point out that it's more about access
> patterns rather than disk usage.
>
> As to the mobile stuff, my guess would probably be "don't store a lot
> of data on the device". AFAIK the story for mobile developers revolves
> quite a bit around the fact that replicating data in and out from The
> Cloud &trade; makes it super easy for them to have bits and pieces of
> a marge larger database.
>
> But in the end, the fact that CouchDB has a much larger disk usage
> size than some would expect is that's the trade off in the grand
> design. There are features we have like database snapshots, append
> only storage to simplify guarantees on consistency (also, hot backups)
> and hosting a large number of db's in a single Erlang VM that end up
> intersecting in such a way that the price we pay is using more bytes.
>
> Also, I'd like to recommend you keep an eye on development because
> this is an active area of optimization. Filipe has been doing awesome
> work integrating things like snappy compression and other things deep
> down at the storage layer to improve the situation. We may be frank in
> saying we use a non-trivial amount of extra space, but its not like
> we're not working on improving that situation. :D
>
> That ended up longer than expected. Let us know if you have any other
> questions.
>


-- 
- sleepnova

Re: Frugal Erlang vs Resources Hungry CouchDB

Posted by Paul Davis <pa...@gmail.com>.
On Wed, Jun 29, 2011 at 7:59 PM, Zdravko Gligic <zg...@gmail.com> wrote:
> Hi Folks,
>
> In many places I have read how Erlang runs on small devices and how
> (as a result) it is very frugal with resources.  I think that I have
> read that or at least something to that effect.  However, none of that
> seems to apply to CouchDB.
>
> I believe that I read somewhere that the length of key names can make
> a significant reduction in disk usage - as in, cutting it in half or
> less.  However, when I asked about it on #couchdb, a very smart person
> stated point blank (with a bit of attitude or maybe just conviction)
> that if I was worried about disk then I should not be using CouchDB.
>
> In many places I have read how both DB and View compactions can free
> up as much as 90% of occupied space.  Similarly, I have read how
> CouchDB would be struggling on smaller VPS allocations and how a mere
> 2GB database would struggle with anything less than that much in RAM -
> especially when compactions and/or cleanups are running.
>
> Whenever I come across such CouchDB resources related postings, I keep
> thinking about all of those Couches on all of those mobile devices (at
> least in all of those presentations and slides) and asking my self
> "how do they do that" ?
>
> Regards,
> teslan
>

Teslan,

I'm not sure where you were getting the impression that Erlang was
frugal with disk space. In general, its true that Erlang is pretty
good at using a minimal amount of CPU/RAM resources while it runs,
though as in all things, that usage will scale with load.

As to disk usage, that's a direct trade off in the design of CouchDB.
The append only b+tree is going to cause fragmentation in the database
files. There are of course games we could play to minimize to a
certain extent by doing things like log structured merge trees with
more aggressive compaction but then the issue becomes that we end up
requiring more active file descriptors per database which in turn
hurts people that are hosting a large number of databases on a single
node (think hosting, or db per user account).

My guess that whoever it was on IRC was just speaking with conviction.
We don't try and hide the fact that CouchDB uses quite a bit more
space than people would expect at first by any means.

As to the amount of space that can be cleaned up, it really depends on
the specific load patterns and how aggressive people are at keeping
the database files compacted. Obviously I could write a single
document hundreds of thousands of times without compacting, and then
compact and have a database that is a percent or less of the
"uncompacted" size.

I'm also not sure about why someone would say that a 2GiB database
would struggle with less than 2GiB of RAM. RAM usage is more or less
tied to the number of concurrent clients you have accessing the
database and the amount and type of view generations you have running.
Its not really tied to the physical size of the database as we don't
hold caches to anything. There used to be a silly benchmark floating
around that showed CouchDB handling a couple thousand requests for a
small doc and it was only using 9M of RAM. Granted that's a super
idealized case, but I'd just point out that it's more about access
patterns rather than disk usage.

As to the mobile stuff, my guess would probably be "don't store a lot
of data on the device". AFAIK the story for mobile developers revolves
quite a bit around the fact that replicating data in and out from The
Cloud &trade; makes it super easy for them to have bits and pieces of
a marge larger database.

But in the end, the fact that CouchDB has a much larger disk usage
size than some would expect is that's the trade off in the grand
design. There are features we have like database snapshots, append
only storage to simplify guarantees on consistency (also, hot backups)
and hosting a large number of db's in a single Erlang VM that end up
intersecting in such a way that the price we pay is using more bytes.

Also, I'd like to recommend you keep an eye on development because
this is an active area of optimization. Filipe has been doing awesome
work integrating things like snappy compression and other things deep
down at the storage layer to improve the situation. We may be frank in
saying we use a non-trivial amount of extra space, but its not like
we're not working on improving that situation. :D

That ended up longer than expected. Let us know if you have any other questions.

Re: Frugal Erlang vs Resources Hungry CouchDB

Posted by Dale Harvey <da...@arandomurl.com>.
I was about to post some of what Paul just did

On mobile devices the bottlenecks are in CPU and flash storage, erlang (and
therefore couch) dont have a true idle state, but they do very well at
limiting battery usage on devices, the Android version is now down to 4/5 MB
of flash storage which is comparatively small (adobe flash takes up 20MB)

The phones do generally have a lot of storage to spare, which is where couch
keeps its data, and as both replies have already said, the typical use case
is for a small subset of a users data.

Cheers
Dale

On 30 June 2011 01:50, Jens Alfke <je...@mooseyard.com> wrote:

>
> On Jun 29, 2011, at 4:59 PM, Zdravko Gligic wrote:
>
> > In many places I have read how Erlang runs on small devices and how
> > (as a result) it is very frugal with resources.  I think that I have
> > read that or at least something to that effect.
>
> I’m not an Erlang expert, but the “typical” use-case Erlang was designed
> for was running Ericsson’s telecom switches, which are not small devices.
> The resource advantage it has is very lightweight parallelism, so you can
> run tens of thousands of ‘processes’ at once without consuming huge amounts
> of RAM in stack space.
>
> > Whenever I come across such CouchDB resources related postings, I keep
> > thinking about all of those Couches on all of those mobile devices (at
> > least in all of those presentations and slides) and asking my self
> > "how do they do that" ?
>
> A mobile device is typically going to use CouchDB to store personal-sized
> data sets, like your to-do list or phone book or bug queue, or your save
> state in a game. And it’s probably going to have one client at a time,
> connected by loopback on localhost, sometimes making a single sync
> connection to an upstream server database.
>
> In other words, it’s not going to be serving a million-document corporate
> database to thousands of clients simultaneously.
>
> For those sort of workloads, CouchDB runs fine on a mobile device,
> especially if the client compacts the database frequently.
>
> SQLite might be a good comparison — you could use it to manage very large
> databases [although maybe not so well as MySQL or Postgres] and for that
> purpose you’ll want a good amount of RAM and disk space. But for small data
> sets, it fits fine into embedded devices like iPhones and even microwave
> ovens.
>
> —Jens

Re: Frugal Erlang vs Resources Hungry CouchDB

Posted by Jens Alfke <je...@mooseyard.com>.
On Jun 29, 2011, at 4:59 PM, Zdravko Gligic wrote:

> In many places I have read how Erlang runs on small devices and how
> (as a result) it is very frugal with resources.  I think that I have
> read that or at least something to that effect.

I’m not an Erlang expert, but the “typical” use-case Erlang was designed for was running Ericsson’s telecom switches, which are not small devices. The resource advantage it has is very lightweight parallelism, so you can run tens of thousands of ‘processes’ at once without consuming huge amounts of RAM in stack space.

> Whenever I come across such CouchDB resources related postings, I keep
> thinking about all of those Couches on all of those mobile devices (at
> least in all of those presentations and slides) and asking my self
> "how do they do that" ?

A mobile device is typically going to use CouchDB to store personal-sized data sets, like your to-do list or phone book or bug queue, or your save state in a game. And it’s probably going to have one client at a time, connected by loopback on localhost, sometimes making a single sync connection to an upstream server database.

In other words, it’s not going to be serving a million-document corporate database to thousands of clients simultaneously.

For those sort of workloads, CouchDB runs fine on a mobile device, especially if the client compacts the database frequently.

SQLite might be a good comparison — you could use it to manage very large databases [although maybe not so well as MySQL or Postgres] and for that purpose you’ll want a good amount of RAM and disk space. But for small data sets, it fits fine into embedded devices like iPhones and even microwave ovens.

—Jens