You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Daniel Gonzalez <go...@gonvaled.com> on 2012/03/15 09:38:48 UTC

Size of couchdb documents

I have the following document in a couchdb database:

{
   "_id": "000013a7-4df6-403b-952c-ed767b61554a",
   "_rev": "1-54dc1794443105e9d16ba71531dd2850",
   "tags": [
       "auto_import"
   ],
   "ZZZZZZZZZZZ": "910111",
   "UUUUUUUUUUUUU": "OOOOOOOOO",
   "RECEIVING_OPERATOR": "073",
   "type": "XXXXXXXXXXXXXXXXXXX",
   "src_file": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}

This JSON file takes exactly 319 bytes if saved in my local
filesystem. My documents are all like this (give or take a couple of
bytes, since some of the fields have varying lengths).

In my database I have currently around 6 millions documents, and they
use 15 GB. That gives around 2.5KBytes/document. That means that the
documents are taking 8 times more space on CouchDB as they would on
disk.

Why is that?

Re: Size of couchdb documents

Posted by CGS <cg...@gmail.com>.
The size of the database is not linear with the number of documents. Just
double the number of documents and you will see another size per saved
document. At least that I remember I noticed when I was testing that part
of CouchDB. I know this is not answering your question, but it may give you
a hint about how to structure your databases to save harddisk space.

CGS


On Thu, Mar 15, 2012 at 11:03 AM, Dave Cottlehuber <da...@muse.net.nz> wrote:

> Might as well keep this in the other thread you have open, perhaps related.
>
> On 15 March 2012 09:38, Daniel Gonzalez <go...@gonvaled.com> wrote:
> > I have the following document in a couchdb database:
> >
> > {
> >   "_id": "000013a7-4df6-403b-952c-ed767b61554a",
> >   "_rev": "1-54dc1794443105e9d16ba71531dd2850",
> >   "tags": [
> >       "auto_import"
> >   ],
> >   "ZZZZZZZZZZZ": "910111",
> >   "UUUUUUUUUUUUU": "OOOOOOOOO",
> >   "RECEIVING_OPERATOR": "073",
> >   "type": "XXXXXXXXXXXXXXXXXXX",
> >   "src_file": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
> > }
> >
> > This JSON file takes exactly 319 bytes if saved in my local
> > filesystem. My documents are all like this (give or take a couple of
> > bytes, since some of the fields have varying lengths).
> >
> > In my database I have currently around 6 millions documents, and they
> > use 15 GB. That gives around 2.5KBytes/document. That means that the
> > documents are taking 8 times more space on CouchDB as they would on
> > disk.
> >
> > Why is that?
>

Re: Size of couchdb documents

Posted by Dave Cottlehuber <da...@muse.net.nz>.
Might as well keep this in the other thread you have open, perhaps related.

On 15 March 2012 09:38, Daniel Gonzalez <go...@gonvaled.com> wrote:
> I have the following document in a couchdb database:
>
> {
>   "_id": "000013a7-4df6-403b-952c-ed767b61554a",
>   "_rev": "1-54dc1794443105e9d16ba71531dd2850",
>   "tags": [
>       "auto_import"
>   ],
>   "ZZZZZZZZZZZ": "910111",
>   "UUUUUUUUUUUUU": "OOOOOOOOO",
>   "RECEIVING_OPERATOR": "073",
>   "type": "XXXXXXXXXXXXXXXXXXX",
>   "src_file": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
> }
>
> This JSON file takes exactly 319 bytes if saved in my local
> filesystem. My documents are all like this (give or take a couple of
> bytes, since some of the fields have varying lengths).
>
> In my database I have currently around 6 millions documents, and they
> use 15 GB. That gives around 2.5KBytes/document. That means that the
> documents are taking 8 times more space on CouchDB as they would on
> disk.
>
> Why is that?

Re: Size of couchdb documents

Posted by Jason Smith <jh...@iriscouch.com>.
On Thu, Mar 15, 2012 at 2:00 PM, Daniel Gonzalez <go...@gonvaled.com> wrote:
> I understand the overheads that you are referring to, but it still schocks
> me that Couchdb needs 8 times as much space to store the data.

Indeed. It is shocking.

CouchDB stores two indexes: the ID index (documents sorted by _id) and
the sequence index (documents sorted by when they were changed, for
replication).

Understanding of the overheads that I mention, how many total bytes
would you say is reasonable for a 319-byte object?

-- 
Iris Couch

Re: Size of couchdb documents

Posted by Jason Smith <jh...@iriscouch.com>.
On Fri, Mar 16, 2012 at 9:10 AM, Daniel Gonzalez <go...@gonvaled.com> wrote:
>>
>> Hi, Daniel. That's great news! Also, I have an update from a CouchDB 1.2.0
>> test.
>>
>> I have a database here with 10 million documents, most several KB of
>> English text. upgrade to version 1.2 changed the database size from
>> 38GB to is 9.2GB, or now 0.94 KB per document.
>>
>
> That is interesting. Is CouchDB reducing the size of your stored data?
> Compression? Or is the average size of your input data smaller than 0.94KB?
> (I am not sure what "most several KB" means)

Well, you busted me. I do not know the average size of the documents
offhand, but I suspect it is much greater than 400 bytes, because many
of the documents are a few KB (maybe 1kb-5kb) of text strings.

But, yes, CouchDB 1.2 stores data compressed on the disk. I am using
the Snappy option for the minimal CPU hit.

http://code.google.com/p/snappy/

-- 
Iris Couch

Re: Size of couchdb documents

Posted by Daniel Gonzalez <go...@gonvaled.com>.
>
> Hi, Daniel. That's great news! Also, I have an update from a CouchDB 1.2.0
> test.
>
> I have a database here with 10 million documents, most several KB of
> English text. upgrade to version 1.2 changed the database size from
> 38GB to is 9.2GB, or now 0.94 KB per document.
>

That is interesting. Is CouchDB reducing the size of your stored data?
Compression? Or is the average size of your input data smaller than 0.94KB?
(I am not sure what "most several KB" means)


>
> So you should see an even greater improvement when 1.2.0 comes out
> Real Soon Now.
>
> > I have one more question. Is the alphabet I have shown above "ordered"
> for
> > couchdb?
>
> The sort order may not be quite what you expect, especially if you
> work with Unix or servers a lot.
>
> It is described here:
> http://wiki.apache.org/couchdb/View_collation#Collation_Specification
>
> Basically CouchDB follows (uses!) ICU. The major point is that
> different letter sequences are compared case-insensitively, but
> same-letter strings are case sensitive (lower case first). To me, it
> more or less follows how an English dictionary would do it.
>
> --
> Iris Couch
>

I have now changed my encoding dictionary to:

"-@0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ"

As suggested by Jamie Talbot. That seems to be ordered in the ICU (or UCA?)
sense.

Regarding size of documents, having now nearly 20 millions of documens, and
7.4GB, I can defenitely say that the situation has indeed improved a lot. I
have now 400 bytes/doc, down from originally 3KB/doc.

Re: Size of couchdb documents

Posted by Robert Newson <rn...@apache.org>.
you can see how erlang sorts by using a erlang terminal, e.g,

erl
> lists:sort(["a","A","b","B", "0", "9", "+"]).
["+","0","9","A","B","a","b"]

etc.

On 16 March 2012 15:39, Daniel Gonzalez <go...@gonvaled.com> wrote:
> On Fri, Mar 16, 2012 at 4:30 PM, Robert Newson <rn...@apache.org> wrote:
>> Ah, thanks, that's good advice. You can still grab a bunch of the
>> uuids from /_uuids and then use them, thus giving you good uuids and
>> idempotency too.
>>
>> b.
>>
>
> Understood. But this would not solve the doc_id size problem, which in
> my database with 22 million documents has a big effect. So I need to
> generate a doc_id on the clien side with few characters. Which brings
> me back to my originial question: could somebody produce a base64
> dictionary which is "Erlang term ordered"?

Re: Size of couchdb documents

Posted by Daniel Gonzalez <go...@gonvaled.com>.
On Fri, Mar 16, 2012 at 5:03 PM, Alexander Shorin <kx...@gmail.com> wrote:
> Daniel,
>
> Since you're using Python, have you played with uuid.uuid1 function?
> It produce semi-sequential host-based uuids. To make them really
> sequential, probably you'd like to reverse uuid value because his
> "head" changes often than "tail". This trick could be cheaper that
> implementing erlang-friendly base64 encoding.
>
> --
> ,,,^..^,,,

Thanks Alexander, but that won't do. The main requirement to keep
performance and size low is to have a really short document_id. With
base64 you can number 16 millions of documents with just 4 characters,
and over 1 thousand million documents with 5 characters. The number of
characters in the doc_id is really a critical parameter when dealing
with lots of documents.

Re: Size of couchdb documents

Posted by Alexander Shorin <kx...@gmail.com>.
Daniel,

Since you're using Python, have you played with uuid.uuid1 function?
It produce semi-sequential host-based uuids. To make them really
sequential, probably you'd like to reverse uuid value because his
"head" changes often than "tail". This trick could be cheaper that
implementing erlang-friendly base64 encoding.

--
,,,^..^,,,

Re: Size of couchdb documents

Posted by Daniel Gonzalez <go...@gonvaled.com>.
On Fri, Mar 16, 2012 at 4:30 PM, Robert Newson <rn...@apache.org> wrote:
> Ah, thanks, that's good advice. You can still grab a bunch of the
> uuids from /_uuids and then use them, thus giving you good uuids and
> idempotency too.
>
> b.
>

Understood. But this would not solve the doc_id size problem, which in
my database with 22 million documents has a big effect. So I need to
generate a doc_id on the clien side with few characters. Which brings
me back to my originial question: could somebody produce a base64
dictionary which is "Erlang term ordered"?

Re: Size of couchdb documents

Posted by Robert Newson <rn...@apache.org>.
Ah, thanks, that's good advice. You can still grab a bunch of the
uuids from /_uuids and then use them, thus giving you good uuids and
idempotency too.

b.

On 16 March 2012 15:17, Alexander Shorin <kx...@gmail.com> wrote:
> On Fri, Mar 16, 2012 at 7:09 PM, Robert Newson <rn...@apache.org> wrote:
>> That advice probably dates to when we made a fully random UUID.
>
> That advice based on wiki note:
> http://wiki.apache.org/couchdb/HTTP_Document_API#POST
>
> ...and fully explained in method description:
>
>> If doc has no _id then the server will allocate a random ID and a new document will be created. Otherwise the doc’s _id will be used to identity the document to create or update. Trying to update an existing document with an incorrect _rev will raise a ResourceConflict exception.
>
>> Note that it is generally better to avoid saving documents with no _id and instead generate document IDs on the client side. This is due to the fact that the underlying HTTP POST method is not idempotent, and an automatic retry due to a problem somewhere on the networking stack may cause multiple documents being created in the database.
>
> If docid is specified on client, PUT request is used instead of POST
> one. Workaround is already known: use /_uuids server resource for doc
> ids source, but implementing this trick on library level is not good
> idea due to it force to produce additional requests behind of scene.
>
> --
> ,,,^..^,,,

Re: Size of couchdb documents

Posted by Alexander Shorin <kx...@gmail.com>.
On Fri, Mar 16, 2012 at 7:09 PM, Robert Newson <rn...@apache.org> wrote:
> That advice probably dates to when we made a fully random UUID.

That advice based on wiki note:
http://wiki.apache.org/couchdb/HTTP_Document_API#POST

...and fully explained in method description:

> If doc has no _id then the server will allocate a random ID and a new document will be created. Otherwise the doc’s _id will be used to identity the document to create or update. Trying to update an existing document with an incorrect _rev will raise a ResourceConflict exception.

> Note that it is generally better to avoid saving documents with no _id and instead generate document IDs on the client side. This is due to the fact that the underlying HTTP POST method is not idempotent, and an automatic retry due to a problem somewhere on the networking stack may cause multiple documents being created in the database.

If docid is specified on client, PUT request is used instead of POST
one. Workaround is already known: use /_uuids server resource for doc
ids source, but implementing this trick on library level is not good
idea due to it force to produce additional requests behind of scene.

--
,,,^..^,,,

Re: Size of couchdb documents

Posted by Robert Newson <rn...@apache.org>.
That advice probably dates to when we made a fully random UUID.

On 16 March 2012 15:02, Daniel Gonzalez <go...@gonvaled.com> wrote:
> On Fri, Mar 16, 2012 at 3:51 PM, Robert Newson <rn...@apache.org> wrote:
>> Any reason you can't use the built-in, default UUID algorithm that
>> produces collision-resistant but sequential values?
>>
>> B.
>>
>
> Two reasons:
>
> 1. According to the couchdb-python documentation, this is not
> recommended: "The save() method creates a document with a random ID
> generated by CouchDB (which is not recommended)."
> http://packages.python.org/CouchDB/client.html. I do not remember the
> reasoning behind this., but I have been sticking to this in my
> libraries.
> 2. Using a standard uuid will probably put me back into my size and
> performance problems. The uuid generated by couchdb is 32 (?)
> characters, and now I am using much less.

Re: Size of couchdb documents

Posted by Daniel Gonzalez <go...@gonvaled.com>.
On Fri, Mar 16, 2012 at 3:51 PM, Robert Newson <rn...@apache.org> wrote:
> Any reason you can't use the built-in, default UUID algorithm that
> produces collision-resistant but sequential values?
>
> B.
>

Two reasons:

1. According to the couchdb-python documentation, this is not
recommended: "The save() method creates a document with a random ID
generated by CouchDB (which is not recommended)."
http://packages.python.org/CouchDB/client.html. I do not remember the
reasoning behind this., but I have been sticking to this in my
libraries.
2. Using a standard uuid will probably put me back into my size and
performance problems. The uuid generated by couchdb is 32 (?)
characters, and now I am using much less.

Re: Size of couchdb documents

Posted by Robert Newson <rn...@apache.org>.
Any reason you can't use the built-in, default UUID algorithm that
produces collision-resistant but sequential values?

B.

On 16 March 2012 14:41, Daniel Gonzalez <go...@gonvaled.com> wrote:
>>
>> If memory serves the database's by_id tree uses Erlang term sorting for collation instead of ICU.  ICU is of course the default collation option for MR views.  Regards,
>>
>> Adam
>
>
> That is interesting. I will try to confirm that, because that would
> mean that the dictionary that I am using now:
>
> "-@0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ"
>
> which is ICU ordered, would not be optimal for the doc_ids. Can you
> tell me what would an "Erlang term order" base64 dictionary look like?
>
> Anyway, I am curious: I understand that the size of doc_id is going to
> have big impact in performance and size of the database, since the
> doc_id is going to be present in a lot of internal structures. What I
> do not fully understand is why *ordering* of doc_ids when inserting
> documents in the database is going to have any effect in insert speed,
> or view generation. In my naive view of couchdb, the documents are
> just written to a big file system file as they are POSTed to couchdb,
> in the order that they arrive. How would the doc_id order affect this
> process?

Re: Size of couchdb documents

Posted by Daniel Gonzalez <go...@gonvaled.com>.
>
> If memory serves the database's by_id tree uses Erlang term sorting for collation instead of ICU.  ICU is of course the default collation option for MR views.  Regards,
>
> Adam


That is interesting. I will try to confirm that, because that would
mean that the dictionary that I am using now:

"-@0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ"

which is ICU ordered, would not be optimal for the doc_ids. Can you
tell me what would an "Erlang term order" base64 dictionary look like?

Anyway, I am curious: I understand that the size of doc_id is going to
have big impact in performance and size of the database, since the
doc_id is going to be present in a lot of internal structures. What I
do not fully understand is why *ordering* of doc_ids when inserting
documents in the database is going to have any effect in insert speed,
or view generation. In my naive view of couchdb, the documents are
just written to a big file system file as they are POSTed to couchdb,
in the order that they arrive. How would the doc_id order affect this
process?

Re: Size of couchdb documents

Posted by Adam Kocoloski <ko...@apache.org>.
On Mar 15, 2012, at 7:55 PM, Jason Smith wrote:

> On Thu, Mar 15, 2012 at 10:14 PM, Daniel Gonzalez <go...@gonvaled.com> wrote:
>> Hi Matthieu,
>> 
>> This really seems to help. I am using now a base62 encoded monotonically
>> increasing integer, which means my doc_id goes from "0" onwards, using the
>> alphabet:
>> 
>> ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz
>> 
>> I am getting now 3000 docs/s, more or less stable, and the size of my
>> documents has decreased from 3KB to 0.4 KB.
>> I am not sure whether this metrics will worsen when the database grows, but
>> my feeling is that the situation has improved a lot just by changing the
>> doc_id.
> 
> Hi, Daniel. That's great news! Also, I have an update from a CouchDB 1.2.0 test.
> 
> I have a database here with 10 million documents, most several KB of
> English text. upgrade to version 1.2 changed the database size from
> 38GB to is 9.2GB, or now 0.94 KB per document.
> 
> So you should see an even greater improvement when 1.2.0 comes out
> Real Soon Now.
> 
>> I have one more question. Is the alphabet I have shown above "ordered" for
>> couchdb?
> 
> The sort order may not be quite what you expect, especially if you
> work with Unix or servers a lot.
> 
> It is described here:
> http://wiki.apache.org/couchdb/View_collation#Collation_Specification
> 
> Basically CouchDB follows (uses!) ICU. The major point is that
> different letter sequences are compared case-insensitively, but
> same-letter strings are case sensitive (lower case first). To me, it
> more or less follows how an English dictionary would do it.
> 
> -- 
> Iris Couch

If memory serves the database's by_id tree uses Erlang term sorting for collation instead of ICU.  ICU is of course the default collation option for MR views.  Regards,

Adam


Re: Size of couchdb documents

Posted by Jason Smith <jh...@iriscouch.com>.
On Thu, Mar 15, 2012 at 10:14 PM, Daniel Gonzalez <go...@gonvaled.com> wrote:
> Hi Matthieu,
>
> This really seems to help. I am using now a base62 encoded monotonically
> increasing integer, which means my doc_id goes from "0" onwards, using the
> alphabet:
>
> ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz
>
> I am getting now 3000 docs/s, more or less stable, and the size of my
> documents has decreased from 3KB to 0.4 KB.
> I am not sure whether this metrics will worsen when the database grows, but
> my feeling is that the situation has improved a lot just by changing the
> doc_id.

Hi, Daniel. That's great news! Also, I have an update from a CouchDB 1.2.0 test.

I have a database here with 10 million documents, most several KB of
English text. upgrade to version 1.2 changed the database size from
38GB to is 9.2GB, or now 0.94 KB per document.

So you should see an even greater improvement when 1.2.0 comes out
Real Soon Now.

> I have one more question. Is the alphabet I have shown above "ordered" for
> couchdb?

The sort order may not be quite what you expect, especially if you
work with Unix or servers a lot.

It is described here:
http://wiki.apache.org/couchdb/View_collation#Collation_Specification

Basically CouchDB follows (uses!) ICU. The major point is that
different letter sequences are compared case-insensitively, but
same-letter strings are case sensitive (lower case first). To me, it
more or less follows how an English dictionary would do it.

-- 
Iris Couch

Re: Size of couchdb documents

Posted by Jamie Talbot <ja...@jamietalbot.com>.
So, just to understand, it's best to generate IDs in an encoding that
matches the collation order of CouchDB exactly?

I was using this:

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-@'

though, as I now understand it, the following is better:

'-@0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ'

Is that correct?

I decided to go for Base 64 encoding, as you can do fast bitwise
encodes and decodes.  You can see a sample encoding/decoding class
that does this in PHP here:

https://github.com/majelbstoat/Celsus/blob/master/library/Celsus/Encoder.php

Cheers,

Jamie.

On Thu, Mar 15, 2012 at 09:31, Matthieu Rakotojaona
<ma...@gmail.com> wrote:
> On Thu, Mar 15, 2012 at 4:14 PM, Daniel Gonzalez <go...@gonvaled.com> wrote:
>> I have one more question. Is the alphabet I have shown above "ordered" for
>> couchdb?
>
> From the wiki (http://wiki.apache.org/couchdb/View_collation#Collation_Specification),
> your alphabet is not optimal. The link will explain better than me
> what would be the best choice for your alphabet =]
>
> --
> Matthieu RAKOTOJAONA



-- 
---
http://jamietalbot.com

Re: Size of couchdb documents

Posted by Matthieu Rakotojaona <ma...@gmail.com>.
On Thu, Mar 15, 2012 at 4:14 PM, Daniel Gonzalez <go...@gonvaled.com> wrote:
> I have one more question. Is the alphabet I have shown above "ordered" for
> couchdb?

>From the wiki (http://wiki.apache.org/couchdb/View_collation#Collation_Specification),
your alphabet is not optimal. The link will explain better than me
what would be the best choice for your alphabet =]

-- 
Matthieu RAKOTOJAONA

Re: Size of couchdb documents

Posted by Daniel Gonzalez <go...@gonvaled.com>.
Hi Matthieu,

This really seems to help. I am using now a base62 encoded monotonically
increasing integer, which means my doc_id goes from "0" onwards, using the
alphabet:

ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz

I am getting now 3000 docs/s, more or less stable, and the size of my
documents has decreased from 3KB to 0.4 KB.
I am not sure whether this metrics will worsen when the database grows, but
my feeling is that the situation has improved a lot just by changing the
doc_id.

I have one more question. Is the alphabet I have shown above "ordered" for
couchdb?

Thanks,
Daniel

On Thu, Mar 15, 2012 at 3:09 PM, Matthieu Rakotojaona <
matthieu.rakotojaona@gmail.com> wrote:

> On Thu, Mar 15, 2012 at 3:00 PM, Daniel Gonzalez <go...@gonvaled.com>
> wrote:
> > I understand the overheads that you are referring to, but it still
> schocks
> > me that Couchdb needs 8 times as much space to store the data.
> >
> > Are there any guidelines on what to do/avoid in order to get a lower
> > overhead ratio?
>
> I got surprisingly good results when changing the _id design. I advise
> you to follow what is written in this page :
> http://wiki.apache.org/couchdb/Performance#File_size
>
> Basically :
> - use shorter _ids
> - use sequential _ids. If you cannot (eg because you have multiple
> disconnected parts that will have to merge often and that would cause
> too many clashes), you can use couchdb's own semi-sequential generated
> uuids. Yes, uuids are contradictory to the first point.
>
>
> --
> Matthieu RAKOTOJAONA
>

Re: Size of couchdb documents

Posted by Matthieu Rakotojaona <ma...@gmail.com>.
On Thu, Mar 15, 2012 at 3:00 PM, Daniel Gonzalez <go...@gonvaled.com> wrote:
> I understand the overheads that you are referring to, but it still schocks
> me that Couchdb needs 8 times as much space to store the data.
>
> Are there any guidelines on what to do/avoid in order to get a lower
> overhead ratio?

I got surprisingly good results when changing the _id design. I advise
you to follow what is written in this page :
http://wiki.apache.org/couchdb/Performance#File_size

Basically :
- use shorter _ids
- use sequential _ids. If you cannot (eg because you have multiple
disconnected parts that will have to merge often and that would cause
too many clashes), you can use couchdb's own semi-sequential generated
uuids. Yes, uuids are contradictory to the first point.


-- 
Matthieu RAKOTOJAONA

Re: Size of couchdb documents

Posted by Daniel Gonzalez <go...@gonvaled.com>.
I understand the overheads that you are referring to, but it still schocks
me that Couchdb needs 8 times as much space to store the data.

Are there any guidelines on what to do/avoid in order to get a lower
overhead ratio?

Re: Size of couchdb documents

Posted by Jason Smith <jh...@iriscouch.com>.
On Thu, Mar 15, 2012 at 8:38 AM, Daniel Gonzalez <go...@gonvaled.com> wrote:
> I have the following document in a couchdb database:
>
> {
>   "_id": "000013a7-4df6-403b-952c-ed767b61554a",
>   "_rev": "1-54dc1794443105e9d16ba71531dd2850",
>   "tags": [
>       "auto_import"
>   ],
>   "ZZZZZZZZZZZ": "910111",
>   "UUUUUUUUUUUUU": "OOOOOOOOO",
>   "RECEIVING_OPERATOR": "073",
>   "type": "XXXXXXXXXXXXXXXXXXX",
>   "src_file": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
> }
>
> This JSON file takes exactly 319 bytes if saved in my local
> filesystem. My documents are all like this (give or take a couple of
> bytes, since some of the fields have varying lengths).
>
> In my database I have currently around 6 millions documents, and they
> use 15 GB. That gives around 2.5KBytes/document. That means that the
> documents are taking 8 times more space on CouchDB as they would on
> disk.

Hi, Daniel. Excellent question!

Ask yourself, how much space does a 319 byte file *really* consume on a disk?

It must be more than 319 bytes because the operating system must store
file metadata too. And even the file data occupies a 4KB block.

On a Linux ext3 filesystem, there is the superblock (and its copies),
the block group descriptor table, block bitmaps, inode bitmaps,
inodes, and then of course data blocks--usually 4 kilobytes a pop.
Whoops! That exceeds the CouchDB average already. So what is the
storage cost of a 319-byte file?

CouchDB is the same. But running on top of the OS, it can't as easily
hide its metadata from the census.

Having said all of that, the CouchDB file format is indeed bloated,
particularly with numbers. The upcoming 1.2 release addresses that,
with several degrees of data compression supported.

I think most people are initially shocked by CouchDB's time and space
performance, however if you consider its amortized costs in real-world
usage, it is capable for many usage scenarios.

-- 
Iris Couch