You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Santi Saez <sa...@woop.es> on 2010/02/01 17:27:23 UTC

Best way to store 2^32 IPs in CouchDB

Hi,

I'm doing some initial tests with CouchDB, trying to store 2^32 IP 
addresses (approximately 4.3 billions of documents).

Documents have only required fields: _id and _rev, but I've noticed that 
the minimum space occupied by each document is approximately 3.7KB, so I 
need +14TB disk space only for the basic scheme without any extra field 
(using IP as unique identifier in integer format).

Note that playing with a simple Python script and a binary data file, 
this data can be stored in 16GB space (each IP 4 = bytes * 2 ^32 addresses).

Is it possible to optimize the disk space for what I'm trying to do 
using CouchDB? Perhaps disabling "something", compressing, or changing 
_rev field format/size.. thanks!!

I haver read the manual for CouchDB perfomance, but I didn't get it:

http://wiki.apache.org/couchdb/Performance

Regards,

-- 
Santi Saez
http://woop.es

Re: Best way to store 2^32 IPs in CouchDB

Posted by Santi Saez <sa...@woop.es>.
El 01/02/10 17:56, Robert Newson escribió:
> compaction should reduce disk usage even without updates or deletes,
> but that is probably not true for 0.8. odd that you get the exact same
> byte count after compaction...

In another testing server with CentOS-5 and "couchdb-0.10.0-1.el5", we have:

# curl http://localhost:5984/ipv4
{"db_name":"ipv4","doc_count":11836124,"doc_del_count":7,"update_seq":11836141,"purge_seq":0,"compact_running":true,"disk_size":97730438359,"instance_start_time":"1264780216812476","disk_format_version":4}

Still is running a compact "Database Compaction" task:

# ls -lh /var/lib/couchdb/ipv4.couch*
92G Feb  1 19:59 /var/lib/couchdb/ipv4.couch
34M Feb  1 19:58 /var/lib/couchdb/ipv4.couch.compact

So, I have to wait to finish compacting to see if I saved some disk 
space, thanks!!

Regards,

-- 
Santi Saez
http://woop.es

Re: Best way to store 2^32 IPs in CouchDB

Posted by Robert Newson <ro...@gmail.com>.
compaction should reduce disk usage even without updates or deletes,
but that is probably not true for 0.8. odd that you get the exact same
byte count after compaction...

On Mon, Feb 1, 2010 at 4:52 PM, Santi Saez <sa...@woop.es> wrote:
> El 01/02/10 17:31, Robert Newson escribió:
>
>> Try database compaction?
>
> I have tried database compaction in another testing server (Debian Lenny
> box) using CouchDB 0.8.0-2, and after database compaction disk size is the
> same:
>
> # curl http://localhost:5984/test
> {"db_name":"test","doc_count":15999,"doc_del_count":0,"update_seq":15999,"compact_running":false,"disk_size":60330312}
>
> # curl -X POST http://localhost:5984/test/_compact
> {"ok":true}
>
> # curl http://localhost:5984/test
> {"db_name":"test","doc_count":15999,"doc_del_count":0,"update_seq":15999,"compact_running":false,"disk_size":60330312}
>
> Acording to the documentation [1]: "Compaction rewrites the database file,
> removing outdated document revisions and deleted documents".
>
> So, it's normal because in my test I have not delete/upadate any document,
> only inserts.
>
> Thanks!
>
> [1] http://wiki.apache.org/couchdb/Compaction
>
> --
> Santi Saez
> http://woop.es
>

Re: Best way to store 2^32 IPs in CouchDB

Posted by Santi Saez <sa...@woop.es>.
El 01/02/10 17:31, Robert Newson escribió:

> Try database compaction?

I have tried database compaction in another testing server (Debian Lenny 
box) using CouchDB 0.8.0-2, and after database compaction disk size is 
the same:

# curl http://localhost:5984/test
{"db_name":"test","doc_count":15999,"doc_del_count":0,"update_seq":15999,"compact_running":false,"disk_size":60330312}

# curl -X POST http://localhost:5984/test/_compact
{"ok":true}

# curl http://localhost:5984/test
{"db_name":"test","doc_count":15999,"doc_del_count":0,"update_seq":15999,"compact_running":false,"disk_size":60330312}

Acording to the documentation [1]: "Compaction rewrites the database 
file, removing outdated document revisions and deleted documents".

So, it's normal because in my test I have not delete/upadate any 
document, only inserts.

Thanks!

[1] http://wiki.apache.org/couchdb/Compaction

-- 
Santi Saez
http://woop.es

Re: Best way to store 2^32 IPs in CouchDB

Posted by Robert Newson <ro...@gmail.com>.
Try database compaction?

B.

On Mon, Feb 1, 2010 at 4:27 PM, Santi Saez <sa...@woop.es> wrote:
>
> Hi,
>
> I'm doing some initial tests with CouchDB, trying to store 2^32 IP addresses
> (approximately 4.3 billions of documents).
>
> Documents have only required fields: _id and _rev, but I've noticed that the
> minimum space occupied by each document is approximately 3.7KB, so I need
> +14TB disk space only for the basic scheme without any extra field (using IP
> as unique identifier in integer format).
>
> Note that playing with a simple Python script and a binary data file, this
> data can be stored in 16GB space (each IP 4 = bytes * 2 ^32 addresses).
>
> Is it possible to optimize the disk space for what I'm trying to do using
> CouchDB? Perhaps disabling "something", compressing, or changing _rev field
> format/size.. thanks!!
>
> I haver read the manual for CouchDB perfomance, but I didn't get it:
>
> http://wiki.apache.org/couchdb/Performance
>
> Regards,
>
> --
> Santi Saez
> http://woop.es
>

Re: Best way to store 2^32 IPs in CouchDB

Posted by Santi Saez <sa...@woop.es>.
El 01/02/10 18:19, Markus Jelsma escribió:
> Not really, but you could omit about 300 million IP addresses, these are
> multicast and private network addresses, that'd save you about 1.2GiB already.

Thanks for the tip ;-)

Regards,

-- 
Santi Saez
http://woop.es

Re: Best way to store 2^32 IPs in CouchDB

Posted by Markus Jelsma <ma...@buyways.nl>.
Not really, but you could omit about 300 million IP addresses, these are 
multicast and private network addresses, that'd save you about 1.2GiB already.


>Now seriously: any idea to reduce disk space in this test to store 2^32
>documents? thanks!

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Best way to store 2^32 IPs in CouchDB

Posted by Phil Rand <ph...@gmail.com>.
Of course it depends on what you are trying to do, which you haven't
told us.  The easiest way to reduce your storage needs is to not store
data you don't need to.  For example, if I wanted to map IP back to
hostname, I wouldn't use CouchDB at all, since we already have the
DNS.

On Mon, Feb 1, 2010 at 8:58 AM, Santi Saez <sa...@woop.es> wrote:
> El 01/02/10 17:32, Elf escribió:
>>
>> Did you plan to handle IPv6 in future versions of your program? :)
>
> It would be another great test.. but using CouchDB, perhaps I will not have
> enough disk space ;-P
>
> Now seriously: any idea to reduce disk space in this test to store 2^32
> documents? thanks!
>
> Regards,
>
> --
> Santi Saez
> http://woop.es
>



-- 
Phil Rand
philrand@gmail.com
philrand@pobox.com

Re: Best way to store 2^32 IPs in CouchDB

Posted by Santi Saez <sa...@woop.es>.
El 01/02/10 17:32, Elf escribió:
> Did you plan to handle IPv6 in future versions of your program? :)

It would be another great test.. but using CouchDB, perhaps I will not 
have enough disk space ;-P

Now seriously: any idea to reduce disk space in this test to store 2^32 
documents? thanks!

Regards,

-- 
Santi Saez
http://woop.es

Re: Best way to store 2^32 IPs in CouchDB

Posted by Elf <el...@gmail.com>.
Did you plan to handle IPv6 in future versions of your program? :)

2010/2/1 Santi Saez <sa...@woop.es>:
>
> Hi,
>
> I'm doing some initial tests with CouchDB, trying to store 2^32 IP addresses
> (approximately 4.3 billions of documents).
>
> Documents have only required fields: _id and _rev, but I've noticed that the
> minimum space occupied by each document is approximately 3.7KB, so I need
> +14TB disk space only for the basic scheme without any extra field (using IP
> as unique identifier in integer format).
>
> Note that playing with a simple Python script and a binary data file, this
> data can be stored in 16GB space (each IP 4 = bytes * 2 ^32 addresses).
>
> Is it possible to optimize the disk space for what I'm trying to do using
> CouchDB? Perhaps disabling "something", compressing, or changing _rev field
> format/size.. thanks!!
>
> I haver read the manual for CouchDB perfomance, but I didn't get it:
>
> http://wiki.apache.org/couchdb/Performance
>
> Regards,
>
> --
> Santi Saez
> http://woop.es
>



-- 
----------------
Best regards
Elf
mailto:elf2001@gmail.com

Re: Best way to store 2^32 IPs in CouchDB

Posted by Nicholas Orr <ni...@zxgen.net>.
Also have a look at this thread

http://mail-archives.apache.org/mod_mbox/couchdb-dev/201001.mbox/%3Chi57et$19n$1@ger.gmane.org%3E



On Tue, Feb 2, 2010 at 6:07 AM, Paul Davis <pa...@gmail.com>wrote:

> On Mon, Feb 1, 2010 at 1:50 PM, Santi Saez <sa...@woop.es> wrote:
> > El 01/02/10 17:56, Paul Davis escribió:
> >
> > Dear Paul,
> >
> >> Well, 2^32 of anything is 4GiB per byte stored. So, minimum of four
> >> bytes and you're at 16GiB. Even with just 1KiB overhead you're at
> >> 4TiB.
> >>
> >> I'm left wondering why you would want to store a list of numbers in
> >> the first place.
> >
> > Imagine a service like Netcraft.
> >
> > I know that there aren't 2^32 active servers, but I wanted to test it
> with
> > 4.3 billion documents and stress/benchamark CouchDB with other DBs.
> >
> > Regards,
> >
> > --
> > Santi Saez
> > http://woop.es
> >
>
> If you're looking for benchmark data I'd also suggest something like
> the Enron email dataset. I can't imagine 4.3 billion integer documents
> is going to be very informative about real world usage. The Wikipedia
> abstracts data set another candidate as well.
>
> HTH,
> Paul Davis
>

Re: Best way to store 2^32 IPs in CouchDB

Posted by Paul Davis <pa...@gmail.com>.
On Mon, Feb 1, 2010 at 1:50 PM, Santi Saez <sa...@woop.es> wrote:
> El 01/02/10 17:56, Paul Davis escribió:
>
> Dear Paul,
>
>> Well, 2^32 of anything is 4GiB per byte stored. So, minimum of four
>> bytes and you're at 16GiB. Even with just 1KiB overhead you're at
>> 4TiB.
>>
>> I'm left wondering why you would want to store a list of numbers in
>> the first place.
>
> Imagine a service like Netcraft.
>
> I know that there aren't 2^32 active servers, but I wanted to test it with
> 4.3 billion documents and stress/benchamark CouchDB with other DBs.
>
> Regards,
>
> --
> Santi Saez
> http://woop.es
>

If you're looking for benchmark data I'd also suggest something like
the Enron email dataset. I can't imagine 4.3 billion integer documents
is going to be very informative about real world usage. The Wikipedia
abstracts data set another candidate as well.

HTH,
Paul Davis

Re: Best way to store 2^32 IPs in CouchDB

Posted by Stephen Day <sj...@gmail.com>.
I thought I'd weigh in on this to illustrate the differences in the use
cases between heterogeneous document based data vs homogeneous data, such as
IP address adjacencies. I have a bit of a networking background, so if I am
way off here in your intent, this may at least be an interesting set of
commentary regarding a hypothetical couchdb "router".

My assumption here is that your problem seems similar to a very specialized
problem that is "solved" in routers. Typically, in a network routing table,
a highly specialized mtrie structure is used, with a depth of 4 and 255
leaves per node (obviously not all populated), that stores adjacency
information at the "leaves". It allows one to look up the adjacency entries
for any ip address with 4 array lookups.

Couchdb is much more generalized, so I wouldn't expect it to perform as well
in this case when compared to the mtrie. The idea behind couch is to
store heterogeneous data, then provide indexes on this. And even though
couch isn't designed to hold adjacency data, its flexibility allows it to do
something very similar to the mtrie. Lets say we have a database filled with
adjacency documents that are the basis of a simplified routing system on
couch. Entries might look like this:

"3.0.0.1" -> {"networks": ["1.0.0.0/24", "2.0.0.0/24"]}

Let's say the semantics here are "Networks 1.0.0.0/24 and 2.0.0.0/24 are
available via 3.0.0.1". This would basically be an adjacency entry. Your
view code would produce routing pairs from the adjacency information above
(assume it can generate keys from netmasks):

"1.0.0.1" -> "3.0.0.1"
"1.0.0.2" -> "3.0.0.1"
...
"1.0.0.254" -> "3.0.0.1"

Then, again semantically, you might ask "how do i get to 1.0.0.16?". Your
view would respond with "3.0.0.1". Despite this "working", the storage
required here is orders of magnitude larger than that required for
the homogeneous mtrie, especially because ip addresses now take up to 16
bytes, instead of 4, not to mention the storage for the metadata of each
adjacency (rev and id) and the size of b-tree to store and index it. This is
the cost of flexibility. If your data is very homogeneous in that every
single key can be represented as the same type more efficiently that the
string representation, such an ip address, then couchdb may not be the right
tool.

I hope this helps.

Stephen J Day

On Mon, Feb 1, 2010 at 12:43 PM, Brian Candler <B....@pobox.com> wrote:

> On Mon, Feb 01, 2010 at 07:50:00PM +0100, Santi Saez wrote:
> > El 01/02/10 17:56, Paul Davis escribió:
> >
> > Dear Paul,
> >
> > >Well, 2^32 of anything is 4GiB per byte stored. So, minimum of four
> > >bytes and you're at 16GiB. Even with just 1KiB overhead you're at
> > >4TiB.
> > >
> > >I'm left wondering why you would want to store a list of numbers in
> > >the first place.
> >
> > Imagine a service like Netcraft.
>
> Then what you want is HTTP virtual hosts, not IP addresses?
>
> Remember that one IP address can serve tens of thousands of virtual hosts.
> (A couchdb document for one IP address could list multiple HTTP hosts
> within
> the JSON, of course)
>
> But according to Netcraft there are around 200M hosts, which is only about
> 5% of what you were looking at before.  In other words, this is a "sparse"
> dataset; there is no value in storing IP addresses which don't have any
> information of interest to you.
>
> Another trick which may compact your data is to group it into /24's.  That
> is, one JSON document for all of 0.0.0.0-0.0.0.255, another for
> 0.0.1.0-0.0.1.255 etc.  As well as reducing overhead, there are other
> obvious savings (e.g. if you're sweeping network blocks then you can store
> a
> single timestamp to say when the sweep of that /24 was performed)
>
> HTH,
>
> Brian.
>

Re: Best way to store 2^32 IPs in CouchDB

Posted by Brian Candler <B....@pobox.com>.
On Mon, Feb 01, 2010 at 07:50:00PM +0100, Santi Saez wrote:
> El 01/02/10 17:56, Paul Davis escribió:
> 
> Dear Paul,
> 
> >Well, 2^32 of anything is 4GiB per byte stored. So, minimum of four
> >bytes and you're at 16GiB. Even with just 1KiB overhead you're at
> >4TiB.
> >
> >I'm left wondering why you would want to store a list of numbers in
> >the first place.
> 
> Imagine a service like Netcraft.

Then what you want is HTTP virtual hosts, not IP addresses?

Remember that one IP address can serve tens of thousands of virtual hosts. 
(A couchdb document for one IP address could list multiple HTTP hosts within
the JSON, of course)

But according to Netcraft there are around 200M hosts, which is only about
5% of what you were looking at before.  In other words, this is a "sparse"
dataset; there is no value in storing IP addresses which don't have any
information of interest to you.

Another trick which may compact your data is to group it into /24's.  That
is, one JSON document for all of 0.0.0.0-0.0.0.255, another for
0.0.1.0-0.0.1.255 etc.  As well as reducing overhead, there are other
obvious savings (e.g. if you're sweeping network blocks then you can store a
single timestamp to say when the sweep of that /24 was performed)

HTH,

Brian.

Re: Best way to store 2^32 IPs in CouchDB

Posted by Santi Saez <sa...@woop.es>.
El 01/02/10 17:56, Paul Davis escribió:

Dear Paul,

> Well, 2^32 of anything is 4GiB per byte stored. So, minimum of four
> bytes and you're at 16GiB. Even with just 1KiB overhead you're at
> 4TiB.
>
> I'm left wondering why you would want to store a list of numbers in
> the first place.

Imagine a service like Netcraft.

I know that there aren't 2^32 active servers, but I wanted to test it 
with 4.3 billion documents and stress/benchamark CouchDB with other DBs.

Regards,

-- 
Santi Saez
http://woop.es

Re: Best way to store 2^32 IPs in CouchDB

Posted by Paul Davis <pa...@gmail.com>.
Well, 2^32 of anything is 4GiB per byte stored. So, minimum of four
bytes and you're at 16GiB. Even with just 1KiB overhead you're at
4TiB.

I'm left wondering why you would want to store a list of numbers in
the first place.

HTH,
Paul Davis

On Mon, Feb 1, 2010 at 11:27 AM, Santi Saez <sa...@woop.es> wrote:
>
> Hi,
>
> I'm doing some initial tests with CouchDB, trying to store 2^32 IP addresses
> (approximately 4.3 billions of documents).
>
> Documents have only required fields: _id and _rev, but I've noticed that the
> minimum space occupied by each document is approximately 3.7KB, so I need
> +14TB disk space only for the basic scheme without any extra field (using IP
> as unique identifier in integer format).
>
> Note that playing with a simple Python script and a binary data file, this
> data can be stored in 16GB space (each IP 4 = bytes * 2 ^32 addresses).
>
> Is it possible to optimize the disk space for what I'm trying to do using
> CouchDB? Perhaps disabling "something", compressing, or changing _rev field
> format/size.. thanks!!
>
> I haver read the manual for CouchDB perfomance, but I didn't get it:
>
> http://wiki.apache.org/couchdb/Performance
>
> Regards,
>
> --
> Santi Saez
> http://woop.es
>