You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Norman Barker <no...@gmail.com> on 2010/07/26 19:00:26 UTC

couchdb and millions of records

Hi,

I have sampled the wikipedia tsv collection from freebase
(http://wiki.freebase.com/wiki/WEX/Documentation#articles), I ran this
through awk and drop the xml field and then did a simple conversion to
JSON. I then call _bulk_docs 150 docs at a time into couch 0.11.

I wrote a simple view in erlang that emits the date as a key (I am
actually using this to test the free text search couchdb-clucene), the
views are fast once computed.

The amount of disk storage used by couchdb is an issue, and the write
times are slow, I changed my view and my 2.3 million view computation
is still running!

        "request_time": {
            "description": "length of a request inside CouchDB without
MochiWeb",
            "current": 2253451.122,
            "sum": 2253451.122,
            "mean": 501.212,
            "stddev": 12275.385,
            "min": 0.5,
            "max": 798124.0
        },

For my use case once the system is up there is only a few updates per
hour, but doing the initial harvest takes a long time.

Does 1.0 make substantial gains on this, if so how, are there any
other areas that I should be looking at to improve this, I am happy
writing erlang code.

thanks,

Norman

Re: couchdb and millions of records

Posted by J Chris Anderson <jc...@gmail.com>.

On Jul 26, 2010, at 10:41 AM, Simon Metson wrote:

> Hi,
> 	We've done things at this scale with CouchDB. The key thing is to do bulk inserts, and to trigger view indexing as you go. For instance our code by default will bulk insert 5000 records, then hit a view, then do the next 5000 then hit the view etc. Of course the batch size is something you'd want to tune, since it'll depend on your documents and views. It's much quicker to do the view index incrementally than hit all N million records at once. You might also want to hit view and db compaction occasionally, especially if you're also doing bulk deletes.
> Cheers
> Simon
> 

Also, 1.0 should be significantly faster for your use case.

Chris

> On 26 Jul 2010, at 18:00, Norman Barker wrote:
> 
>> Hi,
>> 
>> I have sampled the wikipedia tsv collection from freebase
>> (http://wiki.freebase.com/wiki/WEX/Documentation#articles), I ran this
>> through awk and drop the xml field and then did a simple conversion to
>> JSON. I then call _bulk_docs 150 docs at a time into couch 0.11.
>> 
>> I wrote a simple view in erlang that emits the date as a key (I am
>> actually using this to test the free text search couchdb-clucene), the
>> views are fast once computed.
>> 
>> The amount of disk storage used by couchdb is an issue, and the write
>> times are slow, I changed my view and my 2.3 million view computation
>> is still running!
>> 
>>       "request_time": {
>>           "description": "length of a request inside CouchDB without
>> MochiWeb",
>>           "current": 2253451.122,
>>           "sum": 2253451.122,
>>           "mean": 501.212,
>>           "stddev": 12275.385,
>>           "min": 0.5,
>>           "max": 798124.0
>>       },
>> 
>> For my use case once the system is up there is only a few updates per
>> hour, but doing the initial harvest takes a long time.
>> 
>> Does 1.0 make substantial gains on this, if so how, are there any
>> other areas that I should be looking at to improve this, I am happy
>> writing erlang code.
>> 
>> thanks,
>> 
>> Norman
>

Re: couchdb and millions of records

Posted by Simon Metson <si...@googlemail.com>.

Hi,
	We've done things at this scale with CouchDB. The key thing is to do  
bulk inserts, and to trigger view indexing as you go. For instance our  
code by default will bulk insert 5000 records, then hit a view, then  
do the next 5000 then hit the view etc. Of course the batch size is  
something you'd want to tune, since it'll depend on your documents and  
views. It's much quicker to do the view index incrementally than hit  
all N million records at once. You might also want to hit view and db  
compaction occasionally, especially if you're also doing bulk deletes.
Cheers
Simon

On 26 Jul 2010, at 18:00, Norman Barker wrote:

> Hi,
>
> I have sampled the wikipedia tsv collection from freebase
> (http://wiki.freebase.com/wiki/WEX/Documentation#articles), I ran this
> through awk and drop the xml field and then did a simple conversion to
> JSON. I then call _bulk_docs 150 docs at a time into couch 0.11.
>
> I wrote a simple view in erlang that emits the date as a key (I am
> actually using this to test the free text search couchdb-clucene), the
> views are fast once computed.
>
> The amount of disk storage used by couchdb is an issue, and the write
> times are slow, I changed my view and my 2.3 million view computation
> is still running!
>
>        "request_time": {
>            "description": "length of a request inside CouchDB without
> MochiWeb",
>            "current": 2253451.122,
>            "sum": 2253451.122,
>            "mean": 501.212,
>            "stddev": 12275.385,
>            "min": 0.5,
>            "max": 798124.0
>        },
>
> For my use case once the system is up there is only a few updates per
> hour, but doing the initial harvest takes a long time.
>
> Does 1.0 make substantial gains on this, if so how, are there any
> other areas that I should be looking at to improve this, I am happy
> writing erlang code.
>
> thanks,
>
> Norman