You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by David King <dk...@ketralnis.com> on 2008/07/13 00:52:51 UTC

Practical storage limit

Has anyone hit a practical storage limit, either in terms of data or  
number of records, under couchdb? I have a database of about 70  
million records (over 60 million of them are very small, only three  
integer properties) and maybe eight views that I'd like to try a quick  
port and see how it performs.

Are there any gotchas that I should know about first under the default  
config?

Re: Practical storage limit

Posted by David King <dk...@ketralnis.com>.

(original issue at http://www.mail-archive.com/couchdb-user@incubator.apache.org/msg00792.html)

>> You appear to be hitting the weird mochiweb connection reset bug.  
>> It's causes test failures too. We are looking into it.
> I updated to r679855, which has the fix for the connection reset  
> bug, but I'm still experiencing the problem. Any ideas?

I reduced the chunk size (in the code at the above URL), reducing the  
time between couchdb requests, and this seems to have resolved the  
issue (that is, I'm up to 35k documents, 100 at a time, without a  
timeout now, which is many more requests than worked before).

This leads me to believe that it's more of a couchdb-python issue  
(maybe couchdb-python is keeping the socket to couchdb open for the  
whole time, and it's timing out in between?). Maybe couchdb-python can  
do keep-alives in a background thread?

Re: Practical storage limit

Posted by David King <dk...@ketralnis.com>.

> You appear to be hitting the weird mochiweb connection reset bug.  
> It's causes test failures too. We are looking into it.

I updated to r679855, which has the fix for the connection reset bug,  
but I'm still experiencing the problem. Any ideas?

Re: Practical storage limit

Posted by Damien Katz <da...@apache.org>.

On Jul 16, 2008, at 6:56 PM, David King wrote:

>> We'd love to hear what you come up with and also to solve any  
>> problems you might encounter on your way. Please let us know.  
>> Please note that CouchDB at this point is not optimised. We are  
>> still in the 'getting it right' phase before we come to the  
>> 'getting it fast'. That said, CouchDB is plenty fast already, but  
>> there is also the potential to greatly speed up things.
>
>
> So I'm trying a smaller version of this first (9 million records),  
> and I've hit a snag. I have some rather simple python code to read  
> from Postgres and write to couchdb (that uses couchdb-python, where  
> 'db' is a couchdb.client.Database object):
>
>    chunker = IteratorChunker(get_stuff())
>
>    while not chunker.done:
>        print "fetching"
>        chunk = chunker.next_chunk(1000)
>        if chunk:
>            print "Adding %d items, starting with %s" %  
> (len(chunk),chunk[0]['_id'])
>            db.update(chunk)
>
> db.update(docs) (see <http://code.google.com/p/couchdb-python/source/browse/trunk/couchdb/client.py 
> >, line 360) uses the bulk API, like:
>
>    data = self.resource.post('_bulk_docs', content={'docs':  
> documents})
>
> At apparently random points throughout this process, but almost  
> always before 15,000 records or so, the process dies with an  
> exception, the tail end of which looks like:
>
>  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/ 
> python2.5/httplib.py", line 707, in send
>    self.sock.sendall(str)
>  File "<string>", line 1, in sendall
> socket.error: (54, 'Connection reset by peer')
>
> If I have Futon up while it's running, I occasionally get a  
> Javascript error along the lines of "killed" (reproducing it is  
> difficult) at the same time.
>
> I could have it catch the reset connection and re-try, but why would  
> this be happening?
>

You appear to be hitting the weird mochiweb connection reset bug. It's  
causes test failures too. We are looking into it.

Re: Practical storage limit

Posted by David King <dk...@ketralnis.com>.

> We'd love to hear what you come up with and also to solve any  
> problems you might encounter on your way. Please let us know. Please  
> note that CouchDB at this point is not optimised. We are still in  
> the 'getting it right' phase before we come to the 'getting it  
> fast'. That said, CouchDB is plenty fast already, but there is also  
> the potential to greatly speed up things.


So I'm trying a smaller version of this first (9 million records), and  
I've hit a snag. I have some rather simple python code to read from  
Postgres and write to couchdb (that uses couchdb-python, where 'db' is  
a couchdb.client.Database object):

     chunker = IteratorChunker(get_stuff())

     while not chunker.done:
         print "fetching"
         chunk = chunker.next_chunk(1000)
         if chunk:
             print "Adding %d items, starting with %s" %  
(len(chunk),chunk[0]['_id'])
             db.update(chunk)

db.update(docs) (see <http://code.google.com/p/couchdb-python/source/browse/trunk/couchdb/client.py 
 >, line 360) uses the bulk API, like:

     data = self.resource.post('_bulk_docs', content={'docs':  
documents})

At apparently random points throughout this process, but almost always  
before 15,000 records or so, the process dies with an exception, the  
tail end of which looks like:

   File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/ 
python2.5/httplib.py", line 707, in send
     self.sock.sendall(str)
   File "<string>", line 1, in sendall
socket.error: (54, 'Connection reset by peer')

If I have Futon up while it's running, I occasionally get a Javascript  
error along the lines of "killed" (reproducing it is difficult) at the  
same time.

I could have it catch the reset connection and re-try, but why would  
this be happening?

Re: Practical storage limit

Posted by Jan Lehnardt <ja...@apache.org>.

On Jul 13, 2008, at 20:25, Peter Eddy wrote:

> On Sun, Jul 13, 2008 at 12:46 PM, Jan Lehnardt <ja...@apache.org> wrote:
>> To get on write view update semantics, you can write a little daemon
>> script that runs alongside CouchDB and is specified in couch.ini
>> as DbUpdateNotificationProcesses. This deamon gets sent a
>> notification each time the database is changed and could in turn
>> trigger a view update every N document inserts or every Y seconds,
>> whichever occurs first. The reason not to integrate each doc as
>> it comes in is that it is horribly inefficient and CouchDB is  
>> designed
>> to do view index updates very fast, so batching is a good idea.
>
> Thanks for this information, Jan. I think this solution is a lot
> better than trying to figure out when it's necessary to update views
> in application code. However I'd still argue that people will want
> this as their default behavior most of the time, and I wonder if it
> wouldn't be more efficient to just build the daemon functionality into
> the server, maybe with tunable N and Y values.

I have no number to back it up, but in a usual web-app reads
outnumber writes by far and the incremental index update will
be fast enough. Yes you can always contrive counter examples,
but they can use the proposed solution.

I don't think that an erlang module to handle this would be that
much more efficient than an external daemon. After all, it
is only a few bytes exchanged. But as outlined in my last
paragraph, it makes sense to bundle such a system with
CouchDB but we haven't gotten to it yet. (This is your cue
to jump in *hint* :)

Cheers
Jan
--

>
>
> - Peter
>
>> To get a list of all views in a database, you can do a
>> GET /db/_all_docs?startkey=_design/&endkey=_design/ZZZZ
>> (we will have a /db/_all_design_docs view to make the ZZZZ-hack
>> go away).
>>
>> That should solve your problem.
>>
>> Yes, such a daemon should be shipped with CouchDB, but we
>> haven't got around to work on the deployment infrastructure yet.
>> Any contributions to this are very welcome. I think the developer's
>> choice of language for helper scripts is Python, but any will do,
>> whatever suits you best.
>>
>> Cheers
>> Jan
>> --
>> The FaQ entry is at the bottom of this page.
>> http://wiki.apache.org/couchdb/FrequentlyAskedQuestions
>> BTW: Is there a FaQ module or whatever for MoinMoin? it
>> would be nice to get a MediaWiki-like table of contents at
>> the top. Noah?
>>
>

Re: Practical storage limit

Posted by Peter Eddy <pe...@gmail.com>.

On Sun, Jul 13, 2008 at 12:46 PM, Jan Lehnardt <ja...@apache.org> wrote:
> To get on write view update semantics, you can write a little daemon
> script that runs alongside CouchDB and is specified in couch.ini
> as DbUpdateNotificationProcesses. This deamon gets sent a
> notification each time the database is changed and could in turn
> trigger a view update every N document inserts or every Y seconds,
> whichever occurs first. The reason not to integrate each doc as
> it comes in is that it is horribly inefficient and CouchDB is designed
> to do view index updates very fast, so batching is a good idea.

Thanks for this information, Jan. I think this solution is a lot
better than trying to figure out when it's necessary to update views
in application code. However I'd still argue that people will want
this as their default behavior most of the time, and I wonder if it
wouldn't be more efficient to just build the daemon functionality into
the server, maybe with tunable N and Y values.

- Peter

> To get a list of all views in a database, you can do a
> GET /db/_all_docs?startkey=_design/&endkey=_design/ZZZZ
> (we will have a /db/_all_design_docs view to make the ZZZZ-hack
> go away).
>
> That should solve your problem.
>
> Yes, such a daemon should be shipped with CouchDB, but we
> haven't got around to work on the deployment infrastructure yet.
> Any contributions to this are very welcome. I think the developer's
> choice of language for helper scripts is Python, but any will do,
> whatever suits you best.
>
> Cheers
> Jan
> --
> The FaQ entry is at the bottom of this page.
> http://wiki.apache.org/couchdb/FrequentlyAskedQuestions
> BTW: Is there a FaQ module or whatever for MoinMoin? it
> would be nice to get a MediaWiki-like table of contents at
> the top. Noah?
>

Re: Practical storage limit

Posted by Jan Lehnardt <ja...@apache.org>.

On Jul 13, 2008, at 18:05, Peter Eddy wrote:

>> What I'd do is query the view for updating after each bulk insert.
>
> This is what I've been doing, however it seems less than ideal. It
> means that the bulk insert code needs to know about all views that
> have been defined.
>
> What I want to avoid, of course, is a user invoking a view and having
> to wait a long time (minutes in some cases) for that view to be
> updated in order to get their results. I would gladly trade document
> insertion time for automatic view updates.
>
> It could be argued that bulk updates are a special case that happen
> only in exceptional situations. But I can easily imagine normal
> application functionality that might add many documents, and this code
> would then need to think about how many documents were added and if
> the views should be updated in the process in order to avoid long
> user-time delays. The larger the application, the more messy and error
> prone this becomes.
>
> Anyway, I've been meaning to mention this for a long time, so since it
> came up...

I'll make a FaQ out of this :-)

To get on write view update semantics, you can write a little daemon
script that runs alongside CouchDB and is specified in couch.ini
as DbUpdateNotificationProcesses. This deamon gets sent a
notification each time the database is changed and could in turn
trigger a view update every N document inserts or every Y seconds,
whichever occurs first. The reason not to integrate each doc as
it comes in is that it is horribly inefficient and CouchDB is designed
to do view index updates very fast, so batching is a good idea.

To get a list of all views in a database, you can do a
GET /db/_all_docs?startkey=_design/&endkey=_design/ZZZZ
(we will have a /db/_all_design_docs view to make the ZZZZ-hack
go away).

That should solve your problem.

Yes, such a daemon should be shipped with CouchDB, but we
haven't got around to work on the deployment infrastructure yet.
Any contributions to this are very welcome. I think the developer's
choice of language for helper scripts is Python, but any will do,
whatever suits you best.

Cheers
Jan
--
The FaQ entry is at the bottom of this page.
http://wiki.apache.org/couchdb/FrequentlyAskedQuestions
BTW: Is there a FaQ module or whatever for MoinMoin? it
would be nice to get a MediaWiki-like table of contents at
the top. Noah?

Re: Practical storage limit

Posted by Peter Eddy <pe...@gmail.com>.

> What I'd do is query the view for updating after each bulk insert.

This is what I've been doing, however it seems less than ideal. It
means that the bulk insert code needs to know about all views that
have been defined.

What I want to avoid, of course, is a user invoking a view and having
to wait a long time (minutes in some cases) for that view to be
updated in order to get their results. I would gladly trade document
insertion time for automatic view updates.

It could be argued that bulk updates are a special case that happen
only in exceptional situations. But I can easily imagine normal
application functionality that might add many documents, and this code
would then need to think about how many documents were added and if
the views should be updated in the process in order to avoid long
user-time delays. The larger the application, the more messy and error
prone this becomes.

Anyway, I've been meaning to mention this for a long time, so since it
came up...

- Peter

Re: Practical storage limit

Posted by Jan Lehnardt <ja...@apache.org>.

On Jul 12, 2008, at 23:52 , David King wrote:

> Has anyone hit a practical storage limit, either in terms of data or  
> number of records, under couchdb? I have a database of about 70  
> million records (over 60 million of them are very small, only three  
> integer properties) and maybe eight views that I'd like to try a  
> quick port and see how it performs.

We'd love to hear what you come up with and also to solve any problems  
you might encounter on your way. Please let us know. Please note that  
CouchDB at this point is not optimised. We are still in the 'getting  
it right' phase before we come to the 'getting it fast'. That said,  
CouchDB is plenty fast already, but there is also the potential to  
greatly speed up things.

> Are there any gotchas that I should know about first under the  
> default config?

As Paul said, use bulk inserts where possible. Be aware that view  
index creation might take some time if you run it on a very large db  
for the first time. What I'd do is query the view for updating after  
each bulk insert. Note that CouchDB stores more data on disk that is  
in your document, especially when you don't do bulk inserts (but even  
then). Run database compaction when the database size gets too large  
for you (compaction will need some more space since it creates a copy  
of your data).

Other than that, I'm very interested in your findings :)

Cheers
Jan
--

Re: Practical storage limit

Posted by Paul Davis <pa...@gmail.com>.

Not sure if it's a gotchya, but keep in mind that _bulk_docs is your
friend. Single inserts are extremely slow by comparison.

On Sat, Jul 12, 2008 at 6:52 PM, David King <dk...@ketralnis.com> wrote:
> Has anyone hit a practical storage limit, either in terms of data or number
> of records, under couchdb? I have a database of about 70 million records
> (over 60 million of them are very small, only three integer properties) and
> maybe eight views that I'd like to try a quick port and see how it performs.
>
> Are there any gotchas that I should know about first under the default
> config?
>