You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Darrell Rodgers <DR...@immersivetechnologies.com> on 2019/04/15 09:52:33 UTC

RE: Getting an error "all_dbs_active" running a CouchDB 2.3 cluster

Hi Jan,

Thanks for the info. Jake is now off the project, but I've been doing some more work on it in the meantime.

We have q=8, so yes you're right that's more potential open DBs than the limit of 500 - good tip!

I've noticed something I can't explain and was wondering if you could shed some light.

With our test CouchDB 2.0 cluster with 720 DB shards (90 * q=8, n = 2 with 2 nodes), we've noticed the nodes stabilise on the following amount of memory usage:

Max DBs | Memory
500             2GB
800             4GB
1000           6GB

This is the equilibrium it reaches while under load from our view-building and conflict-resolution scheduled tasks, but with no replication or other usage.

Running a node with only 4GB ram and 1000 Max DBs, the CouchDB process memory usage grew until the kernel killed it, then it built up again under load and was killed again.

Questions:
1. Can you explain why there is a difference between Max DBs 800 and 1000, even though we have < 800 total DB shards? (does it cache more than one copy of a shard?)
2. Is there a reliable way to predict how much memory the server is going to need?
3. Is the process running out of memory and being killed by the kernel expected in that situation, or have I missed a config item somewhere that caps what CouchDB will use?

Thanks in advance!

Darrell Rodgers

-----Original Message-----
From: Jan Lehnardt <ja...@apache.org> 
Sent: Thursday, 14 March 2019 6:16 PM
To: user@couchdb.apache.org
Subject: Re: Getting an error "all_dbs_active" running a CouchDB 2.3 cluster

Heya Jake,

what is your q value for your databases on the source and target clusters?

The default is 8, so 8*90 gives us 720 potential open dbs.

It could just be that under normal operation, your 2.0 cluster never has to open all shards at the same time, but now that you are running a migration, the target cluster has to.

IIRC there were no semantic changes in this handling between 2.0 and 2.3.

Best
Jan
-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/


> On 14. Mar 2019, at 11:05, Jake Kroon <JK...@immersivetechnologies.com> wrote:
> 
> Hi,
>  
> I’m in the process of trying to migrate a CouchDB cluster from 2.0 to 2.3, by creating a new cluster and replicating the databases over to it, and eventually plan to switch over to the new one. Generally this process is going fine, but I’m getting errors similar to the following when running my applications against the new cluster:
>  
> [error] 2019-02-21T07:04:51.213276Z couchdb@ip <0.17397.4590> f346ddb688 rexi_server: from: couchdb@ip (<0.32026.4592>) mfa: fabric_rpc:map_view/5 error:{badmatch,{error,all_dbs_active}} [{fabric_rpc,map_view,5,[{file,"src/fabric_rpc.erl"},{line,148}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
>  
> I’m using the default max_dbs_open value of 500 (which is preset in the default.ini file). As far as I understand it, this should be plenty, and it’s what I’m successfully using on my current 2.0 cluster with no errors. I may be misunderstanding how this setting works though.
>  
> I have about 90 databases in the cluster, and all I’m currently running is a couple of scripts:
>  
> 	• A “build views” script that runs every hour, that goes through each database and queries each of the views (in series).
> 	• A “conflict resolver” script that runs every 15 minutes, that queries all databases for conflicts and then performs custom logic to deal with conflicts (though there won’t be any conflicts on our new server at this time, so it’s just querying the conflicts view on each database)
>  
> I also previously had continuous bidirectional replication set up between the new cluster and the old one, and the “all_dbs_active” error was happening quite often (a couple of times per hour). I’ve cancelled all the replication jobs and the error has reduced to about 1 or 2 instances per day.
>  
> I haven’t yet tried increasing the max_dbs_open value (which seems to be a common suggestion for dealing with the “all_dbs_active” error), because the live 2.0 cluster is working fine with the default value of 500, and has higher load on it than the new 2.3 cluster.
>  
> I was wondering if anyone has any suggestions on what I should look at to try to solve this issue?
>  
> I’m running the cluster on Ubuntu 18.04 LTS.
>  
> Thanks!
> Jake Kroon
> Software Engineer
>  
> D: +61 8 9318 6949
> E: JKroon@immersivetechnologies.com
>  
> 
>  
> YouTube  |  LinkedIn  |  Google+

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/