You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by eric casteleijn <er...@canonical.com> on 2009/11/10 20:10:07 UTC

Running out of erlang processes.

Today our production couchdb suddenly crashed and refused to restart, 
(scrubbed end of the logfile included.) After asking on #couchdb, Jan 
kindly helped me understand that this was due to running into the 
default limit of allowed erlang processes (32K and a bit.) and suggested 
increasing the number.

While I have no problem doing that if it solves the problem, I am still 
left with some questions that I hope you all can help me answer.

I'll first sketch the set-up of our system very globabally, and tell you 
what we were doing at the time that might have had an impact:

We have a single server node, with a lot of clients replicating to and 
from databases it contains, and a web interface reading from and writing 
to it. The server node contains somewhere between 200K and 300K databases.

I was rather naively doing collection some stats from the server by 
running something like the following pseudo code:

all_dbs = GET "/_all_dbs"
for db in all_dbs:
     sleep for a little bit
     GET "/" + db

to get the number of databases and the total number of documents in all 
databases. After some minutes of running this, the crash occured. This 
might not be related to us running the script, but that would be one 
hell of a coincedence, as we haven't seen this particular error before. 
Also after aborting the script, couchdb seemed to slowly recuperate.

My questions in declining order of ulcerinducingness:

1. Can anyone explain why the above would cause CouchDB to run out of 
processes?

2. Are there more scenarios like this that can cause such crashes?

3. Is there a way to monitor the number of processes in use? (Ideally I 
would like to have it in _stats, although I don't know how feasible that 
is.)

4. If this is not an outright bug in CouchDB, is there any way to 
degrade a little more gracefully in cases like this? It reminds me of 
the errors CouchDB throws when running out of file descriptors. I would 
things to slow down and maybe time out on connections, rather than cause 
crashes, or (in the case of file descriptors) return server errors.

-- 
- eric casteleijn
https://launchpad.net/~thisfred
http://www.canonical.com

Re: Running out of erlang processes.

Posted by Adam Kocoloski <ko...@apache.org>.
Hi Eric,

On Nov 10, 2009, at 2:10 PM, eric casteleijn wrote:

> Today our production couchdb suddenly crashed and refused to  
> restart, (scrubbed end of the logfile included.) After asking on  
> #couchdb, Jan kindly helped me understand that this was due to  
> running into the default limit of allowed erlang processes (32K and  
> a bit.) and suggested increasing the number.
>
> While I have no problem doing that if it solves the problem, I am  
> still left with some questions that I hope you all can help me answer.
>
> I'll first sketch the set-up of our system very globabally, and tell  
> you what we were doing at the time that might have had an impact:
>
> We have a single server node, with a lot of clients replicating to  
> and from databases it contains, and a web interface reading from and  
> writing to it. The server node contains somewhere between 200K and  
> 300K databases.
>
> I was rather naively doing collection some stats from the server by  
> running something like the following pseudo code:
>
> all_dbs = GET "/_all_dbs"
> for db in all_dbs:
>    sleep for a little bit
>    GET "/" + db
>
> to get the number of databases and the total number of documents in  
> all databases. After some minutes of running this, the crash  
> occured. This might not be related to us running the script, but  
> that would be one hell of a coincedence, as we haven't seen this  
> particular error before. Also after aborting the script, couchdb  
> seemed to slowly recuperate.
>
> My questions in declining order of ulcerinducingness:
>
> 1. Can anyone explain why the above would cause CouchDB to run out  
> of processes?

In my experience, running out of Erlang processes almost always  
indicates a process leak somewhere.  Then again, I have no experience  
running 200k databases on one node.

a) What is your configuration value for max_dbs_open?

b) On average, how many _design docs are actively being queried in  
each of these DBs?

c) Did you change mochiweb's built-in connection limit of 2048?

d) What is your ulimit setting on the server?

e) these replications are initiated by the other servers, right?

> 2. Are there more scenarios like this that can cause such crashes?

I've discovered a few in the past and fixed them when they occur.  I'm  
not aware of any at the moment.

> 3. Is there a way to monitor the number of processes in use?  
> (Ideally I would like to have it in _stats, although I don't know  
> how feasible that is.)

There is a patch in JIRA for this:

https://issues.apache.org/jira/browse/COUCHDB-395

In our production systems we run CouchDB on a named node and monitor  
the VM externally with erl_call and Munin, e.g.

erl_call -c #{cookie} -sname couchdb@'#{Socket.gethostname}' -a  
'erlang system_info [process_count]'

>
> 4. If this is not an outright bug in CouchDB, is there any way to  
> degrade a little more gracefully in cases like this? It reminds me  
> of the errors CouchDB throws when running out of file descriptors. I  
> would things to slow down and maybe time out on connections, rather  
> than cause crashes, or (in the case of file descriptors) return  
> server errors.

Sounds challenging.  If it really isn't a bug you probably want to  
adjust some limits to prevent this from happening.  At Cloudant we are  
tuning the following ordered list of knobs,

connection limit per customer (external to CouchDB)
max_dbs_open
mochiweb server connection limit
ulimit
ERL_MAX_PORTS
Erlang process limit

You should be able to configure those so that you never run out of  
processes in normal operation, instead refusing new connections or  
closing old DBs.  Best,

Adam

> -- 
> - eric casteleijn
> https://launchpad.net/~thisfred
> http://www.canonical.com
> [Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.23291.1877>] 'HEAD' /u 
> %2F053%2F2a4%2F101325%2Fnotes {1,1}
> Headers: [{'Accept',"application/json"},
>          {'Accept-Encoding',"compress, gzip"},
>          {'Authorization',"OAuth realm=\"\", oauth_nonce= 
> \"76187806\", oauth_timestamp=\"1257868744\", oauth_consumer_key= 
> \"ubuntuone\", oauth_signature_method=\"HMAC-SHA1\", oauth_version= 
> \"1.0\", oauth_token=\"*****\", oauth_signature=\"*****\""},
>          {'Connection',"Keep-Alive"},
>          {'Host',"couchdb.one.ubuntu.com"},
>          {'User-Agent',"couchdb-python 0.6"},
>          {'Via',"1.1 couchdb.one.ubuntu.com"},
>          {'X-Forwarded-For',"147.102.133.18"},
>          {"X-Forwarded-Host","couchdb.one.ubuntu.com"},
>          {"X-Forwarded-Server","couchdb.one.ubuntu.com"},
>          {"X-Forwarded-Ssl","on"}]
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.23291.1877>] OAuth  
> Params: [{"oauth_nonce","76187806"},
>               {"oauth_timestamp","1257868744"},
>               {"oauth_consumer_key","ubuntuone"},
>               {"oauth_signature_method","HMAC-SHA1"},
>               {"oauth_version","1.0"},
>               {"oauth_token","*****"},
>               {"oauth_signature","*****"}]
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.23291.1877>]  
> request_group {Pid, Seq} {<0.1222.1877>,105002}
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [emulator] Too many processes
>
>
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.11658.1881>] 'GET' /u 
> %2F043%2F3d6%2F128956%2Fcontacts {1,1}
> Headers: [{'Accept-Encoding',"identity"},
>          {'Authorization',"Basic *****"},
>          {'Content-Type',"application/json"},
>          {'Host',"*****:9030"},
>          {'User-Agent',"couchdb minimal http interface"}]
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.11658.1881>] OAuth  
> Params: []
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.22836.1877>] 'GET' /u 
> %2Fad1%2F288%2F12708%2Fnotes/_local 
> %2F06b93d053bca1fb7402800b657ba29bd {1,
>                                                                                1 
> }
> Headers: [{'Accept',"application/json"},
>          {'Accept-Encoding',"gzip"},
>          {'Authorization',"OAuth oauth_signature=\"*****\",  
> oauth_token=\"*****\", oauth_version=\"1.0\", oauth_nonce=\"*****\",  
> oauth_timestamp=\"1257868739\", oauth_signature_method=\"HMAC- 
> SHA1\", oauth_consumer_key=\"ubuntuone\""},
>          {'Connection',"Keep-Alive"},
>          {'Host',"couchdb.one.ubuntu.com:443"},
>          {'User-Agent',"CouchDB/0.10.0"},
>          {'Via',"1.1 couchdb.one.ubuntu.com"},
>          {'X-Forwarded-For',"*****"},
>          {"X-Forwarded-Host","couchdb.one.ubuntu.com:443"},
>          {"X-Forwarded-Server","couchdb.one.ubuntu.com"},
>          {"X-Forwarded-Ssl","on"}]
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.22836.1877>] OAuth  
> Params: [{"oauth_signature","*****"},
>               {"oauth_token","*****"},
>               {"oauth_version","1.0"},
>               {"oauth_nonce","*****"},
>               {"oauth_timestamp","1257868739"},
>               {"oauth_signature_method","HMAC-SHA1"},
>               {"oauth_consumer_key","ubuntuone"}]
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.785.1877>] ** Generic  
> server couch_server terminating
> ** Last message in was {open,<<"u/053/2a4/101325/notes">>,
>                             [{user_ctx,{user_ctx,<<"101325">>,[]}}]}
> ** When Server state == {server,"/srv/couchdb/database",
>                            {re_pattern,0,0,
>                                 
> <<69,82,67,80,124,0,0,0,16,0,0,0,1,0,0,0,0,0,
>                                   
> 0,0,0,0,0,0,48,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
>                                   
> 0,0,0,0,0,0,0,0,0,93,0,72,25,77,0,0,0,0,0,0,
>                                   
> 0,0,0,0,0,0,254,255,255,7,0,0,0,0,0,0,0,0,0,
>                                   
> 0,0,0,0,0,0,0,77,0,0,0,0,16,171,255,3,0,0,0,
>                                   
> 128,254,255,255,7,0,0,0,0,0,0,0,0,0,0,0,0,0,
>                                  0,0,0,69,26,84,0,72,0>>},
>                            10000,5426,"Tue, 10 Nov 2009 15:57:02 GMT"}
> ** Reason for termination ==
> ** {system_limit,[{erlang,spawn_opt,
>                          [proc_lib,init_p,
>                           [couch_server,
>                             
> [couch_primary_services,couch_server_sup,<0.1.0>],
>                            gen,init_it,
>                             
> [gen_server,<0.785.1877>,<0.785.1877>,couch_db,
>                             {<<"u/053/2a4/101325/notes">>,
>                              "/srv/couchdb/database/u/053/2a4/101325/ 
> notes.couch",
>                              <0.29700.1881>,
>                              [{user_ctx,{user_ctx,<<"101325">>,[]}}]},
>                             []]],
>                           [link]]},
>                  {proc_lib,start_link,5},
>                  {couch_db,start_link,3},
>                  {couch_server,handle_call,3},
>                  {gen_server,handle_msg,5},
>                  {proc_lib,init_p_do_apply,3}]}
>
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>]  
> {error_report,<0.24.0>,
>    {<0.780.1877>,supervisor_report,
>     [{supervisor,{local,couch_primary_services}},
>      {errorContext,child_terminated},
>      {reason,
>          {system_limit,
>              [{erlang,spawn_opt,
>                   [proc_lib,init_p,
>                    [couch_server,
>                     [couch_primary_services,couch_server_sup,<0.1.0>],
>                     gen,init_it,
>                     [gen_server,<0.785.1877>,<0.785.1877>,couch_db,
>                      {<<"u/053/2a4/101325/notes">>,
>                       "/srv/couchdb/database/u/053/2a4/101325/ 
> notes.couch",
>                       <0.29700.1881>,
>                       [{user_ctx,{user_ctx,<<"101325">>,[]}}]},
>                      []]],
>                    [link]]},
>               {proc_lib,start_link,5},
>               {couch_db,start_link,3},
>               {couch_server,handle_call,3},
>               {gen_server,handle_msg,5},
>               {proc_lib,init_p_do_apply,3}]}},
>      {offender,
>          [{pid,<0.785.1877>},
>           {name,couch_server},
>           {mfa,{couch_server,sup_start_link,[]}},
>           {restart_type,permanent},
>           {shutdown,brutal_kill},
>           {child_type,supervisor}]}]}}
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>]  
> {error_report,<0.24.0>,
>              {<0.780.1877>,supervisor_report,
>               [{supervisor,{local,couch_primary_services}},
>                {errorContext,start_error},
>                {reason,{already_started,<0.785.1877>}},
>                {offender,[{pid,<0.785.1877>},
>                           {name,couch_server},
>                           {mfa,{couch_server,sup_start_link,[]}},
>                           {restart_type,permanent},
>                           {shutdown,brutal_kill},
>                           {child_type,supervisor}]}]}}
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>]  
> {error_report,<0.24.0>,
>              {<0.780.1877>,supervisor_report,
>               [{supervisor,{local,couch_primary_services}},
>                {errorContext,start_error},
>                {reason,{already_started,<0.785.1877>}},
>                {offender,[{pid,<0.785.1877>},
>                           {name,couch_server},
>                           {mfa,{couch_server,sup_start_link,[]}},
>                           {restart_type,permanent},
>                           {shutdown,brutal_kill},
>                           {child_type,supervisor}]}]}}
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>]  
> {error_report,<0.24.0>,
>              {<0.780.1877>,supervisor_report,
>               [{supervisor,{local,couch_primary_services}},
>                {errorContext,start_error},
>                {reason,{already_started,<0.785.1877>}},
>                {offender,[{pid,<0.785.1877>},
>                           {name,couch_server},
>                           {mfa,{couch_server,sup_start_link,[]}},
>                           {restart_type,permanent},
>                           {shutdown,brutal_kill},
>                           {child_type,supervisor}]}]}}
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>]  
> {error_report,<0.24.0>,
>              {<0.780.1877>,supervisor_report,
>               [{supervisor,{local,couch_primary_services}},
>                {errorContext,start_error},
>                {reason,{already_started,<0.785.1877>}},
>                {offender,[{pid,<0.785.1877>},
>                           {name,couch_server},
>                           {mfa,{couch_server,sup_start_link,[]}},
>                           {restart_type,permanent},
>                           {shutdown,brutal_kill},
>                           {child_type,supervisor}]}]}}
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>]  
> {error_report,<0.24.0>,
>              {<0.780.1877>,supervisor_report,
>               [{supervisor,{local,couch_primary_services}},
>                {errorContext,start_error},
>                {reason,{already_started,<0.785.1877>}},
>                {offender,[{pid,<0.785.1877>},
>                           {name,couch_server},
>                           {mfa,{couch_server,sup_start_link,[]}},
>                           {restart_type,permanent},
>                           {shutdown,brutal_kill},
>                           {child_type,supervisor}]}]}}
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>]  
> {error_report,<0.24.0>,
>              {<0.780.1877>,supervisor_report,
>               [{supervisor,{local,couch_primary_services}},
>                {errorContext,start_error},
>                {reason,{already_started,<0.785.1877>}},
>                {offender,[{pid,<0.785.1877>},
>                           {name,couch_server},
>                           {mfa,{couch_server,sup_start_link,[]}},
>                           {restart_type,permanent},
>                           {shutdown,brutal_kill},
>                           {child_type,supervisor}]}]}}
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>]  
> {error_report,<0.24.0>,
>              {<0.780.1877>,supervisor_report,
>               [{supervisor,{local,couch_primary_services}},
>                {errorContext,start_error},
>                {reason,{already_started,<0.785.1877>}},
>                {offender,[{pid,<0.785.1877>},
>                           {name,couch_server},
>                           {mfa,{couch_server,sup_start_link,[]}},
>                           {restart_type,permanent},
>                           {shutdown,brutal_kill},
>                           {child_type,supervisor}]}]}}
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>]  
> {error_report,<0.24.0>,
>              {<0.780.1877>,supervisor_report,
>               [{supervisor,{local,couch_primary_services}},
>                {errorContext,start_error},
>                {reason,{already_started,<0.785.1877>}},
>                {offender,[{pid,<0.785.1877>},
>                           {name,couch_server},
>                           {mfa,{couch_server,sup_start_link,[]}},
>                           {restart_type,permanent},
>                           {shutdown,brutal_kill},
>                           {child_type,supervisor}]}]}}
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>]  
> {error_report,<0.24.0>,
>              {<0.780.1877>,supervisor_report,
>               [{supervisor,{local,couch_primary_services}},
>                {errorContext,start_error},
>                {reason,{already_started,<0.785.1877>}},
>                {offender,[{pid,<0.785.1877>},
>                           {name,couch_server},
>                           {mfa,{couch_server,sup_start_link,[]}},
>                           {restart_type,permanent},
>                           {shutdown,brutal_kill},
>                           {child_type,supervisor}]}]}}
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>]  
> {error_report,<0.24.0>,
>              {<0.780.1877>,supervisor_report,
>               [{supervisor,{local,couch_primary_services}},
>                {errorContext,start_error},
>                {reason,{already_started,<0.785.1877>}},
>                {offender,[{pid,<0.785.1877>},
>                           {name,couch_server},
>                           {mfa,{couch_server,sup_start_link,[]}},
>                           {restart_type,permanent},
>                           {shutdown,brutal_kill},
>                           {child_type,supervisor}]}]}}
>
> [Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>]  
> {error_report,<0.24.0>,
>              {<0.780.1877>,supervisor_report,
>               [{supervisor,{local,couch_primary_services}},
>                {errorContext,shutdown},
>                {reason,reached_max_restart_intensity},
>                {offender,[{pid,<0.785.1877>},
>                           {name,couch_server},
>                           {mfa,{couch_server,sup_start_link,[]}},
>                           {restart_type,permanent},
>                           {shutdown,brutal_kill},
>                           {child_type,supervisor}]}]}}


Re: Running out of erlang processes.

Posted by Paul Davis <pa...@gmail.com>.
> all_dbs = GET "/_all_dbs"
> for db in all_dbs:
>    sleep for a little bit
>    GET "/" + db

Eric,

Another thing to check is whether your client library is reusing
connections. Each open socket to the database creates a new Erlang
process, so if your client isn't reusing connections this could start
chewing up Erlang PID's unnecessarily. Couple that with a client that
also doesn't close the socket explicitly and a high connection limit
and you could start having a noticeable impact.

I'm still not the biggest fan of the system info patch in JIRA. Though
I'm less of an unfan rereading it now. I would be OK adding a
num_erlang_processes statistic to that stats module assuming that
erlang:system_info(process_count) isn't too costly.

HTH,
Paul Davis