You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by gi...@git.apache.org on 2017/10/02 17:39:26 UTC
[GitHub] uschtwill opened a new issue #859: Various error messages and breakage

uschtwill opened a new issue #859: Various error messages and breakage
URL: https://github.com/apache/couchdb/issues/859
 
 
   I do realise this is (probably) not a bug report, but I had to write down all of this somewhere, right?
   
   And I feel like it would be nice to document the explanations for the various error messages and their root cause(s) somewhere (that someone hopefully will be able to provide), so other people who might be getting these errors in the future may then have more to go by then I do now.
   
   Please let me know if I can be more precise, add additional information or do anything else to improve this inquiry.
   
   ## Context
   We are using CouchDB to store documents which are read out with Logstash and its [couchdb_changes plugin](https://www.elastic.co/guide/en/logstash/current/plugins-inputs-couchdb_changes.html), which then modifies it and pushes it downstream for further processing. 
   
   We have 3 single node instances, which all exhibit the behaviour described below: prod and staging are holding around 10M documents, and a dev instance just a couple of thousands. All instances are running as Docker containers on a single Dockerhost which is also running a lot of other containers (80+). 
   
   ## Current Behavior
   We are getting the following errors which I don't know how to interpret. I have yet to properly and predictably reproduce these errors, and I am not sure what all of the consequences are and what the root cause is. So I am pretty much lost.
   
   The error that is concerning me the most is:
   ```
   [error] 2017-10-02T15:14:56.440179Z nonode@nohost <0.23835.5135> 1c9aabe372 req_err(527034634) timeout : nil
     [{rexi,stream2,3,[{file,"src/rexi.erl"},{line,213}]},{fabric_rpc,changes_enumerator,2,[{file,"src/fabric_rpc.erl"},{line,349}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,783}]},{couch_btree,stream_kp_node,7,[{file,"src/couch_btree.erl"},{line,710}]},{couch_btree,stream_kp_node,8,[{file,"src/couch_btree.erl"},{line,754}]},{couch_btree,fold,4,[{file,"src/couch_btree.erl"},{line,220}]},{couch_db,changes_since,5,[{file,"src/couch_db.erl"},{line,1244}]},{fabric_rpc,changes,4,[{file,"src/fabric_rpc.erl"},{line,83}]}]
   [error] 2017-10-02T15:14:56.440653Z nonode@nohost <0.23835.5135> 1c9aabe372 Response abnormally terminated: {timeout,nil,[{rexi,stream2,3,[{file,"src/rexi.erl"},{line,213}]},{fabric_rpc,changes_enumerator,2,[{file,"src/fabric_rpc.erl"},{line,349}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,783}]},{couch_btree,stream_kp_node,7,[{file,"src/couch_btree.erl"},{line,710}]},{couch_btree,stream_kp_node,8,[{file,"src/couch_btree.erl"},{line,754}]},{couch_btree,fold,4,[{file,"src/couch_btree.erl"},{line,220}]},{couch_db,changes_since,5,[{file,"src/couch_db.erl"},{line,1244}]},{fabric_rpc,changes,4,[{file,"src/fabric_rpc.erl"},{line,83}]}]}
   ```
   This one makes Logstash reset/lose track of the file it uses to track which update_sequence it is at. This is actually how I noticed it, because Logstash jumped back a couple of million updates. Ouch.
   
   Another error that regularly keeps popping up around the former one (in batches of up to 10 or so) is:
   ```
   [error] 2017-10-02T12:58:21.722899Z nonode@nohost <0.18748.5131> 1c9aabe372 rexi_server exit:timeout [{rexi,stream2,3,[{file,"src/rexi.erl"},{line,213}]},{fabric_rpc,changes_enumerator,2,[{file,"src/fabric_rpc.erl"},{line,349}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,783}]},{couch_btree,stream_kp_node,7,[{file,"src/couch_btree.erl"},{line,710}]},{couch_btree,stream_kp_node,8,[{file,"src/couch_btree.erl"},{line,754}]},{couch_btree,fold,4,[{file,"src/couch_btree.erl"},{line,220}]},{couch_db,changes_since,5,[{file,"src/couch_db.erl"},{line,1244}]},{fabric_rpc,changes,4,[{file,"src/fabric_rpc.erl"},{line,83}]}]
   ```
   ... or as a shorter variation:
   ```
   [error] 2017-10-01T09:55:45.098141Z nonode@nohost <0.26324.4873> 1bc0992f85 rexi_server exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,286}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,632}]},{couch_mrview,query_view,5,[{file,"src/couch_mrview.erl"},{line,255}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
   ```
   This one also occurs in batches:
   ```
   [error] 2017-10-02T13:50:21.375009Z nonode@nohost <0.17003.82> -------- OS Process Error <0.26804.77> :: {os_process_error,"OS process timed out."}
   ```
   and so is occasionally:
   ```
   [error] 2017-10-02T13:48:12.959135Z nonode@nohost emulator -------- Error in process <0.10836.82> with exit value: {{nocatch,{os_process_error,"OS process timed out."}},[{couch_os_process,prompt,2,[{file,"src/couch_os_process.erl"},{line,59}]},{couch_query_servers,map_doc_raw,2,[{file,"src/couch_query_servers.erl"},{line,67}]},{couch_mrview_updater... 
     [<<"couch_mrview_util:get_view/4 L56">>,<<"couch_mrview:query_view/6 L244">>,<<"rexi_server:init_p/3 L139">>]
   ```
   In the logs of the dev env I also found:
   ```
   [error] 2017-10-02T13:49:15.968875Z nonode@nohost <0.11410.82> a95cb34482 req_err(476198327) unknown_error : function_clause
     [<<"couch_stats_httpd:transform_stats/2 L34">>,<<"chttpd_misc:handle_node_req/1 L289">>,<<"chttpd:process_request/1 L293">>,<<"chttpd:handle_request_int/1 L229">>,<<"mochiweb_http:headers/6 L122">>,<<"proc_lib:init_p_do_apply/3 L237">>]
   ```
   and:
   ```
   [error] 2017-10-02T13:48:59.581334Z nonode@nohost <0.11410.82> 3ee8c49d37 req_err(476198327) unknown_error : function_clause
   ```
   or crash reports like this:
   ```
   [error] 2017-10-02T13:48:54.516307Z nonode@nohost <0.11488.82> -------- gen_server couch_compaction_daemon terminated with reason: {compaction_loop_died,{timeout,{gen_server,call,[<0.262.0>,start_compact]}}}
     last msg: {'EXIT',<0.11672.82>,{timeout,{gen_server,call,[<0.262.0>,start_compact]}}}
     state: {state,<0.11672.82>,[<<"shards/80000000-9fffffff/db.1506787408">>]}
   [error] 2017-10-02T13:48:05.838229Z nonode@nohost <0.11650.82> -------- CRASH REPORT Process  (<0.11650.82>) with 0 neighbors exited with reason: {timeout,{gen_server,call,[couch_httpd_vhost,get_state]}} at gen_server:call/2(line:182) <= couch_httpd_vhost:dispatch_host/1(line:102) <= chttpd:handle_request/1(line:142) <= mochiweb_http:headers/6(line:122) <= proc_lib:init_p_do_apply/3(line:237); initial_call: {mochiweb_acceptor,init,['Argument__1','Argument__2',...]}, ancestors: [chttpd,chttpd_sup,<0.355.0>], messages: [{#Ref<0.0.71.43291>,{vhosts_state,[],["_utils","_...",...],...}}], links: [<0.358.0>,#Port<0.64807>], dictionary: [{couch_rewrite_count,0}], trap_exit: false, status: running, heap_size: 1598, stack_size: 27, reductions: 1126
   ```
   
   ## Possible Solution
   When we first got this error a while ago (the very top one) I set 
   ```
   [fabric]
   request_timeout = infinity
   ```
   for all of the instances. The error messages (the first two from above) stopped, and I thought I had solved the problem. But yesterday things started breaking again, the resets happened on staging and prod and all mayhem broke lose on the environment I was currently developing against.
   
   From what I have read so far, my solution right now would be to significantly increase `os_process_timeout` and `os_process_limit` (they're still at their default values, 5000 and 25), although I am not entirely sure what is happening in the first place. There seems to be so many things breaking at the same time (across environments), I suspect that increasing those settings might make the symptoms go away but may not address the root cause. From my googling I suspect something to do with ulimits/file descriptors/open files, but I am a kind of out of my depth there.
   
   And from inside one of the containers everything looks fine:
   ```
   root@e439ab2acd96:/opt/couchdb# ulimit -a
   core file size          (blocks, -c) 0
   data seg size           (kbytes, -d) unlimited
   scheduling priority             (-e) 0
   file size               (blocks, -f) unlimited
   pending signals                 (-i) 128197
   max locked memory       (kbytes, -l) 64
   max memory size         (kbytes, -m) unlimited
   open files                      (-n) 524288
   pipe size            (512 bytes, -p) 8
   POSIX message queues     (bytes, -q) 819200
   real-time priority              (-r) 0
   stack size              (kbytes, -s) 8192
   cpu time               (seconds, -t) unlimited
   max user processes              (-u) 524288
   virtual memory          (kbytes, -v) unlimited
   file locks                      (-x) unlimited
   root@e439ab2acd96:/opt/couchdb# cat /proc/sys/fs/file-nr
   35808   0       3274312
   ```
   
   On the other hand I just realised that on the host the limit for open files seems ridiculously low, although that hasn't been an issue so far. Maybe that's it already? Anyways, I would like to not dirty the crime scene so I will leave it as this for now.
   ```
   root@vps:~# ulimit -n
   1024
   ```
   
   ANY pointers are welcome...
   
   ## Your Environment
   * Version used: Docker klaemo/couchdb:2.0.0
   * Operating System and version (desktop or mobile): Ubuntu 14.04
   * Dockerhost: 32GB RAM, 12 Core, 500GB SSD VPS
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services