You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by "sergey-safarov (via GitHub)" <gi...@apache.org> on 2023/06/04 06:47:55 UTC

[GitHub] [couchdb] sergey-safarov opened a new issue, #4633: maintenance_mode on high CPU usage

sergey-safarov opened a new issue, #4633:
URL: https://github.com/apache/couchdb/issues/4633

   ## Summary
   
   I use CouchDB rpm packages (no docker, no k8s) on the AWS cloud with an Application load-balancer (ALB).
   On ALB configured health check for `/_up` endpoint.
   When the CouchDB daemon fails and starts consuming CPU I want to inform ALG about node failure.
   Could you add a config param like `maintenance_mode_on_cpu_load` with an integer value (for example 60% per CPU core)? If the CPU core load rise configured value for one of the CPU cores, then enable `maintenance_mode` and return `404` error for `/_up` endpoint.
   
   ## Additional context
   
   This allows detect CouchDB node failure on ALB side and reroute traffic to other nodes in the cluster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva commented on issue #4633: maintenance_mode on high CPU usage

Posted by "nickva (via GitHub)" <gi...@apache.org>.
nickva commented on issue #4633:
URL: https://github.com/apache/couchdb/issues/4633#issuecomment-1593209456

   @sergey-safarov that would work in some cases but CPU load might be tricky to figure out in various environments (windows, linux, k8s, vms). 
   
   Often a proxy for system is overload is when internal process message queues start filling up and stay relatively high. That can be detected by `http $DB/_node/$nodename/_system` endpoint. It shows some of the `message_queues` and their length.
   
   ```
   http $DB/_node/_local/_system | jq '.message_queues'
   {
     "couch_file": {},
     "couch_db_updater": {},
     "couch_server": 0,
     "index_server": 0,
      ...
   }
   ```
   
   Specifically `couch_db_updater` is one to look for during document writes as it could back up. But that could also indicate a slow disk IO issue not necessary a CPU overload issue.
   
   Erlang VM also does some busy waiting, that is keeps schedulers working for a bit longer than necessary to trade-off CPU usage for latency. That can make it seem like it's running out of CPUs capacity as the OS visible CPU usage would be higher, but it may still be doing fine in that state. You can disable busy waiting with: `+sbwt none`   `+sbwtdcpu none`  `+sbwtdio none` `vm.args` settings.
   
   However, in general, it could be dangerous to automatically put nodes in maintenance mode. There are good chances whatever is causing it on one node, will start happening on another node as well. Especially if it now has to also process API requests from the nodes which are already put in maintenance mode. So it all could lead to a cascading failure until none of the nodes can access traffic.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] sergey-safarov commented on issue #4633: maintenance_mode on high CPU usage

Posted by "sergey-safarov (via GitHub)" <gi...@apache.org>.
sergey-safarov commented on issue #4633:
URL: https://github.com/apache/couchdb/issues/4633#issuecomment-1594152433

   Thanks, @nickva for clarification.
   I will collect monitoring system data (CPU and IO load) when the issue happens and will provide `/_node/_local/_system` information.
   Our installation use CouchDB 2.3.1 version. If it can help understand the issue I can provide an error log also.
   This happens not often and maybe required a timer to reproduce it again.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org