You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2020/09/17 23:08:25 UTC

[GitHub] [couchdb] chewbranca opened a new issue #3160: Work around statistics(run_queue) returning incorrect data

chewbranca opened a new issue #3160:
URL: https://github.com/apache/couchdb/issues/3160


   [NOTE]: # ( ^^ Provide a general summary of the issue in the title above. ^^ )
   
   ## Description
   
   In the `/_system` endpoint, particularly in the `chttpd_node:get_stats/0` function, one of the data points returned is an integer reflecting the total sum of the length of all run queues [1]. With the introduction of dirty schedulers in Erlang, the `statistics(run_queue)` statistic now includes information about the depth of the dirty CPU and IO queues, however, the documentation indicates it should not be including those values. At least that's how it appears to me, I've filed a bug on the matter [2] and I'll keep an eye on that.
   
   In the meantime, we can isolate the scheduler run queues from the dirty scheduler run queues, and report those values separately which will ensure we have an accurate value for `run_queue` and also allow us to introduce a new value for the dirty CPU queue.
   
   For more context, CouchDB uses several NIFs, but none of them are dirty NIFs. However, OTP's `rsa_generate_key` function _is_ a dirty NIF, and will flow through the dirty NIF CPU queue. If you're using something like HAProxy for SSL termination, you won't normally encounter this, however, we can still end up making dirty NIF calls for a push replication to a remote instance over https. The end result is that the `run_queue` statistic in the `/_system` endpoint will reflect the amount of crypto work going through the dirty cpu queue and will give false positives on run_queue overload, as elevated run_queues are usually indicative of the Erlang VM having more work to handle that resources available.
   
   [NOTE]: # ( Describe the problem you're encountering. )
   [TIP]:  # ( Do NOT give us access or passwords to your actual CouchDB! )
   
   ## Steps to Reproduce
   
   [NOTE]: # ( Include commands to reproduce, if possible. curl is preferred. )
   
   See the bug report in [2] for more details.
   
   ## Expected Behaviour
   
   [NOTE]: # ( Tell us what you expected to happen. )
   
   ## Your Environment
   
   [TIP]:  # ( Include as many relevant details about your environment as possible. )
   [TIP]:  # ( You can paste the output of curl http://YOUR-COUCHDB:5984/ here. )
   
   * CouchDB version used:
   * Browser name and version:
   * Operating system and version:
   
   ## Additional Context
   
   [TIP]:  # ( Add any other context about the problem here. )
   
   [1] https://github.com/apache/couchdb/blob/master/src/chttpd/src/chttpd_node.erl#L216
   [2] https://bugs.erlang.org/browse/ERL-1355
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] chewbranca commented on issue #3160: Work around statistics(run_queue) returning incorrect data

Posted by GitBox <gi...@apache.org>.

chewbranca commented on issue #3160:
URL: https://github.com/apache/couchdb/issues/3160#issuecomment-697759402


   Alright, I've got two PRs out for the 3.x line and the main line:
   
   main: https://github.com/apache/couchdb/pull/3168
   3.x: https://github.com/apache/couchdb/pull/3161


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] chewbranca closed issue #3160: Work around statistics(run_queue) returning incorrect data

Posted by GitBox <gi...@apache.org>.

chewbranca closed issue #3160:
URL: https://github.com/apache/couchdb/issues/3160


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] chewbranca commented on issue #3160: Work around statistics(run_queue) returning incorrect data

Posted by GitBox <gi...@apache.org>.

chewbranca commented on issue #3160:
URL: https://github.com/apache/couchdb/issues/3160#issuecomment-697973296


   I've merged both PRs, and now the CouchDB run_queues statistic represents the same values as before, along with a new statistic for the dirty cpu queue. Closing this out.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] chewbranca commented on issue #3160: Work around statistics(run_queue) returning incorrect data

Posted by GitBox <gi...@apache.org>.

chewbranca commented on issue #3160:
URL: https://github.com/apache/couchdb/issues/3160#issuecomment-695003185


   I got confirmation in the Erlang bug report [1] that there is in fact a discrepancy between the documentation of `statistics(run_queue)` and the output, however, Rickard Green is of the opinion that it all the CPU run_queues should be aggregated together, which certainly has merits. This morning he merged a change to the maintenance branch updating the documentation to indicate `statistics(run_queue)` does _not_ exclude the dirty CPU queue.
   
   That said, I don't think that's appropriate for our use case. I've replied to the ticket demonstrating a scenario where aggregating the normal and dirty CPU queues can obscure the meaning of `statistics(run_queue)`, especially for CouchDB where the scenarios of elevated normal run_queues are fundamentally different than that of an elevated dirty CPU queue (assuming you use a proxy for SSL termination).
   
   Therefore I remain of the opinion that it's important to separate the regular run_queue values from the dirty cpu run_queue. In the PR I linked above, I've already started going down that path using the `statistics(run_queue_lengths)` value, which is also much more efficient than what we've been doing, as `statistics(run_queue)` performs a lock on all run_queues to read from them, whereas the newer APIs do not lock. So we can switch to the new version that doesn't lock _and_ isolate these two data points as separate metrics.
   
   
   [1] https://bugs.erlang.org/browse/ERL-1355


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] chewbranca commented on issue #3160: Work around statistics(run_queue) returning incorrect data

Posted by GitBox <gi...@apache.org>.

chewbranca commented on issue #3160:
URL: https://github.com/apache/couchdb/issues/3160#issuecomment-694551709

I've got a proof of concept implementation separating out the scheduler and dirty scheduler run_queues in [1].

That PR uses the relatively new statistic call `statistics(run_queue_lengths)` which was introduced in OTP-18.3. This statistic returns a list of length N + 1 where N is the number of regular schedulers, and +1 is the currently fixed value of the number of dirty CPU schedulers in Erlang. You can see in the PR [1] that I'm using that data to also construct a new stat entry for the dirty CPU queue. We could use the even newer `statistics(run_queue_lengths_all)` statistic which returns a list of N + 2 for the N schedulers and the 2 dirty scheduler queues, however, Erlang does not yet use dirty IO NIFs for file IO, so that value will always be zero, and that function was introduced in OTP-20, and we still kinda sorta maybe tolerate OTP-19. I'm not opposed to requiring OTP-20, but there's not a big rush on this front as we don't use dirty IO schedulers yet.

Btw, the PR is against 3.x, we'll want to have a separate PR for 4.x, which is why I'm adding the details here in the issue.

[1] https://github.com/apache/couchdb/pull/3161
[2] https://erlang.org/doc/man/erlang.html#statistics_run_queue_lengths
[3] https://erlang.org/doc/man/erlang.html#statistics_run_queue_lengths_all

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org