You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2018/07/08 21:28:38 UTC

[GitHub] janl commented on issue #1063: Aborting continuous listening on _global_changes is leaking resources

janl commented on issue #1063: Aborting continuous listening on _global_changes is leaking resources
URL: https://github.com/apache/couchdb/issues/1063#issuecomment-403318555

I finally had some time to do a reproducing setup. If anyone here wants to play with it, please contact me privately. I’m happy to hand out SSH access.

Big thanks to @redgeoff for proving such a comprehensive test case.

My results are preliminary, so nothing concluded yet.

Baseline info:
- I tested 2.1.1 and master as of today with Erlang R20.
- I test on DigitalOcean Droplets. Smallest one (1 “virtual“ CPU, 1.8GHz, 1GB RAM, SSD) and the second smallest “dedicated CPU” option (2 dedicated hyper threads, TBD GHz, 4GB RAM, SSD).
- I’m running haproxy on a. separate droplet.
- I’m running the node scripts on yet another droplet.

Observations:
- The differences I see are less in CPU time (things never go to 100%) but on load (fist digit).
- On the smaller VM, I have one VM with occasional +1/3 load values.
- This seems to correlate with a lager number of sockets on the system on `5984` in `TIME_WAIT` state.
- On the bigger machines, the difference is more like 1/4 or 1/5..
- On either nodes, I couldn’t really detect a sticky workhorse node, but the load differential swapping between the two CouchDB nodes. I didn’t detect a pattern, but (see graphs below), you see a slight drop in CPU % about once an hour.

Graphs from a ~3 hour run on the higher perf machines, this is CPU %:

Node 1:
![screen shot 2018-07-08 at 23 17 06](https://user-images.githubusercontent.com/11321/42424056-49337f00-8305-11e8-8471-bd1976b448cd.png)

Node 2:
![screen shot 2018-07-08 at 23 16 44](https://user-images.githubusercontent.com/11321/42424057-51dfa8fe-8305-11e8-9c31-58dc702391b1.png)

Things I wanna try next:
- run the long poll option to get a feel for what numbers to expect
- set `option http-close` in the haproxy config
- disable http keepalive on the CouchDB side (should be about the same effect as the previous item).

Gut feel: there is some variance in when TCP sockets get out of TIME_WAIT which accounts for the extra load and properly closing those HTTP requests either in the client or in the load balancer will help, or a reduction of the TIME_WAIT timeout, but not sure how well this is tunable across different systems.

This could be sockets being left in TIME_WAIT because of continuous changes requests not terminating with a. TCP FIN and/or unread bytes on the socket, and system-induced, or test-case induced variance in when those sockets are cleaned up, leading to larger load on the systems.

Might be worth re-reading https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux to refresh the memories.

So far for now. I’ll update this as I have more info.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services