You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2017/12/12 00:44:05 UTC
[GitHub] redgeoff opened a new issue #1063: Aborting continuous listening on _global_changes is leaking resources

redgeoff opened a new issue #1063: Aborting continuous listening on _global_changes is leaking resources
URL: https://github.com/apache/couchdb/issues/1063
 
 
   Get ready, this is a fun one! :D
   
   ## Overview
   - I suspect that there is a resource leak where CouchDB does not properly release all the associated resources when a continuous feed is aborted. The test scripts below continuously update docs while listening on the _changes feed of the _global_changes database. The _global_changes stream is canceled every second.
   - This might seem like a rather harmless issue in an isolated test like this, but in a production environment with a fair bit of load, this resource leak can stop your database from scaling as your nodes can easily reach 100% CPU usage and become unresponsive, regardless of the number of nodes you have.
   - In  my case, I was able to use feed=longpoll instead of feed=continuous to prevent this resource leak, but this alternate usage may not be a viable option for all use cases. Moreover, if there is a resource leak it could be causing problems in other areas of CouchDB.
   
   ## Expected Behavior
   CPU load should be distributed equally across nodes in cluster.
   
   ## Current Behavior
   A single node in the cluster is reaching 100% CPU usage and load is not being distributed equally across nodes in the cluster.
   
   ## Possible Solution
   I don't currently have a solution, but if your design permits, you can work around this by using feed=longpoll instead of feed=continuous
   
   ## Definition of a ?workhorse? node
   Let?s define the ?workhorse node? as the one node in our cluster of nodes that is using a lot of CPU--nearly 100% CPU. This workhorse node will be locked in as the workhorse node until it is restarted.
   
   ## Hardware:
   - You should be able to replicate the following tests on any hardware, but the CouchDB nodes must be located on different servers. The nodes can be in the same datacenter/AvailabilityZone, but each node MUST be located on a different server. I believe this is because there needs to be some hop distance between the nodes before the problem becomes visible.
   - I developed these steps using two AWS t2.nano instances (1 CPU core and 0.5GB memory) running Ubuntu 16. If your hardware is more beefy it is likely that you may need to wait longer or modify the test scripts to pound the CouchDB cluster harder with more concurrency.
   
   ## changes-resource-leak-setup.js
   ```js
   var Slouch = require('couch-slouch');
   var slouch = new Slouch('http://admin:secret@example.com:5984');
   
   var NUM_DOCS = 100;
   
   var dbName = 'test_db';
   
   var lastSeq = undefined;
   
   var createOrUpdateDocs = function (i) {
     var doc = {
       _id: '' + i,
       foo: 'bar',
       updated_at: (new Date()).toISOString()
     };
   
     console.log('creating/updating doc=', doc);
   
     return slouch.doc.createOrUpdate(dbName, doc).then(function () {
       if (i < NUM_DOCS - 1) {
         return createOrUpdateDocs(i + 1);
       }
     });
   };
   
   var createDatabaseIfMissing = function (dbName) {
     return slouch.db.exists(dbName).then(function (exists) {
       if (!exists) {
         console.log('creating database', dbName);
         return slouch.db.create(dbName);
       }
     })
   };
   
   var createDocs = function () {
     return createOrUpdateDocs(0);
   };
   
   createDatabaseIfMissing('_global_changes').then(function () {
     return createDatabaseIfMissing('test_db');
   }).then(function () {
     return createDocs();
   });
   ```
   
   ## changes-resource-leak-bad-test.js
   ```js
   var Slouch = require('couch-slouch');
   var slouch = new Slouch('http://admin:secret@example.com:5984');
   
   var NUM_DOCS = 100;
   
   var dbName = 'test_db';
   
   var lastSeq = undefined;
   
   var createOrUpdateDocs = function (i) {
     var doc = {
       _id: '' + i,
       foo: 'bar',
       updated_at: (new Date()).toISOString()
     };
   
     // console.log('creating/updating doc=', doc);
   
     return slouch.doc.createOrUpdate(dbName, doc).then(function () {
       if (i < NUM_DOCS - 1) {
         return createOrUpdateDocs(i + 1);
       }
     });
   };
   
   var updateDocs = function () {
     return createOrUpdateDocs(0).then(function () {
       // Repeat
       return updateDocs();
     });
   };
   
   var listenToGlobalChanges = function () {
     var first = false;
   
     var iterator = slouch.db.changes('_global_changes', {
       include_docs: true,
       feed: 'continuous',
       heartbeat: true,
       since: lastSeq,
       limit: 10
     });
   
     return iterator.each(function (change) {
       console.log('change=', change);
       lastSeq = change.seq;
   
       if (!first) {
         setTimeout(function () {
           iterator.abort();
         }, 1000);
         first = true;
       }
     }).then(function () {
       // Repeat
       return listenToGlobalChanges();
     });
   };
   
   updateDocs();
   listenToGlobalChanges();
   ```
   
   ## changes-resource-leak-good-test.js
   ```js
   var Slouch = require('couch-slouch');
   var slouch = new Slouch('http://admin:secret@example.com:5984');
   
   var NUM_DOCS = 100;
   
   var dbName = 'test_db';
   
   var lastSeq = undefined;
   
   var createOrUpdateDocs = function (i) {
     var doc = {
       _id: '' + i,
       foo: 'bar',
       updated_at: (new Date()).toISOString()
     };
   
     // console.log('creating/updating doc=', doc);
   
     return slouch.doc.createOrUpdate(dbName, doc).then(function () {
       if (i < NUM_DOCS - 1) {
         return createOrUpdateDocs(i + 1);
       }
     });
   };
   
   var updateDocs = function () {
     return createOrUpdateDocs(0).then(function () {
       // Repeat
       return updateDocs();
     });
   };
   
   var listenToGlobalChanges = function () {
     var iterator = slouch.db.changes('_global_changes', {
       include_docs: true,
       feed: 'longpoll',
       heartbeat: true,
       since: lastSeq,
       limit: 10,
       timeout: 1000
     });
   
     return iterator.each(function (change) {
       console.log('change=', change);
       lastSeq = change.seq;
     }).then(function () {
       // Repeat
       return listenToGlobalChanges();
     });
   };
   
   updateDocs();
   listenToGlobalChanges();
   ```
   
   ## Setup:
   - Download the changes-resource-leak-setup.js, changes-resource-leak-bad-test.js and changes-resource-leak-good-test.js scripts above the and modify the CouchDB URLs
   - Create a 2-node cluster with the default local.ini config. We create a 2-node cluster as it make it easier to see the problem. You will also need to set up a load balancer in front of the 2 nodes, e.g. haproxy, nginx, etc? You do not need to create any system databases as the test script will create _global_changes if the _global_changes database doesn?t exist
   - Install node (https://nodejs.org/en/download/package-manager/)
   - $ npm install couch-slouch
   - $ node ./changes-resource-leak-setup.js
   - Wait about 100 secs for 100 docs to be created
   
   ## Test 1 - Create a workhorse node:
   - $ node ./changes-resource-leak-bad-test.js # This is will run indefinitely
   - While changes-resource-leak-bad-test.js runs use top on **both** CouchDB nodes to monitor CPU usage. After about 10 minutes, one of the nodes will reach almost 100% CPU and stay around 100% CPU usage. We?ll refer to this node as the workhorse node. The other node will have some CPU usage, but not nearly as much as the workhorse node.
   - Now stop changes-resource-leak-bad-test.js (with Ctrl+C)
   - Wait for the CPU usage on both nodes to go down to about 0%
   - Now start up ./changes-resource-leak-bad-test.js again
   - Within about 20 seconds the workhorse node should again return to almost 100% CPU usage, while the other node will have significantly less usage. This workhorse node will continue to be the workhorse node until you restart your CouchDB nodes.
   
   ## Test 2 - Use feed=longpoll instead of feed=continuous
   - To illustrate that the problem is most likely due to a resource leak with aborting continuous listening, we?ll run changes-resource-leak-good-test.js, which has roughly the same business logic as changes-resource-leak-bad-test.js, but uses feed=longpoll, instead of feed=continuous
   - Make sure changes-resource-leak-bad-test.js is not running
   - IMPORTANT: fully stop both nodes. After *both* nodes have been stopped, start them back up. This is needed to ?reset? the DB back to where we want it.
   - $ node ./changes-resource-leak-good-test.js # This is will run indefinitely
   - Even after many minutes you?ll see that both CouchDB nodes consume only a few percent of CPU each! Yippie!
   
   ## Notes:
   - While changes-resource-leak-bad-test.js is running you can use `netstat` on both the server running changes-resource-leak-bad-test.js and on the CouchDB nodes to see that connections are being cleaned up and not persisting. Therefore, we can assume that changes-resource-leak-bad-test.js is not leaving open connections and that the leak is most likely in CouchDB itself.
   - I throughly tested my load balancer to make sure that the load balancer was not causing the problem. I even used different load balancers.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services