You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2018/09/06 16:35:33 UTC

[GitHub] davisp opened a new pull request #1596: Fix couch server race condition

davisp opened a new pull request #1596: Fix couch server race condition
URL: https://github.com/apache/couchdb/pull/1596
 
 
   ## Overview
   
   There are two separate bugs here being fixed. The first was in `couch_server:terminate/2`. If there was an active open_async process when couch_server died it would throw a function clause when it tried to shutdown the database that wasn't yet open. We just filter out any `#entry{}` record to avoid the issue. This a simple bug but it prevented couch_server from properly generating a SASL report that ended up covering up the second more complex race condition.
   
   The second bug has to deal with a race between deletions and opens as well as the state of couch_server's mailbox. There's a very specific set of operations that have to happen in a particular order to trigger this bug:
   
   1. An in-flight open or creation request that will ultimately succeed. The `'$gen_call'` message must be in couch_server's message queue before Step 2 finishes.
   2. A deletion request
   3. A second open or creation request that is enqueued before couch_server processes the `'$gen_call'` message from Step 1.
   4. The couch_db_updater pid's `'EXIT'` signal is delayed until after the open_result from Step 3 is delivered to couch_server
   
   The underlying issue here is that the deletion request clears the `couch_dbs` ets table state without removing the possible corresponding state in couch_server's message queue. As things progress couch_server ends up mistaking the `open_result` message from Step 1 as corresponding to the open_async process spawned in Step 3. Currently this results in the client from Step 3 getting an invalid response from couch_server, and then more importantly, couch_server dies while attempting to process the real `open_result` message because of the state in the `couch_dbs` ets table being incorrect (it would throw a function_clause error in a list comprehension because `#entry.waiters` was undefined).
   
   The fix was just to ensure that the opener pid in the `#entry{}` record matches the pid of the caller. Anything that doesn't match is ignored since the deletion already cleaned up the server state. Although we do kill the couch_db_updater pid for extra security, Though technically its duplicating work that the deletion request already handled (via killing the `open_async` pid which is linked to it).
   
   ## Testing recommendations
   
   The second of three commits adds a failing test case that is fixed by the third commit.
   
   `make eunit`
   
   ## Checklist
   
   - [x] Code is written and works correctly;
   - [x] Changes are covered by tests;
   - [x] No documentation to change
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services