You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@couchdb.apache.org by gi...@git.apache.org on 2017/04/14 22:28:16 UTC

[GitHub] davisp opened a new pull request #476: Couchdb 3376 fix mem3 shards

davisp opened a new pull request #476: Couchdb 3376 fix mem3 shards
URL: https://github.com/apache/couchdb/pull/476

## Overview

There were two issues with mem3_shards that were fixed while I've been testing the PSE code.
The first issue was found by Jay Doane where a database can have its shards inserted into the cache after its been deleted. This can happen if a client does a rapid CREATE/DELETE/GET cycle on a database. The fix for this is to track the changes feed update sequence from the changes feed listener and only insert shard maps that come from a client that has read as recent of an update_seq as mem3_shards.

The second issue found during heavy benchmarking was that large shard maps (in the Q>=128 range) can quite easily cause mem3_shards to backup when there's a thundering herd attempting to open the database. There's no coordination among workers trying to add a shard map to the cache so if a bunch of independent clients all send the shard map at once (say, at the beginning of a benchmark) then mem3_shards can get overwhelmed. The fix for this was two fold. First, rather than send the shard map directly to mem3_shards, we copy it into a spawned process and when/if mem3_shards wants to write it, it tells this writer process to do its business. The second optimization for this change is to create an ets table to track these processes. Then independent clients can check if a shard map is already enroute to mem3_shards by using ets:insert_new and canceling their writer if that returns false.

## Testing recommendations

Assuming you have the stack available you should be able to duplicate the test results fairly easily by creating a Q=256 database and then having 200 or so HTTP workers write random docs to it as fast as possible. The trick to triggering this is to make sure that all 200 workers start at once though.

## JIRA issue number

COUCHDB-3376

## Related Pull Requests

N/A

## Checklist

- [x ] Code is written and works correctly;
- [ x] Changes are covered by tests;
- [ ] Documentation reflects the changes;

No documentation change since its not covering a public facing API or behavior other than "stuff don't break now".

For tests, there's nothing added or removed for these. Its implicitly tested by any test that does clustered operations on a database but there's no existing test suite specifically for the caching behavior.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services