You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2022/09/06 06:55:28 UTC

[GitHub] [couchdb] nickva opened a new pull request, #4163: Statistically skip _revs_diff in the replicator

nickva opened a new pull request, #4163:
URL: https://github.com/apache/couchdb/pull/4163

   [WIP, need to figure out better ratios and constant values]
   
   When the revisions are consistently missing from the target, for
   example when replicating to a new target database, it's wasteful to
   keep calling _revs_diff. Reuse the same algorithm used when skipping
   calls to _bulk_get if there are too many attachments, to statistically
   start skipping call to _revs_diff.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva commented on pull request #4163: Statistically skip _revs_diff in the replicator

Posted by GitBox <gi...@apache.org>.
nickva commented on PR #4163:
URL: https://github.com/apache/couchdb/pull/4163#issuecomment-1239504311

   This is the behavior of the 0.4 decay factor
   
   <img width="792" alt="revs_diff_stats" src="https://user-images.githubusercontent.com/211822/188910166-2e76704d-81d1-48b2-85a2-10050ffcdcf0.png">
   
   It's affected less by recent history than the _bulk_get one. The idea being is we can react a bit quicker to changes if we if the proportion dips below the threshold (0.9) so we can start calling _revs_diff again.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva commented on a diff in pull request #4163: Statistically skip _revs_diff in the replicator

Posted by GitBox <gi...@apache.org>.
nickva commented on code in PR #4163:
URL: https://github.com/apache/couchdb/pull/4163#discussion_r970261981


##########
src/couch_replicator/src/couch_replicator_worker.erl:
##########
@@ -31,6 +31,9 @@
 -define(BULK_GET_RATIO_THRESHOLD, 0.5).
 -define(BULK_GET_RATIO_DECAY, 0.25).
 -define(BULK_GET_RETRY_SEC, 37).
+-define(REVS_DIFF_RATIO_THRESHOLD, 0.95).
+-define(REVS_DIFF_RATIO_DECAY, 0.4).
+-define(REVS_DIFF_RETRY_SEC, 37).

Review Comment:
   Updated with a test and updated the description of the coefficients.
   
   The tests set up two scenarios, one with an empty target, and then check that less than the expected number of revs_diff calls are made (more than half a skipped), the other populates the target with the same docs and we check that the expected number (docs_count / replication_batch_size) are made.
   
   Because the batch size and number of revs_diff is not necessarily deterministic, opted to use thresholds (> half, < max etc as opposed to exact comparisons).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva merged pull request #4163: Statistically skip _revs_diff in the replicator

Posted by GitBox <gi...@apache.org>.
nickva merged PR #4163:
URL: https://github.com/apache/couchdb/pull/4163


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva commented on a diff in pull request #4163: Statistically skip _revs_diff in the replicator

Posted by GitBox <gi...@apache.org>.
nickva commented on code in PR #4163:
URL: https://github.com/apache/couchdb/pull/4163#discussion_r970217477


##########
src/couch_replicator/src/couch_replicator_worker.erl:
##########
@@ -31,6 +31,9 @@
 -define(BULK_GET_RATIO_THRESHOLD, 0.5).
 -define(BULK_GET_RATIO_DECAY, 0.25).
 -define(BULK_GET_RETRY_SEC, 37).
+-define(REVS_DIFF_RATIO_THRESHOLD, 0.95).
+-define(REVS_DIFF_RATIO_DECAY, 0.4).
+-define(REVS_DIFF_RETRY_SEC, 37).

Review Comment:
   > An elegant reuse/generalization of an existing algorithm!
   > 
   > I didn't notice any new tests, and assume that's because there are already lots of tests exercising this code path, and also the basic algorithm is already tested by the _bulk_get tests?
   
   The basic algorithm is test in _bulk_get. But it's a good idea to add some more end-to-end tests and count the number of calls to _revs_diff not unlike we did for _bulk_get disable case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva commented on a diff in pull request #4163: Statistically skip _revs_diff in the replicator

Posted by GitBox <gi...@apache.org>.
nickva commented on code in PR #4163:
URL: https://github.com/apache/couchdb/pull/4163#discussion_r970216702


##########
src/couch_replicator/src/couch_replicator_worker.erl:
##########
@@ -31,6 +31,9 @@
 -define(BULK_GET_RATIO_THRESHOLD, 0.5).
 -define(BULK_GET_RATIO_DECAY, 0.25).
 -define(BULK_GET_RETRY_SEC, 37).
+-define(REVS_DIFF_RATIO_THRESHOLD, 0.95).
+-define(REVS_DIFF_RATIO_DECAY, 0.4).
+-define(REVS_DIFF_RETRY_SEC, 37).

Review Comment:
   Good idea. Add some comments. The types "threshold", "decay" and "retry_sec" are similar between both the revs_diff and bulk_get



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] jaydoane commented on a diff in pull request #4163: Statistically skip _revs_diff in the replicator

Posted by GitBox <gi...@apache.org>.
jaydoane commented on code in PR #4163:
URL: https://github.com/apache/couchdb/pull/4163#discussion_r968829859


##########
src/couch_replicator/src/couch_replicator_worker.erl:
##########
@@ -31,6 +31,9 @@
 -define(BULK_GET_RATIO_THRESHOLD, 0.5).
 -define(BULK_GET_RATIO_DECAY, 0.25).
 -define(BULK_GET_RETRY_SEC, 37).
+-define(REVS_DIFF_RATIO_THRESHOLD, 0.95).
+-define(REVS_DIFF_RATIO_DECAY, 0.4).
+-define(REVS_DIFF_RETRY_SEC, 37).

Review Comment:
   I wonder if it would be helpful to add some comments (and/or a link) here describing why these specific numbers, and how changing them might affect a replication job? Otherwise, they seem a little magical and opaque.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org