You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@couchdb.apache.org by gi...@git.apache.org on 2017/09/27 15:02:10 UTC

[GitHub] nickva commented on issue #810: [stats] replicator scheduler crashed counter not inrcementing

nickva commented on issue #810: [stats] replicator scheduler crashed counter not inrcementing
URL: https://github.com/apache/couchdb/issues/810#issuecomment-332551193

Took a look at this one.

Saw the same behavior. There could be 3 reasons for not noticing the `crashed` guage being bumped.

1) There is a fairly high retries_per_request default value of 10 used to retry failed requests. Requests are tried 10 times with exponentially increasing sleep amounts in between, starting at 0.25 seconds. That mean that there could be up to 4 minutes of retrying the same failed request before the job fails and scheduling replicator reports a crashed status for it. I made a PR to reduce the default number of tries to 5 so there would be up to 8 seconds worth of retries instead. This makes more sense now that the scheduling replicator is used, as it can better handle reporting and backing off when errors occurs. This means that for quite a while (tens of minute or hours) the status for the replication job might be in the running state since it wasting time retrying that request.

2) Replications will uniformly pick one node in the cluster to run on which doesn't have to be the node which processed the document update request. To detect the crashed stats update would have to know which node to check for changes. I made this mistake so mentioning it here just in case. Perhaps there is a case there in general for aggregating stats for all nodes.

3) Crashed status is only reported after the replication job has crashed and is waiting to run next (possibly being penalized if it crashed too many times in a row). However as soon as it is given a chance to run again, it gets counted as `running`. While in that state it won't bump the crashed guage. This was done such that he total number is always equal running + pending + crashing. So the effect of this is that the crashing count will periodically go down for a bit when job is attempting to run, then when it fails it will be bumped back up. Before the PR above, this could take quite a while, but even with it might still take up to 15 seconds (8 seconds worth of retries + stats updates happen with a delay of 5 seconds).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services