You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2018/01/16 16:07:40 UTC

[GitHub] nickva commented on issue #1081: Replicator infinite failure loop

nickva commented on issue #1081: Replicator infinite failure loop
URL: https://github.com/apache/couchdb/issues/1081#issuecomment-358012819

Hi Avaq,

Thanks for your report.

Noticed in the test behavior script you specified a heartbeat. In 2.x replicator doesn't use hearbeats, instead it uses timeouts:

https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L486

Notice that it uses a timeout for the changes feed and the value of the timeout is 1/3 of the `connection_timeout`. By default connection timeout is 30s so the timeout for the _changes feed ends up being 10s.

Try re-running test script with a timeout parameter specified instead instead of a heartbeat.

I just tested it a few days ago investigating a similar issue in 2.1.x and noticed that server responds quickly with a `results` and periodic newlines are being sent, keeping the connection alive. In my case I was also looking at a continuous change feed (because the replication was a continuous one as well). Wonder if there is a difference in behavior between a continuous and a normal one in respect to filters.

Besides the timeout vs heartbeat, and continuous vs normal, a few more questions to get a better idea of what's happening:

* To double check, is the replication itself running on a 2.x cluster? What are the versions of the targets and source? Are they all 2.x as well?

* Are there any proxies or load balancers involved and do you think they could affect the connections?

* How many replication jobs are running? CouchDB 2.x uses a scheduling replicator with a default maximum number of jobs set to 500. If there are more than 500 some tasks will be stopped and some started periodically. In case of filtered replications, with large source db and a restrictive filter, like you have, replications won't checkpoint unless they receive a document update via the filter. However if it takes too long and the job is swapped out by the scheduler, it might not have chance to checkpoint, it will be stopped. Next time starts will use 0 for the changes feed start 0, and it will wait again, not get a document, will be stopped, etc. In this case you can try for example to increase max_jobs to a number high enough to fit all the replications jobs you have.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services