You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2023/01/18 10:30:51 UTC

[GitHub] [couchdb] tophe opened a new issue, #4385: replication restart from start after server restart

tophe opened a new issue, #4385:
URL: https://github.com/apache/couchdb/issues/4385

   [NOTE]: # ( ^^ Provide a general summary of the issue in the title above. ^^ )
   
   ## Description
   Hello, since the upgrade in v3.3.1, I see that when a server restart, all replications restart from scatch.
   I don't know if it is a new functionality, but when you have 10 replications with 20M records, the server become useless for many many hours, the replications process use all the server io, and the server become unresponsive.
   
   [NOTE]: # ( Describe the problem you're encountering. )
   [TIP]:  # ( Do NOT give us access or passwords to your actual CouchDB! )
   
   ## Steps to Reproduce
   setup a replication with many records, complete it, restart the server.
   
   [NOTE]: # ( Include commands to reproduce, if possible. curl is preferred. )
   
   ## Expected Behaviour
   Replications shouldn't restart from the begining, but from their latest positions. So the server is available as soon as it restart.
   pausing and starting and reseting replication through the api, should be a great help. 
   
   [NOTE]: # ( Tell us what you expected to happen. )
   
   ## Your Environment
   docker official image.
   
   [TIP]:  # ( Include as many relevant details about your environment as possible. )
   [TIP]:  # ( You can paste the output of curl http://YOUR-COUCHDB:5984/ here. )
   
   * CouchDB version used: 3.3.1
   * Browser name and version: firefox
   * Operating system and version: debian bulleyes
   
   ## Additional Context
   
   [TIP]:  # ( Add any other context about the problem here. )
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] tophe commented on issue #4385: replication restart from start after server restart

Posted by GitBox <gi...@apache.org>.
tophe commented on issue #4385:
URL: https://github.com/apache/couchdb/issues/4385#issuecomment-1398211068

   hi, 
   
   sorry for the local.ini, if you have get it, you should see that I haven't set the uuid, as you guessed.
   I set it on each cluster, and restart. Now every thing work as expected.
   
   thank you for your support.
   I closed the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva commented on issue #4385: replication restart from start after server restart

Posted by GitBox <gi...@apache.org>.
nickva commented on issue #4385:
URL: https://github.com/apache/couchdb/issues/4385#issuecomment-1387497652

   @tophe thanks for your report.
   
   I had tried setting up a local replication with the 3.3.1 release and I couldn't not reproduce the same issue.
   
   I had created database a and b then continuous replication a ->b. Added 3 docs to a and they were replicated to b. In about 10-30 seconds the replication checkpointed. 
   
   ```
   [notice] 2023-01-18T17:56:53.418762Z couchdb@127.0.0.1 <0.2488.0> 95114e5b09 localhost:5984 127.0.0.1 adm POST /_replicator 201 ok 63
   [notice] 2023-01-18T17:56:53.434113Z couchdb@127.0.0.1 <0.452.0> -------- couch_replicator_scheduler: Job {"0a62088f656e39570a8b89ff61c55a32","+continuous"} started as <0.3587.0>
   [notice] 2023-01-18T17:56:54.160151Z couchdb@127.0.0.1 <0.3587.0> -------- Starting replication 0a62088f656e39570a8b89ff61c55a32+continuous (http://localhost:5984/a/ -> http://localhost:5984/b/) from doc _replicator:a_b worker_procesess:4 worker_batch_size:500 session_id:3835f379fc390834e23283ac1ee96921
   [notice] 2023-01-18T17:56:54.160194Z couchdb@127.0.0.1 <0.3587.0> -------- Document `a_b` triggered replication `0a62088f656e39570a8b89ff61c55a32+continuous`
   [notice] 2023-01-18T17:57:04.154748Z couchdb@127.0.0.1 <0.3508.0> adf93f2899 localhost:5984 127.0.0.1 adm GET /a/_changes?feed=continuous&style=all_docs&since=0&timeout=10000 200 ok 10002
   [notice] 2023-01-18T17:57:10.375115Z couchdb@127.0.0.1 <0.3848.0> f244ad9969 127.0.0.1:5984 127.0.0.1 adm PUT /a/doc1 201 ok 45
   [notice] 2023-01-18T17:57:10.380111Z couchdb@127.0.0.1 <0.3508.0> b9e856c7f2 localhost:5984 127.0.0.1 adm POST /b/_revs_diff 200 ok 3
   [notice] 2023-01-18T17:57:10.383908Z couchdb@127.0.0.1 <0.3508.0> dbd855b8ef localhost:5984 127.0.0.1 adm POST /a/_bulk_get?latest=true&revs=true&attachments=false 200 ok 3
   [notice] 2023-01-18T17:57:10.429121Z couchdb@127.0.0.1 <0.3508.0> 2d0a333f54 localhost:5984 127.0.0.1 adm POST /b/_bulk_docs 201 ok 45
   [notice] 2023-01-18T17:57:13.487160Z couchdb@127.0.0.1 <0.3934.0> a9b147a109 127.0.0.1:5984 127.0.0.1 adm PUT /a/doc2 201 ok 45
   [notice] 2023-01-18T17:57:13.489662Z couchdb@127.0.0.1 <0.3508.0> 63619c5d6c localhost:5984 127.0.0.1 adm POST /b/_revs_diff 200 ok 1
   [notice] 2023-01-18T17:57:13.491367Z couchdb@127.0.0.1 <0.3508.0> c0cacc79b7 localhost:5984 127.0.0.1 adm POST /a/_bulk_get?latest=true&revs=true&attachments=false 200 ok 1
   [notice] 2023-01-18T17:57:13.535906Z couchdb@127.0.0.1 <0.3508.0> 2d41df816b localhost:5984 127.0.0.1 adm POST /b/_bulk_docs 201 ok 44
   [notice] 2023-01-18T17:57:16.258849Z couchdb@127.0.0.1 <0.3996.0> 2d94c3294d 127.0.0.1:5984 127.0.0.1 adm PUT /a/doc3 201 ok 47
   [notice] 2023-01-18T17:57:16.261490Z couchdb@127.0.0.1 <0.3508.0> e31b4295ae localhost:5984 127.0.0.1 adm POST /b/_revs_diff 200 ok 1
   [notice] 2023-01-18T17:57:16.263704Z couchdb@127.0.0.1 <0.3508.0> ff5130158b localhost:5984 127.0.0.1 adm POST /a/_bulk_get?latest=true&revs=true&attachments=false 200 ok 2
   [notice] 2023-01-18T17:57:16.308776Z couchdb@127.0.0.1 <0.3508.0> cca6245826 localhost:5984 127.0.0.1 adm POST /b/_bulk_docs 201 ok 44
   [notice] 2023-01-18T17:57:24.150590Z couchdb@127.0.0.1 <0.3508.0> 28eb6a2108 localhost:5984 127.0.0.1 adm POST /b/_ensure_full_commit 201 ok 1
   [notice] 2023-01-18T17:57:24.153638Z couchdb@127.0.0.1 <0.4101.0> 058625e2ed localhost:5984 127.0.0.1 adm POST /a/_ensure_full_commit 201 ok 1
   [notice] 2023-01-18T17:57:24.153885Z couchdb@127.0.0.1 <0.3587.0> -------- recording a checkpoint for `http://localhost:5984/a/` -> `http://localhost:5984/b/` at source update_seq <<"3-g1AAAACbeJzLYWBgYMpgTmEQTM4vTc5ISXIwNDLXMwBCwxyQVCJDUv3___-zMpgTmXKBAuwpSZYpKYZG2DTgMSaPBUgyNACp_1DTGMGmGSUbpJqlpWLTlwUAQGgo3g">>
   [notice] 2023-01-18T17:57:24.201998Z couchdb@127.0.0.1 <0.4101.0> 755dff440c localhost:5984 127.0.0.1 adm PUT /a/_local/0a62088f656e39570a8b89ff61c55a32 201 ok 44
   [notice] 2023-01-18T17:57:24.247853Z couchdb@127.0.0.1 <0.4101.0> d74241de62 localhost:5984 127.0.0.1 adm PUT /b/_local/0a62088f656e39570a8b89ff61c55a32 201 ok 45
   ```
   
   Then I stopped the server and restarted it.
   
   ```
   [notice] 2023-01-18T17:59:42.941977Z couchdb@127.0.0.1 <0.463.0> -------- couch_replicator_scheduler: Job {"0a62088f656e39570a8b89ff61c55a32","+continuous"} started as <0.604.0>
   [notice] 2023-01-18T17:59:47.455475Z couchdb@127.0.0.1 <0.621.0> 6ac77267da localhost:5984 127.0.0.1 undefined POST /_session 200 ok 22
   [notice] 2023-01-18T17:59:47.486406Z couchdb@127.0.0.1 <0.621.0> cdc7a9e140 localhost:5984 127.0.0.1 adm GET /a/ 200 ok 30
   [notice] 2023-01-18T17:59:47.487452Z couchdb@127.0.0.1 <0.621.0> 67d6faf300 localhost:5984 127.0.0.1 undefined POST /_session 200 ok 1
   [notice] 2023-01-18T17:59:47.490253Z couchdb@127.0.0.1 <0.621.0> f51f7e3d93 localhost:5984 127.0.0.1 adm GET /b/ 200 ok 2
   [notice] 2023-01-18T17:59:47.491681Z couchdb@127.0.0.1 <0.621.0> 3590309197 localhost:5984 127.0.0.1 adm GET /a/ 200 ok 1
   [notice] 2023-01-18T17:59:47.492940Z couchdb@127.0.0.1 <0.621.0> 4e4b45f4cd localhost:5984 127.0.0.1 adm GET /b/ 200 ok 1
   [notice] 2023-01-18T17:59:47.494291Z couchdb@127.0.0.1 <0.621.0> bdecdebada localhost:5984 127.0.0.1 adm GET /a/_local/0a62088f656e39570a8b89ff61c55a32 200 ok 1
   [notice] 2023-01-18T17:59:47.495478Z couchdb@127.0.0.1 <0.621.0> 6fd838bdff localhost:5984 127.0.0.1 adm GET /b/_local/0a62088f656e39570a8b89ff61c55a32 200 ok 1
   [notice] 2023-01-18T17:59:47.507424Z couchdb@127.0.0.1 <0.622.0> 9347dc859c localhost:5984 127.0.0.1 adm GET /a/ 200 ok 1
   [notice] 2023-01-18T17:59:47.507672Z couchdb@127.0.0.1 <0.604.0> -------- Starting replication 0a62088f656e39570a8b89ff61c55a32+continuous (http://localhost:5984/a/ -> http://localhost:5984/b/) from doc _replicator:a_b worker_procesess:4 worker_batch_size:500 session_id:f3d85f800e7e5346b0ec12cd82fe5e76
   [notice] 2023-01-18T17:59:47.507709Z couchdb@127.0.0.1 <0.604.0> -------- Document `a_b` triggered replication `0a62088f656e39570a8b89ff61c55a32+continuous`
   ...
   [notice] 2023-01-18T17:59:57.501970Z couchdb@127.0.0.1 <0.621.0> 0bc5cc2eeb localhost:5984 127.0.0.1 adm GET /a/_changes?feed=continuous&style=all_docs&since=3-g1AAAACbeJzLYWBgYMpgTmEQTM4vTc5ISXIwNDLXMwBCwxyQVCJDUv3___-zMpgTmXKBAuwpSZYpKYZG2DTgMSaPBUgyNACp_1DTGMGmGSUbpJqlpWLTlwUAQGgo3g&timeout=10000 200 ok 10003
   ```
   
   After restart, it re-read the `_local/...` checkpoints and continued on starting `_changes` with `since=3-g1AAAAC...`
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] tophe closed issue #4385: replication restart from start after server restart

Posted by GitBox <gi...@apache.org>.
tophe closed issue #4385: replication restart from start after server restart
URL: https://github.com/apache/couchdb/issues/4385


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] tophe commented on issue #4385: replication restart from start after server restart

Posted by GitBox <gi...@apache.org>.
tophe commented on issue #4385:
URL: https://github.com/apache/couchdb/issues/4385#issuecomment-1397041792

   I have just find the problem, it is docker dependent.
   when you do a docker stop/start everything is ok with replication.
   
   but when you update the container, running a docker run, with a new container, the replication restart from the beginning. (I have made change in local.ini)
   
   [start_log.txt](https://github.com/apache/couchdb/files/10457293/start_log.txt)
   [checkpoint_logs.txt](https://github.com/apache/couchdb/files/10457294/checkpoint_logs.txt)
   I have attache some logs to the post. in checkpoint_log.txt, you can see the last checkpoint for replication lotimages_new.
   at g1AAAAMLeJzLYWBg4MhgTmGQSs4vTc5ISXIoLskvStVLLCopKMpMTtVLyszJAaphSmRIkv
   in start log, you can see  log asked from g1AAAACReJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX
   
   in futon, I can see that replication restart from beginning.
   
   are the replication checkpoints stored in /opt/couchdb/data, that is a docker binded volume, or are they stored in the container ? and if so where are they stored ? 
   
   I have attach some docker config in order to reproduce my setup.
   [dock.zip](https://github.com/apache/couchdb/files/10457471/dock.zip)
   
   you can download all the file and then run docker-compose build, docker-compose up to start couchdb container. 
   set up a replication with some datas. 
   after checkpointing, change local.ini
   
   , and docker-compose build, docker-compose up, to redeploy a new container.
   then compare starting point, with checkpoint.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva commented on issue #4385: replication restart from start after server restart

Posted by GitBox <gi...@apache.org>.
nickva commented on issue #4385:
URL: https://github.com/apache/couchdb/issues/4385#issuecomment-1397565505

   In dock.zip I had noticed local.ini wasn't at text file but some kind of a binary.
   
   ```
    % cat local.ini
   �E�c'K�cnO�cuxUT
   ```
   
   One thing to pay attention is if the replication ID changes. Based on your start_log I see that the checkpoint was *not* found for the catalogues endpoint
   
   ```
   notice] 2023-01-19T13:30:16.056403Z nonode@nohost <0.1150.0> 34af135719 couch1:5984 192.168.100.227 admin GET /catalogues/ 200 ok 2
   [notice] 2023-01-19T13:30:16.131009Z nonode@nohost <0.1150.0> fb7b32c9cd couch1:5984 192.168.100.227 admin GET /catalogues/_local/475a01ff4762aae18390232479a85acd 404 ok 16
   [notice] 2023-01-19T13:30:16.133231Z nonode@nohost <0.1150.0> ca52dcf06f couch1:5984 192.168.100.227 admin GET /catalogues/_local/21d33a0db3bd438bc2fe58ef3e64a1a7 404 ok 2
   [notice] 2023-01-19T13:30:16.171877Z nonode@nohost <0.1150.0> 9adcacce61 couch1:5984 192.168.100.227 admin GET /catalogues/_local/9dea40c7f506f19799311d927b321449 404 ok 38
   [notice] 2023-01-19T13:30:16.173819Z nonode@nohost <0.1150.0> 0ab11d84ae couch1:5984 192.168.100.227 admin GET /catalogues/_local/294224beca58e21bde9e0a5676df7d05 404 ok 1
   ```
   
   Notice the 404 on the `_local/$replicationid` docs.
   
   Not sure about lotimages_new as the logs start after "Starting replication..." already. 
   
   So what may be happening is your replication IDs change inadvertently when you update docker configs. If replication IDs change, that means previous checkpoints won't be found, and replication will rewind from 0. Now, if your source, target and other replication parameters stay the same it's most like the the server uuid `[couchdb] uuid = ...` value that's no consistent. The setting is described [here](https://docs.couchdb.org/en/stable/config/couchdb.html#couchdb/uuid)
   
   Here is the description on the replication ID generation algorithm: https://docs.couchdb.org/en/stable/replication/protocol.html#generate-replication-id
   
   If that is not specified, a random value will be generated, and that would cause your replications IDs to be random every time if you spin up docker containers unless you persist your config or explicitly set `[couchdb] uuid ...`.
   
   That value doesn't have to be a proper UUID. You could use a hostname or some other identifier that uniquely identifies the same "cluster". In addition, make sure it's set to the same value on all the nodes in the cluster. If you have 3 nodes (couch1, couch2, couch3 ensure uuid is the same).
   
   Checkpoints are persisted on both source and target endpoints (database) in `_local/$base_replication_id` docs. Replications will only resume from a checkpoint if it can find the checkpoint  on *both* source and target. So in your logs you could monitor for those 404s and hopefully finally it should find the last one is found (a 200 response). There are usually a few 404s expected as we try to load older versions of replication ID since the algorithm to generate has evolved at least 4 times. Then monitor if replication ID values stay the same or change.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] tophe commented on issue #4385: replication restart from start after server restart

Posted by GitBox <gi...@apache.org>.
tophe commented on issue #4385:
URL: https://github.com/apache/couchdb/issues/4385#issuecomment-1396964894

   hello, 
   
   I have do many test stopping and starting all the replication nodes in different order and I can't reproduce the problem, it seem every things are working fine. After restart replication start from their last checkpoint, and start quickly, and run quite fast in 3.3.1.
   
   Perhaps this was caused by the upgrade from 3.2.1 to 3.3.1 which restart replication from beginning in the upgrade process ? I see that on many node.
   
   thank's for your help (and for that great peace of code). I close the issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] nickva commented on issue #4385: replication restart from start after server restart

Posted by GitBox <gi...@apache.org>.
nickva commented on issue #4385:
URL: https://github.com/apache/couchdb/issues/4385#issuecomment-1387498995

   @tophe would you be able to share more details how to reproduce the issue or show some sanitized logs.  See if there are any errors / exceptions in the logs which may prevent checkpointing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [couchdb] tophe closed issue #4385: replication restart from start after server restart

Posted by GitBox <gi...@apache.org>.
tophe closed issue #4385: replication restart from start after server restart
URL: https://github.com/apache/couchdb/issues/4385


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org