You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@couchdb.apache.org by "jdai1 (via GitHub)" <gi...@apache.org> on 2023/06/12 15:39:07 UTC

[GitHub] [couchdb] jdai1 opened a new issue, #4639: Ensure full commit timing out on replication of large database

jdai1 opened a new issue, #4639:
URL: https://github.com/apache/couchdb/issues/4639

   [NOTE]: # ( ^^ Provide a general summary of the issue in the title above. ^^ )
   
   ## Description
   I am trying to replicate the npm registry through couchdb. However, I keep getting the following errors:
   
   {"error":"checkpoint_commit_failure","reason":"Failure on source commit: {'EXIT',{http_request_failed,\"POST\",\n                             \"https://skimdb.npmjs.com/registry/_ensure_full_commit\",\n                             {error,{error,req_timedout}}}}"}
   
   [NOTE]: # ( Describe the problem you're encountering. )
   [TIP]:  # ( Do NOT give us access or passwords to your actual CouchDB! )
   
   ## Steps to Reproduce
   
   I am using version 3.3.2. When post a request to the _ensure_full_commit endpoint, it returns something that looks like this:
   
   {"ok":true,"instance_start_time":"1686442464"}
   
   However, based on the documentation, it should be returning an instance start time of 0.
   
   [NOTE]: # ( Include commands to reproduce, if possible. curl is preferred. )
   
   ## Expected Behaviour
   
   [NOTE]: # ( Tell us what you expected to happen. )
   
   ## Your Environment
   
   [TIP]:  # ( Include as many relevant details about your environment as possible. )
   [TIP]:  # ( You can paste the output of curl http://YOUR-COUCHDB:5984/ here. )
   
   {"couchdb":"Welcome","version":"3.3.2","git_sha":"11a234070","uuid":"358bddb8b0671d6dc8774db0a626edef","features":["access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"],"vendor":{"name":"The Apache Software Foundation"}}
   
   * CouchDB version used:
   * Browser name and version:
   * Operating system and version:
   
   ## Additional Context
   
   [TIP]:  # ( Add any other context about the problem here. )
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] nickva commented on issue #4639: Ensure full commit timing out on replication of large database

Posted by "nickva (via GitHub)" <gi...@apache.org>.

nickva commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1593384045

   That error means a worker had exhausted their number of retries and crashed.
   
   It looks like one of the endpoints, the source most likely (sorry I am not familiar with npms very much), probably doesn't want to rate limit its API. Instead of using rather aggressive settings like 20 workers, and 40 concurrent http connection, try to use reduce ones - 1 worker, 100 batch size, 5 http connections or something like that.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] jdai1 commented on issue #4639: Ensure full commit timing out on replication of large database

Posted by "jdai1 (via GitHub)" <gi...@apache.org>.

jdai1 commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1592276755

Thanks so much Nick! That is a big help, once again. It turns out the problem was the use_checkpoints. I am trying to clone the npm registry from https://replicate.npmjs.com/, which is a massive database, so I'm running into quite a few bugs.
<img width="667" alt="Screenshot 2023-06-14 at 11 02 38 PM" src="https://github.com/apache/couchdb/assets/44037521/0970baf4-f966-4a52-9897-7ad9a4012003">
This is my current config for replication, which does not give me any checkpoint failures (I guess the couchdb for the npm registry was blocking any checkpoints). However, now, the replications crash with a new error!
```
{"error":"error","reason":"{worker_died,<0.5879.61>,{process_died,<0.7372.75>,kaboom}}"}
```
Since I disabled checkpointing, each time the replication task fails, it restarts, so as a result, the progress is super slow. I did see this thread[https://github.com/apache/couchdb/issues/745], and I was wondering if you guys have found a workaround for this bug.
My database currently looks like this:
```
"doc_del_count":165197,"doc_count":1565
```
and the actual database looks like this:
```
"doc_count":3279735,"doc_del_count":671697
```
So there is still a ways to go! What do you think would be best?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] nickva commented on issue #4639: Ensure full commit timing out on replication of large database

Posted by "nickva (via GitHub)" <gi...@apache.org>.

nickva commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1591810915

@jdai1 no problem at all, thanks for reaching out.

The `max_backoff` error indicates that some replicator API request was throttled too much and it exceeded a max retry interval before the job crashes and restarts.

Those errors are usually emitted if a [429 error code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429) is returned from an HTTP API endpoint. Replication jobs would retry individual requests with an exponential backoff, and in the case of 429 errors it will use the [AIMD](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) - additive increase, multiplicative decrease algorithm, similar to TCP to try to match the throttling rate. But, there is a limit to those failures as well, and if those requests keep getting throttled, it will crash the whole job, and restart it with a simpler configuration. It will apply these setting to those jobs:

```
[
{checkpoint_interval, 5000},
{worker_processes, 2},
{worker_batch_size, 100},
{http_connections, 5}
]
```

and re-schedule the job to run again.

Even if the replication jobs will keep crashing (they are in the `crashing` state), they should still be restarting from time to time, trying as hard as they can to make some progress.

So, to summarize, individual replication API requests are retried on failure, and then the whole replication job is continuously restarted as well. See the advanced replication [guide](https://docs.couchdb.org/en/stable/replication/replicator.html#replication-states) for more info about the various job states.

The suggestion is to then use some settings closer those lower capacity setting when replicating to/from those endpoints (not sure which on of your endpoint is throttled, source, target or both).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] jdai1 commented on issue #4639: Ensure full commit timing out on replication of large database

Posted by "jdai1 (via GitHub)" <gi...@apache.org>.

jdai1 commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1587750283

   Also, on my active_tasks page, it is still saying I have many bulk_get_attempts, even though my bulk_get_attempts is set to false in my configuration file.
   
   <img width="569" alt="Screenshot 2023-06-12 at 1 20 43 PM" src="https://github.com/apache/couchdb/assets/44037521/9b8607d6-744e-4626-9723-13a767f0a1c2">
   <img width="853" alt="Screenshot 2023-06-12 at 1 21 33 PM" src="https://github.com/apache/couchdb/assets/44037521/624d1ec6-f5c0-47a7-bb0b-6e47119b4906">
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] jdai1 commented on issue #4639: Ensure full commit timing out on replication of large database

Posted by "jdai1 (via GitHub)" <gi...@apache.org>.

jdai1 commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1587697490

   Hi Nick! Thank you so much for your help. I believe the source database of the replication is an older version of CouchDB. I will definitely try increasing the connection timeout. Another quick question, if I change the replication config in the local.ini file, will it apply the config if I send a new replication HTTP request, even if the new replication task is a continuation of the existing replication? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] nickva commented on issue #4639: Ensure full commit timing out on replication of large database

Posted by "nickva (via GitHub)" <gi...@apache.org>.

nickva commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1587815799

   Not sure about [use_bulk_get](https://docs.couchdb.org/en/stable/config/replicator.html#replicator/use_bulk_get). It should not attempt to use to use the  _bulk_get endpoint if it's set to `false`. Would you be able to create a smaller repoducer script?
   
   I had tried it locally I saw no bulk_get attempts made with:
   
   ```
   http put $DB/a
   http put $DB/a/doc1 a=b
   http put $DB/a/doc2 a=b
   http put $DB/a/doc3 a=b
   http put $DB/_replicator/ab source=$DB/a target=$DB/c continuous:='true' use_bulk_get:='false' create_target:='true'
   
   http $DB/_scheduler/jobs
   {
       "jobs": [
           {
               "database": "_replicator",
               "doc_id": "ab",
               "history": [
                   {
                       "timestamp": "2023-06-12T17:58:16Z",
                       "type": "started"
                   },
                   {
                       "timestamp": "2023-06-12T17:58:16Z",
                       "type": "added"
                   }
               ],
               "id": "3738529f4ee2354407f06fc91e383d4a+continuous+create_target",
               "info": {
                   "bulk_get_attempts": 0,
                   "bulk_get_docs": 0,
                   "changes_pending": null,
                   "checkpointed_source_seq": "0",
                   "doc_write_failures": 0,
                   "docs_read": 3,
                   "docs_written": 3,
                   "missing_revisions_found": 3,
                   "revisions_checked": 3,
                   "source_seq": "3-g1AAAACTeJzLYWBgYMpgTmHgz8tPSTV0MDQy1zMAQsMckEQiQ1L9____szKYE5lygQLsqWYGaZaJRpjKcRqRxwIkGRqA1H-oSYxgk8wMzVJNEs0wdWUBAFZWJEc",
                   "through_seq": "3-g1AAAACTeJzLYWBgYMpgTmHgz8tPSTV0MDQy1zMAQsMckEQiQ1L9____szKYE5lygQLsqWYGaZaJRpjKcRqRxwIkGRqA1H-oSYxgk8wMzVJNEs0wdWUBAFZWJEc"
               },
               "node": "node1@127.0.0.1",
               "pid": "<0.2518.0>",
               "source": "http://127.0.0.1:15984/a/",
               "start_time": "2023-06-12T17:58:16Z",
               "target": "http://127.0.0.1:15984/c/",
               "user": null
           }
       ],
       "offset": 0,
       "total_rows": 1
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] nickva commented on issue #4639: Ensure full commit timing out on replication of large database

Posted by "nickva (via GitHub)" <gi...@apache.org>.

nickva commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1587718519

   If you change the config the replication will not pick the new value. You'd have to restart the replication job (delete the document and re-create it). The configuration may also be applied per replication job, or cluster-wide in the *.ini file. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] jdai1 commented on issue #4639: Ensure full commit timing out on replication of large database

Posted by "jdai1 (via GitHub)" <gi...@apache.org>.

jdai1 commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1591787968

   Sorry to keep bothering you, would you happen to know why I also get these errors?
   ```
   {"error":"error","reason":"max_backoff"}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] smulikHakipod commented on issue #4639: Ensure full commit timing out on replication of large database

Posted by "smulikHakipod (via GitHub)" <gi...@apache.org>.

smulikHakipod commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1674052322

   Did you guys found any soultion? I am trying to close replicate.npmjs.com with the same error. 
   I tired using @nickva parameters without much success :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Ensure full commit timing out on replication of large database [couchdb]

Posted by "nickva (via GitHub)" <gi...@apache.org>.

nickva commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1842199638

   @smulikHakipod the issue is that the source doesn't allow `_ensure_full_commit` so you'd have to try to replicate with `use_checkpoints = false` settings.
   
   
   @jdai1 The replication is probably timing out or throttling our replication job. I'd suggest trying much lower settings like 1 worker and a batch of 100 or so only. With higher retries_per_request and a high timeout.
   
   There is a hidden (undocumented) `"since_seq": "...-......"` replication job setting that allows starting a replication changes sequence from a particular start sequence (instead of 0) once you've replicated far enough you save the update sequence then start using that as the `"since_seq":...` in the replication job and so on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] jdai1 commented on issue #4639: Ensure full commit timing out on replication of large database

Posted by "jdai1 (via GitHub)" <gi...@apache.org>.

jdai1 commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1587757088

   Oh and finally, do you have any suggestions as to how I can replicate from the outdated couchdb instance when faced with the timeout request on the _ensure_full_commit endpoint? Thank you so much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] nickva commented on issue #4639: Ensure full commit timing out on replication of large database

Posted by "nickva (via GitHub)" <gi...@apache.org>.

nickva commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1587824167

   > Oh and finally, do you have any suggestions as to how I can replicate from the outdated couchdb instance when faced with the timeout request on the _ensure_full_commit endpoint? Thank you so much!
   
   You can try disable bulk_get usage, increase the timeout, and reduce the number of workers a bit. 200 concurrent workers may be a bit too much for the endpoint to handle? Try reducing their number to say 10 and see what it does. Try reducing your batch size as well a bit.
   
   Replicating in general implies being able to update the _local checkpoint document on the source endpoint. If the source endpoint doesn't allow that, depending how that's implemented it may manifest as a timeout (if they just close the connection and don't respond). In that case you can try disable checkpointing altogether with: https://docs.couchdb.org/en/stable/config/replicator.html#replicator/use_checkpoints As the option implies this would disable checkpointing so on failure, it will retry to scan all the docs again next time it tries to run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] nickva commented on issue #4639: Ensure full commit timing out on replication of large database

Posted by "nickva (via GitHub)" <gi...@apache.org>.

nickva commented on issue #4639:
URL: https://github.com/apache/couchdb/issues/4639#issuecomment-1587671324

   As `{error,req_timedout}` indicates a timeout.  Try to increase the replication connection timeout.
   
   https://docs.couchdb.org/en/stable/config/replicator.html#replicator/connection_timeout
   
   The `instance_start_time` change, to return a db creation timestamp instead of `0` is a 3.3 change . We did mention it in the [release notes](https://docs.couchdb.org/en/stable/whatsnew/3.3.html#bugfixes) but did not update the documentation. https://github.com/apache/couchdb/issues/3901
   
   In all of the 3.x versions, `_ensure_full_commit` doesn't actually do anything, it's a compatibility no-op. All the documents updates are now committed during doc update operation. The only thing the API endpoint returns is the database instance start time value. So, it is a bit strange that `_ensure_full_commit` is timing out, unless the endpoint is not a 3.x Apache CouchDB version but an older version or another couch implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org