You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2022/10/12 11:03:55 UTC

[GitHub] [couchdb] arahjan opened a new issue, #4204: Replication crashes just on single database from many

arahjan opened a new issue, #4204:
URL: https://github.com/apache/couchdb/issues/4204

   Couchdb 2.3.1 on Centos7
   
   I'm replicating more than 20dbs from one server to another. The process works flawlessly apart from a single database. 
   The database in question consists of more than 100k small documents like below:
   
   `
   "db_name": "test",
     "purge_seq": "0-g1AAAAFTeJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMeC5BkeACk_gNBViIDHrVJCUAyqZ6gOoiZCyBm7idG7QGI2vsE7FcA2W9P0P5EhiR5wp5xABkWT6RnGiAOnA9UmwUAtixejg",
     "update_seq": "100785-g1AAAAFreJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L8____s5IYGAyr8KhLUgCSSfYwpdX4lDqAlMZDlRrswqc0AaS0HmaqPx6leSxAkqEBSAFVzwcr_0FQ-QKI8v1gh_wlqPwARPl9sOnsBJU_gCiHeHN7FgAkvmTM",
     "sizes": {
       "file": 82658990,
       "external": 77019483,
       "active": 82112519
     },
     "other": {
       "data_size": 77019483
     },
     "doc_del_count": 4,
     "doc_count": 100766,
     "disk_size": 82658990,
     "disk_format_version": 7,
     "data_size": 82112519,
     "compact_running": false,
     "cluster": {
       "q": 8,
       "n": 1,
       "w": 1,
       "r": 1
     },
     "instance_start_time": "0"
   `
   
   After replicating about 79k docs, the replication crashes with output like below:
   
   `[error] 2022-10-12T10:43:26.563124Z couchdb@127.0.0.1 <0.30702.7> -------- CRASH REPORT Process  (<0.30702.7>) with 5 neighbors exited with reason: {worker_died,<0.30700.7>,{process_died,<0.3280.8>,{{nocatch,missing_doc},[{couch_replicator_api_wrap,open_doc_revs,6,[{file,"src/couch_replicator_api_wrap.erl"},{line,302}]},{couch_replicator_worker,'-spawn_doc_reader/3-fun-1-',4,[{file,"src/couch_replicator_worker.erl"},{line,323}]}]}}} at gen_server:terminate/7(line:812) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_replicator_worker,init,['Argument__1']}, ancestors: [<0.30607.7>,couch_replicator_scheduler_sup,couch_replicator_sup,...], messages: [], links: [<0.3401.8>,<0.3503.8>,<0.3700.8>,<0.3413.8>,<0.30703.7>], dictionary: [{last_stats_report,{1665,571404,580133}}], trap_exit: true, status: running, heap_size: 6772, stack_size: 27, reductions: 77732`
   
   When I copied this db manually to the second server, there is no problem anymore. Can add documents on main server and they are being copied to the second one. 
   
   What is a possible culprit of this issue?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] nickva commented on issue #4204: Replication crashes just on single database from many

Posted by GitBox <gi...@apache.org>.

nickva commented on issue #4204:
URL: https://github.com/apache/couchdb/issues/4204#issuecomment-1281103810

   Try setting the `checkpoint_interval` to 5000 (5 seconds), down from 30 second default to ensure if the replication job crashes, it doesn't have to backtrack too much. Is there anything different about that particular document than other documents? Does it have more conflicts or updated more often, while the replication is happening perhaps?
   
   Another thing to try is to check if this happens on the latest release 3.2.2 or not. Specifically it would be to check if when both the source and instance running the replication jobs are updated, if this happens as well.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] arahjan commented on issue #4204: Replication crashes just on single database from many

Posted by GitBox <gi...@apache.org>.

arahjan commented on issue #4204:
URL: https://github.com/apache/couchdb/issues/4204#issuecomment-1280780947

   After some drilling down and increasing logging level I managed to find some errors related to missing revision numbers:
   
   `61374  [error] 2022-10-17T11:54:12.491238Z couchdb@127.0.0.1 <0.31178.16> -------- Retrying fetch and update of document `ABC` as it is unexpectedly missing. **Missing revisions** are: 9-6ab086bc66baa1fffe312b90654d90e5
   
    61419  [debug] 2022-10-17T11:54:32.173657Z couchdb@127.0.0.1 <0.217.0> -------- New task status for <0.20065.16>: [{changes_pending,null},{checkpoint_interval,30000},{checkpointed_source_seq,0},{continuous,true},{database,<<"shards/40000000-5fffffff/_replicator.1665052660">>},{doc_id,<<"ngraph">>},{doc_write_failures,0},{docs_read,92279},{docs_written,92279},{**missing_revisions_found**,92279},{replication_id,<<"58c7cfcc0f22e9d73693b78ec745e04d+continuous">>},{revisions_checked,406477},{source,<<"http://admin:*****@x.x.x.x/cdb2/test/">>},{source_seq,<<"79173-g1AAAAJ7eJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L8____szKYkxgY1HbkAsXYk41NzUwtE7HpwWNSkgKQTLJHGDYJbJhZkmWqqWUyqYY5gAyLRxi2HGyYcbKpUYoByS5LABlWjzCsGmxYooV5oomJKYmG5bEASYYGIAU0bz7UwHawgYaGxoYGluZkGbgAYuB-qIH7wQYaGJlYpqUkkWXgAYiB96EGHgIbmGpqaJqYaEmWgQ8gBsLC8CLEQAMzCwszC2xaswBFLKSv">>},{started_on,1666007643},{target,<<"http://admin:*****@x.x.x.y/cdb2/test/">>},{through_seq,<<"78031-g1AAAAJ7eJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXM
 wBCwxygFFMiQ5L8____szKYkxgY1HRygWLsycamZqaWidj04DEpSQFIJtkjDGMDG2aWZJlqaplMqmEOIMPiEYYJgQ0zTjY1SjEg2WUJIMPq4YapfgcblmhhnmhiYkqiYXksQJKhAUgBzZsPNfA32EBDQ2NDA0tzsgxcADFwP9S7-mADDYxMLNNSksgy8ADEwPsoBqaaGpomJlqSZeADiIGwCLGGGGhgZmFhZoFNaxYAUe6iNw">>},{type,replication},{updated_on,1666007672},{user,null}]`
   
   I checked the document to which this error is regarding and the revision 9-6ab086bc66baa1fffe312b90654d90e5 exists on the document from the replication source. 
   
   Any ideas are welcome...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] arahjan commented on issue #4204: Replication crashes just on single database from many

Posted by GitBox <gi...@apache.org>.

arahjan commented on issue #4204:
URL: https://github.com/apache/couchdb/issues/4204#issuecomment-1281191886

   > Is there anything different about that particular document than other documents?
   Actually, there are more documents which are being reported in couch.log. I've just added a single as an example. 
    
   
   > Does` it have more conflicts or updated more often, while the replication is happening perhaps?
   I haven't noticed. At the moment, the source is passive meaning there are no changes in the documents. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] nickva commented on issue #4204: Replication crashes just on single database from many

Posted by GitBox <gi...@apache.org>.

nickva commented on issue #4204:
URL: https://github.com/apache/couchdb/issues/4204#issuecomment-1277752399

   `missing_doc` often means the document update which changes feed saw was not found when replicator went to fetch all of its revisions. It might happen if the document is deleted in the meantime (and compaction runs), or if the document was just created but somehow it wasn't propagated to the all the nodes in the cluster. 
   
   There is one retry which the replicator will do that case. You can adjust the sleep period with this setting `[replicator] missing_doc_retry_msec = 2000`. The defaults is 2 seconds but you could set to, say, 10000 (10 seconds). Then check that you have good inter-node (cluster) connectivity. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] arahjan commented on issue #4204: Replication crashes just on single database from many

Posted by GitBox <gi...@apache.org>.

arahjan commented on issue #4204:
URL: https://github.com/apache/couchdb/issues/4204#issuecomment-1282593844

   I haven't tried the upgrade yet.
   Anyways, I deleted all "questionable" docs on the source. Even so, I'm still seeing replication process trying to fetch doc, which had been deleted. 
   `notice] 2022-10-18T15:20:39.950023Z couchdb@127.0.0.1 <0.24560.251> -------- Retrying GET to http://admin:*****@a.b.x.d/cdb2/test/doc1.revs=true&open_revs=%5B%222-b3604d99facabc85abc075995edf2d75%22%5D&latest=true in 16.0 seconds due to error {function_clause,[{couch_replicator_api_wrap,'-**open_doc_revs/6-fun-1-',[404**,[{[83,101,114,118,101,114],[110,103,105,110,120,47,49,46,49,54,46,49]}`
   
   What does 404 error mean in this context? I guess it's a leftover old revision. Do I have to manually purge old revisions on source (is it viable)? Or maybe manual compaction is enough? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] nickva commented on issue #4204: Replication crashes just on single database from many

Posted by GitBox <gi...@apache.org>.

nickva commented on issue #4204:
URL: https://github.com/apache/couchdb/issues/4204#issuecomment-1279365281

   If you see `max_url_len` in the logs check if there are any 414 HTTP error response from the replication endpoints. It could be that the proxy limits the maximum URL length or `max_document_id_length` was set for couchdb was set too low.
   
   One case where that could also apply is if there are lot of conflicted revisions, those would end passed to the fetch document request as `atts_since=....` list and if the response returned is 414 the replicator will try to use a shorter list of `atts_since=...` values.
   
   Here is where I found a reference to it:
   
   https://github.com/apache/couchdb/blob/21eebad0fb6ea62786d915b29797983be537908a/src/couch_replicator/src/couch_replicator_api_wrap.erl#L346-L363


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] arahjan commented on issue #4204: Replication crashes just on single database from many

Posted by GitBox <gi...@apache.org>.

arahjan commented on issue #4204:
URL: https://github.com/apache/couchdb/issues/4204#issuecomment-1278692387

   I added this parameter, but it didn't help. 
   I fetched the database from the MAIN server and restored it to SPARE. Unfortunately that didn't help too. Even though both servers have the same number of documents, the replication status is "crashed". 
   There is a info in crash details pointing to {couch_replicator_api_wrap.erl"},{line,302}]} that stands "NewMaxLen = get_value(max_url_len, Options, ?MAX_URL_LEN) div 2,". Could max_url_len parameter somehow related then? What's the default value and is it possible to increase it? 
   Connectivity between nodes is fine. As I said, just this particular database is problematic. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [couchdb] nickva commented on issue #4204: Replication crashes just on single database from many

Posted by GitBox <gi...@apache.org>.

nickva commented on issue #4204:
URL: https://github.com/apache/couchdb/issues/4204#issuecomment-1282619560

   @arahjan During replication the deleted document tombstones (markers) are also replicated. That's needed because if we have the same document on the target, we'd want it to also be deleted if the source deletes it. 
   
   There are a few way to avoid replicating tombstones or removing them later: 
   
    *  https://blog.cloudant.com/2021/05/21/Removing-Tombstones.html
    *  https://blog.cloudant.com/2019/12/13/Filtered-Replication.html
   
   Purging could work as well.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org