You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2020/06/12 04:52:29 UTC

[GitHub] [couchdb] markusd opened a new issue #2941: Database compaction stuck in a loop

markusd opened a new issue #2941:
URL: https://github.com/apache/couchdb/issues/2941


   ## Description
   
   3 of 9 nodes in my CouchDB cluster were stuck in a loop during database compaction. These nodes were submitting significantly more disk IO than the nodes not having the problem, which is how I noticed this in the first place. I do not think `_active_tasks` had the tasks in it, at least not consistently (maybe they appeared and disappeared intermittently).
   
   The compaction files were very old compared to the current time of the database file:
   ```
   -rw-rw-r--  1 couchdb root 16249164041 Jun 10 10:28 inventories_db.1590763566.couch
   -rw-rw-r--  1 couchdb root  2852633672 Jun  2 06:43 inventories_db.1590763566.couch.compact.data
   -rw-rw-r--  1 couchdb root     9287509 Jun  2 06:43 inventories_db.1590763566.couch.compact.meta
   ```
   The logs showed a loop of this every few seconds:
   
   ```
   [notice] 2020-06-10T10:54:57.537879Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- slack_dbs: adding <<"shards/00000000-0e38e38d/inventories_db.1590763566">> to internal compactor queue with priority 1662730747
   [notice] 2020-06-10T10:54:57.538019Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- slack_dbs: Starting compaction for shards/00000000-0e38e38d/inventories_db.1590763566 (priority 1662730747)
   [info] 2020-06-10T10:54:57.538072Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.10254.140> -------- Starting compaction for db "shards/00000000-0e38e38d/inventories_db.1590763566" at 333559
   [notice] 2020-06-10T10:54:57.538354Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- slack_dbs: Started compaction for shards/00000000-0e38e38d/inventories_db.1590763566
   [warning] 2020-06-10T10:54:59.521669Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- exit for compaction of ["shards/00000000-0e38e38d/inventories_db.1590763566"]: {function_clause,[{couch_emsort,set_options,[{ems,<0.7233.140>,undefined,10,100,0,0},{[9052929,8823270,8599979,8373646,8144824,7929093,7702555,7474977,7250283,7024316],7025740}],[{file,"src/couch_emsort.erl"},{line,157}]},{couch_emsort,open,2,[{file,"src/couch_emsort.erl"},{line,154}]},{couch_bt_engine_compactor,bind_emsort,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,634}]},{couch_bt_engine_compactor,open_compaction_files,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,109}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,62}]}]}
   [error] 2020-06-10T10:54:59.522030Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m emulator -------- Error in process <0.21911.135> on node 'couchdb@c-couchdb-2-m-6.c-couchdb-2-m' with exit value:
   {function_clause,[{couch_emsort,set_options,[{ems,<0.7233.140>,undefined,10,100,0,0},{[9052929,8823270,8599979,8373646,8144824,7929093,7702555,7474977,7250283,7024316],7025740}],[{file,"src/couch_emsort.erl"},{line,157}]},{couch_emsort,open,2,[{file,"src/couch_emsort.erl"},{line,154}]},{couch_bt_engine_compactor,bind_emsort,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,634}]},{couch_bt_engine_compactor,open_compaction_files,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,109}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,62}]}]}
   [info] 2020-06-10T10:54:59.522029Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.227.0> -------- db shards/00000000-0e38e38d/inventories_db.1590763566 died with reason {function_clause,[{couch_emsort,set_options,[{ems,<0.7233.140>,undefined,10,100,0,0},{[9052929,8823270,8599979,8373646,8144824,7929093,7702555,7474977,7250283,7024316],7025740}],[{file,"src/couch_emsort.erl"},{line,157}]},{couch_emsort,open,2,[{file,"src/couch_emsort.erl"},{line,154}]},{couch_bt_engine_compactor,bind_emsort,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,634}]},{couch_bt_engine_compactor,open_compaction_files,3,[{file,"src/couch_bt_engine_compactor.erl"},{line,109}]},{couch_bt_engine_compactor,start,4,[{file,"src/couch_bt_engine_compactor.erl"},{line,62}]}]}
   
   [notice] 2020-06-10T10:55:04.524763Z couchdb@c-couchdb-2-m-6.c-couchdb-2-m <0.2700.0> -------- slack_dbs: adding <<"shards/00000000-0e38e38d/inventories_db.1590763566">> to internal compactor queue with priority 1662730747
   ```
   I deleted the `.compact` files manually, which restarted the compaction. It ran to completion on all nodes without issues.
   
   ## Steps to Reproduce
   
   I do not know, I simply replicated a database (6 million docs, 250 GB, q=18, n=3) and the automatic background compaction started and got stuck.
   
   ## Expected Behaviour
   
   Compaction to not get stuck and not to consume loads of disk IO without making progress.
   
   ## Your Environment
   
   * CouchDB version used: 3.1.0


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] rnewson commented on issue #2941: Database compaction stuck in a loop

Posted by GitBox <gi...@apache.org>.
rnewson commented on issue #2941:
URL: https://github.com/apache/couchdb/issues/2941#issuecomment-657399911


   Agreed this is a real bug and the cause is the commit you pointed out (123bf82370c21a8b5458299a7e36c477a0fedca4). The state is then passed as an option, and set_options (rightly) crashes if passed an unexpected option.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] nickva commented on issue #2941: Database compaction stuck in a loop

Posted by GitBox <gi...@apache.org>.
nickva commented on issue #2941:
URL: https://github.com/apache/couchdb/issues/2941#issuecomment-656990181


   Thanks for the report @markusd and good analysis, @wohali. It does look like a bug.
   
   If this was a compaction file left over from before the upgrade to 3.1.0, this is what might have happened:
   
   Previously the `emsort:get_state/1` result was just the [root value](https://github.com/apache/couchdb/blob/master/src/couch/src/couch_emsort.erl#L167) `{BB, PrevPos}` then in 3.1.0 it was converted to be [`[{root, Root}, ...]`](https://github.com/apache/couchdb/blob/3.x/src/couch/src/couch_emsort.erl#L181-L185).  There is an upgrade clause, but it checks for an [integer only](https://github.com/apache/couchdb/blob/3.x/src/couch/src/couch_bt_engine_compactor.erl#L631) and I think we'd want to check for the `{_BB, _PrevPos}` pattern instead there.
   
   @davisp, what do you think, is that about right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] bdoyle0182 edited a comment on issue #2941: Database compaction stuck in a loop

Posted by GitBox <gi...@apache.org>.
bdoyle0182 edited a comment on issue #2941:
URL: https://github.com/apache/couchdb/issues/2941#issuecomment-742127830


   we're seeing a similar issue with random compactions on shards when upgrading from 2.x to 3.1.1. But might be completely unrelated. The compaction metadata file blew up to about 500gb over 24 hours for a shard that is about 30gb constantly hitting this error. Similarly we're also seeing large disk io on the nodes this is happening on versus nodes this is not happening on.
   
   ```<0.5181.0> -------- exit for compaction of ["shards/60000000-7fffffff/core_activations.1589336264"]: {badarith,[{couch_file,get_pread_locnum,3,[{file,"src/couch_file.erl"},{line,730}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{couch_file,read_multi_raw_iolists_int,2,[{file,"src/couch_file.erl"},{line,719}]},{couch_file,handle_call,3,[{file,"src/couch_file.erl"},{line,507}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,636}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,665}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}```
   
   ```-------- CRASH REPORT Process  (<0.32443.1691>) with 3 neighbors crashed with reason: bad arithmetic expression at couch_file:get_pread_locnum/3(line:730) <= lists:map/2(line:1239) <= couch_file:read_multi_raw_iolists_int/2(line:719) <= couch_file:handle_call/3(line:507) <= gen_server:try_handle_call/4(line:636) <= gen_server:handle_msg/6(line:665) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_file,init,['Argument__1']}, ancestors: [<0.3352.1692>], message_queue_len: 0, messages: [], links: [<0.3352.1692>], dictionary: [{couch_file_fd,{{file_descriptor,prim_file,{#Port<0.1924847>,92}},"..."}},...], trap_exit: false, status: running, heap_size: 28690, stack_size: 27, reductions: 13483```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] wohali commented on issue #2941: Database compaction stuck in a loop

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2941:
URL: https://github.com/apache/couchdb/issues/2941#issuecomment-643092066


   Looks like a real bug. Thanks for the info on a workaround!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] nickva commented on issue #2941: Database compaction stuck in a loop

Posted by GitBox <gi...@apache.org>.
nickva commented on issue #2941:
URL: https://github.com/apache/couchdb/issues/2941#issuecomment-657639527


   Merged the fix to 3.x branch: https://github.com/apache/couchdb/pull/3001
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] janl closed issue #2941: Database compaction stuck in a loop

Posted by GitBox <gi...@apache.org>.
janl closed issue #2941:
URL: https://github.com/apache/couchdb/issues/2941


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] bdoyle0182 edited a comment on issue #2941: Database compaction stuck in a loop

Posted by GitBox <gi...@apache.org>.
bdoyle0182 edited a comment on issue #2941:
URL: https://github.com/apache/couchdb/issues/2941#issuecomment-742127830


   we're seeing a similar issue with random compactions on shards when upgrading from 2.x to 3.1.1. But might be completely unrelated. The compaction metadata file blew up to about 500gb over 24 hours for a shard that is about 30gb constantly hitting this error. Similarly we're also seeing large disk io on the nodes this is happening on versus nodes this is not happening on.
   
   ```<0.5181.0> -------- exit for compaction of ["shards/60000000-7fffffff/core_activations.1589336264"]: {badarith,[{couch_file,get_pread_locnum,3,[{file,"src/couch_file.erl"},{line,730}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{couch_file,read_multi_raw_iolists_int,2,[{file,"src/couch_file.erl"},{line,719}]},{couch_file,handle_call,3,[{file,"src/couch_file.erl"},{line,507}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,636}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,665}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] kathy1121 commented on issue #2941: Database compaction stuck in a loop

Posted by GitBox <gi...@apache.org>.
kathy1121 commented on issue #2941:
URL: https://github.com/apache/couchdb/issues/2941#issuecomment-657560277


   Thanks Markus, I met the same issue and solved as you suggested. Much appreciate!!!!!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] bdoyle0182 commented on issue #2941: Database compaction stuck in a loop

Posted by GitBox <gi...@apache.org>.
bdoyle0182 commented on issue #2941:
URL: https://github.com/apache/couchdb/issues/2941#issuecomment-742127830


   we're seeing a similar issue with random compactions on shards when upgrading from 2.x to 3.1.1. But might be completely unrelated. The compaction metadata file blew up to about 500gb over 24 hours for a shard that is about 30gb constantly hitting this error.
   
   ```<0.5181.0> -------- exit for compaction of ["shards/60000000-7fffffff/core_activations.1589336264"]: {badarith,[{couch_file,get_pread_locnum,3,[{file,"src/couch_file.erl"},{line,730}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{lists,map,2,[{file,"lists.erl"},{line,1239}]},{couch_file,read_multi_raw_iolists_int,2,[{file,"src/couch_file.erl"},{line,719}]},{couch_file,handle_call,3,[{file,"src/couch_file.erl"},{line,507}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,636}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,665}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org