You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2020/01/10 15:56:24 UTC

[GitHub] [couchdb] wohali opened a new issue #2437: Hung beam.smp sitting at 100% CPU

wohali opened a new issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437
 
 
   ## Description
   
   ```irc
   13:56 <+davisp> 1) I've seen some test cases sit idle for 40-50m+. SSH'ing to the node
                   holding that job shows beam.smp spinning at ~100% CPU on the node.
                   Unfortunately using `kill -SIGUSR1 $pid` to try and generate a crash dump
                   does not work.
   ```
   
   ## Steps to Reproduce
   
   Unsure so far. Perhaps an EUnit soak test would work.
   
   ## Expected Behaviour
   
   beam should not hang.
   
   ## Your Environment
   
   This has occured on IBM Cloud (x86_64, docker) workers as well as our FreeBSD (12.1, iocage, no docker) workers during various builds.
   
   ## Additional Context
   
   Unclear if this is specific to Jenkins, to the version of Erlang we're using, or if this is genuinely a release blocking issue.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-579003677
 
 
   Upstream Erlang/OTP bug filed https://bugs.erlang.org/browse/ERL-1152

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576913743
 
 
   With the patch, we've gone into a hung beam again, but this time not 100% CPU.
   
   https://gist.github.com/wohali/5ef99e1e0ce9d06231f3b05395f4c902
   
   Erlang 21.3.8.12.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-575263817
 
 
   And:
   
   https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FFullPlatformMatrix/detail/jenkins-185-arm64/6/pipeline
   
   Notably this time it's on x86_64.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] dch commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
dch commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-577119512
 
 
   Some background reading
   
   http://erlang.org/pipermail/erlang-questions/2017-February/091721.html
   
   "The only "normal" reason that I have found so far for erl_child_setup to
   exit is if the Linux OOM killed decides to terminate it. Besides that all
   terminations should be considered bugs in erl_child_setup."
   
   From the OTP 21 branch https://github.com/erlang/otp/commits/OTP-21.3.8.12
   
   - https://github.com/erlang/otp/commit/849361207b506ad390c2a247e4591a581bddb607
   - https://github.com/erlang/otp/commit/dce979e061d2c55a4fdfc47f1c5c74b948a37182
   
   & from OTP master:
   
   - https://github.com/erlang/otp/commit/1eb6bbf780edfbb64cb74a9be27290f38d26144f
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576929600
 
 
   Another test hang:
   
   https://gist.github.com/wohali/5c903d20c023339c00a92ec322b9d1d2
   
   Last test before hang:
   
   ```
       chttpd_open_revs_error_test:105: should_return_503_error_for_open_revs_post_form...ok
   ```
   
   After attaching gdb to the parent `beam.smp` and retrieving the backtrace, then detatching, the hang self-cleared (!)
   
   The tests then continued and hung again after:
   
   ```
       couch_mrview_collation_tests:183: should_use_collator_for_reduce_grouping...ok
   [os_mon] memory supervisor port (memsup): Erlang has closed
   [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
       [done in 1.117 s]
     [done in 1.272 s]
   module 'couch_mrview_purge_docs_tests'
     Map views
   ```
   
   Gathering stats now.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576913743
 
 
   With the patch, we've gone into a hung beam again, but this time not 100% CPU.
   
   https://gist.github.com/wohali/5ef99e1e0ce9d06231f3b05395f4c902
   
   Erlang 21.3.8.12.
   
   [ETA: Updated gist with gdb output from all the relevant threads.]

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576913743
 
 
   With the patch, we've gone into a hung beam again, but this time not 100% CPU.
   
   https://gist.github.com/wohali/5ef99e1e0ce9d06231f3b05395f4c902
   
   Erlang 21.3.8.12.
   
   [ETA: Updated gist with gdb output from all the relevant threads.]
   
   [ETA 2: Killed the `couchjs` process and the tests are still stuck. @davisp calls this a "smoking gun."]
   
   [ETA 3: Killed erl_child_setup, and it does not get reaped - zombie:
   
   ```
   jenkins   15258  0.0  0.0      0     0 ?        Zs   21:44   0:00 [erl_child_setup] <defunct>
   ```
   ]

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-574339581
 
 
   I have a backtrace I had saved
   
   [vm_stuck_bt.txt](https://github.com/apache/couchdb/files/4061151/vm_stuck_bt.txt)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576401953
 
 
   Most often I'm seeing this in CI on 22.2 (for PRs) and on FreeBSD (for "full" builds). A [recently merged change](https://github.com/apache/couchdb/commit/7214e506199f41babd09611c7ab3564291d5be06) works around our FreeBSD CI worker dying to this, but it wallpapers over the real problem.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576929600
 
 
   Another test hang:
   
   https://gist.github.com/wohali/5c903d20c023339c00a92ec322b9d1d2
   
   Last test before hang:
   
   ```
       chttpd_open_revs_error_test:105: should_return_503_error_for_open_revs_post_form...ok
   ```
   
   After attaching gdb to the parent `beam.smp` and retrieving the backtrace, then detatching, the hang self-cleared (!)
   
   The tests then continued and succeeded in the rest of the run.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-578443244
 
 
   When the VM was stuck on Erlang 22 I sampled the stack a few times:
   
   with `pidof rebar | xargs -n1 sudo gdb --batch -ex "thread apply 5 bt" -p`
   
   [sample_stuck_vm.txt](https://github.com/apache/couchdb/files/4112786/sample_stuck_vm.txt)
   
   Thread 5 was the one that was actively spinning it seems so sampled just its stack.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-575255673
 
 
   Another failed build with this situation:
   
   https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FFullPlatformMatrix/detail/jenkins-185-arm64/5/pipeline

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576913743
 
 
   With the patch, we've gone into a hung beam again, but this time not 100% CPU.
   
   https://gist.github.com/wohali/5ef99e1e0ce9d06231f3b05395f4c902
   
   Erlang 21.3.8.12.
   
   [ETA: Updated gist with gdb output from all the relevant threads.]
   
   [ETA 2: Killed the `couchjs` process and the tests are still stuck. @davisp calls this a "smoking gun."]

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-574028139
 
 
   This is 100% repeatable now on FreeBSD when a build is aborted. The Jenkins termination process leaves beam.smp jobs lying around.
   
   This may be related to how Jenkins kills jobs on FreeBSD, or it could be something else.
   
   I can try and force a core or something on the hung `beam.smp` if someone can help diagnose this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576929600
 
 
   Another test hang:
   
   https://gist.github.com/wohali/5c903d20c023339c00a92ec322b9d1d2
   
   Last test before hang:
   
   ```
       chttpd_open_revs_error_test:105: should_return_503_error_for_open_revs_post_form...ok
   ```
   
   After attaching gdb to the parent `beam.smp` and retrieving the backtrace, then detatching, the hang self-cleared (!)
   
   The tests then continued and hung again after:
   
   ```
       couch_mrview_collation_tests:183: should_use_collator_for_reduce_grouping...ok
   [os_mon] memory supervisor port (memsup): Erlang has closed
   [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
       [done in 1.117 s]
     [done in 1.272 s]
   module 'couch_mrview_purge_docs_tests'
     Map views
   ```
   
   Results here: https://gist.github.com/wohali/2b6a8563658a468b5810a35b764f4a19
   
   Detaching `gdb` from `beam.smp` didn't resume this time.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] willholley commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
willholley commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-574328422
 
 
   https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FPullRequests/detail/PR-2452/6/pipeline appears to have stalled - I'll restart it in the UK morning in case there's anything useful to be gleaned from it today.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-575868479
 
 
   Interesting one:
   
   https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FPullRequests/detail/PR-2468/6/pipeline
   
   @davisp this was on 22.2, SM60, x86_64...note the error -11 in test/javascript/tests/view_update_seq.js 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] dch commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
dch commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-577200406
 
 
   no failures in ~6h of running `gmake eunit javascript` on latest OTP20, but OTP21 & OTP22 fail reliably.
   
   I've noticed that erl_child_setup isn't actually hung, if you kill off diskup or cpusup, they;ll be correctly restarted, so perhaps this is something specific to couchjs' heavy use of stdio?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali closed issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali closed issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-577282640
 
 
   @dch In my tests, I was able to kill off disksup or cpusup and erl_child_setup would be hung. In another case I killed off erl_child_setup and beam.smp either hunk or segfaulted, check the links above for details.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] davisp commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
davisp commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576921440
 
 
   Could be a smoking gun at least. My current theory is that there's some sort of messaging protocol between erl_child_setup and the Erlang VM that's getting confused. beam just sitting stuck waiting on a read of a port seems not good. Will have to read more on erl_child_setup to get a figure on what it might be and hopefully figure out a test case for upstream.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576913743
 
 
   With the patch, we've gone into a hung beam again, but this time not 100% CPU.
   
   https://gist.github.com/wohali/5ef99e1e0ce9d06231f3b05395f4c902
   
   Erlang 21.3.8.12.
   
   [ETA: Updated gist with gdb output from all the relevant threads.]
   
   [ETA 2: Killed the `couchjs` process and the tests are still stuck. @davisp calls this a "smoking gun."]
   
   [ETA 3: Killed `erl_child_setup`, and it does not get reaped - zombie:
   
   ```
   jenkins   15258  0.0  0.0      0     0 ?        Zs   21:44   0:00 [erl_child_setup] <defunct>
   ```
   @davisp thinks this is enough to show that this is a VM bug.]

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576404587
 
 
   @nickva sure looks like scheduler failure. I wish I knew why :(
   
   I also want to be sure that the 100% hung beam.smp issue is the same as the "error 32" problem we're seeing on failing to launch `couchjs` correctly. They seem related...but can't be sure.
   
   We looked at the source code for 21.x and 22.x for https://github.com/erlang/otp/commits/master/erts/emulator/sys/unix/erl_child_setup.c and didn't see anything in those releases. There's more recent commits but nothing released yet (per GH tags).
   
   Here's the most recent commit, which landed in OTP-21.0.9. This would be the most likely culprit... https://github.com/erlang/otp/commit/da4c24bf8fc7bb2ee0d0a66d9fcfe6344d7c0c8a#diff-d0f4c5a298602460b21bc50914237807
   
   ```
   erl_child_setup program ignores TERM signals as of ERTS version
   10.0 (cff8dce). This setting was unfortunately inherited by
   port programs. This commit restores handling of TERM signals
   in port programs to the default behavior. That is, terminate the
   process.
   ```
   
   🤔 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-577761557
 
 
   ```
   [2020-01-23T15:08:02.163Z]   [done in 0.468 s]
   
   [2020-01-23T15:08:02.163Z] module 'mem3_reshard_test'
   
   [2020-01-23T15:08:02.163Z]   mem3 shard split db tests
   
   [2020-01-23T15:08:03.263Z]     mem3_reshard_test:86: split_one_shard...[0.459 s] ok
   
   [2020-01-23T15:08:03.963Z]     mem3_reshard_test:146: update_docs_before_topoff1...[0.466 s] ok
   
   [2020-01-23T15:44:16.999Z] Sending interrupt signal to process
   
   [2020-01-23T15:44:17.658Z] Sending interrupt signal to process
   
   [2020-01-23T15:44:26.285Z] Terminated
   
   [2020-01-23T15:44:26.285Z] make[1]: *** [Makefile:175: eunit] Terminated
   
   [2020-01-23T15:44:26.285Z] make: *** [Makefile:153: check] Terminated
   
   [2020-01-23T15:44:26.285Z]     undefined
   
   [2020-01-23T15:44:26.285Z]     *** context setup failed ***
   
   [2020-01-23T15:44:26.285Z] **in function mem3_reshard_test:with_proc/3 (test/eunit/mem3_reshard_test.erl, line 685)
   
   [2020-01-23T15:44:26.285Z] in call from mem3_reshard_test:setup/0 (test/eunit/mem3_reshard_test.erl, line 35)
   
   [2020-01-23T15:44:26.285Z] **error:{noproc,{gen_server,call,[mem3_nodes,get_nodelist]}}
   
   [2020-01-23T15:44:26.285Z] 
   
   [2020-01-23T15:44:26.285Z] 
   
   [2020-01-23T15:44:26.285Z]   undefined
   
   [2020-01-23T15:44:26.285Z]   *** context setup failed ***
   
   [2020-01-23T15:44:26.285Z] **in function mem3_reshard_test:with_proc/3 (test/eunit/mem3_reshard_test.erl, line 685)
   
   [2020-01-23T15:44:26.285Z] in call from mem3_reshard_test:setup/0 (test/eunit/mem3_reshard_test.erl, line 35)
   
   [2020-01-23T15:44:26.285Z] **error:{noproc,{gen_server,call,[mem3_nodes,get_nodelist]}}
   ```
   
   Another case of a stuck scheduler I think. It looks like one test finished, then the next one was being set up. In the setup code we call `mem3_nodes:get_node_list()` that call however never returned and everything was stuck waiting for it.
   
   I clicked either stop or restart make check and it then received a bunch of termination signals and it seems to have killed mem3_nodes gen_server and all the stuck waiting calls returned with noproc. And then the whole thing got torn down.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576814116
 
 
   Nick noticed this commit, which I'm going to try running in a loop locally and see if I can get a repro with the patch:
   
   https://github.com/erlang/otp/commit/0109f91a76ea6873cf9851529d7c60b70f328f86
   
   which has been ported as far back as 20.x, suggesting it's a fairly serious bug.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576438585
 
 
   If we need to split the issues, that's fine, but both are 3.0.0 blockers IMO. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576434017
 
 
   Another hang, and again in 22.2:
   
   https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FPullRequests/detail/PR-2472/2/pipeline

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-574332938
 
 
   ```irc
   14:20 <vatamane> i had managed to lock beam on a mac when it was running eunit tests and I
                    was compiling fdb from scratch, so all cpus were busy and memory usage
                    was high
   14:20 <+Wohali> what version of beam
   14:20 <+Wohali> fbsd is currently on 21.something, i have to check
   14:21 <vatamane> backtraces showed only one scheduler (#1) spinning and most other waiting
                    on a condition variable
   14:21 <vatamane> 21.3.8.4
   14:21 <+Wohali> looks like fbsd is 21.3.8.11,4 so that lines up
   14:22 <+Wohali> could this be another scheduler collapse?
   14:23 <vatamane> since I saw the child_setup_failed which deals with forking ports
                    (processes) like couchjs and cleaning them up and so on
   14:24 <+Wohali> i guess we'll have to go spelunk into recent erlang changes in all that
   14:25 <+davisp> Wohali: Probably that yeah. Specifically I'd go look at the commit history
                   to that child_setup helper
   14:26 <+davisp> https://github.com/erlang/otp/commits/master/erts/emulator/sys/unix/erl_ch
   ild_setup.c
   14:26 <+Wohali> ok, at least there's a direction. i'll paste this convo into the bug
   14:26 <+Wohali> ty for the pointer
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576434680
 
 
   Interestingly, the stream of `error 32` broken pipes seems to be predicated on this test failing:
   
   ```
   ::in function mem3_sync_event_listener:'-should_terminate/1-fun-3-'/0 (src/mem3_sync_event_listener.erl, line 312)
   in call from mem3_sync_event_listener:'-should_terminate/1-fun-5-'/1 (src/mem3_sync_event_listener.erl, line 312)
   **error:{assert,[{module,mem3_sync_event_listener},
   {line,312},
   {expression,"false"},
   {expected,true},
   {value,false}]}
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Re: [GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by Joan Touzet <wo...@apache.org>.
Sorry, disregard...wrong destination.

On 2020-01-20 4:41 p.m., Joan Touzet wrote:
> If we need to split the issues, that's fine, but both are 3.0.0 blockers 
> IMO.

Re: [GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by Joan Touzet <wo...@apache.org>.
If we need to split the issues, that's fine, but both are 3.0.0 blockers 
IMO.

[GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576435270
 
 
   @wohali 
   
   > I also want to be sure that the 100% hung beam.smp issue is the same as the "error 32" problem
   
   It might not be. I had only looked it since I saw it happening before the "freeze". It might be either a coincidence, or the freeze also causes the child spawner process to die so the causation is the other way around.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU

Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-573098465
 
 
   One note, the FreeBSD variant of this that was seen was during an eunit test of `couchjs`, though the error was:
   
   ```
   couch_js_tests: couch_js_test_...out of memory
   ```
   
   Output of `top` on the affected FreeBSD node:
   
   ```
     PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
    1895 jenkins      24  52    0  1185M    71M select   1 663:13  99.73% beam.smp
   27828 jenkins      24  52    0  1173M    63M select   3 521:32  99.57% beam.smp
   31174 jenkins      24  52    0  1156M    68M select   1 503:54  98.30% beam.smp
   ```
   
   So, not from the run that ran out of memory itself. [Looking back farther for builds on this node](https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FFullPlatformMatrix/detail/master/16/pipeline/50), I see we're going out to lunch during eunit pretty badly (pay close attention to the timestamp)
   
   ```
   [2020-01-09T16:20:45.493Z] [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
   [2020-01-09T16:20:45.493Z]     [done in 0.200 s]
   [2020-01-09T16:20:45.493Z]   Check index files cleanup
   [2020-01-09T16:20:45.493Z]     clustered
   [2020-01-09T16:20:45.848Z]       couchdb_mrview_tests:155: should_cleanup_index_files...[0.084 s] ok
   [2020-01-09T16:20:45.848Z]       [done in 0.102 s]
   [2020-01-09T17:38:30.777Z] Sending interrupt signal to process
   [2020-01-09T17:38:50.777Z] After 20s process did not stop
   ```
   
   [Another run that failed similarly](https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FFullPlatformMatrix/detail/master/18/pipeline/50):
   
   ```
   [2020-01-09T18:38:50.214Z] module 'couch_mrview_purge_docs_tests'
   [2020-01-09T18:38:50.214Z]   Map views
   [2020-01-09T20:00:12.530Z] Sending interrupt signal to process
   [2020-01-09T20:00:32.530Z] After 20s process did not stop
   ```
   
   Version of Erlang on that node:
   ```
   Erlang/OTP 21 [erts-10.3.5.7] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [dtrace]
   ```
   
   Our FreeBSD workers still have SpiderMonkey 1.8.5, though [SM60 is now available](https://svnweb.freebsd.org/ports/head/lang/spidermonkey60/).
   
   @davisp @jiangphcn I'm worried that all of these failures seem to be mrview related. With the `couchjs` SM60 changes recently, could the bug be in both the 1.8.5 and 60 versions, or how we handle the view process?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services