You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2020/01/10 15:56:24 UTC
[GitHub] [couchdb] wohali opened a new issue #2437: Hung beam.smp sitting at
100% CPU
wohali opened a new issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437
## Description
```irc
13:56 <+davisp> 1) I've seen some test cases sit idle for 40-50m+. SSH'ing to the node
holding that job shows beam.smp spinning at ~100% CPU on the node.
Unfortunately using `kill -SIGUSR1 $pid` to try and generate a crash dump
does not work.
```
## Steps to Reproduce
Unsure so far. Perhaps an EUnit soak test would work.
## Expected Behaviour
beam should not hang.
## Your Environment
This has occured on IBM Cloud (x86_64, docker) workers as well as our FreeBSD (12.1, iocage, no docker) workers during various builds.
## Additional Context
Unclear if this is specific to Jenkins, to the version of Erlang we're using, or if this is genuinely a release blocking issue.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-579003677
Upstream Erlang/OTP bug filed https://bugs.erlang.org/browse/ERL-1152
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576913743
With the patch, we've gone into a hung beam again, but this time not 100% CPU.
https://gist.github.com/wohali/5ef99e1e0ce9d06231f3b05395f4c902
Erlang 21.3.8.12.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-575263817
And:
https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FFullPlatformMatrix/detail/jenkins-185-arm64/6/pipeline
Notably this time it's on x86_64.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] dch commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
dch commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-577119512
Some background reading
http://erlang.org/pipermail/erlang-questions/2017-February/091721.html
"The only "normal" reason that I have found so far for erl_child_setup to
exit is if the Linux OOM killed decides to terminate it. Besides that all
terminations should be considered bugs in erl_child_setup."
From the OTP 21 branch https://github.com/erlang/otp/commits/OTP-21.3.8.12
- https://github.com/erlang/otp/commit/849361207b506ad390c2a247e4591a581bddb607
- https://github.com/erlang/otp/commit/dce979e061d2c55a4fdfc47f1c5c74b948a37182
& from OTP master:
- https://github.com/erlang/otp/commit/1eb6bbf780edfbb64cb74a9be27290f38d26144f
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali edited a comment on issue #2437: Hung beam.smp
sitting at 100% CPU
Posted by GitBox <gi...@apache.org>.
wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576929600
Another test hang:
https://gist.github.com/wohali/5c903d20c023339c00a92ec322b9d1d2
Last test before hang:
```
chttpd_open_revs_error_test:105: should_return_503_error_for_open_revs_post_form...ok
```
After attaching gdb to the parent `beam.smp` and retrieving the backtrace, then detatching, the hang self-cleared (!)
The tests then continued and hung again after:
```
couch_mrview_collation_tests:183: should_use_collator_for_reduce_grouping...ok
[os_mon] memory supervisor port (memsup): Erlang has closed
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[done in 1.117 s]
[done in 1.272 s]
module 'couch_mrview_purge_docs_tests'
Map views
```
Gathering stats now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali edited a comment on issue #2437: Hung beam.smp
sitting at 100% CPU
Posted by GitBox <gi...@apache.org>.
wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576913743
With the patch, we've gone into a hung beam again, but this time not 100% CPU.
https://gist.github.com/wohali/5ef99e1e0ce9d06231f3b05395f4c902
Erlang 21.3.8.12.
[ETA: Updated gist with gdb output from all the relevant threads.]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali edited a comment on issue #2437: Hung beam.smp
sitting at 100% CPU
Posted by GitBox <gi...@apache.org>.
wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576913743
With the patch, we've gone into a hung beam again, but this time not 100% CPU.
https://gist.github.com/wohali/5ef99e1e0ce9d06231f3b05395f4c902
Erlang 21.3.8.12.
[ETA: Updated gist with gdb output from all the relevant threads.]
[ETA 2: Killed the `couchjs` process and the tests are still stuck. @davisp calls this a "smoking gun."]
[ETA 3: Killed erl_child_setup, and it does not get reaped - zombie:
```
jenkins 15258 0.0 0.0 0 0 ? Zs 21:44 0:00 [erl_child_setup] <defunct>
```
]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-574339581
I have a backtrace I had saved
[vm_stuck_bt.txt](https://github.com/apache/couchdb/files/4061151/vm_stuck_bt.txt)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576401953
Most often I'm seeing this in CI on 22.2 (for PRs) and on FreeBSD (for "full" builds). A [recently merged change](https://github.com/apache/couchdb/commit/7214e506199f41babd09611c7ab3564291d5be06) works around our FreeBSD CI worker dying to this, but it wallpapers over the real problem.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576929600
Another test hang:
https://gist.github.com/wohali/5c903d20c023339c00a92ec322b9d1d2
Last test before hang:
```
chttpd_open_revs_error_test:105: should_return_503_error_for_open_revs_post_form...ok
```
After attaching gdb to the parent `beam.smp` and retrieving the backtrace, then detatching, the hang self-cleared (!)
The tests then continued and succeeded in the rest of the run.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-578443244
When the VM was stuck on Erlang 22 I sampled the stack a few times:
with `pidof rebar | xargs -n1 sudo gdb --batch -ex "thread apply 5 bt" -p`
[sample_stuck_vm.txt](https://github.com/apache/couchdb/files/4112786/sample_stuck_vm.txt)
Thread 5 was the one that was actively spinning it seems so sampled just its stack.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-575255673
Another failed build with this situation:
https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FFullPlatformMatrix/detail/jenkins-185-arm64/5/pipeline
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali edited a comment on issue #2437: Hung beam.smp
sitting at 100% CPU
Posted by GitBox <gi...@apache.org>.
wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576913743
With the patch, we've gone into a hung beam again, but this time not 100% CPU.
https://gist.github.com/wohali/5ef99e1e0ce9d06231f3b05395f4c902
Erlang 21.3.8.12.
[ETA: Updated gist with gdb output from all the relevant threads.]
[ETA 2: Killed the `couchjs` process and the tests are still stuck. @davisp calls this a "smoking gun."]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-574028139
This is 100% repeatable now on FreeBSD when a build is aborted. The Jenkins termination process leaves beam.smp jobs lying around.
This may be related to how Jenkins kills jobs on FreeBSD, or it could be something else.
I can try and force a core or something on the hung `beam.smp` if someone can help diagnose this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali edited a comment on issue #2437: Hung beam.smp
sitting at 100% CPU
Posted by GitBox <gi...@apache.org>.
wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576929600
Another test hang:
https://gist.github.com/wohali/5c903d20c023339c00a92ec322b9d1d2
Last test before hang:
```
chttpd_open_revs_error_test:105: should_return_503_error_for_open_revs_post_form...ok
```
After attaching gdb to the parent `beam.smp` and retrieving the backtrace, then detatching, the hang self-cleared (!)
The tests then continued and hung again after:
```
couch_mrview_collation_tests:183: should_use_collator_for_reduce_grouping...ok
[os_mon] memory supervisor port (memsup): Erlang has closed
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[done in 1.117 s]
[done in 1.272 s]
module 'couch_mrview_purge_docs_tests'
Map views
```
Results here: https://gist.github.com/wohali/2b6a8563658a468b5810a35b764f4a19
Detaching `gdb` from `beam.smp` didn't resume this time.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] willholley commented on issue #2437: Hung beam.smp
sitting at 100% CPU
Posted by GitBox <gi...@apache.org>.
willholley commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-574328422
https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FPullRequests/detail/PR-2452/6/pipeline appears to have stalled - I'll restart it in the UK morning in case there's anything useful to be gleaned from it today.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-575868479
Interesting one:
https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FPullRequests/detail/PR-2468/6/pipeline
@davisp this was on 22.2, SM60, x86_64...note the error -11 in test/javascript/tests/view_update_seq.js
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] dch commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
dch commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-577200406
no failures in ~6h of running `gmake eunit javascript` on latest OTP20, but OTP21 & OTP22 fail reliably.
I've noticed that erl_child_setup isn't actually hung, if you kill off diskup or cpusup, they;ll be correctly restarted, so perhaps this is something specific to couchjs' heavy use of stdio?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali closed issue #2437: Hung beam.smp sitting at 100%
CPU
Posted by GitBox <gi...@apache.org>.
wohali closed issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-577282640
@dch In my tests, I was able to kill off disksup or cpusup and erl_child_setup would be hung. In another case I killed off erl_child_setup and beam.smp either hunk or segfaulted, check the links above for details.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] davisp commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
davisp commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576921440
Could be a smoking gun at least. My current theory is that there's some sort of messaging protocol between erl_child_setup and the Erlang VM that's getting confused. beam just sitting stuck waiting on a read of a port seems not good. Will have to read more on erl_child_setup to get a figure on what it might be and hopefully figure out a test case for upstream.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali edited a comment on issue #2437: Hung beam.smp
sitting at 100% CPU
Posted by GitBox <gi...@apache.org>.
wohali edited a comment on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576913743
With the patch, we've gone into a hung beam again, but this time not 100% CPU.
https://gist.github.com/wohali/5ef99e1e0ce9d06231f3b05395f4c902
Erlang 21.3.8.12.
[ETA: Updated gist with gdb output from all the relevant threads.]
[ETA 2: Killed the `couchjs` process and the tests are still stuck. @davisp calls this a "smoking gun."]
[ETA 3: Killed `erl_child_setup`, and it does not get reaped - zombie:
```
jenkins 15258 0.0 0.0 0 0 ? Zs 21:44 0:00 [erl_child_setup] <defunct>
```
@davisp thinks this is enough to show that this is a VM bug.]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576404587
@nickva sure looks like scheduler failure. I wish I knew why :(
I also want to be sure that the 100% hung beam.smp issue is the same as the "error 32" problem we're seeing on failing to launch `couchjs` correctly. They seem related...but can't be sure.
We looked at the source code for 21.x and 22.x for https://github.com/erlang/otp/commits/master/erts/emulator/sys/unix/erl_child_setup.c and didn't see anything in those releases. There's more recent commits but nothing released yet (per GH tags).
Here's the most recent commit, which landed in OTP-21.0.9. This would be the most likely culprit... https://github.com/erlang/otp/commit/da4c24bf8fc7bb2ee0d0a66d9fcfe6344d7c0c8a#diff-d0f4c5a298602460b21bc50914237807
```
erl_child_setup program ignores TERM signals as of ERTS version
10.0 (cff8dce). This setting was unfortunately inherited by
port programs. This commit restores handling of TERM signals
in port programs to the default behavior. That is, terminate the
process.
```
🤔
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-577761557
```
[2020-01-23T15:08:02.163Z] [done in 0.468 s]
[2020-01-23T15:08:02.163Z] module 'mem3_reshard_test'
[2020-01-23T15:08:02.163Z] mem3 shard split db tests
[2020-01-23T15:08:03.263Z] mem3_reshard_test:86: split_one_shard...[0.459 s] ok
[2020-01-23T15:08:03.963Z] mem3_reshard_test:146: update_docs_before_topoff1...[0.466 s] ok
[2020-01-23T15:44:16.999Z] Sending interrupt signal to process
[2020-01-23T15:44:17.658Z] Sending interrupt signal to process
[2020-01-23T15:44:26.285Z] Terminated
[2020-01-23T15:44:26.285Z] make[1]: *** [Makefile:175: eunit] Terminated
[2020-01-23T15:44:26.285Z] make: *** [Makefile:153: check] Terminated
[2020-01-23T15:44:26.285Z] undefined
[2020-01-23T15:44:26.285Z] *** context setup failed ***
[2020-01-23T15:44:26.285Z] **in function mem3_reshard_test:with_proc/3 (test/eunit/mem3_reshard_test.erl, line 685)
[2020-01-23T15:44:26.285Z] in call from mem3_reshard_test:setup/0 (test/eunit/mem3_reshard_test.erl, line 35)
[2020-01-23T15:44:26.285Z] **error:{noproc,{gen_server,call,[mem3_nodes,get_nodelist]}}
[2020-01-23T15:44:26.285Z]
[2020-01-23T15:44:26.285Z]
[2020-01-23T15:44:26.285Z] undefined
[2020-01-23T15:44:26.285Z] *** context setup failed ***
[2020-01-23T15:44:26.285Z] **in function mem3_reshard_test:with_proc/3 (test/eunit/mem3_reshard_test.erl, line 685)
[2020-01-23T15:44:26.285Z] in call from mem3_reshard_test:setup/0 (test/eunit/mem3_reshard_test.erl, line 35)
[2020-01-23T15:44:26.285Z] **error:{noproc,{gen_server,call,[mem3_nodes,get_nodelist]}}
```
Another case of a stuck scheduler I think. It looks like one test finished, then the next one was being set up. In the setup code we call `mem3_nodes:get_node_list()` that call however never returned and everything was stuck waiting for it.
I clicked either stop or restart make check and it then received a bunch of termination signals and it seems to have killed mem3_nodes gen_server and all the stuck waiting calls returned with noproc. And then the whole thing got torn down.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576814116
Nick noticed this commit, which I'm going to try running in a loop locally and see if I can get a repro with the patch:
https://github.com/erlang/otp/commit/0109f91a76ea6873cf9851529d7c60b70f328f86
which has been ported as far back as 20.x, suggesting it's a fairly serious bug.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576438585
If we need to split the issues, that's fine, but both are 3.0.0 blockers IMO.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576434017
Another hang, and again in 22.2:
https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FPullRequests/detail/PR-2472/2/pipeline
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-574332938
```irc
14:20 <vatamane> i had managed to lock beam on a mac when it was running eunit tests and I
was compiling fdb from scratch, so all cpus were busy and memory usage
was high
14:20 <+Wohali> what version of beam
14:20 <+Wohali> fbsd is currently on 21.something, i have to check
14:21 <vatamane> backtraces showed only one scheduler (#1) spinning and most other waiting
on a condition variable
14:21 <vatamane> 21.3.8.4
14:21 <+Wohali> looks like fbsd is 21.3.8.11,4 so that lines up
14:22 <+Wohali> could this be another scheduler collapse?
14:23 <vatamane> since I saw the child_setup_failed which deals with forking ports
(processes) like couchjs and cleaning them up and so on
14:24 <+Wohali> i guess we'll have to go spelunk into recent erlang changes in all that
14:25 <+davisp> Wohali: Probably that yeah. Specifically I'd go look at the commit history
to that child_setup helper
14:26 <+davisp> https://github.com/erlang/otp/commits/master/erts/emulator/sys/unix/erl_ch
ild_setup.c
14:26 <+Wohali> ok, at least there's a direction. i'll paste this convo into the bug
14:26 <+Wohali> ty for the pointer
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576434680
Interestingly, the stream of `error 32` broken pipes seems to be predicated on this test failing:
```
::in function mem3_sync_event_listener:'-should_terminate/1-fun-3-'/0 (src/mem3_sync_event_listener.erl, line 312)
in call from mem3_sync_event_listener:'-should_terminate/1-fun-5-'/1 (src/mem3_sync_event_listener.erl, line 312)
**error:{assert,[{module,mem3_sync_event_listener},
{line,312},
{expression,"false"},
{expected,true},
{value,false}]}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
Re: [GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp
sitting at 100% CPU
Posted by Joan Touzet <wo...@apache.org>.
Sorry, disregard...wrong destination.
On 2020-01-20 4:41 p.m., Joan Touzet wrote:
> If we need to split the issues, that's fine, but both are 3.0.0 blockers
> IMO.
Re: [GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp
sitting at 100% CPU
Posted by Joan Touzet <wo...@apache.org>.
If we need to split the issues, that's fine, but both are 3.0.0 blockers
IMO.
[GitHub] [couchdb] nickva commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
nickva commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-576435270
@wohali
> I also want to be sure that the 100% hung beam.smp issue is the same as the "error 32" problem
It might not be. I had only looked it since I saw it happening before the "freeze". It might be either a coincidence, or the freeze also causes the child spawner process to die so the causation is the other way around.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [couchdb] wohali commented on issue #2437: Hung beam.smp sitting at
100% CPU
Posted by GitBox <gi...@apache.org>.
wohali commented on issue #2437: Hung beam.smp sitting at 100% CPU
URL: https://github.com/apache/couchdb/issues/2437#issuecomment-573098465
One note, the FreeBSD variant of this that was seen was during an eunit test of `couchjs`, though the error was:
```
couch_js_tests: couch_js_test_...out of memory
```
Output of `top` on the affected FreeBSD node:
```
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
1895 jenkins 24 52 0 1185M 71M select 1 663:13 99.73% beam.smp
27828 jenkins 24 52 0 1173M 63M select 3 521:32 99.57% beam.smp
31174 jenkins 24 52 0 1156M 68M select 1 503:54 98.30% beam.smp
```
So, not from the run that ran out of memory itself. [Looking back farther for builds on this node](https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FFullPlatformMatrix/detail/master/16/pipeline/50), I see we're going out to lunch during eunit pretty badly (pay close attention to the timestamp)
```
[2020-01-09T16:20:45.493Z] [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[2020-01-09T16:20:45.493Z] [done in 0.200 s]
[2020-01-09T16:20:45.493Z] Check index files cleanup
[2020-01-09T16:20:45.493Z] clustered
[2020-01-09T16:20:45.848Z] couchdb_mrview_tests:155: should_cleanup_index_files...[0.084 s] ok
[2020-01-09T16:20:45.848Z] [done in 0.102 s]
[2020-01-09T17:38:30.777Z] Sending interrupt signal to process
[2020-01-09T17:38:50.777Z] After 20s process did not stop
```
[Another run that failed similarly](https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FFullPlatformMatrix/detail/master/18/pipeline/50):
```
[2020-01-09T18:38:50.214Z] module 'couch_mrview_purge_docs_tests'
[2020-01-09T18:38:50.214Z] Map views
[2020-01-09T20:00:12.530Z] Sending interrupt signal to process
[2020-01-09T20:00:32.530Z] After 20s process did not stop
```
Version of Erlang on that node:
```
Erlang/OTP 21 [erts-10.3.5.7] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [dtrace]
```
Our FreeBSD workers still have SpiderMonkey 1.8.5, though [SM60 is now available](https://svnweb.freebsd.org/ports/head/lang/spidermonkey60/).
@davisp @jiangphcn I'm worried that all of these failures seem to be mrview related. With the `couchjs` SM60 changes recently, could the bug be in both the 1.8.5 and 60 versions, or how we handle the view process?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services