You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by "Adam Spiers (JIRA)" <ji...@apache.org> on 2015/12/14 18:14:46 UTC

[jira] [Commented] (COUCHDB-1917) Killed silently after "heart-beat time-out"

    [ https://issues.apache.org/jira/browse/COUCHDB-1917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056303#comment-15056303 ] 

Adam Spiers commented on COUCHDB-1917:
--------------------------------------

I believe I have finally figured out the root cause of this.  It's the {{exec 1>&-}} line in the {{start_couchdb}} function in {{/usr/bin/couchdb}} ({{bin/couchdb.tpl.in}} in the source tree).  This closes STDOUT, but then if the erlang process crashes or is killed via couchdb -k, it proceeds to the next bit of code which attempts to cat the message {{Apache CouchDB crashed}} to STDOUT, and fails because it's already closed.  At this point bash attempts to write {{cat: standard output: Bad file descriptor}} to STDERR before quitting altogether due to the {{-e}} option at the top of the script; however noone ever sees that error because STDERR is also closed by the {{exec 2>&-}} line.

This reinforces my belief that it's very often bad practice to close STDOUT/STDERR or redirect them to /dev/null since this can hide critical issues.

So a simple fix is to change those lines to 

{quote}
      exec 1>&/var/log/couchdb/daemon-stdout.log
      exec 2>&/var/log/couchdb/daemon-stderr.log
{quote}

or similar.  A worse solution would be to ditch the {{cat}} which outputs the crash message, but that sounds like a really bad idea to me. 

Of course this doesn't solve the underlying problem that erlang sometimes mysteriously hangs/dies, but at least it ensures the bandaid will work.

BTW whilst increasing the heartbeat timeout will decrease CouchDB's sensitivity to system-level "jitter", it will also mean it takes longer to recover when things really go wrong.

> Killed silently after "heart-beat time-out"
> -------------------------------------------
>
>                 Key: COUCHDB-1917
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1917
>             Project: CouchDB
>          Issue Type: Bug
>            Reporter: Nathan Vander Wilt
>
> Recently CouchDB has been prone to random quiet deaths. There is nothing in the stdout or .log files, only in stderr is a mention made of "no activity for N seconds" followed by "Killed".
> ```
> heart_beat_kill_pid = 32575
> heart_beat_timeout = 11
> heart: Sat Oct  5 02:59:16 2013: heart-beat time-out, no activity for 12 seconds
> Killed
> heart: Sat Oct  5 02:59:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
> heart_beat_kill_pid = 13781
> heart_beat_timeout = 11
> heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 seconds
> Killed
> heart: Tue Oct 22 19:51:11 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
> heart_beat_kill_pid = 15292
> heart_beat_timeout = 11
> heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 seconds
> Killed
> heart: Tue Oct 29 12:33:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
> heart_beat_kill_pid = 29158
> heart_beat_timeout = 11
> heart: Tue Oct 29 19:51:27 2013: heart-beat time-out, no activity for 18 seconds
> Killed
> heart: Tue Oct 29 19:51:31 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
> ```
> I'm not the only one who's seen this, e.g. http://mail-archives.apache.org/mod_mbox/couchdb-user/201309.mbox/%3C20130915031459.GF2125@translab.its.uci.edu%3E
> My particular configuration is:
> Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-41-virtual x86_64)
> Erlang R16B01
> CouchDB 1.4.0
> [build-couchdb at 8ddd81c22179667c77146b2ec96d543fb95c804]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)