You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Nathan Vander Wilt <na...@calftrail.com> on 2013/10/29 17:42:06 UTC

Re: couchdb crashes silently

I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 […output and configuration options…]` and keep pulling up my sites finding them dead too. Seems to be about the same thing as others are reporting in this old thread…was there any resolution?

This is not an OOM thing, in dmesg I do see some killed processes (node) but never couchdb/beam and NOTHING killed after I added swap some several days ago. CouchDB was dead again this morning.

The only trace of trouble in the logs is in couch.stderr:

```
heart_beat_kill_pid = 32575
heart_beat_timeout = 11
heart: Sat Oct  5 02:59:16 2013: heart-beat time-out, no activity for 12 seconds
Killed
heart: Sat Oct  5 02:59:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.

heart_beat_kill_pid = 13781
heart_beat_timeout = 11
heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 seconds
Killed
heart: Tue Oct 22 19:51:11 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.

heart_beat_kill_pid = 15292
heart_beat_timeout = 11
heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 seconds
Killed
heart: Tue Oct 29 12:33:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.

heart_beat_kill_pid = 29158
heart_beat_timeout = 11
```

1. What are these "heart-beat time-out" logs about? Is that a clue to the trouble?
2. Regardless, why isn't the shell script restarting CouchDB after 5 seconds like I told it to?

`erlang:display(erlang:system_info(otp_release)).`  says R15B

thanks,
-natevw



On Sep 13, 2013, at 3:20 PM, James Marca <jm...@translab.its.uci.edu> wrote:

> I am seeing a lot of random, silent crashes on just *one* of my
> CouchDB servers.
> 
> couchdb version 1.4.0 (gentoo ebuild)
> 
> erlang also from gentoo ebuild: 
> Erlang (BEAM) emulator version 5.10.2
> Compiled on Fri Sep 13 08:39:20 2013
> Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8]
> [async-threads:10] [kernel-poll:false]
> 
> I've got 3 servers running couchdb, A, B, C, and only B is crashing.
> All of them are replicating a single db between them, with B acting as
> the "hub"...A pushes to B, B pushes to both A and C, and C pushes to
> B.
> 
> All three servers have data crunching jobs running that are reading
> and writing to the database that is being replicated around.
> 
> The B server, the one in the middle that is push replicating to both A
> and C, is the one that is crashing.
> 
> The log looks like this:
> 
> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404
> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has started on http://0.0.0.0:5984/
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` (document `carb_grid_state4k_push_emma64`).
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document `carb_grid_state5k_hpms_push_kitty`).
> 
> no warning that the server died or why, and nothing in the
> /var/log/messages about anything untoward  happening (no OOM killer
> invoked or anything like that)
> 
> The restart only happened because I manually did a 
> /etc/init.d/couchdb restart
> Usually couchdb restarts itself, but not with this crash.
> 
> 
> 
> I flipped the log to debug level, and still had no warning about the crash:
> 
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy"
> Headers: [{'Accept',"application/json"},
>          {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="},
>          {'Content-Length',"346"},
>          {'Content-Type',"application/json"},
>          {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"},
>          {'User-Agent',"CouchDB/1.4.0"},
>          {"X-Couch-Full-Commit","false"}]
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: []
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc batch of size 128531 bytes
> 
> And that was it.  CouchDB was down and out.
> 
> I even tried shutting off the data processing (so as to reduce the db
> load) on box B, but that didn't help (all the crashing has put it far
> behind in replicating box A and C).
> 
> My guess is that the replication load is too big (too many
> connections, too much data being pushed in), but I would expect some
> sort of warning before the server dies.  
> 
> Any clues or suggestions would be appreciated.  I am currently going
> to try compling from source directly, but I don't have much faith that
> it will make a difference.
> 
> Thanks,
> James Marca
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 


Re: couchdb crashes silently

Posted by Nathan Vander Wilt <na...@calftrail.com>.
Yes, the bootstrap shell script is broken. I filed https://issues.apache.org/jira/browse/COUCHDB-1885 but that has a stupid title and doesn't quite capture how broken it is. Basically, some of the -k/-s logic got borked a while back and so IIRC you can't request a graceful restart of CouchDB via the shell script (you have to kill the beam process *yourself* and then the script will reload it).

That aside, I don't think that is related in this case. At least the last time this instance went down, the Erlang process _was still running_ just not accepting network connections. So from the shell script's perspective, it didn't see the need to restart.

hth,
-natevw


On Oct 31, 2013, at 9:30 PM, Jim Klo <ji...@sri.com> wrote:

> I noticed this myself (the bootstrap shell script not working). I vaguely recall that determining that the watchdog process doesn't correctly monitor the pid file. The logic in general was off - basically there's an edge condition not accounted for. I don't remember if I fixed the script or not - I'd have to hunt through my notes when I get back to a real computer. Something tells me I wrapped in a cron to clean up and restart as I was under a timeline before the world nearly came to an end earlier this month.
> 
> Jim Klo
> Senior Software Engineer
> SRI International
> t: @nsomnac
> 
> On Oct 31, 2013, at 5:19 PM, "Nathan Vander Wilt" <na...@calftrail.com>> wrote:
> 
> Okay, may have figured out why the shell script isn't restarting Couch. It seems it may not actually die all the way. I can't connect to it, but there is a process matching the pidfile:
> 
> 6417 ?        Sl    14:34 /home/ubuntu/bc2/build/lib/erlang/erts-5.10.2/bin/beam -Bd -K true -A 4 -- -root /home/ubuntu/bc2/build/lib/erlang -progname erl -- -home /home/ubuntu -- -noshell -noinput -os_mon start_memsup false start_cpu_sup false disk_space_check_interval 1 disk_almost_full_threshold 1 -sasl errlog_type error -couch_ini bc2/build/etc/couchdb/default.ini production_couch/local.ini -s couch -pidfile production_couch/couch.pid -heart
> 
> hth,
> -nvw
> 
> 
> 
> On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt <na...@calftrail.com>> wrote:
> 
> Aaaand my Couch commited suicide again today. Unless this is something different, I may have finally gotten lucky and had CouchDB leave a note [eerily unfinished!] in the logs this time:
> https://gist.github.com/natevw/fd509978516499ba128b
> 
> ```
> ** Reason == {badarg,
>                [{io,put_chars,
>                     [<0.93.0>,unicode,
>                      <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] 66.249.66.216 - - GET /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor 200\n">>],
>                     []},
> ```
> 
> So…now what? I have a rebuilt version of CouchDB I'm going to try [once I figure out why *it* isn't starting] but this is still really upsetting — I'm aware I could add my own cronjob or something to check and restart if needed every minute, but a) the shell script is SUPPOSED to be keeping CouchDB and b) it's NOT and c) this is embarrassing and aggravating.
> 
> thanks,
> -natevw
> 
> 
> On Oct 29, 2013, at 9:42 AM, Nathan Vander Wilt <na...@calftrail.com>> wrote:
> 
> I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 […output and configuration options…]` and keep pulling up my sites finding them dead too. Seems to be about the same thing as others are reporting in this old thread…was there any resolution?
> 
> This is not an OOM thing, in dmesg I do see some killed processes (node) but never couchdb/beam and NOTHING killed after I added swap some several days ago. CouchDB was dead again this morning.
> 
> The only trace of trouble in the logs is in couch.stderr:
> 
> ```
> heart_beat_kill_pid = 32575
> heart_beat_timeout = 11
> heart: Sat Oct  5 02:59:16 2013: heart-beat time-out, no activity for 12 seconds
> Killed
> heart: Sat Oct  5 02:59:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
> 
> heart_beat_kill_pid = 13781
> heart_beat_timeout = 11
> heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 seconds
> Killed
> heart: Tue Oct 22 19:51:11 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
> 
> heart_beat_kill_pid = 15292
> heart_beat_timeout = 11
> heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 seconds
> Killed
> heart: Tue Oct 29 12:33:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
> 
> heart_beat_kill_pid = 29158
> heart_beat_timeout = 11
> ```
> 
> 1. What are these "heart-beat time-out" logs about? Is that a clue to the trouble?
> 2. Regardless, why isn't the shell script restarting CouchDB after 5 seconds like I told it to?
> 
> `erlang:display(erlang:system_info(otp_release)).`  says R15B
> 
> thanks,
> -natevw
> 
> 
> 
> On Sep 13, 2013, at 3:20 PM, James Marca <jm...@translab.its.uci.edu>> wrote:
> 
> I am seeing a lot of random, silent crashes on just *one* of my
> CouchDB servers.
> 
> couchdb version 1.4.0 (gentoo ebuild)
> 
> erlang also from gentoo ebuild:
> Erlang (BEAM) emulator version 5.10.2
> Compiled on Fri Sep 13 08:39:20 2013
> Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8]
> [async-threads:10] [kernel-poll:false]
> 
> I've got 3 servers running couchdb, A, B, C, and only B is crashing.
> All of them are replicating a single db between them, with B acting as
> the "hub"...A pushes to B, B pushes to both A and C, and C pushes to
> B.
> 
> All three servers have data crunching jobs running that are reading
> and writing to the database that is being replicated around.
> 
> The B server, the one in the middle that is push replicating to both A
> and C, is the one that is crashing.
> 
> The log looks like this:
> 
> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404
> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has started on http://0.0.0.0:5984/
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` (document `carb_grid_state4k_push_emma64`).
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document `carb_grid_state5k_hpms_push_kitty`).
> 
> no warning that the server died or why, and nothing in the
> /var/log/messages about anything untoward  happening (no OOM killer
> invoked or anything like that)
> 
> The restart only happened because I manually did a
> /etc/init.d/couchdb restart
> Usually couchdb restarts itself, but not with this crash.
> 
> 
> 
> I flipped the log to debug level, and still had no warning about the crash:
> 
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy"
> Headers: [{'Accept',"application/json"},
>       {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="},
>       {'Content-Length',"346"},
>       {'Content-Type',"application/json"},
>       {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"},
>       {'User-Agent',"CouchDB/1.4.0"},
>       {"X-Couch-Full-Commit","false"}]
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: []
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc batch of size 128531 bytes
> 
> And that was it.  CouchDB was down and out.
> 
> I even tried shutting off the data processing (so as to reduce the db
> load) on box B, but that didn't help (all the crashing has put it far
> behind in replicating box A and C).
> 
> My guess is that the replication load is too big (too many
> connections, too much data being pushed in), but I would expect some
> sort of warning before the server dies.
> 
> Any clues or suggestions would be appreciated.  I am currently going
> to try compling from source directly, but I don't have much faith that
> it will make a difference.
> 
> Thanks,
> James Marca
> 
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> 
> 
> 


Re: couchdb crashes silently

Posted by Dave Cottlehuber <dc...@jsonified.com>.
On 11. November 2013 at 23:10:38, Nathan Vander Wilt (nate-lists@calftrail.com) wrote:
>  
> Aaaaand this happened *again* over the weekend. This time I had  
> started CouchDB in a screen session, which was still running.  
> Again, it looked like both the shell script processes and the  
> beam one were both still running, just…no Couch.
>  
> I had debug logs going, the stdout records shows the logger dying  
> again but not with any unicode error type event, just the last  
> log:
> https://gist.github.com/natevw/dcd4a9a973da01270735  
>  
> There is some "heart: Sat Nov 9 08:35:30 2013: heart-beat time-out,  
> no activity for 26 seconds" in the stderr log but I'm not sure it's  
> related or not…there seem to be a few more heart-beat time-outs  
> than actual CouchDB server failures.

when heartbeat times out, the wrapper script kills & restarts BEAM - this is part of Erlang VM stuff.

> Any concrete suggestions…? This sucks. I'm burnt out poking  
> through debug logs on this, I'm embarrassed and angry every time  
> I discover my sites have been down for another day or two because  
> of this, and adding another layer of twine and baling wire in the  
> form of a _second_ shell watchdog script is not at all exciting  
> >:-(
>  
> regards,
> -natevw

Remotely it’s hard to offer useful help (so many possibilities) but:

heart timeouts:
- long-running NIFs can do this by blocking the scheduler; especially do you have any large JSON docs moving in or out? Its likely that reverting to R14B01 or B04 may resolve this, you’d need to rebuild couchdb though.
- possibly HIPE. You should be able to uninstall erlang-base-hipe and install erlang-base assuming its ubuntu (IIRC) to resolve that, rebooting couch reqd of course.
- there has been some mention of a possible resource leak in continuous replication (which you have) but I’d not expect it to hang the BEAM, `just` crash it.
- are there any erlang turds (erl_crash.dump) lying around? they contain some useful debug info.

resources:
You’ve not mentioned what’s the OS doing at this time — anything in /var/log/messages or dmesg or whatever? is this a lean VPS or a “real box” getting timeouts? Personally I’d install collectd or ganglia etc, pump general OS metrics out to graphite for comparing, and also start collecting some erlang vm metrics too, wrt to the possible resource leak.

I’ve not looked at recon for this http://ferd.github.io/recon/ but it could be useful; right now I’d pick https://github.com/jsonified/estatsd and send erlang vm stats to graphite, same as the OS stuff, and see what comes up when these issues occur.

A+
Dave


Re: couchdb crashes silently

Posted by Nathan Vander Wilt <na...@calftrail.com>.
Aaaaand this happened *again* over the weekend. This time I had started CouchDB in a screen session, which was still running. Again, it looked like both the shell script processes and the beam one were both still running, just…no Couch.

I had debug logs going, the stdout records shows the logger dying again but not with any unicode error type event, just the last log:
https://gist.github.com/natevw/dcd4a9a973da01270735

There is some "heart: Sat Nov  9 08:35:30 2013: heart-beat time-out, no activity for 26 seconds" in the stderr log but I'm not sure it's related or not…there seem to be a few more heart-beat time-outs than actual CouchDB server failures.

Any concrete suggestions…? This sucks. I'm burnt out poking through debug logs on this, I'm embarrassed and angry every time I discover my sites have been down for another day or two because of this, and adding another layer of twine and baling wire in the form of a _second_ shell watchdog script is not at all exciting >:-(

regards,
-natevw



On Nov 1, 2013, at 9:17 AM, Nathan Vander Wilt <na...@calftrail.com> wrote:

> 
> On Nov 1, 2013, at 12:10 AM, Dave Cottlehuber <dc...@jsonified.com> wrote:
> 
>>> On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt > 
>>> wrote:
>>> 
>>> Aaaand my Couch commited suicide again today. Unless this is  
>>> something different, I may have finally gotten lucky and had  
>>> CouchDB leave a note [eerily unfinished!] in the logs this time:  
>>> https://gist.github.com/natevw/fd509978516499ba128b  
>>> 
>>> ```
>>> ** Reason == {badarg,
>>> [{io,put_chars,
>>> [<0.93.0>,unicode,
>>> <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] 66.249.66.216  
>>> - - GET /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor  
>>> 200\n">>],
>>> []},
>>> ```
>>> 
>>> So…now what? I have a rebuilt version of CouchDB I'm going to try  
>>> [once I figure out why *it* isn't starting] but this is still really  
>>> upsetting — I'm aware I could add my own cronjob or something to  
>>> check and restart if needed every minute, but a) the shell script  
>>> is SUPPOSED to be keeping CouchDB and b) it's NOT and c) this is  
>>> embarrassing and aggravating.
>>> 
>>> thanks,
>>> -natevw
>> 
>> So there’s 2 things here
>> 
>> - why the couch doesn’t get restarted?
>> 
>> Sounds very much like the afore mentioned pid race condition. Wendall do you know any more about this? I thought you had some ideas about it IIRC.
>> 
> 
> 
> I think I figured out the answer to this one, at least in the latest crash. The Erlang process the shell script watches was still running, just not accepting connections. I didn't notice this the previous times, though…I only realized it this time because when I went to restart the shell script acted like it was already running. So maybe there's actually two crashes, one silent heartbeat one and this unicode?
> 
> 
> 
>> - why io:putchars/2 has trouble writing to a boring log file, which obviously works most of the time.
>> 
>> <0.93.0>,unicode, <<"[Thu, 31 Oct 2013 19:48:48 GMT...”>>
>> 
>> io:put_chars(Fd, unicode, <<Binary>>) doesn’t look right — there’s no io:put_chars/3. 
>> 
>> This unicode looks weird and from a quick look I can’t see where it should come from.
>> 
>> Can you get more of the logfile (like hundreds of lines) and stick it somewhere? email is fine.
>> 
>> I’d like to see what happens to <0.93.0> (the process wrapping the log fd), and also if the unicode atom turns up anywhere else prior.
> 
> 
> You want more of the log *up to* the crash? Because I have nothing *beyond* what is in that gist, that's the thing! The end of the log was cut off, I did not snip it. The log as it sits now has these exact lines in it:
> 
> ```
>                             {line,173}]},
>                           {gen_event,ser
> Apache CouchDB 1.4.0 (LogLevel=info) is starting.
> ```
> 
> (The subsequent "starting" is due to my intervention.)
> 
> -nvw


Re: couchdb crashes silently

Posted by Nathan Vander Wilt <na...@calftrail.com>.
On Nov 1, 2013, at 12:10 AM, Dave Cottlehuber <dc...@jsonified.com> wrote:

>> On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt > 
>> wrote:
>> 
>> Aaaand my Couch commited suicide again today. Unless this is  
>> something different, I may have finally gotten lucky and had  
>> CouchDB leave a note [eerily unfinished!] in the logs this time:  
>> https://gist.github.com/natevw/fd509978516499ba128b  
>> 
>> ```
>> ** Reason == {badarg,
>> [{io,put_chars,
>> [<0.93.0>,unicode,
>> <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] 66.249.66.216  
>> - - GET /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor  
>> 200\n">>],
>> []},
>> ```
>> 
>> So…now what? I have a rebuilt version of CouchDB I'm going to try  
>> [once I figure out why *it* isn't starting] but this is still really  
>> upsetting — I'm aware I could add my own cronjob or something to  
>> check and restart if needed every minute, but a) the shell script  
>> is SUPPOSED to be keeping CouchDB and b) it's NOT and c) this is  
>> embarrassing and aggravating.
>> 
>> thanks,
>> -natevw
> 
> So there’s 2 things here
> 
> - why the couch doesn’t get restarted?
> 
> Sounds very much like the afore mentioned pid race condition. Wendall do you know any more about this? I thought you had some ideas about it IIRC.
> 


I think I figured out the answer to this one, at least in the latest crash. The Erlang process the shell script watches was still running, just not accepting connections. I didn't notice this the previous times, though…I only realized it this time because when I went to restart the shell script acted like it was already running. So maybe there's actually two crashes, one silent heartbeat one and this unicode?



> - why io:putchars/2 has trouble writing to a boring log file, which obviously works most of the time.
> 
> <0.93.0>,unicode, <<"[Thu, 31 Oct 2013 19:48:48 GMT...”>>
> 
> io:put_chars(Fd, unicode, <<Binary>>) doesn’t look right — there’s no io:put_chars/3. 
> 
> This unicode looks weird and from a quick look I can’t see where it should come from.
> 
> Can you get more of the logfile (like hundreds of lines) and stick it somewhere? email is fine.
> 
> I’d like to see what happens to <0.93.0> (the process wrapping the log fd), and also if the unicode atom turns up anywhere else prior.


You want more of the log *up to* the crash? Because I have nothing *beyond* what is in that gist, that's the thing! The end of the log was cut off, I did not snip it. The log as it sits now has these exact lines in it:

```
                             {line,173}]},
                           {gen_event,ser
Apache CouchDB 1.4.0 (LogLevel=info) is starting.
```

(The subsequent "starting" is due to my intervention.)

-nvw




Re: couchdb crashes silently

Posted by Nathan Vander Wilt <na...@calftrail.com>.
On Nov 1, 2013, at 7:08 AM, Dave Cottlehuber <dc...@jsonified.com> wrote:

> On 01. November 2013 at 08:10:22, Dave Cottlehuber (dch@jsonified.com) wrote:
>> 
>>> On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt >
>>> wrote:
>> 
>> So there’s 2 things here
>> 
>> - why the couch doesn’t get restarted?
>> 
>> Sounds very much like the afore mentioned pid race condition.  
>> Wendall do you know any more about this? I thought you had some  
>> ideas about it IIRC.
>> 
>> - why io:putchars/2 has trouble writing to a boring log file,  
>> which obviously works most of the time.
>> 
>> <0.93.0>,unicode, <<"[Thu, 31 Oct 2013 19:48:48 GMT...”>>  
>> 
>> io:put_chars(Fd, unicode, <>) doesn’t look right  
>> — there’s no io:put_chars/3.
>> 
>> This unicode looks weird and from a quick look I can’t see where  
>> it should come from.
> 
> Hey Nate,
> 
> I think from unicode:characters_to_binary actually.
> 
> BTW what erlang are you running here, and what couch version?

This is CouchDB 1.4.0 on Erlang R16B01 (build-couchdb at 8ddd81c22179667c77146b2ec96d543fb95c8043 which seems to still be the latest)
-nvw



Re: couchdb crashes silently

Posted by Dave Cottlehuber <dc...@jsonified.com>.
On 01. November 2013 at 08:10:22, Dave Cottlehuber (dch@jsonified.com) wrote:
>  
> > On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt >
> > wrote:
>
> So there’s 2 things here
>  
> - why the couch doesn’t get restarted?
>  
> Sounds very much like the afore mentioned pid race condition.  
> Wendall do you know any more about this? I thought you had some  
> ideas about it IIRC.
>  
> - why io:putchars/2 has trouble writing to a boring log file,  
> which obviously works most of the time.
>  
> <0.93.0>,unicode, <<"[Thu, 31 Oct 2013 19:48:48 GMT...”>>  
>  
> io:put_chars(Fd, unicode, <>) doesn’t look right  
> — there’s no io:put_chars/3.
>  
> This unicode looks weird and from a quick look I can’t see where  
> it should come from.

Hey Nate,

I think from unicode:characters_to_binary actually.

BTW what erlang are you running here, and what couch version?

> Can you get more of the logfile (like hundreds of lines) and stick  
> it somewhere? email is fine.
>  
> I’d like to see what happens to <0.93.0> (the process wrapping  
> the log fd), and also if the unicode atom turns up anywhere else  
> prior.
>  
> A+
> Dave




Re: couchdb crashes silently

Posted by Dave Cottlehuber <dc...@jsonified.com>.
> On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt > 
> wrote:
>  
> Aaaand my Couch commited suicide again today. Unless this is  
> something different, I may have finally gotten lucky and had  
> CouchDB leave a note [eerily unfinished!] in the logs this time:  
> https://gist.github.com/natevw/fd509978516499ba128b  
>  
> ```
> ** Reason == {badarg,
> [{io,put_chars,
> [<0.93.0>,unicode,
> <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] 66.249.66.216  
> - - GET /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor  
> 200\n">>],
> []},
> ```
>  
> So…now what? I have a rebuilt version of CouchDB I'm going to try  
> [once I figure out why *it* isn't starting] but this is still really  
> upsetting — I'm aware I could add my own cronjob or something to  
> check and restart if needed every minute, but a) the shell script  
> is SUPPOSED to be keeping CouchDB and b) it's NOT and c) this is  
> embarrassing and aggravating.
>  
> thanks,
> -natevw

So there’s 2 things here

- why the couch doesn’t get restarted?

Sounds very much like the afore mentioned pid race condition. Wendall do you know any more about this? I thought you had some ideas about it IIRC.

- why io:putchars/2 has trouble writing to a boring log file, which obviously works most of the time.

<0.93.0>,unicode, <<"[Thu, 31 Oct 2013 19:48:48 GMT...”>>

io:put_chars(Fd, unicode, <<Binary>>) doesn’t look right — there’s no io:put_chars/3. 

This unicode looks weird and from a quick look I can’t see where it should come from.

Can you get more of the logfile (like hundreds of lines) and stick it somewhere? email is fine.

I’d like to see what happens to <0.93.0> (the process wrapping the log fd), and also if the unicode atom turns up anywhere else prior.

A+
Dave



Re: couchdb crashes silently

Posted by Jim Klo <ji...@sri.com>.
I noticed this myself (the bootstrap shell script not working). I vaguely recall that determining that the watchdog process doesn't correctly monitor the pid file. The logic in general was off - basically there's an edge condition not accounted for. I don't remember if I fixed the script or not - I'd have to hunt through my notes when I get back to a real computer. Something tells me I wrapped in a cron to clean up and restart as I was under a timeline before the world nearly came to an end earlier this month.

Jim Klo
Senior Software Engineer
SRI International
t: @nsomnac

On Oct 31, 2013, at 5:19 PM, "Nathan Vander Wilt" <na...@calftrail.com>> wrote:

Okay, may have figured out why the shell script isn't restarting Couch. It seems it may not actually die all the way. I can't connect to it, but there is a process matching the pidfile:

6417 ?        Sl    14:34 /home/ubuntu/bc2/build/lib/erlang/erts-5.10.2/bin/beam -Bd -K true -A 4 -- -root /home/ubuntu/bc2/build/lib/erlang -progname erl -- -home /home/ubuntu -- -noshell -noinput -os_mon start_memsup false start_cpu_sup false disk_space_check_interval 1 disk_almost_full_threshold 1 -sasl errlog_type error -couch_ini bc2/build/etc/couchdb/default.ini production_couch/local.ini -s couch -pidfile production_couch/couch.pid -heart

hth,
-nvw



On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt <na...@calftrail.com>> wrote:

Aaaand my Couch commited suicide again today. Unless this is something different, I may have finally gotten lucky and had CouchDB leave a note [eerily unfinished!] in the logs this time:
https://gist.github.com/natevw/fd509978516499ba128b

```
** Reason == {badarg,
                [{io,put_chars,
                     [<0.93.0>,unicode,
                      <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] 66.249.66.216 - - GET /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor 200\n">>],
                     []},
```

So…now what? I have a rebuilt version of CouchDB I'm going to try [once I figure out why *it* isn't starting] but this is still really upsetting — I'm aware I could add my own cronjob or something to check and restart if needed every minute, but a) the shell script is SUPPOSED to be keeping CouchDB and b) it's NOT and c) this is embarrassing and aggravating.

thanks,
-natevw


On Oct 29, 2013, at 9:42 AM, Nathan Vander Wilt <na...@calftrail.com>> wrote:

I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 […output and configuration options…]` and keep pulling up my sites finding them dead too. Seems to be about the same thing as others are reporting in this old thread…was there any resolution?

This is not an OOM thing, in dmesg I do see some killed processes (node) but never couchdb/beam and NOTHING killed after I added swap some several days ago. CouchDB was dead again this morning.

The only trace of trouble in the logs is in couch.stderr:

```
heart_beat_kill_pid = 32575
heart_beat_timeout = 11
heart: Sat Oct  5 02:59:16 2013: heart-beat time-out, no activity for 12 seconds
Killed
heart: Sat Oct  5 02:59:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.

heart_beat_kill_pid = 13781
heart_beat_timeout = 11
heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 seconds
Killed
heart: Tue Oct 22 19:51:11 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.

heart_beat_kill_pid = 15292
heart_beat_timeout = 11
heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 seconds
Killed
heart: Tue Oct 29 12:33:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.

heart_beat_kill_pid = 29158
heart_beat_timeout = 11
```

1. What are these "heart-beat time-out" logs about? Is that a clue to the trouble?
2. Regardless, why isn't the shell script restarting CouchDB after 5 seconds like I told it to?

`erlang:display(erlang:system_info(otp_release)).`  says R15B

thanks,
-natevw



On Sep 13, 2013, at 3:20 PM, James Marca <jm...@translab.its.uci.edu>> wrote:

I am seeing a lot of random, silent crashes on just *one* of my
CouchDB servers.

couchdb version 1.4.0 (gentoo ebuild)

erlang also from gentoo ebuild:
Erlang (BEAM) emulator version 5.10.2
Compiled on Fri Sep 13 08:39:20 2013
Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8]
[async-threads:10] [kernel-poll:false]

I've got 3 servers running couchdb, A, B, C, and only B is crashing.
All of them are replicating a single db between them, with B acting as
the "hub"...A pushes to B, B pushes to both A and C, and C pushes to
B.

All three servers have data crunching jobs running that are reading
and writing to the database that is being replicated around.

The B server, the one in the middle that is push replicating to both A
and C, is the one that is crashing.

The log looks like this:

[Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404
[Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404
[Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has started on http://0.0.0.0:5984/
[Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` (document `carb_grid_state4k_push_emma64`).
[Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document `carb_grid_state5k_hpms_push_kitty`).

no warning that the server died or why, and nothing in the
/var/log/messages about anything untoward  happening (no OOM killer
invoked or anything like that)

The restart only happened because I manually did a
/etc/init.d/couchdb restart
Usually couchdb restarts itself, but not with this crash.



I flipped the log to debug level, and still had no warning about the crash:

[Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy"
Headers: [{'Accept',"application/json"},
       {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="},
       {'Content-Length',"346"},
       {'Content-Type',"application/json"},
       {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"},
       {'User-Agent',"CouchDB/1.4.0"},
       {"X-Couch-Full-Commit","false"}]
[Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: []
[Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc batch of size 128531 bytes

And that was it.  CouchDB was down and out.

I even tried shutting off the data processing (so as to reduce the db
load) on box B, but that didn't help (all the crashing has put it far
behind in replicating box A and C).

My guess is that the replication load is too big (too many
connections, too much data being pushed in), but I would expect some
sort of warning before the server dies.

Any clues or suggestions would be appreciated.  I am currently going
to try compling from source directly, but I don't have much faith that
it will make a difference.

Thanks,
James Marca

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.





Re: couchdb crashes silently

Posted by Nathan Vander Wilt <na...@calftrail.com>.
Okay, may have figured out why the shell script isn't restarting Couch. It seems it may not actually die all the way. I can't connect to it, but there is a process matching the pidfile:

 6417 ?        Sl    14:34 /home/ubuntu/bc2/build/lib/erlang/erts-5.10.2/bin/beam -Bd -K true -A 4 -- -root /home/ubuntu/bc2/build/lib/erlang -progname erl -- -home /home/ubuntu -- -noshell -noinput -os_mon start_memsup false start_cpu_sup false disk_space_check_interval 1 disk_almost_full_threshold 1 -sasl errlog_type error -couch_ini bc2/build/etc/couchdb/default.ini production_couch/local.ini -s couch -pidfile production_couch/couch.pid -heart

hth,
-nvw



On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt <na...@calftrail.com> wrote:

> Aaaand my Couch commited suicide again today. Unless this is something different, I may have finally gotten lucky and had CouchDB leave a note [eerily unfinished!] in the logs this time:
> https://gist.github.com/natevw/fd509978516499ba128b
> 
> ```
> ** Reason == {badarg,
>                  [{io,put_chars,
>                       [<0.93.0>,unicode,
>                        <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] 66.249.66.216 - - GET /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor 200\n">>],
>                       []},
> ```
> 
> So…now what? I have a rebuilt version of CouchDB I'm going to try [once I figure out why *it* isn't starting] but this is still really upsetting — I'm aware I could add my own cronjob or something to check and restart if needed every minute, but a) the shell script is SUPPOSED to be keeping CouchDB and b) it's NOT and c) this is embarrassing and aggravating.
> 
> thanks,
> -natevw
> 
> 
> On Oct 29, 2013, at 9:42 AM, Nathan Vander Wilt <na...@calftrail.com> wrote:
> 
>> I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 […output and configuration options…]` and keep pulling up my sites finding them dead too. Seems to be about the same thing as others are reporting in this old thread…was there any resolution?
>> 
>> This is not an OOM thing, in dmesg I do see some killed processes (node) but never couchdb/beam and NOTHING killed after I added swap some several days ago. CouchDB was dead again this morning.
>> 
>> The only trace of trouble in the logs is in couch.stderr:
>> 
>> ```
>> heart_beat_kill_pid = 32575
>> heart_beat_timeout = 11
>> heart: Sat Oct  5 02:59:16 2013: heart-beat time-out, no activity for 12 seconds
>> Killed
>> heart: Sat Oct  5 02:59:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
>> 
>> heart_beat_kill_pid = 13781
>> heart_beat_timeout = 11
>> heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 seconds
>> Killed
>> heart: Tue Oct 22 19:51:11 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
>> 
>> heart_beat_kill_pid = 15292
>> heart_beat_timeout = 11
>> heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 seconds
>> Killed
>> heart: Tue Oct 29 12:33:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
>> 
>> heart_beat_kill_pid = 29158
>> heart_beat_timeout = 11
>> ```
>> 
>> 1. What are these "heart-beat time-out" logs about? Is that a clue to the trouble?
>> 2. Regardless, why isn't the shell script restarting CouchDB after 5 seconds like I told it to?
>> 
>> `erlang:display(erlang:system_info(otp_release)).`  says R15B
>> 
>> thanks,
>> -natevw
>> 
>> 
>> 
>> On Sep 13, 2013, at 3:20 PM, James Marca <jm...@translab.its.uci.edu> wrote:
>> 
>>> I am seeing a lot of random, silent crashes on just *one* of my
>>> CouchDB servers.
>>> 
>>> couchdb version 1.4.0 (gentoo ebuild)
>>> 
>>> erlang also from gentoo ebuild: 
>>> Erlang (BEAM) emulator version 5.10.2
>>> Compiled on Fri Sep 13 08:39:20 2013
>>> Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8]
>>> [async-threads:10] [kernel-poll:false]
>>> 
>>> I've got 3 servers running couchdb, A, B, C, and only B is crashing.
>>> All of them are replicating a single db between them, with B acting as
>>> the "hub"...A pushes to B, B pushes to both A and C, and C pushes to
>>> B.
>>> 
>>> All three servers have data crunching jobs running that are reading
>>> and writing to the database that is being replicated around.
>>> 
>>> The B server, the one in the middle that is push replicating to both A
>>> and C, is the one that is crashing.
>>> 
>>> The log looks like this:
>>> 
>>> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404
>>> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404
>>> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has started on http://0.0.0.0:5984/
>>> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` (document `carb_grid_state4k_push_emma64`).
>>> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document `carb_grid_state5k_hpms_push_kitty`).
>>> 
>>> no warning that the server died or why, and nothing in the
>>> /var/log/messages about anything untoward  happening (no OOM killer
>>> invoked or anything like that)
>>> 
>>> The restart only happened because I manually did a 
>>> /etc/init.d/couchdb restart
>>> Usually couchdb restarts itself, but not with this crash.
>>> 
>>> 
>>> 
>>> I flipped the log to debug level, and still had no warning about the crash:
>>> 
>>> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy"
>>> Headers: [{'Accept',"application/json"},
>>>         {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="},
>>>         {'Content-Length',"346"},
>>>         {'Content-Type',"application/json"},
>>>         {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"},
>>>         {'User-Agent',"CouchDB/1.4.0"},
>>>         {"X-Couch-Full-Commit","false"}]
>>> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: []
>>> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc batch of size 128531 bytes
>>> 
>>> And that was it.  CouchDB was down and out.
>>> 
>>> I even tried shutting off the data processing (so as to reduce the db
>>> load) on box B, but that didn't help (all the crashing has put it far
>>> behind in replicating box A and C).
>>> 
>>> My guess is that the replication load is too big (too many
>>> connections, too much data being pushed in), but I would expect some
>>> sort of warning before the server dies.  
>>> 
>>> Any clues or suggestions would be appreciated.  I am currently going
>>> to try compling from source directly, but I don't have much faith that
>>> it will make a difference.
>>> 
>>> Thanks,
>>> James Marca
>>> 
>>> -- 
>>> This message has been scanned for viruses and
>>> dangerous content by MailScanner, and is
>>> believed to be clean.
>>> 
>> 
> 


Re: couchdb crashes silently

Posted by Nathan Vander Wilt <na...@calftrail.com>.
Aaaand my Couch commited suicide again today. Unless this is something different, I may have finally gotten lucky and had CouchDB leave a note [eerily unfinished!] in the logs this time:
https://gist.github.com/natevw/fd509978516499ba128b

```
** Reason == {badarg,
                 [{io,put_chars,
                      [<0.93.0>,unicode,
                       <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] 66.249.66.216 - - GET /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor 200\n">>],
                      []},
```

So…now what? I have a rebuilt version of CouchDB I'm going to try [once I figure out why *it* isn't starting] but this is still really upsetting — I'm aware I could add my own cronjob or something to check and restart if needed every minute, but a) the shell script is SUPPOSED to be keeping CouchDB and b) it's NOT and c) this is embarrassing and aggravating.

thanks,
-natevw


On Oct 29, 2013, at 9:42 AM, Nathan Vander Wilt <na...@calftrail.com> wrote:

> I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 […output and configuration options…]` and keep pulling up my sites finding them dead too. Seems to be about the same thing as others are reporting in this old thread…was there any resolution?
> 
> This is not an OOM thing, in dmesg I do see some killed processes (node) but never couchdb/beam and NOTHING killed after I added swap some several days ago. CouchDB was dead again this morning.
> 
> The only trace of trouble in the logs is in couch.stderr:
> 
> ```
> heart_beat_kill_pid = 32575
> heart_beat_timeout = 11
> heart: Sat Oct  5 02:59:16 2013: heart-beat time-out, no activity for 12 seconds
> Killed
> heart: Sat Oct  5 02:59:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
> 
> heart_beat_kill_pid = 13781
> heart_beat_timeout = 11
> heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 seconds
> Killed
> heart: Tue Oct 22 19:51:11 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
> 
> heart_beat_kill_pid = 15292
> heart_beat_timeout = 11
> heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 seconds
> Killed
> heart: Tue Oct 29 12:33:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
> 
> heart_beat_kill_pid = 29158
> heart_beat_timeout = 11
> ```
> 
> 1. What are these "heart-beat time-out" logs about? Is that a clue to the trouble?
> 2. Regardless, why isn't the shell script restarting CouchDB after 5 seconds like I told it to?
> 
> `erlang:display(erlang:system_info(otp_release)).`  says R15B
> 
> thanks,
> -natevw
> 
> 
> 
> On Sep 13, 2013, at 3:20 PM, James Marca <jm...@translab.its.uci.edu> wrote:
> 
>> I am seeing a lot of random, silent crashes on just *one* of my
>> CouchDB servers.
>> 
>> couchdb version 1.4.0 (gentoo ebuild)
>> 
>> erlang also from gentoo ebuild: 
>> Erlang (BEAM) emulator version 5.10.2
>> Compiled on Fri Sep 13 08:39:20 2013
>> Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8]
>> [async-threads:10] [kernel-poll:false]
>> 
>> I've got 3 servers running couchdb, A, B, C, and only B is crashing.
>> All of them are replicating a single db between them, with B acting as
>> the "hub"...A pushes to B, B pushes to both A and C, and C pushes to
>> B.
>> 
>> All three servers have data crunching jobs running that are reading
>> and writing to the database that is being replicated around.
>> 
>> The B server, the one in the middle that is push replicating to both A
>> and C, is the one that is crashing.
>> 
>> The log looks like this:
>> 
>> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404
>> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404
>> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has started on http://0.0.0.0:5984/
>> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` (document `carb_grid_state4k_push_emma64`).
>> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document `carb_grid_state5k_hpms_push_kitty`).
>> 
>> no warning that the server died or why, and nothing in the
>> /var/log/messages about anything untoward  happening (no OOM killer
>> invoked or anything like that)
>> 
>> The restart only happened because I manually did a 
>> /etc/init.d/couchdb restart
>> Usually couchdb restarts itself, but not with this crash.
>> 
>> 
>> 
>> I flipped the log to debug level, and still had no warning about the crash:
>> 
>> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy"
>> Headers: [{'Accept',"application/json"},
>>         {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="},
>>         {'Content-Length',"346"},
>>         {'Content-Type',"application/json"},
>>         {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"},
>>         {'User-Agent',"CouchDB/1.4.0"},
>>         {"X-Couch-Full-Commit","false"}]
>> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: []
>> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc batch of size 128531 bytes
>> 
>> And that was it.  CouchDB was down and out.
>> 
>> I even tried shutting off the data processing (so as to reduce the db
>> load) on box B, but that didn't help (all the crashing has put it far
>> behind in replicating box A and C).
>> 
>> My guess is that the replication load is too big (too many
>> connections, too much data being pushed in), but I would expect some
>> sort of warning before the server dies.  
>> 
>> Any clues or suggestions would be appreciated.  I am currently going
>> to try compling from source directly, but I don't have much faith that
>> it will make a difference.
>> 
>> Thanks,
>> James Marca
>> 
>> -- 
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>> 
>