You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by "[mRg]" <em...@gmail.com> on 2010/07/10 11:03:45 UTC

CouchDB Crashing under high CPU wait

Hi all,

I wonder if any of you can help me with a problem that has been plaguing us
for months.

We have CouchDB running on 3 virtualised (VMWare) RHEL5 servers. Each
instance of couchdb replicating to the other. The disks for these virtual
machines sit on top of a HP EVA / SAN.

We are noticing that sometimes the disk latency on the SAN can get high
(when a lot of snapshots/backups are occuring) which will cause high CPU
wait (>4.0) on the machines, at these point CouchDB seem to fail. No error
message in the logs, it just stops without warning.

We've turned full debug logging on and there is nothing in any of the logs
showing any kind of couchdb error but we're seeing all 3 machines fail at
the same time which leads us to believe it may be due to the cpu wait / disk
IO latency issue. We also have Solr running on these boxes and it doesnt
seem to be effected by this.

I was wondering if anyone has seen this issue before and if there was
anything else we can try (other than monitoring to see if the process is not
running and restart it). Could some other process on RHEL be killing off the
couch process ?

We have tried with 0.10.1 / 0.10.2 and have now upgraded Erlang and are
running with 0.11 but still having exactly the same issues.

Any help would be appreciated as we go live at the end of the month after a
year in development and this is the only outstanding issue and it
is perplexing to say the least.

Regards

Stephen

Re: CouchDB Crashing under high CPU wait

Posted by "[mRg]" <em...@gmail.com>.

Thanks all .. going to go away and try these various options :)

On 13 July 2010 08:43, Jan Lehnardt <ja...@apache.org> wrote:

>
> On 13 Jul 2010, at 09:30, Dirkjan Ochtman wrote:
>
> > On Mon, Jul 12, 2010 at 23:47, J Chris Anderson <jc...@gmail.com>
> wrote:
> >> This is usually a symptom that Erlang is unable to allocate memory. When
> that happens, it just goes away, no logs, no nothing.
> >>
> >> It could be the disk wait is cause mochiweb to continue to accept
> sockets, and allocate processes to handle the connections, until at some
> point there is no memory left to allocate.
> >>
> >> One option is to configure the max_connections # to be smaller.
> >>
> >> But I am just stabbing in the dark here as to the cause of the memory
> over-usage.
> >>
> >> And yes it could be the OS killing Couch for other reasons.
> >>
> >> There is a heartbeat option which ought to be the robust fix for this
> (it will reboot couch automatically). Someone else on this list will know
> better than I, how to ensure that it runs.
> >
> > We've also been seeing this with one of our CouchDB servers at work.
> > It seems to die when a (rsync) backup process gets kicked off by cron.
> >
> > I've verified that it's not the Linux OOM killer killing CouchDB, so I
> > would like to hear more about the heartbeat option thing.
>
> read up on `couchdb -r` in `couchdb -h` :)
>
> Cheers
> Jan
> --
>
>

Re: CouchDB Crashing under high CPU wait

Posted by Jan Lehnardt <ja...@apache.org>.

On 13 Jul 2010, at 09:30, Dirkjan Ochtman wrote:

> On Mon, Jul 12, 2010 at 23:47, J Chris Anderson <jc...@gmail.com> wrote:
>> This is usually a symptom that Erlang is unable to allocate memory. When that happens, it just goes away, no logs, no nothing.
>> 
>> It could be the disk wait is cause mochiweb to continue to accept sockets, and allocate processes to handle the connections, until at some point there is no memory left to allocate.
>> 
>> One option is to configure the max_connections # to be smaller.
>> 
>> But I am just stabbing in the dark here as to the cause of the memory over-usage.
>> 
>> And yes it could be the OS killing Couch for other reasons.
>> 
>> There is a heartbeat option which ought to be the robust fix for this (it will reboot couch automatically). Someone else on this list will know better than I, how to ensure that it runs.
> 
> We've also been seeing this with one of our CouchDB servers at work.
> It seems to die when a (rsync) backup process gets kicked off by cron.
> 
> I've verified that it's not the Linux OOM killer killing CouchDB, so I
> would like to hear more about the heartbeat option thing.

read up on `couchdb -r` in `couchdb -h` :)

Cheers
Jan
--

Re: CouchDB Crashing under high CPU wait

Posted by Dirkjan Ochtman <di...@ochtman.nl>.

On Mon, Jul 12, 2010 at 23:47, J Chris Anderson <jc...@gmail.com> wrote:
> This is usually a symptom that Erlang is unable to allocate memory. When that happens, it just goes away, no logs, no nothing.
>
> It could be the disk wait is cause mochiweb to continue to accept sockets, and allocate processes to handle the connections, until at some point there is no memory left to allocate.
>
> One option is to configure the max_connections # to be smaller.
>
> But I am just stabbing in the dark here as to the cause of the memory over-usage.
>
> And yes it could be the OS killing Couch for other reasons.
>
> There is a heartbeat option which ought to be the robust fix for this (it will reboot couch automatically). Someone else on this list will know better than I, how to ensure that it runs.

We've also been seeing this with one of our CouchDB servers at work.
It seems to die when a (rsync) backup process gets kicked off by cron.

I've verified that it's not the Linux OOM killer killing CouchDB, so I
would like to hear more about the heartbeat option thing.

Cheers,

Dirkjan

Re: CouchDB Crashing under high CPU wait

Posted by Mikeal Rogers <mi...@gmail.com>.

We saw this at Mozilla a bunch using CentOS on a vmware.

If you don't install all the vmware tools there are cases when it will just
kill the CouchDB erlang process for some unknown reason. No logs, no
nothing, and it happens even when it's not under load.

-Mikeal

On Mon, Jul 12, 2010 at 2:47 PM, J Chris Anderson <jc...@gmail.com> wrote:

>
> On Jul 10, 2010, at 2:03 AM, [mRg] wrote:
>
> > Hi all,
> >
> > I wonder if any of you can help me with a problem that has been plaguing
> us
> > for months.
> >
> > We have CouchDB running on 3 virtualised (VMWare) RHEL5 servers. Each
> > instance of couchdb replicating to the other. The disks for these virtual
> > machines sit on top of a HP EVA / SAN.
> >
> > We are noticing that sometimes the disk latency on the SAN can get high
> > (when a lot of snapshots/backups are occuring) which will cause high CPU
> > wait (>4.0) on the machines, at these point CouchDB seem to fail. No
> error
> > message in the logs, it just stops without warning.
> >
>
> This is usually a symptom that Erlang is unable to allocate memory. When
> that happens, it just goes away, no logs, no nothing.
>
> It could be the disk wait is cause mochiweb to continue to accept sockets,
> and allocate processes to handle the connections, until at some point there
> is no memory left to allocate.
>
> One option is to configure the max_connections # to be smaller.
>
> But I am just stabbing in the dark here as to the cause of the memory
> over-usage.
>
> And yes it could be the OS killing Couch for other reasons.
>
> There is a heartbeat option which ought to be the robust fix for this (it
> will reboot couch automatically). Someone else on this list will know better
> than I, how to ensure that it runs.
>
> - Chris
>
> > We've turned full debug logging on and there is nothing in any of the
> logs
> > showing any kind of couchdb error but we're seeing all 3 machines fail at
> > the same time which leads us to believe it may be due to the cpu wait /
> disk
> > IO latency issue. We also have Solr running on these boxes and it doesnt
> > seem to be effected by this.
> >
> > I was wondering if anyone has seen this issue before and if there was
> > anything else we can try (other than monitoring to see if the process is
> not
> > running and restart it). Could some other process on RHEL be killing off
> the
> > couch process ?
> >
> > We have tried with 0.10.1 / 0.10.2 and have now upgraded Erlang and are
> > running with 0.11 but still having exactly the same issues.
> >
> > Any help would be appreciated as we go live at the end of the month after
> a
> > year in development and this is the only outstanding issue and it
> > is perplexing to say the least.
> >
> > Regards
> >
> > Stephen
>
>

Re: CouchDB Crashing under high CPU wait

Posted by J Chris Anderson <jc...@gmail.com>.

On Jul 10, 2010, at 2:03 AM, [mRg] wrote:

> Hi all,
> 
> I wonder if any of you can help me with a problem that has been plaguing us
> for months.
> 
> We have CouchDB running on 3 virtualised (VMWare) RHEL5 servers. Each
> instance of couchdb replicating to the other. The disks for these virtual
> machines sit on top of a HP EVA / SAN.
> 
> We are noticing that sometimes the disk latency on the SAN can get high
> (when a lot of snapshots/backups are occuring) which will cause high CPU
> wait (>4.0) on the machines, at these point CouchDB seem to fail. No error
> message in the logs, it just stops without warning.
> 

This is usually a symptom that Erlang is unable to allocate memory. When that happens, it just goes away, no logs, no nothing.

It could be the disk wait is cause mochiweb to continue to accept sockets, and allocate processes to handle the connections, until at some point there is no memory left to allocate.

One option is to configure the max_connections # to be smaller.

But I am just stabbing in the dark here as to the cause of the memory over-usage.

And yes it could be the OS killing Couch for other reasons.

There is a heartbeat option which ought to be the robust fix for this (it will reboot couch automatically). Someone else on this list will know better than I, how to ensure that it runs.

- Chris

> We've turned full debug logging on and there is nothing in any of the logs
> showing any kind of couchdb error but we're seeing all 3 machines fail at
> the same time which leads us to believe it may be due to the cpu wait / disk
> IO latency issue. We also have Solr running on these boxes and it doesnt
> seem to be effected by this.
> 
> I was wondering if anyone has seen this issue before and if there was
> anything else we can try (other than monitoring to see if the process is not
> running and restart it). Could some other process on RHEL be killing off the
> couch process ?
> 
> We have tried with 0.10.1 / 0.10.2 and have now upgraded Erlang and are
> running with 0.11 but still having exactly the same issues.
> 
> Any help would be appreciated as we go live at the end of the month after a
> year in development and this is the only outstanding issue and it
> is perplexing to say the least.
> 
> Regards
> 
> Stephen