You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Tayven Bigelow <tb...@mobileaccord.com> on 2017/01/30 23:15:14 UTC

Crashing due to memory use

Hey Guys!


Been using a CouchDB 2.0 12 server cluster for a while now and have noticed a memory leak that causes beam.smp to crash while populating Views.

The q/r/w/n is set up as:

[cluster]
q=12
r=2
w=2
n=3

As far as I know the server should be able to handle the load as it has 64GB RAM with a Core i7 6700. We are running ubuntu 16.04.1.

The Database is 16.5 GB in size.


I've also attempted to run 2.0 with Dreyfus and Clouseau and ran into the same issue with a Database size of 7.8MB.


I've noted in previous releases some people have ran into similar memory issues with beam.smp and increasing the open file limit was part of the resolution. We've increased the nofile limit for the couchdb user to 4096 (as found here: https://wiki.apache.org/couchdb/Performance ) with no luck.


Nothing out of the ordinary is thrown in the logs. The only way to catch it is by watching memory use.


I'm wondering if theres a configuration/setting somewhere that I am missing that could be causing this issue.


Thanks!

Tayven



All information in this message is confidential and may be legally privileged. If you are not the intended recipient, notify the sender immediately and destroy this email.

Re: Crashing due to memory use

Posted by Tayven Bigelow <tb...@mobileaccord.com>.
Hi Joan and Jan,


Since our last chat I've implemented a few more adjustments, which helped but it still ends up happening on the larger views.

Based off of this, we've set up monit to check the pid every few seconds and just restart couchdb to keep it somewhat "stable" across our LB set. Unfortunately this means that the views take much longer to view.


I've adjusted the +Q and the +A to be more reasonable with our set up. The full command is :

/home/couchdb/couchdb/bin/../erts-7.3/bin/beam.smp -K true -Q 1024 -A 128 -Bd -- -root /home/couchdb/couchdb/bin/.. -progname couchdb -- -home /home/couchdb -- -boot /home/couchdb/couchdb/bin/../releases/2.0.0/couchdb -name <<servername>> -setcookie <<cookie>> -kernel error_logger silent -sasl sasl_error_logger false -kernel inet_dist_listen_min 9000 -kernel inet_dist_listen_max 9500 -noshell -noinput -config /home/couchdb/couchdb/bin/../releases/2.0.0/sys.config

I've also adjusted the oom_score_adj to -700 for beam.smp, which makes it run longer before running into an oom.
In addition to that the current .ini config is here: https://gist.github.com/anonymous/67697b63a08a536f1902282674b811fb

Is there anything else to check that you can think of? It seems like this a once off issue based off other users experiences.

Thanks!
Tayven

________________________________
From: Tayven Bigelow <tb...@mobileaccord.com>
Sent: Wednesday, February 1, 2017 1:01 PM
To: user@couchdb.apache.org; Joan Touzet
Cc: Nick Becker
Subject: Re: Crashing due to memory use

Oh jeeze. I'm so sorry Joan! I saw the J in the name and just ran with it.


Some additional info that I've gotten since yesterday.  With Joan's recommendation to swap the ERL_MAX_PORTS with +Q I decided to check into other performance gains that could be achieved with erlang that might have been missing.  I've since increased the asynchronous I/O threads from 16 to 600. With this running we are able to process 1 view without it crashing, however when I start running more than one view it starts to increase memory use and eventually crash.


> This is unusual. And you don't see similar high memory use for couchjs processes?

Correct. Looking at the tree info via htop, the beam.smp processes are the only ones using a high amount of memory.


> Using netstat, how many active connections do you have open on a server when beam.smp is eating lots of RAM? On which ports? A summary report would be useful.

These were taken when some of the cluster was crashing, and the current server was using full RAM and 50% of the swap.
netstat -s results in:
https://gist.github.com/anonymous/e240084267242763caf85ea37704cfe7

netstat -apA inet | grep beam.smp  results in:
https://gist.github.com/anonymous/86e8a4a3943d81ed74f7a6124ce79288

> How many databases do you actually have on the machine? You mention two databases, but it's unclear to me how many are actually resident. This includes any backup copies of your database you may have in place visible to CouchDB.
At the moment, including _global_changes, admin, _users, _replicator, and _metadata we have 11 databases. We hope to be able to replicate our current cloudant set up which would have 185 databases.

> Is this a view you've ever run in production on Cloudant, or something new you're trying only on your local instance? Is this view perhaps using the experimental nodejs view server?
This view is something that we've ran for a while. It began in a locally hosted 1.6.1 couchdb instance, before being ran in cloudant. It's not using the Nodejs view server. We've only included that to try and use clouseau/dreyfus full text search. At the moment its not being utilized.

> Are you launching couchdb with any special flags being passed to the Erlang erl process besides ERL_MAX_PORTS?
Prior to this response, outside of the standard 2.0.0 the only other change would be forcing the port range. I've since increased the Asynchronous I/O to 600.
The full arguments being passed to it right now from vm.args are:
-name <nodename> , -setcookie <cookie>, -kernel error_logger silent, -sasl sasl_error_logger false, -kernel inet_dist_listen_max 9500, -kernel inet_dist_listen_min 9000, +K true, +Bd -noinput, +A 600, +Q 4096

> monitor output of erlang:memory().
-- Note that this is after changing to +Q and increasing +A --
https://gist.github.com/anonymous/720f3e1aa8f444fafe586111bfa81cc4

> Output of ets:i().

At the Start (no high memory use)
 https://gist.github.com/anonymous/20e7fa4387d6d116f2cafc378cf9c9b0

During high memory use:
https://gist.github.com/anonymous/5aa63ec31261402f5750d681d842229b


> and, if you add Fred's recon to your install

> I recommend reducing or disabling swap for best performance.

> Another option is to edit the oom adjuster for beam.smp versus other processes, such as couchjs. This is done through the tunable /proc/<pid>/oom_score_adj, setting the value strongly negative for beam.smp (range is -1000 to +1000) and setting it mildly higher for couchjs.


I'll work on adding these in :)


We're currently working on building a DB with non sensitive material in case you would like to try.


Thanks for the recommended reading material as well, if it isn't obvious I haven't had much experience in using erlang.


-Tayven


________________________________
From: Joan Touzet <wo...@apache.org>
Sent: Tuesday, January 31, 2017 2:00:43 PM
To: user@couchdb.apache.org
Cc: Nick Becker; Tayven Bigelow
Subject: Re: Crashing due to memory use

Hi Tayven,

> Jan,

Joan, actually. Jan is also on this thread. :)

A few things stand out here. I'm going to heavily trim your
emails for clarity.

----- Original Message -----

> At the time of crash the kernel is reporting that beam.smp is
> consuming 62G of memory + 32G of swap.

This is unusual. And you don't see similar high memory use for
couchjs processes?

I recommend reducing or disabling swap for best performance.

Another option is to edit the oom adjuster for beam.smp versus
other processes, such as couchjs. This is done through the
tunable /proc/<pid>/oom_score_adj, setting the value strongly
negative for beam.smp (range is -1000 to +1000) and setting it
mildly higher for couchjs. Documentation for this is at

https://www.kernel.org/doc/Documentation/filesystems/proc.txt
/proc filesysem documentation - Linux kernel<https://www.kernel.org/doc/Documentation/filesystems/proc.txt>
www.kernel.org
/proc/ /coredump_filter allows you to customize which memory segments will be dumped when the process is dumped. coredump_filter is a bitmask of memory types.



/proc filesysem documentation - Linux kernel<https://www.kernel.org/doc/Documentation/filesystems/proc.txt>
/proc filesysem documentation - Linux kernel<https://www.kernel.org/doc/Documentation/filesystems/proc.txt>
www.kernel.org
/proc/ /coredump_filter allows you to customize which memory segments will be dumped when the process is dumped. coredump_filter is a bitmask of memory types.



www.kernel.org<http://www.kernel.org>
/proc/ /coredump_filter allows you to customize which memory segments will be dumped when the process is dumped. coredump_filter is a bitmask of memory types.




in section 3.1.

> In Local.ini the changes from the base file are:
[snip]

>  max_connections = 1024
Presuming this is in your [httpd] section, it won't have much
effect, since this only affects the old interface (running on
port 5986).

Using netstat, how many active connections do you have open on a
server when beam.smp is eating lots of RAM? On which ports? A
summary report would be useful.

>  max_dbs_open = 500

How many databases do you actually have on the machine? You
mention two databases, but it's unclear to me how many are actually
resident. This includes any backup copies of your database you may
have in place visible to CouchDB.

>  nodejs = /usr/local/bin/node /home/couchdb/couchdb/share/server/main.js

Apache CouchDB considers this view server experimental. You run it
at your own risk. Though, if this was at fault, I'd expect to see
nodejs processes consuming more RAM and CPU resources than beam.smp
itself. Also, you'd have to be declaring your view's language as
nodejs instead of javascript, which you're not doing per your sample
design document.

> The memory leak happens when we kick off a new view.

Is this a view you've ever run in production on Cloudant, or
something new you're trying only on your local instance? Is this
view perhaps using the experimental nodejs view server?

-----

Are you launching couchdb with any special flags being passed to
the Erlang erl process besides ERL_MAX_PORTS?

Note that in recent versions of Erlang, ERL_MAX_PORTS has been
replaced by the +Q flag. ERL_MAX_PORTS has no effect on these
newer versions. Check the documentation for your specific version
of Erlang.

Recommendation:

If you're going to be running a big cluster on your own, read Fred
Hebert's great free book Stuff Goes Bad: Erlang in Anger.

  http://www.erlang-in-anger.com/
Stuff Goes Bad: Erlang in Anger<http://www.erlang-in-anger.com/>
www.erlang-in-anger.com
Stuff Goes Bad Erlang in Anger Free Ebook. This book intends to be a little guide about how to be the Erlang medic in a time of war.



Stuff Goes Bad: Erlang in Anger<http://www.erlang-in-anger.com/>
Stuff Goes Bad: Erlang in Anger<http://www.erlang-in-anger.com/>
www.erlang-in-anger.com
Stuff Goes Bad Erlang in Anger Free Ebook. This book intends to be a little guide about how to be the Erlang medic in a time of war.



www.erlang-in-anger.com<http://www.erlang-in-anger.com>
Stuff Goes Bad: Erlang in Anger<http://www.erlang-in-anger.com/>
www.erlang-in-anger.com
Stuff Goes Bad Erlang in Anger Free Ebook. This book intends to be a little guide about how to be the Erlang medic in a time of war.



Stuff Goes Bad Erlang in Anger Free Ebook. This book intends to be a little guide about how to be the Erlang medic in a time of war.




and pay special attention to chapters 4, 5 & 7. Specifically, if
you can get on the node during periods of high memory usage with a
remsh:

$ erl -setcookie <cookie> -name tayven@localhost \
  -remsh couchdb@localhost -hidden

and at least monitor the output of:

  1> erlang:memory().
  2> ets:i().

and, if you add Fred's recon to your install,

  3> recon:proc_count(memory, 3).
  4> recon:proc_count(binary_memory, 3).

we'll know more.

We don't have a smoking gun yet, but hopefully with more data, we
can help you narrow in on one.

-Joan
All information in this message is confidential and may be legally privileged. If you are not the intended recipient, notify the sender immediately and destroy this email.
All information in this message is confidential and may be legally privileged. If you are not the intended recipient, notify the sender immediately and destroy this email.

Re: Crashing due to memory use

Posted by Tayven Bigelow <tb...@mobileaccord.com>.
Oh jeeze. I'm so sorry Joan! I saw the J in the name and just ran with it.


Some additional info that I've gotten since yesterday.  With Joan's recommendation to swap the ERL_MAX_PORTS with +Q I decided to check into other performance gains that could be achieved with erlang that might have been missing.  I've since increased the asynchronous I/O threads from 16 to 600. With this running we are able to process 1 view without it crashing, however when I start running more than one view it starts to increase memory use and eventually crash.


> This is unusual. And you don't see similar high memory use for couchjs processes?

Correct. Looking at the tree info via htop, the beam.smp processes are the only ones using a high amount of memory.


> Using netstat, how many active connections do you have open on a server when beam.smp is eating lots of RAM? On which ports? A summary report would be useful.

These were taken when some of the cluster was crashing, and the current server was using full RAM and 50% of the swap.
netstat -s results in:
https://gist.github.com/anonymous/e240084267242763caf85ea37704cfe7

netstat -apA inet | grep beam.smp  results in:
https://gist.github.com/anonymous/86e8a4a3943d81ed74f7a6124ce79288

> How many databases do you actually have on the machine? You mention two databases, but it's unclear to me how many are actually resident. This includes any backup copies of your database you may have in place visible to CouchDB.
At the moment, including _global_changes, admin, _users, _replicator, and _metadata we have 11 databases. We hope to be able to replicate our current cloudant set up which would have 185 databases.

> Is this a view you've ever run in production on Cloudant, or something new you're trying only on your local instance? Is this view perhaps using the experimental nodejs view server?
This view is something that we've ran for a while. It began in a locally hosted 1.6.1 couchdb instance, before being ran in cloudant. It's not using the Nodejs view server. We've only included that to try and use clouseau/dreyfus full text search. At the moment its not being utilized.

> Are you launching couchdb with any special flags being passed to the Erlang erl process besides ERL_MAX_PORTS?
Prior to this response, outside of the standard 2.0.0 the only other change would be forcing the port range. I've since increased the Asynchronous I/O to 600.
The full arguments being passed to it right now from vm.args are:
-name <nodename> , -setcookie <cookie>, -kernel error_logger silent, -sasl sasl_error_logger false, -kernel inet_dist_listen_max 9500, -kernel inet_dist_listen_min 9000, +K true, +Bd -noinput, +A 600, +Q 4096

> monitor output of erlang:memory().
-- Note that this is after changing to +Q and increasing +A --
https://gist.github.com/anonymous/720f3e1aa8f444fafe586111bfa81cc4

> Output of ets:i().

At the Start (no high memory use)
 https://gist.github.com/anonymous/20e7fa4387d6d116f2cafc378cf9c9b0

During high memory use:
https://gist.github.com/anonymous/5aa63ec31261402f5750d681d842229b


> and, if you add Fred's recon to your install

> I recommend reducing or disabling swap for best performance.

> Another option is to edit the oom adjuster for beam.smp versus other processes, such as couchjs. This is done through the tunable /proc/<pid>/oom_score_adj, setting the value strongly negative for beam.smp (range is -1000 to +1000) and setting it mildly higher for couchjs.


I'll work on adding these in :)


We're currently working on building a DB with non sensitive material in case you would like to try.


Thanks for the recommended reading material as well, if it isn't obvious I haven't had much experience in using erlang.


-Tayven


________________________________
From: Joan Touzet <wo...@apache.org>
Sent: Tuesday, January 31, 2017 2:00:43 PM
To: user@couchdb.apache.org
Cc: Nick Becker; Tayven Bigelow
Subject: Re: Crashing due to memory use

Hi Tayven,

> Jan,

Joan, actually. Jan is also on this thread. :)

A few things stand out here. I'm going to heavily trim your
emails for clarity.

----- Original Message -----

> At the time of crash the kernel is reporting that beam.smp is
> consuming 62G of memory + 32G of swap.

This is unusual. And you don't see similar high memory use for
couchjs processes?

I recommend reducing or disabling swap for best performance.

Another option is to edit the oom adjuster for beam.smp versus
other processes, such as couchjs. This is done through the
tunable /proc/<pid>/oom_score_adj, setting the value strongly
negative for beam.smp (range is -1000 to +1000) and setting it
mildly higher for couchjs. Documentation for this is at

https://www.kernel.org/doc/Documentation/filesystems/proc.txt
/proc filesysem documentation - Linux kernel<https://www.kernel.org/doc/Documentation/filesystems/proc.txt>
www.kernel.org
/proc/ /coredump_filter allows you to customize which memory segments will be dumped when the process is dumped. coredump_filter is a bitmask of memory types.




in section 3.1.

> In Local.ini the changes from the base file are:
[snip]

>  max_connections = 1024
Presuming this is in your [httpd] section, it won't have much
effect, since this only affects the old interface (running on
port 5986).

Using netstat, how many active connections do you have open on a
server when beam.smp is eating lots of RAM? On which ports? A
summary report would be useful.

>  max_dbs_open = 500

How many databases do you actually have on the machine? You
mention two databases, but it's unclear to me how many are actually
resident. This includes any backup copies of your database you may
have in place visible to CouchDB.

>  nodejs = /usr/local/bin/node /home/couchdb/couchdb/share/server/main.js

Apache CouchDB considers this view server experimental. You run it
at your own risk. Though, if this was at fault, I'd expect to see
nodejs processes consuming more RAM and CPU resources than beam.smp
itself. Also, you'd have to be declaring your view's language as
nodejs instead of javascript, which you're not doing per your sample
design document.

> The memory leak happens when we kick off a new view.

Is this a view you've ever run in production on Cloudant, or
something new you're trying only on your local instance? Is this
view perhaps using the experimental nodejs view server?

-----

Are you launching couchdb with any special flags being passed to
the Erlang erl process besides ERL_MAX_PORTS?

Note that in recent versions of Erlang, ERL_MAX_PORTS has been
replaced by the +Q flag. ERL_MAX_PORTS has no effect on these
newer versions. Check the documentation for your specific version
of Erlang.

Recommendation:

If you're going to be running a big cluster on your own, read Fred
Hebert's great free book Stuff Goes Bad: Erlang in Anger.

  http://www.erlang-in-anger.com/
Stuff Goes Bad: Erlang in Anger<http://www.erlang-in-anger.com/>
www.erlang-in-anger.com
Stuff Goes Bad Erlang in Anger Free Ebook. This book intends to be a little guide about how to be the Erlang medic in a time of war.




and pay special attention to chapters 4, 5 & 7. Specifically, if
you can get on the node during periods of high memory usage with a
remsh:

$ erl -setcookie <cookie> -name tayven@localhost \
  -remsh couchdb@localhost -hidden

and at least monitor the output of:

  1> erlang:memory().
  2> ets:i().

and, if you add Fred's recon to your install,

  3> recon:proc_count(memory, 3).
  4> recon:proc_count(binary_memory, 3).

we'll know more.

We don't have a smoking gun yet, but hopefully with more data, we
can help you narrow in on one.

-Joan
All information in this message is confidential and may be legally privileged. If you are not the intended recipient, notify the sender immediately and destroy this email.

Re: Crashing due to memory use

Posted by Joan Touzet <wo...@apache.org>.
Hi Tayven,

> Jan,

Joan, actually. Jan is also on this thread. :)

A few things stand out here. I'm going to heavily trim your
emails for clarity.

----- Original Message -----

> At the time of crash the kernel is reporting that beam.smp is
> consuming 62G of memory + 32G of swap.

This is unusual. And you don't see similar high memory use for
couchjs processes?

I recommend reducing or disabling swap for best performance. 

Another option is to edit the oom adjuster for beam.smp versus
other processes, such as couchjs. This is done through the
tunable /proc/<pid>/oom_score_adj, setting the value strongly
negative for beam.smp (range is -1000 to +1000) and setting it
mildly higher for couchjs. Documentation for this is at

https://www.kernel.org/doc/Documentation/filesystems/proc.txt

in section 3.1.

> In Local.ini the changes from the base file are:
[snip]

>  max_connections = 1024
Presuming this is in your [httpd] section, it won't have much
effect, since this only affects the old interface (running on 
port 5986).

Using netstat, how many active connections do you have open on a
server when beam.smp is eating lots of RAM? On which ports? A
summary report would be useful.

>  max_dbs_open = 500

How many databases do you actually have on the machine? You
mention two databases, but it's unclear to me how many are actually
resident. This includes any backup copies of your database you may
have in place visible to CouchDB.

>  nodejs = /usr/local/bin/node /home/couchdb/couchdb/share/server/main.js

Apache CouchDB considers this view server experimental. You run it
at your own risk. Though, if this was at fault, I'd expect to see
nodejs processes consuming more RAM and CPU resources than beam.smp
itself. Also, you'd have to be declaring your view's language as
nodejs instead of javascript, which you're not doing per your sample
design document.

> The memory leak happens when we kick off a new view.

Is this a view you've ever run in production on Cloudant, or
something new you're trying only on your local instance? Is this
view perhaps using the experimental nodejs view server?

-----

Are you launching couchdb with any special flags being passed to
the Erlang erl process besides ERL_MAX_PORTS?

Note that in recent versions of Erlang, ERL_MAX_PORTS has been 
replaced by the +Q flag. ERL_MAX_PORTS has no effect on these
newer versions. Check the documentation for your specific version
of Erlang.

Recommendation:

If you're going to be running a big cluster on your own, read Fred
Hebert's great free book Stuff Goes Bad: Erlang in Anger.

  http://www.erlang-in-anger.com/

and pay special attention to chapters 4, 5 & 7. Specifically, if
you can get on the node during periods of high memory usage with a
remsh:

$ erl -setcookie <cookie> -name tayven@localhost \
  -remsh couchdb@localhost -hidden

and at least monitor the output of:

  1> erlang:memory().
  2> ets:i().

and, if you add Fred's recon to your install,

  3> recon:proc_count(memory, 3).
  4> recon:proc_count(binary_memory, 3).

we'll know more.

We don't have a smoking gun yet, but hopefully with more data, we
can help you narrow in on one.

-Joan

Re: Crashing due to memory use

Posted by Tayven Bigelow <tb...@mobileaccord.com>.
Jan,

The nodes each have 62G of memory and 32G swapspace.

At the time of crash the kernel is reporting that beam.smp is consuming 62G of memory + 32G of swap.

The settings I've changed for OS (might be related) are:

Soft/Hard nofile limit = 4096

net.core.somaxconn = 1024

net.core.netdev_max_backlog = 2000

net.ipv4.txp_max_syn_backlog = 2048

ERL_MAX_PORTS = 4096


In Default.Ini the changes from the base file are:

 q=12
 r=2
 w=2
 n=3

In Local.ini the changes from the base file are:
 credentials = true
 max_connections = 1024
 uuid = 712c6af6adce6e9ea43868cd7f78b35f
 max_dbs_open = 500
 allow_jsonp = true
 delayed_commits = false
 enable_cors = true
 nodejs = /usr/local/bin/node /home/couchdb/couchdb/share/server/main.js
 os_process_limit = 900
 require_valid_user = true
 [cors]
 credentials = true
 max_connections = 1024
 [log]
 file = /home/couchdb/couchdb/var/log/couchdb.log
 level = debug
 writer = file
 [compaction]
 _default = [{db_fragmentation, "40%"}, {view_fragmentation, "40%"}, {from, "22:00"}, {to, "06:00"}]

This is consistent across all 12 nodes that we have running.

I've tried running it without the OS changes as well, and it still crashes with an OOM.


-Tayven

________________________________
From: Joan Touzet <wo...@apache.org>
Sent: Tuesday, January 31, 2017 11:12:12 AM
To: user@couchdb.apache.org
Cc: Tayven Bigelow; Nick Becker
Subject: Re: Crashing due to memory use

Tayven,

Thanks for the info.

How much RAM is in this node? Do you know approximately how much RAM the beam.smp process is consuming when the oom-killer takes action? Have you changed any settings in default.ini/local.ini?

-Joan

----- Original Message -----
> From: "Tayven Bigelow" <tb...@mobileaccord.com>
> To: "Jan Lehnardt" <ja...@apache.org>, user@couchdb.apache.org
> Cc: "Nick Becker" <ni...@mobileaccord.com>
> Sent: Tuesday, January 31, 2017 12:49:11 PM
> Subject: Re: Crashing due to memory use
>
> Hey Jan!
>
>
> You'd be correct on the multiple postings, weren't sure they were
> being posted.
>
> We currently run this in production on cloudant and were hoping to
> have a backup utilizing the new couchdb 2.0. We are able to
> consistently replicate.
>
> The memory leak happens when we kick off a new view.
> beam.smp terminates on a OOM by the kernel.
>
> Checking /var/log/syslog shows:
> Jan 31 18:32:44 couchdb7 kernel: [594086.565577] Out of memory: Kill
> process 23731 (beam.smp) score 961 or sacrifice child
> Jan 31 18:32:44 couchdb7 kernel: [594086.565622] Killed process 23773
> (memsup) total-vm:4228kB, anon-rss:12kB, file-rss:0kB
> Jan 31 18:32:44 couchdb7 kernel: [594086.569327] Out of memory: Kill
> process 23731 (beam.smp) score 961 or sacrifice child
> Jan 31 18:32:44 couchdb7 kernel: [594086.569392] Killed process 23731
> (beam.smp) total-vm:126594220kB, anon-rss:64708732kB, file-rss:0kB
> Jan 31 18:32:56 couchdb7 monit[9113]: 'couchdb' process is not
> running
>
> The couchdb.log file at the time of crash contains:
>
> 1981936-[debug] 2017-01-31T17:16:35.355774Z
> couchdb@couchdb7.geopoll.com <0.9036.262> -------- OS Process
> #Port<0.63437> Input  ::
> ["map_doc",{"_id":"bill-4690221d-fc07-4278-abdf-cabf1018ecb6","_rev":"5-b90c6c87a0a48e647528a1b3c5bfe12b","MetaData":{"PollId":"147402","Car
> rierId":"25504","UserPollStateId":"3362564708"},"UserId":"1002449829201","CreateDate":"2015-11-23T06:42:40.0285675Z","LastModifiedDate":"2015-11-23T06:43:07.5474967Z","SystemSource":"GeoPoll","AttemptCount":1,"BillingIdentifier":"bill-4690221d-fc07-4278-abdf-cabf1018ecb6
> ","CallbackUri":"http://de-geopoll-1:8645/billingcallback","CallbackSent":true,"Activities":[{"MetaData":{},"CreateDate":"2015-11-23T06:42:59.0297329Z","State":"PROCESSING"},{"MetaData":{},"CreateDate":"2015-11-23T06:42:59.0307329Z","State":"SUCCESS"}],"Currency":"US_Dol
> lar_USD","ConsumerIdentifier":"250025308","ToBeBilledIdentifier":"255763398389","BillType":"Carrier","BillProcessingStateAsString":"SUCCESS","Value":0.11,"BillProcessingState":"SUCCESS","BillingProvider":"TRANSFERTO","NextProcessingTime":"0001-01-01T00:00:00","NextProces
> singTimeAsLong":0,"Id":"bill-4690221d-fc07-4278-abdf-cabf1018ecb6","CreatedDate":"2015-11-23T06:42:40.0285675Z","ModifiedDate":"2015-11-23T06:43:07.5474967Z","Type":"Bill"}]
> 1981937-[debug] 2017-01-31T17:16:35.355856Z
> couchdb@couchdb7.geopoll.com <0.11910.262> -------- OS Process
> #Port<0.63508> Output ::
> [[[["GeoPoll","8921801"],null]],[[["77802","PRETUPS"],null]],[[["77802","PRETUPS","SUCCESS","2014","03","05"],null],[["ALL","PRETUPS","SUCC
> ESS","2014","03","05"],null],[["77802","ALL","SUCCESS","2014","03","05"],null],[["77802","PRETUPS","ALL","2014","03","05"],null],[["ALL","ALL","SUCCESS","2014","03","05"],null],[["ALL","PRETUPS","ALL","2014","03","05"],null],[["77802","ALL","ALL","2014","03","05"],null],
> [["ALL","ALL","ALL","2014","03","05"],null]],[[["77802","2014","3","05"],null]],[["254788760292",null]],[[["PRETUPS","25402","2014-03-05T12:48:59.5664722Z"],43]],[[["PRETUPS","2014-03-05T12:48:59.5664722Z"],43]],[[["PRETUPS","SUCCESS","2014-03-05T12:48:59.5664722Z"],null
> ]],[[["PRETUPS","25402","SUCCESS","2014-03-05T12:48:59.5664722Z"],null]],[[["PRETUPS","25402","2014-03-05T12:48:59.5664722Z"],null]],[[["PRETUPS","2014-03-05T12:48:59.5664722Z"],null]],[[["PRETUPS"],null]],[["254788760292",null]],[["1000374925501",null]],[[[2014,3,5,"PRE
> TUPS","SUCCESS"],null]]]
> 1981938-[debug] 2017-01-31T17:16:35.356012Z
> couchdb@couchdb7.geopoll.com <0.9036.262> -------- OS Process
> #Port<0.63437> Output ::
> [[[["147402","TRANSFERTO","SUCCESS"],null]],[[["TRANSFERTO","SUCCESS","2015-11-23T06:43:07.5474967Z"],null]],[[["TRANSFERTO","SUCCESS","0001
> -01-01T00:00:00"],null]]]
> 1981939-[debug] 2017-01-31T17:16:35.356108Z
> couchdb@couchdb7.geopoll.com <0.11910.262> -------- OS Process
> #Port<0.63508> Input  ::
> ["map_doc",{"_id":"bill-197d71d3-3091-47ef-9efe-b154161fcbfb","_rev":"3-832e63f45b45d5e3008b7e7bbe2b7392","MetaData":{"PollId":"77802","CarrierId":"25402","UserPollStateId":"3256532401","CarrierName":"Airtel-Kenya","Pretups.Version":"5.1","Pretups.Uri":"https://41.223.56.108:8093/pretups/C2SReceiver","Auth.Login":"pretups","Auth.Password":"0971500a350af5c3d1c0b12221a0558c","Auth.GatewayCode":"EXTGW","Auth.GatewayType":"EXTGW","Auth.ServicePort":"190","Auth.SourceType":"EXT","Cmd.ExtNwCode":"KE","Cmd.Msisdn":"732810086","Cmd.Pin":"2549","Cmd.Login":"","Cmd.Password":"","Cmd.ExtCode":"2468","CountryCode":"254","MobilePhoneLength":"9","TestMobileNumber":"254733621719","Currency":"KES"},"UserId":"1000277123401","CreateDate":"2014-03-05T13:45:49.6889321Z","LastModifiedDate":"2014-03-05T13:46:14.8050931Z","SystemSource":"GeoPoll","AttemptCount":1,"BillingIdentifier":"bill-197d71d3-3091-47ef-9efe-b154161fcbfb","CallbackUri":"http://uk-app-3:8645/billingcallback","Activities":[{"CreateDate":"2014-03-05T13:46:14.2902898Z","State":"PROCESSING"},{"MetaData":{"Type":"EXRCTRFRESP","Txnid":"R140305.1648.210003","Txnstatus":"200","Date":"05/03/2014
> 16:48:40","Extrefnum":"","Data":null},"CreateDate":"2014-03-05T13:46:14.2912898Z","State":"SUCCESS"}],"Currency":"Kenyan_Shilling_KES","ConsumerIdentifier":"8963201","ToBeBilledIdentifier":"254735960469","BillType":"Carrier","BillProcessingStateAsString":"SUCCESS","Value":43.0,"BillProcessingState":"SUCCESS","BillingProvider":"PRETUPS","NextProcessingTime":"0001-01-01T00:00:00","NextProcessingTimeAsLong":0,"Id":"bill-197d71d3-3091-47ef-9efe-b154161fcbfb","CreatedDate":"2014-03-05T13:45:49.6889321Z","ModifiedDate":"2014-03-05T13:46:14.8050931Z","Type":"Bill"}]
> 1981940:[debug] 2017-01-31T17:32:57.300061Z
> couchdb@couchdb7.geopoll.com <0.111.0> -------- Supervisor
> couch_log_sup started couch_log_monitor:start_link() at pid
> <0.114.0>
> 1981941:[debug] 2017-01-31T17:32:57.301585Z
> couchdb@couchdb7.geopoll.com <0.111.0> -------- Supervisor
> couch_log_sup started config_listener_mon:start_link(couch_log_sup,
> nil) at pid <0.115.0>
> 1981942:[info] 2017-01-31T17:32:57.301605Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application couch_log
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981943:[debug] 2017-01-31T17:32:57.302447Z
> couchdb@couchdb7.geopoll.com <0.119.0> -------- Supervisor
> folsom_sup started folsom_sample_slide_sup:start_link() at pid
> <0.120.0>
> 1981944:[debug] 2017-01-31T17:32:57.303229Z
> couchdb@couchdb7.geopoll.com <0.119.0> -------- Supervisor
> folsom_sup started folsom_meter_timer_server:start_link() at pid
> <0.121.0>
> 1981945:[debug] 2017-01-31T17:32:57.303979Z
> couchdb@couchdb7.geopoll.com <0.119.0> -------- Supervisor
> folsom_sup started folsom_metrics_histogram_ets:start_link() at pid
> <0.122.0>
> 1981946:[info] 2017-01-31T17:32:57.304074Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application folsom
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981947:[debug] 2017-01-31T17:32:57.325716Z
> couchdb@couchdb7.geopoll.com <0.126.0> -------- Supervisor
> couch_stats_sup started couch_stats_aggregator:start_link() at pid
> <0.127.0>
> 1981948:[debug] 2017-01-31T17:32:57.326519Z
> couchdb@couchdb7.geopoll.com <0.126.0> -------- Supervisor
> couch_stats_sup started couch_stats_process_tracker:start_link() at
> pid <0.177.0>
> 1981949:[info] 2017-01-31T17:32:57.326595Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application
> couch_stats started on node 'couchdb@couchdb7.geopoll.com'
> 1981950:[info] 2017-01-31T17:32:57.326673Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application khash
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981951:[debug] 2017-01-31T17:32:57.330327Z
> couchdb@couchdb7.geopoll.com <0.182.0> -------- Supervisor
> couch_event_sup2 started couch_event_server:start_link() at pid
> <0.183.0>
> 1981952:[debug] 2017-01-31T17:32:57.331211Z
> couchdb@couchdb7.geopoll.com <0.185.0> -------- Supervisor
> couch_event_os_sup started
> config_listener_mon:start_link(couch_event_os_sup, nil) at pid
> <0.186.0>
> 1981953:[debug] 2017-01-31T17:32:57.331268Z
> couchdb@couchdb7.geopoll.com <0.182.0> -------- Supervisor
> couch_event_sup2 started couch_event_os_sup:start_link() at pid
> <0.185.0>
> 1981954:[info] 2017-01-31T17:32:57.331367Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application
> couch_event started on node 'couchdb@couchdb7.geopoll.com'
> 1981955:[debug] 2017-01-31T17:32:57.334167Z
> couchdb@couchdb7.geopoll.com <0.190.0> -------- Supervisor
> ibrowse_sup started ibrowse:start_link() at pid <0.191.0>
> 1981956:[info] 2017-01-31T17:32:57.334239Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application ibrowse
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981957:[debug] 2017-01-31T17:32:57.335727Z
> couchdb@couchdb7.geopoll.com <0.196.0> -------- Supervisor ioq_sup
> started config_listener_mon:start_link(ioq_sup, nil) at pid
> <0.197.0>
> 1981958:[debug] 2017-01-31T17:32:57.336685Z
> couchdb@couchdb7.geopoll.com <0.196.0> -------- Supervisor ioq_sup
> started ioq:start_link() at pid <0.198.0>
> 1981959:[info] 2017-01-31T17:32:57.336756Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application ioq
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981960:[info] 2017-01-31T17:32:57.336829Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application mochiweb
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981961:[info] 2017-01-31T17:32:57.336899Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application oauth
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981962:[info] 2017-01-31T17:32:57.340965Z
> couchdb@couchdb7.geopoll.com <0.204.0> -------- Apache CouchDB 2.0.0
> is starting.
>
>
>
> For the Large database it would happen when we kicked off 1 out the
> 39 views on the database, however on the smaller database I would
> have to kick off all 5 views within the database.
> The large database has 9 design documents, with the smaller database
> having only 1.
> The views are all JS.
> Other than Fail2Ban, UFW, Logwatch, LogRotate, Monit and Zabbix-Agent
> there is nothing else running on the server. Except when we build it
> with Dreyfus and Clouseau.
>
> Example of one of the larger Design documents:
> {
>   "_id": "_design/bills",
>   "_rev": "4-b0ed6cf8f871391add5004f7e67bc3a8",
>   "language": "javascript",
>   "auto_update": true,
>   "views": {
>     "by_bill_date_and_bill_provider": {
>       "map": "function(doc) {\n  if (doc._id.indexOf(\"bill-\") ===
>       0){\n      var date = new
>       Date(doc.CreatedDate?doc.CreatedDate:doc.CreateDate);\n
>            var year = date.getFullYear();\n      var month =
>       (date.getMonth() + 1);\n      var day = date.getDate();\n
>            emit([year, month, day, doc.BillingProvider,
>       doc.BillProcessingState], null);\n  }\n}",
>       "reduce": "_count"
>     },
>     "by_poll_id_and_bill_date": {
>       "map": "function(doc) {\n  if ((doc._id.indexOf(\"bill-\") ===
>       0) && doc.MetaData.PollId){\n    var date = new
>       Date(doc.CreateDate);\n    var year =
>       date.getFullYear().toString();\n    var month =
>       (date.getMonth() + 1).toString();\n    var day =
>       date.getDate().toString();\n    if (day.length == 1){\n
>            day = \"0\" + day;\n    }\n\n
>          emit([doc.MetaData.PollId, year, month, day], null);\n
>        }\n}",
>       "reduce": "_count"
>     },
>   }
> }
>
> Example of a doc within the larger database:
> {
>   "_id": "bill-e2a5a7d1-3d9f-4f9b-b526-13b80b9e6947",
>   "_rev": "5-b40e00a54059c6c79004c0afd584fc60",
>   "MetaData": {
>     "PollId": "1844608",
>     "CarrierId": "2701",
>     "UserPollStateId": "12614468108"
>   },
>   "UserId": "1002196088104",
>   "CreateDate": "2017-01-31T07:20:58",
>   "LastModifiedDate": "2017-01-31T07:21:14.2473555Z",
>   "SystemSource": "GeoPoll",
>   "AttemptCount": 1,
>   "BillingIdentifier": "bill-e2a5a7d1-3d9f-4f9b-b526-13b80b9e6947",
>   "CallbackUri": "http://XXXXXXXXXXX:8645/billingcallback",
>   "CallbackSent": true,
>   "Activities": [
>     {
>       "MetaData": {},
>       "CreateDate": "2017-01-31T07:21:11.182049Z",
>       "State": "PROCESSING"
>     },
>     {
>       "MetaData": {
>         "VoucherPin": "",
>         "OrderRef": "113234210",
>         "TicketNumber": "",
>         "BoxNumber": "",
>         "BatchNumber": "",
>         "ProcessingTime": "3064.3064"
>       },
>       "CreateDate": "2017-01-31T07:21:11.1820491Z",
>       "State": "SUCCESS"
>     }
>   ],
>   "Currency": "South_African_Rand_ZAR",
>   "ConsumerIdentifier": "XXXXXXXXXXXX",
>   "ToBeBilledIdentifier": "XXXXXXXXXXXX",
>   "BillType": "Carrier",
>   "BillProcessingStateAsString": "SUCCESS",
>   "Value": 2,
>   "BillProcessingState": "SUCCESS",
>   "BillingProvider": "VODACOMSA",
>   "NextProcessingTime": "0001-01-01T00:00:00",
>   "NextProcessingTimeAsLong": 0,
>   "FinalProcessingTime": 0,
>   "LastSubmittedDate": "0001-01-01T00:00:00",
>   "Id": "bill-e2a5a7d1-3d9f-4f9b-b526-13b80b9e6947",
>   "CreatedDate": "2017-01-31T07:20:58",
>   "ModifiedDate": "2017-01-31T07:21:14.2473555Z",
>   "Type": "Bill"
> }
>
> Docs usually go through 4-5 updates before they are finalized.
> Within the larger database we have 16,201,998 docs totaling 23 GB. No
> attachments.
>
> No other traffic besides a single user (me), including replication.
> No other patterns that stand out (to me at least). The memory usage
> grows and grows before eventually consuming the Swap space and
> running into a OOM kill.
>
> The other 11 nodes are affected.
>
> Thanks for your assistance!!
>
> -Tayven
>
> ________________________________
> From: Jan Lehnardt <ja...@apache.org>
> Sent: Tuesday, January 31, 2017 4:38 AM
> To: user@couchdb.apache.org
> Cc: Tayven Bigelow; Nick Becker
> Subject: Re: Crashing due to memory use
>
> Heya Nick and Tayven,
>
> I assume you posted multiple times because your mails didn't show up
> immediately due to mailing list moderation.
>
> You are correct that the database size and hardware configuration
> should not cause any issues.
>
> Can you explain the scenario a little better?
>
> Is the memory leak happening when building your views for the first
> time?
>
> Does beam.smp terminate on its own or is it an OOM kill from the
> kernel?
>
> How many views do you have?
>
> How many design docs?
>
> JS views or Erlang views?
>
> Is there anything else running on these nodes?
>
> Can you share your view code?
>
> Can you share your couch.log?
>
> Can explain your document structure (total bytes, number of fields,
> attachments etc.).
>
> Can you describe your traffic pattern?
>
> Can you describe any other pattern that leads up to the memory leak?
>
> Does this happen on all nodes? If not, is there anything special
> about the affected nodes?
>
>
> (shameless plug, if you require professional assistance, my email
> footer has contact information)
>
>
> > On 31 Jan 2017, at 00:15, Tayven Bigelow
> > <tb...@mobileaccord.com> wrote:
> >
> > Hey Guys!
> >
> >
> > Been using a CouchDB 2.0 12 server cluster for a while now and have
> > noticed a memory leak that causes beam.smp to crash while
> > populating Views.
> >
> > The q/r/w/n is set up as:
> >
> > [cluster]
> > q=12
> > r=2
> > w=2
> > n=3
> >
> > As far as I know the server should be able to handle the load as it
> > has 64GB RAM with a Core i7 6700. We are running ubuntu 16.04.1.
> >
> > The Database is 16.5 GB in size.
> >
> >
> > I've also attempted to run 2.0 with Dreyfus and Clouseau and ran
> > into the same issue with a Database size of 7.8MB.
> >
> >
> > I've noted in previous releases some people have ran into similar
> > memory issues with beam.smp and increasing the open file limit was
> > part of the resolution. We've increased the nofile limit for the
> > couchdb user to 4096 (as found here:
> > https://wiki.apache.org/couchdb/Performance ) with no luck.
> Performance - Couchdb
> Wiki<https://wiki.apache.org/couchdb/Performance>
> wiki.apache.org
> With up to tens of thousands of documents you will generally find
> CouchDB to perform well no matter how you write your code. Once you
> start getting into ...
>
>
>
> >
> >
> > Nothing out of the ordinary is thrown in the logs. The only way to
> > catch it is by watching memory use.
> >
> >
> > I'm wondering if theres a configuration/setting somewhere that I am
> > missing that could be causing this issue.
> >
> >
> > Thanks!
> >
> > Tayven
> >
> >
> >
> > All information in this message is confidential and may be legally
> > privileged. If you are not the intended recipient, notify the
> > sender immediately and destroy this email.
>
> --
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
> Professional Support for Apache CouchDB(tm) -
> Neighbourhood<https://neighbourhood.ie/couchdb-support/>
> neighbourhood.ie
> Apache CouchDB is the first choice for geographically distributed
> database solutions. From cross data-centre clusters to offline-first
> mobile and web solutions ...
>
>
>
> Email: couchdb@neighbourhood.ie
>
>
> All information in this message is confidential and may be legally
> privileged. If you are not the intended recipient, notify the sender
> immediately and destroy this email.
>
All information in this message is confidential and may be legally privileged. If you are not the intended recipient, notify the sender immediately and destroy this email.

Re: Crashing due to memory use

Posted by Joan Touzet <wo...@apache.org>.
Tayven,

Thanks for the info.

How much RAM is in this node? Do you know approximately how much RAM the beam.smp process is consuming when the oom-killer takes action? Have you changed any settings in default.ini/local.ini?

-Joan

----- Original Message -----
> From: "Tayven Bigelow" <tb...@mobileaccord.com>
> To: "Jan Lehnardt" <ja...@apache.org>, user@couchdb.apache.org
> Cc: "Nick Becker" <ni...@mobileaccord.com>
> Sent: Tuesday, January 31, 2017 12:49:11 PM
> Subject: Re: Crashing due to memory use
> 
> Hey Jan!
> 
> 
> You'd be correct on the multiple postings, weren't sure they were
> being posted.
> 
> We currently run this in production on cloudant and were hoping to
> have a backup utilizing the new couchdb 2.0. We are able to
> consistently replicate.
> 
> The memory leak happens when we kick off a new view.
> beam.smp terminates on a OOM by the kernel.
> 
> Checking /var/log/syslog shows:
> Jan 31 18:32:44 couchdb7 kernel: [594086.565577] Out of memory: Kill
> process 23731 (beam.smp) score 961 or sacrifice child
> Jan 31 18:32:44 couchdb7 kernel: [594086.565622] Killed process 23773
> (memsup) total-vm:4228kB, anon-rss:12kB, file-rss:0kB
> Jan 31 18:32:44 couchdb7 kernel: [594086.569327] Out of memory: Kill
> process 23731 (beam.smp) score 961 or sacrifice child
> Jan 31 18:32:44 couchdb7 kernel: [594086.569392] Killed process 23731
> (beam.smp) total-vm:126594220kB, anon-rss:64708732kB, file-rss:0kB
> Jan 31 18:32:56 couchdb7 monit[9113]: 'couchdb' process is not
> running
> 
> The couchdb.log file at the time of crash contains:
> 
> 1981936-[debug] 2017-01-31T17:16:35.355774Z
> couchdb@couchdb7.geopoll.com <0.9036.262> -------- OS Process
> #Port<0.63437> Input  ::
> ["map_doc",{"_id":"bill-4690221d-fc07-4278-abdf-cabf1018ecb6","_rev":"5-b90c6c87a0a48e647528a1b3c5bfe12b","MetaData":{"PollId":"147402","Car
> rierId":"25504","UserPollStateId":"3362564708"},"UserId":"1002449829201","CreateDate":"2015-11-23T06:42:40.0285675Z","LastModifiedDate":"2015-11-23T06:43:07.5474967Z","SystemSource":"GeoPoll","AttemptCount":1,"BillingIdentifier":"bill-4690221d-fc07-4278-abdf-cabf1018ecb6
> ","CallbackUri":"http://de-geopoll-1:8645/billingcallback","CallbackSent":true,"Activities":[{"MetaData":{},"CreateDate":"2015-11-23T06:42:59.0297329Z","State":"PROCESSING"},{"MetaData":{},"CreateDate":"2015-11-23T06:42:59.0307329Z","State":"SUCCESS"}],"Currency":"US_Dol
> lar_USD","ConsumerIdentifier":"250025308","ToBeBilledIdentifier":"255763398389","BillType":"Carrier","BillProcessingStateAsString":"SUCCESS","Value":0.11,"BillProcessingState":"SUCCESS","BillingProvider":"TRANSFERTO","NextProcessingTime":"0001-01-01T00:00:00","NextProces
> singTimeAsLong":0,"Id":"bill-4690221d-fc07-4278-abdf-cabf1018ecb6","CreatedDate":"2015-11-23T06:42:40.0285675Z","ModifiedDate":"2015-11-23T06:43:07.5474967Z","Type":"Bill"}]
> 1981937-[debug] 2017-01-31T17:16:35.355856Z
> couchdb@couchdb7.geopoll.com <0.11910.262> -------- OS Process
> #Port<0.63508> Output ::
> [[[["GeoPoll","8921801"],null]],[[["77802","PRETUPS"],null]],[[["77802","PRETUPS","SUCCESS","2014","03","05"],null],[["ALL","PRETUPS","SUCC
> ESS","2014","03","05"],null],[["77802","ALL","SUCCESS","2014","03","05"],null],[["77802","PRETUPS","ALL","2014","03","05"],null],[["ALL","ALL","SUCCESS","2014","03","05"],null],[["ALL","PRETUPS","ALL","2014","03","05"],null],[["77802","ALL","ALL","2014","03","05"],null],
> [["ALL","ALL","ALL","2014","03","05"],null]],[[["77802","2014","3","05"],null]],[["254788760292",null]],[[["PRETUPS","25402","2014-03-05T12:48:59.5664722Z"],43]],[[["PRETUPS","2014-03-05T12:48:59.5664722Z"],43]],[[["PRETUPS","SUCCESS","2014-03-05T12:48:59.5664722Z"],null
> ]],[[["PRETUPS","25402","SUCCESS","2014-03-05T12:48:59.5664722Z"],null]],[[["PRETUPS","25402","2014-03-05T12:48:59.5664722Z"],null]],[[["PRETUPS","2014-03-05T12:48:59.5664722Z"],null]],[[["PRETUPS"],null]],[["254788760292",null]],[["1000374925501",null]],[[[2014,3,5,"PRE
> TUPS","SUCCESS"],null]]]
> 1981938-[debug] 2017-01-31T17:16:35.356012Z
> couchdb@couchdb7.geopoll.com <0.9036.262> -------- OS Process
> #Port<0.63437> Output ::
> [[[["147402","TRANSFERTO","SUCCESS"],null]],[[["TRANSFERTO","SUCCESS","2015-11-23T06:43:07.5474967Z"],null]],[[["TRANSFERTO","SUCCESS","0001
> -01-01T00:00:00"],null]]]
> 1981939-[debug] 2017-01-31T17:16:35.356108Z
> couchdb@couchdb7.geopoll.com <0.11910.262> -------- OS Process
> #Port<0.63508> Input  ::
> ["map_doc",{"_id":"bill-197d71d3-3091-47ef-9efe-b154161fcbfb","_rev":"3-832e63f45b45d5e3008b7e7bbe2b7392","MetaData":{"PollId":"77802","CarrierId":"25402","UserPollStateId":"3256532401","CarrierName":"Airtel-Kenya","Pretups.Version":"5.1","Pretups.Uri":"https://41.223.56.108:8093/pretups/C2SReceiver","Auth.Login":"pretups","Auth.Password":"0971500a350af5c3d1c0b12221a0558c","Auth.GatewayCode":"EXTGW","Auth.GatewayType":"EXTGW","Auth.ServicePort":"190","Auth.SourceType":"EXT","Cmd.ExtNwCode":"KE","Cmd.Msisdn":"732810086","Cmd.Pin":"2549","Cmd.Login":"","Cmd.Password":"","Cmd.ExtCode":"2468","CountryCode":"254","MobilePhoneLength":"9","TestMobileNumber":"254733621719","Currency":"KES"},"UserId":"1000277123401","CreateDate":"2014-03-05T13:45:49.6889321Z","LastModifiedDate":"2014-03-05T13:46:14.8050931Z","SystemSource":"GeoPoll","AttemptCount":1,"BillingIdentifier":"bill-197d71d3-3091-47ef-9efe-b154161fcbfb","CallbackUri":"http://uk-app-3:8645/billingcallback","Activities":[{"CreateDate":"2014-03-05T13:46:14.2902898Z","State":"PROCESSING"},{"MetaData":{"Type":"EXRCTRFRESP","Txnid":"R140305.1648.210003","Txnstatus":"200","Date":"05/03/2014
> 16:48:40","Extrefnum":"","Data":null},"CreateDate":"2014-03-05T13:46:14.2912898Z","State":"SUCCESS"}],"Currency":"Kenyan_Shilling_KES","ConsumerIdentifier":"8963201","ToBeBilledIdentifier":"254735960469","BillType":"Carrier","BillProcessingStateAsString":"SUCCESS","Value":43.0,"BillProcessingState":"SUCCESS","BillingProvider":"PRETUPS","NextProcessingTime":"0001-01-01T00:00:00","NextProcessingTimeAsLong":0,"Id":"bill-197d71d3-3091-47ef-9efe-b154161fcbfb","CreatedDate":"2014-03-05T13:45:49.6889321Z","ModifiedDate":"2014-03-05T13:46:14.8050931Z","Type":"Bill"}]
> 1981940:[debug] 2017-01-31T17:32:57.300061Z
> couchdb@couchdb7.geopoll.com <0.111.0> -------- Supervisor
> couch_log_sup started couch_log_monitor:start_link() at pid
> <0.114.0>
> 1981941:[debug] 2017-01-31T17:32:57.301585Z
> couchdb@couchdb7.geopoll.com <0.111.0> -------- Supervisor
> couch_log_sup started config_listener_mon:start_link(couch_log_sup,
> nil) at pid <0.115.0>
> 1981942:[info] 2017-01-31T17:32:57.301605Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application couch_log
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981943:[debug] 2017-01-31T17:32:57.302447Z
> couchdb@couchdb7.geopoll.com <0.119.0> -------- Supervisor
> folsom_sup started folsom_sample_slide_sup:start_link() at pid
> <0.120.0>
> 1981944:[debug] 2017-01-31T17:32:57.303229Z
> couchdb@couchdb7.geopoll.com <0.119.0> -------- Supervisor
> folsom_sup started folsom_meter_timer_server:start_link() at pid
> <0.121.0>
> 1981945:[debug] 2017-01-31T17:32:57.303979Z
> couchdb@couchdb7.geopoll.com <0.119.0> -------- Supervisor
> folsom_sup started folsom_metrics_histogram_ets:start_link() at pid
> <0.122.0>
> 1981946:[info] 2017-01-31T17:32:57.304074Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application folsom
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981947:[debug] 2017-01-31T17:32:57.325716Z
> couchdb@couchdb7.geopoll.com <0.126.0> -------- Supervisor
> couch_stats_sup started couch_stats_aggregator:start_link() at pid
> <0.127.0>
> 1981948:[debug] 2017-01-31T17:32:57.326519Z
> couchdb@couchdb7.geopoll.com <0.126.0> -------- Supervisor
> couch_stats_sup started couch_stats_process_tracker:start_link() at
> pid <0.177.0>
> 1981949:[info] 2017-01-31T17:32:57.326595Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application
> couch_stats started on node 'couchdb@couchdb7.geopoll.com'
> 1981950:[info] 2017-01-31T17:32:57.326673Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application khash
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981951:[debug] 2017-01-31T17:32:57.330327Z
> couchdb@couchdb7.geopoll.com <0.182.0> -------- Supervisor
> couch_event_sup2 started couch_event_server:start_link() at pid
> <0.183.0>
> 1981952:[debug] 2017-01-31T17:32:57.331211Z
> couchdb@couchdb7.geopoll.com <0.185.0> -------- Supervisor
> couch_event_os_sup started
> config_listener_mon:start_link(couch_event_os_sup, nil) at pid
> <0.186.0>
> 1981953:[debug] 2017-01-31T17:32:57.331268Z
> couchdb@couchdb7.geopoll.com <0.182.0> -------- Supervisor
> couch_event_sup2 started couch_event_os_sup:start_link() at pid
> <0.185.0>
> 1981954:[info] 2017-01-31T17:32:57.331367Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application
> couch_event started on node 'couchdb@couchdb7.geopoll.com'
> 1981955:[debug] 2017-01-31T17:32:57.334167Z
> couchdb@couchdb7.geopoll.com <0.190.0> -------- Supervisor
> ibrowse_sup started ibrowse:start_link() at pid <0.191.0>
> 1981956:[info] 2017-01-31T17:32:57.334239Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application ibrowse
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981957:[debug] 2017-01-31T17:32:57.335727Z
> couchdb@couchdb7.geopoll.com <0.196.0> -------- Supervisor ioq_sup
> started config_listener_mon:start_link(ioq_sup, nil) at pid
> <0.197.0>
> 1981958:[debug] 2017-01-31T17:32:57.336685Z
> couchdb@couchdb7.geopoll.com <0.196.0> -------- Supervisor ioq_sup
> started ioq:start_link() at pid <0.198.0>
> 1981959:[info] 2017-01-31T17:32:57.336756Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application ioq
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981960:[info] 2017-01-31T17:32:57.336829Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application mochiweb
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981961:[info] 2017-01-31T17:32:57.336899Z
> couchdb@couchdb7.geopoll.com <0.7.0> -------- Application oauth
> started on node 'couchdb@couchdb7.geopoll.com'
> 1981962:[info] 2017-01-31T17:32:57.340965Z
> couchdb@couchdb7.geopoll.com <0.204.0> -------- Apache CouchDB 2.0.0
> is starting.
> 
> 
> 
> For the Large database it would happen when we kicked off 1 out the
> 39 views on the database, however on the smaller database I would
> have to kick off all 5 views within the database.
> The large database has 9 design documents, with the smaller database
> having only 1.
> The views are all JS.
> Other than Fail2Ban, UFW, Logwatch, LogRotate, Monit and Zabbix-Agent
> there is nothing else running on the server. Except when we build it
> with Dreyfus and Clouseau.
> 
> Example of one of the larger Design documents:
> {
>   "_id": "_design/bills",
>   "_rev": "4-b0ed6cf8f871391add5004f7e67bc3a8",
>   "language": "javascript",
>   "auto_update": true,
>   "views": {
>     "by_bill_date_and_bill_provider": {
>       "map": "function(doc) {\n  if (doc._id.indexOf(\"bill-\") ===
>       0){\n      var date = new
>       Date(doc.CreatedDate?doc.CreatedDate:doc.CreateDate);\n
>            var year = date.getFullYear();\n      var month =
>       (date.getMonth() + 1);\n      var day = date.getDate();\n
>            emit([year, month, day, doc.BillingProvider,
>       doc.BillProcessingState], null);\n  }\n}",
>       "reduce": "_count"
>     },
>     "by_poll_id_and_bill_date": {
>       "map": "function(doc) {\n  if ((doc._id.indexOf(\"bill-\") ===
>       0) && doc.MetaData.PollId){\n    var date = new
>       Date(doc.CreateDate);\n    var year =
>       date.getFullYear().toString();\n    var month =
>       (date.getMonth() + 1).toString();\n    var day =
>       date.getDate().toString();\n    if (day.length == 1){\n
>            day = \"0\" + day;\n    }\n\n
>          emit([doc.MetaData.PollId, year, month, day], null);\n
>        }\n}",
>       "reduce": "_count"
>     },
>   }
> }
> 
> Example of a doc within the larger database:
> {
>   "_id": "bill-e2a5a7d1-3d9f-4f9b-b526-13b80b9e6947",
>   "_rev": "5-b40e00a54059c6c79004c0afd584fc60",
>   "MetaData": {
>     "PollId": "1844608",
>     "CarrierId": "2701",
>     "UserPollStateId": "12614468108"
>   },
>   "UserId": "1002196088104",
>   "CreateDate": "2017-01-31T07:20:58",
>   "LastModifiedDate": "2017-01-31T07:21:14.2473555Z",
>   "SystemSource": "GeoPoll",
>   "AttemptCount": 1,
>   "BillingIdentifier": "bill-e2a5a7d1-3d9f-4f9b-b526-13b80b9e6947",
>   "CallbackUri": "http://XXXXXXXXXXX:8645/billingcallback",
>   "CallbackSent": true,
>   "Activities": [
>     {
>       "MetaData": {},
>       "CreateDate": "2017-01-31T07:21:11.182049Z",
>       "State": "PROCESSING"
>     },
>     {
>       "MetaData": {
>         "VoucherPin": "",
>         "OrderRef": "113234210",
>         "TicketNumber": "",
>         "BoxNumber": "",
>         "BatchNumber": "",
>         "ProcessingTime": "3064.3064"
>       },
>       "CreateDate": "2017-01-31T07:21:11.1820491Z",
>       "State": "SUCCESS"
>     }
>   ],
>   "Currency": "South_African_Rand_ZAR",
>   "ConsumerIdentifier": "XXXXXXXXXXXX",
>   "ToBeBilledIdentifier": "XXXXXXXXXXXX",
>   "BillType": "Carrier",
>   "BillProcessingStateAsString": "SUCCESS",
>   "Value": 2,
>   "BillProcessingState": "SUCCESS",
>   "BillingProvider": "VODACOMSA",
>   "NextProcessingTime": "0001-01-01T00:00:00",
>   "NextProcessingTimeAsLong": 0,
>   "FinalProcessingTime": 0,
>   "LastSubmittedDate": "0001-01-01T00:00:00",
>   "Id": "bill-e2a5a7d1-3d9f-4f9b-b526-13b80b9e6947",
>   "CreatedDate": "2017-01-31T07:20:58",
>   "ModifiedDate": "2017-01-31T07:21:14.2473555Z",
>   "Type": "Bill"
> }
> 
> Docs usually go through 4-5 updates before they are finalized.
> Within the larger database we have 16,201,998 docs totaling 23 GB. No
> attachments.
> 
> No other traffic besides a single user (me), including replication.
> No other patterns that stand out (to me at least). The memory usage
> grows and grows before eventually consuming the Swap space and
> running into a OOM kill.
> 
> The other 11 nodes are affected.
> 
> Thanks for your assistance!!
> 
> -Tayven
> 
> ________________________________
> From: Jan Lehnardt <ja...@apache.org>
> Sent: Tuesday, January 31, 2017 4:38 AM
> To: user@couchdb.apache.org
> Cc: Tayven Bigelow; Nick Becker
> Subject: Re: Crashing due to memory use
> 
> Heya Nick and Tayven,
> 
> I assume you posted multiple times because your mails didn’t show up
> immediately due to mailing list moderation.
> 
> You are correct that the database size and hardware configuration
> should not cause any issues.
> 
> Can you explain the scenario a little better?
> 
> Is the memory leak happening when building your views for the first
> time?
> 
> Does beam.smp terminate on its own or is it an OOM kill from the
> kernel?
> 
> How many views do you have?
> 
> How many design docs?
> 
> JS views or Erlang views?
> 
> Is there anything else running on these nodes?
> 
> Can you share your view code?
> 
> Can you share your couch.log?
> 
> Can explain your document structure (total bytes, number of fields,
> attachments etc.).
> 
> Can you describe your traffic pattern?
> 
> Can you describe any other pattern that leads up to the memory leak?
> 
> Does this happen on all nodes? If not, is there anything special
> about the affected nodes?
> 
> 
> (shameless plug, if you require professional assistance, my email
> footer has contact information)
> 
> 
> > On 31 Jan 2017, at 00:15, Tayven Bigelow
> > <tb...@mobileaccord.com> wrote:
> >
> > Hey Guys!
> >
> >
> > Been using a CouchDB 2.0 12 server cluster for a while now and have
> > noticed a memory leak that causes beam.smp to crash while
> > populating Views.
> >
> > The q/r/w/n is set up as:
> >
> > [cluster]
> > q=12
> > r=2
> > w=2
> > n=3
> >
> > As far as I know the server should be able to handle the load as it
> > has 64GB RAM with a Core i7 6700. We are running ubuntu 16.04.1.
> >
> > The Database is 16.5 GB in size.
> >
> >
> > I've also attempted to run 2.0 with Dreyfus and Clouseau and ran
> > into the same issue with a Database size of 7.8MB.
> >
> >
> > I've noted in previous releases some people have ran into similar
> > memory issues with beam.smp and increasing the open file limit was
> > part of the resolution. We've increased the nofile limit for the
> > couchdb user to 4096 (as found here:
> > https://wiki.apache.org/couchdb/Performance ) with no luck.
> Performance - Couchdb
> Wiki<https://wiki.apache.org/couchdb/Performance>
> wiki.apache.org
> With up to tens of thousands of documents you will generally find
> CouchDB to perform well no matter how you write your code. Once you
> start getting into ...
> 
> 
> 
> >
> >
> > Nothing out of the ordinary is thrown in the logs. The only way to
> > catch it is by watching memory use.
> >
> >
> > I'm wondering if theres a configuration/setting somewhere that I am
> > missing that could be causing this issue.
> >
> >
> > Thanks!
> >
> > Tayven
> >
> >
> >
> > All information in this message is confidential and may be legally
> > privileged. If you are not the intended recipient, notify the
> > sender immediately and destroy this email.
> 
> --
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
> Professional Support for Apache CouchDB™ -
> Neighbourhood<https://neighbourhood.ie/couchdb-support/>
> neighbourhood.ie
> Apache CouchDB is the first choice for geographically distributed
> database solutions. From cross data-centre clusters to offline-first
> mobile and web solutions ...
> 
> 
> 
> Email: couchdb@neighbourhood.ie
> 
> 
> All information in this message is confidential and may be legally
> privileged. If you are not the intended recipient, notify the sender
> immediately and destroy this email.
> 

Re: Crashing due to memory use

Posted by Tayven Bigelow <tb...@mobileaccord.com>.
Hey Jan!


You'd be correct on the multiple postings, weren't sure they were being posted.

We currently run this in production on cloudant and were hoping to have a backup utilizing the new couchdb 2.0. We are able to consistently replicate.

The memory leak happens when we kick off a new view.
beam.smp terminates on a OOM by the kernel.

Checking /var/log/syslog shows:
Jan 31 18:32:44 couchdb7 kernel: [594086.565577] Out of memory: Kill process 23731 (beam.smp) score 961 or sacrifice child
Jan 31 18:32:44 couchdb7 kernel: [594086.565622] Killed process 23773 (memsup) total-vm:4228kB, anon-rss:12kB, file-rss:0kB
Jan 31 18:32:44 couchdb7 kernel: [594086.569327] Out of memory: Kill process 23731 (beam.smp) score 961 or sacrifice child
Jan 31 18:32:44 couchdb7 kernel: [594086.569392] Killed process 23731 (beam.smp) total-vm:126594220kB, anon-rss:64708732kB, file-rss:0kB
Jan 31 18:32:56 couchdb7 monit[9113]: 'couchdb' process is not running

The couchdb.log file at the time of crash contains:

1981936-[debug] 2017-01-31T17:16:35.355774Z couchdb@couchdb7.geopoll.com <0.9036.262> -------- OS Process #Port<0.63437> Input  :: ["map_doc",{"_id":"bill-4690221d-fc07-4278-abdf-cabf1018ecb6","_rev":"5-b90c6c87a0a48e647528a1b3c5bfe12b","MetaData":{"PollId":"147402","Car
rierId":"25504","UserPollStateId":"3362564708"},"UserId":"1002449829201","CreateDate":"2015-11-23T06:42:40.0285675Z","LastModifiedDate":"2015-11-23T06:43:07.5474967Z","SystemSource":"GeoPoll","AttemptCount":1,"BillingIdentifier":"bill-4690221d-fc07-4278-abdf-cabf1018ecb6
","CallbackUri":"http://de-geopoll-1:8645/billingcallback","CallbackSent":true,"Activities":[{"MetaData":{},"CreateDate":"2015-11-23T06:42:59.0297329Z","State":"PROCESSING"},{"MetaData":{},"CreateDate":"2015-11-23T06:42:59.0307329Z","State":"SUCCESS"}],"Currency":"US_Dol
lar_USD","ConsumerIdentifier":"250025308","ToBeBilledIdentifier":"255763398389","BillType":"Carrier","BillProcessingStateAsString":"SUCCESS","Value":0.11,"BillProcessingState":"SUCCESS","BillingProvider":"TRANSFERTO","NextProcessingTime":"0001-01-01T00:00:00","NextProces
singTimeAsLong":0,"Id":"bill-4690221d-fc07-4278-abdf-cabf1018ecb6","CreatedDate":"2015-11-23T06:42:40.0285675Z","ModifiedDate":"2015-11-23T06:43:07.5474967Z","Type":"Bill"}]
1981937-[debug] 2017-01-31T17:16:35.355856Z couchdb@couchdb7.geopoll.com <0.11910.262> -------- OS Process #Port<0.63508> Output :: [[[["GeoPoll","8921801"],null]],[[["77802","PRETUPS"],null]],[[["77802","PRETUPS","SUCCESS","2014","03","05"],null],[["ALL","PRETUPS","SUCC
ESS","2014","03","05"],null],[["77802","ALL","SUCCESS","2014","03","05"],null],[["77802","PRETUPS","ALL","2014","03","05"],null],[["ALL","ALL","SUCCESS","2014","03","05"],null],[["ALL","PRETUPS","ALL","2014","03","05"],null],[["77802","ALL","ALL","2014","03","05"],null],
[["ALL","ALL","ALL","2014","03","05"],null]],[[["77802","2014","3","05"],null]],[["254788760292",null]],[[["PRETUPS","25402","2014-03-05T12:48:59.5664722Z"],43]],[[["PRETUPS","2014-03-05T12:48:59.5664722Z"],43]],[[["PRETUPS","SUCCESS","2014-03-05T12:48:59.5664722Z"],null
]],[[["PRETUPS","25402","SUCCESS","2014-03-05T12:48:59.5664722Z"],null]],[[["PRETUPS","25402","2014-03-05T12:48:59.5664722Z"],null]],[[["PRETUPS","2014-03-05T12:48:59.5664722Z"],null]],[[["PRETUPS"],null]],[["254788760292",null]],[["1000374925501",null]],[[[2014,3,5,"PRE
TUPS","SUCCESS"],null]]]
1981938-[debug] 2017-01-31T17:16:35.356012Z couchdb@couchdb7.geopoll.com <0.9036.262> -------- OS Process #Port<0.63437> Output :: [[[["147402","TRANSFERTO","SUCCESS"],null]],[[["TRANSFERTO","SUCCESS","2015-11-23T06:43:07.5474967Z"],null]],[[["TRANSFERTO","SUCCESS","0001
-01-01T00:00:00"],null]]]
1981939-[debug] 2017-01-31T17:16:35.356108Z couchdb@couchdb7.geopoll.com <0.11910.262> -------- OS Process #Port<0.63508> Input  :: ["map_doc",{"_id":"bill-197d71d3-3091-47ef-9efe-b154161fcbfb","_rev":"3-832e63f45b45d5e3008b7e7bbe2b7392","MetaData":{"PollId":"77802","CarrierId":"25402","UserPollStateId":"3256532401","CarrierName":"Airtel-Kenya","Pretups.Version":"5.1","Pretups.Uri":"https://41.223.56.108:8093/pretups/C2SReceiver","Auth.Login":"pretups","Auth.Password":"0971500a350af5c3d1c0b12221a0558c","Auth.GatewayCode":"EXTGW","Auth.GatewayType":"EXTGW","Auth.ServicePort":"190","Auth.SourceType":"EXT","Cmd.ExtNwCode":"KE","Cmd.Msisdn":"732810086","Cmd.Pin":"2549","Cmd.Login":"","Cmd.Password":"","Cmd.ExtCode":"2468","CountryCode":"254","MobilePhoneLength":"9","TestMobileNumber":"254733621719","Currency":"KES"},"UserId":"1000277123401","CreateDate":"2014-03-05T13:45:49.6889321Z","LastModifiedDate":"2014-03-05T13:46:14.8050931Z","SystemSource":"GeoPoll","AttemptCount":1,"BillingIdentifier":"bill-197d71d3-3091-47ef-9efe-b154161fcbfb","CallbackUri":"http://uk-app-3:8645/billingcallback","Activities":[{"CreateDate":"2014-03-05T13:46:14.2902898Z","State":"PROCESSING"},{"MetaData":{"Type":"EXRCTRFRESP","Txnid":"R140305.1648.210003","Txnstatus":"200","Date":"05/03/2014 16:48:40","Extrefnum":"","Data":null},"CreateDate":"2014-03-05T13:46:14.2912898Z","State":"SUCCESS"}],"Currency":"Kenyan_Shilling_KES","ConsumerIdentifier":"8963201","ToBeBilledIdentifier":"254735960469","BillType":"Carrier","BillProcessingStateAsString":"SUCCESS","Value":43.0,"BillProcessingState":"SUCCESS","BillingProvider":"PRETUPS","NextProcessingTime":"0001-01-01T00:00:00","NextProcessingTimeAsLong":0,"Id":"bill-197d71d3-3091-47ef-9efe-b154161fcbfb","CreatedDate":"2014-03-05T13:45:49.6889321Z","ModifiedDate":"2014-03-05T13:46:14.8050931Z","Type":"Bill"}]
1981940:[debug] 2017-01-31T17:32:57.300061Z couchdb@couchdb7.geopoll.com <0.111.0> -------- Supervisor couch_log_sup started couch_log_monitor:start_link() at pid <0.114.0>
1981941:[debug] 2017-01-31T17:32:57.301585Z couchdb@couchdb7.geopoll.com <0.111.0> -------- Supervisor couch_log_sup started config_listener_mon:start_link(couch_log_sup, nil) at pid <0.115.0>
1981942:[info] 2017-01-31T17:32:57.301605Z couchdb@couchdb7.geopoll.com <0.7.0> -------- Application couch_log started on node 'couchdb@couchdb7.geopoll.com'
1981943:[debug] 2017-01-31T17:32:57.302447Z couchdb@couchdb7.geopoll.com <0.119.0> -------- Supervisor folsom_sup started folsom_sample_slide_sup:start_link() at pid <0.120.0>
1981944:[debug] 2017-01-31T17:32:57.303229Z couchdb@couchdb7.geopoll.com <0.119.0> -------- Supervisor folsom_sup started folsom_meter_timer_server:start_link() at pid <0.121.0>
1981945:[debug] 2017-01-31T17:32:57.303979Z couchdb@couchdb7.geopoll.com <0.119.0> -------- Supervisor folsom_sup started folsom_metrics_histogram_ets:start_link() at pid <0.122.0>
1981946:[info] 2017-01-31T17:32:57.304074Z couchdb@couchdb7.geopoll.com <0.7.0> -------- Application folsom started on node 'couchdb@couchdb7.geopoll.com'
1981947:[debug] 2017-01-31T17:32:57.325716Z couchdb@couchdb7.geopoll.com <0.126.0> -------- Supervisor couch_stats_sup started couch_stats_aggregator:start_link() at pid <0.127.0>
1981948:[debug] 2017-01-31T17:32:57.326519Z couchdb@couchdb7.geopoll.com <0.126.0> -------- Supervisor couch_stats_sup started couch_stats_process_tracker:start_link() at pid <0.177.0>
1981949:[info] 2017-01-31T17:32:57.326595Z couchdb@couchdb7.geopoll.com <0.7.0> -------- Application couch_stats started on node 'couchdb@couchdb7.geopoll.com'
1981950:[info] 2017-01-31T17:32:57.326673Z couchdb@couchdb7.geopoll.com <0.7.0> -------- Application khash started on node 'couchdb@couchdb7.geopoll.com'
1981951:[debug] 2017-01-31T17:32:57.330327Z couchdb@couchdb7.geopoll.com <0.182.0> -------- Supervisor couch_event_sup2 started couch_event_server:start_link() at pid <0.183.0>
1981952:[debug] 2017-01-31T17:32:57.331211Z couchdb@couchdb7.geopoll.com <0.185.0> -------- Supervisor couch_event_os_sup started config_listener_mon:start_link(couch_event_os_sup, nil) at pid <0.186.0>
1981953:[debug] 2017-01-31T17:32:57.331268Z couchdb@couchdb7.geopoll.com <0.182.0> -------- Supervisor couch_event_sup2 started couch_event_os_sup:start_link() at pid <0.185.0>
1981954:[info] 2017-01-31T17:32:57.331367Z couchdb@couchdb7.geopoll.com <0.7.0> -------- Application couch_event started on node 'couchdb@couchdb7.geopoll.com'
1981955:[debug] 2017-01-31T17:32:57.334167Z couchdb@couchdb7.geopoll.com <0.190.0> -------- Supervisor ibrowse_sup started ibrowse:start_link() at pid <0.191.0>
1981956:[info] 2017-01-31T17:32:57.334239Z couchdb@couchdb7.geopoll.com <0.7.0> -------- Application ibrowse started on node 'couchdb@couchdb7.geopoll.com'
1981957:[debug] 2017-01-31T17:32:57.335727Z couchdb@couchdb7.geopoll.com <0.196.0> -------- Supervisor ioq_sup started config_listener_mon:start_link(ioq_sup, nil) at pid <0.197.0>
1981958:[debug] 2017-01-31T17:32:57.336685Z couchdb@couchdb7.geopoll.com <0.196.0> -------- Supervisor ioq_sup started ioq:start_link() at pid <0.198.0>
1981959:[info] 2017-01-31T17:32:57.336756Z couchdb@couchdb7.geopoll.com <0.7.0> -------- Application ioq started on node 'couchdb@couchdb7.geopoll.com'
1981960:[info] 2017-01-31T17:32:57.336829Z couchdb@couchdb7.geopoll.com <0.7.0> -------- Application mochiweb started on node 'couchdb@couchdb7.geopoll.com'
1981961:[info] 2017-01-31T17:32:57.336899Z couchdb@couchdb7.geopoll.com <0.7.0> -------- Application oauth started on node 'couchdb@couchdb7.geopoll.com'
1981962:[info] 2017-01-31T17:32:57.340965Z couchdb@couchdb7.geopoll.com <0.204.0> -------- Apache CouchDB 2.0.0 is starting.



For the Large database it would happen when we kicked off 1 out the 39 views on the database, however on the smaller database I would have to kick off all 5 views within the database.
The large database has 9 design documents, with the smaller database having only 1.
The views are all JS.
Other than Fail2Ban, UFW, Logwatch, LogRotate, Monit and Zabbix-Agent there is nothing else running on the server. Except when we build it with Dreyfus and Clouseau.

Example of one of the larger Design documents:
{
  "_id": "_design/bills",
  "_rev": "4-b0ed6cf8f871391add5004f7e67bc3a8",
  "language": "javascript",
  "auto_update": true,
  "views": {
    "by_bill_date_and_bill_provider": {
      "map": "function(doc) {\n  if (doc._id.indexOf(\"bill-\") === 0){\n      var date = new Date(doc.CreatedDate?doc.CreatedDate:doc.CreateDate);\n      var year = date.getFullYear();\n      var month = (date.getMonth() + 1);\n      var day = date.getDate();\n      emit([year, month, day, doc.BillingProvider, doc.BillProcessingState], null);\n  }\n}",
      "reduce": "_count"
    },
    "by_poll_id_and_bill_date": {
      "map": "function(doc) {\n  if ((doc._id.indexOf(\"bill-\") === 0) && doc.MetaData.PollId){\n    var date = new Date(doc.CreateDate);\n    var year = date.getFullYear().toString();\n    var month = (date.getMonth() + 1).toString();\n    var day = date.getDate().toString();\n    if (day.length == 1){\n      day = \"0\" + day;\n    }\n\n    emit([doc.MetaData.PollId, year, month, day], null);\n  }\n}",
      "reduce": "_count"
    },
  }
}

Example of a doc within the larger database:
{
  "_id": "bill-e2a5a7d1-3d9f-4f9b-b526-13b80b9e6947",
  "_rev": "5-b40e00a54059c6c79004c0afd584fc60",
  "MetaData": {
    "PollId": "1844608",
    "CarrierId": "2701",
    "UserPollStateId": "12614468108"
  },
  "UserId": "1002196088104",
  "CreateDate": "2017-01-31T07:20:58",
  "LastModifiedDate": "2017-01-31T07:21:14.2473555Z",
  "SystemSource": "GeoPoll",
  "AttemptCount": 1,
  "BillingIdentifier": "bill-e2a5a7d1-3d9f-4f9b-b526-13b80b9e6947",
  "CallbackUri": "http://XXXXXXXXXXX:8645/billingcallback",
  "CallbackSent": true,
  "Activities": [
    {
      "MetaData": {},
      "CreateDate": "2017-01-31T07:21:11.182049Z",
      "State": "PROCESSING"
    },
    {
      "MetaData": {
        "VoucherPin": "",
        "OrderRef": "113234210",
        "TicketNumber": "",
        "BoxNumber": "",
        "BatchNumber": "",
        "ProcessingTime": "3064.3064"
      },
      "CreateDate": "2017-01-31T07:21:11.1820491Z",
      "State": "SUCCESS"
    }
  ],
  "Currency": "South_African_Rand_ZAR",
  "ConsumerIdentifier": "XXXXXXXXXXXX",
  "ToBeBilledIdentifier": "XXXXXXXXXXXX",
  "BillType": "Carrier",
  "BillProcessingStateAsString": "SUCCESS",
  "Value": 2,
  "BillProcessingState": "SUCCESS",
  "BillingProvider": "VODACOMSA",
  "NextProcessingTime": "0001-01-01T00:00:00",
  "NextProcessingTimeAsLong": 0,
  "FinalProcessingTime": 0,
  "LastSubmittedDate": "0001-01-01T00:00:00",
  "Id": "bill-e2a5a7d1-3d9f-4f9b-b526-13b80b9e6947",
  "CreatedDate": "2017-01-31T07:20:58",
  "ModifiedDate": "2017-01-31T07:21:14.2473555Z",
  "Type": "Bill"
}

Docs usually go through 4-5 updates before they are finalized.
Within the larger database we have 16,201,998 docs totaling 23 GB. No attachments.

No other traffic besides a single user (me), including replication.
No other patterns that stand out (to me at least). The memory usage grows and grows before eventually consuming the Swap space and running into a OOM kill.

The other 11 nodes are affected.

Thanks for your assistance!!

-Tayven

________________________________
From: Jan Lehnardt <ja...@apache.org>
Sent: Tuesday, January 31, 2017 4:38 AM
To: user@couchdb.apache.org
Cc: Tayven Bigelow; Nick Becker
Subject: Re: Crashing due to memory use

Heya Nick and Tayven,

I assume you posted multiple times because your mails didn’t show up immediately due to mailing list moderation.

You are correct that the database size and hardware configuration should not cause any issues.

Can you explain the scenario a little better?

Is the memory leak happening when building your views for the first time?

Does beam.smp terminate on its own or is it an OOM kill from the kernel?

How many views do you have?

How many design docs?

JS views or Erlang views?

Is there anything else running on these nodes?

Can you share your view code?

Can you share your couch.log?

Can explain your document structure (total bytes, number of fields, attachments etc.).

Can you describe your traffic pattern?

Can you describe any other pattern that leads up to the memory leak?

Does this happen on all nodes? If not, is there anything special about the affected nodes?


(shameless plug, if you require professional assistance, my email footer has contact information)


> On 31 Jan 2017, at 00:15, Tayven Bigelow <tb...@mobileaccord.com> wrote:
>
> Hey Guys!
>
>
> Been using a CouchDB 2.0 12 server cluster for a while now and have noticed a memory leak that causes beam.smp to crash while populating Views.
>
> The q/r/w/n is set up as:
>
> [cluster]
> q=12
> r=2
> w=2
> n=3
>
> As far as I know the server should be able to handle the load as it has 64GB RAM with a Core i7 6700. We are running ubuntu 16.04.1.
>
> The Database is 16.5 GB in size.
>
>
> I've also attempted to run 2.0 with Dreyfus and Clouseau and ran into the same issue with a Database size of 7.8MB.
>
>
> I've noted in previous releases some people have ran into similar memory issues with beam.smp and increasing the open file limit was part of the resolution. We've increased the nofile limit for the couchdb user to 4096 (as found here: https://wiki.apache.org/couchdb/Performance ) with no luck.
Performance - Couchdb Wiki<https://wiki.apache.org/couchdb/Performance>
wiki.apache.org
With up to tens of thousands of documents you will generally find CouchDB to perform well no matter how you write your code. Once you start getting into ...



>
>
> Nothing out of the ordinary is thrown in the logs. The only way to catch it is by watching memory use.
>
>
> I'm wondering if theres a configuration/setting somewhere that I am missing that could be causing this issue.
>
>
> Thanks!
>
> Tayven
>
>
>
> All information in this message is confidential and may be legally privileged. If you are not the intended recipient, notify the sender immediately and destroy this email.

--
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/
Professional Support for Apache CouchDB™ - Neighbourhood<https://neighbourhood.ie/couchdb-support/>
neighbourhood.ie
Apache CouchDB is the first choice for geographically distributed database solutions. From cross data-centre clusters to offline-first mobile and web solutions ...



Email: couchdb@neighbourhood.ie


All information in this message is confidential and may be legally privileged. If you are not the intended recipient, notify the sender immediately and destroy this email.

Re: Crashing due to memory use

Posted by Jan Lehnardt <ja...@apache.org>.
Heya Nick and Tayven,

I assume you posted multiple times because your mails didn’t show up immediately due to mailing list moderation.

You are correct that the database size and hardware configuration should not cause any issues.

Can you explain the scenario a little better? 

Is the memory leak happening when building your views for the first time?

Does beam.smp terminate on its own or is it an OOM kill from the kernel?

How many views do you have?

How many design docs?

JS views or Erlang views?

Is there anything else running on these nodes?

Can you share your view code?

Can you share your couch.log?

Can explain your document structure (total bytes, number of fields, attachments etc.).

Can you describe your traffic pattern?

Can you describe any other pattern that leads up to the memory leak?

Does this happen on all nodes? If not, is there anything special about the affected nodes?


(shameless plug, if you require professional assistance, my email footer has contact information)


> On 31 Jan 2017, at 00:15, Tayven Bigelow <tb...@mobileaccord.com> wrote:
> 
> Hey Guys!
> 
> 
> Been using a CouchDB 2.0 12 server cluster for a while now and have noticed a memory leak that causes beam.smp to crash while populating Views.
> 
> The q/r/w/n is set up as:
> 
> [cluster]
> q=12
> r=2
> w=2
> n=3
> 
> As far as I know the server should be able to handle the load as it has 64GB RAM with a Core i7 6700. We are running ubuntu 16.04.1.
> 
> The Database is 16.5 GB in size.
> 
> 
> I've also attempted to run 2.0 with Dreyfus and Clouseau and ran into the same issue with a Database size of 7.8MB.
> 
> 
> I've noted in previous releases some people have ran into similar memory issues with beam.smp and increasing the open file limit was part of the resolution. We've increased the nofile limit for the couchdb user to 4096 (as found here: https://wiki.apache.org/couchdb/Performance ) with no luck.
> 
> 
> Nothing out of the ordinary is thrown in the logs. The only way to catch it is by watching memory use.
> 
> 
> I'm wondering if theres a configuration/setting somewhere that I am missing that could be causing this issue.
> 
> 
> Thanks!
> 
> Tayven
> 
> 
> 
> All information in this message is confidential and may be legally privileged. If you are not the intended recipient, notify the sender immediately and destroy this email.

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/
Email: couchdb@neighbourhood.ie