You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Alexander Shorin <kx...@gmail.com> on 2014/04/24 11:16:23 UTC

[ANN] CouchDB monitoring: you're doing it wro...you can do it better!

Hi everyone again,

Actually, I have another one thing to share with you for today: it's
the post-guidance I wrote about monitoring CouchDB:

http://gws.github.io/munin-plugin-couchdb/guide-to-couchdb-monitoring.html

While it located in the same repository as plugin for specific
monitoring system, it's completely project neutral (with a bit
exception at the end) and aims to cover all the possibilities for
monitoring CouchDB server state. Just using only /_stats isn't enough.

If you prefer Zabiix or Nagios or something else you'll also may found
some interesting bits there.

As for discussion topic I'd like to ask everyone about how you monitor
your CouchDB in production? Which metrics are important for your and
which ones you feel missed? Experience sharing and cool stories about
how monitoring helps you to keep CouchDB good and well is welcome!

P.S. English isn't my native language, so please if you'd noticed any
misspelling or sentences with incorrect syntax, please don't shy to
send me private email with corrections. Thanks!

--
,,,^..^,,,

Re: [ANN] CouchDB monitoring: you're doing it wro...you can do it better!

Posted by Suraj Kumar <su...@inmobi.com>.

Thanks, Alexander, for starting this thread. Your document seems quite
illustrative and a great starting point :)

Here is the picture of our system, from the point of view of monitoring:

We have two couch servers in multi-master mode. Replication gave us
troubles (the replication task on node#1 would crash if the remote couch
service on node#2 was stopped (say, for maintenance)). This may seem like
"of course, what else do you expect?" behaviour, but really, I wished
couchdb could "keep retrying until success" to resume replication, come
what may. But since couchdb does not do this, what we did was to write a
simple replication watcher (Node.js) whose primary function is to create
replication tasks if a task that _should_ be running is not running. To be
fair, "monitoring" of replication tasks is available through _active_tasks
but I'm really looking for the ability to create 'persisted continuous
replication tasks' and not having to now worry about monitoring the monitor
(if its in an OTP "supervisor" restarting it, that's entirely a different
matter)).

The other issue we kept running into is that of some GETs on views would
hang. Observed side effects: lots of couchjs processes, _active_tasks does
not show compaction or indexer, request_time in _stats goes up, we don't
know what is going on. The surprising fact was that these views would
perform inversely proportional to the number of views in the design
document (ie., if a view is coming from a design doc with lots of other
view functions, then this view function call would perform poorly with
frequent freezes). We 'worked around' this by moving critical views into
its own design docs. (we use couchdb 1.5.0 with some custom
non-interfering, sequential hacks (ex: to persist UserCtx and
time-of-modification into JSON document and to always snapshot every change
(ie., 2 sequential writes per intended doc write))).

Anyway, here is what is in our monitoring bucket list which aren't readily
available in _stats, so we measure these in indirect ways:

1. number of open tcp sockets to couch.
2. cpu and memory usage of (each) couchjs process.
3. cpu and memory usage of beam.smp process.
4. grep couch.log for crash-like patterns (essentially, internal errors of
couch engine, such as replication task crashes).

Here are some specific areas that _only_ couch can provide insight (ie.,
completely internal stuff):
1. fragmentation levels of each DB and each View (TIL from OP: disk_size
and data_size can be used to figure this out, but what about each View?).
2. average/min/max doc size
3. hits per (each) view
4. emits per (each) view (because 'emit' causes disk write)
5. size of each index/view (number of nodes in the tree, used size
(individual index file size?), etc.,)
6. count of (inbox) 'queued' messages categorized by each 'functional kind'
of erlang process (ie., kind == index-update, doc-update, doc-read,
mochiweb, etc.,)

Perhaps, some of these metrics may not be easily tracked as an in-memory
counter in some situations? I wonder, what if there was a way we could
listen for significant events happening inside couch? It is not necessary
that couch has to even keep track of everything as in-memory counters.
Instead some of them could be exposed as events that can be plumbed into a
dedicated event stream processor (like Riemann.io) to do whatever
monitoring/alerting we may want to do.

Regards,

  -Suraj

On Thu, Apr 24, 2014 at 2:46 PM, Alexander Shorin <kx...@gmail.com> wrote:

> Hi everyone again,
>
> Actually, I have another one thing to share with you for today: it's
> the post-guidance I wrote about monitoring CouchDB:
>
> http://gws.github.io/munin-plugin-couchdb/guide-to-couchdb-monitoring.html
>
> While it located in the same repository as plugin for specific
> monitoring system, it's completely project neutral (with a bit
> exception at the end) and aims to cover all the possibilities for
> monitoring CouchDB server state. Just using only /_stats isn't enough.
>
> If you prefer Zabiix or Nagios or something else you'll also may found
> some interesting bits there.
>
> As for discussion topic I'd like to ask everyone about how you monitor
> your CouchDB in production? Which metrics are important for your and
> which ones you feel missed? Experience sharing and cool stories about
> how monitoring helps you to keep CouchDB good and well is welcome!
>
> P.S. English isn't my native language, so please if you'd noticed any
> misspelling or sentences with incorrect syntax, please don't shy to
> send me private email with corrections. Thanks!
>
> --
> ,,,^..^,,,
>

-- 
An Onion is the Onion skin and the Onion under the skin until the Onion
Skin without any Onion underneath.

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

Re: [ANN] CouchDB monitoring: you're doing it wro...you can do it better!

Posted by Noah Slater <ns...@apache.org>.

Please only use ANN for project-level announcements, like new releases
or security information. In this instance, I would have used BLOG
perhaps? :)

Also, is this something that would work on our project blog?

On 24 April 2014 11:16, Alexander Shorin <kx...@gmail.com> wrote:
> Hi everyone again,
>
> Actually, I have another one thing to share with you for today: it's
> the post-guidance I wrote about monitoring CouchDB:
>
> http://gws.github.io/munin-plugin-couchdb/guide-to-couchdb-monitoring.html
>
> While it located in the same repository as plugin for specific
> monitoring system, it's completely project neutral (with a bit
> exception at the end) and aims to cover all the possibilities for
> monitoring CouchDB server state. Just using only /_stats isn't enough.
>
> If you prefer Zabiix or Nagios or something else you'll also may found
> some interesting bits there.
>
> As for discussion topic I'd like to ask everyone about how you monitor
> your CouchDB in production? Which metrics are important for your and
> which ones you feel missed? Experience sharing and cool stories about
> how monitoring helps you to keep CouchDB good and well is welcome!
>
> P.S. English isn't my native language, so please if you'd noticed any
> misspelling or sentences with incorrect syntax, please don't shy to
> send me private email with corrections. Thanks!
>
> --
> ,,,^..^,,,



-- 
Noah Slater
https://twitter.com/nslater

Re: [ANN] CouchDB monitoring: you're doing it wro...you can do it better!

Posted by Noah Slater <ns...@apache.org>.

Please only use ANN for project-level announcements, like new releases
or security information. In this instance, I would have used BLOG
perhaps? :)

Also, is this something that would work on our project blog?

On 24 April 2014 11:16, Alexander Shorin <kx...@gmail.com> wrote:
> Hi everyone again,
>
> Actually, I have another one thing to share with you for today: it's
> the post-guidance I wrote about monitoring CouchDB:
>
> http://gws.github.io/munin-plugin-couchdb/guide-to-couchdb-monitoring.html
>
> While it located in the same repository as plugin for specific
> monitoring system, it's completely project neutral (with a bit
> exception at the end) and aims to cover all the possibilities for
> monitoring CouchDB server state. Just using only /_stats isn't enough.
>
> If you prefer Zabiix or Nagios or something else you'll also may found
> some interesting bits there.
>
> As for discussion topic I'd like to ask everyone about how you monitor
> your CouchDB in production? Which metrics are important for your and
> which ones you feel missed? Experience sharing and cool stories about
> how monitoring helps you to keep CouchDB good and well is welcome!
>
> P.S. English isn't my native language, so please if you'd noticed any
> misspelling or sentences with incorrect syntax, please don't shy to
> send me private email with corrections. Thanks!
>
> --
> ,,,^..^,,,



-- 
Noah Slater
https://twitter.com/nslater

Re: [ANN] CouchDB monitoring: you're doing it wro...you can do it better!

Posted by Suraj Kumar <su...@inmobi.com>.

Thanks, Alexander, for starting this thread. Your document seems quite
illustrative and a great starting point :)

Here is the picture of our system, from the point of view of monitoring:

We have two couch servers in multi-master mode. Replication gave us
troubles (the replication task on node#1 would crash if the remote couch
service on node#2 was stopped (say, for maintenance)). This may seem like
"of course, what else do you expect?" behaviour, but really, I wished
couchdb could "keep retrying until success" to resume replication, come
what may. But since couchdb does not do this, what we did was to write a
simple replication watcher (Node.js) whose primary function is to create
replication tasks if a task that _should_ be running is not running. To be
fair, "monitoring" of replication tasks is available through _active_tasks
but I'm really looking for the ability to create 'persisted continuous
replication tasks' and not having to now worry about monitoring the monitor
(if its in an OTP "supervisor" restarting it, that's entirely a different
matter)).

The other issue we kept running into is that of some GETs on views would
hang. Observed side effects: lots of couchjs processes, _active_tasks does
not show compaction or indexer, request_time in _stats goes up, we don't
know what is going on. The surprising fact was that these views would
perform inversely proportional to the number of views in the design
document (ie., if a view is coming from a design doc with lots of other
view functions, then this view function call would perform poorly with
frequent freezes). We 'worked around' this by moving critical views into
its own design docs. (we use couchdb 1.5.0 with some custom
non-interfering, sequential hacks (ex: to persist UserCtx and
time-of-modification into JSON document and to always snapshot every change
(ie., 2 sequential writes per intended doc write))).

Anyway, here is what is in our monitoring bucket list which aren't readily
available in _stats, so we measure these in indirect ways:

1. number of open tcp sockets to couch.
2. cpu and memory usage of (each) couchjs process.
3. cpu and memory usage of beam.smp process.
4. grep couch.log for crash-like patterns (essentially, internal errors of
couch engine, such as replication task crashes).

Here are some specific areas that _only_ couch can provide insight (ie.,
completely internal stuff):
1. fragmentation levels of each DB and each View (TIL from OP: disk_size
and data_size can be used to figure this out, but what about each View?).
2. average/min/max doc size
3. hits per (each) view
4. emits per (each) view (because 'emit' causes disk write)
5. size of each index/view (number of nodes in the tree, used size
(individual index file size?), etc.,)
6. count of (inbox) 'queued' messages categorized by each 'functional kind'
of erlang process (ie., kind == index-update, doc-update, doc-read,
mochiweb, etc.,)

Perhaps, some of these metrics may not be easily tracked as an in-memory
counter in some situations? I wonder, what if there was a way we could
listen for significant events happening inside couch? It is not necessary
that couch has to even keep track of everything as in-memory counters.
Instead some of them could be exposed as events that can be plumbed into a
dedicated event stream processor (like Riemann.io) to do whatever
monitoring/alerting we may want to do.

Regards,

  -Suraj

On Thu, Apr 24, 2014 at 2:46 PM, Alexander Shorin <kx...@gmail.com> wrote:

> Hi everyone again,
>
> Actually, I have another one thing to share with you for today: it's
> the post-guidance I wrote about monitoring CouchDB:
>
> http://gws.github.io/munin-plugin-couchdb/guide-to-couchdb-monitoring.html
>
> While it located in the same repository as plugin for specific
> monitoring system, it's completely project neutral (with a bit
> exception at the end) and aims to cover all the possibilities for
> monitoring CouchDB server state. Just using only /_stats isn't enough.
>
> If you prefer Zabiix or Nagios or something else you'll also may found
> some interesting bits there.
>
> As for discussion topic I'd like to ask everyone about how you monitor
> your CouchDB in production? Which metrics are important for your and
> which ones you feel missed? Experience sharing and cool stories about
> how monitoring helps you to keep CouchDB good and well is welcome!
>
> P.S. English isn't my native language, so please if you'd noticed any
> misspelling or sentences with incorrect syntax, please don't shy to
> send me private email with corrections. Thanks!
>
> --
> ,,,^..^,,,
>

-- 
An Onion is the Onion skin and the Onion under the skin until the Onion
Skin without any Onion underneath.

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.