You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Roberto Iglesias <ri...@voipi.com.ar> on 2022/04/04 16:21:53 UTC

CouchDB 3.1.1 high disk utilization and degradation over time

Hello.

About 1 year ago, we had two CouchDB 2.3.1 instances running inside Docker
containers and pull-replicating one each other. This way, we could read
from and write to any of these servers, although we generally choose one as
the "active" server and write to it. The second server would act as a spare
or backup.

At this point (1y ago) we decided to migrate from CouchDB version 2.3.1 to
3.1.1. Instead of upgrading our existing databases, we added two extra
instances and configure pull replications in all of them until we get the
following scenario:

2.3.1-A <===> 2.3.1-B <===> 3.1.1-A <===> 3.1.1-B

where <===> represents two pull replications, one configured on each side.
i.e: 2.3.1-A pulls from 2.3.1-B and vice versa.

If a write is made at 2.3.1-A, it has to make it through all servers until
it reaches 3.1.1-B.

All of them have an exclusive HDD which is not shared with any other
service.

We have not a single problem with 2.3.1.

After pointing our services to 3.1.1-*A*, it gradually started to increase
Read I/O wait times over weeks until it reached peaks of 600ms (totally
unworkable). So we stopped making write requests (http POST) to it and
pointed all applications to 3.1.1-*B*. 3.1.1-*A* was still receiving writes
but only by replication protocol, as I explained before.

At 3.1.1-*A* server, disk stats decreased to acceptable values, so a few
weeks after we pointed applications back to it in order to confirm whether
the problem is related to write requests sent from our application or not.
Read I/O times did not increase this time. Instead, 3.1.1-B (which handled
application traffic for a few weeks), started to show the same behaviour,
despite it was not handling requests from applications.

It feels like some fragmentation is occurring, but filesystem (ext4) shows
none.

Some changes we've made since problem started:

   - Upgraded kernel from 4.15.0-55-generic to 5.4.0-88-generic
   - Upgraded ubuntu from 18.04 to 20.04
   - Deleted _global_changes database from couchdb3.1.1-A


More info:

   - Couchdb is using docker local-persist (
   https://github.com/MatchbookLab/local-persist) volumes.
   - Disks are WD Purple for 2.3.1 couchdbs and WD Black for 3.1.1 couchdbs.
   - We have only one database of 88GiB and 2 views: one of 22GB and a
   little one of 30MB (highly updated)
   - docker stats shows that couchdb3.1.1 uses lot of memory compared to
   2.3.1:
   - 2.5GiB for couchdb3.1.1-A (not receiving direct write requests)
   - 5.0GiB for couchdb3.1.1-B (receiving both read and write requests)
   - 900MiB for 2.3.1-A
   - 800MiB for 2.3.1-B
   - Database compaction is run at night. Problem only occurs over day,
   when most of the writes are made.
   - Most of the config is default.
   - Latency graph from munin monitoring attached (at the peak, there is an
   outage of the server caused by a kernel upgrade that went wrong)


Any help is appreciated.

-- 
--

*Roberto E. Iglesias*

Re: CouchDB 3.1.1 high disk utilization and degradation over time

Posted by Roberto Iglesias <ri...@voipi.com.ar>.

Hi, Hoel, thanks for sharing.

To me, it doesn't seem to be the same problem, as we use a single HDD (WD
Black) for this purpose and is not throttled by anything. In fact, IO usage
is increasing over time until it reaches unworkable values. To summarize,
CouchDB is not sharing IO resources with any other service.

I'd like to provide more information, so here is our CouchDB configuration
(provided by _node/_local/_config endpoint):

{
  "smoosh.slack_dbs": {
    "from": "00:00",
    "min_priority": "5368709120",
    "strict_window": "true",
    "to": "08:00"
  },
  "uuids": {
    "algorithm": "sequential",
    "max_count": "1000"
  },
  "cluster": {
    "n": "3",
    "q": "2"
  },
  "cors": {
    "credentials": "false"
  },
  "chttpd": {
    "backlog": "512",
    "bind_address": "any",
    "max_db_number_for_dbs_info_req": "100",
    "port": "5984",
    "prefer_minimal": "Cache-Control, Content-Length, Content-Range,
Content-Type, ETag, Server, Transfer-Encoding, Vary",
    "require_valid_user": "false",
    "server_options": "[{recbuf, undefined}]",
    "socket_options": "[{sndbuf, 262144}, {nodelay, true}]"
  },
  "attachments": {
    "compressible_types": "text/*, application/javascript,
application/json, application/xml",
    "compression_level": "8"
  },
  "admins": {
    "admin":
"-pbkdf2-f0aacb520f0bfc4aa90c30045fac25acd431c4ac,7adbc3635ee33dec5e2f38ada9e6bf55,10"
  },
  "query_server_config": {
    "os_process_limit": "100",
    "reduce_limit": "true"
  },
  "vendor": {
    "name": "The Apache Software Foundation"
  },
  "smoosh.ratio_views": {
    "from": "00:00",
    "min_priority": "5.0",
    "strict_window": "true",
    "to": "08:00"
  },
  "feature_flags": {
    "partitioned||*": "true"
  },
  "replicator": {
    "connection_timeout": "30000",
    "http_connections": "20",
    "interval": "60000",
    "max_churn": "20",
    "max_jobs": "500",
    "retries_per_request": "5",
    "socket_options": "[{keepalive, true}, {nodelay, false}]",
    "ssl_certificate_max_depth": "3",
    "startup_jitter": "5000",
    "verify_ssl_certificates": "false",
    "worker_batch_size": "500",
    "worker_processes": "1"
  },
  "ssl": {
    "port": "6984"
  },
  "smoosh.slack_views": {
    "from": "00:00",
    "min_priority": "3758096384",
    "strict_window": "true",
    "to": "08:00"
  },
  "log": {
    "file": "/opt/couchdb/var/log/couch.log",
    "level": "info",
    "write_buffer": "1048576",
    "write_delay": "5000",
    "writer": "file"
  },
  "indexers": {
    "couch_mrview": "true"
  },
  "couch_peruser": {
    "database_prefix": "userdb-",
    "delete_dbs": "false",
    "enable": "false"
  },
  "httpd": {
    "allow_jsonp": "false",
    "authentication_handlers": "{couch_httpd_auth,
cookie_authentication_handler}, {couch_httpd_auth,
default_authentication_handler}",
    "bind_address": "any",
    "enable_cors": "false",
    "enable_xframe_options": "false",
    "max_http_request_size": "4294967296",
    "port": "5986",
    "secure_rewrites": "true",
    "socket_options": "[{sndbuf, 262144}]"
  },
  "ioq.bypass": {
    "compaction": "false",
    "os_process": "true",
    "read": "true",
    "shard_sync": "false",
    "view_update": "true",
    "write": "true"
  },
  "ioq": {
    "concurrency": "10",
    "ratio": "0.01"
  },
  "smoosh.ratio_dbs": {
    "from": "00:00",
    "min_priority": "5.0",
    "strict_window": "true",
    "to": "08:00"
  },
  "csp": {
    "enable": "true"
  },
  "couch_httpd_auth": {
    "allow_persistent_cookies": "true",
    "auth_cache_size": "50",
    "authentication_db": "_users",
    "authentication_redirect": "/_utils/session.html",
    "iterations": "10",
    "require_valid_user": "false",
    "secret": "d6fbdfb5c21b756c94abd8fb0be54d17",
    "timeout": "600"
  },
  "couchdb_engines": {
    "couch": "couch_bt_engine"
  },
  "couchdb": {
    "attachment_stream_buffer_size": "4096",
    "changes_doc_ids_optimization_threshold": "100",
    "database_dir": "./data",
    "default_engine": "couch",
    "default_security": "admin_only",
    "file_compression": "snappy",
    "max_dbs_open": "500",
    "max_document_size": "8000000",
    "os_process_timeout": "5000",
    "users_db_security_editable": "false",
    "uuid": "5b2a3feb760f43fd811abf568eb1f6b2",
    "view_index_dir": "./data"
  }
}


Also, I've noticed that attachments are not published in mailing lists, so
here is a link to the image I added in my first email:
https://ibb.co/L9mvZRB

Thanks again!

On Tue, Apr 5, 2022 at 4:36 AM Hoël Iris <ho...@tolteck.com> wrote:

> My configuration is different (a lot of small DBs) but I had disk I/O
> performance issues too when upgrading from CouchDB 2 to CouchDB 3.
> Maybe it's related, maybe it's not.
> I use AWS, the solution for me was to increase AWS disk IOPs.
>
> See the full discussion here:
> https://github.com/apache/couchdb/discussions/3217
>
>
> Le lun. 4 avr. 2022 à 18:22, Roberto Iglesias <ri...@voipi.com.ar> a
> écrit :
>
> > Hello.
> >
> > About 1 year ago, we had two CouchDB 2.3.1 instances running inside
> Docker
> > containers and pull-replicating one each other. This way, we could read
> > from and write to any of these servers, although we generally choose one
> as
> > the "active" server and write to it. The second server would act as a
> spare
> > or backup.
> >
> > At this point (1y ago) we decided to migrate from CouchDB version 2.3.1
> to
> > 3.1.1. Instead of upgrading our existing databases, we added two extra
> > instances and configure pull replications in all of them until we get the
> > following scenario:
> >
> > 2.3.1-A <===> 2.3.1-B <===> 3.1.1-A <===> 3.1.1-B
> >
> > where <===> represents two pull replications, one configured on each
> side.
> > i.e: 2.3.1-A pulls from 2.3.1-B and vice versa.
> >
> > If a write is made at 2.3.1-A, it has to make it through all servers
> until
> > it reaches 3.1.1-B.
> >
> > All of them have an exclusive HDD which is not shared with any other
> > service.
> >
> > We have not a single problem with 2.3.1.
> >
> > After pointing our services to 3.1.1-*A*, it gradually started to
> > increase Read I/O wait times over weeks until it reached peaks of 600ms
> > (totally unworkable). So we stopped making write requests (http POST) to
> it
> > and pointed all applications to 3.1.1-*B*. 3.1.1-*A* was still receiving
> > writes but only by replication protocol, as I explained before.
> >
> > At 3.1.1-*A* server, disk stats decreased to acceptable values, so a few
> > weeks after we pointed applications back to it in order to confirm
> whether
> > the problem is related to write requests sent from our application or
> not.
> > Read I/O times did not increase this time. Instead, 3.1.1-B (which
> handled
> > application traffic for a few weeks), started to show the same behaviour,
> > despite it was not handling requests from applications.
> >
> > It feels like some fragmentation is occurring, but filesystem (ext4)
> shows
> > none.
> >
> > Some changes we've made since problem started:
> >
> >    - Upgraded kernel from 4.15.0-55-generic to 5.4.0-88-generic
> >    - Upgraded ubuntu from 18.04 to 20.04
> >    - Deleted _global_changes database from couchdb3.1.1-A
> >
> >
> > More info:
> >
> >    - Couchdb is using docker local-persist (
> >    https://github.com/MatchbookLab/local-persist) volumes.
> >    - Disks are WD Purple for 2.3.1 couchdbs and WD Black for 3.1.1
> >    couchdbs.
> >    - We have only one database of 88GiB and 2 views: one of 22GB and a
> >    little one of 30MB (highly updated)
> >    - docker stats shows that couchdb3.1.1 uses lot of memory compared to
> >    2.3.1:
> >    - 2.5GiB for couchdb3.1.1-A (not receiving direct write requests)
> >    - 5.0GiB for couchdb3.1.1-B (receiving both read and write requests)
> >    - 900MiB for 2.3.1-A
> >    - 800MiB for 2.3.1-B
> >    - Database compaction is run at night. Problem only occurs over day,
> >    when most of the writes are made.
> >    - Most of the config is default.
> >    - Latency graph from munin monitoring attached (at the peak, there is
> >    an outage of the server caused by a kernel upgrade that went wrong)
> >
> >
> > Any help is appreciated.
> >
> > --
> > --
> >
> > *Roberto E. Iglesias*
> >
>

Re: CouchDB 3.1.1 high disk utilization and degradation over time

Posted by Hoël Iris <ho...@tolteck.com>.

My configuration is different (a lot of small DBs) but I had disk I/O
performance issues too when upgrading from CouchDB 2 to CouchDB 3.
Maybe it's related, maybe it's not.
I use AWS, the solution for me was to increase AWS disk IOPs.

See the full discussion here:
https://github.com/apache/couchdb/discussions/3217


Le lun. 4 avr. 2022 à 18:22, Roberto Iglesias <ri...@voipi.com.ar> a
écrit :

> Hello.
>
> About 1 year ago, we had two CouchDB 2.3.1 instances running inside Docker
> containers and pull-replicating one each other. This way, we could read
> from and write to any of these servers, although we generally choose one as
> the "active" server and write to it. The second server would act as a spare
> or backup.
>
> At this point (1y ago) we decided to migrate from CouchDB version 2.3.1 to
> 3.1.1. Instead of upgrading our existing databases, we added two extra
> instances and configure pull replications in all of them until we get the
> following scenario:
>
> 2.3.1-A <===> 2.3.1-B <===> 3.1.1-A <===> 3.1.1-B
>
> where <===> represents two pull replications, one configured on each side.
> i.e: 2.3.1-A pulls from 2.3.1-B and vice versa.
>
> If a write is made at 2.3.1-A, it has to make it through all servers until
> it reaches 3.1.1-B.
>
> All of them have an exclusive HDD which is not shared with any other
> service.
>
> We have not a single problem with 2.3.1.
>
> After pointing our services to 3.1.1-*A*, it gradually started to
> increase Read I/O wait times over weeks until it reached peaks of 600ms
> (totally unworkable). So we stopped making write requests (http POST) to it
> and pointed all applications to 3.1.1-*B*. 3.1.1-*A* was still receiving
> writes but only by replication protocol, as I explained before.
>
> At 3.1.1-*A* server, disk stats decreased to acceptable values, so a few
> weeks after we pointed applications back to it in order to confirm whether
> the problem is related to write requests sent from our application or not.
> Read I/O times did not increase this time. Instead, 3.1.1-B (which handled
> application traffic for a few weeks), started to show the same behaviour,
> despite it was not handling requests from applications.
>
> It feels like some fragmentation is occurring, but filesystem (ext4) shows
> none.
>
> Some changes we've made since problem started:
>
>    - Upgraded kernel from 4.15.0-55-generic to 5.4.0-88-generic
>    - Upgraded ubuntu from 18.04 to 20.04
>    - Deleted _global_changes database from couchdb3.1.1-A
>
>
> More info:
>
>    - Couchdb is using docker local-persist (
>    https://github.com/MatchbookLab/local-persist) volumes.
>    - Disks are WD Purple for 2.3.1 couchdbs and WD Black for 3.1.1
>    couchdbs.
>    - We have only one database of 88GiB and 2 views: one of 22GB and a
>    little one of 30MB (highly updated)
>    - docker stats shows that couchdb3.1.1 uses lot of memory compared to
>    2.3.1:
>    - 2.5GiB for couchdb3.1.1-A (not receiving direct write requests)
>    - 5.0GiB for couchdb3.1.1-B (receiving both read and write requests)
>    - 900MiB for 2.3.1-A
>    - 800MiB for 2.3.1-B
>    - Database compaction is run at night. Problem only occurs over day,
>    when most of the writes are made.
>    - Most of the config is default.
>    - Latency graph from munin monitoring attached (at the peak, there is
>    an outage of the server caused by a kernel upgrade that went wrong)
>
>
> Any help is appreciated.
>
> --
> --
>
> *Roberto E. Iglesias*
>

Re: CouchDB 3.1.1 high disk utilization and degradation over time

Posted by Roberto Iglesias <ri...@voipi.com.ar>.

Hi Jan, thanks for taking your time to read and answer my question.

Did you account for the completely rewritten compaction daemon (smoosh)
> that has a different configuration from the one in 2.x?


Yes, I've taken this into account and adjusted my config to match what we
expect. This is related the smoosh config:

{
  "from": "00:00",
  "min_priority": "5368709120",
  "strict_window": "true",
  "to": "08:00"
},
{
  "from": "00:00",
  "min_priority": "3758096384",
  "strict_window": "true",
  "to": "08:00"
},
{
  "from": "00:00",
  "min_priority": "5.0",
  "strict_window": "true",
  "to": "08:00"
},
{
  "from": "00:00",
  "min_priority": "5.0",
  "strict_window": "true",
  "to": "08:00"
}

And it seems to be working as desired. Indeed, at the beginning we had
problems with compaction never finishing, so we ended up with this config.

And finally: which Erlang version are you running? There are a few odd ones
> out there that might affect what you’re doing.


We are using couchdb:3.1.1 docker hub image, but if it helps, this is the
name of erlang dir inside my couchdb instance:

root@mycouchdb:/opt/couchdb# ls -ld erts*
drwxr-xr-x 1 couchdb couchdb 4096 Mar 12  2021 erts-9.3.3.14

Hope it helps.

Thanks.

On Mon, Apr 18, 2022 at 7:11 AM Jan Lehnardt <ja...@apache.org> wrote:

>
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
>
> *24/7 Observation for your CouchDB Instances:
> https://opservatory.app
>
> > On 4. Apr 2022, at 18:21, Roberto Iglesias <ri...@voipi.com.ar>
> wrote:
> >
> > Hello.
> >
> > About 1 year ago, we had two CouchDB 2.3.1 instances running inside
> Docker containers and pull-replicating one each other. This way, we could
> read from and write to any of these servers, although we generally choose
> one as the "active" server and write to it. The second server would act as
> a spare or backup.
> >
> > At this point (1y ago) we decided to migrate from CouchDB version 2.3.1
> to 3.1.1. Instead of upgrading our existing databases, we added two extra
> instances and configure pull replications in all of them until we get the
> following scenario:
> >
> > 2.3.1-A <===> 2.3.1-B <===> 3.1.1-A <===> 3.1.1-B
> >
> > where <===> represents two pull replications, one configured on each
> side. i.e: 2.3.1-A pulls from 2.3.1-B and vice versa.
> >
> > If a write is made at 2.3.1-A, it has to make it through all servers
> until it reaches 3.1.1-B.
> >
> > All of them have an exclusive HDD which is not shared with any other
> service.
> >
> > We have not a single problem with 2.3.1.
> >
> > After pointing our services to 3.1.1-A, it gradually started to increase
> Read I/O wait times over weeks until it reached peaks of 600ms (totally
> unworkable). So we stopped making write requests (http POST) to it and
> pointed all applications to 3.1.1-B. 3.1.1-A was still receiving writes but
> only by replication protocol, as I explained before.
> >
> > At 3.1.1-A server, disk stats decreased to acceptable values, so a few
> weeks after we pointed applications back to it in order to confirm whether
> the problem is related to write requests sent from our application or not.
> Read I/O times did not increase this time. Instead, 3.1.1-B (which handled
> application traffic for a few weeks), started to show the same behaviour,
> despite it was not handling requests from applications.
> >
> > It feels like some fragmentation is occurring, but filesystem (ext4)
> shows none.
> >
> > Some changes we've made since problem started:
> >       • Upgraded kernel from 4.15.0-55-generic to 5.4.0-88-generic
> >       • Upgraded ubuntu from 18.04 to 20.04
> >       • Deleted _global_changes database from couchdb3.1.1-A
> >
> > More info:
> >       • Couchdb is using docker local-persist (
> https://github.com/MatchbookLab/local-persist) volumes.
> >       • Disks are WD Purple for 2.3.1 couchdbs and WD Black for 3.1.1
> couchdbs.
> >       • We have only one database of 88GiB and 2 views: one of 22GB and
> a little one of 30MB (highly updated)
> >       • docker stats shows that couchdb3.1.1 uses lot of memory compared
> to 2.3.1:
> >       • 2.5GiB for couchdb3.1.1-A (not receiving direct write requests)
> >       • 5.0GiB for couchdb3.1.1-B (receiving both read and write
> requests)
> >       • 900MiB for 2.3.1-A
> >       • 800MiB for 2.3.1-B
> >       • Database compaction is run at night. Problem only occurs over
> day, when most of the writes are made.
>
> Did you account for the completely rewritten compaction daemon (smoosh)
> that has a different configuration from the one in 2.x?
>
> https://docs.couchdb.org/en/stable/maintenance/compaction.html#compact-auto
>
> Otherwise you might see compaction going on at all times (what we
> recommend, usually), rather than what you expect: just at night.
>
> And in general, at this point, we strongly recommend running on SSDs for
> the obvious speed benefits :)
>
> And finally: which Erlang version are you running? There are a few odd
> ones out there that might affect what you’re doing.
>
> Best
> Jan
> —
> >       • Most of the config is default.
> >       • Latency graph from munin monitoring attached (at the peak, there
> is an outage of the server caused by a kernel upgrade that went wrong)
> >
> > Any help is appreciated.
> >
> > --
> > --
> >
> > Roberto E. Iglesias
>
>

Re: CouchDB 3.1.1 high disk utilization and degradation over time

Posted by Jan Lehnardt <ja...@apache.org>.

Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

*24/7 Observation for your CouchDB Instances:
https://opservatory.app

> On 4. Apr 2022, at 18:21, Roberto Iglesias <ri...@voipi.com.ar> wrote:
> 
> Hello.
> 
> About 1 year ago, we had two CouchDB 2.3.1 instances running inside Docker containers and pull-replicating one each other. This way, we could read from and write to any of these servers, although we generally choose one as the "active" server and write to it. The second server would act as a spare or backup.
> 
> At this point (1y ago) we decided to migrate from CouchDB version 2.3.1 to 3.1.1. Instead of upgrading our existing databases, we added two extra instances and configure pull replications in all of them until we get the following scenario:
> 
> 2.3.1-A <===> 2.3.1-B <===> 3.1.1-A <===> 3.1.1-B
> 
> where <===> represents two pull replications, one configured on each side. i.e: 2.3.1-A pulls from 2.3.1-B and vice versa.
> 
> If a write is made at 2.3.1-A, it has to make it through all servers until it reaches 3.1.1-B.
> 
> All of them have an exclusive HDD which is not shared with any other service.
> 
> We have not a single problem with 2.3.1.
> 
> After pointing our services to 3.1.1-A, it gradually started to increase Read I/O wait times over weeks until it reached peaks of 600ms (totally unworkable). So we stopped making write requests (http POST) to it and pointed all applications to 3.1.1-B. 3.1.1-A was still receiving writes but only by replication protocol, as I explained before.
> 
> At 3.1.1-A server, disk stats decreased to acceptable values, so a few weeks after we pointed applications back to it in order to confirm whether the problem is related to write requests sent from our application or not. Read I/O times did not increase this time. Instead, 3.1.1-B (which handled application traffic for a few weeks), started to show the same behaviour, despite it was not handling requests from applications.
> 
> It feels like some fragmentation is occurring, but filesystem (ext4) shows none.
> 
> Some changes we've made since problem started:
> 	• Upgraded kernel from 4.15.0-55-generic to 5.4.0-88-generic
> 	• Upgraded ubuntu from 18.04 to 20.04
> 	• Deleted _global_changes database from couchdb3.1.1-A
> 
> More info:
> 	• Couchdb is using docker local-persist (https://github.com/MatchbookLab/local-persist) volumes.
> 	• Disks are WD Purple for 2.3.1 couchdbs and WD Black for 3.1.1 couchdbs.
> 	• We have only one database of 88GiB and 2 views: one of 22GB and a little one of 30MB (highly updated)
> 	• docker stats shows that couchdb3.1.1 uses lot of memory compared to 2.3.1:
> 	• 2.5GiB for couchdb3.1.1-A (not receiving direct write requests)
> 	• 5.0GiB for couchdb3.1.1-B (receiving both read and write requests)
> 	• 900MiB for 2.3.1-A
> 	• 800MiB for 2.3.1-B
> 	• Database compaction is run at night. Problem only occurs over day, when most of the writes are made.

Did you account for the completely rewritten compaction daemon (smoosh) that has a different configuration from the one in 2.x?

https://docs.couchdb.org/en/stable/maintenance/compaction.html#compact-auto

Otherwise you might see compaction going on at all times (what we recommend, usually), rather than what you expect: just at night.

And in general, at this point, we strongly recommend running on SSDs for the obvious speed benefits :)

And finally: which Erlang version are you running? There are a few odd ones out there that might affect what you’re doing.

Best
Jan
—
> 	• Most of the config is default.
> 	• Latency graph from munin monitoring attached (at the peak, there is an outage of the server caused by a kernel upgrade that went wrong)
> 
> Any help is appreciated.
> 
> -- 
> --
> 
> Roberto E. Iglesias