You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Jan Lehnardt <ja...@apache.org> on 2015/07/26 14:47:46 UTC

[2.0] Replication Issues

Hey all,

I’m trying to upgrade a database from 1.6.1 to 2.0.0/master/0c579b98 and I’m seeing a number of issues.

Any help is greatly appreciated. Since this is our official upgrade path for 2.0, this has to be rock-solid.

Feel free to break out individual issue into new threads, if it helps keeping things organised.

Scroll down for detailed information about the database, and machine configurations.


## The Scenario

Replication is running on 2.0, pulling from 1.6.1 over the EC2 internal ip address.

## The Issues

1. repeated log entries for “write quorum for <targetdb> failed”. I’ve seen this in other contexts as well, why is this happening and should it?


2. getting a lot of “cassim_metadata_cache changes listener died” from all nodes about every 5 seconds. What’s up with these?

 - 2015-07-26 08:30:34.400 [error] Undefined emulator Error in process <0.14633.26> on node 'node3@127.0.0.1' with exit value: {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}

 - 2015-07-26 08:30:39.401 [notice] node3@127.0.0.1 <0.314.0> cassim_metadata_cache changes listener died {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}


3. A number of  Replicator, request PUT to "http://0.0.0.0:15984/<target>/edbef049aae9c8828f336534984e5e4f" failed due to error {error,req_timedout} this happens for regular docs, local docs, and _bulk_docs. The machine is basically idle (see below for details), the three beam.smp processes over at 200-250% CPU each, io is 98% idle (it’s mostly logs being written), the machine is basically idle.


4, two issues from couch_replicator_api_wrap.erl:

 - 2015-07-26 08:22:49.849 [error] Undefined <0.3546.0> gen_server <0.3546.0> terminated with reason: no function clause matching couch_replicator_api_wrap:'-update_docs/4-fun-2-'(400, [{"Server","MochiWeb/1.0 (Any of you quaids got a smint?)"},{"Date","Sun, 26 Jul 2015 08:22:49 G..."},...], null, [<<"{\"_id\":\”12345678\",\"_rev\":\"1050-ee6c7d54276b43bc937470e44e0283f2\”,...

 - 2015-07-26 08:30:08.514 [notice] node3@127.0.0.1 <0.6360.26> Retrying GET to http://172.31.10.115:5984/generic_db_name/12348765?revs=true&open_revs=%5B%228-b2826209867a286c76e6a2762f10b1e0%22%5D&latest=true in 1.0 seconds due to error {function_clause,[{couch_replicator_api_wrap,run_user_fun,4},{couch_replicator_api_wrap,receive_docs,4},{couch_replicator_api_wrap,receive_docs_loop,6},{couch_replicator_api_wrap,'-open_doc_revs/6-fun-4-',7}]}



5. Eventually, replication reliably stops with an “invalid_ejson” error, but I don’t yet know if that’s because of the api_wrap issue or something else.



6. Replication has stopped numerous times until I got here, I didn’t have time to look into why that happened, but I have all the logs, but they are 130MB total, so it’ll be a while.


7. When replication ran, it replicated at a rate of about 1000 docs/s, which felt a little slow, but I have no experience there, yet.


## Source Database Info

{
  "db_name": "generic_db_name",
  "doc_count": 6808004,
  "doc_del_count": 18856,
  "update_seq": 8044450,
  "purge_seq": 0,
  "compact_running": false,
  "disk_size": 16293904519,
  "data_size": 11711402577,
  "instance_start_time": "1437834202967309",
  "disk_format_version": 6,
  "committed_update_seq": 8044450
}

Mostly small-ish docs, no big outliers, no attachments.

Source machine info:

Amazon EC2 m3.xlarge 4 cores, 64bit, 16GB RAM, 100GB SSD, 3000 provisioned iops. FFM Availability Zone.

Standard EC2 Ubuntu, Erlang R16B03 (I know, but that’s not the problem here, this couch behaves fine).

Target machine info:

Amazon EC2 m4.10xlarge, 40 cores, 64bit, 160GB RAM, 100GB SSD, 3000 iops (not provisioned), 10GigE networking, FFM AZ.

The latency between both instances is very small and the network throughput is (copying a file is between 100 and 200MB/s).

Standard EC2 Amazon Linux (Redhat/Fedora derivative), Erlang R14B04. CouchDB 2.0 running as dev/run


Thanks!
Jan
--

Re: [2.0] Replication Issues

Posted by Robert Newson <rn...@apache.org>.

Noting that the new buffering patch broke continuous and long poll mode. I fixed both. 



On 2 Aug 2015, at 01:27, Adam Kocoloski <ko...@apache.org> wrote:

>> On Jul 27, 2015, at 11:33 AM, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>>> 
>>> On 27 Jul 2015, at 13:46, Jan Lehnardt <ja...@apache.org> wrote:
>>> 
>>> 
>>>> On 26 Jul 2015, at 19:03, Jan Lehnardt <ja...@apache.org> wrote:
>>>> 
>>>> 
>>>>> On 26 Jul 2015, at 14:47, Jan Lehnardt <ja...@apache.org> wrote:
>>>>> 
>>>>> Hey all,
>>>>> 
>>>>> I’m trying to upgrade a database from 1.6.1 to 2.0.0/master/0c579b98 and I’m seeing a number of issues.
>>>>> 
>>>>> Any help is greatly appreciated. Since this is our official upgrade path for 2.0, this has to be rock-solid.
>>>>> 
>>>>> Feel free to break out individual issue into new threads, if it helps keeping things organised.
>>>>> 
>>>>> Scroll down for detailed information about the database, and machine configurations.
>>>>> 
>>>>> 
>>>>> ## The Scenario
>>>>> 
>>>>> Replication is running on 2.0, pulling from 1.6.1 over the EC2 internal ip address.
>>>>> 
>>>>> ## The Issues
>>>>> 
>>>>> 1. repeated log entries for “write quorum for <targetdb> failed”. I’ve seen this in other contexts as well, why is this happening and should it?
>>>>> 
>>>>> 
>>>>> 2. getting a lot of “cassim_metadata_cache changes listener died” from all nodes about every 5 seconds. What’s up with these?
>>>>> 
>>>>> - 2015-07-26 08:30:34.400 [error] Undefined emulator Error in process <0.14633.26> on node 'node3@127.0.0.1' with exit value: {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}
>>>>> 
>>>>> - 2015-07-26 08:30:39.401 [notice] node3@127.0.0.1 <0.314.0> cassim_metadata_cache changes listener died {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}
>>>> 
>>>> Alexander pointed to https://github.com/apache/couchdb-fabric/commit/b6659c8344c9a028b5ab451be41a991801c2ab3d#diff-2af86e058b4e7a4a99a7c5a12da6debdR96 which is part of Adam’s recent work on COUCHDB-2724.
>>>> 
>>>> Adam, any insights? :)
>>> 
>>> Bob says this should fix it: https://gist.github.com/rnewson/b9efd4f45e1c62315816
>>> 
>>> In the meantime, I reverted the changes optimisation commit on fabric and now I’m getting this once it is time to start replicating more documents after the existing update sequence is all caught up with during replication:
>>> 
>>> https://gist.github.com/janl/75804904dad73d17ed0e
>>> 
>>> During which I found out that there *are* a few small attachments in the source database, sorry for the confusion about this earlier.
>>> 
>>> I still see function_clause errors after the revert, Bob suggests to wait for Adam to comment.
>> 
>> Bob’s latest commits fixed the replication issue, but I’d love to hear about the other things I mentioned.
>> 
>> Best
>> Jan
>> —
> 
> Has there been any further OOB progress on the issues Jan enumerated here?
> 
> Adam
>

Re: [2.0] Replication Issues

Posted by Adam Kocoloski <ko...@apache.org>.

> On Jul 27, 2015, at 11:33 AM, Jan Lehnardt <ja...@apache.org> wrote:
> 
>> 
>> On 27 Jul 2015, at 13:46, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>> 
>>> On 26 Jul 2015, at 19:03, Jan Lehnardt <ja...@apache.org> wrote:
>>> 
>>> 
>>>> On 26 Jul 2015, at 14:47, Jan Lehnardt <ja...@apache.org> wrote:
>>>> 
>>>> Hey all,
>>>> 
>>>> I’m trying to upgrade a database from 1.6.1 to 2.0.0/master/0c579b98 and I’m seeing a number of issues.
>>>> 
>>>> Any help is greatly appreciated. Since this is our official upgrade path for 2.0, this has to be rock-solid.
>>>> 
>>>> Feel free to break out individual issue into new threads, if it helps keeping things organised.
>>>> 
>>>> Scroll down for detailed information about the database, and machine configurations.
>>>> 
>>>> 
>>>> ## The Scenario
>>>> 
>>>> Replication is running on 2.0, pulling from 1.6.1 over the EC2 internal ip address.
>>>> 
>>>> ## The Issues
>>>> 
>>>> 1. repeated log entries for “write quorum for <targetdb> failed”. I’ve seen this in other contexts as well, why is this happening and should it?
>>>> 
>>>> 
>>>> 2. getting a lot of “cassim_metadata_cache changes listener died” from all nodes about every 5 seconds. What’s up with these?
>>>> 
>>>> - 2015-07-26 08:30:34.400 [error] Undefined emulator Error in process <0.14633.26> on node 'node3@127.0.0.1' with exit value: {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}
>>>> 
>>>> - 2015-07-26 08:30:39.401 [notice] node3@127.0.0.1 <0.314.0> cassim_metadata_cache changes listener died {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}
>>> 
>>> Alexander pointed to https://github.com/apache/couchdb-fabric/commit/b6659c8344c9a028b5ab451be41a991801c2ab3d#diff-2af86e058b4e7a4a99a7c5a12da6debdR96 which is part of Adam’s recent work on COUCHDB-2724.
>>> 
>>> Adam, any insights? :)
>> 
>> Bob says this should fix it: https://gist.github.com/rnewson/b9efd4f45e1c62315816
>> 
>> In the meantime, I reverted the changes optimisation commit on fabric and now I’m getting this once it is time to start replicating more documents after the existing update sequence is all caught up with during replication:
>> 
>> https://gist.github.com/janl/75804904dad73d17ed0e
>> 
>> During which I found out that there *are* a few small attachments in the source database, sorry for the confusion about this earlier.
>> 
>> I still see function_clause errors after the revert, Bob suggests to wait for Adam to comment.
> 
> Bob’s latest commits fixed the replication issue, but I’d love to hear about the other things I mentioned.
> 
> Best
> Jan
> —

Has there been any further OOB progress on the issues Jan enumerated here?

Adam

Re: [2.0] Replication Issues

Posted by Jan Lehnardt <ja...@apache.org>.

> On 27 Jul 2015, at 13:46, Jan Lehnardt <ja...@apache.org> wrote:
> 
> 
>> On 26 Jul 2015, at 19:03, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>> 
>>> On 26 Jul 2015, at 14:47, Jan Lehnardt <ja...@apache.org> wrote:
>>> 
>>> Hey all,
>>> 
>>> I’m trying to upgrade a database from 1.6.1 to 2.0.0/master/0c579b98 and I’m seeing a number of issues.
>>> 
>>> Any help is greatly appreciated. Since this is our official upgrade path for 2.0, this has to be rock-solid.
>>> 
>>> Feel free to break out individual issue into new threads, if it helps keeping things organised.
>>> 
>>> Scroll down for detailed information about the database, and machine configurations.
>>> 
>>> 
>>> ## The Scenario
>>> 
>>> Replication is running on 2.0, pulling from 1.6.1 over the EC2 internal ip address.
>>> 
>>> ## The Issues
>>> 
>>> 1. repeated log entries for “write quorum for <targetdb> failed”. I’ve seen this in other contexts as well, why is this happening and should it?
>>> 
>>> 
>>> 2. getting a lot of “cassim_metadata_cache changes listener died” from all nodes about every 5 seconds. What’s up with these?
>>> 
>>> - 2015-07-26 08:30:34.400 [error] Undefined emulator Error in process <0.14633.26> on node 'node3@127.0.0.1' with exit value: {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}
>>> 
>>> - 2015-07-26 08:30:39.401 [notice] node3@127.0.0.1 <0.314.0> cassim_metadata_cache changes listener died {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}
>> 
>> Alexander pointed to https://github.com/apache/couchdb-fabric/commit/b6659c8344c9a028b5ab451be41a991801c2ab3d#diff-2af86e058b4e7a4a99a7c5a12da6debdR96 which is part of Adam’s recent work on COUCHDB-2724.
>> 
>> Adam, any insights? :)
> 
> Bob says this should fix it: https://gist.github.com/rnewson/b9efd4f45e1c62315816
> 
> In the meantime, I reverted the changes optimisation commit on fabric and now I’m getting this once it is time to start replicating more documents after the existing update sequence is all caught up with during replication:
> 
> https://gist.github.com/janl/75804904dad73d17ed0e
> 
> During which I found out that there *are* a few small attachments in the source database, sorry for the confusion about this earlier.
> 
> I still see function_clause errors after the revert, Bob suggests to wait for Adam to comment.

Bob’s latest commits fixed the replication issue, but I’d love to hear about the other things I mentioned.

Best
Jan
--
 
> 
> Best
> Jan
> --
> 
> 
>> 
>> Best
>> Jan
>> --
>> 
>> 
>> 
>>> 
>>> 
>>> 3. A number of  Replicator, request PUT to "http://0.0.0.0:15984/<target>/edbef049aae9c8828f336534984e5e4f" failed due to error {error,req_timedout} this happens for regular docs, local docs, and _bulk_docs. The machine is basically idle (see below for details), the three beam.smp processes over at 200-250% CPU each, io is 98% idle (it’s mostly logs being written), the machine is basically idle.
>>> 
>>> 
>>> 4, two issues from couch_replicator_api_wrap.erl:
>>> 
>>> - 2015-07-26 08:22:49.849 [error] Undefined <0.3546.0> gen_server <0.3546.0> terminated with reason: no function clause matching couch_replicator_api_wrap:'-update_docs/4-fun-2-'(400, [{"Server","MochiWeb/1.0 (Any of you quaids got a smint?)"},{"Date","Sun, 26 Jul 2015 08:22:49 G..."},...], null, [<<"{\"_id\":\”12345678\",\"_rev\":\"1050-ee6c7d54276b43bc937470e44e0283f2\”,...
>>> 
>>> - 2015-07-26 08:30:08.514 [notice] node3@127.0.0.1 <0.6360.26> Retrying GET to http://172.31.10.115:5984/generic_db_name/12348765?revs=true&open_revs=%5B%228-b2826209867a286c76e6a2762f10b1e0%22%5D&latest=true in 1.0 seconds due to error {function_clause,[{couch_replicator_api_wrap,run_user_fun,4},{couch_replicator_api_wrap,receive_docs,4},{couch_replicator_api_wrap,receive_docs_loop,6},{couch_replicator_api_wrap,'-open_doc_revs/6-fun-4-',7}]}
>>> 
>>> 
>>> 
>>> 5. Eventually, replication reliably stops with an “invalid_ejson” error, but I don’t yet know if that’s because of the api_wrap issue or something else.
>>> 
>>> 
>>> 
>>> 6. Replication has stopped numerous times until I got here, I didn’t have time to look into why that happened, but I have all the logs, but they are 130MB total, so it’ll be a while.
>>> 
>>> 
>>> 7. When replication ran, it replicated at a rate of about 1000 docs/s, which felt a little slow, but I have no experience there, yet.
>>> 
>>> 
>>> ## Source Database Info
>>> 
>>> {
>>> "db_name": "generic_db_name",
>>> "doc_count": 6808004,
>>> "doc_del_count": 18856,
>>> "update_seq": 8044450,
>>> "purge_seq": 0,
>>> "compact_running": false,
>>> "disk_size": 16293904519,
>>> "data_size": 11711402577,
>>> "instance_start_time": "1437834202967309",
>>> "disk_format_version": 6,
>>> "committed_update_seq": 8044450
>>> }
>>> 
>>> Mostly small-ish docs, no big outliers, no attachments.
>>> 
>>> Source machine info:
>>> 
>>> Amazon EC2 m3.xlarge 4 cores, 64bit, 16GB RAM, 100GB SSD, 3000 provisioned iops. FFM Availability Zone.
>>> 
>>> Standard EC2 Ubuntu, Erlang R16B03 (I know, but that’s not the problem here, this couch behaves fine).
>>> 
>>> Target machine info:
>>> 
>>> Amazon EC2 m4.10xlarge, 40 cores, 64bit, 160GB RAM, 100GB SSD, 3000 iops (not provisioned), 10GigE networking, FFM AZ.
>>> 
>>> The latency between both instances is very small and the network throughput is (copying a file is between 100 and 200MB/s).
>>> 
>>> Standard EC2 Amazon Linux (Redhat/Fedora derivative), Erlang R14B04. CouchDB 2.0 running as dev/run
>>> 
>>> 
>>> Thanks!
>>> Jan
>>> -- 
>>> 
>> 
>> -- 
>> Professional Support for Apache CouchDB:
>> http://www.neighbourhood.ie/couchdb-support/
>> 
> 
> -- 
> Professional Support for Apache CouchDB:
> http://www.neighbourhood.ie/couchdb-support/
> 

-- 
Professional Support for Apache CouchDB:
http://www.neighbourhood.ie/couchdb-support/

Re: [2.0] Replication Issues

Posted by Jan Lehnardt <ja...@apache.org>.

> On 26 Jul 2015, at 19:03, Jan Lehnardt <ja...@apache.org> wrote:
> 
> 
>> On 26 Jul 2015, at 14:47, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>> Hey all,
>> 
>> I’m trying to upgrade a database from 1.6.1 to 2.0.0/master/0c579b98 and I’m seeing a number of issues.
>> 
>> Any help is greatly appreciated. Since this is our official upgrade path for 2.0, this has to be rock-solid.
>> 
>> Feel free to break out individual issue into new threads, if it helps keeping things organised.
>> 
>> Scroll down for detailed information about the database, and machine configurations.
>> 
>> 
>> ## The Scenario
>> 
>> Replication is running on 2.0, pulling from 1.6.1 over the EC2 internal ip address.
>> 
>> ## The Issues
>> 
>> 1. repeated log entries for “write quorum for <targetdb> failed”. I’ve seen this in other contexts as well, why is this happening and should it?
>> 
>> 
>> 2. getting a lot of “cassim_metadata_cache changes listener died” from all nodes about every 5 seconds. What’s up with these?
>> 
>> - 2015-07-26 08:30:34.400 [error] Undefined emulator Error in process <0.14633.26> on node 'node3@127.0.0.1' with exit value: {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}
>> 
>> - 2015-07-26 08:30:39.401 [notice] node3@127.0.0.1 <0.314.0> cassim_metadata_cache changes listener died {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}
> 
> Alexander pointed to https://github.com/apache/couchdb-fabric/commit/b6659c8344c9a028b5ab451be41a991801c2ab3d#diff-2af86e058b4e7a4a99a7c5a12da6debdR96 which is part of Adam’s recent work on COUCHDB-2724.
> 
> Adam, any insights? :)

Bob says this should fix it: https://gist.github.com/rnewson/b9efd4f45e1c62315816

In the meantime, I reverted the changes optimisation commit on fabric and now I’m getting this once it is time to start replicating more documents after the existing update sequence is all caught up with during replication:

https://gist.github.com/janl/75804904dad73d17ed0e

During which I found out that there *are* a few small attachments in the source database, sorry for the confusion about this earlier.

I still see function_clause errors after the revert, Bob suggests to wait for Adam to comment.

Best
Jan
--


> 
> Best
> Jan
> --
> 
> 
> 
>> 
>> 
>> 3. A number of  Replicator, request PUT to "http://0.0.0.0:15984/<target>/edbef049aae9c8828f336534984e5e4f" failed due to error {error,req_timedout} this happens for regular docs, local docs, and _bulk_docs. The machine is basically idle (see below for details), the three beam.smp processes over at 200-250% CPU each, io is 98% idle (it’s mostly logs being written), the machine is basically idle.
>> 
>> 
>> 4, two issues from couch_replicator_api_wrap.erl:
>> 
>> - 2015-07-26 08:22:49.849 [error] Undefined <0.3546.0> gen_server <0.3546.0> terminated with reason: no function clause matching couch_replicator_api_wrap:'-update_docs/4-fun-2-'(400, [{"Server","MochiWeb/1.0 (Any of you quaids got a smint?)"},{"Date","Sun, 26 Jul 2015 08:22:49 G..."},...], null, [<<"{\"_id\":\”12345678\",\"_rev\":\"1050-ee6c7d54276b43bc937470e44e0283f2\”,...
>> 
>> - 2015-07-26 08:30:08.514 [notice] node3@127.0.0.1 <0.6360.26> Retrying GET to http://172.31.10.115:5984/generic_db_name/12348765?revs=true&open_revs=%5B%228-b2826209867a286c76e6a2762f10b1e0%22%5D&latest=true in 1.0 seconds due to error {function_clause,[{couch_replicator_api_wrap,run_user_fun,4},{couch_replicator_api_wrap,receive_docs,4},{couch_replicator_api_wrap,receive_docs_loop,6},{couch_replicator_api_wrap,'-open_doc_revs/6-fun-4-',7}]}
>> 
>> 
>> 
>> 5. Eventually, replication reliably stops with an “invalid_ejson” error, but I don’t yet know if that’s because of the api_wrap issue or something else.
>> 
>> 
>> 
>> 6. Replication has stopped numerous times until I got here, I didn’t have time to look into why that happened, but I have all the logs, but they are 130MB total, so it’ll be a while.
>> 
>> 
>> 7. When replication ran, it replicated at a rate of about 1000 docs/s, which felt a little slow, but I have no experience there, yet.
>> 
>> 
>> ## Source Database Info
>> 
>> {
>> "db_name": "generic_db_name",
>> "doc_count": 6808004,
>> "doc_del_count": 18856,
>> "update_seq": 8044450,
>> "purge_seq": 0,
>> "compact_running": false,
>> "disk_size": 16293904519,
>> "data_size": 11711402577,
>> "instance_start_time": "1437834202967309",
>> "disk_format_version": 6,
>> "committed_update_seq": 8044450
>> }
>> 
>> Mostly small-ish docs, no big outliers, no attachments.
>> 
>> Source machine info:
>> 
>> Amazon EC2 m3.xlarge 4 cores, 64bit, 16GB RAM, 100GB SSD, 3000 provisioned iops. FFM Availability Zone.
>> 
>> Standard EC2 Ubuntu, Erlang R16B03 (I know, but that’s not the problem here, this couch behaves fine).
>> 
>> Target machine info:
>> 
>> Amazon EC2 m4.10xlarge, 40 cores, 64bit, 160GB RAM, 100GB SSD, 3000 iops (not provisioned), 10GigE networking, FFM AZ.
>> 
>> The latency between both instances is very small and the network throughput is (copying a file is between 100 and 200MB/s).
>> 
>> Standard EC2 Amazon Linux (Redhat/Fedora derivative), Erlang R14B04. CouchDB 2.0 running as dev/run
>> 
>> 
>> Thanks!
>> Jan
>> -- 
>> 
> 
> -- 
> Professional Support for Apache CouchDB:
> http://www.neighbourhood.ie/couchdb-support/
> 

-- 
Professional Support for Apache CouchDB:
http://www.neighbourhood.ie/couchdb-support/

Re: [2.0] Replication Issues

Posted by Jan Lehnardt <ja...@apache.org>.

> On 26 Jul 2015, at 14:47, Jan Lehnardt <ja...@apache.org> wrote:
> 
> Hey all,
> 
> I’m trying to upgrade a database from 1.6.1 to 2.0.0/master/0c579b98 and I’m seeing a number of issues.
> 
> Any help is greatly appreciated. Since this is our official upgrade path for 2.0, this has to be rock-solid.
> 
> Feel free to break out individual issue into new threads, if it helps keeping things organised.
> 
> Scroll down for detailed information about the database, and machine configurations.
> 
> 
> ## The Scenario
> 
> Replication is running on 2.0, pulling from 1.6.1 over the EC2 internal ip address.
> 
> ## The Issues
> 
> 1. repeated log entries for “write quorum for <targetdb> failed”. I’ve seen this in other contexts as well, why is this happening and should it?
> 
> 
> 2. getting a lot of “cassim_metadata_cache changes listener died” from all nodes about every 5 seconds. What’s up with these?
> 
> - 2015-07-26 08:30:34.400 [error] Undefined emulator Error in process <0.14633.26> on node 'node3@127.0.0.1' with exit value: {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}
> 
> - 2015-07-26 08:30:39.401 [notice] node3@127.0.0.1 <0.314.0> cassim_metadata_cache changes listener died {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}

Alexander pointed to https://github.com/apache/couchdb-fabric/commit/b6659c8344c9a028b5ab451be41a991801c2ab3d#diff-2af86e058b4e7a4a99a7c5a12da6debdR96 which is part of Adam’s recent work on COUCHDB-2724.

Adam, any insights? :)

Best
Jan
--



> 
> 
> 3. A number of  Replicator, request PUT to "http://0.0.0.0:15984/<target>/edbef049aae9c8828f336534984e5e4f" failed due to error {error,req_timedout} this happens for regular docs, local docs, and _bulk_docs. The machine is basically idle (see below for details), the three beam.smp processes over at 200-250% CPU each, io is 98% idle (it’s mostly logs being written), the machine is basically idle.
> 
> 
> 4, two issues from couch_replicator_api_wrap.erl:
> 
> - 2015-07-26 08:22:49.849 [error] Undefined <0.3546.0> gen_server <0.3546.0> terminated with reason: no function clause matching couch_replicator_api_wrap:'-update_docs/4-fun-2-'(400, [{"Server","MochiWeb/1.0 (Any of you quaids got a smint?)"},{"Date","Sun, 26 Jul 2015 08:22:49 G..."},...], null, [<<"{\"_id\":\”12345678\",\"_rev\":\"1050-ee6c7d54276b43bc937470e44e0283f2\”,...
> 
> - 2015-07-26 08:30:08.514 [notice] node3@127.0.0.1 <0.6360.26> Retrying GET to http://172.31.10.115:5984/generic_db_name/12348765?revs=true&open_revs=%5B%228-b2826209867a286c76e6a2762f10b1e0%22%5D&latest=true in 1.0 seconds due to error {function_clause,[{couch_replicator_api_wrap,run_user_fun,4},{couch_replicator_api_wrap,receive_docs,4},{couch_replicator_api_wrap,receive_docs_loop,6},{couch_replicator_api_wrap,'-open_doc_revs/6-fun-4-',7}]}
> 
> 
> 
> 5. Eventually, replication reliably stops with an “invalid_ejson” error, but I don’t yet know if that’s because of the api_wrap issue or something else.
> 
> 
> 
> 6. Replication has stopped numerous times until I got here, I didn’t have time to look into why that happened, but I have all the logs, but they are 130MB total, so it’ll be a while.
> 
> 
> 7. When replication ran, it replicated at a rate of about 1000 docs/s, which felt a little slow, but I have no experience there, yet.
> 
> 
> ## Source Database Info
> 
> {
>  "db_name": "generic_db_name",
>  "doc_count": 6808004,
>  "doc_del_count": 18856,
>  "update_seq": 8044450,
>  "purge_seq": 0,
>  "compact_running": false,
>  "disk_size": 16293904519,
>  "data_size": 11711402577,
>  "instance_start_time": "1437834202967309",
>  "disk_format_version": 6,
>  "committed_update_seq": 8044450
> }
> 
> Mostly small-ish docs, no big outliers, no attachments.
> 
> Source machine info:
> 
> Amazon EC2 m3.xlarge 4 cores, 64bit, 16GB RAM, 100GB SSD, 3000 provisioned iops. FFM Availability Zone.
> 
> Standard EC2 Ubuntu, Erlang R16B03 (I know, but that’s not the problem here, this couch behaves fine).
> 
> Target machine info:
> 
> Amazon EC2 m4.10xlarge, 40 cores, 64bit, 160GB RAM, 100GB SSD, 3000 iops (not provisioned), 10GigE networking, FFM AZ.
> 
> The latency between both instances is very small and the network throughput is (copying a file is between 100 and 200MB/s).
> 
> Standard EC2 Amazon Linux (Redhat/Fedora derivative), Erlang R14B04. CouchDB 2.0 running as dev/run
> 
> 
> Thanks!
> Jan
> -- 
> 

-- 
Professional Support for Apache CouchDB:
http://www.neighbourhood.ie/couchdb-support/