You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Ciprian Trusca <CT...@totalsoft.ro> on 2014/12/05 08:51:56 UTC

RE: compaction repeated timeouts causes the server to shutdown temporary when replication is broken

We have turned on debugging for this test and it looks like the cause of this error is the _replicator database.  

After the list of fragmented databases we see no evidence that the compaction for this database is being started in the log ( although fragmentation is and above the 70% threshold) and then we have the compaction loop dying  after approximately 5 seconds.  So I am guessing CouchDB fails to spawn the compaction process.  

I forgot to mention in the first post that we are running CouchDB 1.6.1 on a Centos 6.4 server.

Thanks for your time, any help will be appreciated.

-----Original Message-----
From: Ciprian Trusca [mailto:CTrusca@totalsoft.ro] 
Sent: Thursday, November 27, 2014 10:17 AM
To: user@couchdb.apache.org
Subject: compaction repeated timeouts causes the server to shutdown temporary when replication is broken

Hello all,
we have encountered the following situation during an overnight load test.

We get the following message repeatedly in the couch logs:

** Reason for termination ==

** {compaction_loop_died,

       {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}



At one time, we are getting it three times in an interval of 5 seconds and I am guessing this causes the supervisor to shutdown temporary:


[Thu, 20 Nov 2014 05:58:33 GMT] [error] [<0.93.0>] {error_report,<0.30.0>,
                       {<0.93.0>,supervisor_report,
                        [{supervisor,{local,couch_secondary_services}},
                         {errorContext,shutdown},
                         {reason,reached_max_restart_intensity},
                         {offender,
                             [{pid,<0.10114.14>},
                              {name,compaction_daemon},
                              {mfargs,{couch_compaction_daemon,start_link,[]}},
                              {restart_type,permanent},
                              {shutdown,brutal_kill},
                              {child_type,worker}]}]}}



In this particular component load test the CouchDB peer is shutdown so the replication is broken, meaning that there are a lot of backgrounds processes that try to replicate and die, and there is a thread that removes the failed replication and re-enables them (probably this is not a good idea anymore since CouchDB detects that the peer came back online on its own now).  I suspect that this might be related.



In the Zenoss graphs we see a very significant spike in the IO read /writes at that moment.



Thank you very much for your time, and any hint will be appreciated.

RE: compaction repeated timeouts causes the server to shutdown temporary when replication is broken

Posted by Ciprian Trusca <CT...@totalsoft.ro>.
Thanks Jan, I have opened https://issues.apache.org/jira/browse/COUCHDB-2496. Please let me know if extra information needs to be added.

-----Original Message-----
From: Jan Lehnardt [mailto:jan@apache.org] 
Sent: Monday, December 08, 2014 12:59 PM
To: user@couchdb.apache.org
Subject: Re: compaction repeated timeouts causes the server to shutdown temporary when replication is broken

Heya Ciprian,

this sounds like a bug, could you file an issue on https://issues.apache.org/jira/browse/COUCHDB

Best
Jan
--

> On 05 Dec 2014, at 08:51 , Ciprian Trusca <CT...@totalsoft.ro> wrote:
> 
> We have turned on debugging for this test and it looks like the cause of this error is the _replicator database.  
> 
> After the list of fragmented databases we see no evidence that the compaction for this database is being started in the log ( although fragmentation is and above the 70% threshold) and then we have the compaction loop dying  after approximately 5 seconds.  So I am guessing CouchDB fails to spawn the compaction process.  
> 
> I forgot to mention in the first post that we are running CouchDB 1.6.1 on a Centos 6.4 server.
> 
> Thanks for your time, any help will be appreciated.
> 
> -----Original Message-----
> From: Ciprian Trusca [mailto:CTrusca@totalsoft.ro] 
> Sent: Thursday, November 27, 2014 10:17 AM
> To: user@couchdb.apache.org
> Subject: compaction repeated timeouts causes the server to shutdown temporary when replication is broken
> 
> Hello all,
> we have encountered the following situation during an overnight load test.
> 
> We get the following message repeatedly in the couch logs:
> 
> ** Reason for termination ==
> 
> ** {compaction_loop_died,
> 
>       {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}
> 
> 
> 
> At one time, we are getting it three times in an interval of 5 seconds and I am guessing this causes the supervisor to shutdown temporary:
> 
> 
> [Thu, 20 Nov 2014 05:58:33 GMT] [error] [<0.93.0>] {error_report,<0.30.0>,
>                       {<0.93.0>,supervisor_report,
>                        [{supervisor,{local,couch_secondary_services}},
>                         {errorContext,shutdown},
>                         {reason,reached_max_restart_intensity},
>                         {offender,
>                             [{pid,<0.10114.14>},
>                              {name,compaction_daemon},
>                              {mfargs,{couch_compaction_daemon,start_link,[]}},
>                              {restart_type,permanent},
>                              {shutdown,brutal_kill},
>                              {child_type,worker}]}]}}
> 
> 
> 
> In this particular component load test the CouchDB peer is shutdown so the replication is broken, meaning that there are a lot of backgrounds processes that try to replicate and die, and there is a thread that removes the failed replication and re-enables them (probably this is not a good idea anymore since CouchDB detects that the peer came back online on its own now).  I suspect that this might be related.
> 
> 
> 
> In the Zenoss graphs we see a very significant spike in the IO read /writes at that moment.
> 
> 
> 
> Thank you very much for your time, and any hint will be appreciated.


Re: compaction repeated timeouts causes the server to shutdown temporary when replication is broken

Posted by Jan Lehnardt <ja...@apache.org>.
Heya Ciprian,

this sounds like a bug, could you file an issue on https://issues.apache.org/jira/browse/COUCHDB

Best
Jan
--

> On 05 Dec 2014, at 08:51 , Ciprian Trusca <CT...@totalsoft.ro> wrote:
> 
> We have turned on debugging for this test and it looks like the cause of this error is the _replicator database.  
> 
> After the list of fragmented databases we see no evidence that the compaction for this database is being started in the log ( although fragmentation is and above the 70% threshold) and then we have the compaction loop dying  after approximately 5 seconds.  So I am guessing CouchDB fails to spawn the compaction process.  
> 
> I forgot to mention in the first post that we are running CouchDB 1.6.1 on a Centos 6.4 server.
> 
> Thanks for your time, any help will be appreciated.
> 
> -----Original Message-----
> From: Ciprian Trusca [mailto:CTrusca@totalsoft.ro] 
> Sent: Thursday, November 27, 2014 10:17 AM
> To: user@couchdb.apache.org
> Subject: compaction repeated timeouts causes the server to shutdown temporary when replication is broken
> 
> Hello all,
> we have encountered the following situation during an overnight load test.
> 
> We get the following message repeatedly in the couch logs:
> 
> ** Reason for termination ==
> 
> ** {compaction_loop_died,
> 
>       {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}
> 
> 
> 
> At one time, we are getting it three times in an interval of 5 seconds and I am guessing this causes the supervisor to shutdown temporary:
> 
> 
> [Thu, 20 Nov 2014 05:58:33 GMT] [error] [<0.93.0>] {error_report,<0.30.0>,
>                       {<0.93.0>,supervisor_report,
>                        [{supervisor,{local,couch_secondary_services}},
>                         {errorContext,shutdown},
>                         {reason,reached_max_restart_intensity},
>                         {offender,
>                             [{pid,<0.10114.14>},
>                              {name,compaction_daemon},
>                              {mfargs,{couch_compaction_daemon,start_link,[]}},
>                              {restart_type,permanent},
>                              {shutdown,brutal_kill},
>                              {child_type,worker}]}]}}
> 
> 
> 
> In this particular component load test the CouchDB peer is shutdown so the replication is broken, meaning that there are a lot of backgrounds processes that try to replicate and die, and there is a thread that removes the failed replication and re-enables them (probably this is not a good idea anymore since CouchDB detects that the peer came back online on its own now).  I suspect that this might be related.
> 
> 
> 
> In the Zenoss graphs we see a very significant spike in the IO read /writes at that moment.
> 
> 
> 
> Thank you very much for your time, and any hint will be appreciated.