You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Simon Eisenmann <si...@struktur.de> on 2009/10/19 15:28:59 UTC

Replication hangs

Hi,

i am currently trying to set up a CouchDB cluster with three nodes each
replication changes to the others. Basically this works just fine for a
while with only a couple of changes per minute on every node.

However at some point (a couple of hours) the replication processes just
hang with status like "W Processed source update #12336" (as shown in
Futon). Only solution to get this replication running again, is to
restart the CouchDB. This happens not on all nodes. Usually it begins at
one node and some time later the other nodes also have hanging
replication processes.

So the question is how stable is the replication supposed to be in a
vice versa scenario having multiple changes and multiple replications
coming in and going out at the same time. Is there a recommended way to
have a multi master replicating environment?

Thanks for any hints.

Best regards
Simon




ps: I am using the 0.10.0 release with erlang R13B01 on Ubuntu Linux
64bit. 



-- 
Simon Eisenmann

[ mailto:simon@struktur.de ]

[ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ]
[ T. +49.711.896656.0 | F.+49.711.89665610 ]
[ http://www.struktur.de | mailto:info@struktur.de ]

Re: Replication hangs

Posted by Simon Eisenmann <si...@struktur.de>.
Am Montag, den 19.10.2009, 11:00 -0400 schrieb Adam Kocoloski:
> Hi Simon, I'm not sure I follow why chunked responses require new  
> connections, but in any event, I'm fairly certain that responses to  
> _replicate are not chunked.

Hi Adam,

well afaik chunked connections cannot be reused. At least if i try it i
get a bad status line from the server. This might be another CouchDB
issue though. The chunked connections keep answering with empty lines
instead of returning a new HTTP status line. I am not that deep inside
the HTTP protocol to know if that is intentional. 

> (Digression more appropriate for dev@) Does the client really have any  
> control over whether the responses are chunked?  I guess the "right"  
> way to force no chunking from the client side would be to make an HTTP/ 
> 1.0 request (with Connection: keep-alive if you still want the  
> connection pool).  We should check at some point to see if that  
> actually works.

Ok i see your point. Though i am using a single connection pool for all
couch interaction. Of course i can use a different one for the
replication stuff.

Simon

> 
> Adam
-- 
Simon Eisenmann

[ mailto:simon@struktur.de ]

[ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ]
[ T. +49.711.896656.68 | F.+49.711.89665610 ]
[ http://www.struktur.de | mailto:info@struktur.de ]

Re: Replication hangs

Posted by Adam Kocoloski <ko...@apache.org>.
On Oct 19, 2009, at 10:45 AM, Simon Eisenmann wrote:

> Am Montag, den 19.10.2009, 10:37 -0400 schrieb Adam Kocoloski:
>> o, until JIRA comes back online I'll follow up with that here.  I
>> think I could see how repeated pull replications in rapid succession
>> could end up blowing through sockets.  Each pull replication sets up
>> one new connection for the _changes feed, and tears it down at the
>> end
>> (everything else replication-related goes through a connection
>> pool).
>> Do enough of those very short requests and you could end up with
>> lots
>> of connections in TIME_WAIT and eventually run out of sockets.
>> FWIW,
>> the default Erlang limit is slightly less than 1024.  If your
>> update_notification process uses a new connection for every POST to
>> _replicate you'll hit the system limit (also 1024 in Ubuntu IIRC)
>> twice as fast.
>
> I see that you mean. Though i am using a connection pool in the update
> notification as well. I verified that i am not leaking connections on
> the client side. The only reason i have to throw away and open a nw
> connection is if the response is returned chunked.
>
> I see the reasons for chunged transfer encoding. Is there a way to
> disable it on connections so i can make sure i have a fixed connection
> pool size on the client?

Hi Simon, I'm not sure I follow why chunked responses require new  
connections, but in any event, I'm fairly certain that responses to  
_replicate are not chunked.

(Digression more appropriate for dev@) Does the client really have any  
control over whether the responses are chunked?  I guess the "right"  
way to force no chunking from the client side would be to make an HTTP/ 
1.0 request (with Connection: keep-alive if you still want the  
connection pool).  We should check at some point to see if that  
actually works.

Adam

Re: Replication hangs

Posted by Simon Eisenmann <si...@struktur.de>.
Am Montag, den 19.10.2009, 10:37 -0400 schrieb Adam Kocoloski:
> o, until JIRA comes back online I'll follow up with that here.  I  
> think I could see how repeated pull replications in rapid succession  
> could end up blowing through sockets.  Each pull replication sets up  
> one new connection for the _changes feed, and tears it down at the
> end  
> (everything else replication-related goes through a connection
> pool).   
> Do enough of those very short requests and you could end up with
> lots  
> of connections in TIME_WAIT and eventually run out of sockets.
> FWIW,  
> the default Erlang limit is slightly less than 1024.  If your  
> update_notification process uses a new connection for every POST to  
> _replicate you'll hit the system limit (also 1024 in Ubuntu IIRC)  
> twice as fast.

I see that you mean. Though i am using a connection pool in the update
notification as well. I verified that i am not leaking connections on
the client side. The only reason i have to throw away and open a nw
connection is if the response is returned chunked.

I see the reasons for chunged transfer encoding. Is there a way to
disable it on connections so i can make sure i have a fixed connection
pool size on the client?

> Continuous replication is really our preferred solution for your  
> scenario.  If you can live with interpreting the records in the
> _local  
> document to verify that it's still running you'll end up with a more  
> efficient replication system all around.

Yes i think that sould be possible. I will probably switch to that way.

> Regarding the hangs, if you do write a test script I'll be more than  
> happy to try it and figure out what's going wrong.  Best,

I will do my best to make one available as soon as possible.

Best regards
Simon


> 
> Adam
-- 
Simon Eisenmann

[ mailto:simon@struktur.de ]

[ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ]
[ T. +49.711.896656.68 | F.+49.711.89665610 ]
[ http://www.struktur.de | mailto:info@struktur.de ]

Re: Replication hangs

Posted by Paul Davis <pa...@gmail.com>.
> So, until JIRA comes back online I'll follow up with that here.  I think I
> could see how repeated pull replications in rapid succession could end up
> blowing through sockets.  Each pull replication sets up one new connection
> for the _changes feed, and tears it down at the end (everything else
> replication-related goes through a connection pool).  Do enough of those
> very short requests and you could end up with lots of connections in
> TIME_WAIT and eventually run out of sockets.  FWIW, the default Erlang limit
> is slightly less than 1024.  If your update_notification process uses a new
> connection for every POST to _replicate you'll hit the system limit (also
> 1024 in Ubuntu IIRC) twice as fast.

A quick way to test this is try and hop on a machine just after it
gets wedged and look at the output of:

$ sudo netstat -tap tcp

If you see a whole bunch of sockets in TIME_WAIT then this is probably
your error.

Paul Davis

Re: Replication hangs

Posted by Adam Kocoloski <ko...@apache.org>.
On Oct 19, 2009, at 10:14 AM, Simon Eisenmann wrote:

> Am Montag, den 19.10.2009, 10:04 -0400 schrieb Adam Kocoloski:
>> On Oct 19, 2009, at 10:00 AM, Simon Eisenmann wrote:
>>
>>> Paul,
>>>
>>> Am Montag, den 19.10.2009, 09:53 -0400 schrieb Paul Davis:
>>>> Hmmm, that sounds most odd. Are there any consistencies on when it
>>>> hangs? Specifically, does it look like its a poison doc that causes
>>>> things to go wonky or some such? Do nodes fail in a specific order?
>>>
>>> The only specificness i see is that somehow the slowest node never
>>> seems
>>> to fail. The other two nodes have roughly the same performance.
>>>
>>>> Also, you might try setting up the continuous replication instead  
>>>> of
>>>> the update notifications as that might be a bit more ironed out.
>>>
>>> I already have considered that, though as long there is no way to
>>> figure
>>> out if a continous replication is still up and running i cannot use
>>> it,
>>> cause i have to restart it when a node fails and comes up again  
>>> later.
>>>
>>>> Another thing to check is if its just the task status that's  
>>>> wonky vs
>>>> actual replication. You can check the _local doc that's created by
>>>> replication to see if its update seq is changing while task  
>>>> statuses
>>>> aren't.
>>>
>>> If only the status would hang, i should be able to start up the
>>> replication again correct? Though this hangs as well.
>>
>> Hi Simon, is this hang related to the accept_failed bug report you
>> just filed[1], or is it separate? Best,
>>
>> Adam
>>
>> [1]: https://issues.apache.org/jira/browse/COUCHDB-536
>
> Hi Adam,
>
> i would consider it separate. The accept_failed issue happens only  
> when
> having lots and lots of changes
>
> (essentially while True { put couple of docs, query views, delete  
> docs})
>
> Simon

So, until JIRA comes back online I'll follow up with that here.  I  
think I could see how repeated pull replications in rapid succession  
could end up blowing through sockets.  Each pull replication sets up  
one new connection for the _changes feed, and tears it down at the end  
(everything else replication-related goes through a connection pool).   
Do enough of those very short requests and you could end up with lots  
of connections in TIME_WAIT and eventually run out of sockets.  FWIW,  
the default Erlang limit is slightly less than 1024.  If your  
update_notification process uses a new connection for every POST to  
_replicate you'll hit the system limit (also 1024 in Ubuntu IIRC)  
twice as fast.

Continuous replication is really our preferred solution for your  
scenario.  If you can live with interpreting the records in the _local  
document to verify that it's still running you'll end up with a more  
efficient replication system all around.

Regarding the hangs, if you do write a test script I'll be more than  
happy to try it and figure out what's going wrong.  Best,

Adam


Re: Replication hangs

Posted by Simon Eisenmann <si...@struktur.de>.
Am Montag, den 19.10.2009, 10:04 -0400 schrieb Adam Kocoloski:
> On Oct 19, 2009, at 10:00 AM, Simon Eisenmann wrote:
> 
> > Paul,
> >
> > Am Montag, den 19.10.2009, 09:53 -0400 schrieb Paul Davis:
> >> Hmmm, that sounds most odd. Are there any consistencies on when it
> >> hangs? Specifically, does it look like its a poison doc that causes
> >> things to go wonky or some such? Do nodes fail in a specific order?
> >
> > The only specificness i see is that somehow the slowest node never  
> > seems
> > to fail. The other two nodes have roughly the same performance.
> >
> >> Also, you might try setting up the continuous replication instead of
> >> the update notifications as that might be a bit more ironed out.
> >
> > I already have considered that, though as long there is no way to  
> > figure
> > out if a continous replication is still up and running i cannot use  
> > it,
> > cause i have to restart it when a node fails and comes up again later.
> >
> >> Another thing to check is if its just the task status that's wonky vs
> >> actual replication. You can check the _local doc that's created by
> >> replication to see if its update seq is changing while task statuses
> >> aren't.
> >
> > If only the status would hang, i should be able to start up the
> > replication again correct? Though this hangs as well.
> 
> Hi Simon, is this hang related to the accept_failed bug report you  
> just filed[1], or is it separate? Best,
> 
> Adam
> 
> [1]: https://issues.apache.org/jira/browse/COUCHDB-536

Hi Adam,

i would consider it separate. The accept_failed issue happens only when
having lots and lots of changes 

(essentially while True { put couple of docs, query views, delete docs})

Simon

-- 
Simon Eisenmann

[ mailto:simon@struktur.de ]

[ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ]
[ T. +49.711.896656.68 | F.+49.711.89665610 ]
[ http://www.struktur.de | mailto:info@struktur.de ]

Re: Replication hangs

Posted by Adam Kocoloski <ko...@apache.org>.
On Oct 19, 2009, at 10:00 AM, Simon Eisenmann wrote:

> Paul,
>
> Am Montag, den 19.10.2009, 09:53 -0400 schrieb Paul Davis:
>> Hmmm, that sounds most odd. Are there any consistencies on when it
>> hangs? Specifically, does it look like its a poison doc that causes
>> things to go wonky or some such? Do nodes fail in a specific order?
>
> The only specificness i see is that somehow the slowest node never  
> seems
> to fail. The other two nodes have roughly the same performance.
>
>> Also, you might try setting up the continuous replication instead of
>> the update notifications as that might be a bit more ironed out.
>
> I already have considered that, though as long there is no way to  
> figure
> out if a continous replication is still up and running i cannot use  
> it,
> cause i have to restart it when a node fails and comes up again later.
>
>> Another thing to check is if its just the task status that's wonky vs
>> actual replication. You can check the _local doc that's created by
>> replication to see if its update seq is changing while task statuses
>> aren't.
>
> If only the status would hang, i should be able to start up the
> replication again correct? Though this hangs as well.

Hi Simon, is this hang related to the accept_failed bug report you  
just filed[1], or is it separate? Best,

Adam

[1]: https://issues.apache.org/jira/browse/COUCHDB-536

Re: Replication hangs

Posted by Adam Kocoloski <ko...@apache.org>.
On Oct 19, 2009, at 10:04 AM, Paul Davis wrote:

> Simon,
>
>>> Also, you might try setting up the continuous replication instead of
>>> the update notifications as that might be a bit more ironed out.
>>
>> I already have considered that, though as long there is no way to  
>> figure
>> out if a continous replication is still up and running i cannot use  
>> it,
>> cause i have to restart it when a node fails and comes up again  
>> later.
>
> Hmm. Doesn't the _local doc for the continuous replication show if its
> still in progress? Oh, though it might not have a specific flag
> indicating as such.

Well, in a fuzzy sort of way, yes.  It'll tell you the timestamp and  
sequence number of the last recorded checkpoint.  If you have a system  
with constant updates, this would be sufficient to know that the  
replication is running -- checkpoints are saved every 5 seconds.

Adam

Re: Replication hangs

Posted by Simon Eisenmann <si...@struktur.de>.
Am Mittwoch, den 21.10.2009, 09:08 -0400 schrieb Adam Kocoloski:
> > Though in the logs i now see lots of
> >
> > [info] [<0.164.0>] A server has restarted sinced replication
> start.  
> > Not
> > recording the new sequence number to ensure the replication is
> redone
> > and documents reexamined.
> >
> > Messages. I posted this in IRC yesterday and was told that this is
> > nothing to worry about. So what exactly does it mean and why it is
> > logged with info level when it can be ignored?
> >
> > If that message is nothing critical i would suggest to log it with  
> > debug
> > level, as it is shown at any replication checkpoint on any node as  
> > soon
> > as one of the other nodes was offline.
> 
> So, what we're trying to do here is avoid skipping updates from the  
> source server.  Consider the following sequence of events:
> 
> 1) Save some docs to the source with delayed_commits=true
> 2) Replicating source -> target
> 3) Restart source before full commit, losing the updates that have  
> replicated
> 4) Save more docs to source, overwriting previously used sequence  
> numbers
> 
> If that happens, we don't want the replicator to skip the new docs  
> that have been saved in step 4.  So if we detect that a server  
> restarted, we play it safe and don't checkpoint, so that the next  
> replication will re-examine the sequence.  An analogous situation  
> could happen with the target losing updates that the replicator had  
> written (but not fully committed).
> 
> Skipping checkpointing altogether for the remainder of the
> replication  
> is an overly conservative position.  In my opinion what we should do  
> when we detect this condition is restart the replication immediately  
> from the last known checkpoint.  Then you'd see one of these [info]  
> level messages telling you that the replicator is going to restart
> to  
> double-check some sequence numbers, and that's it.

Ok. Understood. Thanks for the explanation. If that behaviour would only
execute once i would be absolutely fine. But with the current
implementation this is done forever and replication never seems to
switch to normal mode again.

Best regards
Simon


-- 
Simon Eisenmann

[ mailto:simon@struktur.de ]

[ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ]
[ T. +49.711.896656.68 | F.+49.711.89665610 ]
[ http://www.struktur.de | mailto:info@struktur.de ]

Re: Replication hangs

Posted by Jan Lehnardt <ja...@apache.org>.
On 21 Oct 2009, at 15:08, Adam Kocoloski wrote:

> On Oct 21, 2009, at 4:23 AM, Simon Eisenmann wrote:
>
>> Hi,
>>
>> Am Montag, den 19.10.2009, 10:04 -0400 schrieb Paul Davis:
>>>>> Also, you might try setting up the continuous replication instead
>>> of
>>>>> the update notifications as that might be a bit more ironed out.
>>>>
>>>> I already have considered that, though as long there is no way to
>>> figure
>>>> out if a continous replication is still up and running i cannot use
>>> it,
>>>> cause i have to restart it when a node fails and comes up again
>>> later.
>>>
>>> Hmm. Doesn't the _local doc for the continuous replication show if  
>>> its
>>> still in progress? Oh, though it might not have a specific flag
>>> indicating as such.
>>
>> I changed the system to use continuous replication and checkin the
>> _local doc to make sure it's still running. That way everything works
>> fine and i cannot reproduce any hangs.
>>
>> Though in the logs i now see lots of
>>
>> [info] [<0.164.0>] A server has restarted sinced replication start.  
>> Not
>> recording the new sequence number to ensure the replication is redone
>> and documents reexamined.
>>
>> Messages. I posted this in IRC yesterday and was told that this is
>> nothing to worry about. So what exactly does it mean and why it is
>> logged with info level when it can be ignored?
>>
>> If that message is nothing critical i would suggest to log it with  
>> debug
>> level, as it is shown at any replication checkpoint on any node as  
>> soon
>> as one of the other nodes was offline.
>
> So, what we're trying to do here is avoid skipping updates from the  
> source server.  Consider the following sequence of events:
>
> 1) Save some docs to the source with delayed_commits=true
> 2) Replicating source -> target
> 3) Restart source before full commit, losing the updates that have  
> replicated
> 4) Save more docs to source, overwriting previously used sequence  
> numbers
>
> If that happens, we don't want the replicator to skip the new docs  
> that have been saved in step 4.  So if we detect that a server  
> restarted, we play it safe and don't checkpoint, so that the next  
> replication will re-examine the sequence.  An analogous situation  
> could happen with the target losing updates that the replicator had  
> written (but not fully committed).
>
> Skipping checkpointing altogether for the remainder of the  
> replication is an overly conservative position.  In my opinion what  
> we should do when we detect this condition is restart the  
> replication immediately from the last known checkpoint.  Then you'd  
> see one of these [info] level messages telling you that the  
> replicator is going to restart to double-check some sequence  
> numbers, and that's it.
>
> Best, Adam

Adam, this mail is great Wiki material. Can you (or anyone) find a  
place for it on the wiki for future reference?

Cheers
Jan
--


Re: Replication hangs

Posted by Adam Kocoloski <ko...@apache.org>.
On Oct 21, 2009, at 4:23 AM, Simon Eisenmann wrote:

> Hi,
>
> Am Montag, den 19.10.2009, 10:04 -0400 schrieb Paul Davis:
>>>> Also, you might try setting up the continuous replication instead
>> of
>>>> the update notifications as that might be a bit more ironed out.
>>>
>>> I already have considered that, though as long there is no way to
>> figure
>>> out if a continous replication is still up and running i cannot use
>> it,
>>> cause i have to restart it when a node fails and comes up again
>> later.
>>
>> Hmm. Doesn't the _local doc for the continuous replication show if  
>> its
>> still in progress? Oh, though it might not have a specific flag
>> indicating as such.
>
> I changed the system to use continuous replication and checkin the
> _local doc to make sure it's still running. That way everything works
> fine and i cannot reproduce any hangs.
>
> Though in the logs i now see lots of
>
> [info] [<0.164.0>] A server has restarted sinced replication start.  
> Not
> recording the new sequence number to ensure the replication is redone
> and documents reexamined.
>
> Messages. I posted this in IRC yesterday and was told that this is
> nothing to worry about. So what exactly does it mean and why it is
> logged with info level when it can be ignored?
>
> If that message is nothing critical i would suggest to log it with  
> debug
> level, as it is shown at any replication checkpoint on any node as  
> soon
> as one of the other nodes was offline.

So, what we're trying to do here is avoid skipping updates from the  
source server.  Consider the following sequence of events:

1) Save some docs to the source with delayed_commits=true
2) Replicating source -> target
3) Restart source before full commit, losing the updates that have  
replicated
4) Save more docs to source, overwriting previously used sequence  
numbers

If that happens, we don't want the replicator to skip the new docs  
that have been saved in step 4.  So if we detect that a server  
restarted, we play it safe and don't checkpoint, so that the next  
replication will re-examine the sequence.  An analogous situation  
could happen with the target losing updates that the replicator had  
written (but not fully committed).

Skipping checkpointing altogether for the remainder of the replication  
is an overly conservative position.  In my opinion what we should do  
when we detect this condition is restart the replication immediately  
from the last known checkpoint.  Then you'd see one of these [info]  
level messages telling you that the replicator is going to restart to  
double-check some sequence numbers, and that's it.

Best, Adam

Re: Replication hangs

Posted by Simon Eisenmann <si...@struktur.de>.
Hi,

Am Montag, den 19.10.2009, 10:04 -0400 schrieb Paul Davis:
> >> Also, you might try setting up the continuous replication instead
> of
> >> the update notifications as that might be a bit more ironed out.
> >
> > I already have considered that, though as long there is no way to
> figure
> > out if a continous replication is still up and running i cannot use
> it,
> > cause i have to restart it when a node fails and comes up again
> later.
> 
> Hmm. Doesn't the _local doc for the continuous replication show if its
> still in progress? Oh, though it might not have a specific flag
> indicating as such.

I changed the system to use continuous replication and checkin the
_local doc to make sure it's still running. That way everything works
fine and i cannot reproduce any hangs.

Though in the logs i now see lots of 

[info] [<0.164.0>] A server has restarted sinced replication start. Not
recording the new sequence number to ensure the replication is redone
and documents reexamined.

Messages. I posted this in IRC yesterday and was told that this is
nothing to worry about. So what exactly does it mean and why it is
logged with info level when it can be ignored?

If that message is nothing critical i would suggest to log it with debug
level, as it is shown at any replication checkpoint on any node as soon
as one of the other nodes was offline.


Best regards
Simon






-- 
Simon Eisenmann

[ mailto:simon@struktur.de ]

[ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ]
[ T. +49.711.896656.68 | F.+49.711.89665610 ]
[ http://www.struktur.de | mailto:info@struktur.de ]

Re: Replication hangs

Posted by Simon Eisenmann <si...@struktur.de>.
Am Montag, den 19.10.2009, 10:19 -0400 schrieb Paul Davis:
> >> Touché. Can you perhaps increase the logging level to see if it prints
> >> something useful as to what its doing before getting wedged?
> >
> > I am currently running in debug level. Is there any more detailed log
> > level?
> 
> Noper, that's as good as it gets. Is it possible to reproduce this on
> empty databases with some test scripts?
> 
> Paul

Basically to reproduce a update notification process is required which
tells all the other nodes to do a pull replication from the node which
just has changed. To have a constant replication i simply write once per
minute to any of the databases (from a thread of the update notification
process).

Let me see if i can provide a as simple as possible test update
notification script to reproduce.

I will also check if it happens with continuous replication as well.


Simon

-- 
Simon Eisenmann

[ mailto:simon@struktur.de ]

[ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ]
[ T. +49.711.896656.68 | F.+49.711.89665610 ]
[ http://www.struktur.de | mailto:info@struktur.de ]

Re: Replication hangs

Posted by Paul Davis <pa...@gmail.com>.
Simon,

>> Touché. Can you perhaps increase the logging level to see if it prints
>> something useful as to what its doing before getting wedged?
>
> I am currently running in debug level. Is there any more detailed log
> level?

Noper, that's as good as it gets. Is it possible to reproduce this on
empty databases with some test scripts?

Paul

Re: Replication hangs

Posted by Simon Eisenmann <si...@struktur.de>.
Am Montag, den 19.10.2009, 10:04 -0400 schrieb Paul Davis:
> Simon,
> 
> >> Also, you might try setting up the continuous replication instead of
> >> the update notifications as that might be a bit more ironed out.
> >
> > I already have considered that, though as long there is no way to figure
> > out if a continous replication is still up and running i cannot use it,
> > cause i have to restart it when a node fails and comes up again later.
> 
> Hmm. Doesn't the _local doc for the continuous replication show if its
> still in progress? Oh, though it might not have a specific flag
> indicating as such.

Ok i will dig into that. Thanks for the suggestion.

> >> Another thing to check is if its just the task status that's wonky vs
> >> actual replication. You can check the _local doc that's created by
> >> replication to see if its update seq is changing while task statuses
> >> aren't.
> >
> > If only the status would hang, i should be able to start up the
> > replication again correct? Though this hangs as well.
> >
> 
> Touché. Can you perhaps increase the logging level to see if it prints
> something useful as to what its doing before getting wedged?

I am currently running in debug level. Is there any more detailed log
level?


Best regards
Simon

> 
> Paul Davis
-- 
Simon Eisenmann

[ mailto:simon@struktur.de ]

[ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ]
[ T. +49.711.896656.68 | F.+49.711.89665610 ]
[ http://www.struktur.de | mailto:info@struktur.de ]

Re: Replication hangs

Posted by Paul Davis <pa...@gmail.com>.
Simon,

>> Also, you might try setting up the continuous replication instead of
>> the update notifications as that might be a bit more ironed out.
>
> I already have considered that, though as long there is no way to figure
> out if a continous replication is still up and running i cannot use it,
> cause i have to restart it when a node fails and comes up again later.

Hmm. Doesn't the _local doc for the continuous replication show if its
still in progress? Oh, though it might not have a specific flag
indicating as such.

>> Another thing to check is if its just the task status that's wonky vs
>> actual replication. You can check the _local doc that's created by
>> replication to see if its update seq is changing while task statuses
>> aren't.
>
> If only the status would hang, i should be able to start up the
> replication again correct? Though this hangs as well.
>

Touché. Can you perhaps increase the logging level to see if it prints
something useful as to what its doing before getting wedged?

Paul Davis

Re: Replication hangs

Posted by Simon Eisenmann <si...@struktur.de>.
Paul,

Am Montag, den 19.10.2009, 09:53 -0400 schrieb Paul Davis:
> Hmmm, that sounds most odd. Are there any consistencies on when it
> hangs? Specifically, does it look like its a poison doc that causes
> things to go wonky or some such? Do nodes fail in a specific order?

The only specificness i see is that somehow the slowest node never seems
to fail. The other two nodes have roughly the same performance.

> Also, you might try setting up the continuous replication instead of
> the update notifications as that might be a bit more ironed out.

I already have considered that, though as long there is no way to figure
out if a continous replication is still up and running i cannot use it,
cause i have to restart it when a node fails and comes up again later.

> Another thing to check is if its just the task status that's wonky vs
> actual replication. You can check the _local doc that's created by
> replication to see if its update seq is changing while task statuses
> aren't.

If only the status would hang, i should be able to start up the
replication again correct? Though this hangs as well.


-- 
Simon Eisenmann

[ mailto:simon@struktur.de ]

[ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ]
[ T. +49.711.896656.68 | F.+49.711.89665610 ]
[ http://www.struktur.de | mailto:info@struktur.de ]

Re: Replication hangs

Posted by Paul Davis <pa...@gmail.com>.
On Mon, Oct 19, 2009 at 9:48 AM, Simon Eisenmann <si...@struktur.de> wrote:
> Hi Paul,
>
> thanks for your feedback!
>
> Am Montag, den 19.10.2009, 09:40 -0400 schrieb Paul Davis:
>> Are there any tracebacks in the logs that you can paste? I don't think
>> I've heard of replication getting wedged without some sort of
>> feedback.
>
> Unfortunately there was no error in the logs or on stderr. Also any
> further replication request does hang as well (never completes). The
> last entry is always "recording a checkpoint at source update_seq ...".
>
> Please note that this is reproduceable, means it happens all the time
> though the time frame varies.
>
>> Also, are you using continuous replication then? I do know that just
>> before the 0.10.0 release that Adam Kocoloski and Robert Newson spent
>> a good amount of time getting star (all nodes replicate continuously
>> to all otheres) kinks ironed out. Or maybe it was a ring. I dunno, but
>> there was work on something like that.
>
> I am not using continous replication but an update notification process
> triggering pull replication on the other nodes from the database which
> was changes. Your point regarding rings is interesting. In general that
> would explain it. Though in case of a ring i would have multiple hanging
> replications at the same time correct? It always starts with one
> direction hanging. The other way around usually works just fine until it
> hangs some time (hours) later.
>
> Also i have tested this with a couple of SVN revisions before the 10.0
> release and things improved a lot since the first tests. Though now i
> have much more data database update sequence in millions range.
>
> Best regards
> Simon
>
>
>
>>
>> Paul Davis
> --
> Simon Eisenmann
>
> [ mailto:simon@struktur.de ]
>
> [ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ]
> [ T. +49.711.896656.68 | F.+49.711.89665610 ]
> [ http://www.struktur.de | mailto:info@struktur.de ]
>

Simon,

Hmmm, that sounds most odd. Are there any consistencies on when it
hangs? Specifically, does it look like its a poison doc that causes
things to go wonky or some such? Do nodes fail in a specific order?

Also, you might try setting up the continuous replication instead of
the update notifications as that might be a bit more ironed out.

Another thing to check is if its just the task status that's wonky vs
actual replication. You can check the _local doc that's created by
replication to see if its update seq is changing while task statuses
aren't.

Paul Davis

Re: Replication hangs

Posted by Simon Eisenmann <si...@struktur.de>.
Hi Paul,

thanks for your feedback!

Am Montag, den 19.10.2009, 09:40 -0400 schrieb Paul Davis:
> Are there any tracebacks in the logs that you can paste? I don't think
> I've heard of replication getting wedged without some sort of
> feedback.

Unfortunately there was no error in the logs or on stderr. Also any
further replication request does hang as well (never completes). The
last entry is always "recording a checkpoint at source update_seq ...".

Please note that this is reproduceable, means it happens all the time
though the time frame varies.

> Also, are you using continuous replication then? I do know that just
> before the 0.10.0 release that Adam Kocoloski and Robert Newson spent
> a good amount of time getting star (all nodes replicate continuously
> to all otheres) kinks ironed out. Or maybe it was a ring. I dunno, but
> there was work on something like that.

I am not using continous replication but an update notification process
triggering pull replication on the other nodes from the database which
was changes. Your point regarding rings is interesting. In general that
would explain it. Though in case of a ring i would have multiple hanging
replications at the same time correct? It always starts with one
direction hanging. The other way around usually works just fine until it
hangs some time (hours) later.

Also i have tested this with a couple of SVN revisions before the 10.0
release and things improved a lot since the first tests. Though now i
have much more data database update sequence in millions range. 

Best regards
Simon



> 
> Paul Davis
-- 
Simon Eisenmann

[ mailto:simon@struktur.de ]

[ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ]
[ T. +49.711.896656.68 | F.+49.711.89665610 ]
[ http://www.struktur.de | mailto:info@struktur.de ]

Re: Replication hangs

Posted by Paul Davis <pa...@gmail.com>.
On Mon, Oct 19, 2009 at 9:28 AM, Simon Eisenmann <si...@struktur.de> wrote:
> Hi,
>
> i am currently trying to set up a CouchDB cluster with three nodes each
> replication changes to the others. Basically this works just fine for a
> while with only a couple of changes per minute on every node.
>
> However at some point (a couple of hours) the replication processes just
> hang with status like "W Processed source update #12336" (as shown in
> Futon). Only solution to get this replication running again, is to
> restart the CouchDB. This happens not on all nodes. Usually it begins at
> one node and some time later the other nodes also have hanging
> replication processes.
>
> So the question is how stable is the replication supposed to be in a
> vice versa scenario having multiple changes and multiple replications
> coming in and going out at the same time. Is there a recommended way to
> have a multi master replicating environment?
>
> Thanks for any hints.
>
> Best regards
> Simon
>
>
>
>
> ps: I am using the 0.10.0 release with erlang R13B01 on Ubuntu Linux
> 64bit.
>
>
>
> --
> Simon Eisenmann
>
> [ mailto:simon@struktur.de ]
>
> [ struktur AG | Kronenstraße 22a | D-70173 Stuttgart ]
> [ T. +49.711.896656.0 | F.+49.711.89665610 ]
> [ http://www.struktur.de | mailto:info@struktur.de ]
>

Simon,

Are there any tracebacks in the logs that you can paste? I don't think
I've heard of replication getting wedged without some sort of
feedback.

Also, are you using continuous replication then? I do know that just
before the 0.10.0 release that Adam Kocoloski and Robert Newson spent
a good amount of time getting star (all nodes replicate continuously
to all otheres) kinks ironed out. Or maybe it was a ring. I dunno, but
there was work on something like that.

Paul Davis