You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Dirkjan Ochtman <di...@ochtman.nl> on 2011/02/04 15:26:28 UTC

1.0.2 regression in continuous replication

Hi everyone,

I asked about this a bit on IRC, but we decided that this might be
better on the mailing list.

At work, we use CouchDB continuous replication for some important
streams, from a server in a data center somewhere to the servers in
our office, over OpenVPN. This was partly done because the
consumer-grade internet in our office (hey, it's a startup) is a
little flaky at times. This used to work fine; we had a script to
restart the replication if it failed, but that only happened once in a
week or a few weeks.

However, since installing 1.0.2 on Monday, we see our continuous
replication failing much more often, like 3-5 times per day. It seems
to fail complaining about some timeout, here's a trace:
http://dirkjan.ochtman.nl/files/repl-fail.log.

Is anyone seeing similar behavior? Are there any changes in CouchDB
between 1.0.1 and 1.0.2 that could have caused this? We're running
fairly standard Linux boxes, not much else changed on Monday.

Cheers,

Dirkjan

Re: 1.0.2 regression in continuous replication

Posted by Dirkjan Ochtman <di...@ochtman.nl>.

On Thu, Feb 17, 2011 at 12:36, Filipe David Manana <fd...@apache.org> wrote:
> Yes I looked at them. I can't see anything obviously wrong so far. Are
> network failures or congestion completely excluded?

It worked on 1.0.1 and nothing else has changed. It literally started
failing (so often) the day I upgraded.

Cheers,

Dirkjan

Re: 1.0.2 regression in continuous replication

Posted by Filipe David Manana <fd...@apache.org>.

On Thu, Feb 17, 2011 at 7:53 AM, Dirkjan Ochtman <di...@ochtman.nl> wrote:
> On Mon, Feb 14, 2011 at 10:26, Dirkjan Ochtman <di...@ochtman.nl> wrote:
>> Here's a more complete log:
>> http://dirkjan.ochtman.nl/files/repl-fail.tar.gz. I've tried to filter
>> out the view query strings because they contain potentially sensitive
>> information.
>
> Hi Filipe,
>
> Did you have any time to look at our logs?

Yes I looked at them. I can't see anything obviously wrong so far. Are
network failures or congestion completely excluded?

>
> Cheers,
>
> Dirkjan
>

-- 
Filipe David Manana,
fdmanana@gmail.com, fdmanana@apache.org

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

Re: 1.0.2 regression in continuous replication

Posted by Dirkjan Ochtman <di...@ochtman.nl>.

On Mon, Feb 14, 2011 at 10:26, Dirkjan Ochtman <di...@ochtman.nl> wrote:
> Here's a more complete log:
> http://dirkjan.ochtman.nl/files/repl-fail.tar.gz. I've tried to filter
> out the view query strings because they contain potentially sensitive
> information.

Hi Filipe,

Did you have any time to look at our logs?

Cheers,

Dirkjan

Re: 1.0.2 regression in continuous replication

Posted by Dirkjan Ochtman <di...@ochtman.nl>.

On Sat, Feb 12, 2011 at 23:39, Filipe David Manana <fd...@apache.org> wrote:
> How much is "much more"? :) Minutes, hours?
> Can you send a log which contains that log message I mentioned before?

Here are some times that it got restarted, in the past 24 hours:

21:20 (12 hours ago)
21:30 (12 hours ago)
21:50 (12 hours ago)
07:30 (2 hours ago)
08:10 (1 hour ago)
08:50 (1 hour ago)

Here's a more complete log:
http://dirkjan.ochtman.nl/files/repl-fail.tar.gz. I've tried to filter
out the view query strings because they contain potentially sensitive
information.

Cheers,

Dirkjan

Re: 1.0.2 regression in continuous replication

Posted by Filipe David Manana <fd...@apache.org>.

On Sat, Feb 12, 2011 at 2:30 PM, Dirkjan Ochtman <di...@ochtman.nl> wrote:
> It's much more than 30s. It seems to be failing about three or four
> times a day (we check every 10 minutes if it's still alive, restart it
> if not). Let me know if you want more accurate timeframes.

How much is "much more"? :) Minutes, hours?
Can you send a log which contains that log message I mentioned before?

regards,

>
> Cheers,
>
> Dirkjan
>

-- 
Filipe David Manana,
fdmanana@gmail.com, fdmanana@apache.org

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

Re: 1.0.2 regression in continuous replication

Posted by Dirkjan Ochtman <di...@ochtman.nl>.

On Sat, Feb 12, 2011 at 23:22, Filipe David Manana <fd...@apache.org> wrote:
> What's the time delta between the moment the replication started and
> the moment you get the changes_timeout error? Is it a delta of about
> 30 seconds?
>
> You can check when the replication started by inspecting the log file
> for a line like this:
>
> [Sat, 12 Feb 2011 22:16:10 GMT] [info] [<0.108.0>] starting new
> replication "bae147260ef23b1d062188e6838d68d7" at <0.192.0>
>
> The relevant error message in your log is the third one:
>
> "[Tue, 01 Feb 2011 06:23:54 GMT] [error] [<0.3701.3>] changes loop
> timeout, no data received from http://10.8.0.1:5984/efp/"
>
> I want to know the timestamp difference between both.

It's much more than 30s. It seems to be failing about three or four
times a day (we check every 10 minutes if it's still alive, restart it
if not). Let me know if you want more accurate timeframes.

Cheers,

Dirkjan

Re: 1.0.2 regression in continuous replication

Posted by Filipe David Manana <fd...@apache.org>.

Dirkjan,

What's the time delta between the moment the replication started and
the moment you get the changes_timeout error? Is it a delta of about
30 seconds?

You can check when the replication started by inspecting the log file
for a line like this:

[Sat, 12 Feb 2011 22:16:10 GMT] [info] [<0.108.0>] starting new
replication "bae147260ef23b1d062188e6838d68d7" at <0.192.0>

The relevant error message in your log is the third one:

"[Tue, 01 Feb 2011 06:23:54 GMT] [error] [<0.3701.3>] changes loop
timeout, no data received from http://10.8.0.1:5984/efp/"

I want to know the timestamp difference between both.

regards,

On Fri, Feb 4, 2011 at 6:26 AM, Dirkjan Ochtman <di...@ochtman.nl> wrote:
> Hi everyone,
>
> I asked about this a bit on IRC, but we decided that this might be
> better on the mailing list.
>
> At work, we use CouchDB continuous replication for some important
> streams, from a server in a data center somewhere to the servers in
> our office, over OpenVPN. This was partly done because the
> consumer-grade internet in our office (hey, it's a startup) is a
> little flaky at times. This used to work fine; we had a script to
> restart the replication if it failed, but that only happened once in a
> week or a few weeks.
>
> However, since installing 1.0.2 on Monday, we see our continuous
> replication failing much more often, like 3-5 times per day. It seems
> to fail complaining about some timeout, here's a trace:
> http://dirkjan.ochtman.nl/files/repl-fail.log.
>
> Is anyone seeing similar behavior? Are there any changes in CouchDB
> between 1.0.1 and 1.0.2 that could have caused this? We're running
> fairly standard Linux boxes, not much else changed on Monday.
>
> Cheers,
>
> Dirkjan
>



-- 
Filipe David Manana,
fdmanana@gmail.com, fdmanana@apache.org

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."