You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by kowsik <ko...@gmail.com> on 2011/09/01 15:28:54 UTC

CouchDB 1.1 issue

Ran into this twice so far in production CouchDB in the last two days.
We are running CouchDB 1.1 on an EC2 AMI with multi-master replication
across two regions. I notice that every now and then CouchDB will
simply suck up 100% CPU 50% of the total memory and not respond at
all. So far the logs only show sporadic replication errors. One of the
stack traces (failed to replicate after 10 times) is about 500,000
lines long. We are using the _replicator database.

Anyone else running into this? Since 1.1 doesn't have the
try-until-infinity-and-beyond mode, we have a worker task that watches
the _replication_state and kicks the replicator as soon as it errors
out. Are there any settings in terms replicator memory usage, etc that
could help us?

Thanks!

K.
---
http://blog.mudynamics.com
http://blitz.io
@pcapr

Re: CouchDB 1.1 issue

Posted by Paul Davis <pa...@gmail.com>.

I don't have any immediate thoughts no. I'm not "recite details from
memory" familiar with this part of the code base. AFAIK it could be
anything from a networking blip to a pathological log formatting
issue.

On Thu, Sep 1, 2011 at 9:11 PM, kowsik <ko...@gmail.com> wrote:
> Wow, I'm shocked by the eerie silence on this. So I take it, there are
> no clues in my prior emails to figure out why the replicator is
> backing up and then dumping a 500,000 line stack trace?
>
> Dunno if it helps, but here's what we see. The number of documents
> between the two clusters will start to differ (meaning things are not
> replicating fast enough) and then we'll see 100% CPU utilization one
> of the them at the same time watching the memory utilization grow.
> Could it be the geo-latency that's causing the problem?
>
> Just to see if it makes a difference, we are moving our CouchDB
> cluster to an m2.2xlarge instance (big honking instance with fast IO)
> as well as using instance storage instead of EBS. Will report back on
> what we see. But we definitely could use some help here.
>
> Thanks,
>
> K.
> ---
> http://blitz.io
> @pcapr
>
> On Thu, Sep 1, 2011 at 7:29 AM, kowsik <ko...@gmail.com> wrote:
>> One more observation. It seems the memory goes up dramatically while
>> the replicator task is writing all the failed-to-replicate-docs to the
>> log (ends with this)
>>
>> ** Reason for termination ==
>> ** {http_request_failed,<<"failed to replicate http://host/db">>}
>>
>> Is there a way to disable logging for the replicator? Interestingly
>> enough, as soon as we restart, the replicator simply catches up and
>> pretends there were no problems.
>>
>> K.
>> ---
>> http://blog.mudynamics.com
>> http://blitz.io
>> @pcapr
>>
>> On Thu, Sep 1, 2011 at 7:18 AM, kowsik <ko...@gmail.com> wrote:
>>> Right before I sent this email we restarted CouchDB and now it's at
>>> 14% memory usage and climbing. Is there anything we can look at
>>> stats-wise and see where the pressure in the system is? I realize task
>>> stats are being added to trunk, but on 1.1, anything?
>>>
>>> Thanks,
>>>
>>> K.
>>> ---
>>> http://blog.mudynamics.com
>>> http://blitz.io
>>> @pcapr
>>>
>>> On Thu, Sep 1, 2011 at 6:35 AM, Scott Feinberg <fe...@gmail.com> wrote:
>>>> I haven't had that issue-though I'm not using using 1.1 in a
>>>> production environment, just using it to replicate like crazy (millions of
>>>> docs in each of my 20+ databases).  I was running a server with 1 GB of
>>>> memory and didn't have an issue, it handled it fine.
>>>>
>>>> However... from http://docs.couchbase.org/couchdb-release-1.1/index.html
>>>>
>>>> When you PUT/POST a document to the _replicator database, CouchDB will
>>>> attempt to start the replication up to 10 times (configurable under
>>>> [replicator], parameter max_replication_retry_count).
>>>>
>>>> Not sure if that helps.
>>>>
>>>> --Scott
>>>>
>>>> On Thu, Sep 1, 2011 at 9:28 AM, kowsik <ko...@gmail.com> wrote:
>>>>
>>>>> Ran into this twice so far in production CouchDB in the last two days.
>>>>> We are running CouchDB 1.1 on an EC2 AMI with multi-master replication
>>>>> across two regions. I notice that every now and then CouchDB will
>>>>> simply suck up 100% CPU 50% of the total memory and not respond at
>>>>> all. So far the logs only show sporadic replication errors. One of the
>>>>> stack traces (failed to replicate after 10 times) is about 500,000
>>>>> lines long. We are using the _replicator database.
>>>>>
>>>>> Anyone else running into this? Since 1.1 doesn't have the
>>>>> try-until-infinity-and-beyond mode, we have a worker task that watches
>>>>> the _replication_state and kicks the replicator as soon as it errors
>>>>> out. Are there any settings in terms replicator memory usage, etc that
>>>>> could help us?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> K.
>>>>> ---
>>>>> http://blog.mudynamics.com
>>>>> http://blitz.io
>>>>> @pcapr
>>>>>
>>>>
>>>
>>
>

Re: CouchDB 1.1 issue

Posted by kowsik <ko...@gmail.com>.

Wow, I'm shocked by the eerie silence on this. So I take it, there are
no clues in my prior emails to figure out why the replicator is
backing up and then dumping a 500,000 line stack trace?

Dunno if it helps, but here's what we see. The number of documents
between the two clusters will start to differ (meaning things are not
replicating fast enough) and then we'll see 100% CPU utilization one
of the them at the same time watching the memory utilization grow.
Could it be the geo-latency that's causing the problem?

Just to see if it makes a difference, we are moving our CouchDB
cluster to an m2.2xlarge instance (big honking instance with fast IO)
as well as using instance storage instead of EBS. Will report back on
what we see. But we definitely could use some help here.

Thanks,

K.
---
http://blitz.io
@pcapr

On Thu, Sep 1, 2011 at 7:29 AM, kowsik <ko...@gmail.com> wrote:
> One more observation. It seems the memory goes up dramatically while
> the replicator task is writing all the failed-to-replicate-docs to the
> log (ends with this)
>
> ** Reason for termination ==
> ** {http_request_failed,<<"failed to replicate http://host/db">>}
>
> Is there a way to disable logging for the replicator? Interestingly
> enough, as soon as we restart, the replicator simply catches up and
> pretends there were no problems.
>
> K.
> ---
> http://blog.mudynamics.com
> http://blitz.io
> @pcapr
>
> On Thu, Sep 1, 2011 at 7:18 AM, kowsik <ko...@gmail.com> wrote:
>> Right before I sent this email we restarted CouchDB and now it's at
>> 14% memory usage and climbing. Is there anything we can look at
>> stats-wise and see where the pressure in the system is? I realize task
>> stats are being added to trunk, but on 1.1, anything?
>>
>> Thanks,
>>
>> K.
>> ---
>> http://blog.mudynamics.com
>> http://blitz.io
>> @pcapr
>>
>> On Thu, Sep 1, 2011 at 6:35 AM, Scott Feinberg <fe...@gmail.com> wrote:
>>> I haven't had that issue-though I'm not using using 1.1 in a
>>> production environment, just using it to replicate like crazy (millions of
>>> docs in each of my 20+ databases).  I was running a server with 1 GB of
>>> memory and didn't have an issue, it handled it fine.
>>>
>>> However... from http://docs.couchbase.org/couchdb-release-1.1/index.html
>>>
>>> When you PUT/POST a document to the _replicator database, CouchDB will
>>> attempt to start the replication up to 10 times (configurable under
>>> [replicator], parameter max_replication_retry_count).
>>>
>>> Not sure if that helps.
>>>
>>> --Scott
>>>
>>> On Thu, Sep 1, 2011 at 9:28 AM, kowsik <ko...@gmail.com> wrote:
>>>
>>>> Ran into this twice so far in production CouchDB in the last two days.
>>>> We are running CouchDB 1.1 on an EC2 AMI with multi-master replication
>>>> across two regions. I notice that every now and then CouchDB will
>>>> simply suck up 100% CPU 50% of the total memory and not respond at
>>>> all. So far the logs only show sporadic replication errors. One of the
>>>> stack traces (failed to replicate after 10 times) is about 500,000
>>>> lines long. We are using the _replicator database.
>>>>
>>>> Anyone else running into this? Since 1.1 doesn't have the
>>>> try-until-infinity-and-beyond mode, we have a worker task that watches
>>>> the _replication_state and kicks the replicator as soon as it errors
>>>> out. Are there any settings in terms replicator memory usage, etc that
>>>> could help us?
>>>>
>>>> Thanks!
>>>>
>>>> K.
>>>> ---
>>>> http://blog.mudynamics.com
>>>> http://blitz.io
>>>> @pcapr
>>>>
>>>
>>
>

Re: CouchDB 1.1 issue

Posted by Dave Cottlehuber <da...@muse.net.nz>.

On 2 September 2011 04:38, kowsik <ko...@gmail.com> wrote:
> Some follow up questions that I'm hoping the dev's can answer. I don't
> grok Erlang so if these are dumb questions, humor me.
>
> The couchdb script launches erlang with the following parameters:
>
> -Bd - disable breaks
> -K true - what's this for?
> -A 4 - number of async threads - does concurrency improve if this is increased?
>
> There also seems to be a number of options to set the stack size, heap
> size, etc. Anyone played around with these settings to get additional
> concurrency/performance boosts?
>
> Thanks,
>
> K.

Hi Kowsik

IIRC janl@ fiddled with these parameters a while back. K is for kernel
polling; more info www.erlang.org/doc/man/erl.html

If you want to poke more inside Erlang, bigwig
http://www.metabrew.com/article/bigwig-erlang-webtool-spawnfest
https://github.com/beamspirit/bigwig is new & nice.

A+
Dave

Re: CouchDB 1.1 issue

Posted by kowsik <ko...@gmail.com>.

Some follow up questions that I'm hoping the dev's can answer. I don't
grok Erlang so if these are dumb questions, humor me.

The couchdb script launches erlang with the following parameters:

-Bd - disable breaks
-K true - what's this for?
-A 4 - number of async threads - does concurrency improve if this is increased?

There also seems to be a number of options to set the stack size, heap
size, etc. Anyone played around with these settings to get additional
concurrency/performance boosts?

Thanks,

K.
---
http://blog.mudynamics.com
http://blitz.io
@pcapr

On Thu, Sep 1, 2011 at 7:29 AM, kowsik <ko...@gmail.com> wrote:
> One more observation. It seems the memory goes up dramatically while
> the replicator task is writing all the failed-to-replicate-docs to the
> log (ends with this)
>
> ** Reason for termination ==
> ** {http_request_failed,<<"failed to replicate http://host/db">>}
>
> Is there a way to disable logging for the replicator? Interestingly
> enough, as soon as we restart, the replicator simply catches up and
> pretends there were no problems.
>
> K.
> ---
> http://blog.mudynamics.com
> http://blitz.io
> @pcapr
>
> On Thu, Sep 1, 2011 at 7:18 AM, kowsik <ko...@gmail.com> wrote:
>> Right before I sent this email we restarted CouchDB and now it's at
>> 14% memory usage and climbing. Is there anything we can look at
>> stats-wise and see where the pressure in the system is? I realize task
>> stats are being added to trunk, but on 1.1, anything?
>>
>> Thanks,
>>
>> K.
>> ---
>> http://blog.mudynamics.com
>> http://blitz.io
>> @pcapr
>>
>> On Thu, Sep 1, 2011 at 6:35 AM, Scott Feinberg <fe...@gmail.com> wrote:
>>> I haven't had that issue-though I'm not using using 1.1 in a
>>> production environment, just using it to replicate like crazy (millions of
>>> docs in each of my 20+ databases).  I was running a server with 1 GB of
>>> memory and didn't have an issue, it handled it fine.
>>>
>>> However... from http://docs.couchbase.org/couchdb-release-1.1/index.html
>>>
>>> When you PUT/POST a document to the _replicator database, CouchDB will
>>> attempt to start the replication up to 10 times (configurable under
>>> [replicator], parameter max_replication_retry_count).
>>>
>>> Not sure if that helps.
>>>
>>> --Scott
>>>
>>> On Thu, Sep 1, 2011 at 9:28 AM, kowsik <ko...@gmail.com> wrote:
>>>
>>>> Ran into this twice so far in production CouchDB in the last two days.
>>>> We are running CouchDB 1.1 on an EC2 AMI with multi-master replication
>>>> across two regions. I notice that every now and then CouchDB will
>>>> simply suck up 100% CPU 50% of the total memory and not respond at
>>>> all. So far the logs only show sporadic replication errors. One of the
>>>> stack traces (failed to replicate after 10 times) is about 500,000
>>>> lines long. We are using the _replicator database.
>>>>
>>>> Anyone else running into this? Since 1.1 doesn't have the
>>>> try-until-infinity-and-beyond mode, we have a worker task that watches
>>>> the _replication_state and kicks the replicator as soon as it errors
>>>> out. Are there any settings in terms replicator memory usage, etc that
>>>> could help us?
>>>>
>>>> Thanks!
>>>>
>>>> K.
>>>> ---
>>>> http://blog.mudynamics.com
>>>> http://blitz.io
>>>> @pcapr
>>>>
>>>
>>
>

Re: CouchDB 1.1 issue

Posted by kowsik <ko...@gmail.com>.

One more observation. It seems the memory goes up dramatically while
the replicator task is writing all the failed-to-replicate-docs to the
log (ends with this)

** Reason for termination ==
** {http_request_failed,<<"failed to replicate http://host/db">>}

Is there a way to disable logging for the replicator? Interestingly
enough, as soon as we restart, the replicator simply catches up and
pretends there were no problems.

K.
---
http://blog.mudynamics.com
http://blitz.io
@pcapr

On Thu, Sep 1, 2011 at 7:18 AM, kowsik <ko...@gmail.com> wrote:
> Right before I sent this email we restarted CouchDB and now it's at
> 14% memory usage and climbing. Is there anything we can look at
> stats-wise and see where the pressure in the system is? I realize task
> stats are being added to trunk, but on 1.1, anything?
>
> Thanks,
>
> K.
> ---
> http://blog.mudynamics.com
> http://blitz.io
> @pcapr
>
> On Thu, Sep 1, 2011 at 6:35 AM, Scott Feinberg <fe...@gmail.com> wrote:
>> I haven't had that issue-though I'm not using using 1.1 in a
>> production environment, just using it to replicate like crazy (millions of
>> docs in each of my 20+ databases).  I was running a server with 1 GB of
>> memory and didn't have an issue, it handled it fine.
>>
>> However... from http://docs.couchbase.org/couchdb-release-1.1/index.html
>>
>> When you PUT/POST a document to the _replicator database, CouchDB will
>> attempt to start the replication up to 10 times (configurable under
>> [replicator], parameter max_replication_retry_count).
>>
>> Not sure if that helps.
>>
>> --Scott
>>
>> On Thu, Sep 1, 2011 at 9:28 AM, kowsik <ko...@gmail.com> wrote:
>>
>>> Ran into this twice so far in production CouchDB in the last two days.
>>> We are running CouchDB 1.1 on an EC2 AMI with multi-master replication
>>> across two regions. I notice that every now and then CouchDB will
>>> simply suck up 100% CPU 50% of the total memory and not respond at
>>> all. So far the logs only show sporadic replication errors. One of the
>>> stack traces (failed to replicate after 10 times) is about 500,000
>>> lines long. We are using the _replicator database.
>>>
>>> Anyone else running into this? Since 1.1 doesn't have the
>>> try-until-infinity-and-beyond mode, we have a worker task that watches
>>> the _replication_state and kicks the replicator as soon as it errors
>>> out. Are there any settings in terms replicator memory usage, etc that
>>> could help us?
>>>
>>> Thanks!
>>>
>>> K.
>>> ---
>>> http://blog.mudynamics.com
>>> http://blitz.io
>>> @pcapr
>>>
>>
>

Re: CouchDB 1.1 issue

Posted by kowsik <ko...@gmail.com>.

Right before I sent this email we restarted CouchDB and now it's at
14% memory usage and climbing. Is there anything we can look at
stats-wise and see where the pressure in the system is? I realize task
stats are being added to trunk, but on 1.1, anything?

Thanks,

K.
---
http://blog.mudynamics.com
http://blitz.io
@pcapr

On Thu, Sep 1, 2011 at 6:35 AM, Scott Feinberg <fe...@gmail.com> wrote:
> I haven't had that issue-though I'm not using using 1.1 in a
> production environment, just using it to replicate like crazy (millions of
> docs in each of my 20+ databases).  I was running a server with 1 GB of
> memory and didn't have an issue, it handled it fine.
>
> However... from http://docs.couchbase.org/couchdb-release-1.1/index.html
>
> When you PUT/POST a document to the _replicator database, CouchDB will
> attempt to start the replication up to 10 times (configurable under
> [replicator], parameter max_replication_retry_count).
>
> Not sure if that helps.
>
> --Scott
>
> On Thu, Sep 1, 2011 at 9:28 AM, kowsik <ko...@gmail.com> wrote:
>
>> Ran into this twice so far in production CouchDB in the last two days.
>> We are running CouchDB 1.1 on an EC2 AMI with multi-master replication
>> across two regions. I notice that every now and then CouchDB will
>> simply suck up 100% CPU 50% of the total memory and not respond at
>> all. So far the logs only show sporadic replication errors. One of the
>> stack traces (failed to replicate after 10 times) is about 500,000
>> lines long. We are using the _replicator database.
>>
>> Anyone else running into this? Since 1.1 doesn't have the
>> try-until-infinity-and-beyond mode, we have a worker task that watches
>> the _replication_state and kicks the replicator as soon as it errors
>> out. Are there any settings in terms replicator memory usage, etc that
>> could help us?
>>
>> Thanks!
>>
>> K.
>> ---
>> http://blog.mudynamics.com
>> http://blitz.io
>> @pcapr
>>
>

Re: CouchDB 1.1 issue

Posted by Scott Feinberg <fe...@gmail.com>.

I haven't had that issue-though I'm not using using 1.1 in a
production environment, just using it to replicate like crazy (millions of
docs in each of my 20+ databases).  I was running a server with 1 GB of
memory and didn't have an issue, it handled it fine.

However... from http://docs.couchbase.org/couchdb-release-1.1/index.html

When you PUT/POST a document to the _replicator database, CouchDB will
attempt to start the replication up to 10 times (configurable under
[replicator], parameter max_replication_retry_count).

Not sure if that helps.

--Scott

On Thu, Sep 1, 2011 at 9:28 AM, kowsik <ko...@gmail.com> wrote:

> Ran into this twice so far in production CouchDB in the last two days.
> We are running CouchDB 1.1 on an EC2 AMI with multi-master replication
> across two regions. I notice that every now and then CouchDB will
> simply suck up 100% CPU 50% of the total memory and not respond at
> all. So far the logs only show sporadic replication errors. One of the
> stack traces (failed to replicate after 10 times) is about 500,000
> lines long. We are using the _replicator database.
>
> Anyone else running into this? Since 1.1 doesn't have the
> try-until-infinity-and-beyond mode, we have a worker task that watches
> the _replication_state and kicks the replicator as soon as it errors
> out. Are there any settings in terms replicator memory usage, etc that
> could help us?
>
> Thanks!
>
> K.
> ---
> http://blog.mudynamics.com
> http://blitz.io
> @pcapr
>