You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Chris Stockton <ch...@gmail.com> on 2011/10/11 01:03:30 UTC

CouchDB Replication lacking resilience for many database

Hello,

I will try to keep this short as possible, the condensed version of
this is I am looking to the developers and experienced users of
couchdb to offer me some guidance so I can plan how to proceed in
solving our current growth and capacity issues in regards to
replication.

I have been using couchdb for about a year now in a production
environment and nearing 2 years developing with it total, as our
customer base has grown to thousands of users, couchdb has kept up
with reads, writes and general performance pretty well. There has been
one constant pain for us that has used a lot of system operation and
developer time, and that is replication.

Our system is broken into pods, each pod has 3 web servers, a couchdb
master server (all writes and reads come from here), and a couchdb
failover server which pulls from the master, ONLY for redundancy, not
reads. We have over 5000 databases, and a application that runs that
watches the STATUS on the failover machine to make sure new added
databases and existing databases are running a "continuous"
replication task. This means, 5000 replication tasks running at all
times along with the tax and resources needed for these tasks. I have
had trouble with planning out how to scale and tune our systems to
allow couchdb to push the large machines it is running on to their
full potential. Although I was able to find some general information
on erlang VM tuning thankfully which has allowed us to increase our
limit and give us "new errors", we still seem to be unable to get
everything we want out of these machines yet, we never peg CPU, memory
etc, just hit those darn errors in [1]. Since no real world studies,
white papers or general resources for tackling enterprise level
capacity issues exist for CouchDB I feel a bit lost. This is something
I am wanting to change and am very willing to take part in.

A couple posts about my struggles thus far are located here [1] and
here [2], I have heard from these threads that a individual has been
working on major changes to replication, to use a small pool of TCP
connections instead of a 1 to 1 ration. I think this would improve our
situation drastically, but I was unable to find said work anywhere in
SVN or on the web to test or look into it's design concepts. If said
work is not actually under development yet, I am willing to put in
some development time on the weekends to learn erlang and improve this
portion of couchdb, but I do not want to volunteer this time if the
work has already been complete.

Basically at this point, the 3 am NOC calls and daily restarts of
couchdb from the [1] error causing dbs to stop replicating are really
impacting me. I need to find a solution for this, start coding a
erlang patch of my own or look at much less lucrative options...
CouchDB is a great product and I am glad we chose it, I just need a
little help here..

The things I have thought of or considered:
1) Create / implement 'server wide' replication within erlang, seems
like a place where such a task could be optimized and made much
lighter
2) I am doing it wrong!!! I should not have a application ensuring
all 5000 dbs have a continuous replication task, but instead keep a
small pool of continuous replication for the "active" databases or top
/ busy users, the changes feed or other things might be of use here.
The only problem with this is it isn't entirely fair to the customers
if the master dies and we failover, that the busy / active database
would take precedence over the little guy.
3) Give up or change our storage architecture because couchdb is
unable to provide replication stability for our use patterns.

Thank you very much for any suggestions, thoughts or resources you can
point me to,

-Chris

[1] http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/browser
[2] http://mail-archives.apache.org/mod_mbox/couchdb-user/201105.mbox/%3CBANLkTimetq8nvwj4J356cMo0FcWMN=TUhw@mail.gmail.com%3E

Re: CouchDB Replication lacking resilience for many database

Posted by Randall Leeds <ra...@gmail.com>.

On Mon, Oct 10, 2011 at 17:02, Chris Stockton <ch...@gmail.com>wrote:

> Hello,
>
> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
> <fd...@apache.org> wrote:
> > On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
> > <ch...@gmail.com> wrote:
> > Chris,
> >
> > That said work is in the'1.2.x' branch (and master).
> > CouchDB recently migrated from SVN to GIT, see:
> > http://couchdb.apache.org/community/code.html
> >
>
> Thank you very much for the response Filipe, do you possibly have any
> documentation or more detailed summary on what these changes include
> and possible benefits of them? I would love to hear about any tweaking
> or replication tips you may have for our growth issues, perhaps you
> could answer a basic question if nothing else: Do the changes in this
> branch minimize the performance impact of continuous replication on
> many databases?
>

The primary change, as I understand it, is that CouchDB explicitly manages a
pool of HTTP connections per replication.

Previously, the pool was handled by the HTTP client module CouchDB uses,
ibrowse, which pools connections per host/port. Therefore, the pool is
shared between all the replications pulling from a given server.

The config file has settings for changing how these pools behave, under the
replicator heading:
max_http_sessions
max_http_pipeline_size

The first refers to the size of the per-host pool. The second refers to the
number of requests that can be queued for each. Unfortunately, there is one
more setting which is not exposed, which is the maximum number of requests
to try at once per replication (side note to devs, should we provide a quick
patch for this for 1.1.1?), and it is fixed at 100. CouchDB does not
elegantly handle the case when the pool is completely utilized. This is not
a problem in the new replication code in 1.2.

For the time being though, the following formula should hold true or you
will experience problems:

max_http_sessions * max_http_pipeline_size >= 100 * N

where N is the maximum number of concurrent replications triggered to pull
or push from a single host.

I believe this to still be the case in 1.1. I'm pretty sure it was at one
point earlier.

-Randall

>
> Regardless I plan on getting a build of that branch and doing some
> testing of my own very soon.
>
> Thank you!
>
> -Chris
>

Re: CouchDB Replication lacking resilience for many database

Posted by Mark Hahn <ma...@boutiquing.com>.

cool.  Thanks.

On Tue, Oct 11, 2011 at 7:03 AM, Jan Lehnardt <ja...@apache.org> wrote:

>
> On Oct 11, 2011, at 14:20 , Mark Hahn wrote:
>
> > It would be nice to have a control panel that displays things like this
> > message queue depth, connection counts, memory consumed, cpu consumed,
> > reads/writes per second, view rebuilds/sec, avg response times, etc.  I'm
> > sure someone could come up with many more pertinent vars.
> >
> > For extra credit the values could be plotted against time.  When someone
> has
> > a problem they could post the log here.
>
> See /_stats :)
>
> It doesn't have all the things you ask for, but adding new stats isn't
> hard:
>
>  http://wiki.apache.org/couchdb/Adding_Runtime_Statistics
>
> Cheers
> Jan
> --
>
>
>
> >
> > On Mon, Oct 10, 2011 at 10:15 PM, Paul Davis <
> paul.joseph.davis@gmail.com>wrote:
> >
> >> On Mon, Oct 10, 2011 at 11:03 PM, Chris Stockton
> >> <ch...@gmail.com> wrote:
> >>> Hello,
> >>>
> >>> On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski <ko...@apache.org>
> >> wrote:
> >>>> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
> >>>>> <fd...@apache.org> wrote:
> >>>>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
> >>>>>> <ch...@gmail.com> wrote:
> >>>>>> Chris,
> >>>>>>
> >>>>>> That said work is in the'1.2.x' branch (and master).
> >>>>>> CouchDB recently migrated from SVN to GIT, see:
> >>>>>> http://couchdb.apache.org/community/code.html
> >>>>>>
> >>>>>
> >>>>> Thank you very much for the response Filipe, do you possibly have any
> >>>>> documentation or more detailed summary on what these changes include
> >>>>> and possible benefits of them? I would love to hear about any
> tweaking
> >>>>> or replication tips you may have for our growth issues, perhaps you
> >>>>> could answer a basic question if nothing else: Do the changes in this
> >>>>> branch minimize the performance impact of continuous replication on
> >>>>> many databases?
> >>>>>
> >>>>> Regardless I plan on getting a build of that branch and doing some
> >>>>> testing of my own very soon.
> >>>>>
> >>>>> Thank you!
> >>>>>
> >>>>> -Chris
> >>>>
> >>>> I'm pretty sure that even in 1.2.x and master each replication with a
> >> remote source still requires one dedicated TCP connection to consume the
> >> _changes feed.  Replications with a local source have always been able
> to
> >> use a connection pool per host:port combination.  That's not to downplay
> the
> >> significance of the rewrite of the replicator in 1.2.x; Filipe put quite
> a
> >> lot of time into it.
> >>>>
> >>>> The link to "those darn errors" just pointed to the mbox browser for
> >> September 2011.  Do you have a more specific link?  Regards,
> >>>>
> >>>> Adam
> >>>
> >>> Well I will remain optimistic that the rewrite could hopefully have
> >>> solved several of my issues regardless I hope. I guess the idle TCP
> >>> connections by themselves are not too bad, when they all start to work
> >>> simultaneously I think is what becomes the issue =)
> >>>
> >>> Sorry Adam, here is a better link
> >>>
> >>
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E
> >> ,
> >>> the actual text was:
> >>>
> >>> ---------------
> >>>
> >>> It seems that randomly I am getting errors about crashes as our
> >>> replicator runs, all this replicator does is make sure that all
> >>> databases on the master server replicate to our failover by checking
> >>> status.
> >>>
> >>> Details:
> >>> - I notice the below error in the logs, anywhere from 0 to 30 at a
> time.
> >>> - It seems that a database might start replicating okay then stop.
> >>> - These errors [1] are on the failover pulling from master
> >>> - No errors are displayed on the master server
> >>> - The databases inside the URL in the db_not_found portion of the
> >>> error, are always available from curl from the failover machine, which
> >>> makes the error strange, somehow it thinks it can't find the database
> >>> - Master seems healthy at all times, all database are available, no
> >>> errors in log
> >>>
> >>> [1] --
> >>> [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>]
> >>> {error_report,<0.30.0>,
> >>>                         {<0.22466.5305>,crash_report,
> >>>
> >> [[{initial_call,{couch_rep,init,['Argument__1']}},
> >>>                            {pid,<0.22466.5305>},
> >>>                            {registered_name,[]},
> >>>                            {error_info,
> >>>                             {exit,
> >>>                              {db_not_found,
> >>>                               <<"http://user:pass@server
> >> :5984/db_10944/">>},
> >>>                              [{gen_server,init_it,6},
> >>>                               {proc_lib,init_p_do_apply,3}]}},
> >>>                            {ancestors,
> >>>                             [couch_rep_sup,couch_primary_services,
> >>>                              couch_server_sup,<0.31.0>]},
> >>>                            {messages,[]},
> >>>                            {links,[<0.81.0>]},
> >>>                            {dictionary,[]},
> >>>                            {trap_exit,true},
> >>>                            {status,running},
> >>>                            {heap_size,2584},
> >>>                            {stack_size,24},
> >>>                            {reductions,794}],
> >>>                           []]}}
> >>>
> >>
> >> One place I've seen this error pop up when it looks like it shouldn't
> >> is if couch_server gets backed up. If you remsh into one of those db's
> >> you could try the following:
> >>
> >>> process_info(whereis(couch_server), message_queue_len).
> >>
> >> And if that number keeps growing, that could be the issue.
> >>
>
>

Re: CouchDB Replication lacking resilience for many database

Posted by Mark Hahn <ma...@boutiquing.com>.

cool.  Thanks.

On Tue, Oct 11, 2011 at 7:03 AM, Jan Lehnardt <ja...@apache.org> wrote:

>
> On Oct 11, 2011, at 14:20 , Mark Hahn wrote:
>
> > It would be nice to have a control panel that displays things like this
> > message queue depth, connection counts, memory consumed, cpu consumed,
> > reads/writes per second, view rebuilds/sec, avg response times, etc.  I'm
> > sure someone could come up with many more pertinent vars.
> >
> > For extra credit the values could be plotted against time.  When someone
> has
> > a problem they could post the log here.
>
> See /_stats :)
>
> It doesn't have all the things you ask for, but adding new stats isn't
> hard:
>
>  http://wiki.apache.org/couchdb/Adding_Runtime_Statistics
>
> Cheers
> Jan
> --
>
>
>
> >
> > On Mon, Oct 10, 2011 at 10:15 PM, Paul Davis <
> paul.joseph.davis@gmail.com>wrote:
> >
> >> On Mon, Oct 10, 2011 at 11:03 PM, Chris Stockton
> >> <ch...@gmail.com> wrote:
> >>> Hello,
> >>>
> >>> On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski <ko...@apache.org>
> >> wrote:
> >>>> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
> >>>>> <fd...@apache.org> wrote:
> >>>>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
> >>>>>> <ch...@gmail.com> wrote:
> >>>>>> Chris,
> >>>>>>
> >>>>>> That said work is in the'1.2.x' branch (and master).
> >>>>>> CouchDB recently migrated from SVN to GIT, see:
> >>>>>> http://couchdb.apache.org/community/code.html
> >>>>>>
> >>>>>
> >>>>> Thank you very much for the response Filipe, do you possibly have any
> >>>>> documentation or more detailed summary on what these changes include
> >>>>> and possible benefits of them? I would love to hear about any
> tweaking
> >>>>> or replication tips you may have for our growth issues, perhaps you
> >>>>> could answer a basic question if nothing else: Do the changes in this
> >>>>> branch minimize the performance impact of continuous replication on
> >>>>> many databases?
> >>>>>
> >>>>> Regardless I plan on getting a build of that branch and doing some
> >>>>> testing of my own very soon.
> >>>>>
> >>>>> Thank you!
> >>>>>
> >>>>> -Chris
> >>>>
> >>>> I'm pretty sure that even in 1.2.x and master each replication with a
> >> remote source still requires one dedicated TCP connection to consume the
> >> _changes feed.  Replications with a local source have always been able
> to
> >> use a connection pool per host:port combination.  That's not to downplay
> the
> >> significance of the rewrite of the replicator in 1.2.x; Filipe put quite
> a
> >> lot of time into it.
> >>>>
> >>>> The link to "those darn errors" just pointed to the mbox browser for
> >> September 2011.  Do you have a more specific link?  Regards,
> >>>>
> >>>> Adam
> >>>
> >>> Well I will remain optimistic that the rewrite could hopefully have
> >>> solved several of my issues regardless I hope. I guess the idle TCP
> >>> connections by themselves are not too bad, when they all start to work
> >>> simultaneously I think is what becomes the issue =)
> >>>
> >>> Sorry Adam, here is a better link
> >>>
> >>
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E
> >> ,
> >>> the actual text was:
> >>>
> >>> ---------------
> >>>
> >>> It seems that randomly I am getting errors about crashes as our
> >>> replicator runs, all this replicator does is make sure that all
> >>> databases on the master server replicate to our failover by checking
> >>> status.
> >>>
> >>> Details:
> >>> - I notice the below error in the logs, anywhere from 0 to 30 at a
> time.
> >>> - It seems that a database might start replicating okay then stop.
> >>> - These errors [1] are on the failover pulling from master
> >>> - No errors are displayed on the master server
> >>> - The databases inside the URL in the db_not_found portion of the
> >>> error, are always available from curl from the failover machine, which
> >>> makes the error strange, somehow it thinks it can't find the database
> >>> - Master seems healthy at all times, all database are available, no
> >>> errors in log
> >>>
> >>> [1] --
> >>> [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>]
> >>> {error_report,<0.30.0>,
> >>>                         {<0.22466.5305>,crash_report,
> >>>
> >> [[{initial_call,{couch_rep,init,['Argument__1']}},
> >>>                            {pid,<0.22466.5305>},
> >>>                            {registered_name,[]},
> >>>                            {error_info,
> >>>                             {exit,
> >>>                              {db_not_found,
> >>>                               <<"http://user:pass@server
> >> :5984/db_10944/">>},
> >>>                              [{gen_server,init_it,6},
> >>>                               {proc_lib,init_p_do_apply,3}]}},
> >>>                            {ancestors,
> >>>                             [couch_rep_sup,couch_primary_services,
> >>>                              couch_server_sup,<0.31.0>]},
> >>>                            {messages,[]},
> >>>                            {links,[<0.81.0>]},
> >>>                            {dictionary,[]},
> >>>                            {trap_exit,true},
> >>>                            {status,running},
> >>>                            {heap_size,2584},
> >>>                            {stack_size,24},
> >>>                            {reductions,794}],
> >>>                           []]}}
> >>>
> >>
> >> One place I've seen this error pop up when it looks like it shouldn't
> >> is if couch_server gets backed up. If you remsh into one of those db's
> >> you could try the following:
> >>
> >>> process_info(whereis(couch_server), message_queue_len).
> >>
> >> And if that number keeps growing, that could be the issue.
> >>
>
>

Re: CouchDB Replication lacking resilience for many database

Posted by Jan Lehnardt <ja...@apache.org>.

On Oct 11, 2011, at 14:20 , Mark Hahn wrote:

> It would be nice to have a control panel that displays things like this
> message queue depth, connection counts, memory consumed, cpu consumed,
> reads/writes per second, view rebuilds/sec, avg response times, etc.  I'm
> sure someone could come up with many more pertinent vars.
> 
> For extra credit the values could be plotted against time.  When someone has
> a problem they could post the log here.

See /_stats :)

It doesn't have all the things you ask for, but adding new stats isn't hard: 

  http://wiki.apache.org/couchdb/Adding_Runtime_Statistics

Cheers
Jan
-- 



> 
> On Mon, Oct 10, 2011 at 10:15 PM, Paul Davis <pa...@gmail.com>wrote:
> 
>> On Mon, Oct 10, 2011 at 11:03 PM, Chris Stockton
>> <ch...@gmail.com> wrote:
>>> Hello,
>>> 
>>> On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski <ko...@apache.org>
>> wrote:
>>>> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
>>>>> <fd...@apache.org> wrote:
>>>>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
>>>>>> <ch...@gmail.com> wrote:
>>>>>> Chris,
>>>>>> 
>>>>>> That said work is in the'1.2.x' branch (and master).
>>>>>> CouchDB recently migrated from SVN to GIT, see:
>>>>>> http://couchdb.apache.org/community/code.html
>>>>>> 
>>>>> 
>>>>> Thank you very much for the response Filipe, do you possibly have any
>>>>> documentation or more detailed summary on what these changes include
>>>>> and possible benefits of them? I would love to hear about any tweaking
>>>>> or replication tips you may have for our growth issues, perhaps you
>>>>> could answer a basic question if nothing else: Do the changes in this
>>>>> branch minimize the performance impact of continuous replication on
>>>>> many databases?
>>>>> 
>>>>> Regardless I plan on getting a build of that branch and doing some
>>>>> testing of my own very soon.
>>>>> 
>>>>> Thank you!
>>>>> 
>>>>> -Chris
>>>> 
>>>> I'm pretty sure that even in 1.2.x and master each replication with a
>> remote source still requires one dedicated TCP connection to consume the
>> _changes feed.  Replications with a local source have always been able to
>> use a connection pool per host:port combination.  That's not to downplay the
>> significance of the rewrite of the replicator in 1.2.x; Filipe put quite a
>> lot of time into it.
>>>> 
>>>> The link to "those darn errors" just pointed to the mbox browser for
>> September 2011.  Do you have a more specific link?  Regards,
>>>> 
>>>> Adam
>>> 
>>> Well I will remain optimistic that the rewrite could hopefully have
>>> solved several of my issues regardless I hope. I guess the idle TCP
>>> connections by themselves are not too bad, when they all start to work
>>> simultaneously I think is what becomes the issue =)
>>> 
>>> Sorry Adam, here is a better link
>>> 
>> http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E
>> ,
>>> the actual text was:
>>> 
>>> ---------------
>>> 
>>> It seems that randomly I am getting errors about crashes as our
>>> replicator runs, all this replicator does is make sure that all
>>> databases on the master server replicate to our failover by checking
>>> status.
>>> 
>>> Details:
>>> - I notice the below error in the logs, anywhere from 0 to 30 at a time.
>>> - It seems that a database might start replicating okay then stop.
>>> - These errors [1] are on the failover pulling from master
>>> - No errors are displayed on the master server
>>> - The databases inside the URL in the db_not_found portion of the
>>> error, are always available from curl from the failover machine, which
>>> makes the error strange, somehow it thinks it can't find the database
>>> - Master seems healthy at all times, all database are available, no
>>> errors in log
>>> 
>>> [1] --
>>> [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>]
>>> {error_report,<0.30.0>,
>>>                         {<0.22466.5305>,crash_report,
>>> 
>> [[{initial_call,{couch_rep,init,['Argument__1']}},
>>>                            {pid,<0.22466.5305>},
>>>                            {registered_name,[]},
>>>                            {error_info,
>>>                             {exit,
>>>                              {db_not_found,
>>>                               <<"http://user:pass@server
>> :5984/db_10944/">>},
>>>                              [{gen_server,init_it,6},
>>>                               {proc_lib,init_p_do_apply,3}]}},
>>>                            {ancestors,
>>>                             [couch_rep_sup,couch_primary_services,
>>>                              couch_server_sup,<0.31.0>]},
>>>                            {messages,[]},
>>>                            {links,[<0.81.0>]},
>>>                            {dictionary,[]},
>>>                            {trap_exit,true},
>>>                            {status,running},
>>>                            {heap_size,2584},
>>>                            {stack_size,24},
>>>                            {reductions,794}],
>>>                           []]}}
>>> 
>> 
>> One place I've seen this error pop up when it looks like it shouldn't
>> is if couch_server gets backed up. If you remsh into one of those db's
>> you could try the following:
>> 
>>> process_info(whereis(couch_server), message_queue_len).
>> 
>> And if that number keeps growing, that could be the issue.
>>

Re: CouchDB Replication lacking resilience for many database

Posted by Jan Lehnardt <ja...@apache.org>.

On Oct 11, 2011, at 14:20 , Mark Hahn wrote:

> It would be nice to have a control panel that displays things like this
> message queue depth, connection counts, memory consumed, cpu consumed,
> reads/writes per second, view rebuilds/sec, avg response times, etc.  I'm
> sure someone could come up with many more pertinent vars.
> 
> For extra credit the values could be plotted against time.  When someone has
> a problem they could post the log here.

See /_stats :)

It doesn't have all the things you ask for, but adding new stats isn't hard: 

  http://wiki.apache.org/couchdb/Adding_Runtime_Statistics

Cheers
Jan
-- 



> 
> On Mon, Oct 10, 2011 at 10:15 PM, Paul Davis <pa...@gmail.com>wrote:
> 
>> On Mon, Oct 10, 2011 at 11:03 PM, Chris Stockton
>> <ch...@gmail.com> wrote:
>>> Hello,
>>> 
>>> On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski <ko...@apache.org>
>> wrote:
>>>> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
>>>>> <fd...@apache.org> wrote:
>>>>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
>>>>>> <ch...@gmail.com> wrote:
>>>>>> Chris,
>>>>>> 
>>>>>> That said work is in the'1.2.x' branch (and master).
>>>>>> CouchDB recently migrated from SVN to GIT, see:
>>>>>> http://couchdb.apache.org/community/code.html
>>>>>> 
>>>>> 
>>>>> Thank you very much for the response Filipe, do you possibly have any
>>>>> documentation or more detailed summary on what these changes include
>>>>> and possible benefits of them? I would love to hear about any tweaking
>>>>> or replication tips you may have for our growth issues, perhaps you
>>>>> could answer a basic question if nothing else: Do the changes in this
>>>>> branch minimize the performance impact of continuous replication on
>>>>> many databases?
>>>>> 
>>>>> Regardless I plan on getting a build of that branch and doing some
>>>>> testing of my own very soon.
>>>>> 
>>>>> Thank you!
>>>>> 
>>>>> -Chris
>>>> 
>>>> I'm pretty sure that even in 1.2.x and master each replication with a
>> remote source still requires one dedicated TCP connection to consume the
>> _changes feed.  Replications with a local source have always been able to
>> use a connection pool per host:port combination.  That's not to downplay the
>> significance of the rewrite of the replicator in 1.2.x; Filipe put quite a
>> lot of time into it.
>>>> 
>>>> The link to "those darn errors" just pointed to the mbox browser for
>> September 2011.  Do you have a more specific link?  Regards,
>>>> 
>>>> Adam
>>> 
>>> Well I will remain optimistic that the rewrite could hopefully have
>>> solved several of my issues regardless I hope. I guess the idle TCP
>>> connections by themselves are not too bad, when they all start to work
>>> simultaneously I think is what becomes the issue =)
>>> 
>>> Sorry Adam, here is a better link
>>> 
>> http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E
>> ,
>>> the actual text was:
>>> 
>>> ---------------
>>> 
>>> It seems that randomly I am getting errors about crashes as our
>>> replicator runs, all this replicator does is make sure that all
>>> databases on the master server replicate to our failover by checking
>>> status.
>>> 
>>> Details:
>>> - I notice the below error in the logs, anywhere from 0 to 30 at a time.
>>> - It seems that a database might start replicating okay then stop.
>>> - These errors [1] are on the failover pulling from master
>>> - No errors are displayed on the master server
>>> - The databases inside the URL in the db_not_found portion of the
>>> error, are always available from curl from the failover machine, which
>>> makes the error strange, somehow it thinks it can't find the database
>>> - Master seems healthy at all times, all database are available, no
>>> errors in log
>>> 
>>> [1] --
>>> [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>]
>>> {error_report,<0.30.0>,
>>>                         {<0.22466.5305>,crash_report,
>>> 
>> [[{initial_call,{couch_rep,init,['Argument__1']}},
>>>                            {pid,<0.22466.5305>},
>>>                            {registered_name,[]},
>>>                            {error_info,
>>>                             {exit,
>>>                              {db_not_found,
>>>                               <<"http://user:pass@server
>> :5984/db_10944/">>},
>>>                              [{gen_server,init_it,6},
>>>                               {proc_lib,init_p_do_apply,3}]}},
>>>                            {ancestors,
>>>                             [couch_rep_sup,couch_primary_services,
>>>                              couch_server_sup,<0.31.0>]},
>>>                            {messages,[]},
>>>                            {links,[<0.81.0>]},
>>>                            {dictionary,[]},
>>>                            {trap_exit,true},
>>>                            {status,running},
>>>                            {heap_size,2584},
>>>                            {stack_size,24},
>>>                            {reductions,794}],
>>>                           []]}}
>>> 
>> 
>> One place I've seen this error pop up when it looks like it shouldn't
>> is if couch_server gets backed up. If you remsh into one of those db's
>> you could try the following:
>> 
>>> process_info(whereis(couch_server), message_queue_len).
>> 
>> And if that number keeps growing, that could be the issue.
>>

Re: CouchDB Replication lacking resilience for many database

Posted by Mark Hahn <ma...@boutiquing.com>.

It would be nice to have a control panel that displays things like this
message queue depth, connection counts, memory consumed, cpu consumed,
reads/writes per second, view rebuilds/sec, avg response times, etc.  I'm
sure someone could come up with many more pertinent vars.

For extra credit the values could be plotted against time.  When someone has
a problem they could post the log here.

On Mon, Oct 10, 2011 at 10:15 PM, Paul Davis <pa...@gmail.com>wrote:

> On Mon, Oct 10, 2011 at 11:03 PM, Chris Stockton
> <ch...@gmail.com> wrote:
> > Hello,
> >
> > On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski <ko...@apache.org>
> wrote:
> >> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
> >>
> >>> Hello,
> >>>
> >>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
> >>> <fd...@apache.org> wrote:
> >>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
> >>>> <ch...@gmail.com> wrote:
> >>>> Chris,
> >>>>
> >>>> That said work is in the'1.2.x' branch (and master).
> >>>> CouchDB recently migrated from SVN to GIT, see:
> >>>> http://couchdb.apache.org/community/code.html
> >>>>
> >>>
> >>> Thank you very much for the response Filipe, do you possibly have any
> >>> documentation or more detailed summary on what these changes include
> >>> and possible benefits of them? I would love to hear about any tweaking
> >>> or replication tips you may have for our growth issues, perhaps you
> >>> could answer a basic question if nothing else: Do the changes in this
> >>> branch minimize the performance impact of continuous replication on
> >>> many databases?
> >>>
> >>> Regardless I plan on getting a build of that branch and doing some
> >>> testing of my own very soon.
> >>>
> >>> Thank you!
> >>>
> >>> -Chris
> >>
> >> I'm pretty sure that even in 1.2.x and master each replication with a
> remote source still requires one dedicated TCP connection to consume the
> _changes feed.  Replications with a local source have always been able to
> use a connection pool per host:port combination.  That's not to downplay the
> significance of the rewrite of the replicator in 1.2.x; Filipe put quite a
> lot of time into it.
> >>
> >> The link to "those darn errors" just pointed to the mbox browser for
> September 2011.  Do you have a more specific link?  Regards,
> >>
> >> Adam
> >
> > Well I will remain optimistic that the rewrite could hopefully have
> > solved several of my issues regardless I hope. I guess the idle TCP
> > connections by themselves are not too bad, when they all start to work
> > simultaneously I think is what becomes the issue =)
> >
> > Sorry Adam, here is a better link
> >
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E
> ,
> > the actual text was:
> >
> > ---------------
> >
> > It seems that randomly I am getting errors about crashes as our
> > replicator runs, all this replicator does is make sure that all
> > databases on the master server replicate to our failover by checking
> > status.
> >
> > Details:
> >  - I notice the below error in the logs, anywhere from 0 to 30 at a time.
> >  - It seems that a database might start replicating okay then stop.
> >  - These errors [1] are on the failover pulling from master
> >  - No errors are displayed on the master server
> >  - The databases inside the URL in the db_not_found portion of the
> > error, are always available from curl from the failover machine, which
> > makes the error strange, somehow it thinks it can't find the database
> >  - Master seems healthy at all times, all database are available, no
> > errors in log
> >
> > [1] --
> >  [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>]
> > {error_report,<0.30.0>,
> >                          {<0.22466.5305>,crash_report,
> >
> [[{initial_call,{couch_rep,init,['Argument__1']}},
> >                             {pid,<0.22466.5305>},
> >                             {registered_name,[]},
> >                             {error_info,
> >                              {exit,
> >                               {db_not_found,
> >                                <<"http://user:pass@server
> :5984/db_10944/">>},
> >                               [{gen_server,init_it,6},
> >                                {proc_lib,init_p_do_apply,3}]}},
> >                             {ancestors,
> >                              [couch_rep_sup,couch_primary_services,
> >                               couch_server_sup,<0.31.0>]},
> >                             {messages,[]},
> >                             {links,[<0.81.0>]},
> >                             {dictionary,[]},
> >                             {trap_exit,true},
> >                             {status,running},
> >                             {heap_size,2584},
> >                             {stack_size,24},
> >                             {reductions,794}],
> >                            []]}}
> >
>
> One place I've seen this error pop up when it looks like it shouldn't
> is if couch_server gets backed up. If you remsh into one of those db's
> you could try the following:
>
>    > process_info(whereis(couch_server), message_queue_len).
>
> And if that number keeps growing, that could be the issue.
>

Re: CouchDB Replication lacking resilience for many database

Posted by Mark Hahn <ma...@boutiquing.com>.

It would be nice to have a control panel that displays things like this
message queue depth, connection counts, memory consumed, cpu consumed,
reads/writes per second, view rebuilds/sec, avg response times, etc.  I'm
sure someone could come up with many more pertinent vars.

For extra credit the values could be plotted against time.  When someone has
a problem they could post the log here.

On Mon, Oct 10, 2011 at 10:15 PM, Paul Davis <pa...@gmail.com>wrote:

> On Mon, Oct 10, 2011 at 11:03 PM, Chris Stockton
> <ch...@gmail.com> wrote:
> > Hello,
> >
> > On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski <ko...@apache.org>
> wrote:
> >> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
> >>
> >>> Hello,
> >>>
> >>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
> >>> <fd...@apache.org> wrote:
> >>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
> >>>> <ch...@gmail.com> wrote:
> >>>> Chris,
> >>>>
> >>>> That said work is in the'1.2.x' branch (and master).
> >>>> CouchDB recently migrated from SVN to GIT, see:
> >>>> http://couchdb.apache.org/community/code.html
> >>>>
> >>>
> >>> Thank you very much for the response Filipe, do you possibly have any
> >>> documentation or more detailed summary on what these changes include
> >>> and possible benefits of them? I would love to hear about any tweaking
> >>> or replication tips you may have for our growth issues, perhaps you
> >>> could answer a basic question if nothing else: Do the changes in this
> >>> branch minimize the performance impact of continuous replication on
> >>> many databases?
> >>>
> >>> Regardless I plan on getting a build of that branch and doing some
> >>> testing of my own very soon.
> >>>
> >>> Thank you!
> >>>
> >>> -Chris
> >>
> >> I'm pretty sure that even in 1.2.x and master each replication with a
> remote source still requires one dedicated TCP connection to consume the
> _changes feed.  Replications with a local source have always been able to
> use a connection pool per host:port combination.  That's not to downplay the
> significance of the rewrite of the replicator in 1.2.x; Filipe put quite a
> lot of time into it.
> >>
> >> The link to "those darn errors" just pointed to the mbox browser for
> September 2011.  Do you have a more specific link?  Regards,
> >>
> >> Adam
> >
> > Well I will remain optimistic that the rewrite could hopefully have
> > solved several of my issues regardless I hope. I guess the idle TCP
> > connections by themselves are not too bad, when they all start to work
> > simultaneously I think is what becomes the issue =)
> >
> > Sorry Adam, here is a better link
> >
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E
> ,
> > the actual text was:
> >
> > ---------------
> >
> > It seems that randomly I am getting errors about crashes as our
> > replicator runs, all this replicator does is make sure that all
> > databases on the master server replicate to our failover by checking
> > status.
> >
> > Details:
> >  - I notice the below error in the logs, anywhere from 0 to 30 at a time.
> >  - It seems that a database might start replicating okay then stop.
> >  - These errors [1] are on the failover pulling from master
> >  - No errors are displayed on the master server
> >  - The databases inside the URL in the db_not_found portion of the
> > error, are always available from curl from the failover machine, which
> > makes the error strange, somehow it thinks it can't find the database
> >  - Master seems healthy at all times, all database are available, no
> > errors in log
> >
> > [1] --
> >  [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>]
> > {error_report,<0.30.0>,
> >                          {<0.22466.5305>,crash_report,
> >
> [[{initial_call,{couch_rep,init,['Argument__1']}},
> >                             {pid,<0.22466.5305>},
> >                             {registered_name,[]},
> >                             {error_info,
> >                              {exit,
> >                               {db_not_found,
> >                                <<"http://user:pass@server
> :5984/db_10944/">>},
> >                               [{gen_server,init_it,6},
> >                                {proc_lib,init_p_do_apply,3}]}},
> >                             {ancestors,
> >                              [couch_rep_sup,couch_primary_services,
> >                               couch_server_sup,<0.31.0>]},
> >                             {messages,[]},
> >                             {links,[<0.81.0>]},
> >                             {dictionary,[]},
> >                             {trap_exit,true},
> >                             {status,running},
> >                             {heap_size,2584},
> >                             {stack_size,24},
> >                             {reductions,794}],
> >                            []]}}
> >
>
> One place I've seen this error pop up when it looks like it shouldn't
> is if couch_server gets backed up. If you remsh into one of those db's
> you could try the following:
>
>    > process_info(whereis(couch_server), message_queue_len).
>
> And if that number keeps growing, that could be the issue.
>

Re: CouchDB Replication lacking resilience for many database

Posted by Paul Davis <pa...@gmail.com>.

On Mon, Oct 10, 2011 at 11:03 PM, Chris Stockton
<ch...@gmail.com> wrote:
> Hello,
>
> On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski <ko...@apache.org> wrote:
>> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
>>
>>> Hello,
>>>
>>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
>>> <fd...@apache.org> wrote:
>>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
>>>> <ch...@gmail.com> wrote:
>>>> Chris,
>>>>
>>>> That said work is in the'1.2.x' branch (and master).
>>>> CouchDB recently migrated from SVN to GIT, see:
>>>> http://couchdb.apache.org/community/code.html
>>>>
>>>
>>> Thank you very much for the response Filipe, do you possibly have any
>>> documentation or more detailed summary on what these changes include
>>> and possible benefits of them? I would love to hear about any tweaking
>>> or replication tips you may have for our growth issues, perhaps you
>>> could answer a basic question if nothing else: Do the changes in this
>>> branch minimize the performance impact of continuous replication on
>>> many databases?
>>>
>>> Regardless I plan on getting a build of that branch and doing some
>>> testing of my own very soon.
>>>
>>> Thank you!
>>>
>>> -Chris
>>
>> I'm pretty sure that even in 1.2.x and master each replication with a remote source still requires one dedicated TCP connection to consume the _changes feed.  Replications with a local source have always been able to use a connection pool per host:port combination.  That's not to downplay the significance of the rewrite of the replicator in 1.2.x; Filipe put quite a lot of time into it.
>>
>> The link to "those darn errors" just pointed to the mbox browser for September 2011.  Do you have a more specific link?  Regards,
>>
>> Adam
>
> Well I will remain optimistic that the rewrite could hopefully have
> solved several of my issues regardless I hope. I guess the idle TCP
> connections by themselves are not too bad, when they all start to work
> simultaneously I think is what becomes the issue =)
>
> Sorry Adam, here is a better link
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E,
> the actual text was:
>
> ---------------
>
> It seems that randomly I am getting errors about crashes as our
> replicator runs, all this replicator does is make sure that all
> databases on the master server replicate to our failover by checking
> status.
>
> Details:
>  - I notice the below error in the logs, anywhere from 0 to 30 at a time.
>  - It seems that a database might start replicating okay then stop.
>  - These errors [1] are on the failover pulling from master
>  - No errors are displayed on the master server
>  - The databases inside the URL in the db_not_found portion of the
> error, are always available from curl from the failover machine, which
> makes the error strange, somehow it thinks it can't find the database
>  - Master seems healthy at all times, all database are available, no
> errors in log
>
> [1] --
>  [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>]
> {error_report,<0.30.0>,
>                          {<0.22466.5305>,crash_report,
>                           [[{initial_call,{couch_rep,init,['Argument__1']}},
>                             {pid,<0.22466.5305>},
>                             {registered_name,[]},
>                             {error_info,
>                              {exit,
>                               {db_not_found,
>                                <<"http://user:pass@server:5984/db_10944/">>},
>                               [{gen_server,init_it,6},
>                                {proc_lib,init_p_do_apply,3}]}},
>                             {ancestors,
>                              [couch_rep_sup,couch_primary_services,
>                               couch_server_sup,<0.31.0>]},
>                             {messages,[]},
>                             {links,[<0.81.0>]},
>                             {dictionary,[]},
>                             {trap_exit,true},
>                             {status,running},
>                             {heap_size,2584},
>                             {stack_size,24},
>                             {reductions,794}],
>                            []]}}
>

One place I've seen this error pop up when it looks like it shouldn't
is if couch_server gets backed up. If you remsh into one of those db's
you could try the following:

    > process_info(whereis(couch_server), message_queue_len).

And if that number keeps growing, that could be the issue.

Re: CouchDB Replication lacking resilience for many database

Posted by Paul Davis <pa...@gmail.com>.

On Mon, Oct 10, 2011 at 11:03 PM, Chris Stockton
<ch...@gmail.com> wrote:
> Hello,
>
> On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski <ko...@apache.org> wrote:
>> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
>>
>>> Hello,
>>>
>>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
>>> <fd...@apache.org> wrote:
>>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
>>>> <ch...@gmail.com> wrote:
>>>> Chris,
>>>>
>>>> That said work is in the'1.2.x' branch (and master).
>>>> CouchDB recently migrated from SVN to GIT, see:
>>>> http://couchdb.apache.org/community/code.html
>>>>
>>>
>>> Thank you very much for the response Filipe, do you possibly have any
>>> documentation or more detailed summary on what these changes include
>>> and possible benefits of them? I would love to hear about any tweaking
>>> or replication tips you may have for our growth issues, perhaps you
>>> could answer a basic question if nothing else: Do the changes in this
>>> branch minimize the performance impact of continuous replication on
>>> many databases?
>>>
>>> Regardless I plan on getting a build of that branch and doing some
>>> testing of my own very soon.
>>>
>>> Thank you!
>>>
>>> -Chris
>>
>> I'm pretty sure that even in 1.2.x and master each replication with a remote source still requires one dedicated TCP connection to consume the _changes feed.  Replications with a local source have always been able to use a connection pool per host:port combination.  That's not to downplay the significance of the rewrite of the replicator in 1.2.x; Filipe put quite a lot of time into it.
>>
>> The link to "those darn errors" just pointed to the mbox browser for September 2011.  Do you have a more specific link?  Regards,
>>
>> Adam
>
> Well I will remain optimistic that the rewrite could hopefully have
> solved several of my issues regardless I hope. I guess the idle TCP
> connections by themselves are not too bad, when they all start to work
> simultaneously I think is what becomes the issue =)
>
> Sorry Adam, here is a better link
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E,
> the actual text was:
>
> ---------------
>
> It seems that randomly I am getting errors about crashes as our
> replicator runs, all this replicator does is make sure that all
> databases on the master server replicate to our failover by checking
> status.
>
> Details:
>  - I notice the below error in the logs, anywhere from 0 to 30 at a time.
>  - It seems that a database might start replicating okay then stop.
>  - These errors [1] are on the failover pulling from master
>  - No errors are displayed on the master server
>  - The databases inside the URL in the db_not_found portion of the
> error, are always available from curl from the failover machine, which
> makes the error strange, somehow it thinks it can't find the database
>  - Master seems healthy at all times, all database are available, no
> errors in log
>
> [1] --
>  [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>]
> {error_report,<0.30.0>,
>                          {<0.22466.5305>,crash_report,
>                           [[{initial_call,{couch_rep,init,['Argument__1']}},
>                             {pid,<0.22466.5305>},
>                             {registered_name,[]},
>                             {error_info,
>                              {exit,
>                               {db_not_found,
>                                <<"http://user:pass@server:5984/db_10944/">>},
>                               [{gen_server,init_it,6},
>                                {proc_lib,init_p_do_apply,3}]}},
>                             {ancestors,
>                              [couch_rep_sup,couch_primary_services,
>                               couch_server_sup,<0.31.0>]},
>                             {messages,[]},
>                             {links,[<0.81.0>]},
>                             {dictionary,[]},
>                             {trap_exit,true},
>                             {status,running},
>                             {heap_size,2584},
>                             {stack_size,24},
>                             {reductions,794}],
>                            []]}}
>

One place I've seen this error pop up when it looks like it shouldn't
is if couch_server gets backed up. If you remsh into one of those db's
you could try the following:

    > process_info(whereis(couch_server), message_queue_len).

And if that number keeps growing, that could be the issue.

Re: CouchDB Replication lacking resilience for many database

Posted by kowsik <ko...@gmail.com>.

Chris,
You might want to read this:

http://blog.mudynamics.com/2011/09/05/help-couchdb-break-the-c10k-barrier/

Make sure that your default 'ulimit -n' is pretty high. Under heavy
load, I've seen the replicator getting "backed-up" and start consuming
precious RAM until it gets completely wedged. With 1.1, you also have
the replicator-giving-up syndrome which has now been fixed in trunk
(with infinite retries). We have background workers on blitz.io that
monitor the replicator task status and kicks them when they go into an
error state. A kludgy hack, but one that works pretty well in
production.

You might also want to add this to your local.ini

socket_options = [{recbuf, 262144}, {sndbuf, 262144}, {nodelay, true}]

which helps quite a bit with the _changes feed.

K.
---
http://blog.mudynamics.com
http://blitz.io
@pcapr

On Mon, Oct 10, 2011 at 9:03 PM, Chris Stockton
<ch...@gmail.com> wrote:
> Hello,
>
> On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski <ko...@apache.org> wrote:
>> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
>>
>>> Hello,
>>>
>>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
>>> <fd...@apache.org> wrote:
>>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
>>>> <ch...@gmail.com> wrote:
>>>> Chris,
>>>>
>>>> That said work is in the'1.2.x' branch (and master).
>>>> CouchDB recently migrated from SVN to GIT, see:
>>>> http://couchdb.apache.org/community/code.html
>>>>
>>>
>>> Thank you very much for the response Filipe, do you possibly have any
>>> documentation or more detailed summary on what these changes include
>>> and possible benefits of them? I would love to hear about any tweaking
>>> or replication tips you may have for our growth issues, perhaps you
>>> could answer a basic question if nothing else: Do the changes in this
>>> branch minimize the performance impact of continuous replication on
>>> many databases?
>>>
>>> Regardless I plan on getting a build of that branch and doing some
>>> testing of my own very soon.
>>>
>>> Thank you!
>>>
>>> -Chris
>>
>> I'm pretty sure that even in 1.2.x and master each replication with a remote source still requires one dedicated TCP connection to consume the _changes feed.  Replications with a local source have always been able to use a connection pool per host:port combination.  That's not to downplay the significance of the rewrite of the replicator in 1.2.x; Filipe put quite a lot of time into it.
>>
>> The link to "those darn errors" just pointed to the mbox browser for September 2011.  Do you have a more specific link?  Regards,
>>
>> Adam
>
> Well I will remain optimistic that the rewrite could hopefully have
> solved several of my issues regardless I hope. I guess the idle TCP
> connections by themselves are not too bad, when they all start to work
> simultaneously I think is what becomes the issue =)
>
> Sorry Adam, here is a better link
> http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E,
> the actual text was:
>
> ---------------
>
> It seems that randomly I am getting errors about crashes as our
> replicator runs, all this replicator does is make sure that all
> databases on the master server replicate to our failover by checking
> status.
>
> Details:
>  - I notice the below error in the logs, anywhere from 0 to 30 at a time.
>  - It seems that a database might start replicating okay then stop.
>  - These errors [1] are on the failover pulling from master
>  - No errors are displayed on the master server
>  - The databases inside the URL in the db_not_found portion of the
> error, are always available from curl from the failover machine, which
> makes the error strange, somehow it thinks it can't find the database
>  - Master seems healthy at all times, all database are available, no
> errors in log
>
> [1] --
>  [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>]
> {error_report,<0.30.0>,
>                          {<0.22466.5305>,crash_report,
>                           [[{initial_call,{couch_rep,init,['Argument__1']}},
>                             {pid,<0.22466.5305>},
>                             {registered_name,[]},
>                             {error_info,
>                              {exit,
>                               {db_not_found,
>                                <<"http://user:pass@server:5984/db_10944/">>},
>                               [{gen_server,init_it,6},
>                                {proc_lib,init_p_do_apply,3}]}},
>                             {ancestors,
>                              [couch_rep_sup,couch_primary_services,
>                               couch_server_sup,<0.31.0>]},
>                             {messages,[]},
>                             {links,[<0.81.0>]},
>                             {dictionary,[]},
>                             {trap_exit,true},
>                             {status,running},
>                             {heap_size,2584},
>                             {stack_size,24},
>                             {reductions,794}],
>                            []]}}
>

Re: CouchDB Replication lacking resilience for many database

Posted by Chris Stockton <ch...@gmail.com>.

Hello,

On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski <ko...@apache.org> wrote:
> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
>
>> Hello,
>>
>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
>> <fd...@apache.org> wrote:
>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
>>> <ch...@gmail.com> wrote:
>>> Chris,
>>>
>>> That said work is in the'1.2.x' branch (and master).
>>> CouchDB recently migrated from SVN to GIT, see:
>>> http://couchdb.apache.org/community/code.html
>>>
>>
>> Thank you very much for the response Filipe, do you possibly have any
>> documentation or more detailed summary on what these changes include
>> and possible benefits of them? I would love to hear about any tweaking
>> or replication tips you may have for our growth issues, perhaps you
>> could answer a basic question if nothing else: Do the changes in this
>> branch minimize the performance impact of continuous replication on
>> many databases?
>>
>> Regardless I plan on getting a build of that branch and doing some
>> testing of my own very soon.
>>
>> Thank you!
>>
>> -Chris
>
> I'm pretty sure that even in 1.2.x and master each replication with a remote source still requires one dedicated TCP connection to consume the _changes feed.  Replications with a local source have always been able to use a connection pool per host:port combination.  That's not to downplay the significance of the rewrite of the replicator in 1.2.x; Filipe put quite a lot of time into it.
>
> The link to "those darn errors" just pointed to the mbox browser for September 2011.  Do you have a more specific link?  Regards,
>
> Adam

Well I will remain optimistic that the rewrite could hopefully have
solved several of my issues regardless I hope. I guess the idle TCP
connections by themselves are not too bad, when they all start to work
simultaneously I think is what becomes the issue =)

Sorry Adam, here is a better link
http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E,
the actual text was:

---------------

It seems that randomly I am getting errors about crashes as our
replicator runs, all this replicator does is make sure that all
databases on the master server replicate to our failover by checking
status.

Details:
  - I notice the below error in the logs, anywhere from 0 to 30 at a time.
  - It seems that a database might start replicating okay then stop.
  - These errors [1] are on the failover pulling from master
  - No errors are displayed on the master server
  - The databases inside the URL in the db_not_found portion of the
error, are always available from curl from the failover machine, which
makes the error strange, somehow it thinks it can't find the database
  - Master seems healthy at all times, all database are available, no
errors in log

[1] --
  [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>]
{error_report,<0.30.0>,
                          {<0.22466.5305>,crash_report,
                           [[{initial_call,{couch_rep,init,['Argument__1']}},
                             {pid,<0.22466.5305>},
                             {registered_name,[]},
                             {error_info,
                              {exit,
                               {db_not_found,
                                <<"http://user:pass@server:5984/db_10944/">>},
                               [{gen_server,init_it,6},
                                {proc_lib,init_p_do_apply,3}]}},
                             {ancestors,
                              [couch_rep_sup,couch_primary_services,
                               couch_server_sup,<0.31.0>]},
                             {messages,[]},
                             {links,[<0.81.0>]},
                             {dictionary,[]},
                             {trap_exit,true},
                             {status,running},
                             {heap_size,2584},
                             {stack_size,24},
                             {reductions,794}],
                            []]}}

Re: CouchDB Replication lacking resilience for many database

Posted by Chris Stockton <ch...@gmail.com>.

Hello,

On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski <ko...@apache.org> wrote:
> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
>
>> Hello,
>>
>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
>> <fd...@apache.org> wrote:
>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
>>> <ch...@gmail.com> wrote:
>>> Chris,
>>>
>>> That said work is in the'1.2.x' branch (and master).
>>> CouchDB recently migrated from SVN to GIT, see:
>>> http://couchdb.apache.org/community/code.html
>>>
>>
>> Thank you very much for the response Filipe, do you possibly have any
>> documentation or more detailed summary on what these changes include
>> and possible benefits of them? I would love to hear about any tweaking
>> or replication tips you may have for our growth issues, perhaps you
>> could answer a basic question if nothing else: Do the changes in this
>> branch minimize the performance impact of continuous replication on
>> many databases?
>>
>> Regardless I plan on getting a build of that branch and doing some
>> testing of my own very soon.
>>
>> Thank you!
>>
>> -Chris
>
> I'm pretty sure that even in 1.2.x and master each replication with a remote source still requires one dedicated TCP connection to consume the _changes feed.  Replications with a local source have always been able to use a connection pool per host:port combination.  That's not to downplay the significance of the rewrite of the replicator in 1.2.x; Filipe put quite a lot of time into it.
>
> The link to "those darn errors" just pointed to the mbox browser for September 2011.  Do you have a more specific link?  Regards,
>
> Adam

Well I will remain optimistic that the rewrite could hopefully have
solved several of my issues regardless I hope. I guess the idle TCP
connections by themselves are not too bad, when they all start to work
simultaneously I think is what becomes the issue =)

Sorry Adam, here is a better link
http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E,
the actual text was:

---------------

It seems that randomly I am getting errors about crashes as our
replicator runs, all this replicator does is make sure that all
databases on the master server replicate to our failover by checking
status.

Details:
  - I notice the below error in the logs, anywhere from 0 to 30 at a time.
  - It seems that a database might start replicating okay then stop.
  - These errors [1] are on the failover pulling from master
  - No errors are displayed on the master server
  - The databases inside the URL in the db_not_found portion of the
error, are always available from curl from the failover machine, which
makes the error strange, somehow it thinks it can't find the database
  - Master seems healthy at all times, all database are available, no
errors in log

[1] --
  [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>]
{error_report,<0.30.0>,
                          {<0.22466.5305>,crash_report,
                           [[{initial_call,{couch_rep,init,['Argument__1']}},
                             {pid,<0.22466.5305>},
                             {registered_name,[]},
                             {error_info,
                              {exit,
                               {db_not_found,
                                <<"http://user:pass@server:5984/db_10944/">>},
                               [{gen_server,init_it,6},
                                {proc_lib,init_p_do_apply,3}]}},
                             {ancestors,
                              [couch_rep_sup,couch_primary_services,
                               couch_server_sup,<0.31.0>]},
                             {messages,[]},
                             {links,[<0.81.0>]},
                             {dictionary,[]},
                             {trap_exit,true},
                             {status,running},
                             {heap_size,2584},
                             {stack_size,24},
                             {reductions,794}],
                            []]}}

Re: CouchDB Replication lacking resilience for many database

Posted by CGS <cg...@gmail.com>.

Hi,

I am no expert, but I do have one or two design question and maybe one or
two suggestions (5000 continuous replication will overload your system for
sure):
1. Why don't you use more storage elements to break your DB in shards? This
way you can remove some pressure from your system and dissipate it through
other storage elements.
2. Why don't you use external triggers instead of continuous replication?
You can set _changes trigger to external processes which can do a buffering
which can be flushed via _bulk/parallel operations or on different ports (I
don't suppose that would solve the problem if the band in between the two
servers is broad enough).
Just keep in mind when you design the solution to your problem that your
bottleneck is not in CPU/RAM/CONNECTION, but in HDD (I was surprised to see
my HDD being almost too slow for a 2.4 MB/s download, but it's MS Windows
here :D ).

Cheers,
CGS

On Tue, Oct 11, 2011 at 3:18 AM, Adam Kocoloski <ko...@apache.org> wrote:

> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:
>
> > Hello,
> >
> > On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
> > <fd...@apache.org> wrote:
> >> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
> >> <ch...@gmail.com> wrote:
> >> Chris,
> >>
> >> That said work is in the'1.2.x' branch (and master).
> >> CouchDB recently migrated from SVN to GIT, see:
> >> http://couchdb.apache.org/community/code.html
> >>
> >
> > Thank you very much for the response Filipe, do you possibly have any
> > documentation or more detailed summary on what these changes include
> > and possible benefits of them? I would love to hear about any tweaking
> > or replication tips you may have for our growth issues, perhaps you
> > could answer a basic question if nothing else: Do the changes in this
> > branch minimize the performance impact of continuous replication on
> > many databases?
> >
> > Regardless I plan on getting a build of that branch and doing some
> > testing of my own very soon.
> >
> > Thank you!
> >
> > -Chris
>
> I'm pretty sure that even in 1.2.x and master each replication with a
> remote source still requires one dedicated TCP connection to consume the
> _changes feed.  Replications with a local source have always been able to
> use a connection pool per host:port combination.  That's not to downplay the
> significance of the rewrite of the replicator in 1.2.x; Filipe put quite a
> lot of time into it.
>
> The link to "those darn errors" just pointed to the mbox browser for
> September 2011.  Do you have a more specific link?  Regards,
>
> Adam

Re: CouchDB Replication lacking resilience for many database

Posted by Adam Kocoloski <ko...@apache.org>.

On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:

> Hello,
> 
> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
> <fd...@apache.org> wrote:
>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
>> <ch...@gmail.com> wrote:
>> Chris,
>> 
>> That said work is in the'1.2.x' branch (and master).
>> CouchDB recently migrated from SVN to GIT, see:
>> http://couchdb.apache.org/community/code.html
>> 
> 
> Thank you very much for the response Filipe, do you possibly have any
> documentation or more detailed summary on what these changes include
> and possible benefits of them? I would love to hear about any tweaking
> or replication tips you may have for our growth issues, perhaps you
> could answer a basic question if nothing else: Do the changes in this
> branch minimize the performance impact of continuous replication on
> many databases?
> 
> Regardless I plan on getting a build of that branch and doing some
> testing of my own very soon.
> 
> Thank you!
> 
> -Chris

I'm pretty sure that even in 1.2.x and master each replication with a remote source still requires one dedicated TCP connection to consume the _changes feed.  Replications with a local source have always been able to use a connection pool per host:port combination.  That's not to downplay the significance of the rewrite of the replicator in 1.2.x; Filipe put quite a lot of time into it.

The link to "those darn errors" just pointed to the mbox browser for September 2011.  Do you have a more specific link?  Regards,

Adam

Re: CouchDB Replication lacking resilience for many database

Posted by Adam Kocoloski <ko...@apache.org>.

On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote:

> Hello,
> 
> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
> <fd...@apache.org> wrote:
>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
>> <ch...@gmail.com> wrote:
>> Chris,
>> 
>> That said work is in the'1.2.x' branch (and master).
>> CouchDB recently migrated from SVN to GIT, see:
>> http://couchdb.apache.org/community/code.html
>> 
> 
> Thank you very much for the response Filipe, do you possibly have any
> documentation or more detailed summary on what these changes include
> and possible benefits of them? I would love to hear about any tweaking
> or replication tips you may have for our growth issues, perhaps you
> could answer a basic question if nothing else: Do the changes in this
> branch minimize the performance impact of continuous replication on
> many databases?
> 
> Regardless I plan on getting a build of that branch and doing some
> testing of my own very soon.
> 
> Thank you!
> 
> -Chris

I'm pretty sure that even in 1.2.x and master each replication with a remote source still requires one dedicated TCP connection to consume the _changes feed.  Replications with a local source have always been able to use a connection pool per host:port combination.  That's not to downplay the significance of the rewrite of the replicator in 1.2.x; Filipe put quite a lot of time into it.

The link to "those darn errors" just pointed to the mbox browser for September 2011.  Do you have a more specific link?  Regards,

Adam

Re: CouchDB Replication lacking resilience for many database

Posted by Randall Leeds <ra...@gmail.com>.

On Mon, Oct 10, 2011 at 17:02, Chris Stockton <ch...@gmail.com>wrote:

> Hello,
>
> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
> <fd...@apache.org> wrote:
> > On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
> > <ch...@gmail.com> wrote:
> > Chris,
> >
> > That said work is in the'1.2.x' branch (and master).
> > CouchDB recently migrated from SVN to GIT, see:
> > http://couchdb.apache.org/community/code.html
> >
>
> Thank you very much for the response Filipe, do you possibly have any
> documentation or more detailed summary on what these changes include
> and possible benefits of them? I would love to hear about any tweaking
> or replication tips you may have for our growth issues, perhaps you
> could answer a basic question if nothing else: Do the changes in this
> branch minimize the performance impact of continuous replication on
> many databases?
>

The primary change, as I understand it, is that CouchDB explicitly manages a
pool of HTTP connections per replication.

Previously, the pool was handled by the HTTP client module CouchDB uses,
ibrowse, which pools connections per host/port. Therefore, the pool is
shared between all the replications pulling from a given server.

The config file has settings for changing how these pools behave, under the
replicator heading:
max_http_sessions
max_http_pipeline_size

The first refers to the size of the per-host pool. The second refers to the
number of requests that can be queued for each. Unfortunately, there is one
more setting which is not exposed, which is the maximum number of requests
to try at once per replication (side note to devs, should we provide a quick
patch for this for 1.1.1?), and it is fixed at 100. CouchDB does not
elegantly handle the case when the pool is completely utilized. This is not
a problem in the new replication code in 1.2.

For the time being though, the following formula should hold true or you
will experience problems:

max_http_sessions * max_http_pipeline_size >= 100 * N

where N is the maximum number of concurrent replications triggered to pull
or push from a single host.

I believe this to still be the case in 1.1. I'm pretty sure it was at one
point earlier.

-Randall

>
> Regardless I plan on getting a build of that branch and doing some
> testing of my own very soon.
>
> Thank you!
>
> -Chris
>

Re: CouchDB Replication lacking resilience for many database

Posted by Chris Stockton <ch...@gmail.com>.

Hello,

On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
<fd...@apache.org> wrote:
> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
> <ch...@gmail.com> wrote:
> Chris,
>
> That said work is in the'1.2.x' branch (and master).
> CouchDB recently migrated from SVN to GIT, see:
> http://couchdb.apache.org/community/code.html
>

Thank you very much for the response Filipe, do you possibly have any
documentation or more detailed summary on what these changes include
and possible benefits of them? I would love to hear about any tweaking
or replication tips you may have for our growth issues, perhaps you
could answer a basic question if nothing else: Do the changes in this
branch minimize the performance impact of continuous replication on
many databases?

Regardless I plan on getting a build of that branch and doing some
testing of my own very soon.

Thank you!

-Chris

Re: CouchDB Replication lacking resilience for many database

Posted by Chris Stockton <ch...@gmail.com>.

Hello,

On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana
<fd...@apache.org> wrote:
> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
> <ch...@gmail.com> wrote:
> Chris,
>
> That said work is in the'1.2.x' branch (and master).
> CouchDB recently migrated from SVN to GIT, see:
> http://couchdb.apache.org/community/code.html
>

Thank you very much for the response Filipe, do you possibly have any
documentation or more detailed summary on what these changes include
and possible benefits of them? I would love to hear about any tweaking
or replication tips you may have for our growth issues, perhaps you
could answer a basic question if nothing else: Do the changes in this
branch minimize the performance impact of continuous replication on
many databases?

Regardless I plan on getting a build of that branch and doing some
testing of my own very soon.

Thank you!

-Chris

Re: CouchDB Replication lacking resilience for many database

Posted by Filipe David Manana <fd...@apache.org>.

On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
<ch...@gmail.com> wrote:
> A couple posts about my struggles thus far are located here [1] and
> here [2], I have heard from these threads that a individual has been
> working on major changes to replication, to use a small pool of TCP
> connections instead of a 1 to 1 ration. I think this would improve our
> situation drastically, but I was unable to find said work anywhere in
> SVN or on the web to test or look into it's design concepts. If said
> work is not actually under development yet, I am willing to put in
> some development time on the weekends to learn erlang and improve this
> portion of couchdb, but I do not want to volunteer this time if the
> work has already been complete.

Chris,

That said work is in the'1.2.x' branch (and master).
CouchDB recently migrated from SVN to GIT, see:
http://couchdb.apache.org/community/code.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

Re: CouchDB Replication lacking resilience for many database

Posted by Filipe David Manana <fd...@apache.org>.

On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton
<ch...@gmail.com> wrote:
> A couple posts about my struggles thus far are located here [1] and
> here [2], I have heard from these threads that a individual has been
> working on major changes to replication, to use a small pool of TCP
> connections instead of a 1 to 1 ration. I think this would improve our
> situation drastically, but I was unable to find said work anywhere in
> SVN or on the web to test or look into it's design concepts. If said
> work is not actually under development yet, I am willing to put in
> some development time on the weekends to learn erlang and improve this
> portion of couchdb, but I do not want to volunteer this time if the
> work has already been complete.

Chris,

That said work is in the'1.2.x' branch (and master).
CouchDB recently migrated from SVN to GIT, see:
http://couchdb.apache.org/community/code.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."