You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Nicholas Westlake <ni...@gmail.com> on 2013/10/10 09:18:44 UTC

Limits on Continuous Replication

I was advised in #couchdb that my question is better suited to the mailing list.

I'm looking at an app design that uses quite a few replications. I'm wondering what limits I'll run into from couch. Ideas for how to work around those limits would be bonus. :)

I have 3 kinds of database:

1) user # One of these for each user (readable and writeable only by the user who owns it)
2) project-public # One of these for each project (writeable only by the replicator, readable by everyone)
3) project-admin # One of these for each project (writeable only by the replicator, readable only by users granted permission)

When a user is added to a project, filtered continuous replication starts from their "user" database to both the "project-public" and "project-admin" databases. The filter functions check for simple conditions like "doc.committed === true". The replication will be stored in _replicator database so the continuous replication will survives a restart.

The important behaviors:

- When a user changes data and it should propagate to the the project databases in under 30 seconds.
- Project data should be access controlled between public data and admin-only data. I believe separate databases with _security set accordingly is the only way to achieve this.
- Restarting couch shouldn't break anything.

Some realistic maximums. It's unlikely that:

- any "project-admin" or "project-public" databases will exceed 8 megabytes, compacted.
- any "user" database will exceed 100 megabytes each, compacted.
- the user count, and project count will ever exceed 100,000 and 40,000 respectively.
- more than 4,000 users will work on any single project.
- more than 2,000 users will be making changes at any given moment.

Just in case it matters:

- The external process that will be creating databases and entries in the _replicator database is node.js.

The continuous replication count (at most) would be 320 million (40,000 projects * 4,000 users/project * 2 replications/user). This seems like it would break something. :)

Should this work? Or: what's the canonical (or a good) alternative to this design with couch?

Re: Limits on Continuous Replication

Posted by Nicholas Westlake <ni...@gmail.com>.

Is "serious punishment for restarting couch" the only problem with this design?

-NRW

On 10 Oct 2013, at 2:31 AM, Tibor Gemes <ti...@gmail.com> wrote:

> I see one problem. The startup case.
> 
> Let's suppose that the couchdb is shut down properly. On startup it will
> scan all _replicator documents and for each document it will start a
> filtered replication. Each filtered replication will start scan the full
> _changes feed and execute the filter for each item in the feed, and if the
> item passes then it will be checked on the target db, and if it is fresher
> then will be replicated. In normal cases the item won't be newer so the
> replication won't happen, however for each replicator document the whole
> _changes feed will be read and filtered still. This will cause a huge
> stress on the system.
> 
> Hth,
> Tib
> 
> On 10 October 2013 09:18, Nicholas Westlake <ni...@gmail.com>wrote:
> 
>> I was advised in #couchdb that my question is better suited to the mailing
>> list.
>> 
>> I'm looking at an app design that uses quite a few replications. I'm
>> wondering what limits I'll run into from couch. Ideas for how to work
>> around those limits would be bonus. :)
>> 
>> I have 3 kinds of database:
>> 
>> 1) user # One of these for each user (readable and writeable only by the
>> user who owns it)
>> 2) project-public # One of these for each project (writeable only by the
>> replicator, readable by everyone)
>> 3) project-admin # One of these for each project (writeable only by the
>> replicator, readable only by users granted permission)
>> 
>> When a user is added to a project, filtered continuous replication starts
>> from their "user" database to both the "project-public" and "project-admin"
>> databases. The filter functions check for simple conditions like
>> "doc.committed === true". The replication will be stored in _replicator
>> database so the continuous replication will survives a restart.
>> 
>> The important behaviors:
>> 
>> - When a user changes data and it should propagate to the the project
>> databases in under 30 seconds.
>> - Project data should be access controlled between public data and
>> admin-only data. I believe separate databases with _security set
>> accordingly is the only way to achieve this.
>> - Restarting couch shouldn't break anything.
>> 
>> Some realistic maximums. It's unlikely that:
>> 
>> - any "project-admin" or "project-public" databases will exceed 8
>> megabytes, compacted.
>> - any "user" database will exceed 100 megabytes each, compacted.
>> - the user count, and project count will ever exceed 100,000 and 40,000
>> respectively.
>> - more than 4,000 users will work on any single project.
>> - more than 2,000 users will be making changes at any given moment.
>> 
>> Just in case it matters:
>> 
>> - The external process that will be creating databases and entries in the
>> _replicator database is node.js.
>> 
>> The continuous replication count (at most) would be 320 million (40,000
>> projects * 4,000 users/project * 2 replications/user). This seems like it
>> would break something. :)
>> 
>> Should this work? Or: what's the canonical (or a good) alternative to this
>> design with couch?

Re: Limits on Continuous Replication

Posted by Tibor Gemes <ti...@gmail.com>.

I see one problem. The startup case.

Let's suppose that the couchdb is shut down properly. On startup it will
scan all _replicator documents and for each document it will start a
filtered replication. Each filtered replication will start scan the full
_changes feed and execute the filter for each item in the feed, and if the
item passes then it will be checked on the target db, and if it is fresher
then will be replicated. In normal cases the item won't be newer so the
replication won't happen, however for each replicator document the whole
_changes feed will be read and filtered still. This will cause a huge
stress on the system.

Hth,
Tib

On 10 October 2013 09:18, Nicholas Westlake <ni...@gmail.com>wrote:

> I was advised in #couchdb that my question is better suited to the mailing
> list.
>
> I'm looking at an app design that uses quite a few replications. I'm
> wondering what limits I'll run into from couch. Ideas for how to work
> around those limits would be bonus. :)
>
> I have 3 kinds of database:
>
> 1) user # One of these for each user (readable and writeable only by the
> user who owns it)
> 2) project-public # One of these for each project (writeable only by the
> replicator, readable by everyone)
> 3) project-admin # One of these for each project (writeable only by the
> replicator, readable only by users granted permission)
>
> When a user is added to a project, filtered continuous replication starts
> from their "user" database to both the "project-public" and "project-admin"
> databases. The filter functions check for simple conditions like
> "doc.committed === true". The replication will be stored in _replicator
> database so the continuous replication will survives a restart.
>
> The important behaviors:
>
> - When a user changes data and it should propagate to the the project
> databases in under 30 seconds.
> - Project data should be access controlled between public data and
> admin-only data. I believe separate databases with _security set
> accordingly is the only way to achieve this.
> - Restarting couch shouldn't break anything.
>
> Some realistic maximums. It's unlikely that:
>
> - any "project-admin" or "project-public" databases will exceed 8
> megabytes, compacted.
> - any "user" database will exceed 100 megabytes each, compacted.
> - the user count, and project count will ever exceed 100,000 and 40,000
> respectively.
> - more than 4,000 users will work on any single project.
> - more than 2,000 users will be making changes at any given moment.
>
> Just in case it matters:
>
> - The external process that will be creating databases and entries in the
> _replicator database is node.js.
>
> The continuous replication count (at most) would be 320 million (40,000
> projects * 4,000 users/project * 2 replications/user). This seems like it
> would break something. :)
>
> Should this work? Or: what's the canonical (or a good) alternative to this
> design with couch?