You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Stefan du Fresne <st...@medicmobile.org> on 2016/05/25 08:34:52 UTC

The state of filtered replication

Hello all,

I work on an app that involves a large amount of CouchDB filtered replication (every user has a filtered subset of the DB locally via PouchDB). Currently filtered replication is our number 1 performance bottleneck for rolling out to more users, and I'm trying to work out where we can go from here.

Our current setup is one CouchDB database and N PouchDB installations, which all two-way replicate, with the CouchDB->PouchDB replication being filtered based on user permissions / relevance [1].

Our issue is that as we add users a) total document creation velocity increases, and b) the proportion of documents that are relevant to any particular user decreases. These two points cause replication-- both initial onboarding and continual-- to take longer and longer.

At this stage we are being forced to manually limit the number of users we onboard at any particular time to half a dozen or so, or risk CouchDB being unresponsive [2]. As we'd want to be onboarding 50-100 at any particular time due to how we're rolling pit, you can imagine that this is pretty painful.

I have already re-written the filter in Erlang, which halved its execution time, which is awesome!

I also attempted to simplify the filter to increase performance. However, filter speed seems more dependent on the physical size of your filter as opposed to what code executes, which makes writing a simple filter that can fall-back to a complicated filter not terribly useful (see: https://issues.apache.org/jira/browse/COUCHDB-3021 <https://issues.apache.org/jira/browse/COUCHDB-3021>)

If the above linked ticket is fixed (if it can be) this would make our filter 3-4x faster again. However, this still wouldn't address the fundamental issue that filtered replication is very CPU-intensive, and so as noted above doesn't seem to scale terribly well.

Ideally then, I would like to remove filter replication completely, but there does not seem to be a good alternative right now.

Looking through the archives there was talk of adding view replication, see: https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E <https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E> , but it doesn't look like this ever got resolved.

There is also often talk of databases per user being a good scaling strategy, but we're basically doing that already (with PouchDB),  and for us documents aren't owned / viewed by just one person so this does not get us away from filtered replication (eg a supervisor replicates her documents as well as N sub-users documents). There are potentially wild and crazy schemes that involves many different databases where the equivalent of filtering is expressed in replication relationships, but this would add a massive amount of complexity to our app, and I’m not even convinced it would work as there are lots of edge cases to consider.

Does anyone know of anything else I can try to increase replication performance? Or to safeguard against many replicators unacceptably degrading couchdb performance? Does Couch 2.0 address any of these concerns?

Thanks in advance,
- Stefan du Fresne

[1] security is handled by not exposing couch and going through a wrapper service that validates couch requests, relevance is hierarchy based (i.e. documents you or your subordinates are authors of are replicated to you)
[2] there are also administrators / configurers that access couchdb directly

Re: The state of filtered replication

Posted by Pedro Narciso García Revington <p....@gmail.com>.

@Stefan

It can be as simple as putting the replicas behind a load balancer and
pointing all your clients to it. But it depends on how your application
works. This will also help you to scale this solution since you can add
more replicas to the load balancer without telling to the nodes. Things can
be more complicated if you want to write too.

2016-05-25 12:39 GMT+02:00 Stefan du Fresne <st...@medicmobile.org>:

> Hi Pedro,
>
> Thanks for your advice.
>
> This is definitely something that is in the back of our minds, along with
> looking into couchdb clustering. Another similar option we’re considering
> is having filtered replication between those replicas and having them
> represent regions (our data permission structure is basically report <-
> person <- family <- region <- larger region <- still larger region). This
> would still involve filtered replication, but would cut down on irrelevant
> documents that users had to filter through. We’re still at the stage of
> trying to get the most out of one server however.
>
> On your example though, to be clear, assigning users to replicas is
> something that I have to manage myself, correct? Do you know if a
> particular user needs to stays on the same replica or if I could just
> dumbly direct them to any existing node? Naively I’d think that I could do
> the latter, but I’ve noticed one-way replication seems to involve passing
> some metadata back to the server (Pouch does this, though I’ve never really
> looked into what it’s sending or what Couch does it with.), so it’s not
> clear how stateful this kind of thing is.
>
> Cheers,
> Stefan
>
> > On 25 May 2016, at 09:51, Pedro Narciso García Revington <
> p.revington@gmail.com> wrote:
> >
> > Because couchdb supports master master replication you can alter your
> > schema to:
> >
> > master couchdb → couchdb replica 1 → some clients
> >                               couchdb replica 2 → some other clients
> >
> > So you can distrubute the load between the replicas.
> >
> > 2016-05-25 10:34 GMT+02:00 Stefan du Fresne <stefan@medicmobile.org
> <ma...@medicmobile.org>>:
> >
> >> Hello all,
> >>
> >> I work on an app that involves a large amount of CouchDB filtered
> >> replication (every user has a filtered subset of the DB locally via
> >> PouchDB). Currently filtered replication is our number 1 performance
> >> bottleneck for rolling out to more users, and I'm trying to work out
> where
> >> we can go from here.
> >>
> >> Our current setup is one CouchDB database and N PouchDB installations,
> >> which all two-way replicate, with the CouchDB->PouchDB replication being
> >> filtered based on user permissions / relevance [1].
> >>
> >> Our issue is that as we add users a) total document creation velocity
> >> increases, and b) the proportion of documents that are relevant to any
> >> particular user decreases. These two points cause replication-- both
> >> initial onboarding and continual-- to take longer and longer.
> >>
> >> At this stage we are being forced to manually limit the number of users
> we
> >> onboard at any particular time to half a dozen or so, or risk CouchDB
> being
> >> unresponsive [2]. As we'd want to be onboarding 50-100 at any particular
> >> time due to how we're rolling pit, you can imagine that this is pretty
> >> painful.
> >>
> >> I have already re-written the filter in Erlang, which halved its
> execution
> >> time, which is awesome!
> >>
> >> I also attempted to simplify the filter to increase performance.
> However,
> >> filter speed seems more dependent on the physical size of your filter as
> >> opposed to what code executes, which makes writing a simple filter that
> can
> >> fall-back to a complicated filter not terribly useful (see:
> >> https://issues.apache.org/jira/browse/COUCHDB-3021 <
> >> https://issues.apache.org/jira/browse/COUCHDB-3021 <
> https://issues.apache.org/jira/browse/COUCHDB-3021>>)
> >>
> >> If the above linked ticket is fixed (if it can be) this would make our
> >> filter 3-4x faster again. However, this still wouldn't address the
> >> fundamental issue that filtered replication is very CPU-intensive, and
> so
> >> as noted above doesn't seem to scale terribly well.
> >>
> >> Ideally then, I would like to remove filter replication completely, but
> >> there does not seem to be a good alternative right now.
> >>
> >> Looking through the archives there was talk of adding view replication,
> >> see:
> >>
> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
> <
> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
> >
> >> <
> >>
> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
> <
> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
> >>
> >> , but it doesn't look like this ever got resolved.
> >>
> >> There is also often talk of databases per user being a good scaling
> >> strategy, but we're basically doing that already (with PouchDB),  and
> for
> >> us documents aren't owned / viewed by just one person so this does not
> get
> >> us away from filtered replication (eg a supervisor replicates her
> documents
> >> as well as N sub-users documents). There are potentially wild and crazy
> >> schemes that involves many different databases where the equivalent of
> >> filtering is expressed in replication relationships, but this would add
> a
> >> massive amount of complexity to our app, and I’m not even convinced it
> >> would work as there are lots of edge cases to consider.
> >>
> >> Does anyone know of anything else I can try to increase replication
> >> performance? Or to safeguard against many replicators unacceptably
> >> degrading couchdb performance? Does Couch 2.0 address any of these
> concerns?
> >>
> >> Thanks in advance,
> >> - Stefan du Fresne
> >>
> >> [1] security is handled by not exposing couch and going through a
> wrapper
> >> service that validates couch requests, relevance is hierarchy based
> (i.e.
> >> documents you or your subordinates are authors of are replicated to you)
> >> [2] there are also administrators / configurers that access couchdb
> >> directly
>
>

Re: The state of filtered replication

Posted by Stefan du Fresne <st...@medicmobile.org>.

Hi Pedro,

Thanks for your advice.

This is definitely something that is in the back of our minds, along with looking into couchdb clustering. Another similar option we’re considering is having filtered replication between those replicas and having them represent regions (our data permission structure is basically report <- person <- family <- region <- larger region <- still larger region). This would still involve filtered replication, but would cut down on irrelevant documents that users had to filter through. We’re still at the stage of trying to get the most out of one server however. 

On your example though, to be clear, assigning users to replicas is something that I have to manage myself, correct? Do you know if a particular user needs to stays on the same replica or if I could just dumbly direct them to any existing node? Naively I’d think that I could do the latter, but I’ve noticed one-way replication seems to involve passing some metadata back to the server (Pouch does this, though I’ve never really looked into what it’s sending or what Couch does it with.), so it’s not clear how stateful this kind of thing is.

Cheers,
Stefan

> On 25 May 2016, at 09:51, Pedro Narciso García Revington <p....@gmail.com> wrote:
> 
> Because couchdb supports master master replication you can alter your
> schema to:
> 
> master couchdb → couchdb replica 1 → some clients
>                               couchdb replica 2 → some other clients
> 
> So you can distrubute the load between the replicas.
> 
> 2016-05-25 10:34 GMT+02:00 Stefan du Fresne <stefan@medicmobile.org <ma...@medicmobile.org>>:
> 
>> Hello all,
>> 
>> I work on an app that involves a large amount of CouchDB filtered
>> replication (every user has a filtered subset of the DB locally via
>> PouchDB). Currently filtered replication is our number 1 performance
>> bottleneck for rolling out to more users, and I'm trying to work out where
>> we can go from here.
>> 
>> Our current setup is one CouchDB database and N PouchDB installations,
>> which all two-way replicate, with the CouchDB->PouchDB replication being
>> filtered based on user permissions / relevance [1].
>> 
>> Our issue is that as we add users a) total document creation velocity
>> increases, and b) the proportion of documents that are relevant to any
>> particular user decreases. These two points cause replication-- both
>> initial onboarding and continual-- to take longer and longer.
>> 
>> At this stage we are being forced to manually limit the number of users we
>> onboard at any particular time to half a dozen or so, or risk CouchDB being
>> unresponsive [2]. As we'd want to be onboarding 50-100 at any particular
>> time due to how we're rolling pit, you can imagine that this is pretty
>> painful.
>> 
>> I have already re-written the filter in Erlang, which halved its execution
>> time, which is awesome!
>> 
>> I also attempted to simplify the filter to increase performance. However,
>> filter speed seems more dependent on the physical size of your filter as
>> opposed to what code executes, which makes writing a simple filter that can
>> fall-back to a complicated filter not terribly useful (see:
>> https://issues.apache.org/jira/browse/COUCHDB-3021 <
>> https://issues.apache.org/jira/browse/COUCHDB-3021 <https://issues.apache.org/jira/browse/COUCHDB-3021>>)
>> 
>> If the above linked ticket is fixed (if it can be) this would make our
>> filter 3-4x faster again. However, this still wouldn't address the
>> fundamental issue that filtered replication is very CPU-intensive, and so
>> as noted above doesn't seem to scale terribly well.
>> 
>> Ideally then, I would like to remove filter replication completely, but
>> there does not seem to be a good alternative right now.
>> 
>> Looking through the archives there was talk of adding view replication,
>> see:
>> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E <https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>
>> <
>> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E <https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>>
>> , but it doesn't look like this ever got resolved.
>> 
>> There is also often talk of databases per user being a good scaling
>> strategy, but we're basically doing that already (with PouchDB),  and for
>> us documents aren't owned / viewed by just one person so this does not get
>> us away from filtered replication (eg a supervisor replicates her documents
>> as well as N sub-users documents). There are potentially wild and crazy
>> schemes that involves many different databases where the equivalent of
>> filtering is expressed in replication relationships, but this would add a
>> massive amount of complexity to our app, and I’m not even convinced it
>> would work as there are lots of edge cases to consider.
>> 
>> Does anyone know of anything else I can try to increase replication
>> performance? Or to safeguard against many replicators unacceptably
>> degrading couchdb performance? Does Couch 2.0 address any of these concerns?
>> 
>> Thanks in advance,
>> - Stefan du Fresne
>> 
>> [1] security is handled by not exposing couch and going through a wrapper
>> service that validates couch requests, relevance is hierarchy based (i.e.
>> documents you or your subordinates are authors of are replicated to you)
>> [2] there are also administrators / configurers that access couchdb
>> directly

Re: The state of filtered replication

Posted by Pedro Narciso García Revington <p....@gmail.com>.

Because couchdb supports master master replication you can alter your
schema to:

master couchdb → couchdb replica 1 → some clients
                               couchdb replica 2 → some other clients

So you can distrubute the load between the replicas.

2016-05-25 10:34 GMT+02:00 Stefan du Fresne <st...@medicmobile.org>:

> Hello all,
>
> I work on an app that involves a large amount of CouchDB filtered
> replication (every user has a filtered subset of the DB locally via
> PouchDB). Currently filtered replication is our number 1 performance
> bottleneck for rolling out to more users, and I'm trying to work out where
> we can go from here.
>
> Our current setup is one CouchDB database and N PouchDB installations,
> which all two-way replicate, with the CouchDB->PouchDB replication being
> filtered based on user permissions / relevance [1].
>
> Our issue is that as we add users a) total document creation velocity
> increases, and b) the proportion of documents that are relevant to any
> particular user decreases. These two points cause replication-- both
> initial onboarding and continual-- to take longer and longer.
>
> At this stage we are being forced to manually limit the number of users we
> onboard at any particular time to half a dozen or so, or risk CouchDB being
> unresponsive [2]. As we'd want to be onboarding 50-100 at any particular
> time due to how we're rolling pit, you can imagine that this is pretty
> painful.
>
> I have already re-written the filter in Erlang, which halved its execution
> time, which is awesome!
>
> I also attempted to simplify the filter to increase performance. However,
> filter speed seems more dependent on the physical size of your filter as
> opposed to what code executes, which makes writing a simple filter that can
> fall-back to a complicated filter not terribly useful (see:
> https://issues.apache.org/jira/browse/COUCHDB-3021 <
> https://issues.apache.org/jira/browse/COUCHDB-3021>)
>
> If the above linked ticket is fixed (if it can be) this would make our
> filter 3-4x faster again. However, this still wouldn't address the
> fundamental issue that filtered replication is very CPU-intensive, and so
> as noted above doesn't seem to scale terribly well.
>
> Ideally then, I would like to remove filter replication completely, but
> there does not seem to be a good alternative right now.
>
> Looking through the archives there was talk of adding view replication,
> see:
> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
> <
> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>
> , but it doesn't look like this ever got resolved.
>
> There is also often talk of databases per user being a good scaling
> strategy, but we're basically doing that already (with PouchDB),  and for
> us documents aren't owned / viewed by just one person so this does not get
> us away from filtered replication (eg a supervisor replicates her documents
> as well as N sub-users documents). There are potentially wild and crazy
> schemes that involves many different databases where the equivalent of
> filtering is expressed in replication relationships, but this would add a
> massive amount of complexity to our app, and I’m not even convinced it
> would work as there are lots of edge cases to consider.
>
> Does anyone know of anything else I can try to increase replication
> performance? Or to safeguard against many replicators unacceptably
> degrading couchdb performance? Does Couch 2.0 address any of these concerns?
>
> Thanks in advance,
> - Stefan du Fresne
>
> [1] security is handled by not exposing couch and going through a wrapper
> service that validates couch requests, relevance is hierarchy based (i.e.
> documents you or your subordinates are authors of are replicated to you)
> [2] there are also administrators / configurers that access couchdb
> directly

Re: The state of filtered replication

Posted by Will Holley <wi...@gmail.com>.

Hi Stefan,

CouchDB 2.0 allows Mango selectors to be used as replication filters
(https://issues.apache.org/jira/browse/COUCHDB-2988) which offers
improvements over JavaScript-based filters. There are also many
improvements in the works around management of server-side
replications which should help if you end up with an architecture
where many per-user databases sync server-side with a "master"
database.

I've seen a number of Cloudant customers struggle with this problem
and it pretty much comes down to "avoid filtering over large sets of
documents". The 2 common strategies seem to be:

1. Manually sharding data (db per user / device)
2. Drive replication with something other than _changes

(1) can sometimes be done as a temporary database. For instance, you
may be able to create a database with, say "today's documents" using a
naming convention and have devices sync from that.

(2) might involve bootstrapping the replication since parameter (i.e.
capture a sequence number daily and start replication from that point
because you know there can be no documents for user A before that
date), or use a query to grab a set of documents and use them to
filter the replication or write your own replicator which uses these
doc _ids  instead of the _changes feed.

As Steve mentioned, both approaches have interesting edge cases around
dealing with documents that no longer pass a filter / match a query,
and you may need a strategy to remove documents from devices (PouchDB
doesn't support purging yet and you don't necessarily want deletions
to sync back to the server).

Cheers,

Will

On 25 May 2016 at 15:17, Steve Genoud <st...@gmail.com> wrote:
> Hi Stefan,
>
> There is a fork of CouchDB called Barrel that has support for view based
> replication: https://docs.barrel-db.org/docs/using-the-view-changes
>
> This seems to be mostly what you are looking for - a way to have a filter
> that is "cached" in some way. Note that it introduces some "interesting"
> edge cases (what happens when a document is removed from the view for
> instance) that you need to be aware of.
>
> Best,
> Steve
>
> On Wed, 25 May 2016 at 15:12 Varun Sikka <si...@gmail.com> wrote:
>
>> Hi Stefan,
>>
>> I ran into a similar issue, and I had to give up on the filtered
>> replication. I modified replication to use the doc_ids key and manually
>> send the doc_ids that I want to replicate.
>>
>> I am planning to move to the master - master - client db model, but again
>> on a low infrastructure, I am not very confident it could be managed. Am
>> really looking forward for a more robust solution to this problem of
>> Replication.
>>
>> Regards
>> Varun
>>
>> On Wed, May 25, 2016 at 3:25 PM, Stefan Klein <st...@gmail.com>
>> wrote:
>>
>> > 2016-05-25 12:48 GMT+02:00 Stefan du Fresne <st...@medicmobile.org>:
>> >
>> >
>> >
>> > > So to be clear, this is effectively replacing replication— where the
>> > > client negotiates with the server for a collection of changes to
>> > download—
>> > > with a daemon that builds up a collection of documents that each client
>> > > should get (and also presumably delete), which clients can then query
>> for
>> > > when they’re able?
>> > >
>> >
>> > Sorry, didn't describe well enough.
>> >
>> > On Serverside we have one big database containing all documents and one
>> db
>> > for each user.
>> > The clients always replicate to and from their individual userdb,
>> > unfiltered. So the db for a user is a 1:1 copy of their pouchdb/... on
>> > their client.
>> >
>> > Initially we set up a filtered replication for each user from servers
>> main
>> > database to the server copy of the users database.
>> > With this we ran into performance problems and sooner or later we
>> probably
>> > would have ran into issues with open file descriptors.
>> >
>> > So what we do instead is listening to the changes of the main database
>> and
>> > distribute the documents to the servers userdb, which then are synced
>> with
>> > the clients.
>> >
>> > Note: this is only for documents the users actually work with (as in
>> > possibly modify), for queries on the data we query views on the main
>> > database.
>> >
>> > For the way back, we listen to the _dbchanges, so we get an event for
>> > changes on the users dbs, get that change from the users db and determine
>> > what to do with it.
>> > We do not replicate back users changes to the main database but rather
>> have
>> > an internal API to evaluate all kinds of constrains on users input.
>> > If you do not have to check users input, you could certainly listen to
>> > _dbchanges and "blindly" one-shot replicate from the changed DB to your
>> > main DB.
>> >
>> > --
>> > Stefan
>> >
>>
>>
>>
>> --
>> Regards,
>> Varun
>>

Re: The state of filtered replication

Posted by Steve Genoud <st...@gmail.com>.

Hi Stefan,

There is a fork of CouchDB called Barrel that has support for view based
replication: https://docs.barrel-db.org/docs/using-the-view-changes

This seems to be mostly what you are looking for - a way to have a filter
that is "cached" in some way. Note that it introduces some "interesting"
edge cases (what happens when a document is removed from the view for
instance) that you need to be aware of.

Best,
Steve

On Wed, 25 May 2016 at 15:12 Varun Sikka <si...@gmail.com> wrote:

> Hi Stefan,
>
> I ran into a similar issue, and I had to give up on the filtered
> replication. I modified replication to use the doc_ids key and manually
> send the doc_ids that I want to replicate.
>
> I am planning to move to the master - master - client db model, but again
> on a low infrastructure, I am not very confident it could be managed. Am
> really looking forward for a more robust solution to this problem of
> Replication.
>
> Regards
> Varun
>
> On Wed, May 25, 2016 at 3:25 PM, Stefan Klein <st...@gmail.com>
> wrote:
>
> > 2016-05-25 12:48 GMT+02:00 Stefan du Fresne <st...@medicmobile.org>:
> >
> >
> >
> > > So to be clear, this is effectively replacing replication— where the
> > > client negotiates with the server for a collection of changes to
> > download—
> > > with a daemon that builds up a collection of documents that each client
> > > should get (and also presumably delete), which clients can then query
> for
> > > when they’re able?
> > >
> >
> > Sorry, didn't describe well enough.
> >
> > On Serverside we have one big database containing all documents and one
> db
> > for each user.
> > The clients always replicate to and from their individual userdb,
> > unfiltered. So the db for a user is a 1:1 copy of their pouchdb/... on
> > their client.
> >
> > Initially we set up a filtered replication for each user from servers
> main
> > database to the server copy of the users database.
> > With this we ran into performance problems and sooner or later we
> probably
> > would have ran into issues with open file descriptors.
> >
> > So what we do instead is listening to the changes of the main database
> and
> > distribute the documents to the servers userdb, which then are synced
> with
> > the clients.
> >
> > Note: this is only for documents the users actually work with (as in
> > possibly modify), for queries on the data we query views on the main
> > database.
> >
> > For the way back, we listen to the _dbchanges, so we get an event for
> > changes on the users dbs, get that change from the users db and determine
> > what to do with it.
> > We do not replicate back users changes to the main database but rather
> have
> > an internal API to evaluate all kinds of constrains on users input.
> > If you do not have to check users input, you could certainly listen to
> > _dbchanges and "blindly" one-shot replicate from the changed DB to your
> > main DB.
> >
> > --
> > Stefan
> >
>
>
>
> --
> Regards,
> Varun
>

Re: The state of filtered replication

Posted by Varun Sikka <si...@gmail.com>.

Hi Stefan,

I ran into a similar issue, and I had to give up on the filtered
replication. I modified replication to use the doc_ids key and manually
send the doc_ids that I want to replicate.

I am planning to move to the master - master - client db model, but again
on a low infrastructure, I am not very confident it could be managed. Am
really looking forward for a more robust solution to this problem of
Replication.

Regards
Varun

On Wed, May 25, 2016 at 3:25 PM, Stefan Klein <st...@gmail.com> wrote:

> 2016-05-25 12:48 GMT+02:00 Stefan du Fresne <st...@medicmobile.org>:
>
>
>
> > So to be clear, this is effectively replacing replication— where the
> > client negotiates with the server for a collection of changes to
> download—
> > with a daemon that builds up a collection of documents that each client
> > should get (and also presumably delete), which clients can then query for
> > when they’re able?
> >
>
> Sorry, didn't describe well enough.
>
> On Serverside we have one big database containing all documents and one db
> for each user.
> The clients always replicate to and from their individual userdb,
> unfiltered. So the db for a user is a 1:1 copy of their pouchdb/... on
> their client.
>
> Initially we set up a filtered replication for each user from servers main
> database to the server copy of the users database.
> With this we ran into performance problems and sooner or later we probably
> would have ran into issues with open file descriptors.
>
> So what we do instead is listening to the changes of the main database and
> distribute the documents to the servers userdb, which then are synced with
> the clients.
>
> Note: this is only for documents the users actually work with (as in
> possibly modify), for queries on the data we query views on the main
> database.
>
> For the way back, we listen to the _dbchanges, so we get an event for
> changes on the users dbs, get that change from the users db and determine
> what to do with it.
> We do not replicate back users changes to the main database but rather have
> an internal API to evaluate all kinds of constrains on users input.
> If you do not have to check users input, you could certainly listen to
> _dbchanges and "blindly" one-shot replicate from the changed DB to your
> main DB.
>
> --
> Stefan
>



-- 
Regards,
Varun

Re: The state of filtered replication

Posted by Robert Newson <rn...@apache.org>.

All replications should checkpoint periodically too, not just at the end. The log will show this, PUT to a _local url. 

Sent from my iPhone

> On 26 May 2016, at 14:04, Paul Okstad <po...@gmail.com> wrote:
> 
> I'll double check my situation since I have not thoroughly verified it. This particular issue occurs between restarts of the server where I make no changes to the continuous replications in the _replicator DB, but it may also be related to the issue of too many continuous replications causing a replications to stall out from lack of resources. It's possible that I assumed they were starting over from seq 1 when in fact they were never able to complete a full replication in the first place.
> 
> -- 
> Paul Okstad
> 
>> On May 26, 2016, at 2:51 AM, Robert Newson <rn...@apache.org> wrote:
>> 
>> There must be something else wrong. Filtered replications definitely make and resume from checkpoints, same as unfiltered.
>> 
>> We mix the filter code and parameters into the replication checkpoint id to ensure we start from 0 for a potentially different filtering. Perhaps you are changing those? Or maybe supplying since_seq as well (which overrides the checkpoint)?
>> 
>> Sent from my iPhone
>> 
>>> On 25 May 2016, at 16:39, Paul Okstad <po...@gmail.com> wrote:
>>> 
>>> This isn’t just a problem of filtered replication, it’s a major issue in the database-per-user strategy (at least in the v1.6.1 I’m using). I’m also using a database-per-user design with thousands of users and a single global database. If a small fraction of the users (hundreds) has continuously ongoing replications from the user DB to the global DB, it will cause extremely high CPU utilization. This is without any replication filtered javascript function.
>>> 
>>> Another huge issue with filtered replications is that they lose their place when replications are restarted. In other words, they don’t keep track of sequence ID between restarts of the server or stopping and starting the same replication. So for example, if I want to perform filtered replication of public documents from the global DB to the public DB, and I have a ton of documents in global, then each time I restart the filtered replication it will begin from sequence #1. I’m guessing this is due to the fact that CouchDB does not know if the filter function has been modified between replications, but this behavior is still very disappointing.
>>> 
>>> — 
>>> Paul Okstad
>>> http://pokstad.com <http://pokstad.com/>
>>> 
>>> 
>>> 
>>>> On May 25, 2016, at 4:25 AM, Stefan Klein <st...@gmail.com> wrote:
>>>> 
>>>> 2016-05-25 12:48 GMT+02:00 Stefan du Fresne <st...@medicmobile.org>:
>>>> 
>>>> 
>>>> 
>>>>> So to be clear, this is effectively replacing replication— where the
>>>>> client negotiates with the server for a collection of changes to download—
>>>>> with a daemon that builds up a collection of documents that each client
>>>>> should get (and also presumably delete), which clients can then query for
>>>>> when they’re able?
>>>> 
>>>> Sorry, didn't describe well enough.
>>>> 
>>>> On Serverside we have one big database containing all documents and one db
>>>> for each user.
>>>> The clients always replicate to and from their individual userdb,
>>>> unfiltered. So the db for a user is a 1:1 copy of their pouchdb/... on
>>>> their client.
>>>> 
>>>> Initially we set up a filtered replication for each user from servers main
>>>> database to the server copy of the users database.
>>>> With this we ran into performance problems and sooner or later we probably
>>>> would have ran into issues with open file descriptors.
>>>> 
>>>> So what we do instead is listening to the changes of the main database and
>>>> distribute the documents to the servers userdb, which then are synced with
>>>> the clients.
>>>> 
>>>> Note: this is only for documents the users actually work with (as in
>>>> possibly modify), for queries on the data we query views on the main
>>>> database.
>>>> 
>>>> For the way back, we listen to the _dbchanges, so we get an event for
>>>> changes on the users dbs, get that change from the users db and determine
>>>> what to do with it.
>>>> We do not replicate back users changes to the main database but rather have
>>>> an internal API to evaluate all kinds of constrains on users input.
>>>> If you do not have to check users input, you could certainly listen to
>>>> _dbchanges and "blindly" one-shot replicate from the changed DB to your
>>>> main DB.
>>>> 
>>>> -- 
>>>> Stefan
>>

Re: The state of filtered replication

Posted by Paul Okstad <po...@gmail.com>.

I'll double check my situation since I have not thoroughly verified it. This particular issue occurs between restarts of the server where I make no changes to the continuous replications in the _replicator DB, but it may also be related to the issue of too many continuous replications causing a replications to stall out from lack of resources. It's possible that I assumed they were starting over from seq 1 when in fact they were never able to complete a full replication in the first place.

-- 
Paul Okstad

> On May 26, 2016, at 2:51 AM, Robert Newson <rn...@apache.org> wrote:
> 
> There must be something else wrong. Filtered replications definitely make and resume from checkpoints, same as unfiltered.
> 
> We mix the filter code and parameters into the replication checkpoint id to ensure we start from 0 for a potentially different filtering. Perhaps you are changing those? Or maybe supplying since_seq as well (which overrides the checkpoint)?
> 
> Sent from my iPhone
> 
>> On 25 May 2016, at 16:39, Paul Okstad <po...@gmail.com> wrote:
>> 
>> This isn’t just a problem of filtered replication, it’s a major issue in the database-per-user strategy (at least in the v1.6.1 I’m using). I’m also using a database-per-user design with thousands of users and a single global database. If a small fraction of the users (hundreds) has continuously ongoing replications from the user DB to the global DB, it will cause extremely high CPU utilization. This is without any replication filtered javascript function.
>> 
>> Another huge issue with filtered replications is that they lose their place when replications are restarted. In other words, they don’t keep track of sequence ID between restarts of the server or stopping and starting the same replication. So for example, if I want to perform filtered replication of public documents from the global DB to the public DB, and I have a ton of documents in global, then each time I restart the filtered replication it will begin from sequence #1. I’m guessing this is due to the fact that CouchDB does not know if the filter function has been modified between replications, but this behavior is still very disappointing.
>> 
>> — 
>> Paul Okstad
>> http://pokstad.com <http://pokstad.com/>
>> 
>> 
>> 
>>> On May 25, 2016, at 4:25 AM, Stefan Klein <st...@gmail.com> wrote:
>>> 
>>> 2016-05-25 12:48 GMT+02:00 Stefan du Fresne <st...@medicmobile.org>:
>>> 
>>> 
>>> 
>>>> So to be clear, this is effectively replacing replication— where the
>>>> client negotiates with the server for a collection of changes to download—
>>>> with a daemon that builds up a collection of documents that each client
>>>> should get (and also presumably delete), which clients can then query for
>>>> when they’re able?
>>> 
>>> Sorry, didn't describe well enough.
>>> 
>>> On Serverside we have one big database containing all documents and one db
>>> for each user.
>>> The clients always replicate to and from their individual userdb,
>>> unfiltered. So the db for a user is a 1:1 copy of their pouchdb/... on
>>> their client.
>>> 
>>> Initially we set up a filtered replication for each user from servers main
>>> database to the server copy of the users database.
>>> With this we ran into performance problems and sooner or later we probably
>>> would have ran into issues with open file descriptors.
>>> 
>>> So what we do instead is listening to the changes of the main database and
>>> distribute the documents to the servers userdb, which then are synced with
>>> the clients.
>>> 
>>> Note: this is only for documents the users actually work with (as in
>>> possibly modify), for queries on the data we query views on the main
>>> database.
>>> 
>>> For the way back, we listen to the _dbchanges, so we get an event for
>>> changes on the users dbs, get that change from the users db and determine
>>> what to do with it.
>>> We do not replicate back users changes to the main database but rather have
>>> an internal API to evaluate all kinds of constrains on users input.
>>> If you do not have to check users input, you could certainly listen to
>>> _dbchanges and "blindly" one-shot replicate from the changed DB to your
>>> main DB.
>>> 
>>> -- 
>>> Stefan
>

Re: The state of filtered replication

Posted by Robert Newson <rn...@apache.org>.

There must be something else wrong. Filtered replications definitely make and resume from checkpoints, same as unfiltered.

We mix the filter code and parameters into the replication checkpoint id to ensure we start from 0 for a potentially different filtering. Perhaps you are changing those? Or maybe supplying since_seq as well (which overrides the checkpoint)?

Sent from my iPhone

> On 25 May 2016, at 16:39, Paul Okstad <po...@gmail.com> wrote:
> 
> This isn’t just a problem of filtered replication, it’s a major issue in the database-per-user strategy (at least in the v1.6.1 I’m using). I’m also using a database-per-user design with thousands of users and a single global database. If a small fraction of the users (hundreds) has continuously ongoing replications from the user DB to the global DB, it will cause extremely high CPU utilization. This is without any replication filtered javascript function.
> 
> Another huge issue with filtered replications is that they lose their place when replications are restarted. In other words, they don’t keep track of sequence ID between restarts of the server or stopping and starting the same replication. So for example, if I want to perform filtered replication of public documents from the global DB to the public DB, and I have a ton of documents in global, then each time I restart the filtered replication it will begin from sequence #1. I’m guessing this is due to the fact that CouchDB does not know if the filter function has been modified between replications, but this behavior is still very disappointing.
> 
> — 
> Paul Okstad
> http://pokstad.com <http://pokstad.com/>
> 
> 
> 
>> On May 25, 2016, at 4:25 AM, Stefan Klein <st...@gmail.com> wrote:
>> 
>> 2016-05-25 12:48 GMT+02:00 Stefan du Fresne <st...@medicmobile.org>:
>> 
>> 
>> 
>>> So to be clear, this is effectively replacing replication— where the
>>> client negotiates with the server for a collection of changes to download—
>>> with a daemon that builds up a collection of documents that each client
>>> should get (and also presumably delete), which clients can then query for
>>> when they’re able?
>> 
>> Sorry, didn't describe well enough.
>> 
>> On Serverside we have one big database containing all documents and one db
>> for each user.
>> The clients always replicate to and from their individual userdb,
>> unfiltered. So the db for a user is a 1:1 copy of their pouchdb/... on
>> their client.
>> 
>> Initially we set up a filtered replication for each user from servers main
>> database to the server copy of the users database.
>> With this we ran into performance problems and sooner or later we probably
>> would have ran into issues with open file descriptors.
>> 
>> So what we do instead is listening to the changes of the main database and
>> distribute the documents to the servers userdb, which then are synced with
>> the clients.
>> 
>> Note: this is only for documents the users actually work with (as in
>> possibly modify), for queries on the data we query views on the main
>> database.
>> 
>> For the way back, we listen to the _dbchanges, so we get an event for
>> changes on the users dbs, get that change from the users db and determine
>> what to do with it.
>> We do not replicate back users changes to the main database but rather have
>> an internal API to evaluate all kinds of constrains on users input.
>> If you do not have to check users input, you could certainly listen to
>> _dbchanges and "blindly" one-shot replicate from the changed DB to your
>> main DB.
>> 
>> -- 
>> Stefan
>

Re: The state of filtered replication

Posted by Paul Okstad <po...@gmail.com>.

This isn’t just a problem of filtered replication, it’s a major issue in the database-per-user strategy (at least in the v1.6.1 I’m using). I’m also using a database-per-user design with thousands of users and a single global database. If a small fraction of the users (hundreds) has continuously ongoing replications from the user DB to the global DB, it will cause extremely high CPU utilization. This is without any replication filtered javascript function.

Another huge issue with filtered replications is that they lose their place when replications are restarted. In other words, they don’t keep track of sequence ID between restarts of the server or stopping and starting the same replication. So for example, if I want to perform filtered replication of public documents from the global DB to the public DB, and I have a ton of documents in global, then each time I restart the filtered replication it will begin from sequence #1. I’m guessing this is due to the fact that CouchDB does not know if the filter function has been modified between replications, but this behavior is still very disappointing.

— 
Paul Okstad
http://pokstad.com <http://pokstad.com/>

> On May 25, 2016, at 4:25 AM, Stefan Klein <st...@gmail.com> wrote:
> 
> 2016-05-25 12:48 GMT+02:00 Stefan du Fresne <st...@medicmobile.org>:
> 
> 
> 
>> So to be clear, this is effectively replacing replication— where the
>> client negotiates with the server for a collection of changes to download—
>> with a daemon that builds up a collection of documents that each client
>> should get (and also presumably delete), which clients can then query for
>> when they’re able?
>> 
> 
> Sorry, didn't describe well enough.
> 
> On Serverside we have one big database containing all documents and one db
> for each user.
> The clients always replicate to and from their individual userdb,
> unfiltered. So the db for a user is a 1:1 copy of their pouchdb/... on
> their client.
> 
> Initially we set up a filtered replication for each user from servers main
> database to the server copy of the users database.
> With this we ran into performance problems and sooner or later we probably
> would have ran into issues with open file descriptors.
> 
> So what we do instead is listening to the changes of the main database and
> distribute the documents to the servers userdb, which then are synced with
> the clients.
> 
> Note: this is only for documents the users actually work with (as in
> possibly modify), for queries on the data we query views on the main
> database.
> 
> For the way back, we listen to the _dbchanges, so we get an event for
> changes on the users dbs, get that change from the users db and determine
> what to do with it.
> We do not replicate back users changes to the main database but rather have
> an internal API to evaluate all kinds of constrains on users input.
> If you do not have to check users input, you could certainly listen to
> _dbchanges and "blindly" one-shot replicate from the changed DB to your
> main DB.
> 
> -- 
> Stefan

Re: The state of filtered replication

Posted by Stefan Klein <st...@gmail.com>.

2016-05-25 12:48 GMT+02:00 Stefan du Fresne <st...@medicmobile.org>:



> So to be clear, this is effectively replacing replication— where the
> client negotiates with the server for a collection of changes to download—
> with a daemon that builds up a collection of documents that each client
> should get (and also presumably delete), which clients can then query for
> when they’re able?
>

Sorry, didn't describe well enough.

On Serverside we have one big database containing all documents and one db
for each user.
The clients always replicate to and from their individual userdb,
unfiltered. So the db for a user is a 1:1 copy of their pouchdb/... on
their client.

Initially we set up a filtered replication for each user from servers main
database to the server copy of the users database.
With this we ran into performance problems and sooner or later we probably
would have ran into issues with open file descriptors.

So what we do instead is listening to the changes of the main database and
distribute the documents to the servers userdb, which then are synced with
the clients.

Note: this is only for documents the users actually work with (as in
possibly modify), for queries on the data we query views on the main
database.

For the way back, we listen to the _dbchanges, so we get an event for
changes on the users dbs, get that change from the users db and determine
what to do with it.
We do not replicate back users changes to the main database but rather have
an internal API to evaluate all kinds of constrains on users input.
If you do not have to check users input, you could certainly listen to
_dbchanges and "blindly" one-shot replicate from the changed DB to your
main DB.

-- 
Stefan

Re: The state of filtered replication

Posted by Stefan du Fresne <st...@medicmobile.org>.

Hey Stefan,

That sounds really interesting!

So to be clear, this is effectively replacing replication— where the client negotiates with the server for a collection of changes to download— with a daemon that builds up a collection of documents that each client should get (and also presumably delete), which clients can then query for when they’re able?

This is sort of like ‘compiling’ the replication result so you know the answer when the client asks, as opposed to working it out live each time?

Cheers,
Stefan

> On 25 May 2016, at 11:34, Stefan Klein <st...@gmail.com> wrote:
> 
> Hi Stefan,
> 
> 2016-05-25 10:34 GMT+02:00 Stefan du Fresne <st...@medicmobile.org>:
> 
> [ ... ]
> 
> 
>> Ideally then, I would like to remove filter replication completely, but
>> there does not seem to be a good alternative right now.
>> 
>> Looking through the archives there was talk of adding view replication,
>> see:
>> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
>> <
>> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>
>> , but it doesn't look like this ever got resolved.
>> 
>> There is also often talk of databases per user being a good scaling
>> strategy, but we're basically doing that already (with PouchDB),  and for
>> us documents aren't owned / viewed by just one person so this does not get
>> us away from filtered replication (eg a supervisor replicates her documents
>> as well as N sub-users documents). There are potentially wild and crazy
>> schemes that involves many different databases where the equivalent of
>> filtering is expressed in replication relationships, but this would add a
>> massive amount of complexity to our app, and I’m not even convinced it
>> would work as there are lots of edge cases to consider.
> 
> 
> [ ... ]
> 
> We ran into similar issues with the one db per user and filtered
> replication.
> I solved it a daemon listening to the main DB changes feed, determining to
> which user(s) this changed document is to be replicated and using single
> shot replications by documentID to actually get the document replicated.
> So there is just one listener to the (in our case unfiltered) changes feed
> and short replications by documentID.
> 
> For our case this works very well so far.
> 
> regards,
> Stefan

Re: The state of filtered replication

Posted by Stefan Klein <st...@gmail.com>.

Hi Stefan,

2016-05-25 10:34 GMT+02:00 Stefan du Fresne <st...@medicmobile.org>:

[ ... ]


> Ideally then, I would like to remove filter replication completely, but
> there does not seem to be a good alternative right now.
>
> Looking through the archives there was talk of adding view replication,
> see:
> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
> <
> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>
> , but it doesn't look like this ever got resolved.
>
> There is also often talk of databases per user being a good scaling
> strategy, but we're basically doing that already (with PouchDB),  and for
> us documents aren't owned / viewed by just one person so this does not get
> us away from filtered replication (eg a supervisor replicates her documents
> as well as N sub-users documents). There are potentially wild and crazy
> schemes that involves many different databases where the equivalent of
> filtering is expressed in replication relationships, but this would add a
> massive amount of complexity to our app, and I’m not even convinced it
> would work as there are lots of edge cases to consider.


[ ... ]

We ran into similar issues with the one db per user and filtered
replication.
I solved it a daemon listening to the main DB changes feed, determining to
which user(s) this changed document is to be replicated and using single
shot replications by documentID to actually get the document replicated.
So there is just one listener to the (in our case unfiltered) changes feed
and short replications by documentID.

For our case this works very well so far.

regards,
Stefan

Re: The state of filtered replication

Posted by Stefan du Fresne <st...@medicmobile.org>.

Hi Simon,

It’s good to hear we are not the only people to have this problem, and that we’re following in the footsteps of others :-)

More databases is definitely an option, but it’s one we’re trying to avoid, partially to keep the budget under control, and partially because we’d be constantly scaling up and down, as most of our performance concerns are just when we’re on boarding, so it’s a lot of extra complexity.

Unfortunately (2) isn’t an option for us either: our PouchDB clients are on slow phones with slow, flakey and expensive network connections (think health workers in remote parts of Uganda) and so reducing what we send them to the bare minimum is very important. We also shouldn’t really send them other people’s data— even to have Pouch then filter it out— for privacy reasons.

Stefan
> On 25 May 2016, at 09:55, Sinan Gabel <si...@gmail.com> wrote:
> 
> Hi Stefan,
> 
> I recognise your description and problem: I also gave up on the server-side
> performance. With 1.6.1 version of CouchDB I only saw two immediate options:
> 
> (1) More databases on the server-side to reduce the number of docs per
> database
> (2) Simply do the filtering on the client-side in PouchDB, this is actually
> quite fast and robust: Here experiment with best settings of options:
> *batch_size* and *timeout*.
> 
> For (2) possibly combine with: https://github.com/nolanlawson/worker-pouch <https://github.com/nolanlawson/worker-pouch>
> if there are a lot of documents
> 
> 
> ... however it would be best with a much faster "production-made"
> server-side filtering opportunity in CouchDB 2.x.
> 
> 
> Br,
> Sinan
> 
> On 25 May 2016 at 10:34, Stefan du Fresne <stefan@medicmobile.org <ma...@medicmobile.org>> wrote:
> 
>> Hello all,
>> 
>> I work on an app that involves a large amount of CouchDB filtered
>> replication (every user has a filtered subset of the DB locally via
>> PouchDB). Currently filtered replication is our number 1 performance
>> bottleneck for rolling out to more users, and I'm trying to work out where
>> we can go from here.
>> 
>> Our current setup is one CouchDB database and N PouchDB installations,
>> which all two-way replicate, with the CouchDB->PouchDB replication being
>> filtered based on user permissions / relevance [1].
>> 
>> Our issue is that as we add users a) total document creation velocity
>> increases, and b) the proportion of documents that are relevant to any
>> particular user decreases. These two points cause replication-- both
>> initial onboarding and continual-- to take longer and longer.
>> 
>> At this stage we are being forced to manually limit the number of users we
>> onboard at any particular time to half a dozen or so, or risk CouchDB being
>> unresponsive [2]. As we'd want to be onboarding 50-100 at any particular
>> time due to how we're rolling pit, you can imagine that this is pretty
>> painful.
>> 
>> I have already re-written the filter in Erlang, which halved its execution
>> time, which is awesome!
>> 
>> I also attempted to simplify the filter to increase performance. However,
>> filter speed seems more dependent on the physical size of your filter as
>> opposed to what code executes, which makes writing a simple filter that can
>> fall-back to a complicated filter not terribly useful (see:
>> https://issues.apache.org/jira/browse/COUCHDB-3021 <
>> https://issues.apache.org/jira/browse/COUCHDB-3021 <https://issues.apache.org/jira/browse/COUCHDB-3021>>)
>> 
>> If the above linked ticket is fixed (if it can be) this would make our
>> filter 3-4x faster again. However, this still wouldn't address the
>> fundamental issue that filtered replication is very CPU-intensive, and so
>> as noted above doesn't seem to scale terribly well.
>> 
>> Ideally then, I would like to remove filter replication completely, but
>> there does not seem to be a good alternative right now.
>> 
>> Looking through the archives there was talk of adding view replication,
>> see:
>> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E <https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>
>> <
>> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E <https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>>
>> , but it doesn't look like this ever got resolved.
>> 
>> There is also often talk of databases per user being a good scaling
>> strategy, but we're basically doing that already (with PouchDB),  and for
>> us documents aren't owned / viewed by just one person so this does not get
>> us away from filtered replication (eg a supervisor replicates her documents
>> as well as N sub-users documents). There are potentially wild and crazy
>> schemes that involves many different databases where the equivalent of
>> filtering is expressed in replication relationships, but this would add a
>> massive amount of complexity to our app, and I’m not even convinced it
>> would work as there are lots of edge cases to consider.
>> 
>> Does anyone know of anything else I can try to increase replication
>> performance? Or to safeguard against many replicators unacceptably
>> degrading couchdb performance? Does Couch 2.0 address any of these concerns?
>> 
>> Thanks in advance,
>> - Stefan du Fresne
>> 
>> [1] security is handled by not exposing couch and going through a wrapper
>> service that validates couch requests, relevance is hierarchy based (i.e.
>> documents you or your subordinates are authors of are replicated to you)
>> [2] there are also administrators / configurers that access couchdb
>> directly

Re: The state of filtered replication

Posted by Sinan Gabel <si...@gmail.com>.

Hi Stefan,

I recognise your description and problem: I also gave up on the server-side
performance. With 1.6.1 version of CouchDB I only saw two immediate options:

(1) More databases on the server-side to reduce the number of docs per
database
(2) Simply do the filtering on the client-side in PouchDB, this is actually
quite fast and robust: Here experiment with best settings of options:
*batch_size* and *timeout*.

For (2) possibly combine with: https://github.com/nolanlawson/worker-pouch
if there are a lot of documents


... however it would be best with a much faster "production-made"
server-side filtering opportunity in CouchDB 2.x.


Br,
Sinan

On 25 May 2016 at 10:34, Stefan du Fresne <st...@medicmobile.org> wrote:

> Hello all,
>
> I work on an app that involves a large amount of CouchDB filtered
> replication (every user has a filtered subset of the DB locally via
> PouchDB). Currently filtered replication is our number 1 performance
> bottleneck for rolling out to more users, and I'm trying to work out where
> we can go from here.
>
> Our current setup is one CouchDB database and N PouchDB installations,
> which all two-way replicate, with the CouchDB->PouchDB replication being
> filtered based on user permissions / relevance [1].
>
> Our issue is that as we add users a) total document creation velocity
> increases, and b) the proportion of documents that are relevant to any
> particular user decreases. These two points cause replication-- both
> initial onboarding and continual-- to take longer and longer.
>
> At this stage we are being forced to manually limit the number of users we
> onboard at any particular time to half a dozen or so, or risk CouchDB being
> unresponsive [2]. As we'd want to be onboarding 50-100 at any particular
> time due to how we're rolling pit, you can imagine that this is pretty
> painful.
>
> I have already re-written the filter in Erlang, which halved its execution
> time, which is awesome!
>
> I also attempted to simplify the filter to increase performance. However,
> filter speed seems more dependent on the physical size of your filter as
> opposed to what code executes, which makes writing a simple filter that can
> fall-back to a complicated filter not terribly useful (see:
> https://issues.apache.org/jira/browse/COUCHDB-3021 <
> https://issues.apache.org/jira/browse/COUCHDB-3021>)
>
> If the above linked ticket is fixed (if it can be) this would make our
> filter 3-4x faster again. However, this still wouldn't address the
> fundamental issue that filtered replication is very CPU-intensive, and so
> as noted above doesn't seem to scale terribly well.
>
> Ideally then, I would like to remove filter replication completely, but
> there does not seem to be a good alternative right now.
>
> Looking through the archives there was talk of adding view replication,
> see:
> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
> <
> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>
> , but it doesn't look like this ever got resolved.
>
> There is also often talk of databases per user being a good scaling
> strategy, but we're basically doing that already (with PouchDB),  and for
> us documents aren't owned / viewed by just one person so this does not get
> us away from filtered replication (eg a supervisor replicates her documents
> as well as N sub-users documents). There are potentially wild and crazy
> schemes that involves many different databases where the equivalent of
> filtering is expressed in replication relationships, but this would add a
> massive amount of complexity to our app, and I’m not even convinced it
> would work as there are lots of edge cases to consider.
>
> Does anyone know of anything else I can try to increase replication
> performance? Or to safeguard against many replicators unacceptably
> degrading couchdb performance? Does Couch 2.0 address any of these concerns?
>
> Thanks in advance,
> - Stefan du Fresne
>
> [1] security is handled by not exposing couch and going through a wrapper
> service that validates couch requests, relevance is hierarchy based (i.e.
> documents you or your subordinates are authors of are replicated to you)
> [2] there are also administrators / configurers that access couchdb
> directly