You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by svilen <az...@svilendobrev.com> on 2012/09/26 11:50:09 UTC

Re: Selective Replication

wild guess, do u have loops in the "graph" or it is pure tree, i.e.
can the "named replication" repeat some documents if referenced many
times?

apart of that, i guess one stream/replication of N docs would be
faster than N replications of 1 doc each, but i wonder why it
decellerates with N growing. btw the all-in-one would be starting
on empty, while the singles would be going to increasing target
each time.
can u try full-replicate 100000 all-in-one over to another 100000
docs? and how that measures ? if that gives any hint..

svil

On Wed, 26 Sep 2012 17:34:50 +0200
Frank Wunderlich <fr...@kreuzwerker.de> wrote:

> Hi *,
> 
> I am currently trying to figure out, how one could realize something
> like "selective replication" in CouchDB.
> 
> In our scenario we have got around 10 physically distributed CouchDB
> instances running. There will probably be more than 1 million
> documents in out "master" instance. Only a subset of those documents
> shall be replicated to each of the "slave" instances. Users shall be
> able to explicitly control, which documents shall be synchronized to
> which destination.
> 
> So far I stumbled over the following 2 concepts:
> 1. Filtered Replication
> 2. Named Document Replication
> 
> On the first glance, replication filters seemed to be the way to go.
> But unfortunately we have got a quite "relational" document model.
> One logical "asset" consists of several CouchDB documents,
> referencing each other.
> 
> The filter functions can only access data, that is part of the
> document that is passed in as parameter. Because of this limitation,
> each partial document must contain all the information necessary, to
> determine whether it shall be replicated or not.
> 
> This leads to redundancy and to potential inconsistencies if a
> "transaction" fails. Inconsistent asset aggregates might get
> "partially" transferred to other CouchDB instances. And in my eyes,
> it will be hard to recognize and track down the cause of such
> inconsistencies.
> 
> Furthermore our content documents get "polluted" by pure technical
> attributes.
> 
> 
> That's why we took a look at the second option: Named Document
> Replication.
> 
> It seemed to be good idea, to separate the two concerns of
> persistence and synchronization. First we would like to persist any
> "logical asset" in our local CouchDB. When we know that this step
> succeeded and all partial documents got stored in the database, then
> we would "register" the "logical asset" for synchronization. This
> step would happen on the application layer, that is built on top of
> our CouchDB.
> 
> The registration process would look up all partial documents that
> make up the "logical asset". Then any running replication job would
> get canceled (assuming we are using continues replication). Finally
> we would restart those replication jobs by adding the indentified
> document_ids to the json that gets posted to the replicate URL.
> 
> The first attempts seemed promising.
> But when experimenting with larger sets of documents, we noticed a
> significant performance degradation during replication. With 100.000
> documents to be replicated, the "Named Document Replication" was 4
> times slower than the complete and unconditional replication of the
> whole database. With 200.000 documents, the selective approach was
> even 7 times slower. With 1.000.000 documents, the factor was > 20
> 
> So this approach is not scaling well...
> 
> What are your thoughts about this?
> Is there anyone who has faced similar architectural questions? 
> 
> Any hint will be appreciated.
> Best regards,
> Frank
> 
> 
> 
> --
> kreuzwerker GmbH - we touch running systems
> fon  +49 177 8780280  | fax +49 30  6098388-99 
> Ritterstraße 12-14, 10969 Berlin | frank.wunderlich@kreuzwerker.de
> HR B 129427 | Amtsgericht Charlottenburg  |  Geschäftsführer: Tilmann
> Eing  
>