You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Tim Hankins <ti...@gmail.com> on 2013/01/17 13:08:01 UTC

Refactoring a CouchDB.

Hi,

I'm a student programmer at the IT University of Copenhagen, and have
inherited a CouchDB application, which I'm having trouble scaling. I
believe it may need to be refactored.

Specifically, the problem seems to be coming from the use of Filtered
Replication. (All user documents are stored in the same database, and
replicating from server to client requires filtered replication.)

I'm in the process of reading Chapter 23 of "O'Reilly: CouchDB - The
Definitive Guide" which deals with High Performance, and "O'Reilly: Scaling
CouchDB". Any other suggestions about the following would be greatly
appreciated!

Some background...

The system is part of a clinical trial undertaken by the ITU and the Danish
State Hospital. It aims to help Bipolar patients manage their disease. It
is composed of
    1). 100+ android phones running a client application and Couchbase
Mobile.
    2). A web server backed by CouchDB.

Each day, the android client application collects two kinds of data.
Subjective and Objective. Subjective data are manually entered by patients.
Objective data are gathered from the phone's sensors.

Subjective and Objective data are stored in their own couch documents, and
have IDs that include the user's ID, the document type, and the date in a
"DD_MM_YYYY" format. They are replicated once a day by placing replication
docs in the "_replicator" database.

Once replicated to the server, these documents are...
    1). Used as input to a data mining algorithm.
    2). Displayed on a web page. (Users can see their own data, and
clinicians can see the data for all users.)

The data mining algorithm produces a new CouchDB document for each user
every day, which we call an "Impact Factor" document. (It looks at each
user's historical objective and subjective data, and looks for
correlations.)

Replication: Replication takes place from client to server, and from server
to client.
    1). Client to server: This seems to be working fine.
    2). Server to client: This is what's broken.

Two things have to be replicated from server to client.
    1). Each user's subjective data for the past 14 days.
    2). Each user's Impact Factor document for the current day.

Since all user documents are stored in the same database, we use filtered
replication to send the right docs to the right users.

The problem is that this filter function takes too long. ( >10minutes)
    1). To test whether the filter function is crashing, I replicated the
entire DB to another un-loaded machine, and it seems to run just fine.
(Well it takes about 2.5 minutes, but it doesn't crash.)
    2). I've tried re-writing the filter function in ERLANG, but haven't
managed to get it working.

And besides, I suspect that the way the DB is structured is just not suited
to the job.

So, to summarize...
    - Android client phones produce new CouchDB docs and replicate them to
the server.
    - One central CouchDB holds all users.
    - Both individual and group data are served to web pages.
    - A data mining algorithm processes this data on a per-user basis.
    - Subjective data and Impact Factor data documents are replicated from
the server to each client phone.

Is there a way to structure the DB so that users can replicate without the
need for filters, but which preserves the ability of clinicians to see an
overview of all users? (It's my understanding that views can't be run *
across* databases.)

Well, as before, any suggestions or pointers would be much appreciated.

Cheers,
Tim.

Re: Refactoring a CouchDB.

Posted by Alexander Shorin <kx...@gmail.com>.
Hi Tim!

Glad to see that CouchDB goes into medicine (:

On Thu, Jan 17, 2013 at 4:08 PM, Tim Hankins <ti...@gmail.com> wrote:
>
> The problem is that this filter function takes too long. ( >10minutes)
>     1). To test whether the filter function is crashing, I replicated the
> entire DB to another un-loaded machine, and it seems to run just fine.
> (Well it takes about 2.5 minutes, but it doesn't crash.)
>     2). I've tried re-writing the filter function in ERLANG, but haven't
> managed to get it working.

1) Integration tests were never fast. We're using regular unittests
for every design function running them directly within query server,
not through communication with it as with foreign process. All tests
are stored within design document, so you could run them on every
instance in every time to prove that things are supposed to work. This
saves a lot of time and helps to manage individual databases.

2) Erlang filters are extremely fast since they by passes a lot of
intermediate operations such as data encoding/decoding to/from json,
stdio communication and more. If you wanted to improve replication
speed it a good solution. But you should be aware about lack of
sandbox feature for them and be careful with running untrusted code.

--
,,,^..^,,,

Re: Refactoring a CouchDB.

Posted by "Michael Zedeler." <mi...@zedeler.dk>.
Hi Dave.

I find this discussion very interesting - I have a few questions 
regarding complexity.

On 2013-01-17 14:32, Dave Cottlehuber wrote:
> The main constraint is that replication filters need to be run per
> document, per replication. So N replications requires N passes through
> all the documents, ie N^2. And in your case most of the documents will
> not be replicated to a given user.
To be more precise: given N documents, M replication filters and K 
replication runs, the complexity is N*M*K, right?

If N and M is in the same order of magnitude, you can safely assume N*M 
is N^2, but if there is just one replication filter, the complexity is 
just N (which shouldn't be surprising)?

So the issue is that if you want to replicate to a large number of 
users, you get a large value of K?

Regards,

Michael.


Re: Refactoring a CouchDB.

Posted by Tim Hankins <ti...@gmail.com>.
Thanks all! Your suggestions are extremely valuable. I really appreciate
the help!

If I find that I need a little more clarification, I'll be sure to ask, but
for now I'm going to roll up my sleeves, and get to work.

Cheers,
Tim.


On Thu, Jan 17, 2013 at 2:32 PM, Dave Cottlehuber <dc...@jsonified.com> wrote:

> On 17 January 2013 13:08, Tim Hankins <ti...@gmail.com> wrote:
> > Hi,
> >
> > I'm a student programmer at the IT University of Copenhagen, and have
> > inherited a CouchDB application, which I'm having trouble scaling. I
> > believe it may need to be refactored.
> >
> > Specifically, the problem seems to be coming from the use of Filtered
> > Replication. (All user documents are stored in the same database, and
> > replicating from server to client requires filtered replication.)
>
> Yes, this seems very likely.
>
> The main constraint is that replication filters need to be run per
> document, per replication. So N replications requires N passes through
> all the documents, ie N^2. And in your case most of the documents will
> not be replicated to a given user.
>
> There are 2 main approaches, and a possible 3rd one:
>
> 1. keep using the same replication approach, but create an additional
> server-side per-user DB. Move all endpoints to access their private DB
> only, and non-filtered replication is now possible. Use a master DB
> that replicates all docs from the private DB to the master DB, for
> views across all users, and implement an additional replication on the
> server-side private DBs to retrieve their data.
>
> 2. use named document replication to transfer only the required
> documents from the master DB to the endpoint DB. To identify what
> documents need to be transferred per user, create a view that exposes
> only the user's name. You can therefore avoid the N^2 filter pass
> above, as it will be done once within the view.
>
> http://wiki.apache.org/couchdb/Replication#Named_Document_Replication
>
> You can probably use the update_seq (available in the json properties
> of http://couch:5984/dbname/ ) as a checkpoint of where you are / were
> up to, but I can think of a few corner cases where this might come
> back and haunt you later. Anybody else want to comment?
>
> 3. Do something (handwavey) with the _changes feed to ensure each
> document only needs to be processed/filtered once, and use this to
> avoid an on-disk view. Somewhere you'll need a list of documents then
> that have been sent to a specific client, so I'm not sure that 1 or 2
> aren't already better. But maybe your specific use case can work with
> this more easily.
>
> > Each day, the android client application collects two kinds of data.
> > Subjective and Objective. Subjective data are manually entered by
> patients.
> > Objective data are gathered from the phone's sensors.
> >
> > Subjective and Objective data are stored in their own couch documents,
> and
> > have IDs that include the user's ID, the document type, and the date in a
> > "DD_MM_YYYY" format. They are replicated once a day by placing
> replication
> > docs in the "_replicator" database.
>
> That sounds like overkill for a single document id. Ideally you keep
> your doc ids short (as they're used everywhere as , well, ids) and put
> the extra info into separate fields within the document. You can
> easily create a view to reconstruct that same data format from the
> document if absolutely required.
>
> > Once replicated to the server, these documents are...
> >     1). Used as input to a data mining algorithm.
> >     2). Displayed on a web page. (Users can see their own data, and
> > clinicians can see the data for all users.)
> >
> > The data mining algorithm produces a new CouchDB document for each user
> > every day, which we call an "Impact Factor" document. (It looks at each
> > user's historical objective and subjective data, and looks for
> > correlations.)
>
> Cool! This sounds impressive.
>
> > Replication: Replication takes place from client to server, and from
> server
> > to client.
> >     1). Client to server: This seems to be working fine.
> >     2). Server to client: This is what's broken.
> >
> > Two things have to be replicated from server to client.
> >     1). Each user's subjective data for the past 14 days.
> >     2). Each user's Impact Factor document for the current day.
> >
> > Since all user documents are stored in the same database, we use filtered
> > replication to send the right docs to the right users.
>
> From a privacy perspective, I would always use a per-user server-side
> DB as the replication endpoint.
>
> A Wise Man once said "always plan that all your data gets replicated,
> everywhere". It only takes one slipup to share confidential
> information across patients/customers.
>
> > The problem is that this filter function takes too long. ( >10minutes)
> >     1). To test whether the filter function is crashing, I replicated the
> > entire DB to another un-loaded machine, and it seems to run just fine.
> > (Well it takes about 2.5 minutes, but it doesn't crash.)
> >     2). I've tried re-writing the filter function in ERLANG, but haven't
> > managed to get it working.
> >
> > And besides, I suspect that the way the DB is structured is just not
> suited
> > to the job.
> >
> > So, to summarize...
> >     - Android client phones produce new CouchDB docs and replicate them
> to
> > the server.
> >     - One central CouchDB holds all users.
> >     - Both individual and group data are served to web pages.
> >     - A data mining algorithm processes this data on a per-user basis.
> >     - Subjective data and Impact Factor data documents are replicated
> from
> > the server to each client phone.
> >
> > Is there a way to structure the DB so that users can replicate without
> the
> > need for filters, but which preserves the ability of clinicians to see an
> > overview of all users? (It's my understanding that views can't be run *
> > across* databases.)
>
> In summary, either turn the N^2 filter problem into a O(N)
> pre-calculated view, or use a per-user DB.
>
> And ideally do both, if disk & other constraints are feasible.
>
> > Well, as before, any suggestions or pointers would be much appreciated.
> >
> > Cheers,
> > Tim.
>
> A+
> Dave
>

Re: Refactoring a CouchDB.

Posted by Robert Newson <rn...@apache.org>.
Just a reminder that filtered replication still benefits from
checkpointing (i.e, it's incremental as usual). The performance
difference of unfiltered vs filtered replication is the cost of the
evaluation through couchjs.


On 17 January 2013 08:32, Dave Cottlehuber <dc...@jsonified.com> wrote:
> On 17 January 2013 13:08, Tim Hankins <ti...@gmail.com> wrote:
>> Hi,
>>
>> I'm a student programmer at the IT University of Copenhagen, and have
>> inherited a CouchDB application, which I'm having trouble scaling. I
>> believe it may need to be refactored.
>>
>> Specifically, the problem seems to be coming from the use of Filtered
>> Replication. (All user documents are stored in the same database, and
>> replicating from server to client requires filtered replication.)
>
> Yes, this seems very likely.
>
> The main constraint is that replication filters need to be run per
> document, per replication. So N replications requires N passes through
> all the documents, ie N^2. And in your case most of the documents will
> not be replicated to a given user.
>
> There are 2 main approaches, and a possible 3rd one:
>
> 1. keep using the same replication approach, but create an additional
> server-side per-user DB. Move all endpoints to access their private DB
> only, and non-filtered replication is now possible. Use a master DB
> that replicates all docs from the private DB to the master DB, for
> views across all users, and implement an additional replication on the
> server-side private DBs to retrieve their data.
>
> 2. use named document replication to transfer only the required
> documents from the master DB to the endpoint DB. To identify what
> documents need to be transferred per user, create a view that exposes
> only the user's name. You can therefore avoid the N^2 filter pass
> above, as it will be done once within the view.
>
> http://wiki.apache.org/couchdb/Replication#Named_Document_Replication
>
> You can probably use the update_seq (available in the json properties
> of http://couch:5984/dbname/ ) as a checkpoint of where you are / were
> up to, but I can think of a few corner cases where this might come
> back and haunt you later. Anybody else want to comment?
>
> 3. Do something (handwavey) with the _changes feed to ensure each
> document only needs to be processed/filtered once, and use this to
> avoid an on-disk view. Somewhere you'll need a list of documents then
> that have been sent to a specific client, so I'm not sure that 1 or 2
> aren't already better. But maybe your specific use case can work with
> this more easily.
>
>> Each day, the android client application collects two kinds of data.
>> Subjective and Objective. Subjective data are manually entered by patients.
>> Objective data are gathered from the phone's sensors.
>>
>> Subjective and Objective data are stored in their own couch documents, and
>> have IDs that include the user's ID, the document type, and the date in a
>> "DD_MM_YYYY" format. They are replicated once a day by placing replication
>> docs in the "_replicator" database.
>
> That sounds like overkill for a single document id. Ideally you keep
> your doc ids short (as they're used everywhere as , well, ids) and put
> the extra info into separate fields within the document. You can
> easily create a view to reconstruct that same data format from the
> document if absolutely required.
>
>> Once replicated to the server, these documents are...
>>     1). Used as input to a data mining algorithm.
>>     2). Displayed on a web page. (Users can see their own data, and
>> clinicians can see the data for all users.)
>>
>> The data mining algorithm produces a new CouchDB document for each user
>> every day, which we call an "Impact Factor" document. (It looks at each
>> user's historical objective and subjective data, and looks for
>> correlations.)
>
> Cool! This sounds impressive.
>
>> Replication: Replication takes place from client to server, and from server
>> to client.
>>     1). Client to server: This seems to be working fine.
>>     2). Server to client: This is what's broken.
>>
>> Two things have to be replicated from server to client.
>>     1). Each user's subjective data for the past 14 days.
>>     2). Each user's Impact Factor document for the current day.
>>
>> Since all user documents are stored in the same database, we use filtered
>> replication to send the right docs to the right users.
>
> From a privacy perspective, I would always use a per-user server-side
> DB as the replication endpoint.
>
> A Wise Man once said "always plan that all your data gets replicated,
> everywhere". It only takes one slipup to share confidential
> information across patients/customers.
>
>> The problem is that this filter function takes too long. ( >10minutes)
>>     1). To test whether the filter function is crashing, I replicated the
>> entire DB to another un-loaded machine, and it seems to run just fine.
>> (Well it takes about 2.5 minutes, but it doesn't crash.)
>>     2). I've tried re-writing the filter function in ERLANG, but haven't
>> managed to get it working.
>>
>> And besides, I suspect that the way the DB is structured is just not suited
>> to the job.
>>
>> So, to summarize...
>>     - Android client phones produce new CouchDB docs and replicate them to
>> the server.
>>     - One central CouchDB holds all users.
>>     - Both individual and group data are served to web pages.
>>     - A data mining algorithm processes this data on a per-user basis.
>>     - Subjective data and Impact Factor data documents are replicated from
>> the server to each client phone.
>>
>> Is there a way to structure the DB so that users can replicate without the
>> need for filters, but which preserves the ability of clinicians to see an
>> overview of all users? (It's my understanding that views can't be run *
>> across* databases.)
>
> In summary, either turn the N^2 filter problem into a O(N)
> pre-calculated view, or use a per-user DB.
>
> And ideally do both, if disk & other constraints are feasible.
>
>> Well, as before, any suggestions or pointers would be much appreciated.
>>
>> Cheers,
>> Tim.
>
> A+
> Dave

Re: Refactoring a CouchDB.

Posted by Dave Cottlehuber <dc...@jsonified.com>.
On 17 January 2013 13:08, Tim Hankins <ti...@gmail.com> wrote:
> Hi,
>
> I'm a student programmer at the IT University of Copenhagen, and have
> inherited a CouchDB application, which I'm having trouble scaling. I
> believe it may need to be refactored.
>
> Specifically, the problem seems to be coming from the use of Filtered
> Replication. (All user documents are stored in the same database, and
> replicating from server to client requires filtered replication.)

Yes, this seems very likely.

The main constraint is that replication filters need to be run per
document, per replication. So N replications requires N passes through
all the documents, ie N^2. And in your case most of the documents will
not be replicated to a given user.

There are 2 main approaches, and a possible 3rd one:

1. keep using the same replication approach, but create an additional
server-side per-user DB. Move all endpoints to access their private DB
only, and non-filtered replication is now possible. Use a master DB
that replicates all docs from the private DB to the master DB, for
views across all users, and implement an additional replication on the
server-side private DBs to retrieve their data.

2. use named document replication to transfer only the required
documents from the master DB to the endpoint DB. To identify what
documents need to be transferred per user, create a view that exposes
only the user's name. You can therefore avoid the N^2 filter pass
above, as it will be done once within the view.

http://wiki.apache.org/couchdb/Replication#Named_Document_Replication

You can probably use the update_seq (available in the json properties
of http://couch:5984/dbname/ ) as a checkpoint of where you are / were
up to, but I can think of a few corner cases where this might come
back and haunt you later. Anybody else want to comment?

3. Do something (handwavey) with the _changes feed to ensure each
document only needs to be processed/filtered once, and use this to
avoid an on-disk view. Somewhere you'll need a list of documents then
that have been sent to a specific client, so I'm not sure that 1 or 2
aren't already better. But maybe your specific use case can work with
this more easily.

> Each day, the android client application collects two kinds of data.
> Subjective and Objective. Subjective data are manually entered by patients.
> Objective data are gathered from the phone's sensors.
>
> Subjective and Objective data are stored in their own couch documents, and
> have IDs that include the user's ID, the document type, and the date in a
> "DD_MM_YYYY" format. They are replicated once a day by placing replication
> docs in the "_replicator" database.

That sounds like overkill for a single document id. Ideally you keep
your doc ids short (as they're used everywhere as , well, ids) and put
the extra info into separate fields within the document. You can
easily create a view to reconstruct that same data format from the
document if absolutely required.

> Once replicated to the server, these documents are...
>     1). Used as input to a data mining algorithm.
>     2). Displayed on a web page. (Users can see their own data, and
> clinicians can see the data for all users.)
>
> The data mining algorithm produces a new CouchDB document for each user
> every day, which we call an "Impact Factor" document. (It looks at each
> user's historical objective and subjective data, and looks for
> correlations.)

Cool! This sounds impressive.

> Replication: Replication takes place from client to server, and from server
> to client.
>     1). Client to server: This seems to be working fine.
>     2). Server to client: This is what's broken.
>
> Two things have to be replicated from server to client.
>     1). Each user's subjective data for the past 14 days.
>     2). Each user's Impact Factor document for the current day.
>
> Since all user documents are stored in the same database, we use filtered
> replication to send the right docs to the right users.

>From a privacy perspective, I would always use a per-user server-side
DB as the replication endpoint.

A Wise Man once said "always plan that all your data gets replicated,
everywhere". It only takes one slipup to share confidential
information across patients/customers.

> The problem is that this filter function takes too long. ( >10minutes)
>     1). To test whether the filter function is crashing, I replicated the
> entire DB to another un-loaded machine, and it seems to run just fine.
> (Well it takes about 2.5 minutes, but it doesn't crash.)
>     2). I've tried re-writing the filter function in ERLANG, but haven't
> managed to get it working.
>
> And besides, I suspect that the way the DB is structured is just not suited
> to the job.
>
> So, to summarize...
>     - Android client phones produce new CouchDB docs and replicate them to
> the server.
>     - One central CouchDB holds all users.
>     - Both individual and group data are served to web pages.
>     - A data mining algorithm processes this data on a per-user basis.
>     - Subjective data and Impact Factor data documents are replicated from
> the server to each client phone.
>
> Is there a way to structure the DB so that users can replicate without the
> need for filters, but which preserves the ability of clinicians to see an
> overview of all users? (It's my understanding that views can't be run *
> across* databases.)

In summary, either turn the N^2 filter problem into a O(N)
pre-calculated view, or use a per-user DB.

And ideally do both, if disk & other constraints are feasible.

> Well, as before, any suggestions or pointers would be much appreciated.
>
> Cheers,
> Tim.

A+
Dave

Re: Refactoring a CouchDB.

Posted by Jens Alfke <je...@couchbase.com>.
On Jan 17, 2013, at 4:08 AM, Tim Hankins <ti...@gmail.com> wrote:

> Two things have to be replicated from server to client.
>    1). Each user's subjective data for the past 14 days.
>    2). Each user's Impact Factor document for the current day.
...
> The problem is that this filter function takes too long. ( >10minutes)

Sounds like you might be bottlenecked on date parsing; this is surprisingly expensive. If you’re storing dates as strings, the filter function to implement (1) is going to have to parse each one and compare it against the current time. You might be able to speed up the filter a lot by storing dates in the database as numeric timestamps, in which case the filter’s test becomes a simple subtraction.

Or you could just skip the time-based filtering entirely. On the initial replication you'd end up downloading all the user’s subjective data and Impact Factor documents, but thereafter only new ones would be downloaded. This might be worthwhile if the surplus data isn’t too large. (After the first sync your mobile app can purge — not delete — the older docs from its local database to save room.)

—Jens

Re: Refactoring a CouchDB.

Posted by svilen <az...@svilendobrev.com>.
you could have additional intermediate database per user, replicate it
whole to from phone, and filter-replicate it to/from whole database. 
(thus the filtered repl. will be on server. )
then eventualy u could swap the roles of which database is MAIN/leading
- the whole one or the pieces.

this might be the least possible change/impact fix.
ciao
svilen

 On Thu, 17 Jan 2013 13:08:01 +0100
Tim Hankins <ti...@gmail.com> wrote:

> Hi,
> 
> I'm a student programmer at the IT University of Copenhagen, and have
> inherited a CouchDB application, which I'm having trouble scaling. I
> believe it may need to be refactored.
> 
> Specifically, the problem seems to be coming from the use of Filtered
> Replication. (All user documents are stored in the same database, and
> replicating from server to client requires filtered replication.)
> 
> I'm in the process of reading Chapter 23 of "O'Reilly: CouchDB - The
> Definitive Guide" which deals with High Performance, and "O'Reilly:
> Scaling CouchDB". Any other suggestions about the following would be
> greatly appreciated!
> 
> Some background...
> 
> The system is part of a clinical trial undertaken by the ITU and the
> Danish State Hospital. It aims to help Bipolar patients manage their
> disease. It is composed of
>     1). 100+ android phones running a client application and Couchbase
> Mobile.
>     2). A web server backed by CouchDB.
> 
> Each day, the android client application collects two kinds of data.
> Subjective and Objective. Subjective data are manually entered by
> patients. Objective data are gathered from the phone's sensors.
> 
> Subjective and Objective data are stored in their own couch
> documents, and have IDs that include the user's ID, the document
> type, and the date in a "DD_MM_YYYY" format. They are replicated once
> a day by placing replication docs in the "_replicator" database.
> 
> Once replicated to the server, these documents are...
>     1). Used as input to a data mining algorithm.
>     2). Displayed on a web page. (Users can see their own data, and
> clinicians can see the data for all users.)
> 
> The data mining algorithm produces a new CouchDB document for each
> user every day, which we call an "Impact Factor" document. (It looks
> at each user's historical objective and subjective data, and looks for
> correlations.)
> 
> Replication: Replication takes place from client to server, and from
> server to client.
>     1). Client to server: This seems to be working fine.
>     2). Server to client: This is what's broken.
> 
> Two things have to be replicated from server to client.
>     1). Each user's subjective data for the past 14 days.
>     2). Each user's Impact Factor document for the current day.
> 
> Since all user documents are stored in the same database, we use
> filtered replication to send the right docs to the right users.
> 
> The problem is that this filter function takes too long. ( >10minutes)
>     1). To test whether the filter function is crashing, I replicated
> the entire DB to another un-loaded machine, and it seems to run just
> fine. (Well it takes about 2.5 minutes, but it doesn't crash.)
>     2). I've tried re-writing the filter function in ERLANG, but
> haven't managed to get it working.
> 
> And besides, I suspect that the way the DB is structured is just not
> suited to the job.
> 
> So, to summarize...
>     - Android client phones produce new CouchDB docs and replicate
> them to the server.
>     - One central CouchDB holds all users.
>     - Both individual and group data are served to web pages.
>     - A data mining algorithm processes this data on a per-user basis.
>     - Subjective data and Impact Factor data documents are replicated
> from the server to each client phone.
> 
> Is there a way to structure the DB so that users can replicate
> without the need for filters, but which preserves the ability of
> clinicians to see an overview of all users? (It's my understanding
> that views can't be run * across* databases.)
> 
> Well, as before, any suggestions or pointers would be much
> appreciated.
> 
> Cheers,
> Tim.