You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Henri van den Bulk <hv...@gmail.com> on 2011/11/02 17:20:07 UTC

Reconciling Data

I'm in the process of writing an external program in Java that reconciles data in CouchDB to a source system. One of the basic parts is to determine what data needs to be removed from CouchDB. The good thing is that the Ids in CouchDB are the same as the Ids in the source system. However, some initial test seem that the process is very slow in determining what needs to be removed.

Basically, here are the steps that I'm using:
Get all ids from the sources system
Get all ids from CouchDB using _all_docs and paging with the fast paging approach (e.g. start_key_docid and limit)
Loop through the ids from Couch to see if they are not in the source id list
Using modify_docs to delete

The basic logic is using a NOT IN like in sql. However, I'm trying to determine if there is a faster way of doing this directly in CouchDB. For example, how might we use the MapReduce (View) capability to performing this delete. Or any other thoughts on syncing data in a fastest way possible with CouchDB>

Oh NOTE: we can not delete the whole db 1st as we have mobile clients that used the _changes and are bandwidth constraint.

Thanks,

Henri

Re: Reconciling Data

Posted by Henri van den Bulk <hv...@gmail.com>.
Thanks for the tip - we'll give it a shot.

On Nov 2, 2011, at 4:29 PM, Jens Alfke wrote:

> 
> On Nov 2, 2011, at 2:07 PM, Henri van den Bulk wrote:
> 
> Unfortunately the source system does not keep track of changes so there is no way of knowing what the changes were. We've toyed around to maybe putting the data in an intermediate database from which we can then do sql queries like not in and hash compares. However, this seems to defeat the purpose of having couch.
> 
> If you get the doc IDs from the source system, sort them and write them one per line to a simple text file, you can then efficiently compare two versions of that file (by reading them line by line in parallel) to find which IDs have been added or removed in the interim. You could practically do this with a shell script (using the sort and diff tools).
> 
> —Jens


Re: Reconciling Data

Posted by Jens Alfke <je...@couchbase.com>.
On Nov 2, 2011, at 2:07 PM, Henri van den Bulk wrote:

Unfortunately the source system does not keep track of changes so there is no way of knowing what the changes were. We've toyed around to maybe putting the data in an intermediate database from which we can then do sql queries like not in and hash compares. However, this seems to defeat the purpose of having couch.

If you get the doc IDs from the source system, sort them and write them one per line to a simple text file, you can then efficiently compare two versions of that file (by reading them line by line in parallel) to find which IDs have been added or removed in the interim. You could practically do this with a shell script (using the sort and diff tools).

—Jens

Re: Reconciling Data

Posted by Henri van den Bulk <hv...@gmail.com>.
Jens - 

Unfortunately the source system does not keep track of changes so there is no way of knowing what the changes were. We've toyed around to maybe putting the data in an intermediate database from which we can then do sql queries like not in and hash compares. However, this seems to defeat the purpose of having couch.

Would it be possible to somehow post the ids to couch, like with filters using the request, and then compare them?

Thanks,

Henri

On Nov 2, 2011, at 1:49 PM, Jens Alfke wrote:

> 
> On Nov 2, 2011, at 9:20 AM, Henri van den Bulk wrote:
> 
>> I'm in the process of writing an external program in Java that reconciles data in CouchDB to a source system. One of the basic parts is to determine what data needs to be removed from CouchDB. The good thing is that the Ids in CouchDB are the same as the Ids in the source system. However, some initial test seem that the process is very slow in determining what needs to be removed.
> 
> It’s inevitably going to be inefficient to have to fetch _all_ the doc IDs from CouchDB and _all_ the IDs from the other system and then scan them looking for differences.
> 
> Instead, is there a way you can query the other system to find out which docs have been removed since the last time you synced?
> 
> —Jens


Re: Reconciling Data

Posted by Ryan Ramage <ry...@gmail.com>.
You might want to have your other data source more sleep-y (no joke)

http://syncable.org/





On Wed, Nov 2, 2011 at 1:49 PM, Jens Alfke <je...@couchbase.com> wrote:
>
> On Nov 2, 2011, at 9:20 AM, Henri van den Bulk wrote:
>
>> I'm in the process of writing an external program in Java that reconciles data in CouchDB to a source system. One of the basic parts is to determine what data needs to be removed from CouchDB. The good thing is that the Ids in CouchDB are the same as the Ids in the source system. However, some initial test seem that the process is very slow in determining what needs to be removed.
>
> It’s inevitably going to be inefficient to have to fetch _all_ the doc IDs from CouchDB and _all_ the IDs from the other system and then scan them looking for differences.
>
> Instead, is there a way you can query the other system to find out which docs have been removed since the last time you synced?
>
> —Jens



-- 
Twitter: @eckoit

Re: Reconciling Data

Posted by Jens Alfke <je...@couchbase.com>.
On Nov 2, 2011, at 9:20 AM, Henri van den Bulk wrote:

> I'm in the process of writing an external program in Java that reconciles data in CouchDB to a source system. One of the basic parts is to determine what data needs to be removed from CouchDB. The good thing is that the Ids in CouchDB are the same as the Ids in the source system. However, some initial test seem that the process is very slow in determining what needs to be removed.

It’s inevitably going to be inefficient to have to fetch _all_ the doc IDs from CouchDB and _all_ the IDs from the other system and then scan them looking for differences.

Instead, is there a way you can query the other system to find out which docs have been removed since the last time you synced?

—Jens