You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Chris Anderson <jc...@apache.org> on 2009/03/11 00:06:08 UTC

Re: rep_security merge to trunk

On Tue, Mar 10, 2009 at 3:44 PM, Damien Katz <da...@apache.org> wrote:
>
> This patch breaks the file format and replication API, so replication with
> earlier versions is not possible.

The rev format has changed. Does this mean that migrating existing
data will involve getting each doc from oldDB, stripping the _rev, and
loading it into newDB?

It should be pretty straightforward to write a Python or Ruby script
that does this in bulk to transfer docs. It's essentially a version of
the python dump / load tools that doesn't require putting the whole db
on disk as an intermediary.

I'll volunteer but I wonder how I should handle docs with conflicts in
the oldDB?

Chris

-- 
Chris Anderson
http://jchris.mfdz.com

Re: rep_security merge to trunk

Posted by Jan Lehnardt <ja...@apache.org>.
On 15 Mar 2009, at 04:35, Chris Anderson wrote:

> On Wed, Mar 11, 2009 at 4:51 PM, Damien Katz <da...@apache.org>  
> wrote:
>> For importing existing docs, I think you could just use the
>> all_or_nothing:true option and save the multiple copies of the same
>> documents and they'll all be saved, and you don't have to worry  
>> about the
>> _revisions stuff.
>>
>
> I've posted a script that copies between two running CouchDB
> instances. I'm using the all_or_nothing option. It does attachments
> inline using base64 encoding because it mostly works. I think if you
> have attachments so big that they can't be buffered, you probably want
> to avoid bulk docs anyway. If anyone desperately needs such a script
> you might be able to convince me to modify what I've written.
>
> Blog post with script and instructions here:
>
> http://jchrisa.net/drl/_design/sofa/_show/post/Upgrading%20CouchDB%20databases%20to%20trunk


http://wiki.apache.org/couchdb/BreakingChangesUpdateTrunkTo0Dot9

Please help expanding the page.

Cheers
Jan
--


Re: rep_security merge to trunk

Posted by Jan Lehnardt <ja...@apache.org>.
On 15 Mar 2009, at 04:35, Chris Anderson wrote:

> On Wed, Mar 11, 2009 at 4:51 PM, Damien Katz <da...@apache.org>  
> wrote:
>> For importing existing docs, I think you could just use the
>> all_or_nothing:true option and save the multiple copies of the same
>> documents and they'll all be saved, and you don't have to worry  
>> about the
>> _revisions stuff.
>>
>
> I've posted a script that copies between two running CouchDB
> instances. I'm using the all_or_nothing option. It does attachments
> inline using base64 encoding because it mostly works. I think if you
> have attachments so big that they can't be buffered, you probably want
> to avoid bulk docs anyway. If anyone desperately needs such a script
> you might be able to convince me to modify what I've written.
>
> Blog post with script and instructions here:
>
> http://jchrisa.net/drl/_design/sofa/_show/post/Upgrading%20CouchDB%20databases%20to%20trunk
>

Hi Chris, great work, thanks! Would it make sense to add the blog post
& script to the CouchDB wiki? I'd like to add a few notes and your blog
is still read-only :)

Cheers
Jan
--



Re: rep_security merge to trunk

Posted by Chris Anderson <jc...@apache.org>.
On Sun, Mar 15, 2009 at 7:41 AM, Chris Anderson <jc...@apache.org> wrote:
> On Sun, Mar 15, 2009 at 6:56 AM, Jeff Hinrichs - DM&T
> <du...@gmail.com> wrote:
>> On Sat, Mar 14, 2009 at 10:35 PM, Chris Anderson <jc...@apache.org> wrote:
>>> On Wed, Mar 11, 2009 at 4:51 PM, Damien Katz <da...@apache.org> wrote:
>>>> For importing existing docs, I think you could just use the
>>>> all_or_nothing:true option and save the multiple copies of the same
>>>> documents and they'll all be saved, and you don't have to worry about the
>>>> _revisions stuff.
>>>>
>>>
>>> I've posted a script that copies between two running CouchDB
>>> instances. I'm using the all_or_nothing option. It does attachments
>>> inline using base64 encoding because it mostly works. I think if you
>>> have attachments so big that they can't be buffered, you probably want
>>> to avoid bulk docs anyway. If anyone desperately needs such a script
>>> you might be able to convince me to modify what I've written.
>>>
>>> Blog post with script and instructions here:
>>>
>>> http://jchrisa.net/drl/_design/sofa/_show/post/Upgrading%20CouchDB%20databases%20to%20trunk
>>>
>>
>> Chris,
>> Does this migrate conflicted documents or does it ignore them?
>>
>
> yes, it migrates conflicts. It does document requests with
>
> GET /db/docid?open_revs=all&attachments=true
>
> which gives a copy of each doc rev leaf node (that is the head rev and
> any conflict revs).
>
> Once I figured out that the same request works for conflicted and
> normal docs, the script got much simpler.
>

I forgot to mention that it just strips the _rev from the original
documents, so in the case of conflicts the winning rev could change.

If this is unacceptable for someone's application it should be possible to fix.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: rep_security merge to trunk

Posted by Chris Anderson <jc...@apache.org>.
On Sun, Mar 15, 2009 at 6:56 AM, Jeff Hinrichs - DM&T
<du...@gmail.com> wrote:
> On Sat, Mar 14, 2009 at 10:35 PM, Chris Anderson <jc...@apache.org> wrote:
>> On Wed, Mar 11, 2009 at 4:51 PM, Damien Katz <da...@apache.org> wrote:
>>> For importing existing docs, I think you could just use the
>>> all_or_nothing:true option and save the multiple copies of the same
>>> documents and they'll all be saved, and you don't have to worry about the
>>> _revisions stuff.
>>>
>>
>> I've posted a script that copies between two running CouchDB
>> instances. I'm using the all_or_nothing option. It does attachments
>> inline using base64 encoding because it mostly works. I think if you
>> have attachments so big that they can't be buffered, you probably want
>> to avoid bulk docs anyway. If anyone desperately needs such a script
>> you might be able to convince me to modify what I've written.
>>
>> Blog post with script and instructions here:
>>
>> http://jchrisa.net/drl/_design/sofa/_show/post/Upgrading%20CouchDB%20databases%20to%20trunk
>>
>
> Chris,
> Does this migrate conflicted documents or does it ignore them?
>

yes, it migrates conflicts. It does document requests with

GET /db/docid?open_revs=all&attachments=true

which gives a copy of each doc rev leaf node (that is the head rev and
any conflict revs).

Once I figured out that the same request works for conflicted and
normal docs, the script got much simpler.

Jan, what about blog comments? ;)

-- 
Chris Anderson
http://jchris.mfdz.com

Re: rep_security merge to trunk

Posted by Jeff Hinrichs - DM&T <du...@gmail.com>.
On Sat, Mar 14, 2009 at 10:35 PM, Chris Anderson <jc...@apache.org> wrote:
> On Wed, Mar 11, 2009 at 4:51 PM, Damien Katz <da...@apache.org> wrote:
>> For importing existing docs, I think you could just use the
>> all_or_nothing:true option and save the multiple copies of the same
>> documents and they'll all be saved, and you don't have to worry about the
>> _revisions stuff.
>>
>
> I've posted a script that copies between two running CouchDB
> instances. I'm using the all_or_nothing option. It does attachments
> inline using base64 encoding because it mostly works. I think if you
> have attachments so big that they can't be buffered, you probably want
> to avoid bulk docs anyway. If anyone desperately needs such a script
> you might be able to convince me to modify what I've written.
>
> Blog post with script and instructions here:
>
> http://jchrisa.net/drl/_design/sofa/_show/post/Upgrading%20CouchDB%20databases%20to%20trunk
>

Chris,
Does this migrate conflicted documents or does it ignore them?

Regards,

Jeff Hinrichs

Re: rep_security merge to trunk

Posted by Chris Anderson <jc...@apache.org>.
On Wed, Mar 11, 2009 at 4:51 PM, Damien Katz <da...@apache.org> wrote:
> For importing existing docs, I think you could just use the
> all_or_nothing:true option and save the multiple copies of the same
> documents and they'll all be saved, and you don't have to worry about the
> _revisions stuff.
>

I've posted a script that copies between two running CouchDB
instances. I'm using the all_or_nothing option. It does attachments
inline using base64 encoding because it mostly works. I think if you
have attachments so big that they can't be buffered, you probably want
to avoid bulk docs anyway. If anyone desperately needs such a script
you might be able to convince me to modify what I've written.

Blog post with script and instructions here:

http://jchrisa.net/drl/_design/sofa/_show/post/Upgrading%20CouchDB%20databases%20to%20trunk

Chris

-- 
Chris Anderson
http://jchris.mfdz.com

Re: rep_security merge to trunk

Posted by Damien Katz <da...@apache.org>.
On Mar 11, 2009, at 7:07 PM, Chris Anderson wrote:

> On Wed, Mar 11, 2009 at 8:34 AM, Damien Katz <da...@apache.org>  
> wrote:
>>
>> On Mar 10, 2009, at 7:06 PM, Chris Anderson wrote:
>>
>>> On Tue, Mar 10, 2009 at 3:44 PM, Damien Katz <da...@apache.org>  
>>> wrote:
>>>>
>>>> This patch breaks the file format and replication API, so  
>>>> replication
>>>> with
>>>> earlier versions is not possible.
>>>
>>> The rev format has changed. Does this mean that migrating existing
>>> data will involve getting each doc from oldDB, stripping the _rev,  
>>> and
>>> loading it into newDB?
>>
>> Yes, but it should be possible to convert the revs to the new  
>> format too.
>> But why?
>>
>>>
>>> It should be pretty straightforward to write a Python or Ruby script
>>> that does this in bulk to transfer docs. It's essentially a  
>>> version of
>>> the python dump / load tools that doesn't require putting the  
>>> whole db
>>> on disk as an intermediary.
>>>
>>> I'll volunteer but I wonder how I should handle docs with  
>>> conflicts in
>>> the oldDB?
>>
>> Oh that's why. Using the replicator API would work for that.
>>
>
> A little confused as to the plan here. Let me try to articulate:
>
> Write a script that pulls all_docs_by_seq from the old version of
> CouchDB in batches of 1000, and for each doc loads the head rev (and
> any conflict revs) into memory.
>
> Then it creates a bulk_docs POST for those docs, by stripping the rev
> from any docs that don't have conflicts, and any docs that have
> conflicts, creating a series of revs like this (pretend there are 199
> conflict revs)
>
> 1-sdfjhgsaf
> 2-asdfkjsad
> ..
> 199-asdf7tsfd
>
> and applying the revs to each doc in the conflict set. Does the rev
> ordering matter? Assuming I don't reuse the prefix number, does the
> format/length of the second rev part matter?
>
> Then using a normal POST of an object like {"docs":[...array of
> docs...]} to the /db/_bulk_docs URL (with no special query option),
> the new docs (and conflict revs) will get stored in the new DB?
>
> Or do I need to assign well-formed made up revs to the non-conflicting
> docs (they'd all get "1-foobar") and use the ?new_edits=false option
> on the bulk_docs POST ?



To use the new_edits=false, you have to specify a rev history in a doc  
_revisions property, like this:
{new_edits:false,
  docs:[
     {_id:"foo", _revisions={start:2,ids:["133457546","475133454"]} }
     ]}

The ids are the rev ids without the leading offset, the are send this  
way for efficiency. Converting to regular revs, they would look like  
"2-133457546" and "1-475133454".

For importing existing docs, I think you could just use the  
all_or_nothing:true option and save the multiple copies of the same  
documents and they'll all be saved, and you don't have to worry about  
the _revisions stuff.

-Damien

>
> I think getting this clear on the list will help everyone's
> understanding of the new bulk_docs semantics. (I don't plan to include
> in my migrator the ability to transfer any docs which would be lost on
> the source DB during compaction... only the HEAD rev and any conflicts
> will be transfered.)
>
> Chris
>
> ps I tagged trunk as bulk_transactions (maybe coulda picked a better
> name) so we have a record of the last point of 0.9 development that
> had the old semantics. Please don't use this tag.
>
> -- 
> Chris Anderson
> http://jchris.mfdz.com


Re: rep_security merge to trunk

Posted by Chris Anderson <jc...@apache.org>.
On Wed, Mar 11, 2009 at 8:34 AM, Damien Katz <da...@apache.org> wrote:
>
> On Mar 10, 2009, at 7:06 PM, Chris Anderson wrote:
>
>> On Tue, Mar 10, 2009 at 3:44 PM, Damien Katz <da...@apache.org> wrote:
>>>
>>> This patch breaks the file format and replication API, so replication
>>> with
>>> earlier versions is not possible.
>>
>> The rev format has changed. Does this mean that migrating existing
>> data will involve getting each doc from oldDB, stripping the _rev, and
>> loading it into newDB?
>
> Yes, but it should be possible to convert the revs to the new format too.
> But why?
>
>>
>> It should be pretty straightforward to write a Python or Ruby script
>> that does this in bulk to transfer docs. It's essentially a version of
>> the python dump / load tools that doesn't require putting the whole db
>> on disk as an intermediary.
>>
>> I'll volunteer but I wonder how I should handle docs with conflicts in
>> the oldDB?
>
> Oh that's why. Using the replicator API would work for that.
>

A little confused as to the plan here. Let me try to articulate:

Write a script that pulls all_docs_by_seq from the old version of
CouchDB in batches of 1000, and for each doc loads the head rev (and
any conflict revs) into memory.

Then it creates a bulk_docs POST for those docs, by stripping the rev
from any docs that don't have conflicts, and any docs that have
conflicts, creating a series of revs like this (pretend there are 199
conflict revs)

1-sdfjhgsaf
2-asdfkjsad
..
199-asdf7tsfd

and applying the revs to each doc in the conflict set. Does the rev
ordering matter? Assuming I don't reuse the prefix number, does the
format/length of the second rev part matter?

Then using a normal POST of an object like {"docs":[...array of
docs...]} to the /db/_bulk_docs URL (with no special query option),
the new docs (and conflict revs) will get stored in the new DB?

Or do I need to assign well-formed made up revs to the non-conflicting
docs (they'd all get "1-foobar") and use the ?new_edits=false option
on the bulk_docs POST ?

I think getting this clear on the list will help everyone's
understanding of the new bulk_docs semantics. (I don't plan to include
in my migrator the ability to transfer any docs which would be lost on
the source DB during compaction... only the HEAD rev and any conflicts
will be transfered.)

Chris

ps I tagged trunk as bulk_transactions (maybe coulda picked a better
name) so we have a record of the last point of 0.9 development that
had the old semantics. Please don't use this tag.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: rep_security merge to trunk

Posted by Damien Katz <da...@apache.org>.
On Mar 10, 2009, at 7:06 PM, Chris Anderson wrote:

> On Tue, Mar 10, 2009 at 3:44 PM, Damien Katz <da...@apache.org>  
> wrote:
>>
>> This patch breaks the file format and replication API, so  
>> replication with
>> earlier versions is not possible.
>
> The rev format has changed. Does this mean that migrating existing
> data will involve getting each doc from oldDB, stripping the _rev, and
> loading it into newDB?

Yes, but it should be possible to convert the revs to the new format  
too. But why?

>
> It should be pretty straightforward to write a Python or Ruby script
> that does this in bulk to transfer docs. It's essentially a version of
> the python dump / load tools that doesn't require putting the whole db
> on disk as an intermediary.
>
> I'll volunteer but I wonder how I should handle docs with conflicts in
> the oldDB?

Oh that's why. Using the replicator API would work for that.

-Damien