You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Joshua Bronson <ja...@gmail.com> on 2009/06/17 18:51:24 UTC

Re: Multipart MIME in dump tool

To revive an old thread:
I needed a script to dump a large (>30G) couchdb database on a nightly basis
for backup purposes, to be performed while couchdb is running, and
noticed couchdb-python
issue 58 <http://code.google.com/p/couchdb-python/issues/detail?id=58>:
couchdb-dump
fails if size of dumpfile > memory. I looked at the scripts attached in the
comments and realized that I had similar but different needs: I just needed
to stream the response to _all_docs_by_seq?include_docs=true directly to
stdout, without doing any json encoding along the way. (The complementary
load does have to do json encoding (for instance to check the "deleted":true
flag), but this is fine.) I thought I would share the script with the
community in case it's helpful to anyone else as well as to solicit
feedback:

https://svn.openplans.org/melk/util/streamcouch.py



Here is the usage:

Usage: ./streamcouch.py [dump | load] DBURL

"dump" requests "_all_docs_by_seq?include_docs=true" from DBURL
and streams the response to stdout.

"load" creates a database at DBURL with documents read from stdin
in the format output by "_all_docs_by_seq?include_docs=true".
Requires couchdb-python <http://code.google.com/p/couchdb-python/>.

Ex.
  ./streamcouch.py dump http://localhost:5984/backedup > dump
  ./streamcouch.py load http://localhost:5984/restored < dump



I have tested a dump / load roundtrip locally and it
worked. I should note that none of our documents have inline attachments.
There was talk earlier in this thread that documents with attachments cannot
be posted with a _rev. It would be trivial to remove the _rev during the
load, I just haven't needed to.

I should also note that my next version of the script will allow you to
perform incremental backups by passing a startkey to _all_docs_by_seq. The
complementary load will be able accept multiple responses to
_all_docs_by_seq on stdin.

Looking forward to your responses.


On Tue, Apr 7, 2009 at 8:49 AM, Matt Goodall <ma...@gmail.com> wrote:

>
> 2009/4/7 Jeff Hinrichs - DM&T <du...@gmail.com>:
> >
> >
> > On Tue, Apr 7, 2009 at 4:37 AM, Matt Goodall <ma...@gmail.com>
> wrote:
> >>
> >> 2009/4/7 Jeff Hinrichs - DM&T <du...@gmail.com>:
> >> >
> >> >
> >> > On Mon, Apr 6, 2009 at 2:03 PM, Matt Goodall <ma...@gmail.com>
> >> > wrote:
> >> >>
> >> >> 2009/4/6 Matt Goodall <ma...@gmail.com>:
> >> >> > 2009/4/5 Jeff Hinrichs - DM&T <du...@gmail.com>:
> >> >> >>> So personally, for now, I would write the dump/load tools using
> >> >> >>> plain
> >> >> >>> old httplib from the standard library. It's more than capable.
> The
> >> >> >>> only bit that might involve decoding from JSON to Python is
> >> >> >>> removing
> >> >> >>> the _rev; everything else is a matter of streaming data from HTTP
> >> >> >>> to
> >> >> >>> disk and vice-versa.
> >> >> >>
> >> >> >> Although that is done currently, I don't believe that it is
> strictly
> >> >> >> required by couchdb.  In fact I just tested and you can insert
> >> >> >> (while
> >> >> >> loading to an empty database) with the _rev property containing
> >> >> >> information.  The _rev is then updated and the resulting _rev
> >> >> >> returned.
> >> >> >> So
> >> >> >> there is no need to remove the _rev when loading a dump file.
> >> >> >
> >> >> > Unfortunately, you do still have to remove the _rev for those
> >> >> > documents with inline attachments, otherwise you get a conflict
> error
> >> >> > from couchdb. I don't know if it's a bug to create a document with
> a
> >> >> > _rev or if the _rev + inline attachments is just an inconsistency.
> >> >> > I'll ask on the couchdb list later and post back here.
> >> >>
> >> >> Accepting a new document containing a _rev appears to be a bug in
> >> >> CouchDB. So, it definitely needs removing which is arguably the
> >> >> correct thing to do anyway.
> >> >
> >> >
> >> >>
> >> >> - Matt
> >> >
> >> > Matt,
> >> > Saw your post and then reviewed my test scripts only to realize that
> you
> >> > are
> >> > correct about couchdb freaking out with _rev+attachment.  However, I
> >> > don't
> >> > agree with Damien about which is the bug.  I see his point of view --
> he
> >> > is
> >> > seeing this as standard couchdb operation.
> >> >
> >> > Quite frankly, to be a proper dump/load mechanism -- you should be
> able
> >> > to
> >> > dump dbA, then create a dbB, then load the dump from dbA into dbB and
> >> > when
> >> > you replicate from one to the other, they should appear to be already
> >> > synchronized (no replication events occur)  If a dump load cycle
> causes
> >> > dbA
> >> > to transform into dbA' then it is not a dump/load -- it's a fetch and
> >> > Insert.
> >>
> >> Ah, I think that's a different sort of dump/load than couchdb-python
> >> provides. I see couchdb-python's dump/load is a sort of
> >> snapshot/bootstrap tool. You're talking about a disconnected
> >> replication process. Both probably have their uses.
> >
> >
> > Now that you put it that way, yes.  I was using replication to
> demonstrate
> > that a database reloaded from a dump should equal the originally dumped
> > database.  Replication is inherent w/ couchdb, so if a reloaded database
> > will respond differently to replication than the original, something is
> > lost.
> >
> > Couchdb essentially uses the _id + _rev as a unique index to the data.
> The
> > inability to recreate that unique index is the problem that, I think,
> needs
> > to be corrected.
> >
> >>
> >> >
> >> > couchdb proper really needs to correct this situation.  Dump and load
> >> > needs
> >> > to put couchdb into a different mode so that this can be accomplished.
> >> > i.e.
> >> > /database/_dump -- which would dump json documents out  and then a
> >> > /database/_load where you would post the contents of the _dump to
> reload
> >> > a
> >> > database.  And I mean load -- not insert.
> >>
> >> I'm sure this can already be achieved using HTTP but, AFAICT, it's not
> >> fully documented yet.
> >
> > There is a thread on couchdb-dev,
> >
> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200904.mbox/%3C9AF648C2-0503-48E8-A335-13BAEB9B1B3D@apache.org%3E
> > that appears to be talking about such a method.  However, it would
> require
> > the _rev as a parameter instead of part of the json document.
> >
> >>
> >> And yes, it might be nice for CouchDB to provide support for a
> >> disconnected replication mode (i.e. replicate to disk) that assumes
> >> the data will be loaded into an empty database (or at least a database
> >> that has not seen those documents before). However, wouldn't that dump
> >> basically be a compacted database?
> >
> > It would be, but currently, none of the scripts (haven't looked at your
> > yet), request conflicted.  I haven't written the test for this yet, but
> my
> > analysis is that only the current "winning" _rev of the data.  A
> dump/load
> > cycle currently looses that information, if present.  It has not bitten
> me
> > yet, but I can see it being a problem.
>
> My version is as lossy as the original couchdb-dump/load ... just a
> couple of orders of magnitude more efficient ;-).
>
> >
> >>
> >> >
> >> > There are times when you need to be able to dump/load a database.
> Some
> >> > times for error recovery, some times for debugging and some times for
> >> > legal
> >> > reasons but without a proper couchdb api for it, we are whistling in
> the
> >> > wind.
> >>
> >> That already works - just copy the .couch file for your database.
> >> CouchDB's append-only model should mean you never get a copy of a
> >> partly written database.
> >
> > You do have a point.  However, that .couch file is dependent on the
> running
> > version of couchdb - so that would need to be backed up too.  Maybe I am
> > still need to shake off the dust of current RDBMS but all of them support
> > the idea of  dump/load.  Something just feels very 80'ish about backing
> up
> > the actual data file.
>
> You're right, there should be a version-independent format and it
> should be part of the CouchDB distribution.
>
> Every time a .couch file format-breaking change is made you have to
> replicate from to a new CouchDB server. For the 0.9 release where the
> .couch file format *and* the replication stream format changed not
> even that was possible and someone (jchris, I think) wrote a script to
> help.
>
> >
> >>
> >> Note: you can actually perform a disconnected replication using a
> >> copy: copy a .couch file to a new CouchDB instance giving it a
> >> temporary database name, replicate from the temporary database to the
> >> real database, delete the temporary database. Not especially nice, but
> >> not too bad either. All CouchDB needs to do to improve that is to
> >> provide a way to replicate from a file on disk rather than a locally
> >> installed database.
> >
> > Agreed,  and then this whole thread becomes a bike shed. ;)
>
> Still some use, I think. I still think the ability to bulk load a
> bunch of documents from a simple file is useful. For instance, when
> setting up a new database it's not uncommon to need to bootstrap the
> data in it before pointing an web server at it.
>
> >
> >
> > Regards,
> >
> > Jeff Hinrichs
> >>
> >> - Matt
> >>
> >>
> >
> >
> >
> > --
> > Jeff Hinrichs
> > Dundee Media & Technology, Inc
> > jeffh@dundeemt.com
> > 402.218.1473
> > web: www.dundeemt.com
> > blog: inre.dundeemt.com
> >
> > >
> >
>
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google Groups
> "CouchDB-Python" group.
> To post to this group, send email to couchdb-python@googlegroups.com
> To unsubscribe from this group, send email to
> couchdb-python+unsubscribe@googlegroups.com<co...@googlegroups.com>
> For more options, visit this group at
> http://groups.google.com/group/couchdb-python?hl=en
> -~----------~----~----~----~------~----~------~--~---
>

Re: Multipart MIME in dump tool

Posted by Joshua Bronson <ja...@gmail.com>.
 On Thu, Jun 18, 2009 at 4:47 AM, Nils Breunese <n....@vpro.nl> wrote:

> Joshua Bronson wrote:
>
>  I needed a script to dump a large (>30G) couchdb database on a nightly
>> basis
>> for backup purposes, to be performed while couchdb is running, (...)
>>
>
> Did you know that you can just use tools like cp to safely backup live
> CouchDB databases? Using rsync will give you an instant incremental backup
> tool.
>
> Nils Breunese.
>


Thanks for bringing this up. I was actually doing exactly that -- rsyncing
the .couch file -- before switching to json dumps. Here are the reasons I
switched:

  - The format of the .couch files can change from one version of couchdb to
another, so if you ever upgrade couchdb (which you probably will!), you'll
no longer be able to swap in the .couch files.

  - If the .couch file ever somehow gets corrupted, the corruption will
propagate to your backups. Nobody wants to suffer the fate of
ma.gnolia<http://corvusconsulting.ca/2009/02/ma-gnolias-bad-day/>
!

  - json is human-readable

  - It takes up less space, and can be further compressed to take up much
less. My 30G .couch file produced a 17G _all_docs_by_seq dump which then
bzip2-compressed to 2.6G.


And now with the latest version of
streamcouch.py<https://svn.openplans.org/melk/util/streamcouch.py>,
along with something like my new wrapper
script<https://svn.openplans.org/melk/util/backupcouch>
,

  - It does incremental backups too.


So far I like doing it this way a lot better. If anyone's had a chance to
give it a whirl, I'd love to hear about your experiences with it.

Re: Multipart MIME in dump tool

Posted by Nils Breunese <n....@vpro.nl>.
Joshua Bronson wrote:

> I needed a script to dump a large (>30G) couchdb database on a nightly basis
> for backup purposes, to be performed while couchdb is running, (...)

Did you know that you can just use tools like cp to safely backup live 
CouchDB databases? Using rsync will give you an instant incremental 
backup tool.

Nils Breunese.