You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by "Henrik Hofmeister (JIRA)" <ji...@apache.org> on 2011/08/08 23:48:27 UTC

[jira] [Created] (COUCHDB-1243) Compact and copy feature that resets changes

Compact and copy feature that resets changes
--------------------------------------------

                 Key: COUCHDB-1243
                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
             Project: CouchDB
          Issue Type: New Feature
          Components: Database Core
    Affects Versions: 1.1, 1.0.1
         Environment: Ubuntu, but not important
            Reporter: Henrik Hofmeister


After running db and view compaction on a 7K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 7K changes takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.

A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Henrik Hofmeister (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henrik Hofmeister updated COUCHDB-1243:
---------------------------------------

    Description: 
After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.

A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )

I've attached the dump load php script for your convenience.

  was:
After running db and view compaction on a 7K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 7K changes takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.

A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )

I've attached the dump load php script for your convenience.


> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Robert Newson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081235#comment-13081235 ] 

Robert Newson commented on COUCHDB-1243:
----------------------------------------

The difference is the removal of all the _deleted stubs which are critical to the correct operation of CouchDB replication. As such, I'm -1 on the idea.

That said, I'd be +1 on add /db/_export and /db/import entry points that make backup/restore trivial.


> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 7K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 7K changes takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Paul Joseph Davis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081527#comment-13081527 ] 

Paul Joseph Davis commented on COUCHDB-1243:
--------------------------------------------

The oops scenario is important, but the motivating use case as I always heard it was if you wanted to rebalance doc information across shards in a cluster.

> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Henrik Hofmeister (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henrik Hofmeister updated COUCHDB-1243:
---------------------------------------

    Description: 
After running db and view compaction on a 7K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 7K changes takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.

A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )

I've attached the dump load php script for your convenience.

  was:
After running db and view compaction on a 7K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 7K changes takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.

A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )


> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 7K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 7K changes takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Henrik Hofmeister (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henrik Hofmeister updated COUCHDB-1243:
---------------------------------------

    Attachment: dump_load.php

dump load script - requires php and curl installed.

takes 2 arguments - the host (including http and port and trailing /) - and the source db name

> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 7K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 7K changes takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Henrik Hofmeister (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081284#comment-13081284 ] 

Henrik Hofmeister commented on COUCHDB-1243:
--------------------------------------------

export/import would do the trick as well - or at least make it easier... However we are using couchdb intensively for both moderate and huge size dbs... this forever growing changes size will cause us to switch away from couch eventually - as we are rapidly growing into SAN size requirements which makes couchdb a very expensive db :( Also making view changes and compaction is getting to a point where it has to be done in weekends to allow for it to update. Our main db has 2 changes for every document... with 7 mio documents -  we are facing a staggering 15 mio changes :)

I'd atleast consider that couchdb is - to my understanding - built for web scale - and we are nowhere near our expected size and already growing out of it? 

> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Henrik Hofmeister (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081575#comment-13081575 ] 

Henrik Hofmeister commented on COUCHDB-1243:
--------------------------------------------

We update the documents - not delete and create - or at least - not all the time.. Its not temp data its just forever growing data :) But anyways... good points - I'm starting to get the fact the CouchDB's main point is master/master replication - which we are also using it for on the more moderatly sized dbs. Could be alright though - to allow couch to disable replication on certain db's - in favor of stuff like this? 

> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Robert Newson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081530#comment-13081530 ] 

Robert Newson commented on COUCHDB-1243:
----------------------------------------

Seriously? Bleh. We can surely do shard splitting without the horrors of _purge.

> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Robert Newson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081584#comment-13081584 ] 

Robert Newson commented on COUCHDB-1243:
----------------------------------------

You could reduce revs_limits on those databases, which will reduce much of the overhead, with the caveat that replication could be impaired if no common ancestor can be found (not a problem if you never replicate).

curl -X PUT -d "number goes here" http://localhost:5984/dbname/_revs_limit


> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Robert Newson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081508#comment-13081508 ] 

Robert Newson commented on COUCHDB-1243:
----------------------------------------

_purge is really for the "oops, I just put my admin password in a document" scenario. It's not well tested, has known and unresolved bugs, and obviously ruins eventual consistency. I'd rather see it removed than encouraged, but I think it's important for the narrow use case I just mentioned.

We only remember the _rev's for the last 1000 updates to a document, so there is a cap (albeit a generous one) on how much is retained. When you say '6+ million changes' are these updates to existing documents or are you deleting documents and making new ones?

If the latter, then you could consider the temporal database idea, which is often suggested when using couchdb as a message queue: Use a database per time interval (say, weekly). When the database is empty (i.e, only has deleted documents), you can delete the db entirely.

I'll finish with saying that CouchDB's retention of information about deleted documents and old revisions is central to CouchDB, if it's working so strongly against you, then I don't think it's the right database solution for your problem.



> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Henrik Hofmeister (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081594#comment-13081594 ] 

Henrik Hofmeister commented on COUCHDB-1243:
--------------------------------------------

Already done that though... but thanks

> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082685#comment-13082685 ] 

Randall Leeds commented on COUCHDB-1243:
----------------------------------------

Also, if it wasn't already clear, this is bat country. Proceed with caution.

> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Paul Joseph Davis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081352#comment-13081352 ] 

Paul Joseph Davis commented on COUCHDB-1243:
--------------------------------------------

Though there's a caveat and a note on purge. Firstly, if you purge twice in a row without updating a view, you have to rebuild the view from scratch. For heavy users of views this becomes a problem. This is just an implementation detail at the moment and at some time in the future could eventually be fixed.

And a note, there was another report of a bug this morning that looks as though its triggered in the purge code and specifically affects compaction. There's been some speculation that its purge code, but I don't think anyone's sat down to comb through it yet to try and reproduce it.

> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Damien Katz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081324#comment-13081324 ] 

Damien Katz commented on COUCHDB-1243:
--------------------------------------

I mostly agree with Robert Newsom, that what you are asking for is a dangerous thing for CouchDB replication. However, there is the purge option, which "forgets" documents, deleted or otherwise, completely removing them from the internal indexes. Once documents are purged, compaction will will completely remove them from the file forever. Unfortunately, I couldn't find actual documentation on the purge functionality, so the best place to figure out how to use the purge is to look at the purge test in the browser test suite, which can be found here:

http://svn.apache.org/viewvc/couchdb/trunk/share/www/script/test/purge.js?view=co&revision=1086241&content-type=text%2Fplain

I've often thought a it would be useful to purge docs during compaction, by providing a user defined function to signal to remove unwanted docs/stubs. But no such thing exists, in the meantime you can accomplish it with a purge + compaction.

> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (COUCHDB-1243) Compact and copy feature that resets changes

Posted by "Randall Leeds (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/COUCHDB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082657#comment-13082657 ] 

Randall Leeds commented on COUCHDB-1243:
----------------------------------------

If a smaller _revs_limit doesn't fix your problem then it sounds like you have some documents that are in conflict. The best way I can think to automate purging the conflicts would be to consume the /_changes feed with ?style=all_docs. Each entry in the feed will include an array of revisions in the 'changes' property. The first of these is the winning conflict revision. Then use /_purge to remove all but this winning revision and you'll be left with only the history of the winning version. If you only consume the _changes feed up to a sequence number before the stable replication checkpoints you won't be destroying revisions that haven't replicated yet and replication should continue to function. Additionally, documents that haven't been in conflict much but have received many updates will still have history back to _revs_limit and should replicate safely, without introducing new conflicts, so long as they haven't received a number of divergent updates.

Paul's caveats about _purge and view indexes applies.

> Compact and copy feature that resets changes
> --------------------------------------------
>
>                 Key: COUCHDB-1243
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1243
>             Project: CouchDB
>          Issue Type: New Feature
>          Components: Database Core
>    Affects Versions: 1.0.1, 1.1
>         Environment: Ubuntu, but not important
>            Reporter: Henrik Hofmeister
>              Labels: cleanup, compaction
>         Attachments: dump_load.php
>
>
> After running db and view compaction on a 70K doc db with 6+ mio. changes - it takes up 0.8 GB. If copying the same documents to a new db (get and bulk insert) - the same date with 70K changes (only the inserts) takes up 40 mb. That is a huge difference. Has been verified on 2 db's that the difference is more than 65 times the size of data.
> A "Compact and copy" feature that copies only documents, and resets the changes for at db would be very nice to try and limit the disk usage a little bit. (Our current test environment takes up nearly 100 GB... )
> I've attached the dump load php script for your convenience.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira