You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Damien Katz <da...@apache.org> on 2008/09/11 22:12:28 UTC

new purge functionality

I just checked in the new document purge functionality, which removes  
all information about a document existence from a database. New tests  
can be found in the test suite.

Purge is not to be confused with deletion. A deletion is like an edit  
to a document, and it's replicated the same as document edit. However,  
purges are not like a new document edit, rather it's the elimination  
of the document and meta-data from that instance of the database,  
where as deletions still preserve the meta-data. After a purge the  
same documents on other database replica instances will be unaffected.

The reason for purge is to both completely removing documents you no  
longer care about (deletions from long ago) and it's necessary for  
database partitioning, when the number of partitions is resized and  
documents need to be moved between partitions. Purging document is  
generally not something application code should worry about.

Because we eliminate the record of the database, things that index the  
database like views and full text search must take special steps to  
ensure their indexes no longer include the purged document. One way to  
accomplish this is to just completely rebuild the indexes from scratch  
whenever something is purged. But that's very expensive if you only  
purge a handful of documents, you must reexamine every document in the  
database.

To avoid this penalty CouchDB keeps track of only the documents most  
recently purged. The next time it purges more documents, it will  
forget about those previous purged documents. When the indexer notices  
the purge seq has changed, if its only 1 seq number behind the  
database's purge seq, then it has a chance to retrieve the list of the  
most recently purged documents and remove them from the index and  
update the indexes purge seq, then procede to update the indexes  
normally. If the database purge seq is 2 or more than the last one the  
index recorded, the index is automatically discarded and rebuilt from  
scratch.

This is already implemented by the view engine, but the full text  
engine will still need modified to work with purge as well.

When purging, you must specify the doc Id and the revision(s) to  
purge. If there is already a later revision of a document, that  
document isn't purged. Any document revision that doesn't exist is  
ignored. Also an additional limitation is purge cannot happen during a  
compaction, the client will get an error.

The typical operations to efficiently and completely purge documents  
would be:
1. Purge the document(s)
2. Cause the view indexes to be refreshed (for each design doc, open a  
view with count=0, it will cause all the design doc;s view indexes to  
be updated)
3. (Optionally) purge 0 more documents and cause the record of our  
purged documents to be dropped.
4. Compact the database (Until this is done remnants of the purged  
documents can still be found in the db file when dumped raw)

Re: new purge functionality

Posted by Damien Katz <da...@apache.org>.

Oh yeah, I forgot to mention, this breaks the file format. Sorry.  
You'll need to dump, upgrade and import your databases.

-Damien


On Sep 11, 2008, at 4:12 PM, Damien Katz wrote:

> I just checked in the new document purge functionality, which  
> removes all information about a document existence from a database.  
> New tests can be found in the test suite.
>
> Purge is not to be confused with deletion. A deletion is like an  
> edit to a document, and it's replicated the same as document edit.  
> However, purges are not like a new document edit, rather it's the  
> elimination of the document and meta-data from that instance of the  
> database, where as deletions still preserve the meta-data. After a  
> purge the same documents on other database replica instances will be  
> unaffected.
>
> The reason for purge is to both completely removing documents you no  
> longer care about (deletions from long ago) and it's necessary for  
> database partitioning, when the number of partitions is resized and  
> documents need to be moved between partitions. Purging document is  
> generally not something application code should worry about.
>
> Because we eliminate the record of the database, things that index  
> the database like views and full text search must take special steps  
> to ensure their indexes no longer include the purged document. One  
> way to accomplish this is to just completely rebuild the indexes  
> from scratch whenever something is purged. But that's very expensive  
> if you only purge a handful of documents, you must reexamine every  
> document in the database.
>
> To avoid this penalty CouchDB keeps track of only the documents most  
> recently purged. The next time it purges more documents, it will  
> forget about those previous purged documents. When the indexer  
> notices the purge seq has changed, if its only 1 seq number behind  
> the database's purge seq, then it has a chance to retrieve the list  
> of the most recently purged documents and remove them from the index  
> and update the indexes purge seq, then procede to update the indexes  
> normally. If the database purge seq is 2 or more than the last one  
> the index recorded, the index is automatically discarded and rebuilt  
> from scratch.
>
> This is already implemented by the view engine, but the full text  
> engine will still need modified to work with purge as well.
>
> When purging, you must specify the doc Id and the revision(s) to  
> purge. If there is already a later revision of a document, that  
> document isn't purged. Any document revision that doesn't exist is  
> ignored. Also an additional limitation is purge cannot happen during  
> a compaction, the client will get an error.
>
> The typical operations to efficiently and completely purge documents  
> would be:
> 1. Purge the document(s)
> 2. Cause the view indexes to be refreshed (for each design doc, open  
> a view with count=0, it will cause all the design doc;s view indexes  
> to be updated)
> 3. (Optionally) purge 0 more documents and cause the record of our  
> purged documents to be dropped.
> 4. Compact the database (Until this is done remnants of the purged  
> documents can still be found in the db file when dumped raw)