You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Chris Stockton <ch...@gmail.com> on 2011/12/22 22:01:11 UTC

Database size seems off even after compaction runs.

We have a customer using 11MB of disk space, not much but it so
happens the user is using more like 100KB of data. One thing we have
noticed is that the users doc_del_count is very high, indicating he
likely had 11mb of data but later deleted it.

{"db_name":"...snip...","doc_count":49,"doc_del_count":42981,"update_seq":86097,"purge_seq":0,"compact_running":false,"disk_size":11427940,"instance_start_time":"1324102884349705","disk_format_version":5,"committed_update_seq":86097}

After compaction the size does not change, the physical file on the
file system is verified 11MB.

Viewing _all_docs shows a page and a half of data, viewing futon shows
the user really is using around 100kb max.

Any guesses here?

-Chris

Re: Database size seems off even after compaction runs.

Posted by Rogutės Sparnuotos <ro...@googlemail.com>.

Chris Stockton (2011-12-23 13:56):
> Hello,
> 
> On Fri, Dec 23, 2011 at 5:48 AM, CGS <cg...@gmail.com> wrote:
> > Hi,
> >
> > Sorry to interfere with such a question, but why don't you work with a
> > buffer database? I mean, make a replica to another database which filters
> > out the deleted documents. In such way you can clean all your databases and
> > you use temporary some extra-space (only during the "cleaning" process).
> > Another idea would be to use two databases: one active and one inactive at
> > the given time. That means, you move the data from one to the other,
> > filtering out the deleted documents, and when it's over, you switch to the
> > newly constructed database, while the other gets emptied (deleted and
> > re-created). Just my 2c opinions.
> >
> > CGS
> >
> 
> Thanks everyone for the various feedback. Now the information I have
> gathered is the disk utilization we are seeing is simply from the
> deleted documents.
> 
> The question I have yet to see answered (perhaps because it simply
> isn't possible) is how to reclaim this space?

Again, you can't reclaim space taken by empty deleted documents (the ones
that have only the _id, _rev, _deleted fields), but you can reclaim space
if your documents were deleted with a body (more than the 3 fields above).

Do you know the ids of deleted documents? If you do, you can look at them:
$ curl '<db>/<doc_id>?rev=<rev>'
You can delete them again, this time without leaving any fields:
$ curl -X PUT '<db>/<doc_id>?rev=<rev>' -d '{"_deleted":true}'
Don't know if there is any way to find the ids of deleted documents,
except by watching the _changes feed beforehand (as Robert Newson
suggested).

-- 
--  Rogutės Sparnuotos

Re: Database size seems off even after compaction runs.

Posted by Rogutės Sparnuotos <ro...@googlemail.com>.

Mark Hahn (2011-12-24 16:39):
> Some say nothing but minimal id/rev data is retained after deleting a doc ...
> 
> Henrik Lundgren said ...
> > understand that the body of an empty deleted document is only this:
> >  {"_id": <id>, "_rev":<revision>, "_deleted": true}
> 
> Jens Alfke said ...
> > Those should be pretty small, though, since they’re just trees of revision IDs.
> 
> Rogutės Sparnuotos said ...
> > you can't reclaim space taken by empty deleted documents (the ones
> > that have only the _id, _rev, _deleted fields),
> 
> --------------------
> But some say the entire document is retained ...
> 
> Robert Newson said
> > It's worth saying again that compaction does *not* remove "deleted
> > documents’ contents".
> 
> CGS said ...
> > deleting a document just makes it unavailable
> 
> --------------------
> 
> So, who is right?

Sorry, but I can't see anything contradictory in the quotes you posted.

How would you define "deleting a doc"?

-- 
--  Rogutės Sparnuotos

Re: Database size seems off even after compaction runs.

Posted by Jens Alfke <je...@couchbase.com>.

On Dec 24, 2011, at 4:39 PM, Mark Hahn wrote:

Some say nothing but minimal id/rev data is retained after deleting a doc …
...
But some say the entire document is retained ...

Deleting a doc just adds a new revision (a “tombstone”) that’s marked as deleted.
Compacting removes the space occupied by the contents of non-current revisions.
Therefore, after deleting a doc and compacting, all that remains of it is the tombstone revision (and the revision tree, which is tiny).

The source of confusion, I think, is that the tombstone may or may not contain just “minimal id/rev data”. Normally it will, if you used the DELETE method. But if you try to delete a document just by adding a “_deleted” property to it, you’ve literally(!) written your own tombstone, and that tombstone contains all the data of the previous revision because that’s what was in your PUT.

This whole thread and the resulting confusion is all about that weird edge case of deleting by adding “_deleted”. The moral of the story is: don’t do that. I don’t know if that behavior is actively deprecated, but it seems unlikely to be what one would want to happen.

—Jens

Re: Database size seems off even after compaction runs.

Posted by Mark Hahn <ma...@hahnca.com>.

Some say nothing but minimal id/rev data is retained after deleting a doc ...

Henrik Lundgren said ...
> understand that the body of an empty deleted document is only this:
>  {"_id": <id>, "_rev":<revision>, "_deleted": true}

Jens Alfke said ...
> Those should be pretty small, though, since they’re just trees of revision IDs.

Rogutės Sparnuotos said ...
> you can't reclaim space taken by empty deleted documents (the ones
that have only the _id, _rev, _deleted fields),

--------------------
But some say the entire document is retained ...

Robert Newson said
> It's worth saying again that compaction does *not* remove "deleted
documents’ contents".

CGS said ...
> deleting a document just makes it unavailable

--------------------

So, who is right?

Re: Database size seems off even after compaction runs.

Posted by Rogutės Sparnuotos <ro...@googlemail.com>.

Henrik Lundgren (2011-12-23 13:20):
> Ok, so how do I prevent the database from consuming all diskspace in
> the long run?
> 
> I'm developing an application that is quite insert heavy ( about 6 Gb
> / day ), the database is essentially a message inbox.
> 
> I plan to delete obsolete messages in a houskeeping job, but if
> CouchDB will retain the latest revision of all documents I might have
> to reconsider using CouchDB, which is a pity :-(

Yes, deleted documents will stay even after your housekeeping job. But do
understand that the body of an empty deleted document is only this:
  {"_id": <id>, "_rev":<revision>, "_deleted": true}
So the right question would be how much space does an empty deleted
document need?

After a quick test, I would say you need:
~100KB / 1000 properly deleted documents with 1 revision
~200KB / 1000 with 10 revisions
  ~1MB / 1000 with 100 revisions

But I have seen some inconsistencies (bugs?) with higher numbers; don't
have time to look deeper right now.

-- 
--  Rogutės Sparnuotos

Re: Database size seems off even after compaction runs.

Posted by CGS <cg...@gmail.com>.

Well, at least it's a nice introduction in the philosophy of CouchDB, 
isn't it? :)

CGS




On 12/27/2011 11:10 PM, Jens Alfke wrote:
> On Dec 27, 2011, at 1:57 PM, CGS wrote:
>
>> http://guide.couchdb.org/draft/
> Oh, OK. I was thinking more of reference documentation, not a tutorial. CouchDB badly needs a solid reference.
>
> —Jens

Re: Database size seems off even after compaction runs.

Posted by Jens Alfke <je...@couchbase.com>.

On Dec 27, 2011, at 1:57 PM, CGS wrote:

> http://guide.couchdb.org/draft/

Oh, OK. I was thinking more of reference documentation, not a tutorial. CouchDB badly needs a solid reference.

—Jens

Re: Database size seems off even after compaction runs.

Posted by CGS <cg...@gmail.com>.

I was referring to this:

http://guide.couchdb.org/draft/

CGS

On 12/27/2011 09:55 PM, Jens Alfke wrote:
> On Dec 27, 2011, at 1:55 AM, CGS wrote:
>
>> I understand your confusion about documentation, but I can say, this
>> behavior is documented. If you read carefully the official documentation
> Which is the 'official documentation'? So far the most comprehensive thing I've found is the wiki, especially the "Complete HTTP API Reference" and the pages it links to, but I've never gotten the feeling that this is a definitive source of documentation, just a grab-bag that gets updated when people feel like it (and which contains a lot of obsolete/inaccurate information). CouchDB doesn't seem to have anything like, say, the MySQL reference manual.
>
> —Jens

Re: Database size seems off even after compaction runs.

Posted by Jens Alfke <je...@couchbase.com>.

On Dec 27, 2011, at 1:55 AM, CGS wrote:

> I understand your confusion about documentation, but I can say, this 
> behavior is documented. If you read carefully the official documentation

Which is the 'official documentation'? So far the most comprehensive thing I've found is the wiki, especially the "Complete HTTP API Reference" and the pages it links to, but I've never gotten the feeling that this is a definitive source of documentation, just a grab-bag that gets updated when people feel like it (and which contains a lot of obsolete/inaccurate information). CouchDB doesn't seem to have anything like, say, the MySQL reference manual.

—Jens

Re: Database size seems off even after compaction runs.

Posted by CGS <cg...@gmail.com>.

I understand your confusion about documentation, but I can say, this 
behavior is documented. If you read carefully the official 
documentation, you will see the example with the bank in which if one 
transaction is not successfully, the transaction is not erased, but 
updated. This is, in this case, if one deletes a document, it's just an 
update of the old document (i.e., giving a new revision and marking it 
as deleted). So, in other words, "nothing is lost, everything is 
transformed." There is no point to be concerned about security here 
because if you use HTTP predicate GET on the document, CouchDB will 
return a JSON of a form {"error":"not_found","reason":"deleted"} 
(compared with a non-existing document which is reported with the 
appropriate reason, "document does not exist" or so - I don't remember 
now the message, but I know it's quite meaningful). Even in the case of 
somebody breaking into your server and obtaining the file or admin 
password, after compaction, all the previous versions of the document 
are no longer there, so, no data can be extracted from there.

Nevertheless, as it was said here, there are few distinct cases:
1. Using DELETE predicate from HTTP. That will ensure the minimum data 
are written on the harddisk.
2. Using "_deleted":true in combination with HTTP PUT/POST. If no other 
data are added to the document while sending the request, it has the 
same effect as the first point.
3. Emptying the document. This will reduce the document size even more, 
but it will not allow you to reuse the document unless you provide the 
correct revision of the document (in the other options, no revision is 
required).

My point in enumerating these option is related to their usage. If you 
can afford one HTTP request at the time, then using DELETE is probably 
the best option. But, in many cases, that is a luxury you cannot afford 
because of the harddisk writing speed limitation. In most of the cases, 
you would like to use bulk operations. That means buffering your data. 
At this time, option 1 is no longer available.

As you can see, each of the options has its own advantages, but also 
disadvantages/limitations. But that is another story already.

This choice of such a behavior has two major pros:
1. History. If you delete a document which you need it later on, the 
undo action can be done easily by reverting the document revision to the 
previous one (providing that no compaction was triggered in between the 
two actions).
2. Harddisk write speed optimization. If you delete a document and you 
want to reuse the name later, in the case of the pointer toward the 
document being simply deleted, then you need mandatory to trigger a 
compaction to avoid document name conflict. And that is a much slower 
process than just updating a document.

The only way to delete completely a document is to re-create the 
physical file containing the database. But if this is more annoying than 
few extra-bytes per document, then leave the "tombstone" there. If both 
of the previously mentioned options are not convenient for your project, 
then CouchDB may not be what you need (I am not discouraging people to 
use CouchDB, but only stating the fact that there is no gain without 
pain, and using CouchDB is quite a gain in my opinion). Nevertheless, to 
be kept in mind that there is a way to reclaim the physical space kept 
by the deleted documents.

And two more things I would like to clarify from my previous messages:
1. "making the document unavailable" meant the HTTP GET will return 
"error" in the case of trying to access a deleted document;
2. when I was speaking about my design for the given case, I stated that 
there are limitations in the specified design (e.g., race condition and 
how often you can trigger such a switch), so, one can invent another 
design based on the information (as I said before) that deleting a 
document completely can be done only by re-creating the database 
filtering out the deleted documents (e.g., no "crazy storage blowouts" 
if you use a round-robin on all your databases, just temporary 
inconvenience of adding some extra-space to your server system - PC, 
cluster... - while you perform the space reclaiming procedure).

CGS

On 12/25/2011 01:10 AM, Daniel Bryan wrote:
> I understand if this is necessary for eventual consistency, but shouldn't
> this be better-documented? I generally expected that if I delete sensitive
> or unwanted data, or that a user requests that their personal or private
> data be deleted, it'll be deleted in a way that's more solid than basically
> hiding it. Sure, CouchDB won't let you get at that document, but it's
> certainly still there on the disk, and presumably detectable if you
> inspected the data structure that holds individual documents. Not a very
> good situation vis a vis security. I know that normal unix "deletion"
> leaves files technically on disk, but there are ways to allow for that and
> prevent it from being an issue.
>
> Even setting data security aside, I've been using CouchDB as a kind of
> staging environment for large amounts of data which should ultimately be
> elsewhere (different flavours relational databases, databases belonging to
> different organisations, etc.) because it's really easy to implement as an
> interface and let people just throw whatever they want into it with a POST.
> It's really the perfect tool for that, but pretty soon there'll be tens of
> gigabytes a day of data flowing through the system, and most of it just
> needs to be indexed for a while before our scheduled scripts pull it all
> out, shove it elsewhere and delete it. In this use case, if I'm
> understanding this correctly, we'll get crazy storage blowouts unless we
> implement a bunch of hacks to switch to new databases after performing
> deletions (as well as scripts that make our HTTP reverse proxy
> transparently and intelligently route data to the new database - absolutely
> not a trivial task in any complex system with many moving parts).
>
> But you know, this all comes with the territory. If the devs say there's a
> good reason for documents to stick around after deletion, I believe them,
> but I think that's a pretty huge point and I don't know how I've missed it.
>
> What's the way to delete a document if I actually want to really delete the
> data? Changing it to a blank document before deleting, and then compacting?
>
> On Sat, Dec 24, 2011 at 2:37 PM, Jens Alfke<je...@couchbase.com>  wrote:
>
>> On Dec 23, 2011, at 4:09 PM, Mark Hahn wrote:
>>
>>> 1) How exactly could you make this switch without interrupting service?
>> Replicate database to new db, then atomically switch your proxy or
>> whatever to the new db from the old one.
>> Depending on how long the replication takes, there’s a race condition here
>> where changes made to the old db during the replication won’t be propagated
>> to the new one; you could either repeat the process incrementally until
>> this doesn’t happen, or else put the db into read-only mode while you’re
>> doing the copy.
>>
>> This might also be helpful: http://tinyurl.com/89lr3fl
>>
>>> 2) Wouldn't this procedure create the exact same eventual consistency
>>> problems that deleting documents in a db would?
>> No; what’s necessary is the revision tree, and the replication will
>> preserve that. You’re just losing the contents of the deleted revisions
>> that accidentally got left behind because of the weird way the documents
>> were deleted.
>>
>> —Jens
>>
>>

Re: Database size seems off even after compaction runs.

Posted by Robert Newson <ro...@gmail.com>.

Yes, you can create a new doc where the deleted doc was.

Sent from my iPhone

On 25 Dec 2011, at 12:10, George Burt <im...@gmail.com> wrote:

> So, can I re-use the deleted document?  My _id is part of the data and has
> meaning.  If I delete the old _id, am I not allowed to have that same
> meaning again by reclaiming the _id?  _id="block_1_house_1"  then a
> hurricane and so we delete it.  Then we rebuild it (maybe) and so I need
> _id="block_1_house_1" again.
>
> George
>
>
> On Sun, Dec 25, 2011 at 5:20 AM, Robert Newson <rn...@apache.org> wrote:
>
>> Mark,
>>
>> Using the DELETE method simply updates the document to
>>
>> {"_id":"foo","_rev":"newrev","_deleted":true}
>>
>> If you did the same via PUT or POST, you'd get exactly the same effect
>> as DELETE.
>>
>> Daniel,
>>
>> You have a valid point, that this should be better documented. It is
>> unknown how many phantom documents are out there, those that were
>> deleted by adding _deleted:true on the assumption that this cleans out
>> the document. In fact, when I first noticed this effect I created a
>> JIRA ticket and applied a patch to fix it, before Damien pointed out
>> that this behavior is intentional (indeed, necessary).
>>
>> To answer your final question, CouchDB preserves what you ask it to,
>> it does not alter the contents of documents itself. So, if you save
>> {"_id":"foo","_rev":"newrev","_deleted":true. "password to my bank
>> account":"foobar"}, it will do so. Use either the DELETE http method
>> or POST/PUT only the document you wish to be stored (minimum is, as
>> noted above, _id, _rev and _deleted).
>>
>> B.
>>
>>
>> On 25 December 2011 00:40, Jens Alfke <je...@couchbase.com> wrote:
>>> No. If you delete a document properly (using DELETE, not just setting a
>> _deleted property) you won't have this problem. The old revision with the
>> data will be gone after compaction, leaving only an empty "tombstone".
>>>
>>> --Jens     [via iPhone]
>>>
>>> On Dec 24, 2011, at 4:10 PM, "Daniel Bryan" <da...@gmail.com> wrote:
>>>
>>>> I understand if this is necessary for eventual consistency, but
>> shouldn't
>>>> this be better-documented? I generally expected that if I delete
>> sensitive
>>>> or unwanted data, or that a user requests that their personal or private
>>>> data be deleted, it'll be deleted in a way that's more solid than
>> basically
>>>> hiding it. Sure, CouchDB won't let you get at that document, but it's
>>>> certainly still there on the disk, and presumably detectable if you
>>>> inspected the data structure that holds individual documents. Not a very
>>>> good situation vis a vis security. I know that normal unix "deletion"
>>>> leaves files technically on disk, but there are ways to allow for that
>> and
>>>> prevent it from being an issue.
>>>>
>>>> Even setting data security aside, I've been using CouchDB as a kind of
>>>> staging environment for large amounts of data which should ultimately be
>>>> elsewhere (different flavours relational databases, databases belonging
>> to
>>>> different organisations, etc.) because it's really easy to implement as
>> an
>>>> interface and let people just throw whatever they want into it with a
>> POST.
>>>> It's really the perfect tool for that, but pretty soon there'll be tens
>> of
>>>> gigabytes a day of data flowing through the system, and most of it just
>>>> needs to be indexed for a while before our scheduled scripts pull it all
>>>> out, shove it elsewhere and delete it. In this use case, if I'm
>>>> understanding this correctly, we'll get crazy storage blowouts unless we
>>>> implement a bunch of hacks to switch to new databases after performing
>>>> deletions (as well as scripts that make our HTTP reverse proxy
>>>> transparently and intelligently route data to the new database -
>> absolutely
>>>> not a trivial task in any complex system with many moving parts).
>>>>
>>>> But you know, this all comes with the territory. If the devs say
>> there's a
>>>> good reason for documents to stick around after deletion, I believe
>> them,
>>>> but I think that's a pretty huge point and I don't know how I've missed
>> it.
>>>>
>>>> What's the way to delete a document if I actually want to really delete
>> the
>>>> data? Changing it to a blank document before deleting, and then
>> compacting?
>>>>
>>>> On Sat, Dec 24, 2011 at 2:37 PM, Jens Alfke <je...@couchbase.com> wrote:
>>>>
>>>>>
>>>>> On Dec 23, 2011, at 4:09 PM, Mark Hahn wrote:
>>>>>
>>>>>> 1) How exactly could you make this switch without interrupting
>> service?
>>>>>
>>>>> Replicate database to new db, then atomically switch your proxy or
>>>>> whatever to the new db from the old one.
>>>>> Depending on how long the replication takes, there’s a race condition
>> here
>>>>> where changes made to the old db during the replication won’t be
>> propagated
>>>>> to the new one; you could either repeat the process incrementally until
>>>>> this doesn’t happen, or else put the db into read-only mode while
>> you’re
>>>>> doing the copy.
>>>>>
>>>>> This might also be helpful: http://tinyurl.com/89lr3fl
>>>>>
>>>>>> 2) Wouldn't this procedure create the exact same eventual consistency
>>>>>> problems that deleting documents in a db would?
>>>>>
>>>>> No; what’s necessary is the revision tree, and the replication will
>>>>> preserve that. You’re just losing the contents of the deleted revisions
>>>>> that accidentally got left behind because of the weird way the
>> documents
>>>>> were deleted.
>>>>>
>>>>> —Jens
>>>>>
>>>>>
>>
>
>
>
> --
> George Burt
> President
> TrueShot Enterprises, LLC.
> (386) 208-1309
> Fax (213) 477-2195
> www.TrueShot.com
> 12756 92nd Ter
> Live Oak, FL 32060

Re: Database size seems off even after compaction runs.

Posted by George Burt <im...@gmail.com>.

So, can I re-use the deleted document?  My _id is part of the data and has
meaning.  If I delete the old _id, am I not allowed to have that same
meaning again by reclaiming the _id?  _id="block_1_house_1"  then a
hurricane and so we delete it.  Then we rebuild it (maybe) and so I need
_id="block_1_house_1" again.

George


On Sun, Dec 25, 2011 at 5:20 AM, Robert Newson <rn...@apache.org> wrote:

> Mark,
>
> Using the DELETE method simply updates the document to
>
>  {"_id":"foo","_rev":"newrev","_deleted":true}
>
> If you did the same via PUT or POST, you'd get exactly the same effect
> as DELETE.
>
> Daniel,
>
> You have a valid point, that this should be better documented. It is
> unknown how many phantom documents are out there, those that were
> deleted by adding _deleted:true on the assumption that this cleans out
> the document. In fact, when I first noticed this effect I created a
> JIRA ticket and applied a patch to fix it, before Damien pointed out
> that this behavior is intentional (indeed, necessary).
>
> To answer your final question, CouchDB preserves what you ask it to,
> it does not alter the contents of documents itself. So, if you save
> {"_id":"foo","_rev":"newrev","_deleted":true. "password to my bank
> account":"foobar"}, it will do so. Use either the DELETE http method
> or POST/PUT only the document you wish to be stored (minimum is, as
> noted above, _id, _rev and _deleted).
>
> B.
>
>
> On 25 December 2011 00:40, Jens Alfke <je...@couchbase.com> wrote:
> > No. If you delete a document properly (using DELETE, not just setting a
> _deleted property) you won't have this problem. The old revision with the
> data will be gone after compaction, leaving only an empty "tombstone".
> >
> > --Jens     [via iPhone]
> >
> > On Dec 24, 2011, at 4:10 PM, "Daniel Bryan" <da...@gmail.com> wrote:
> >
> >> I understand if this is necessary for eventual consistency, but
> shouldn't
> >> this be better-documented? I generally expected that if I delete
> sensitive
> >> or unwanted data, or that a user requests that their personal or private
> >> data be deleted, it'll be deleted in a way that's more solid than
> basically
> >> hiding it. Sure, CouchDB won't let you get at that document, but it's
> >> certainly still there on the disk, and presumably detectable if you
> >> inspected the data structure that holds individual documents. Not a very
> >> good situation vis a vis security. I know that normal unix "deletion"
> >> leaves files technically on disk, but there are ways to allow for that
> and
> >> prevent it from being an issue.
> >>
> >> Even setting data security aside, I've been using CouchDB as a kind of
> >> staging environment for large amounts of data which should ultimately be
> >> elsewhere (different flavours relational databases, databases belonging
> to
> >> different organisations, etc.) because it's really easy to implement as
> an
> >> interface and let people just throw whatever they want into it with a
> POST.
> >> It's really the perfect tool for that, but pretty soon there'll be tens
> of
> >> gigabytes a day of data flowing through the system, and most of it just
> >> needs to be indexed for a while before our scheduled scripts pull it all
> >> out, shove it elsewhere and delete it. In this use case, if I'm
> >> understanding this correctly, we'll get crazy storage blowouts unless we
> >> implement a bunch of hacks to switch to new databases after performing
> >> deletions (as well as scripts that make our HTTP reverse proxy
> >> transparently and intelligently route data to the new database -
> absolutely
> >> not a trivial task in any complex system with many moving parts).
> >>
> >> But you know, this all comes with the territory. If the devs say
> there's a
> >> good reason for documents to stick around after deletion, I believe
> them,
> >> but I think that's a pretty huge point and I don't know how I've missed
> it.
> >>
> >> What's the way to delete a document if I actually want to really delete
> the
> >> data? Changing it to a blank document before deleting, and then
> compacting?
> >>
> >> On Sat, Dec 24, 2011 at 2:37 PM, Jens Alfke <je...@couchbase.com> wrote:
> >>
> >>>
> >>> On Dec 23, 2011, at 4:09 PM, Mark Hahn wrote:
> >>>
> >>>> 1) How exactly could you make this switch without interrupting
> service?
> >>>
> >>> Replicate database to new db, then atomically switch your proxy or
> >>> whatever to the new db from the old one.
> >>> Depending on how long the replication takes, there’s a race condition
> here
> >>> where changes made to the old db during the replication won’t be
> propagated
> >>> to the new one; you could either repeat the process incrementally until
> >>> this doesn’t happen, or else put the db into read-only mode while
> you’re
> >>> doing the copy.
> >>>
> >>> This might also be helpful: http://tinyurl.com/89lr3fl
> >>>
> >>>> 2) Wouldn't this procedure create the exact same eventual consistency
> >>>> problems that deleting documents in a db would?
> >>>
> >>> No; what’s necessary is the revision tree, and the replication will
> >>> preserve that. You’re just losing the contents of the deleted revisions
> >>> that accidentally got left behind because of the weird way the
> documents
> >>> were deleted.
> >>>
> >>> —Jens
> >>>
> >>>
>



-- 
George Burt
President
TrueShot Enterprises, LLC.
(386) 208-1309
Fax (213) 477-2195
www.TrueShot.com
12756 92nd Ter
Live Oak, FL 32060

Re: Database size seems off even after compaction runs.

Posted by Robert Newson <rn...@apache.org>.

Mark,

Using the DELETE method simply updates the document to

  {"_id":"foo","_rev":"newrev","_deleted":true}

If you did the same via PUT or POST, you'd get exactly the same effect
as DELETE.

Daniel,

You have a valid point, that this should be better documented. It is
unknown how many phantom documents are out there, those that were
deleted by adding _deleted:true on the assumption that this cleans out
the document. In fact, when I first noticed this effect I created a
JIRA ticket and applied a patch to fix it, before Damien pointed out
that this behavior is intentional (indeed, necessary).

To answer your final question, CouchDB preserves what you ask it to,
it does not alter the contents of documents itself. So, if you save
{"_id":"foo","_rev":"newrev","_deleted":true. "password to my bank
account":"foobar"}, it will do so. Use either the DELETE http method
or POST/PUT only the document you wish to be stored (minimum is, as
noted above, _id, _rev and _deleted).

B.


On 25 December 2011 00:40, Jens Alfke <je...@couchbase.com> wrote:
> No. If you delete a document properly (using DELETE, not just setting a _deleted property) you won't have this problem. The old revision with the data will be gone after compaction, leaving only an empty "tombstone".
>
> --Jens     [via iPhone]
>
> On Dec 24, 2011, at 4:10 PM, "Daniel Bryan" <da...@gmail.com> wrote:
>
>> I understand if this is necessary for eventual consistency, but shouldn't
>> this be better-documented? I generally expected that if I delete sensitive
>> or unwanted data, or that a user requests that their personal or private
>> data be deleted, it'll be deleted in a way that's more solid than basically
>> hiding it. Sure, CouchDB won't let you get at that document, but it's
>> certainly still there on the disk, and presumably detectable if you
>> inspected the data structure that holds individual documents. Not a very
>> good situation vis a vis security. I know that normal unix "deletion"
>> leaves files technically on disk, but there are ways to allow for that and
>> prevent it from being an issue.
>>
>> Even setting data security aside, I've been using CouchDB as a kind of
>> staging environment for large amounts of data which should ultimately be
>> elsewhere (different flavours relational databases, databases belonging to
>> different organisations, etc.) because it's really easy to implement as an
>> interface and let people just throw whatever they want into it with a POST.
>> It's really the perfect tool for that, but pretty soon there'll be tens of
>> gigabytes a day of data flowing through the system, and most of it just
>> needs to be indexed for a while before our scheduled scripts pull it all
>> out, shove it elsewhere and delete it. In this use case, if I'm
>> understanding this correctly, we'll get crazy storage blowouts unless we
>> implement a bunch of hacks to switch to new databases after performing
>> deletions (as well as scripts that make our HTTP reverse proxy
>> transparently and intelligently route data to the new database - absolutely
>> not a trivial task in any complex system with many moving parts).
>>
>> But you know, this all comes with the territory. If the devs say there's a
>> good reason for documents to stick around after deletion, I believe them,
>> but I think that's a pretty huge point and I don't know how I've missed it.
>>
>> What's the way to delete a document if I actually want to really delete the
>> data? Changing it to a blank document before deleting, and then compacting?
>>
>> On Sat, Dec 24, 2011 at 2:37 PM, Jens Alfke <je...@couchbase.com> wrote:
>>
>>>
>>> On Dec 23, 2011, at 4:09 PM, Mark Hahn wrote:
>>>
>>>> 1) How exactly could you make this switch without interrupting service?
>>>
>>> Replicate database to new db, then atomically switch your proxy or
>>> whatever to the new db from the old one.
>>> Depending on how long the replication takes, there’s a race condition here
>>> where changes made to the old db during the replication won’t be propagated
>>> to the new one; you could either repeat the process incrementally until
>>> this doesn’t happen, or else put the db into read-only mode while you’re
>>> doing the copy.
>>>
>>> This might also be helpful: http://tinyurl.com/89lr3fl
>>>
>>>> 2) Wouldn't this procedure create the exact same eventual consistency
>>>> problems that deleting documents in a db would?
>>>
>>> No; what’s necessary is the revision tree, and the replication will
>>> preserve that. You’re just losing the contents of the deleted revisions
>>> that accidentally got left behind because of the weird way the documents
>>> were deleted.
>>>
>>> —Jens
>>>
>>>

Re: Database size seems off even after compaction runs.

Posted by Jens Alfke <je...@couchbase.com>.

No. If you delete a document properly (using DELETE, not just setting a _deleted property) you won't have this problem. The old revision with the data will be gone after compaction, leaving only an empty "tombstone".

--Jens     [via iPhone]

On Dec 24, 2011, at 4:10 PM, "Daniel Bryan" <da...@gmail.com> wrote:

> I understand if this is necessary for eventual consistency, but shouldn't
> this be better-documented? I generally expected that if I delete sensitive
> or unwanted data, or that a user requests that their personal or private
> data be deleted, it'll be deleted in a way that's more solid than basically
> hiding it. Sure, CouchDB won't let you get at that document, but it's
> certainly still there on the disk, and presumably detectable if you
> inspected the data structure that holds individual documents. Not a very
> good situation vis a vis security. I know that normal unix "deletion"
> leaves files technically on disk, but there are ways to allow for that and
> prevent it from being an issue.
> 
> Even setting data security aside, I've been using CouchDB as a kind of
> staging environment for large amounts of data which should ultimately be
> elsewhere (different flavours relational databases, databases belonging to
> different organisations, etc.) because it's really easy to implement as an
> interface and let people just throw whatever they want into it with a POST.
> It's really the perfect tool for that, but pretty soon there'll be tens of
> gigabytes a day of data flowing through the system, and most of it just
> needs to be indexed for a while before our scheduled scripts pull it all
> out, shove it elsewhere and delete it. In this use case, if I'm
> understanding this correctly, we'll get crazy storage blowouts unless we
> implement a bunch of hacks to switch to new databases after performing
> deletions (as well as scripts that make our HTTP reverse proxy
> transparently and intelligently route data to the new database - absolutely
> not a trivial task in any complex system with many moving parts).
> 
> But you know, this all comes with the territory. If the devs say there's a
> good reason for documents to stick around after deletion, I believe them,
> but I think that's a pretty huge point and I don't know how I've missed it.
> 
> What's the way to delete a document if I actually want to really delete the
> data? Changing it to a blank document before deleting, and then compacting?
> 
> On Sat, Dec 24, 2011 at 2:37 PM, Jens Alfke <je...@couchbase.com> wrote:
> 
>> 
>> On Dec 23, 2011, at 4:09 PM, Mark Hahn wrote:
>> 
>>> 1) How exactly could you make this switch without interrupting service?
>> 
>> Replicate database to new db, then atomically switch your proxy or
>> whatever to the new db from the old one.
>> Depending on how long the replication takes, there’s a race condition here
>> where changes made to the old db during the replication won’t be propagated
>> to the new one; you could either repeat the process incrementally until
>> this doesn’t happen, or else put the db into read-only mode while you’re
>> doing the copy.
>> 
>> This might also be helpful: http://tinyurl.com/89lr3fl
>> 
>>> 2) Wouldn't this procedure create the exact same eventual consistency
>>> problems that deleting documents in a db would?
>> 
>> No; what’s necessary is the revision tree, and the replication will
>> preserve that. You’re just losing the contents of the deleted revisions
>> that accidentally got left behind because of the weird way the documents
>> were deleted.
>> 
>> —Jens
>> 
>>

Re: Database size seems off even after compaction runs.

Posted by Daniel Bryan <da...@gmail.com>.

I understand if this is necessary for eventual consistency, but shouldn't
this be better-documented? I generally expected that if I delete sensitive
or unwanted data, or that a user requests that their personal or private
data be deleted, it'll be deleted in a way that's more solid than basically
hiding it. Sure, CouchDB won't let you get at that document, but it's
certainly still there on the disk, and presumably detectable if you
inspected the data structure that holds individual documents. Not a very
good situation vis a vis security. I know that normal unix "deletion"
leaves files technically on disk, but there are ways to allow for that and
prevent it from being an issue.

Even setting data security aside, I've been using CouchDB as a kind of
staging environment for large amounts of data which should ultimately be
elsewhere (different flavours relational databases, databases belonging to
different organisations, etc.) because it's really easy to implement as an
interface and let people just throw whatever they want into it with a POST.
It's really the perfect tool for that, but pretty soon there'll be tens of
gigabytes a day of data flowing through the system, and most of it just
needs to be indexed for a while before our scheduled scripts pull it all
out, shove it elsewhere and delete it. In this use case, if I'm
understanding this correctly, we'll get crazy storage blowouts unless we
implement a bunch of hacks to switch to new databases after performing
deletions (as well as scripts that make our HTTP reverse proxy
transparently and intelligently route data to the new database - absolutely
not a trivial task in any complex system with many moving parts).

But you know, this all comes with the territory. If the devs say there's a
good reason for documents to stick around after deletion, I believe them,
but I think that's a pretty huge point and I don't know how I've missed it.

What's the way to delete a document if I actually want to really delete the
data? Changing it to a blank document before deleting, and then compacting?

On Sat, Dec 24, 2011 at 2:37 PM, Jens Alfke <je...@couchbase.com> wrote:

>
> On Dec 23, 2011, at 4:09 PM, Mark Hahn wrote:
>
> > 1) How exactly could you make this switch without interrupting service?
>
> Replicate database to new db, then atomically switch your proxy or
> whatever to the new db from the old one.
> Depending on how long the replication takes, there’s a race condition here
> where changes made to the old db during the replication won’t be propagated
> to the new one; you could either repeat the process incrementally until
> this doesn’t happen, or else put the db into read-only mode while you’re
> doing the copy.
>
> This might also be helpful: http://tinyurl.com/89lr3fl
>
> > 2) Wouldn't this procedure create the exact same eventual consistency
> > problems that deleting documents in a db would?
>
> No; what’s necessary is the revision tree, and the replication will
> preserve that. You’re just losing the contents of the deleted revisions
> that accidentally got left behind because of the weird way the documents
> were deleted.
>
> —Jens
>
>

Re: Database size seems off even after compaction runs.

Posted by Jens Alfke <je...@couchbase.com>.

On Dec 23, 2011, at 4:09 PM, Mark Hahn wrote:

> 1) How exactly could you make this switch without interrupting service?

Replicate database to new db, then atomically switch your proxy or whatever to the new db from the old one.
Depending on how long the replication takes, there’s a race condition here where changes made to the old db during the replication won’t be propagated to the new one; you could either repeat the process incrementally until this doesn’t happen, or else put the db into read-only mode while you’re doing the copy.

This might also be helpful: http://tinyurl.com/89lr3fl

> 2) Wouldn't this procedure create the exact same eventual consistency
> problems that deleting documents in a db would?

No; what’s necessary is the revision tree, and the replication will preserve that. You’re just losing the contents of the deleted revisions that accidentally got left behind because of the weird way the documents were deleted.

—Jens

Re: Database size seems off even after compaction runs.

Posted by CGS <cg...@gmail.com>.

1. Think of the service as a quantified stream of data and not as a 
continuous one. To switch from one db to another is just deviating the 
flux from one db to another in between two data transmission sequences. 
The actual implementation depends on your project. I don't know about 
your project, but just for the sake of the argument, let's consider two 
databases: B (at the back-end) and F (at the front-end). Also, let's say 
F is connected with another HTTP server (maybe it's me, but I am not 
relying only on CouchDB to respond to all HTTP requests). Let's reclaim 
the space from B firstly. I create a database BT and I am starting to 
transfer all the available documents (delete event for a document just 
makes it unavailable). Once I finish, I just "cut the pipe" in between B 
and F (stopping replication or whatever mechanism you may use to connect 
B with F) and "redirect the pipe" toward BT (starting replication or any 
other mechanism you use; for the replication I would add filter, but 
that's another story). You can do that in the reversed order 
(redirecting and after that cutting). Once the data flux is redirected, 
delete B and re-create it. That deletes the file from the harddisk and 
creates a new one. Secondly, to reclaim F, the same procedure, just that 
it is handled by the HTTP server (redirection page can be done even with 
a simple JavaScript command; all one needs to do is switch the old page 
to a temporary new one). If programmed correctly, the user wouldn't feel 
anything except for a slight delay in loading the page (redirection). 
Maybe I worked too much with YAWS and Erlang, but I usually create a 
simple application which checks the correctness of the data before 
injecting them into the database. The delay time is negligible (I use 
bulk operation which peaks higher than the volume of documents YAWS can 
send) and the switch can be done by a simple command sent to the TCP 
server within the Erlang application. That for the back-end database. 
For the front-end, the redirection it's just replacing the web page (no 
service interruption for YAWS - a bit more complex in case of using file 
cache). That would be my design for this particular example.

2. Would it? Transferring only the available documents from B to BT or 
from F to FT (from the example above), BT/FT would just use the space of 
the documents you want to keep (process done not through CouchDB 
replication, but a bit of handy work - or maybe using filtered 
replication, but I am not sure here). Once B/F is deleted, the file 
containing the database is deleted from the harddisk (the physical space 
where the file existed on the harddisk is emptied, meaning, the space 
can be reused by OS), so, no history is kept in this case if the 
database is created again. That for sure reclaims the space.

Of course, even for this example, there are limitations in using such a 
design. But it can be a starting point for you designing your project. 
If you want something simpler, then maybe you should ask the developers 
to add a "no history" option to CouchDB (it wouldn't be a bad idea and I 
am not ironic here).

But, as I mentioned before, the design depends on your project only and 
there is no general solution.

I hope this opinion will help you in your project.

CGS

On 12/24/2011 01:09 AM, Mark Hahn wrote:
>>   That means, you move the data from one to the other, filtering out the
> deleted documents, and when it's over, you switch to the newly constructed
> database, while the other gets emptied (deleted and re-created).
>
> 1) How exactly could you make this switch without interrupting service?
>
> 2) Wouldn't this procedure create the exact same eventual consistency
> problems that deleting documents in a db would?
>

Re: Database size seems off even after compaction runs.

Posted by Mark Hahn <ma...@hahnca.com>.

>  That means, you move the data from one to the other, filtering out the
deleted documents, and when it's over, you switch to the newly constructed
database, while the other gets emptied (deleted and re-created).

1) How exactly could you make this switch without interrupting service?

2) Wouldn't this procedure create the exact same eventual consistency
problems that deleting documents in a db would?

Re: Database size seems off even after compaction runs.

Posted by CGS <cg...@gmail.com>.

I think the two options I gave you answer your question because deleting 
a database it's not the same with deleting a document (deleting a 
document just makes it unavailable, while deleting the database erase 
physically the file from the harddisk). I know it's more a workaround 
and it requires a bit of work, but it reclaims all the space used by the 
deleted documents.

CGS





On 12/23/2011 09:56 PM, Chris Stockton wrote:
> Hello,
>
> On Fri, Dec 23, 2011 at 5:48 AM, CGS<cg...@gmail.com>  wrote:
>> Hi,
>>
>> Sorry to interfere with such a question, but why don't you work with a
>> buffer database? I mean, make a replica to another database which filters
>> out the deleted documents. In such way you can clean all your databases and
>> you use temporary some extra-space (only during the "cleaning" process).
>> Another idea would be to use two databases: one active and one inactive at
>> the given time. That means, you move the data from one to the other,
>> filtering out the deleted documents, and when it's over, you switch to the
>> newly constructed database, while the other gets emptied (deleted and
>> re-created). Just my 2c opinions.
>>
>> CGS
>>
> Thanks everyone for the various feedback. Now the information I have
> gathered is the disk utilization we are seeing is simply from the
> deleted documents.
>
> The question I have yet to see answered (perhaps because it simply
> isn't possible) is how to reclaim this space?
>
> -Chris

Re: Database size seems off even after compaction runs.

Posted by Nathan Vander Wilt <na...@calftrail.com>.

On Dec 23, 2011, at 2:56 PM, Chris Stockton wrote:
> Hello,
> 
> On Fri, Dec 23, 2011 at 5:48 AM, CGS <cg...@gmail.com> wrote:
>> Hi,
>> 
>> Sorry to interfere with such a question, but why don't you work with a
>> buffer database? I mean, make a replica to another database which filters
>> out the deleted documents. In such way you can clean all your databases and
>> you use temporary some extra-space (only during the "cleaning" process).
>> Another idea would be to use two databases: one active and one inactive at
>> the given time. That means, you move the data from one to the other,
>> filtering out the deleted documents, and when it's over, you switch to the
>> newly constructed database, while the other gets emptied (deleted and
>> re-created). Just my 2c opinions.
>> 
>> CGS
>> 
> 
> Thanks everyone for the various feedback. Now the information I have
> gathered is the disk utilization we are seeing is simply from the
> deleted documents.
> 
> The question I have yet to see answered (perhaps because it simply
> isn't possible) is how to reclaim this space?


One potential option, not sure if it's ideal for your use case, could be using the purge command on the deleted documents, followed by a compaction.

What this does is *completely forget* the document in the btree (no "tombstones"):

POST /db/_purge -d '{"id1":["rev1"], "id2":["rev2"]}'

This will thwart replication, i.e. this deletion will not propagate and the document may get pushed back into existence if another database has it.

hth,
-natevw

Re: Database size seems off even after compaction runs.

Posted by Chris Stockton <ch...@gmail.com>.

Hello,

On Fri, Dec 23, 2011 at 5:48 AM, CGS <cg...@gmail.com> wrote:
> Hi,
>
> Sorry to interfere with such a question, but why don't you work with a
> buffer database? I mean, make a replica to another database which filters
> out the deleted documents. In such way you can clean all your databases and
> you use temporary some extra-space (only during the "cleaning" process).
> Another idea would be to use two databases: one active and one inactive at
> the given time. That means, you move the data from one to the other,
> filtering out the deleted documents, and when it's over, you switch to the
> newly constructed database, while the other gets emptied (deleted and
> re-created). Just my 2c opinions.
>
> CGS
>

Thanks everyone for the various feedback. Now the information I have
gathered is the disk utilization we are seeing is simply from the
deleted documents.

The question I have yet to see answered (perhaps because it simply
isn't possible) is how to reclaim this space?

-Chris

Re: Database size seems off even after compaction runs.

Posted by CGS <cg...@gmail.com>.

Hi,

Sorry to interfere with such a question, but why don't you work with a 
buffer database? I mean, make a replica to another database which 
filters out the deleted documents. In such way you can clean all your 
databases and you use temporary some extra-space (only during the 
"cleaning" process). Another idea would be to use two databases: one 
active and one inactive at the given time. That means, you move the data 
from one to the other, filtering out the deleted documents, and when 
it's over, you switch to the newly constructed database, while the other 
gets emptied (deleted and re-created). Just my 2c opinions.

CGS





On 12/23/2011 01:20 PM, Henrik Lundgren wrote:
> Ok, so how do I prevent the database from consuming all diskspace in
> the long run?
>
> I'm developing an application that is quite insert heavy ( about 6 Gb
> / day ), the database is essentially a message inbox.
>
> I plan to delete obsolete messages in a houskeeping job, but if
> CouchDB will retain the latest revision of all documents I might have
> to reconsider using CouchDB, which is a pity :-(
>
> Henrik
>
> On Fri, Dec 23, 2011 at 12:36 PM, Marcello Nuccio
> <ma...@gmail.com>  wrote:
>> OK, I've added the replies from Robert and Paul to
>> http://wiki.apache.org/couchdb/FUQ
>>
>> Then it is right to say that there are informations that can't be
>> deleted from a database, for example the _id of documents?
>>
>> Thanks for the clarifications, since this behaviour was totally non
>> obvious to me.
>>
>> Marcello
>>
>> 2011/12/23 Robert Newson<rn...@apache.org>:
>>> An update to the wiki would be be very helpful.
>>>
>>> It's worth saying again that compaction does *not* remove "deleted
>>> documents’ contents". We keep the latest revision of every document
>>> ever seen, even if that revision has _deleted:true in it. This is so
>>> that replication can ensure eventual consistency between replicas. Not
>>> only will all replicas agree on which documents are present and which
>>> are not, but also the contents of both.
>>>
>>> B.
>>>
>>> On 23 December 2011 08:11, Marcello Nuccio<ma...@gmail.com>  wrote:
>>>> 2011/12/23 Paul Davis<pa...@gmail.com>:
>>>>> On Thu, Dec 22, 2011 at 7:00 PM, Jens Alfke<je...@couchbase.com>  wrote:
>>>>>> On Dec 22, 2011, at 1:44 PM, Chris Stockton wrote:
>>>>>>
>>>>>> Okay, so this catches me a bit off guard, always thought compaction
>>>>>> cleaned those up.
>>>>>>
>>>>>> Compaction removes old revisions’ and deleted documents’ contents, but their revision histories are still there. Those should be pretty small, though, since they’re just trees of revision IDs.
>>>>>>
>>>>>> (Unless you did delete the docs by just setting a “_deleted” attribute? I don’t know what the behavior of that would be; sounds like it doesn’t actually delete the document from the database, in which case maybe the last revision data does get left behind.)
>>>>>>
>>>>>> —Jens
>>>>> Deleted documents specifically allow for a body to be set in the
>>>>> deleted revision. The intention for this is to have a "who deleted
>>>>> this" type of meta data for the doc. Some client libraries delete docs
>>>>> by grabbing the current object blob, adding a '"_deleted": true'
>>>>> member, and then sending it back which inadvertently (in most cases)
>>>>> keeps the last doc body around after compaction.
>>>> Can I write these informations in the wiki?
>>>> I think it would be very useful in
>>>> http://wiki.apache.org/couchdb/Compaction
>>>> and in http://wiki.apache.org/couchdb/FUQ
>>>>
>>>> Marcello

Re: Database size seems off even after compaction runs.

Posted by Henrik Lundgren <ca...@gmail.com>.

Ok, so how do I prevent the database from consuming all diskspace in
the long run?

I'm developing an application that is quite insert heavy ( about 6 Gb
/ day ), the database is essentially a message inbox.

I plan to delete obsolete messages in a houskeeping job, but if
CouchDB will retain the latest revision of all documents I might have
to reconsider using CouchDB, which is a pity :-(

Henrik

On Fri, Dec 23, 2011 at 12:36 PM, Marcello Nuccio
<ma...@gmail.com> wrote:
> OK, I've added the replies from Robert and Paul to
> http://wiki.apache.org/couchdb/FUQ
>
> Then it is right to say that there are informations that can't be
> deleted from a database, for example the _id of documents?
>
> Thanks for the clarifications, since this behaviour was totally non
> obvious to me.
>
> Marcello
>
> 2011/12/23 Robert Newson <rn...@apache.org>:
>> An update to the wiki would be be very helpful.
>>
>> It's worth saying again that compaction does *not* remove "deleted
>> documents’ contents". We keep the latest revision of every document
>> ever seen, even if that revision has _deleted:true in it. This is so
>> that replication can ensure eventual consistency between replicas. Not
>> only will all replicas agree on which documents are present and which
>> are not, but also the contents of both.
>>
>> B.
>>
>> On 23 December 2011 08:11, Marcello Nuccio <ma...@gmail.com> wrote:
>>> 2011/12/23 Paul Davis <pa...@gmail.com>:
>>>> On Thu, Dec 22, 2011 at 7:00 PM, Jens Alfke <je...@couchbase.com> wrote:
>>>>>
>>>>> On Dec 22, 2011, at 1:44 PM, Chris Stockton wrote:
>>>>>
>>>>> Okay, so this catches me a bit off guard, always thought compaction
>>>>> cleaned those up.
>>>>>
>>>>> Compaction removes old revisions’ and deleted documents’ contents, but their revision histories are still there. Those should be pretty small, though, since they’re just trees of revision IDs.
>>>>>
>>>>> (Unless you did delete the docs by just setting a “_deleted” attribute? I don’t know what the behavior of that would be; sounds like it doesn’t actually delete the document from the database, in which case maybe the last revision data does get left behind.)
>>>>>
>>>>> —Jens
>>>>
>>>> Deleted documents specifically allow for a body to be set in the
>>>> deleted revision. The intention for this is to have a "who deleted
>>>> this" type of meta data for the doc. Some client libraries delete docs
>>>> by grabbing the current object blob, adding a '"_deleted": true'
>>>> member, and then sending it back which inadvertently (in most cases)
>>>> keeps the last doc body around after compaction.
>>>
>>> Can I write these informations in the wiki?
>>> I think it would be very useful in
>>> http://wiki.apache.org/couchdb/Compaction
>>> and in http://wiki.apache.org/couchdb/FUQ
>>>
>>> Marcello

Re: Database size seems off even after compaction runs.

Posted by Marcello Nuccio <ma...@gmail.com>.

OK, I've added the replies from Robert and Paul to
http://wiki.apache.org/couchdb/FUQ

Then it is right to say that there are informations that can't be
deleted from a database, for example the _id of documents?

Thanks for the clarifications, since this behaviour was totally non
obvious to me.

Marcello

2011/12/23 Robert Newson <rn...@apache.org>:
> An update to the wiki would be be very helpful.
>
> It's worth saying again that compaction does *not* remove "deleted
> documents’ contents". We keep the latest revision of every document
> ever seen, even if that revision has _deleted:true in it. This is so
> that replication can ensure eventual consistency between replicas. Not
> only will all replicas agree on which documents are present and which
> are not, but also the contents of both.
>
> B.
>
> On 23 December 2011 08:11, Marcello Nuccio <ma...@gmail.com> wrote:
>> 2011/12/23 Paul Davis <pa...@gmail.com>:
>>> On Thu, Dec 22, 2011 at 7:00 PM, Jens Alfke <je...@couchbase.com> wrote:
>>>>
>>>> On Dec 22, 2011, at 1:44 PM, Chris Stockton wrote:
>>>>
>>>> Okay, so this catches me a bit off guard, always thought compaction
>>>> cleaned those up.
>>>>
>>>> Compaction removes old revisions’ and deleted documents’ contents, but their revision histories are still there. Those should be pretty small, though, since they’re just trees of revision IDs.
>>>>
>>>> (Unless you did delete the docs by just setting a “_deleted” attribute? I don’t know what the behavior of that would be; sounds like it doesn’t actually delete the document from the database, in which case maybe the last revision data does get left behind.)
>>>>
>>>> —Jens
>>>
>>> Deleted documents specifically allow for a body to be set in the
>>> deleted revision. The intention for this is to have a "who deleted
>>> this" type of meta data for the doc. Some client libraries delete docs
>>> by grabbing the current object blob, adding a '"_deleted": true'
>>> member, and then sending it back which inadvertently (in most cases)
>>> keeps the last doc body around after compaction.
>>
>> Can I write these informations in the wiki?
>> I think it would be very useful in
>> http://wiki.apache.org/couchdb/Compaction
>> and in http://wiki.apache.org/couchdb/FUQ
>>
>> Marcello

Re: Database size seems off even after compaction runs.

Posted by Robert Newson <rn...@apache.org>.

An update to the wiki would be be very helpful.

It's worth saying again that compaction does *not* remove "deleted
documents’ contents". We keep the latest revision of every document
ever seen, even if that revision has _deleted:true in it. This is so
that replication can ensure eventual consistency between replicas. Not
only will all replicas agree on which documents are present and which
are not, but also the contents of both.

B.

On 23 December 2011 08:11, Marcello Nuccio <ma...@gmail.com> wrote:
> 2011/12/23 Paul Davis <pa...@gmail.com>:
>> On Thu, Dec 22, 2011 at 7:00 PM, Jens Alfke <je...@couchbase.com> wrote:
>>>
>>> On Dec 22, 2011, at 1:44 PM, Chris Stockton wrote:
>>>
>>> Okay, so this catches me a bit off guard, always thought compaction
>>> cleaned those up.
>>>
>>> Compaction removes old revisions’ and deleted documents’ contents, but their revision histories are still there. Those should be pretty small, though, since they’re just trees of revision IDs.
>>>
>>> (Unless you did delete the docs by just setting a “_deleted” attribute? I don’t know what the behavior of that would be; sounds like it doesn’t actually delete the document from the database, in which case maybe the last revision data does get left behind.)
>>>
>>> —Jens
>>
>> Deleted documents specifically allow for a body to be set in the
>> deleted revision. The intention for this is to have a "who deleted
>> this" type of meta data for the doc. Some client libraries delete docs
>> by grabbing the current object blob, adding a '"_deleted": true'
>> member, and then sending it back which inadvertently (in most cases)
>> keeps the last doc body around after compaction.
>
> Can I write these informations in the wiki?
> I think it would be very useful in
> http://wiki.apache.org/couchdb/Compaction
> and in http://wiki.apache.org/couchdb/FUQ
>
> Marcello

Re: Database size seems off even after compaction runs.

Posted by Marcello Nuccio <ma...@gmail.com>.

2011/12/23 Paul Davis <pa...@gmail.com>:
> On Thu, Dec 22, 2011 at 7:00 PM, Jens Alfke <je...@couchbase.com> wrote:
>>
>> On Dec 22, 2011, at 1:44 PM, Chris Stockton wrote:
>>
>> Okay, so this catches me a bit off guard, always thought compaction
>> cleaned those up.
>>
>> Compaction removes old revisions’ and deleted documents’ contents, but their revision histories are still there. Those should be pretty small, though, since they’re just trees of revision IDs.
>>
>> (Unless you did delete the docs by just setting a “_deleted” attribute? I don’t know what the behavior of that would be; sounds like it doesn’t actually delete the document from the database, in which case maybe the last revision data does get left behind.)
>>
>> —Jens
>
> Deleted documents specifically allow for a body to be set in the
> deleted revision. The intention for this is to have a "who deleted
> this" type of meta data for the doc. Some client libraries delete docs
> by grabbing the current object blob, adding a '"_deleted": true'
> member, and then sending it back which inadvertently (in most cases)
> keeps the last doc body around after compaction.

Can I write these informations in the wiki?
I think it would be very useful in
http://wiki.apache.org/couchdb/Compaction
and in http://wiki.apache.org/couchdb/FUQ

Marcello

Re: Database size seems off even after compaction runs.

Posted by Paul Davis <pa...@gmail.com>.

On Thu, Dec 22, 2011 at 7:00 PM, Jens Alfke <je...@couchbase.com> wrote:
>
> On Dec 22, 2011, at 1:44 PM, Chris Stockton wrote:
>
> Okay, so this catches me a bit off guard, always thought compaction
> cleaned those up.
>
> Compaction removes old revisions’ and deleted documents’ contents, but their revision histories are still there. Those should be pretty small, though, since they’re just trees of revision IDs.
>
> (Unless you did delete the docs by just setting a “_deleted” attribute? I don’t know what the behavior of that would be; sounds like it doesn’t actually delete the document from the database, in which case maybe the last revision data does get left behind.)
>
> —Jens

Deleted documents specifically allow for a body to be set in the
deleted revision. The intention for this is to have a "who deleted
this" type of meta data for the doc. Some client libraries delete docs
by grabbing the current object blob, adding a '"_deleted": true'
member, and then sending it back which inadvertently (in most cases)
keeps the last doc body around after compaction.

Re: Database size seems off even after compaction runs.

Posted by Jens Alfke <je...@couchbase.com>.

On Dec 22, 2011, at 1:44 PM, Chris Stockton wrote:

Okay, so this catches me a bit off guard, always thought compaction
cleaned those up.

Compaction removes old revisions’ and deleted documents’ contents, but their revision histories are still there. Those should be pretty small, though, since they’re just trees of revision IDs.

(Unless you did delete the docs by just setting a “_deleted” attribute? I don’t know what the behavior of that would be; sounds like it doesn’t actually delete the document from the database, in which case maybe the last revision data does get left behind.)

—Jens

Re: Database size seems off even after compaction runs.

Posted by Chris Stockton <ch...@gmail.com>.

Hello,

On Thu, Dec 22, 2011 at 2:36 PM, Robert Newson <rn...@apache.org> wrote:
> Deleted docs are preserved forever (necessary for eventual consistency).
>
> If you know the doc id of a deleted doc, try GET
> host:port/dbname/docname?open_revs=all
>
> You should be able to find your deleted docs via the _changes feed, e.g.;
>
> ~ $ curl localhost:5984/foo/_changes
> {"results":[
> {"seq":2,"id":"bar","changes":[{"rev":"2-91257658886b5692a98d053bb9990b47"}],"deleted":true}
> ],
> "last_seq":2}
>
> B.
>

Robert, I thought I was becoming a couchdb expert, totally tossed out
that theory for me :p

Okay, so this catches me a bit off guard, always thought compaction
cleaned those up. Is there a good way to bulk remove all deleted docs?
Or will I need to make DELETE calls on each individual document?

Thanks a ton, I appreciate it!

-Chris

Re: Database size seems off even after compaction runs.

Posted by Robert Newson <rn...@apache.org>.

Deleted docs are preserved forever (necessary for eventual consistency).

If you know the doc id of a deleted doc, try GET
host:port/dbname/docname?open_revs=all

You should be able to find your deleted docs via the _changes feed, e.g.;

~ $ curl localhost:5984/foo/_changes
{"results":[
{"seq":2,"id":"bar","changes":[{"rev":"2-91257658886b5692a98d053bb9990b47"}],"deleted":true}
],
"last_seq":2}

B.

On 22 December 2011 21:27, Chris Stockton <ch...@gmail.com> wrote:
> Hello Robert,
>
> On Thu, Dec 22, 2011 at 2:20 PM, Robert Newson <rn...@apache.org> wrote:
>> Deleted docs take space. If you used the DELETE http method, then it's
>> minimal, but if you just added _deleted:true to your document and
>> saved, then it contains all the data of the previous revision,
>> including all attachments.
>>
>
> Thanks for the response, every single doc listed in _all_docs (only 25
> docs) is on the first revision. Is what you are saying that the user
> could have deleted 49K docs and although they don't exist in _all_docs
> anymore, they are physically on the disk and COMPACTION is not the way
> to get rid of them?
>
> Do you know how can I discard these old documents, or even query to see them?

Re: Database size seems off even after compaction runs.

Posted by Chris Stockton <ch...@gmail.com>.

Hello Robert,

On Thu, Dec 22, 2011 at 2:20 PM, Robert Newson <rn...@apache.org> wrote:
> Deleted docs take space. If you used the DELETE http method, then it's
> minimal, but if you just added _deleted:true to your document and
> saved, then it contains all the data of the previous revision,
> including all attachments.
>

Thanks for the response, every single doc listed in _all_docs (only 25
docs) is on the first revision. Is what you are saying that the user
could have deleted 49K docs and although they don't exist in _all_docs
anymore, they are physically on the disk and COMPACTION is not the way
to get rid of them?

Do you know how can I discard these old documents, or even query to see them?

Re: Database size seems off even after compaction runs.

Posted by Robert Newson <rn...@apache.org>.

Deleted docs take space. If you used the DELETE http method, then it's
minimal, but if you just added _deleted:true to your document and
saved, then it contains all the data of the previous revision,
including all attachments.

B

On 22 December 2011 21:01, Chris Stockton <ch...@gmail.com> wrote:
> We have a customer using 11MB of disk space, not much but it so
> happens the user is using more like 100KB of data. One thing we have
> noticed is that the users doc_del_count is very high, indicating he
> likely had 11mb of data but later deleted it.
>
> {"db_name":"...snip...","doc_count":49,"doc_del_count":42981,"update_seq":86097,"purge_seq":0,"compact_running":false,"disk_size":11427940,"instance_start_time":"1324102884349705","disk_format_version":5,"committed_update_seq":86097}
>
> After compaction the size does not change, the physical file on the
> file system is verified 11MB.
>
> Viewing _all_docs shows a page and a half of data, viewing futon shows
> the user really is using around 100kb max.
>
> Any guesses here?
>
> -Chris