You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Gabriel de Oliveira Barbosa <ma...@gmail.com> on 2014/03/07 05:07:15 UTC

Bulk deletes and disk size

Hello,

In our production system we are doing "bulk delete" by update docs with
"_deleted:true".
This is done in an on-demand aggregation process, so it delete a lot of
documents per minutes and updates the daily aggregated doc in database.
In the end of the day we should have 1 aggregated doc with 7k _revs
(compacted periodically) and 7k documents deleted.

But our disk usage are increasing too much fast, looks like that deleted
documents are still there.

After my "bulk delete" process (to replicate the deletes in other
instances) I could run a purge operation, to really remove the deleted doc
from the disk.

I'm afraid of purge operation increase my query time (reindex) and as
consequence overflow my server resources.

What should I do in this case?

Thanks

Re: Bulk deletes and disk size

Posted by Gabriel de Oliveira Barbosa <ma...@gmail.com>.
Looking in the Couchdb docs there are just examples using POST for bulk
updates, so real bulk deletes are really a possible thing ?
http://docs.couchdb.org/en/latest/api/database/bulk-api.html?highlight=bulk#post--db-_bulk_docs


2014-03-07 23:19 GMT-03:00 Gabriel de Oliveira Barbosa <
manobi.oliveira@gmail.com>:

> I though that compactation process would remove the body from the document
> marked with "_deleted".
> These deleted docs (marked with _delete) do not have previous revisions or
> attachments, so the disk size is the result of millions of documents with
> their body forgotten in the database?
>
> So to be effective my "bulk delete" have to be a PUT request? because I'm
> already doing  {"_id":"foo", "_rev":"bar", "_deleted":true} but using
> node.js cradle db.save([doc1,doc2,doc3]) and I'm not sure if it uses PUT or
> POST behind it.
>
>
> 2014-03-07 13:05 GMT-03:00 Robert Samuel Newson <rn...@apache.org>:
>
> The user says he just added _deleted:true to his documents, which marks
>> the document as deleted but will forever preserve all values in the
>> document, including attachments. You're right that compaction will remove
>> bodies and attachments from non-leaf revisions, however.
>>
>> B.
>>
>> On 7 Mar 2014, at 15:56, Jens Alfke <je...@couchbase.com> wrote:
>>
>> >
>> > On Mar 7, 2014, at 1:47 AM, Robert Samuel Newson <rn...@apache.org>
>> wrote:
>> >
>> >> Adding _deleted:true marks the document as deleted only, it does not
>> remove the body or the attachments. This is why your disk usage has not
>> reduced; you haven't reduced the size of your documents.
>> >
>> > Gabriel did say the database is "compacted periodically". So that
>> should be getting rid of the old bodies and attachments.
>> >
>> > --Jens
>>
>>
>

Re: Bulk deletes and disk size

Posted by Manobi <ma...@gmail.com>.
Thanks Robert,

I understand what you are saying, but now I'm sure that no additional field is being preserved from my side.
Before delete I'm querying docs to delete and the view just return the _id and _rev necessary to perform my the update.

The unique explainable reason for this impressive disk size, is that the cradle library is doing a document merge instead of a replace in the save method.

I'll keep investigating and appreciate your help.

Sent from my iPad

> On 08/03/2014, at 06:35, Robert Samuel Newson <rn...@apache.org> wrote:
> 
> Gabriel,
> 
> In that example you’re making the right change. but if those objects in the docs array had more properties they would be preserved (forever).
> 
> "I though that compactation process would remove the body from the document
> marked with "_deleted"." - This is not true.
> 
> Deleting docs via _bulk_docs can be exactly the same as deleting via -X DELETE but only if you only preserve the three fields you need, _id, _rev and _deleted.
> 
> I forked your gist to illustrate: https://gist.github.com/rnewson/9427862
> 
> Here I map any document to a doc with the minimal state necessary to delete the document. The difference is clear, I hope.
> 
> B.
> 
>> On 8 Mar 2014, at 04:13, Gabriel de Oliveira Barbosa <ma...@gmail.com> wrote:
>> 
>> Thanks Jens, but I think I'm already doing this.
>> Please check my gist which shows my logic:
>> https://gist.github.com/manobi/9425208#file-bulk_delete-js
>> 
>> Do you think the problem is with the method used to update de docs using
>> Couchdb Bulk API?
> 

Re: Bulk deletes and disk size

Posted by Robert Samuel Newson <rn...@apache.org>.
Gabriel,

In that example you’re making the right change. but if those objects in the docs array had more properties they would be preserved (forever).

"I though that compactation process would remove the body from the document
marked with "_deleted"." - This is not true.

Deleting docs via _bulk_docs can be exactly the same as deleting via -X DELETE but only if you only preserve the three fields you need, _id, _rev and _deleted.

I forked your gist to illustrate: https://gist.github.com/rnewson/9427862

Here I map any document to a doc with the minimal state necessary to delete the document. The difference is clear, I hope.

B.

On 8 Mar 2014, at 04:13, Gabriel de Oliveira Barbosa <ma...@gmail.com> wrote:

> Thanks Jens, but I think I'm already doing this.
> Please check my gist which shows my logic:
> https://gist.github.com/manobi/9425208#file-bulk_delete-js
> 
> Do you think the problem is with the method used to update de docs using
> Couchdb Bulk API?


Re: Bulk deletes and disk size

Posted by Gabriel de Oliveira Barbosa <ma...@gmail.com>.
Thanks Jens, but I think I'm already doing this.
Please check my gist which shows my logic:
https://gist.github.com/manobi/9425208#file-bulk_delete-js

Do you think the problem is with the method used to update de docs using
Couchdb Bulk API?

Re: Bulk deletes and disk size

Posted by Jens Alfke <je...@couchbase.com>.
On Mar 7, 2014, at 6:19 PM, Gabriel de Oliveira Barbosa <ma...@gmail.com> wrote:

> I though that compactation process would remove the body from the document
> marked with "_deleted".

Nope. It’s sometimes useful to preserve all or part of the body along with a deleted document (for example, the reason for the deletion or the user who did it.)

If you want the attachments to get garbage collected, you’ll at least have to remove the _attachments dictionary from the document when you add the _deleted property. But for maximum space savings, update the doc to contain nothing but _deleted:true.

—Jens

Re: Bulk deletes and disk size

Posted by Gabriel de Oliveira Barbosa <ma...@gmail.com>.
I though that compactation process would remove the body from the document
marked with "_deleted".
These deleted docs (marked with _delete) do not have previous revisions or
attachments, so the disk size is the result of millions of documents with
their body forgotten in the database?

So to be effective my "bulk delete" have to be a PUT request? because I'm
already doing  {"_id":"foo", "_rev":"bar", "_deleted":true} but using
node.js cradle db.save([doc1,doc2,doc3]) and I'm not sure if it uses PUT or
POST behind it.


2014-03-07 13:05 GMT-03:00 Robert Samuel Newson <rn...@apache.org>:

> The user says he just added _deleted:true to his documents, which marks
> the document as deleted but will forever preserve all values in the
> document, including attachments. You're right that compaction will remove
> bodies and attachments from non-leaf revisions, however.
>
> B.
>
> On 7 Mar 2014, at 15:56, Jens Alfke <je...@couchbase.com> wrote:
>
> >
> > On Mar 7, 2014, at 1:47 AM, Robert Samuel Newson <rn...@apache.org>
> wrote:
> >
> >> Adding _deleted:true marks the document as deleted only, it does not
> remove the body or the attachments. This is why your disk usage has not
> reduced; you haven't reduced the size of your documents.
> >
> > Gabriel did say the database is "compacted periodically". So that should
> be getting rid of the old bodies and attachments.
> >
> > --Jens
>
>

Re: Bulk deletes and disk size

Posted by Robert Samuel Newson <rn...@apache.org>.
The user says he just added _deleted:true to his documents, which marks the document as deleted but will forever preserve all values in the document, including attachments. You’re right that compaction will remove bodies and attachments from non-leaf revisions, however.

B.

On 7 Mar 2014, at 15:56, Jens Alfke <je...@couchbase.com> wrote:

> 
> On Mar 7, 2014, at 1:47 AM, Robert Samuel Newson <rn...@apache.org> wrote:
> 
>> Adding _deleted:true marks the document as deleted only, it does not remove the body or the attachments. This is why your disk usage has not reduced; you haven’t reduced the size of your documents.
> 
> Gabriel did say the database is “compacted periodically”. So that should be getting rid of the old bodies and attachments.
> 
> —Jens


Re: Bulk deletes and disk size

Posted by Jens Alfke <je...@couchbase.com>.
On Mar 7, 2014, at 1:47 AM, Robert Samuel Newson <rn...@apache.org> wrote:

> Adding _deleted:true marks the document as deleted only, it does not remove the body or the attachments. This is why your disk usage has not reduced; you haven’t reduced the size of your documents.

Gabriel did say the database is “compacted periodically”. So that should be getting rid of the old bodies and attachments.

—Jens

Re: Bulk deletes and disk size

Posted by Robert Samuel Newson <rn...@apache.org>.
Hi,

Adding _deleted:true marks the document as deleted only, it does not remove the body or the attachments. This is why your disk usage has not reduced; you haven’t reduced the size of your documents.

When you delete a document with the DELETE http method you are really doing a PUT with just {"_id":"foo", "_rev":"bar", "_deleted":true} so your "bulk delete" operation should do the same if you don’t intend to keep the data.

Definitely avoid purging until you’ve discovered whether really deleting your content addresses your disk usage.

B.

On 7 Mar 2014, at 04:07, Gabriel de Oliveira Barbosa <ma...@gmail.com> wrote:

> Hello,
> 
> In our production system we are doing "bulk delete" by update docs with
> "_deleted:true".
> This is done in an on-demand aggregation process, so it delete a lot of
> documents per minutes and updates the daily aggregated doc in database.
> In the end of the day we should have 1 aggregated doc with 7k _revs
> (compacted periodically) and 7k documents deleted.
> 
> But our disk usage are increasing too much fast, looks like that deleted
> documents are still there.
> 
> After my "bulk delete" process (to replicate the deletes in other
> instances) I could run a purge operation, to really remove the deleted doc
> from the disk.
> 
> I'm afraid of purge operation increase my query time (reindex) and as
> consequence overflow my server resources.
> 
> What should I do in this case?
> 
> Thanks