You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Thomas Mueller (JIRA)" <ji...@apache.org> on 2015/09/24 15:47:04 UTC

[jira] [Commented] (OAK-2808) Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

    [ https://issues.apache.org/jira/browse/OAK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906356#comment-14906356 ] 

Thomas Mueller commented on OAK-2808:
-------------------------------------

http://svn.apache.org/r1705054 (trunk), partial implementation. Remarks:

* Disabled by default (DEFAULT_ACTIVE_DELETE = -1). 
* Delete of entries is not yet implemented; this would have to be done in oak-core I guess.
* To enable, set the property "activeDelete" to, for example, 3600 (1 hour), for an index of type "lucene"
* For manual testing, I removed the "async" = "async" flag from the index - this will create garbage quickly.
* A new child node ":trash" is created, with child nodes "run_1", "run_2",... and a property "index" for the next id.
* For each file, a new property "uniqueKey" (16 bytes) is created, and that key is appended to the file (ignored when reading)

The block size per binary increased by 16 bytes. I wonder if, for MongoDB, it would be better to use 1024 bytes less, as we do for the MongoBlobStore, because MongoDB rounds up the space allocated for a record to the next power of two (there is an overhead per record, let's assume it is 1 KB at most)

The missing "delete trash" feature would need to periodically (and asynchronously) read the first "run_.." entries, and delete the binaries if needed. It would probably have to maintain a "deleteIndex", similar to the "index" property used to create new entries.

> Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
> ----------------------------------------------------------------------------------------------------
>
>                 Key: OAK-2808
>                 URL: https://issues.apache.org/jira/browse/OAK-2808
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Thomas Mueller
>              Labels: datastore, performance
>             Fix For: 1.3.7
>
>         Attachments: OAK-2808-1.patch, copyonread-stats.png
>
>
> With storing of Lucene index files within DataStore our usage pattern
> of DataStore has changed between JR2 and Oak.
> With JR2 the writes were mostly application based i.e. if application
> stores a pdf/image file then that would be stored in DataStore. JR2 by
> default would not write stuff to DataStore. Further in deployment
> where large number of binary content is present then systems tend to
> share the DataStore to avoid duplication of storage. In such cases
> running Blob GC is a non trivial task as it involves a manual step and
> coordination across multiple deployments. Due to this systems tend to
> delay frequency of GC
> Now with Oak apart from application the Oak system itself *actively*
> uses the DataStore to store the index files for Lucene and there the
> churn might be much higher i.e. frequency of creation and deletion of
> index file is lot higher. This would accelerate the rate of garbage
> generation and thus put lot more pressure on the DataStore storage
> requirements.
> Discussion thread http://markmail.org/thread/iybd3eq2bh372zrl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)