You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Thomas Mueller (JIRA)" <ji...@apache.org> on 2015/05/07 15:14:00 UTC

[jira] [Commented] (OAK-2808) Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC

    [ https://issues.apache.org/jira/browse/OAK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532576#comment-14532576 ] 

Thomas Mueller commented on OAK-2808:
-------------------------------------

Because Lucene is not a component we develop ourselves, we can't really rely on Lucene files to be "globally" unique, unless we make them unique ourselves. Otherwise we might delete data that is used by another component (another Lucene index for example, that by chance uses files with the same content). You can't rely on timing data (last modified), as you don't know when files go into the data store.

What we could do is: for each index, compute a random UUID and store that with the index (just once, when creating the index). Use this UUID to xor the first few bytes of each block of each index file with that UUID when writing and when reading (a block being for example 4 KB). That would make each block of each index file unique. That would work with the FileDataStore and the BlobStore, without increasing the file size, and without impacting performance measurably. The disadvantage is, if you lose the UUID, the Lucene file is corrupt. I'm not sure if xor is really secure, but it might be OK for what we need.

> 	Active deletion of 'deleted' Lucene index files from DataStore without relying on full scale Blob GC
> -----------------------------------------------------------------------------------------------------
>
>                 Key: OAK-2808
>                 URL: https://issues.apache.org/jira/browse/OAK-2808
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: lucene
>            Reporter: Chetan Mehrotra
>              Labels: datastore, performance
>             Fix For: 1.3.0
>
>         Attachments: copyonread-stats.png
>
>
> With storing of Lucene index files within DataStore our usage pattern
> of DataStore has changed between JR2 and Oak.
> With JR2 the writes were mostly application based i.e. if application
> stores a pdf/image file then that would be stored in DataStore. JR2 by
> default would not write stuff to DataStore. Further in deployment
> where large number of binary content is present then systems tend to
> share the DataStore to avoid duplication of storage. In such cases
> running Blob GC is a non trivial task as it involves a manual step and
> coordination across multiple deployments. Due to this systems tend to
> delay frequency of GC
> Now with Oak apart from application the Oak system itself *actively*
> uses the DataStore to store the index files for Lucene and there the
> churn might be much higher i.e. frequency of creation and deletion of
> index file is lot higher. This would accelerate the rate of garbage
> generation and thus put lot more pressure on the DataStore storage
> requirements.
> Discussion thread http://markmail.org/thread/iybd3eq2bh372zrl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)