You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Thomas Mueller (JIRA)" <ji...@apache.org> on 2014/06/05 11:28:02 UTC
[jira] [Comment Edited] (OAK-1849) DataStore GC support for heterogeneous deployments using a shared datastore

    [ https://issues.apache.org/jira/browse/OAK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018616#comment-14018616 ] 

Thomas Mueller edited comment on OAK-1849 at 6/5/14 9:26 AM:
-------------------------------------------------------------

What you describe above is the solution we had for Jackrabbit 2.x data stores, to share data stores. For the FileDataStore, we used the lastModified field of the file. For large data stores, updating the field takes quite a long time, as the metadata of each file needs to be changed. In the past, this turned out to be a performance problem.

To speed up garbage collection, I suggest we use a slightly different mechanism (unless for cases where we share a datastore with a Jackrabbit 2.x repository):

# We use {{collectGarbage(boolean markOnly)}} - same as what you described above. If the flag is {{true}}, the list of used blob ids are written to a flat file in the root directory of the data store (using a random file name) during or at the end of the {{mark}} phase.
# If {{markOnly}} if {{false}}, the {{sweep()}} method needs to additionally check the root directory of the data store, and process all flat files stored there, combining the lists if there are multiple. Entries in the list(s) must not be deleted. At the end of the sweep phase, the processed files may be removed.



was (Author: tmueller):
What you describe above is the solution we had for Jackrabbit 2.x data stores, to share data stores. For the FileDataStore, we used the lastModified field of the file. For large data stores, updating the field takes quite a long time, as the metadata of each file needs to be changed. In the past, this turned out to be a performance problem.

To speed up garbage collection, I suggest we use a slightly different mechanism (unless for cases where we share a datastore with a Jackrabbit 2.x repository):

# We use {{collectGarbage(boolean markOnly)}} - same as what you described above.
# If the flag is {{true}}, the list of used blob ids are written to a flat file in the root directory of the data store (using a random file name).
# If {{markOnly}} if {{false}}, the {{sweep()}} method needs to additionally check the root directory of the data store, and process all flat files stored there, combining the lists if there are multiple. Entries in the list(s) must not be deleted. At the end of the sweep phase, the processed files may be removed.


> DataStore GC support for heterogeneous deployments using a shared datastore
> ---------------------------------------------------------------------------
>
>                 Key: OAK-1849
>                 URL: https://issues.apache.org/jira/browse/OAK-1849
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>            Reporter: Amit Jain
>
> If the deployment is such that there are 2 or more different instances with a shared datastore, triggering Datastore GC from one instance will result in blobs used by another instance getting deleted, causing data loss.



--
This message was sent by Atlassian JIRA
(v6.2#6252)