You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-issues@jackrabbit.apache.org by "Davide Giannella (JIRA)" <ji...@apache.org> on 2016/06/08 13:30:21 UTC

[jira] [Updated] (OAK-4200) [BlobGC] Improve collection times of blobs available

     [ https://issues.apache.org/jira/browse/OAK-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Davide Giannella updated OAK-4200:
----------------------------------
    Fix Version/s: 1.6

> [BlobGC] Improve collection times of blobs available
> ----------------------------------------------------
>
>                 Key: OAK-4200
>                 URL: https://issues.apache.org/jira/browse/OAK-4200
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>            Reporter: Amit Jain
>            Assignee: Amit Jain
>             Fix For: 1.6, 1.5.4
>
>
> The blob collection phase (Identifying all the blobs available in the data store) is quite an expensive part of the whole GC process, taking up a few hours sometimes on large repositories, due to iteration of the sub-folders in the data store.
> In an offline discussion with [~tmueller] and [~chetanm], the idea came up that this phase can be faster if
> *  Blobs ids are tracked when the blobs are added for e.g. in a simple file in the datastore per cluster node.
> * GC then consolidates this file from all the cluster nodes and uses it to get the candidates for GC.
> * This variant of the MarkSweepGC can be triggered  more frequently. It would be ok to miss blob id additions to this file during a crash etc., as these blobs can be cleaned up in the *regular* MarkSweepGC cycles triggered occasionally.
> We also may be able to track other metadata along with the blob ids like paths, timestamps etc. for auditing/analytics, in-conjunction with OAK-3140.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)