You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by uv <vl...@gmail.com> on 2014/09/16 13:50:01 UTC

Jackrabbit GC on huge MySQL Database

Hi, 

our system uses jackrabbit 2.6.5 and MySQL DB datastore. Jackrabbit DB
schema size is 300GB, most of it is in datastore. When we run jackrabbit
garbage collector, it runs almost 3 days. Running GC has significant impact
on application performance. 

Could you please advice what possibility we have? 

Somehow spit GC to do not iterate through whole datastore? When GC is not
finished completely, we can not run datastore clean because we can not be
sure what has been scanned and what has not. 

Or is there any other GC implementation? 


Thank you very much. 

Vlastimil



--
View this message in context: http://jackrabbit.510166.n4.nabble.com/Jackrabbit-GC-on-huge-MySQL-Database-tp4661381.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.

Re: Jackrabbit GC on huge MySQL Database

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

In Jackrabbit Oak, we have a different (much, much faster) approach to do
garbage collection, but there is no plan to backport that to Jackrabbit
2.x. The approach is: scan the repository (not traverse, but do a low
level scan in the persistent storage) for blob ids. Then get the list of
blobs from the data store, and delete those that are not in the list of
blob ids in use.

This is much faster mainly because of two things: first, (and most
importantly) it avoids random access reads (the primary key in Jackrabbit
2.x for the nodes is randomly distributed; this is no longer the case for
the default storage engines in Jackrabbit Oak). Second, it avoid marking
all binaries that are still in use.

You could implement this for Jackrabbit 2.x, or you could switch to
Jackrabbit Oak.

Regards,
Thomas


On 16/09/14 13:50, "uv" <vl...@gmail.com> wrote:

>Hi, 
>
>our system uses jackrabbit 2.6.5 and MySQL DB datastore. Jackrabbit DB
>schema size is 300GB, most of it is in datastore. When we run jackrabbit
>garbage collector, it runs almost 3 days. Running GC has significant
>impact
>on application performance.
>
>Could you please advice what possibility we have?
>
>Somehow spit GC to do not iterate through whole datastore? When GC is not
>finished completely, we can not run datastore clean because we can not be
>sure what has been scanned and what has not.
>
>Or is there any other GC implementation?
>
>
>Thank you very much.
>
>Vlastimil
>
>
>
>--
>View this message in context:
>http://jackrabbit.510166.n4.nabble.com/Jackrabbit-GC-on-huge-MySQL-Databas
>e-tp4661381.html
>Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.