You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Paco Avila <mo...@gmail.com> on 2009/03/09 17:36:36 UTC

Datastore and garbage collection

I've reading http://wiki.apache.org/jackrabbit/DataStore and there is a
reference to "Running data store garbage collection". Actually do I need to
perform this task or is already implemented in the core?

-- 
Paco Avila
GIT Consultors
tel: +34 971 498310
fax: +34 971496189
e-mail: pavila@git.es
http://www.git.es

Re: Datastore and garbage collection

Posted by Paco Avila <mo...@gmail.com>.
This sounds interesting as a recommendation.

On Mon, Mar 9, 2009 at 7:55 PM, Jukka Zitting <ju...@gmail.com>wrote:

> Hi,
>
> On Mon, Mar 9, 2009 at 6:38 PM, Paco Avila <mo...@gmail.com> wrote:
> > Do you mean that GC only make sense if I delete documents from the
> > repository?
>
> Yes. I would even say that GC only makes sense if 1) you delete
> significant amounts of documents from the repository and 2) you add
> documents at an *exponential* rate that exceeds the growth in storage
> capacity.
>
> > I don't think that never run GC and keep all the documents (deleted one
> > included) is a good alternative in repositories with several GB of size
> > and big documents.
>
> It depends... For example, I currently shoot about 10GB of digital
> photos per month. Roughly 20% of the shots are so bad (blurry, poor
> composition, overexposed, etc.) that I discard them immediately. It
> would take just a few mouse clicks or a simple cron script to free up
> the disk space that those discarded images take. But the extra effort
> simply isn't worth it, since I will most likely have at least doubled
> my storage capacity before my current 500GB hard drive is even close
> to being filled up. Even the fact that I will probably only ever
> publish about 10% of my photos doesn't make much of a difference,
> since it costs so little to never delete anything. And I never need to
> worry about accidentally removing something.
>
> If your application is for personal use and you produce less than 10GB
> of data per month, then don't worry about garbage collection.
>
> If your application is for enterprise use and your customer produces
> less than 100GB-1TB data per month (depending on the size of the
> enterprise), then don't worry about garbage collection.
>
> BR,
>
> Jukka Zitting
>



-- 
Paco Avila
GIT Consultors
tel: +34 971 498310
fax: +34 971496189
e-mail: pavila@git.es
http://www.git.es

Re: Datastore and garbage collection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Mar 9, 2009 at 6:38 PM, Paco Avila <mo...@gmail.com> wrote:
> Do you mean that GC only make sense if I delete documents from the
> repository?

Yes. I would even say that GC only makes sense if 1) you delete
significant amounts of documents from the repository and 2) you add
documents at an *exponential* rate that exceeds the growth in storage
capacity.

> I don't think that never run GC and keep all the documents (deleted one
> included) is a good alternative in repositories with several GB of size
> and big documents.

It depends... For example, I currently shoot about 10GB of digital
photos per month. Roughly 20% of the shots are so bad (blurry, poor
composition, overexposed, etc.) that I discard them immediately. It
would take just a few mouse clicks or a simple cron script to free up
the disk space that those discarded images take. But the extra effort
simply isn't worth it, since I will most likely have at least doubled
my storage capacity before my current 500GB hard drive is even close
to being filled up. Even the fact that I will probably only ever
publish about 10% of my photos doesn't make much of a difference,
since it costs so little to never delete anything. And I never need to
worry about accidentally removing something.

If your application is for personal use and you produce less than 10GB
of data per month, then don't worry about garbage collection.

If your application is for enterprise use and your customer produces
less than 100GB-1TB data per month (depending on the size of the
enterprise), then don't worry about garbage collection.

BR,

Jukka Zitting

Re: Datastore and garbage collection

Posted by Alexander Klimetschek <ak...@day.com>.
On Mon, Mar 9, 2009 at 7:10 PM, Paco Avila <mo...@gmail.com> wrote:
> And after run
>
> GarbageCollector gc;
> SessionImpl si = (SessionImpl)session;
> gc = si.createDataStoreGarbageCollector();
>
> // optional (if you want to implement a progress bar / output):
> // gc.setScanEventListener(this);
> gc.scan();
> gc.stopScan();
>
> // delete old data
> gc.deleteUnused();
>
> all deleted documents are supposed to be remove from repository/datastore?

Yes.


> On Mon, Mar 9, 2009 at 6:38 PM, Paco Avila <mo...@gmail.com> wrote:
>> Do you mean that GC only make sense if I delete documents from the
>> repository? I don't think that never run GC and keep all the documents
>> (deleted one included) is a good alternative in repositories with several GB
>> of size and big documents.

TB is the new GB ;-)

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: Datastore and garbage collection

Posted by Paco Avila <mo...@gmail.com>.
And after run

GarbageCollector gc;
SessionImpl si = (SessionImpl)session;
gc = si.createDataStoreGarbageCollector();

// optional (if you want to implement a progress bar / output):
// gc.setScanEventListener(this);
gc.scan();
gc.stopScan();

// delete old data
gc.deleteUnused();

all deleted documents are supposed to be remove from repository/datastore?

On Mon, Mar 9, 2009 at 6:38 PM, Paco Avila <mo...@gmail.com> wrote:

> Do you mean that GC only make sense if I delete documents from the
> repository? I don't think that never run GC and keep all the documents
> (deleted one included) is a good alternative in repositories with several GB
> of size and big documents.
>
> On Mon, Mar 9, 2009 at 6:20 PM, Jukka Zitting <ju...@gmail.com>wrote:
>
>> Hi,
>>
>> On Mon, Mar 9, 2009 at 6:02 PM, Paco Avila <mo...@gmail.com> wrote:
>> > The real question should be "Do I need to call the garbage collection in
>> my
>> > app" ? :P
>> >
>> > And the answer seems to be "YES"!
>>
>> Well, it depends. If your usage patterns permit, you could also just
>> ignore garbage collection entirely.
>>
>> If you don't have lots of short-lived files (or binary properties) in
>> the repository, then the cost of keeping some extra unused binaries in
>> the data store may well be smaller than the cost of getting rid of
>> them.
>>
>> It's worth estimating the rate at which you remove binary data from
>> the repository, and using the result to calculate the best garbage
>> collection intervals. The low (and declining) cost of storage and the
>> typical usage patterns of many content applications (especially ones
>> with versioning) may well suggest that the most economic alternative
>> is to never run the garbage collector.
>>
>> BR,
>>
>> Jukka Zitting
>>
>
>
>
> --
> Paco Avila
> GIT Consultors
> tel: +34 971 498310
> fax: +34 971496189
> e-mail: pavila@git.es
> http://www.git.es
>



-- 
Paco Avila
GIT Consultors
tel: +34 971 498310
fax: +34 971496189
e-mail: pavila@git.es
http://www.git.es

Re: Datastore and garbage collection

Posted by Paco Avila <mo...@gmail.com>.
Do you mean that GC only make sense if I delete documents from the
repository? I don't think that never run GC and keep all the documents
(deleted one included) is a good alternative in repositories with several GB
of size and big documents.

On Mon, Mar 9, 2009 at 6:20 PM, Jukka Zitting <ju...@gmail.com>wrote:

> Hi,
>
> On Mon, Mar 9, 2009 at 6:02 PM, Paco Avila <mo...@gmail.com> wrote:
> > The real question should be "Do I need to call the garbage collection in
> my
> > app" ? :P
> >
> > And the answer seems to be "YES"!
>
> Well, it depends. If your usage patterns permit, you could also just
> ignore garbage collection entirely.
>
> If you don't have lots of short-lived files (or binary properties) in
> the repository, then the cost of keeping some extra unused binaries in
> the data store may well be smaller than the cost of getting rid of
> them.
>
> It's worth estimating the rate at which you remove binary data from
> the repository, and using the result to calculate the best garbage
> collection intervals. The low (and declining) cost of storage and the
> typical usage patterns of many content applications (especially ones
> with versioning) may well suggest that the most economic alternative
> is to never run the garbage collector.
>
> BR,
>
> Jukka Zitting
>



-- 
Paco Avila
GIT Consultors
tel: +34 971 498310
fax: +34 971496189
e-mail: pavila@git.es
http://www.git.es

Re: Datastore and garbage collection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Mar 9, 2009 at 6:02 PM, Paco Avila <mo...@gmail.com> wrote:
> The real question should be "Do I need to call the garbage collection in my
> app" ? :P
>
> And the answer seems to be "YES"!

Well, it depends. If your usage patterns permit, you could also just
ignore garbage collection entirely.

If you don't have lots of short-lived files (or binary properties) in
the repository, then the cost of keeping some extra unused binaries in
the data store may well be smaller than the cost of getting rid of
them.

It's worth estimating the rate at which you remove binary data from
the repository, and using the result to calculate the best garbage
collection intervals. The low (and declining) cost of storage and the
typical usage patterns of many content applications (especially ones
with versioning) may well suggest that the most economic alternative
is to never run the garbage collector.

BR,

Jukka Zitting

Re: Datastore and garbage collection

Posted by Paco Avila <mo...@gmail.com>.
The real question should be "Do I need to call the garbage collection in my
app" ? :P

And the answer seems to be "YES"!

On Mon, Mar 9, 2009 at 5:40 PM, Thomas Müller <th...@day.com>wrote:

> Hi,
>
> On Mon, Mar 9, 2009 at 5:36 PM, Paco Avila <mo...@gmail.com> wrote:
> > I've reading http://wiki.apache.org/jackrabbit/DataStore and there is a
> > reference to "Running data store garbage collection". Actually do I need
> to
> > perform this task or is already implemented in the core?
>
> Both :-) The garbage collection is implemented, but you need to call
> it from within your application.
>
> For a code example see http://wiki.apache.org/jackrabbit/DataStore
> "Running data store garbage collection"
>
> Regards,
> Thomas
>



-- 
Paco Avila
GIT Consultors
tel: +34 971 498310
fax: +34 971496189
e-mail: pavila@git.es
http://www.git.es

Re: Datastore and garbage collection

Posted by Thomas Müller <th...@day.com>.
Hi,

On Mon, Mar 9, 2009 at 5:36 PM, Paco Avila <mo...@gmail.com> wrote:
> I've reading http://wiki.apache.org/jackrabbit/DataStore and there is a
> reference to "Running data store garbage collection". Actually do I need to
> perform this task or is already implemented in the core?

Both :-) The garbage collection is implemented, but you need to call
it from within your application.

For a code example see http://wiki.apache.org/jackrabbit/DataStore
"Running data store garbage collection"

Regards,
Thomas