You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Alexander Shutyaev <sh...@gmail.com> on 2012/09/03 07:02:57 UTC

Re: force gc?

Hi Jeffrey,

I think I described the problem wrong :) I don't want to do Java's memory
GC. I want to do cassandra's GC - that is I want to "really" remove deleted
rows from a column family and get my disc space back.

2012/8/31 Jeffrey Kesselman <je...@gmail.com>

> Cassandra at least used to do disc cleanup as a side effect of
> garbage collection through finalizers.  (This is a mistake for the
> reason outlined below.)
>
> It is important to understand that you can *never* "force* a gc in java.
> Even calling System.gc() is merely a hint to the VM. What you are doing is
> telling the VM that you are * willing* to give up some processor time right
> now to gc, how much it choses to actually collect or not collect is totally
> up to the VM.
>
> The *only* garbage collection guarantee in java is that it will make a
> "best effort" to collect what it can to avoid an out of memory exception at
> the time that it runs out of memory.  You are not guaranteed when *if
> ever*, a given object will actually be collected.  Since finalizers happen
> when an object is collected, and not when it becomes a candidate for
> collection, the same is true of the finalizer.  You are
> not guaranteed when, if ever, it will run.
>
>
> On Fri, Aug 31, 2012 at 9:03 AM, Alexander Shutyaev <sh...@gmail.com>wrote:
>
>> Hi All!
>>
>> I have a problem with using cassandra. Our application does a lot of
>> overwrites and deletes. If I understand correctly cassandra does not
>> actually delete these objects until gc_grace seconds have passed. I tried
>> to "force" gc by setting gc_grace to 0 on an existing column family and
>> running major compaction afterwards. However I did not get disk space back,
>> although I'm pretty much sure that my column family should occupy many
>> times fewer space. We have also a PostgreSQL db and we duplicate each
>> operation with data in both dbs. And the PosgreSQL table is much more
>> smaller than the corresponding cassandra's column family. Does anyone have
>> any suggestions on how can I analyze my problem? Or maybe I'm doing
>> something wrong and there is another way to force gc on an existing column
>> family.
>>
>> Thanks in advance,
>> Alexander
>>
>
>
>
> --
> It's always darkest just before you are eaten by a grue.
>

Re: force gc?

Posted by Peter Schuller <pe...@infidyne.com>.
> Maybe there is some tool to analyze it? It would be great if I could somehow
> export each row of a column family into a separate file - so I could see
> their count and sizes. Is there any such tool? Or maybe you have some better
> thoughts...

Use something like pycassa to non-obnoxiously iterate over all rows:

 for row_id, row in your_column_family.get_range():
    ....

https://github.com/pycassa/pycassa

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: force gc?

Posted by Alexander Shutyaev <sh...@gmail.com>.
Hi Peter,

I don't compare it with PosgreSQL size, I just make some estimations.. This
table / column family stores some xml documents with average raw size of
2Mb each and total size about 5Gb. However the space cassandra occupies on
disc is 70Gb (after gc_grace was set to 0 and major compaction was run).

Maybe there is some tool to analyze it? It would be great if I could
somehow export each row of a column family into a separate file - so I
could see their count and sizes. Is there any such tool? Or maybe you have
some better thoughts...

2012/9/3 Peter Schuller <pe...@infidyne.com>

> > I think that was clear from your post. I don't see a problem with your
> > process. Setting gc grace to 0 and forcing compaction should indeed
> > return you to the smallest possible on-disk size.
>
> (But may be unsafe as documented; can cause deleted data to pop back up,
> etc.)
>
> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
>

Re: force gc?

Posted by Peter Schuller <pe...@infidyne.com>.
> I think that was clear from your post. I don't see a problem with your
> process. Setting gc grace to 0 and forcing compaction should indeed
> return you to the smallest possible on-disk size.

(But may be unsafe as documented; can cause deleted data to pop back up, etc.)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: force gc?

Posted by Peter Schuller <pe...@infidyne.com>.
> I think I described the problem wrong :) I don't want to do Java's memory
> GC. I want to do cassandra's GC - that is I want to "really" remove deleted
> rows from a column family and get my disc space back.

I think that was clear from your post. I don't see a problem with your
process. Setting gc grace to 0 and forcing compaction should indeed
return you to the smallest possible on-disk size.

Did you really not see a *decrease*, or are you just comparing the
final size with that of PostgreSQL? Keep in mind that in many cases
(especially if not using compression) the Cassandra on-disk format is
not as compact as PostgreSQL. For example column names are duplicated
in each row, and the row key is duplicated twice (once in index, once
in data).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)