You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Alexander Shutyaev <sh...@gmail.com> on 2012/08/31 15:03:04 UTC

force gc?

Hi All!

I have a problem with using cassandra. Our application does a lot of
overwrites and deletes. If I understand correctly cassandra does not
actually delete these objects until gc_grace seconds have passed. I tried
to "force" gc by setting gc_grace to 0 on an existing column family and
running major compaction afterwards. However I did not get disk space back,
although I'm pretty much sure that my column family should occupy many
times fewer space. We have also a PostgreSQL db and we duplicate each
operation with data in both dbs. And the PosgreSQL table is much more
smaller than the corresponding cassandra's column family. Does anyone have
any suggestions on how can I analyze my problem? Or maybe I'm doing
something wrong and there is another way to force gc on an existing column
family.

Thanks in advance,
Alexander

Re: force gc?

Posted by aaron morton <aa...@thelastpickle.com>.
What version are you on ?

Check the result of you major compaction by looking for log lines such as "Compacted to…" They will say how much smaller the new file is. 

After a major compaction there should be a single SSTable, the ks-cf-he-1234 part with multiple components such as -Data.db. How many files do you have for the CF on disk ? Are you also using secondary indexes ?

Reducing gc and forcing a major compaction should result in all purgable space being removed. There are times when minor compaction cannot purge expired tombstones because the rows are spread out over multiple size tiers. 

Expanding from 5GB to 70Gb is out of the normal expections I would say. You may want to check it the db contains what you expect it to. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 3/09/2012, at 5:32 PM, Alexander Shutyaev <sh...@gmail.com> wrote:

> Hi Derek,
> 
> I'm using size-tiered compaction.
> 
> 2012/9/3 Derek Williams <de...@fyrie.net>
> On Fri, Aug 31, 2012 at 7:03 AM, Alexander Shutyaev <sh...@gmail.com> wrote:
> Does anyone have any suggestions on how can I analyze my problem? Or maybe I'm doing something wrong and there is another way to force gc on an existing column family.
> 
> Are you using leveled compaction? I haven't looked into it too much, but I think forcing a major compaction when using leveled strategy doesn't have the same effect as with size tiered. 
> 
> -- 
> Derek Williams
> 
> 


Re: force gc?

Posted by Alexander Shutyaev <sh...@gmail.com>.
Hi Derek,

I'm using size-tiered compaction.

2012/9/3 Derek Williams <de...@fyrie.net>

> On Fri, Aug 31, 2012 at 7:03 AM, Alexander Shutyaev <sh...@gmail.com>wrote:
>
>> Does anyone have any suggestions on how can I analyze my problem? Or
>> maybe I'm doing something wrong and there is another way to force gc on an
>> existing column family.
>>
>
> Are you using leveled compaction? I haven't looked into it too much, but I
> think forcing a major compaction when using leveled strategy doesn't have
> the same effect as with size tiered.
>
> --
> Derek Williams
>
>

Re: force gc?

Posted by Derek Williams <de...@fyrie.net>.
On Fri, Aug 31, 2012 at 7:03 AM, Alexander Shutyaev <sh...@gmail.com>wrote:

> Does anyone have any suggestions on how can I analyze my problem? Or maybe
> I'm doing something wrong and there is another way to force gc on an
> existing column family.
>

Are you using leveled compaction? I haven't looked into it too much, but I
think forcing a major compaction when using leveled strategy doesn't have
the same effect as with size tiered.

-- 
Derek Williams

Re: force gc?

Posted by Peter Schuller <pe...@infidyne.com>.
> Maybe there is some tool to analyze it? It would be great if I could somehow
> export each row of a column family into a separate file - so I could see
> their count and sizes. Is there any such tool? Or maybe you have some better
> thoughts...

Use something like pycassa to non-obnoxiously iterate over all rows:

 for row_id, row in your_column_family.get_range():
    ....

https://github.com/pycassa/pycassa

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: force gc?

Posted by Alexander Shutyaev <sh...@gmail.com>.
Hi Peter,

I don't compare it with PosgreSQL size, I just make some estimations.. This
table / column family stores some xml documents with average raw size of
2Mb each and total size about 5Gb. However the space cassandra occupies on
disc is 70Gb (after gc_grace was set to 0 and major compaction was run).

Maybe there is some tool to analyze it? It would be great if I could
somehow export each row of a column family into a separate file - so I
could see their count and sizes. Is there any such tool? Or maybe you have
some better thoughts...

2012/9/3 Peter Schuller <pe...@infidyne.com>

> > I think that was clear from your post. I don't see a problem with your
> > process. Setting gc grace to 0 and forcing compaction should indeed
> > return you to the smallest possible on-disk size.
>
> (But may be unsafe as documented; can cause deleted data to pop back up,
> etc.)
>
> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
>

Re: force gc?

Posted by Peter Schuller <pe...@infidyne.com>.
> I think that was clear from your post. I don't see a problem with your
> process. Setting gc grace to 0 and forcing compaction should indeed
> return you to the smallest possible on-disk size.

(But may be unsafe as documented; can cause deleted data to pop back up, etc.)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: force gc?

Posted by Peter Schuller <pe...@infidyne.com>.
> I think I described the problem wrong :) I don't want to do Java's memory
> GC. I want to do cassandra's GC - that is I want to "really" remove deleted
> rows from a column family and get my disc space back.

I think that was clear from your post. I don't see a problem with your
process. Setting gc grace to 0 and forcing compaction should indeed
return you to the smallest possible on-disk size.

Did you really not see a *decrease*, or are you just comparing the
final size with that of PostgreSQL? Keep in mind that in many cases
(especially if not using compression) the Cassandra on-disk format is
not as compact as PostgreSQL. For example column names are duplicated
in each row, and the row key is duplicated twice (once in index, once
in data).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: force gc?

Posted by Alexander Shutyaev <sh...@gmail.com>.
Hi Jeffrey,

I think I described the problem wrong :) I don't want to do Java's memory
GC. I want to do cassandra's GC - that is I want to "really" remove deleted
rows from a column family and get my disc space back.

2012/8/31 Jeffrey Kesselman <je...@gmail.com>

> Cassandra at least used to do disc cleanup as a side effect of
> garbage collection through finalizers.  (This is a mistake for the
> reason outlined below.)
>
> It is important to understand that you can *never* "force* a gc in java.
> Even calling System.gc() is merely a hint to the VM. What you are doing is
> telling the VM that you are * willing* to give up some processor time right
> now to gc, how much it choses to actually collect or not collect is totally
> up to the VM.
>
> The *only* garbage collection guarantee in java is that it will make a
> "best effort" to collect what it can to avoid an out of memory exception at
> the time that it runs out of memory.  You are not guaranteed when *if
> ever*, a given object will actually be collected.  Since finalizers happen
> when an object is collected, and not when it becomes a candidate for
> collection, the same is true of the finalizer.  You are
> not guaranteed when, if ever, it will run.
>
>
> On Fri, Aug 31, 2012 at 9:03 AM, Alexander Shutyaev <sh...@gmail.com>wrote:
>
>> Hi All!
>>
>> I have a problem with using cassandra. Our application does a lot of
>> overwrites and deletes. If I understand correctly cassandra does not
>> actually delete these objects until gc_grace seconds have passed. I tried
>> to "force" gc by setting gc_grace to 0 on an existing column family and
>> running major compaction afterwards. However I did not get disk space back,
>> although I'm pretty much sure that my column family should occupy many
>> times fewer space. We have also a PostgreSQL db and we duplicate each
>> operation with data in both dbs. And the PosgreSQL table is much more
>> smaller than the corresponding cassandra's column family. Does anyone have
>> any suggestions on how can I analyze my problem? Or maybe I'm doing
>> something wrong and there is another way to force gc on an existing column
>> family.
>>
>> Thanks in advance,
>> Alexander
>>
>
>
>
> --
> It's always darkest just before you are eaten by a grue.
>

Re: force gc?

Posted by Jeffrey Kesselman <je...@gmail.com>.
Cassandra at least used to do disc cleanup as a side effect of
garbage collection through finalizers.  (This is a mistake for the
reason outlined below.)

It is important to understand that you can *never* "force* a gc in java.
Even calling System.gc() is merely a hint to the VM. What you are doing is
telling the VM that you are * willing* to give up some processor time right
now to gc, how much it choses to actually collect or not collect is totally
up to the VM.

The *only* garbage collection guarantee in java is that it will make a
"best effort" to collect what it can to avoid an out of memory exception at
the time that it runs out of memory.  You are not guaranteed when *if
ever*, a given object will actually be collected.  Since finalizers happen
when an object is collected, and not when it becomes a candidate for
collection, the same is true of the finalizer.  You are
not guaranteed when, if ever, it will run.

On Fri, Aug 31, 2012 at 9:03 AM, Alexander Shutyaev <sh...@gmail.com>wrote:

> Hi All!
>
> I have a problem with using cassandra. Our application does a lot of
> overwrites and deletes. If I understand correctly cassandra does not
> actually delete these objects until gc_grace seconds have passed. I tried
> to "force" gc by setting gc_grace to 0 on an existing column family and
> running major compaction afterwards. However I did not get disk space back,
> although I'm pretty much sure that my column family should occupy many
> times fewer space. We have also a PostgreSQL db and we duplicate each
> operation with data in both dbs. And the PosgreSQL table is much more
> smaller than the corresponding cassandra's column family. Does anyone have
> any suggestions on how can I analyze my problem? Or maybe I'm doing
> something wrong and there is another way to force gc on an existing column
> family.
>
> Thanks in advance,
> Alexander
>



-- 
It's always darkest just before you are eaten by a grue.