You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by ruslan usifov <ru...@gmail.com> on 2011/03/05 22:34:21 UTC

Nodes frozen in GC

Hello

I have cluster of 3 cassandra 0.7.3 nodes, and some times nodes a frozen. As
i understand this because GC, in this moment i see

INFO [ScheduledTasks:1] 2011-03-05 15:21:23,524 GCInspector.java (line 128)
GC for ConcurrentMarkSweep: 18052 ms, -997761672 reclaimed leaving
5796586088

How can i reduce this? I installed JNA (i install it correctly:
"CLibrary.java (line 106) JNA mlockall successful" log record  in system.log
present on all nodes), and set vm.swappiness = 0, but this doesn't help. Any
suggest?

PS: Nodes have 12GB of memory with 6GB HeapSize

Re: Nodes frozen in GC

Posted by Peter Schuller <pe...@infidyne.com>.

> Also, why is there so much garbage collection to begin with?  Memcache
> uses a slab allocator to reuse blocks to prevent allocation/deallocation
> of blocks from consuming all the cpu time.  Are there any plans to reuse
> blocks so the garbage collector doesn't have to work so hard?

And to address this btw, although it has nothing to do with the
problem being investigated in this thread: It's not about how *much*
time is spent on memory management. That is of course relevant, but
the issue here is to avoid long stop-the-world pauses. Even if you're
avoiding doing allocation, as long as you're not doing *zero*
allocation you need to collect what you *do* allocate in such a way as
to avoid long stop-the-world pauses.

How this is accomplished in terms of garbage collection
implementations is beyond the scope of this E-Mail, but a major point
is: Yes, there are workloads under which a given garbage collector,
like CMS, will fail to avoid stop-the-world full GC:s in perpetuity.
However, the most common problems are dead simple ones like "oops, row
cache too big" where the *SYMPTOM* is one of prolonged GC pauses, but
the actual root cause is not that of a broken or inadequate GC. It is
sub-optimal that this is so (it would clearly be more use-friendly if
one were just told that there is too much live data), but for various
reasons it is non-trivial, from a JVM perspective, to provide that
information to the operator in a way that is useful and won't trigger
incorrectly (false positively).

Also, all memory management techniques have trade-offs. If you believe
memcached is invulnerable, try this: Populate a memcached with a bunch
of data of varying size but with a given average size. Wait until it's
entirely full. Then, adjust your data so that the distribution looks
the same but is displaced singificantly (eg., maybe move from 150 byte
average blob size to 1000 byte average). Unless you were lucky in
exactly how memcached ended up sizing it's slabs and what your data
sizes happen to be, you can then watch how memcached crashes and burns
your application (which relied on the cache having a good hit ratio)
as you suddenly start seeing data evicted withing seconds after
insertion. This can happen because memcached makes trade-offs in how
it does memory management in order to achieve its performance. One of
those trade-offs involves not being able to re-allocate memory to
slabs for different object sizes, when the object size distribution
changes.

-- 
/ Peter Schuller

Re: Nodes frozen in GC

Posted by Peter Schuller <pe...@infidyne.com>.

> Add:
>
> JVM_OPTS="$JVM_OPTS -XX:+PrintGC"
> JVM_OPTS="$JVM_OPTS -XX:+PrintGCDetails"
> JVM_OPTS="$JVM_OPTS -XX:+PrintGCTimeStamps"
>
> And you will see significantly more detail in the GC log.

Maybe you want to add  -XX:+PrintGCApplicationConcurrentTime while you're at it.

But the key is to see what leads up to the long stop-the-world pause.

-- 
/ Peter Schuller

Re: Nodes frozen in GC

Posted by Peter Schuller <pe...@infidyne.com>.

> JVM_OPTS="$JVM_OPTS -XX:+PrintGCApplicationStoppedTime"
> JVM_OPTS="$JVM_OPTS -Xloggc:/var/log/cassandra/gc.log"

Add:

JVM_OPTS="$JVM_OPTS -XX:+PrintGC"
JVM_OPTS="$JVM_OPTS -XX:+PrintGCDetails"
JVM_OPTS="$JVM_OPTS -XX:+PrintGCTimeStamps"

And you will see significantly more detail in the GC log.


-- 
/ Peter Schuller

RE: Nodes frozen in GC

Posted by Gregory Szorc <gr...@xobni.com>.

> With the large new-gen, you were actually seeing fallbacks to full GC?
> You weren't just still experiencing problems because at 10 gig, the new-gen
> will be so slow to compact to effectively be similar to a full gc in terms of
> affecting latency?

Yes, we were seeing fallbacks to full GC with a large young generation.

Surprisingly, the young generation was still collecting quickly (under 0.25s).

Perhaps that means the young gen was full of fewer, large object?

Doing some experimentation now, with a 250MB young gen, parnew collections are around 0.01-0.02 wall time seconds. With a 2GB young, they are around 0.02-0.07s. Finally, with ~10GB new (9.5GB young), they are around 0.05-0.15s, with most being <0.1s. If no compactions have occurred, we occasionally see a 0.2s parnew collection.

109.126: [GC 109.126: [ParNew
Desired survivor size 298254336 bytes, new threshold 3 (max 3)
- age   1:   34897008 bytes,   34897008 total
- age   2:   17763192 bytes,   52660200 total
- age   3:  135342264 bytes,  188002464 total
: 9524607K->204678K(9903232K), 0.0565180 secs] 9527767K->229803K(15146112K), 0.0567610 secs] [Times: user=0.31 sys=0.01, real=0.06 secs]
128.997: [GC 128.998: [ParNew
Desired survivor size 298254336 bytes, new threshold 2 (max 3)
- age   1:  294294360 bytes,  294294360 total
- age   2:   16201608 bytes,  310495968 total
- age   3:   17755920 bytes,  328251888 total
: 9500071K->349162K(9903232K), 0.1070140 secs] 9525196K->400440K(15146112K), 0.1072430 secs] [Times: user=0.60 sys=0.00, real=0.11 secs]

These were taken during compaction of a non-troublesome CF. We are still seeing new generation allocations in the 1 GB/s range. And, process read I/O reported from /proc/<pid>/io reports reading about 135MB/s. A 7.5x memory allocations to read I/O ratio does seem pretty high. Granted, the host was servicing some thrift requests at the time and these obviously contribute to object allocations. 

Without any compactions, we blow through the ~9.5GB young generation in 50-60s on average, which means our baseline allocation rate (from servicing thrift requests and other misc background work) is a (more reasonable) 200MB/s. Assuming there isn't much else going on in Cassandra, that means that ~80% of the allocated space during compactions is coming from the compaction and that the memory allocation overhead to process read bytes for compactions is pretty high (around 6:1).

I understand there will be an overhead for the programming language, but a ratio in the 6:1 range of JVM allocations to I/O reads seems a bit high. Since we are talking about many gigabytes of memory, I would expect the JVM allocation size to be dominated by the column values. This leads me to believe that the column value buffers are being excessively copied or the sstables aren't being read as efficiently as possible. Whatever the root cause, there definitely seems to be room for improvement. But, I'm not a Java expert and don't know the compaction algorithm too well, so maybe a ratio of 6:1 is pretty good.

> If there is any suspicion that the above is happening, maybe try decreasing
> in_memory_compaction_limit_in_mb (preparing to see lots of stuff logged
> to console, assuming that's still happening in the 0.6.
> version you're running).

I don't believe an in-memory compaction limit config option exists in 0.6.
 
> I"m not sure on what you're basing that, but unless I have fatally failed to
> grok something fundamental about the interaction between new-gen and
> old-gen with CMS, object's aren't being allocated *period* while the "young
> generation is being collected" as that is a stop-the-world pause. (This is also
> why I said before that at 10 gig new-gen size, the observed behavior on
> young gen collections may be similar to fallback-to-full-gc cases, but not quite
> since it would be parallel rather than serial)

The grok fail is probably on my end. I couldn't find any documentation to back either of our claims, so I'll defer to your experience.

Greg

Re: Nodes frozen in GC

Posted by Peter Schuller <pe...@infidyne.com>.

Sorry about the delay,

> I do believe there is a fundamental issue with compactions allocating too much memory and incurring too many garbage collections (at least with 0.6.12).

[snip a lot of good info]

You certainly seem to have a real issue, though I don't get the feel
it's the same as the OP.

I don't think I can offer a silver bullet. I was going to suggest that
you're seeing rows that are large enough that you're taking young-gen
GC:s prior to the complection of individual rows so that the per-row
working set is promoted to old-gen, yet small enough (row) to be below
in_memory_compaction_limit_in_mb. But this seems inconsistent with the
fact that you report problems even with huge new-gen (10 gig).

With the large new-gen, you were actually seeing fallbacks to full GC?
You weren't just still experiencing problems because at 10 gig, the
new-gen will be so slow to compact to effectively be similar to a full
gc in terms of affecting latency?

If there is any suspicion that the above is happening, maybe try
decreasing in_memory_compaction_limit_in_mb (preparing to see lots of
stuff logged to console, assuming that's still happening in the 0.6.
version you're running).

Also, you did mention taking into account tenuring into old-gen, so
maybe your observations there are inconsistent with the above
hypothesis too. But just one correction/note regarding this: You said
that:

   "However, when the young generation is being collected (which
happens VERY often during compactions b/c allocation rate is so high),
objects are allocated directly into the tenured generation."

I"m not sure on what you're basing that, but unless I have fatally
failed to grok something fundamental about the interaction between
new-gen and old-gen with CMS, object's aren't being allocated *period*
while the "young generation is being collected" as that is a
stop-the-world pause. (This is also why I said before that at 10 gig
new-gen size, the observed behavior on young gen collections may be
similar to fallback-to-full-gc cases, but not quite since it would be
parallel rather than serial)

Anyways, I sympathize with your issues and the fact that you don't
have time to start attaching with profilers etc. Unfortunately I don't
know what to suggest that is simpler than that.

-- 
/ Peter Schuller

RE: Nodes frozen in GC

Posted by Gregory Szorc <gr...@xobni.com>.

I do believe there is a fundamental issue with compactions allocating too much memory and incurring too many garbage collections (at least with 0.6.12).

On nearly every Cassandra node I operate, garbage collections simply get out of control during compactions of any reasonably sized CF (>1GB). I can reproduce it on CF's with many wider rows (1000's of columns) consisting of smaller columns (10's-100's of bytes) and CF's with thinner rows (<20 columns) with larger columns (10's MBs) and everything in between.

From the GC logs, I can infer that Cassandra is allocating upwards of 4GB/s. I once gave the JVM 30GB of heap and saw it run through the entire heap in a few seconds while doing a compaction! It would continuously blow through the heap, incur a stop-the-world collection, and repeat. Meanwhile, the listed compacted bytes from the JMX interface was never increasing and the tmp sstable wasn't growing in size.

My current/relevant JVM args are as follows (running on Sun 1.6.0.24 w/ JNA 3.2.7):

-Xms9G -Xmx9G -Xmn256M -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution -Xloggc:/var/log/cassandra/gc.log -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=3 -XX:CMSInitiatingOccupancyFraction=40 -XX:+HeapDumpOnOutOfMemoryError -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSFullGCsBeforeCompaction=1 -XX:ParallelGCThreads=6

I've tweaked with nearly every setting imaginable (http://www.md.pp.ru/~eu/jdk6options.html is a great resource, BTW) and can't control the problem. No matter what I do, nothing can solve the problem of Cassandra allocating objects faster than the GC can clean them. And, when we're talking about >1GB/s of allocations, I don't think you can blame GC for not keeping up.

Since there is no way to prevent these frequent stop-the-world collections, we get frequent client timeouts and an occasional unavailable response if we're unfortunate to have a couple of nodes compacting large CFs at the same time (which happens more than I'd like).

For the past two weeks, we had N=<replication factor> adjacent nodes in our cluster that failed to perform their daily major compaction on a particular column family. All N would spew GCInspector logs and the GC logs revealed heavy memory allocation rate. The only resolution was to restart Cassandra to abort the compaction. I isolated one node from network connectivity and restarted it in a cluster of 1 with no caching, memtables, or any operations. Under these ideal compacting conditions, I still ran into issues. I experimented with extremely large young generations (up to 10GB), very low CMSInitiatingOccupancyFraction, etc, but Cassandra would always allocate faster than JVM could collect, eventually leading to stop-the-world.

Recently, we rolled out a change to the application accessing the cluster which effectively resaved every column in every row. When this was mostly done, our daily major compaction for the trouble CF that refused to compact for two weeks, suddenly completed! Most interesting. (Although, it still went through memory to no end.)

One of my observations is that memory allocations during compaction seems to be mostly short-lived objects. The young generation is almost never promoting objects to the tenured generation (we changed our MaxTenuringThreshold to 3, from Cassandra's default of 1 to discourage early promotion- a default of 1 seems rather silly to me). However, when the young generation is being collected (which happens VERY often during compactions b/c allocation rate is so high), objects are allocated directly into the tenured generation. Even with relatively short ParNew collections (often <0.05s, almost always <0.1s wall time), these tenured allocations quickly accumulate, initiating CMS and eventually stop-the-world.

Anyway, not sure how much additional writing is going to help resolve this issue. I have gobs of GC logs and supplementary metrics data to back up my claims if those will help. But, I have a feeling that if you just create a CF of a few GB and incur a compaction with the JVM under a profiler, it will be pretty easy to identify the culprit. I've started down this path and will let you know if I find anything. But, I'm no Java expert and am quite busy with other tasks, so don't expect anything useful from me anytime soon.

I hope this information helps. If you need anything else, just ask, and I'll see what I can do.

Gregory Szorc
gregory.szorc@xobni.com

> -----Original Message-----
> From: scode@scode.org [mailto:scode@scode.org] On Behalf Of Peter
> Schuller
> Sent: Thursday, March 10, 2011 10:36 AM
> To: ruslan usifov
> Cc: user@cassandra.apache.org
> Subject: Re: Nodes frozen in GC
> 
> I think it would be very useful to get to the bottom of this but without
> further details (like the asked for GC logs) I'm not sure what to do/suggest.
> 
> It's clear that a single CF with a 64 MB memtable flush threshold and without
> key cache and row cache and some bulk insertion, should not be causing the
> problems you are seeing, in general. Especially not with a
> > 5 gb heap size. I think it is highly likely that there is some
> little detail/mistake going on here rather than a fundamental issue.
> But regardless, it would be nice to discover what.
> 
> --
> / Peter Schuller

Re: Nodes frozen in GC

Posted by Peter Schuller <pe...@infidyne.com>.

I think it would be very useful to get to the bottom of this but
without further details (like the asked for GC logs) I'm not sure what
to do/suggest.

It's clear that a single CF with a 64 MB memtable flush threshold and
without key cache and row cache and some bulk insertion, should not be
causing the problems you are seeing, in general. Especially not with a
> 5 gb heap size. I think it is highly likely that there is some
little detail/mistake going on here rather than a fundamental issue.
But regardless, it would be nice to discover what.

-- 
/ Peter Schuller

Re: Nodes frozen in GC

Posted by ruslan usifov <ru...@gmail.com>.

2011/3/8 Peter Schuller <pe...@infidyne.com>

> >                 $client->batch_mutate($mutations,
> > cassandra_ConsistencyLevel::QUORUM);
>
> Btw, what are the mutations? Are you doing something like inserting
> both very small values and very large ones?
>
> I have big xml file (5 GB) (mysql dump in xml format) and read data from it
with SAX xml parser, all records on that file looks like this:

        <row>
                <field name="uid">5</field>
                <field name="aid">3619780:1</field>
                <field name="cleanness">0</field>
                <field name="counter">7</field>
                <field name="gcount">0</field>
                <field name="lastchange">1291053619</field>
                <field name="disaster">0</field>
                <field name="tdisaster">0</field>
        </row>


mutations in that case is 10 similar records (follow fragment of code,
describes situation )

    $l_supercolumn = new cassandra_SuperColumn(array("name" =>
$l_row["aid"], "columns" => $l_columns));
    $l_c_or_sc = new cassandra_ColumnOrSuperColumn(array("super_column" =>
$l_supercolumn));
    $l_mutation = new cassandra_Mutation(array("column_or_supercolumn" =>
$l_c_or_sc));

    if(array_key_exists($l_key, $mutations))
    {
        array_push($mutations[$l_key]['aquarium_friend'], $l_mutation);
    }
    else
    {
        $mutations[$l_key] = array('aquarium_friend' => array($l_mutation));
    };

    if(!($l_i % 10))
    {
        make_mutation($client, $mutations, $g_loger, $g_rloger);
        $mutations = array();

        if(!($l_i % 1000))
        {
            $g_loger->info(sprintf("inserted: %s", $l_i));
        };
    };



> That's why I asked about the frequency. If you're doing a long-term
> stress test and seeing a 30 second pause once per week, that's a lot
> more likely to be "normal" for your workload than if you're seeing it
> happen once ever three minutes. The issue is that if you want to fix
> your problem, one must first figure out what the problem *is*. Based
> on past mailing list traffic, it seems most people's problems that are
> seemingly "due to GC" end up being because of a too high live set size
> or the CMS phase triggering too late. These are fixable issues if are
> running into them.
>
>
In may case this happen from time to time. For example insert all 5GB xml
took about 30-40 minutes, and nodes frozen about 5-10 time on that period
(avg time of frozen 15 secs)



> If all you have is a single column family with a 64 mb flush threshold
> and doing a bunch of insertions, and have a heap size of 5 (was it?)
> gig, you should not be having these issues. But stating that helps no
> one, which is why I'm asking for more information. (Widely
> extrapolating and suggesting that all Cassandra nodes will always
> freeze for 30 seconds every now and then is also helping no one, other
> than not being true.)
>
>  At initial state HEAP was 6GB. When i increase HEAP size to 7GB nodes
frozen only one time, but took much greater time (40 secs)

Re: Nodes frozen in GC

Posted by Peter Schuller <pe...@infidyne.com>.

>                 $client->batch_mutate($mutations,
> cassandra_ConsistencyLevel::QUORUM);

Btw, what are the mutations? Are you doing something like inserting
both very small values and very large ones?

In any case: My main reason to butt back into this thread is that
under normal circumstances you shouldn't be seeing very frequent
long-term pauses due to GC. If your workload is not doing something
particularly unusual, even if you do end up triggering a GC fallback
to full stop-the-world GC, it should happen comparatively infrequently
and you should be seeing many CMS cycles in between each such long
pause.

That's why I asked about the frequency. If you're doing a long-term
stress test and seeing a 30 second pause once per week, that's a lot
more likely to be "normal" for your workload than if you're seeing it
happen once ever three minutes. The issue is that if you want to fix
your problem, one must first figure out what the problem *is*. Based
on past mailing list traffic, it seems most people's problems that are
seemingly "due to GC" end up being because of a too high live set size
or the CMS phase triggering too late. These are fixable issues if are
running into them.

If all you have is a single column family with a 64 mb flush threshold
and doing a bunch of insertions, and have a heap size of 5 (was it?)
gig, you should not be having these issues. But stating that helps no
one, which is why I'm asking for more information. (Widely
extrapolating and suggesting that all Cassandra nodes will always
freeze for 30 seconds every now and then is also helping no one, other
than not being true.)

-- 
/ Peter Schuller

Re: Nodes frozen in GC

Posted by ruslan usifov <ru...@gmail.com>.

2011/3/8 Peter Schuller <pe...@infidyne.com>

>
> (1) I cannot stress this one enough: Run with -XX:+PrintGC
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps and collect the output.
> (2) Attach to your process with jconsole or some similar tool.
> (3) Observe the behavior of the heap over time. Preferably post
> screenshots so others can look at them.
>
>
> I'm not sure that up to the end you has understood, sorry

I launch cassandra with follow gc login options (but doesn't mention about
this before, because of this document
http://www.datastax.com/docs/0.7/troubleshooting/index#nodes-seem-to-freeze-after-some-period-of-time,
there is no any mention about gc.log ):

JVM_OPTS="$JVM_OPTS -XX:+PrintGCApplicationStoppedTime"
JVM_OPTS="$JVM_OPTS -Xloggc:/var/log/cassandra/gc.log"


And detect that nodes frozen with follow log entires

Total time for which application threads were stopped: 30.0000957 seconds

And so on. Also when i think that nodes are frozen i got
UnavailableException and TimeOutException, about 20-30 times (i make few
Attempts (300 with 1 sec sleep) before final fail), follow fragment of code
illustrate what i do

        for(; $l_i < 300; ++$l_i)
        {
            try
            {
                $client->batch_mutate($mutations,
cassandra_ConsistencyLevel::QUORUM);
                $retval = true;

                break;
            }
            catch(cassandra_UnavailableException $e)
            {
                array_push($l_exceptions, get_class($e));
                sleep(1);
            }
            catch(cassandra_TimedOutException $e)
            {
                array_push($l_exceptions, get_class($e));
                sleep(1);
            }
            catch(Exception $e)
            {
                $loger->err(get_class($e).': '.$e->getMessage());
                $loger->err($mutations);

                break;
            };
        };

Re: Nodes frozen in GC

Posted by Peter Schuller <pe...@infidyne.com>.

Also:

* What is the frequency of the pauses? Are we talking every few
seconds, minutes, hours, days....
* If you say decrease the load down to 25%. Are you seeing the same
effect but at 1/4th the frequency, or does it remain unchanged, or
does the problem go away completely?

-- 
/ Peter Schuller

Re: Nodes frozen in GC

Posted by Peter Schuller <pe...@infidyne.com>.

> (1) I cannot stress this one enough: Run with -XX:+PrintGC
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps and collect the output.

Actually, I wonder if it's worth someone getting this enabled by
default, with the obvious problems associated with getting the log
output placed appropriately and rotated. But it really seems to me
like almost every single case of people having problems on the ML:s
would be helped by having instant access to the GC log to rule things
out very quickly and expediently.

I'm not aware of a good simple way to do so though, since the JVM
emits it on stdout (not even stderr) by default, or else to a fixed
file and supports no rotation. Making the file a fifo is possible, but
then one has to have someone reading that file and then that thing
must be monitored, and you introduce the possibility of the log reader
causing JVM hiccups, etc.... does anyone have a good practice for
running production code with GC logging in a way that doesn't interfer
with the normal user who's not actively using it/caring about it?

A related potential mode of backfire is that it might cause blocking
writes to stdout on systems bottlenecking on disk I/O. Maybe it's just
not realistic.

-- 
/ Peter Schuller

Re: Nodes frozen in GC

Posted by Peter Schuller <pe...@infidyne.com>.

> So, are you saying this is normal and expected from Cassandra?  So,
> under load, we can expect java garbage collection to stop the Cassandra
> process on that server from time to time, essentially taking out the
> node for short periods of time while it does garbage collection?

This thread is getting out of hand and out into la-la-land. Original
poster: If you want to skip some rantings of mine, skip to the end and
do (1) and get the results to the list.

First of all, re-checking the history, it seems the only concrete
information is:

INFO [ScheduledTasks:1] 2011-03-05 15:21:23,524 GCInspector.java (line
128) GC for ConcurrentMarkSweep: 18052 ms, -997761672 reclaimed
leaving 5796586088

This indicates that a concurrent mark/sweep GC took 18 seconds. That
may or may not be a bit high for the heap size, but regardless, the
CMS is not a stop-the-world pause. It involves some stop-the-world
pauses, but those 18 seconds aren't it.

I still don't see anything that tells what's actually going on in the
OP's case. But the fact that the heap grew rather than shrunk as a
result of the GC cycle suggests that something is indeed wrong.
Probably the heap is too full, as has already been suggested, and the
question is just why. Probably *something* is tweaked incorrectly for
the heap size, be that row cache, memtable flush thresholds, etc. Or
there's a bug. But there seems to be a distinct lack of information
and a distinct non-lack of random speculating and GC blaming.

CASSANDRA-2252 as was linked to is *not* a magic fix for this.

A lot can be said about garbage collection techniques, but the point
is that the point of the CMS collector is to avoid the need for long
stop-the-world pauses. Some are still required, but they are supposed
to normally be short. For some workloads, you eventually reach a point
where fragmentation in the old generation causes the need do to a full
stop-the-world pause while the entire heap is compacted. This *does*
result in a long uninterrupted pause, if/when it happens.

Usually, it happens because you actually have too much live data on
the heap. That is entirely different from having a reasonable workload
that is still not handled by the GC in a sensible fashion.

Is it possible that everything is configured correctly and the OP is
triggering a bug or just triggering a sufficiently poor behavior of
CMS such that the freezes are due to unavoidable periodic compactions?
Yes. Do we have nearly enough information to know? No. Should we
assume it is the GC/JVM:s fault before having such information, given
that lots of people run Cassandra without triggering this to the
extent this seems to imply? No.

I would suggest to the OP:

(1) I cannot stress this one enough: Run with -XX:+PrintGC
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps and collect the output.
(2) Attach to your process with jconsole or some similar tool.
(3) Observe the behavior of the heap over time. Preferably post
screenshots so others can look at them.

(1) in particular is very important. It's completely useless to be
speculating about details and making sweeping statements when all
indications so far indicate that there is too much live data on the
heap, when there is not even the results of (1) to go by.

(1) will give you output which shows when different stages of GC
trigger and information about heap sizes etc. It will also print the
reason for fallback to full GC, such as a promotion failure. One can
usually observe fairly well what lead up to such a fallback and draw
conclusions. It will also show which stages took what amount of time
(and not all of them are stop-the-world).

-- 
/ Peter Schuller

Re: Nodes frozen in GC

Posted by Paul Pak <pp...@yellowseo.com>.

So, are you saying this is normal and expected from Cassandra?  So,
under load, we can expect java garbage collection to stop the Cassandra
process on that server from time to time, essentially taking out the
node for short periods of time while it does garbage collection?

Also, why is there so much garbage collection to begin with?  Memcache
uses a slab allocator to reuse blocks to prevent allocation/deallocation
of blocks from consuming all the cpu time.  Are there any plans to reuse
blocks so the garbage collector doesn't have to work so hard?

Paul

On 3/7/2011 4:35 PM, Jonathan Ellis wrote:
> It sounds like you're complaining that the JVM sometimes does stop-the-world GC.
>
> You can mitigate this but not (for most workloads) eliminate it with
> GC option tuning.  That's simply the state of the art for Java garbage
> collection right now.
>
> On Sun, Mar 6, 2011 at 2:18 AM, ruslan usifov <ru...@gmail.com> wrote:
>>
>> 2011/3/6 aaron morton <aa...@thelastpickle.com>
>>> Your node is under memory pressure, after the GC there is still 5.7GB in
>>> use. In fact it looks like memory usage went up during the GC process.
>>>
>>> Can you reduce the memtable size, caches or the number of CF's or increase
>>> the JVM size? Also is this happening under heavy load ?
>>>
>> Yes I do bulk load to cluster
>>
>>
>> I reduce memtable to 64MB sutuation became better, but it is all the same
>> very rare GC more than 15 Sec happens. Can reduce
>> flush_largest_memtables_at helps? O may be some GC options?
>>
>
>

Re: Nodes frozen in GC

Posted by David Boxenhorn <da...@daotown.com>.

If RF=2 and CL= QUORUM, you're getting no benefit from replication. When a
node is in GC it stops everything. Set RF=3, so when one node is busy the
cluster will still work.

On Tue, Mar 8, 2011 at 11:46 AM, ruslan usifov <ru...@gmail.com>wrote:

>
>
> 2011/3/8 Chris Goffinet <cg...@chrisgoffinet.com>
>
>> How large are your SSTables on disk? My thought was because you have so
>> many on disk, we have to store the bloom filter + every 128 keys from index
>> in memory.
>>
>>
> 0.5GB
>  But as I understand store in memory happens only when read happens, i do
> only inserts. And i think that memory doesn't problem, because heap
> allocations looks like saw (in max Heap allocations get about 5,5 GB then
> they reduce to 2GB)
>
>
> Also when i increase Heap Size to 7GB, situation stay mach better, but
> nodes frozen still happens, and in gc.log I steel see:
>
> Total time for which application threads were stopped: 20.0686307 seconds
>
> lines (right not so often, like before)
>

Re: Nodes frozen in GC

Posted by ruslan usifov <ru...@gmail.com>.

2011/3/8 Chris Goffinet <cg...@chrisgoffinet.com>

> How large are your SSTables on disk? My thought was because you have so
> many on disk, we have to store the bloom filter + every 128 keys from index
> in memory.
>
>
0.5GB
 But as I understand store in memory happens only when read happens, i do
only inserts. And i think that memory doesn't problem, because heap
allocations looks like saw (in max Heap allocations get about 5,5 GB then
they reduce to 2GB)


Also when i increase Heap Size to 7GB, situation stay mach better, but nodes
frozen still happens, and in gc.log I steel see:

Total time for which application threads were stopped: 20.0686307 seconds

lines (right not so often, like before)

Re: Nodes frozen in GC

Posted by Chris Goffinet <cg...@chrisgoffinet.com>.

How large are your SSTables on disk? My thought was because you have so many
on disk, we have to store the bloom filter + every 128 keys from index in
memory.

On Mon, Mar 7, 2011 at 4:35 PM, ruslan usifov <ru...@gmail.com>wrote:

>
>
> 2011/3/8 Chris Goffinet <cg...@chrisgoffinet.com>
>
>> The rows you are inserting, what is your update ratio to those rows?
>>
>> I doesn't update them only insert, with speed 16000 per second
>

Re: Nodes frozen in GC

Posted by ruslan usifov <ru...@gmail.com>.

2011/3/8 Chris Goffinet <cg...@chrisgoffinet.com>

> The rows you are inserting, what is your update ratio to those rows?
>
> I doesn't update them only insert, with speed 16000 per second

Re: Nodes frozen in GC

Posted by Chris Goffinet <cg...@chrisgoffinet.com>.

The rows you are inserting, what is your update ratio to those rows?

On Mon, Mar 7, 2011 at 4:03 PM, ruslan usifov <ru...@gmail.com>wrote:

>
>
> 2011/3/8 Chris Goffinet <cg...@chrisgoffinet.com>
>
> Can you tell me how many SSTables on disk when you see GC pauses? In your 3
>> node cluster, what's the RF factor?
>>
>>
> About 30-40, and i use RF=2, and insert rows with QUORUM consistency level
>

Re: Nodes frozen in GC

Posted by ruslan usifov <ru...@gmail.com>.

2011/3/8 Chris Goffinet <cg...@chrisgoffinet.com>

> Can you tell me how many SSTables on disk when you see GC pauses? In your 3
> node cluster, what's the RF factor?
>
>
About 30-40, and i use RF=2, and insert rows with QUORUM consistency level

Re: Nodes frozen in GC

Posted by Chris Goffinet <cg...@chrisgoffinet.com>.

Can you tell me how many SSTables on disk when you see GC pauses? In your 3
node cluster, what's the RF factor?

On Mon, Mar 7, 2011 at 1:50 PM, ruslan usifov <ru...@gmail.com>wrote:

>
>
> 2011/3/8 Jonathan Ellis <jb...@gmail.com>
>
> It sounds like you're complaining that the JVM sometimes does
>> stop-the-world GC.
>>
>> You can mitigate this but not (for most workloads) eliminate it with
>> GC option tuning.  That's simply the state of the art for Java garbage
>> collection right now.
>>
>>
> Hm, but what to do in this cases?? In these moments throughput of cluster
> degrade, and I misunderstand what workaround I must do to prevent this
> situations?
>

Re: Nodes frozen in GC

Posted by ruslan usifov <ru...@gmail.com>.

2011/3/8 Paul Pak <pp...@yellowseo.com>

>  Hi Ruslan,
>
> Is it possible for you to tell us the details on what you have done which
> measurably helped your situation, so we can start a "best practices" doc on
> growing cassandra systems?
>
> So far, I see that under load, cassandra is rarely "ready" to take heavy
> load in it's default configuration and a number of steps need to be done
> with the configuration of cassandra for proper sizing of memtables,
> flushing, jvm.  Unfortunately, it's very difficult to gauge what the proper
> or appropriate settings are for a given workload.
>
> It would be helpful if you could share, what happened in the default
> config, what steps you did that helped the situation, h Tow much each step
> helped your situation.  That way we can start a checklist of things to
> address as we grow in load.
>
>
It will be great if you provide options that need tuning from best throput,
i know only 3:

in cassandra.yaml

binary_memtable_throughput_in_mb

And jvm options:

-Xms with -Xmx - for heap size
-Xmn - for minor young generation GC

Re: Nodes frozen in GC

Posted by Paul Pak <pp...@yellowseo.com>.

Hi Ruslan,

Is it possible for you to tell us the details on what you have done
which measurably helped your situation, so we can start a "best
practices" doc on growing cassandra systems? 

So far, I see that under load, cassandra is rarely "ready" to take heavy
load in it's default configuration and a number of steps need to be done
with the configuration of cassandra for proper sizing of memtables,
flushing, jvm.  Unfortunately, it's very difficult to gauge what the
proper or appropriate settings are for a given workload.

It would be helpful if you could share, what happened in the default
config, what steps you did that helped the situation, h Tow much each
step helped your situation.  That way we can start a checklist of things
to address as we grow in load.

Paul

On 3/8/2011 5:15 AM, ruslan usifov wrote:
>
>
> 2011/3/8 Paul Pak <ppak@yellowseo.com <ma...@yellowseo.com>>
>
>     Hi Ruslan,
>
>     It looks like Jonathan and Stu have already been working to reduce
>     garbage collection on v.8  The ticket is at
>     https://issues.apache.org/jira/browse/CASSANDRA-2252
>
>     Jonathan, is there any way to apply the patch to .73 and have
>     ruslan test it to see if it fixes his issue with Garbage
>     collection?  Given the fact that he's already got a test case, it
>     seems like a reasonable thing to do (with the caveat that he is a
>     guinea pig for new code which may not be fully tested yet :) .
>
>
> Thanks Paul for information, it very informative. But how this work at
> others. I see that situation stay better when i increase heap size to
> 7GB, but nodes still frozen truth not so frequent as before, but
> still, also but I don;t probe this i think that setting -Xmn to
> something bigger then 800M may give benefit
> (http://java.sun.com/performance/reference/whitepapers/tuning.html)
> |
> |

Re: Nodes frozen in GC

Posted by ruslan usifov <ru...@gmail.com>.

2011/3/8 Paul Pak <pp...@yellowseo.com>

>  Hi Ruslan,
>
> It looks like Jonathan and Stu have already been working to reduce garbage
> collection on v.8  The ticket is at
> https://issues.apache.org/jira/browse/CASSANDRA-2252
>
> Jonathan, is there any way to apply the patch to .73 and have ruslan test
> it to see if it fixes his issue with Garbage collection?  Given the fact
> that he's already got a test case, it seems like a reasonable thing to do
> (with the caveat that he is a guinea pig for new code which may not be fully
> tested yet :) .
>

Thanks Paul for information, it very informative. But how this work at
others. I see that situation stay better when i increase heap size to 7GB,
but nodes still frozen truth not so frequent as before, but still, also but
I don;t probe this i think that setting -Xmn to something bigger then 800M
may give benefit (
http://java.sun.com/performance/reference/whitepapers/tuning.html)

Re: Nodes frozen in GC

Posted by Paul Pak <pp...@yellowseo.com>.

Hi Ruslan,

It looks like Jonathan and Stu have already been working to reduce
garbage collection on v.8  The ticket is at
https://issues.apache.org/jira/browse/CASSANDRA-2252

Jonathan, is there any way to apply the patch to .73 and have ruslan
test it to see if it fixes his issue with Garbage collection?  Given the
fact that he's already got a test case, it seems like a reasonable thing
to do (with the caveat that he is a guinea pig for new code which may
not be fully tested yet :) .

Paul

On 3/7/2011 4:50 PM, ruslan usifov wrote:
>
>
> 2011/3/8 Jonathan Ellis <jbellis@gmail.com <ma...@gmail.com>>
>
>     It sounds like you're complaining that the JVM sometimes does
>     stop-the-world GC.
>
>     You can mitigate this but not (for most workloads) eliminate it with
>     GC option tuning.  That's simply the state of the art for Java garbage
>     collection right now.
>
>
> Hm, but what to do in this cases?? In these moments throughput of
> cluster degrade, and I misunderstand what workaround I must do to
> prevent this situations?

Re: Nodes frozen in GC

Posted by ruslan usifov <ru...@gmail.com>.

2011/3/8 Jonathan Ellis <jb...@gmail.com>

> It sounds like you're complaining that the JVM sometimes does
> stop-the-world GC.
>
> You can mitigate this but not (for most workloads) eliminate it with
> GC option tuning.  That's simply the state of the art for Java garbage
> collection right now.
>
>
Hm, but what to do in this cases?? In these moments throughput of cluster
degrade, and I misunderstand what workaround I must do to prevent this
situations?

Re: Nodes frozen in GC

Posted by Jonathan Ellis <jb...@gmail.com>.

It sounds like you're complaining that the JVM sometimes does stop-the-world GC.

You can mitigate this but not (for most workloads) eliminate it with
GC option tuning.  That's simply the state of the art for Java garbage
collection right now.

On Sun, Mar 6, 2011 at 2:18 AM, ruslan usifov <ru...@gmail.com> wrote:
>
>
> 2011/3/6 aaron morton <aa...@thelastpickle.com>
>>
>> Your node is under memory pressure, after the GC there is still 5.7GB in
>> use. In fact it looks like memory usage went up during the GC process.
>>
>> Can you reduce the memtable size, caches or the number of CF's or increase
>> the JVM size? Also is this happening under heavy load ?
>>
> Yes I do bulk load to cluster
>
>
> I reduce memtable to 64MB sutuation became better, but it is all the same
> very rare GC more than 15 Sec happens. Can reduce
> flush_largest_memtables_at helps? O may be some GC options?
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Nodes frozen in GC

Posted by Peter Schuller <pe...@infidyne.com>.

Do you have row cache enabled? Disable it. If it fixes it and you want
it, re-enable but consider row sizes and the cap on the cache size..
-- 
/ Peter Schuller

Re: Nodes frozen in GC

Posted by ruslan usifov <ru...@gmail.com>.

2011/3/6 aaron morton <aa...@thelastpickle.com>

> Your node is under memory pressure, after the GC there is still 5.7GB in
> use. In fact it looks like memory usage went up during the GC process.
>
> Can you reduce the memtable size, caches or the number of CF's or increase
> the JVM size? Also is this happening under heavy load ?
>
> Yes I do bulk load to cluster


I reduce memtable to 64MB sutuation became better, but it is all the same
very rare GC more than 15 Sec happens. Can reduce
flush_largest_memtables_at helps? O may be some GC options?

Re: Nodes frozen in GC

Posted by ruslan usifov <ru...@gmail.com>.

Hello thanks for reply

2011/3/7 Aaron Morton <aa...@thelastpickle.com>

> It's always possible to run out of memory. Can you provide...
>
> - number cf's and their Memtable settings
>
1 CF with memtable 64MB, other settings as default



> - any row or key cache settings
>
Its stay default e.g 200000, but i don't do any read, only writes



> - any other buffer or memory settings you may have changed in
> Cassandra.yaml.
>
I change only binary_memtable_throughput_in_mb setting, and set it to 64MB




> - what load you are putting on the cluster, e.g. Inserting x rows/columns
> per second with avg size y
>
>
> I do bulk load of real data into cluster. In may case this do in 16 threads
(8 process / per machine, on two machines), with avg insert speed 1000 per
second per thread, so total is 16000 rows per second, with avg size of row
573 bytes

Re: Nodes frozen in GC

Posted by Aaron Morton <aa...@thelastpickle.com>.

It's always possible to run out of memory. Can you provide...

- number cf's and their Memtable settings
- any row or key cache settings
- any other buffer or memory settings you may have changed in Cassandra.yaml.
- what load you are putting on the cluster, e.g. Inserting x rows/columns per second with avg size y

Aaron

On 8/03/2011, at 2:39 AM, ruslan usifov <ru...@gmail.com> wrote:

> 
> 
> 2011/3/6 aaron morton <aa...@thelastpickle.com>
> Your node is under memory pressure, after the GC there is still 5.7GB in use. In fact it looks like memory usage went up during the GC process.
> 
> Can you reduce the memtable size, caches or the number of CF's or increase the JVM size? Also is this happening under heavy load ?
> 
> 
> I have memtable size, and insert data into one CF, with biggest rowsize 1K, how it is possible that after GC all memory is load? Meybe this is memory leak in cassandra 0.7.3?
>

Re: Nodes frozen in GC

Posted by ruslan usifov <ru...@gmail.com>.

2011/3/6 aaron morton <aa...@thelastpickle.com>

> Your node is under memory pressure, after the GC there is still 5.7GB in
> use. In fact it looks like memory usage went up during the GC process.
>
> Can you reduce the memtable size, caches or the number of CF's or increase
> the JVM size? Also is this happening under heavy load ?
>
>
I have memtable size, and insert data into one CF, with biggest rowsize 1K,
how it is possible that after GC all memory is load? Meybe this is memory
leak in cassandra 0.7.3?

Re: Nodes frozen in GC

Posted by aaron morton <aa...@thelastpickle.com>.

Your node is under memory pressure, after the GC there is still 5.7GB in use. In fact it looks like memory usage went up during the GC process. 

Can you reduce the memtable size, caches or the number of CF's or increase the JVM size? Also is this happening under heavy load ? 

Aaron

On 6/03/2011, at 10:34 AM, ruslan usifov wrote:

> Hello
> 
> I have cluster of 3 cassandra 0.7.3 nodes, and some times nodes a frozen. As i understand this because GC, in this moment i see 
> 
> INFO [ScheduledTasks:1] 2011-03-05 15:21:23,524 GCInspector.java (line 128) GC for ConcurrentMarkSweep: 18052 ms, -997761672 reclaimed leaving 5796586088
> 
> How can i reduce this? I installed JNA (i install it correctly: "CLibrary.java (line 106) JNA mlockall successful" log record  in system.log present on all nodes), and set vm.swappiness = 0, but this doesn't help. Any suggest?
> 
> PS: Nodes have 12GB of memory with 6GB HeapSize