You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tom Evans <te...@googlemail.com> on 2015/11/13 16:00:35 UTC

Best way to track cumulative GC pauses in Solr

Hi all

We have some issues with our Solr servers spending too much time
paused doing GC. From turning on gc debug, and extracting numbers from
the GC log, we're getting an idea of just how much of a problem.

I'm currently doing this in a hacky, inefficient way:

grep -h 'Total time for which application threads were stopped:' solr_gc* \
    | awk '($11 > 0.3) { print $1, $11 }' \
    | sed 's#:.*:##' \
    | sort -n \
    | sum_by_date.py

(Yes, I really am using sed, grep and awk all in one line. Just wrong :)

The "sum_by_date.py" program simply adds up all the values with the
same first column, and remembers the largest value seen. This is
giving me the cumulative GC time for extended pauses (over 0.5s), and
the maximum pause seen in a given time period (hourly), eg:

2015-11-13T11 119.124037 2.203569
2015-11-13T12 184.683309 3.156565
2015-11-13T13 65.934526 1.978202
2015-11-13T14 63.970378 1.411700


This is fine for seeing that we have a problem. However, really I need
to get this in to our monitoring systems - we use munin. I'm
struggling to work out the best way to extract this information for
our monitoring systems, and I think this might be my naivety about
Java, and working out what should be logged.

I've turned on JMX debugging, and looking at the different beans
available using jconsole, but I'm drowning in information. What would
be the best thing to monitor?

Ideally, like the stats above, I'd like to know the cumulative time
spent paused in GC since the last poll, and the longest GC pause that
we see. munin polls every 5 minutes, are there suitable counters
exposed by JMX that it could extract?

Thanks in advance

Tom

Re: Best way to track cumulative GC pauses in Solr

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi Tom,

SPM for SOLR should be helpful here. See http://sematext.com/spm

Otis

 

> On Nov 13, 2015, at 10:00, Tom Evans <te...@googlemail.com> wrote:
> 
> Hi all
> 
> We have some issues with our Solr servers spending too much time
> paused doing GC. From turning on gc debug, and extracting numbers from
> the GC log, we're getting an idea of just how much of a problem.
> 
> I'm currently doing this in a hacky, inefficient way:
> 
> grep -h 'Total time for which application threads were stopped:' solr_gc* \
>    | awk '($11 > 0.3) { print $1, $11 }' \
>    | sed 's#:.*:##' \
>    | sort -n \
>    | sum_by_date.py
> 
> (Yes, I really am using sed, grep and awk all in one line. Just wrong :)
> 
> The "sum_by_date.py" program simply adds up all the values with the
> same first column, and remembers the largest value seen. This is
> giving me the cumulative GC time for extended pauses (over 0.5s), and
> the maximum pause seen in a given time period (hourly), eg:
> 
> 2015-11-13T11 119.124037 2.203569
> 2015-11-13T12 184.683309 3.156565
> 2015-11-13T13 65.934526 1.978202
> 2015-11-13T14 63.970378 1.411700
> 
> 
> This is fine for seeing that we have a problem. However, really I need
> to get this in to our monitoring systems - we use munin. I'm
> struggling to work out the best way to extract this information for
> our monitoring systems, and I think this might be my naivety about
> Java, and working out what should be logged.
> 
> I've turned on JMX debugging, and looking at the different beans
> available using jconsole, but I'm drowning in information. What would
> be the best thing to monitor?
> 
> Ideally, like the stats above, I'd like to know the cumulative time
> spent paused in GC since the last poll, and the longest GC pause that
> we see. munin polls every 5 minutes, are there suitable counters
> exposed by JMX that it could extract?
> 
> Thanks in advance
> 
> Tom

Re: Best way to track cumulative GC pauses in Solr

Posted by Tom Evans <te...@googlemail.com>.
On Fri, Nov 13, 2015 at 4:50 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> Also, what GC settings are you using? We may be able to make some suggestions.
>
> Cumulative GC pauses aren’t very interesting to me. I’m more interested in the longest ones, 90th percentile, 95th, etc.
>

Any advice would be great, but what I'm primarily interested in is how
people are monitoring these statistics in real time, for all time, on
production servers. Eg, for looking at the disk or RAM usage of one of
my servers, I can look at the historical usage in the last week, last
month, last year and so on.

I need to get these stats in to the same monitoring tools as we use
for monitoring every other vital aspect of our servers. Looking at log
files can be useful, but I don't want to keep arbitrarily large log
files on our servers, nor extract data from them, I want to record it
for posterity in one system that understands sampling.

We already use and maintain our own munin systems, so I'm not
interested in paid-for equivalents of munin - regardless of how simple
to set up they are, they don't integrate with our other performance
monitoring stats, and I would never get budget anyway.

So really:

1) Is it OK to turn JMX monitoring on on production systems? The
comments in solr.in.sh suggest not.

2) What JMX beans and attributes should I be using to monitor GC
pauses, particularly maximum length of a single pause in a period, and
the total length of pauses in that period?

Cheers

Tom

Re: Best way to track cumulative GC pauses in Solr

Posted by Walter Underwood <wu...@wunderwood.org>.
Also, what GC settings are you using? We may be able to make some suggestions.

Cumulative GC pauses aren’t very interesting to me. I’m more interested in the longest ones, 90th percentile, 95th, etc.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 13, 2015, at 8:32 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
> On 11/13/2015 8:00 AM, Tom Evans wrote:
>> We have some issues with our Solr servers spending too much time
>> paused doing GC. From turning on gc debug, and extracting numbers from
>> the GC log, we're getting an idea of just how much of a problem.
> 
> Try loading your gc log into gcviewer.
> 
> https://github.com/chewiebug/GCViewer/releases
> 
> Here's a screenshot of this in action with a gc log from Solr loaded:
> 
> https://www.dropbox.com/s/orwt0fcmii5691l/solr-gc-gcviewer-1.35-snapshot.png?dl=0
> 
> This screenshot is from a snapshot build including a feature request
> that I made:
> 
> https://github.com/chewiebug/GCViewer/issues/139
> 
> If you use the 1.34.1 version, you will not see some of the numbers
> shown in my screenshot, but the info you asked for, accumulated GC
> pauses, IS included in that version.
> 
> Thanks,
> Shawn
> 


Re: Best way to track cumulative GC pauses in Solr

Posted by Shawn Heisey <ap...@elyograg.org>.
On 11/13/2015 8:00 AM, Tom Evans wrote:
> We have some issues with our Solr servers spending too much time
> paused doing GC. From turning on gc debug, and extracting numbers from
> the GC log, we're getting an idea of just how much of a problem.

Try loading your gc log into gcviewer.

https://github.com/chewiebug/GCViewer/releases

Here's a screenshot of this in action with a gc log from Solr loaded:

https://www.dropbox.com/s/orwt0fcmii5691l/solr-gc-gcviewer-1.35-snapshot.png?dl=0

This screenshot is from a snapshot build including a feature request
that I made:

https://github.com/chewiebug/GCViewer/issues/139

If you use the 1.34.1 version, you will not see some of the numbers
shown in my screenshot, but the info you asked for, accumulated GC
pauses, IS included in that version.

Thanks,
Shawn