You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Tom Brown <to...@gmail.com> on 2014/06/10 20:05:31 UTC

Is this a long GC pause, or something else?

Last night a regionserver in my cluster stopped responding in a timely
manner for about 20 minutes. I know that stop-the-world GC can cause this
type of behavior, but 20 minutes seems excessive.

The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB). We
are using the latest java 7 from oracle. HDFS is provided by an Isilon
cluster.

The server workload is read/write: the writing process reads all rows it is
about to write, updates them if they exist, and then writes all the rows
(replacing ones that were updated).

The last messages before the pause were regarding an HLog roll:

DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
getDefaultReplication
INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
getDefaultBlockSize

During the next 20 minutes there were a handful of sporadic LruBlockCache
stats messages but nothing else. After 20 minutes, normal operation resumed.

Is 20 minutes for a GC pause expected given the operational load and
machine specs? Could a GC pause include periodic log messages? If it wasn't
a GC pause, what else could it be?

--Tom

Re: Is this a long GC pause, or something else?

Posted by Tom Brown <to...@gmail.com>.

I do not believe GC logging is enabled. I will look into that for the
future.

The cluster is 6 machines all with the same spec. I have not seen any
evidence that any other server in the cluster had any problems at the same
time. There are/were no dead nodes. The master did not seem to notice
anything during this time.

The issue was detected because requests to a particular RS would
consistently timeout during the 20 minutes in question.

--Tom


On Tue, Jun 10, 2014 at 12:49 PM, Vladimir Rodionov <vrodionov@carrieriq.com
> wrote:

> 1. Do you have GC logging enabled on your cluster? It does not look like
> GC - pause to me but for future troubleshooting it is better
> to enable GC logging.
>
> 2. How large is your cluster? Did you check NN and DN logs as well? Are
> all your nodes (RS and DN) up and running? No dead nodes?
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: Tom Brown [tombrown52@gmail.com]
> Sent: Tuesday, June 10, 2014 11:13 AM
> To: user@hbase.apache.org
> Subject: Re: Is this a long GC pause, or something else?
>
> We are still using 0.94.10. We are looking at upgrading soon, but have not
> done so yet.
>
> --Tom
>
>
> On Tue, Jun 10, 2014 at 12:10 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Which release are you using ?
> >
> > In 0.98+, there is JvmPauseMonitor.
> >
> > Cheers
> >
> >
> > On Tue, Jun 10, 2014 at 11:05 AM, Tom Brown <to...@gmail.com>
> wrote:
> >
> > > Last night a regionserver in my cluster stopped responding in a timely
> > > manner for about 20 minutes. I know that stop-the-world GC can cause
> this
> > > type of behavior, but 20 minutes seems excessive.
> > >
> > > The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB).
> We
> > > are using the latest java 7 from oracle. HDFS is provided by an Isilon
> > > cluster.
> > >
> > > The server workload is read/write: the writing process reads all rows
> it
> > is
> > > about to write, updates them if they exist, and then writes all the
> rows
> > > (replacing ones that were updated).
> > >
> > > The last messages before the pause were regarding an HLog roll:
> > >
> > > DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll
> requested
> > > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > > getDefaultReplication
> > > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > > getDefaultBlockSize
> > >
> > > During the next 20 minutes there were a handful of sporadic
> LruBlockCache
> > > stats messages but nothing else. After 20 minutes, normal operation
> > > resumed.
> > >
> > > Is 20 minutes for a GC pause expected given the operational load and
> > > machine specs? Could a GC pause include periodic log messages? If it
> > wasn't
> > > a GC pause, what else could it be?
> > >
> > > --Tom
> > >
> >
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

RE: Is this a long GC pause, or something else?

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

1. Do you have GC logging enabled on your cluster? It does not look like GC - pause to me but for future troubleshooting it is better
to enable GC logging.

2. How large is your cluster? Did you check NN and DN logs as well? Are all your nodes (RS and DN) up and running? No dead nodes?

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Tom Brown [tombrown52@gmail.com]
Sent: Tuesday, June 10, 2014 11:13 AM
To: user@hbase.apache.org
Subject: Re: Is this a long GC pause, or something else?

We are still using 0.94.10. We are looking at upgrading soon, but have not
done so yet.

--Tom

On Tue, Jun 10, 2014 at 12:10 PM, Ted Yu <yu...@gmail.com> wrote:

> Which release are you using ?
>
> In 0.98+, there is JvmPauseMonitor.
>
> Cheers
>
>
> On Tue, Jun 10, 2014 at 11:05 AM, Tom Brown <to...@gmail.com> wrote:
>
> > Last night a regionserver in my cluster stopped responding in a timely
> > manner for about 20 minutes. I know that stop-the-world GC can cause this
> > type of behavior, but 20 minutes seems excessive.
> >
> > The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB). We
> > are using the latest java 7 from oracle. HDFS is provided by an Isilon
> > cluster.
> >
> > The server workload is read/write: the writing process reads all rows it
> is
> > about to write, updates them if they exist, and then writes all the rows
> > (replacing ones that were updated).
> >
> > The last messages before the pause were regarding an HLog roll:
> >
> > DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
> > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > getDefaultReplication
> > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > getDefaultBlockSize
> >
> > During the next 20 minutes there were a handful of sporadic LruBlockCache
> > stats messages but nothing else. After 20 minutes, normal operation
> > resumed.
> >
> > Is 20 minutes for a GC pause expected given the operational load and
> > machine specs? Could a GC pause include periodic log messages? If it
> wasn't
> > a GC pause, what else could it be?
> >
> > --Tom
> >
>

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: Is this a long GC pause, or something else?

Posted by Tom Brown <to...@gmail.com>.

We are still using 0.94.10. We are looking at upgrading soon, but have not
done so yet.

--Tom


On Tue, Jun 10, 2014 at 12:10 PM, Ted Yu <yu...@gmail.com> wrote:

> Which release are you using ?
>
> In 0.98+, there is JvmPauseMonitor.
>
> Cheers
>
>
> On Tue, Jun 10, 2014 at 11:05 AM, Tom Brown <to...@gmail.com> wrote:
>
> > Last night a regionserver in my cluster stopped responding in a timely
> > manner for about 20 minutes. I know that stop-the-world GC can cause this
> > type of behavior, but 20 minutes seems excessive.
> >
> > The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB). We
> > are using the latest java 7 from oracle. HDFS is provided by an Isilon
> > cluster.
> >
> > The server workload is read/write: the writing process reads all rows it
> is
> > about to write, updates them if they exist, and then writes all the rows
> > (replacing ones that were updated).
> >
> > The last messages before the pause were regarding an HLog roll:
> >
> > DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
> > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > getDefaultReplication
> > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > getDefaultBlockSize
> >
> > During the next 20 minutes there were a handful of sporadic LruBlockCache
> > stats messages but nothing else. After 20 minutes, normal operation
> > resumed.
> >
> > Is 20 minutes for a GC pause expected given the operational load and
> > machine specs? Could a GC pause include periodic log messages? If it
> wasn't
> > a GC pause, what else could it be?
> >
> > --Tom
> >
>

Re: Is this a long GC pause, or something else?

Posted by Ted Yu <yu...@gmail.com>.

Which release are you using ?

In 0.98+, there is JvmPauseMonitor.

Cheers


On Tue, Jun 10, 2014 at 11:05 AM, Tom Brown <to...@gmail.com> wrote:

> Last night a regionserver in my cluster stopped responding in a timely
> manner for about 20 minutes. I know that stop-the-world GC can cause this
> type of behavior, but 20 minutes seems excessive.
>
> The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB). We
> are using the latest java 7 from oracle. HDFS is provided by an Isilon
> cluster.
>
> The server workload is read/write: the writing process reads all rows it is
> about to write, updates them if they exist, and then writes all the rows
> (replacing ones that were updated).
>
> The last messages before the pause were regarding an HLog roll:
>
> DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
> INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> getDefaultReplication
> INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> getDefaultBlockSize
>
> During the next 20 minutes there were a handful of sporadic LruBlockCache
> stats messages but nothing else. After 20 minutes, normal operation
> resumed.
>
> Is 20 minutes for a GC pause expected given the operational load and
> machine specs? Could a GC pause include periodic log messages? If it wasn't
> a GC pause, what else could it be?
>
> --Tom
>

Re: Is this a long GC pause, or something else?

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi Tom,

Aha.  Our pauses keep happening. :(

We use SPM - see http://sematext.com/spm/ - it has support for HBase and
Hadoop metrics, among other things.  As a matter of fact, for
troubleshooting an issue like this one you may also want to ship your logs
into Logsene <http://sematext.com/logsene/>.  Doing that will let you
correlate your pause with messages in the logs, which could help you figure
out what's going on next time something like this happens.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jun 10, 2014 at 7:52 PM, Tom Brown <to...@gmail.com> wrote:

> Otis,
>
> I'm not sure our issue is the same (although they could turn out to be
> related). As far as I have been able to determine, we have only had a
> single long pause.
>
> However, we don't have much experience micromanaging our JVMs. How did you
> generate those graphs?
>
> --Tom
>
>
> On Tue, Jun 10, 2014 at 4:52 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
> > No, I don't think so.  We had it until this morning and didn't see this
> > problem.  We'll probably switch to it tomorrow morning before we change
> EC2
> > instances and see if that removes the problem.
> >
> > Tom - do your pauses look like the ones in our SPM graphs?
> >
> > Otis
> > --
> > Performance Monitoring * Log Analytics * Search Analytics
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> > On Tue, Jun 10, 2014 at 6:38 PM, Vladimir Rodionov <
> > vrodionov@carrieriq.com>
> > wrote:
> >
> > > Unbelievable. Do you see the same with the latest OpenJDK?
> > >
> > > Best regards,
> > > Vladimir Rodionov
> > > Principal Platform Engineer
> > > Carrier IQ, www.carrieriq.com
> > > e-mail: vrodionov@carrieriq.com
> > >
> > > ________________________________________
> > > From: Otis Gospodnetic [otis.gospodnetic@gmail.com]
> > > Sent: Tuesday, June 10, 2014 2:43 PM
> > > To: user@hbase.apache.org
> > > Subject: Re: Is this a long GC pause, or something else?
> > >
> > > Does it repeat?
> > > We are seeing this with u60 oracle JVM too!  SPM shows the whole JVM
> > > blocking for about 16 minutes every M minutes.
> > >
> > > Otis
> > >
> > >
> > >
> > > > On Jun 10, 2014, at 2:05 PM, Tom Brown <to...@gmail.com> wrote:
> > > >
> > > > Last night a regionserver in my cluster stopped responding in a
> timely
> > > > manner for about 20 minutes. I know that stop-the-world GC can cause
> > this
> > > > type of behavior, but 20 minutes seems excessive.
> > > >
> > > > The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB).
> > We
> > > > are using the latest java 7 from oracle. HDFS is provided by an
> Isilon
> > > > cluster.
> > > >
> > > > The server workload is read/write: the writing process reads all rows
> > it
> > > is
> > > > about to write, updates them if they exist, and then writes all the
> > rows
> > > > (replacing ones that were updated).
> > > >
> > > > The last messages before the pause were regarding an HLog roll:
> > > >
> > > > DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll
> > requested
> > > > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > > > getDefaultReplication
> > > > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > > > getDefaultBlockSize
> > > >
> > > > During the next 20 minutes there were a handful of sporadic
> > LruBlockCache
> > > > stats messages but nothing else. After 20 minutes, normal operation
> > > resumed.
> > > >
> > > > Is 20 minutes for a GC pause expected given the operational load and
> > > > machine specs? Could a GC pause include periodic log messages? If it
> > > wasn't
> > > > a GC pause, what else could it be?
> > > >
> > > > --Tom
> > >
> > > Confidentiality Notice:  The information contained in this message,
> > > including any attachments hereto, may be confidential and is intended
> to
> > be
> > > read only by the individual or entity to whom this message is
> addressed.
> > If
> > > the reader of this message is not the intended recipient or an agent or
> > > designee of the intended recipient, please note that any review, use,
> > > disclosure or distribution of this message or its attachments, in any
> > form,
> > > is strictly prohibited.  If you have received this message in error,
> > please
> > > immediately notify the sender and/or Notifications@carrieriq.com and
> > > delete or destroy any copy of this message and its attachments.
> > >
> >
>

Re: Is this a long GC pause, or something else?

Posted by Tom Brown <to...@gmail.com>.

Otis,

I'm not sure our issue is the same (although they could turn out to be
related). As far as I have been able to determine, we have only had a
single long pause.

However, we don't have much experience micromanaging our JVMs. How did you
generate those graphs?

--Tom


On Tue, Jun 10, 2014 at 4:52 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> No, I don't think so.  We had it until this morning and didn't see this
> problem.  We'll probably switch to it tomorrow morning before we change EC2
> instances and see if that removes the problem.
>
> Tom - do your pauses look like the ones in our SPM graphs?
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Jun 10, 2014 at 6:38 PM, Vladimir Rodionov <
> vrodionov@carrieriq.com>
> wrote:
>
> > Unbelievable. Do you see the same with the latest OpenJDK?
> >
> > Best regards,
> > Vladimir Rodionov
> > Principal Platform Engineer
> > Carrier IQ, www.carrieriq.com
> > e-mail: vrodionov@carrieriq.com
> >
> > ________________________________________
> > From: Otis Gospodnetic [otis.gospodnetic@gmail.com]
> > Sent: Tuesday, June 10, 2014 2:43 PM
> > To: user@hbase.apache.org
> > Subject: Re: Is this a long GC pause, or something else?
> >
> > Does it repeat?
> > We are seeing this with u60 oracle JVM too!  SPM shows the whole JVM
> > blocking for about 16 minutes every M minutes.
> >
> > Otis
> >
> >
> >
> > > On Jun 10, 2014, at 2:05 PM, Tom Brown <to...@gmail.com> wrote:
> > >
> > > Last night a regionserver in my cluster stopped responding in a timely
> > > manner for about 20 minutes. I know that stop-the-world GC can cause
> this
> > > type of behavior, but 20 minutes seems excessive.
> > >
> > > The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB).
> We
> > > are using the latest java 7 from oracle. HDFS is provided by an Isilon
> > > cluster.
> > >
> > > The server workload is read/write: the writing process reads all rows
> it
> > is
> > > about to write, updates them if they exist, and then writes all the
> rows
> > > (replacing ones that were updated).
> > >
> > > The last messages before the pause were regarding an HLog roll:
> > >
> > > DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll
> requested
> > > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > > getDefaultReplication
> > > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > > getDefaultBlockSize
> > >
> > > During the next 20 minutes there were a handful of sporadic
> LruBlockCache
> > > stats messages but nothing else. After 20 minutes, normal operation
> > resumed.
> > >
> > > Is 20 minutes for a GC pause expected given the operational load and
> > > machine specs? Could a GC pause include periodic log messages? If it
> > wasn't
> > > a GC pause, what else could it be?
> > >
> > > --Tom
> >
> > Confidentiality Notice:  The information contained in this message,
> > including any attachments hereto, may be confidential and is intended to
> be
> > read only by the individual or entity to whom this message is addressed.
> If
> > the reader of this message is not the intended recipient or an agent or
> > designee of the intended recipient, please note that any review, use,
> > disclosure or distribution of this message or its attachments, in any
> form,
> > is strictly prohibited.  If you have received this message in error,
> please
> > immediately notify the sender and/or Notifications@carrieriq.com and
> > delete or destroy any copy of this message and its attachments.
> >
>

Re: Is this a long GC pause, or something else?

Posted by Otis Gospodnetic <ot...@gmail.com>.

No, I don't think so.  We had it until this morning and didn't see this
problem.  We'll probably switch to it tomorrow morning before we change EC2
instances and see if that removes the problem.

Tom - do your pauses look like the ones in our SPM graphs?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jun 10, 2014 at 6:38 PM, Vladimir Rodionov <vr...@carrieriq.com>
wrote:

> Unbelievable. Do you see the same with the latest OpenJDK?
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: Otis Gospodnetic [otis.gospodnetic@gmail.com]
> Sent: Tuesday, June 10, 2014 2:43 PM
> To: user@hbase.apache.org
> Subject: Re: Is this a long GC pause, or something else?
>
> Does it repeat?
> We are seeing this with u60 oracle JVM too!  SPM shows the whole JVM
> blocking for about 16 minutes every M minutes.
>
> Otis
>
>
>
> > On Jun 10, 2014, at 2:05 PM, Tom Brown <to...@gmail.com> wrote:
> >
> > Last night a regionserver in my cluster stopped responding in a timely
> > manner for about 20 minutes. I know that stop-the-world GC can cause this
> > type of behavior, but 20 minutes seems excessive.
> >
> > The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB). We
> > are using the latest java 7 from oracle. HDFS is provided by an Isilon
> > cluster.
> >
> > The server workload is read/write: the writing process reads all rows it
> is
> > about to write, updates them if they exist, and then writes all the rows
> > (replacing ones that were updated).
> >
> > The last messages before the pause were regarding an HLog roll:
> >
> > DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
> > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > getDefaultReplication
> > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > getDefaultBlockSize
> >
> > During the next 20 minutes there were a handful of sporadic LruBlockCache
> > stats messages but nothing else. After 20 minutes, normal operation
> resumed.
> >
> > Is 20 minutes for a GC pause expected given the operational load and
> > machine specs? Could a GC pause include periodic log messages? If it
> wasn't
> > a GC pause, what else could it be?
> >
> > --Tom
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

RE: Is this a long GC pause, or something else?

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

Unbelievable. Do you see the same with the latest OpenJDK?

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Otis Gospodnetic [otis.gospodnetic@gmail.com]
Sent: Tuesday, June 10, 2014 2:43 PM
To: user@hbase.apache.org
Subject: Re: Is this a long GC pause, or something else?

Does it repeat?
We are seeing this with u60 oracle JVM too!  SPM shows the whole JVM blocking for about 16 minutes every M minutes.

Otis

> On Jun 10, 2014, at 2:05 PM, Tom Brown <to...@gmail.com> wrote:
>
> Last night a regionserver in my cluster stopped responding in a timely
> manner for about 20 minutes. I know that stop-the-world GC can cause this
> type of behavior, but 20 minutes seems excessive.
>
> The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB). We
> are using the latest java 7 from oracle. HDFS is provided by an Isilon
> cluster.
>
> The server workload is read/write: the writing process reads all rows it is
> about to write, updates them if they exist, and then writes all the rows
> (replacing ones that were updated).
>
> The last messages before the pause were regarding an HLog roll:
>
> DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
> INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> getDefaultReplication
> INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> getDefaultBlockSize
>
> During the next 20 minutes there were a handful of sporadic LruBlockCache
> stats messages but nothing else. After 20 minutes, normal operation resumed.
>
> Is 20 minutes for a GC pause expected given the operational load and
> machine specs? Could a GC pause include periodic log messages? If it wasn't
> a GC pause, what else could it be?
>
> --Tom

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: Is this a long GC pause, or something else?

Posted by Otis Gospodnetic <ot...@gmail.com>.

Here are some graphs:
JVM GC: https://apps.sematext.com/spm-reports/s/mYKcXNXBMl
JVM threads: https://apps.sematext.com/spm-reports/s/eJAVT8TUoB (so you can
see threads just "disappear" for blocks of time)

Meanwhile, the server's not dead - here's the CPU showing it's not dead and
it's not 100% idle OR 100% busy:
https://apps.sematext.com/spm-reports/s/Ess9S9JnYF

We just noticed this today when we switched from OpenJDK to Oracle VJM
update 60.
This is actually from a cluster running on R3 instances on EC2.

These lockups come and go, as you can see, and appear on all nodes in the
cluster, just not at the same time.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jun 10, 2014 at 5:43 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Does it repeat?
> We are seeing this with u60 oracle JVM too!  SPM shows the whole JVM
> blocking for about 16 minutes every M minutes.
>
> Otis
>
>
>
> > On Jun 10, 2014, at 2:05 PM, Tom Brown <to...@gmail.com> wrote:
> >
> > Last night a regionserver in my cluster stopped responding in a timely
> > manner for about 20 minutes. I know that stop-the-world GC can cause this
> > type of behavior, but 20 minutes seems excessive.
> >
> > The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB). We
> > are using the latest java 7 from oracle. HDFS is provided by an Isilon
> > cluster.
> >
> > The server workload is read/write: the writing process reads all rows it
> is
> > about to write, updates them if they exist, and then writes all the rows
> > (replacing ones that were updated).
> >
> > The last messages before the pause were regarding an HLog roll:
> >
> > DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
> > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > getDefaultReplication
> > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > getDefaultBlockSize
> >
> > During the next 20 minutes there were a handful of sporadic LruBlockCache
> > stats messages but nothing else. After 20 minutes, normal operation
> resumed.
> >
> > Is 20 minutes for a GC pause expected given the operational load and
> > machine specs? Could a GC pause include periodic log messages? If it
> wasn't
> > a GC pause, what else could it be?
> >
> > --Tom
>

Re: Is this a long GC pause, or something else?

Posted by Otis Gospodnetic <ot...@gmail.com>.

Does it repeat?
We are seeing this with u60 oracle JVM too!  SPM shows the whole JVM blocking for about 16 minutes every M minutes. 

Otis

 

> On Jun 10, 2014, at 2:05 PM, Tom Brown <to...@gmail.com> wrote:
> 
> Last night a regionserver in my cluster stopped responding in a timely
> manner for about 20 minutes. I know that stop-the-world GC can cause this
> type of behavior, but 20 minutes seems excessive.
> 
> The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB). We
> are using the latest java 7 from oracle. HDFS is provided by an Isilon
> cluster.
> 
> The server workload is read/write: the writing process reads all rows it is
> about to write, updates them if they exist, and then writes all the rows
> (replacing ones that were updated).
> 
> The last messages before the pause were regarding an HLog roll:
> 
> DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
> INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> getDefaultReplication
> INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> getDefaultBlockSize
> 
> During the next 20 minutes there were a handful of sporadic LruBlockCache
> stats messages but nothing else. After 20 minutes, normal operation resumed.
> 
> Is 20 minutes for a GC pause expected given the operational load and
> machine specs? Could a GC pause include periodic log messages? If it wasn't
> a GC pause, what else could it be?
> 
> --Tom