You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Liu, Ming (HPIT-GADSC)" <mi...@hp.com> on 2014/12/02 06:22:26 UTC

how to tell there is a OOM in regionserver

Hi, all,

Recently, one of our HBase 0.98.5 instance meet with issues: when run some specific workload, all region servers will suddenly shut down at same time, but master is still running. When I check the log, in master log, I can see messages like
2014-12-01 08:28:11,072 DEBUG [main-EventThread] master.ServerManager: Added=n008.cluster,60020,1417413986550 to dead servers, submitted shutdown handler to be executed meta=false
And on n008, regionserver log file, there is no ERROR message, the last log entry looks very like a ZooKeeper startup message. The log just stopped with that last ZooKeeper startup message, and the Region Server process was gone when we check with 'jps'.

We then increased the heap size of regionserver, and it work fine. RegionServer no longer disappear. So we doubt there was a Out Of Memory issue, so the region server processes are killed. But my questions are:

1.       What log message will indicate there is a OOM? Since the region server is 'kill -9', so I think there is no message can tell this.

2.       If there is no typical log message about OOM, then how can an admin make sure there is a region server OOM happened? We just guess, but can not make sure. We hope there is a method to tell OOM occured for sure.

3.       Does the Zookeeper message appears every time with RegionServer OOM (if it is a OOM). Or it is just a random event just in our system?

So in sum, I want to know what is the typical clue that people can make sure there is a OOM issue in HBase region server?

Thank you,
Ming

Re: how to tell there is a OOM in regionserver

Posted by Otis Gospodnetic <ot...@gmail.com>.
It could also have been the so called OOM Killer that killed your RS.  See
http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Dec 2, 2014 at 1:37 AM, Liu, Ming (HPIT-GADSC) <mi...@hp.com>
wrote:

> Thank you both!
>
> Yes, I can see there is the '.out' file with clear proof of process was
> 'killed'. So we can prove this issue now!
> And it is also true that we must rely on JVM itself for proof that the
> kill operation is due to OOM.
> Thank you both, this is a very good learning.
>
> Thanks,
> Ming
>
> -----Original Message-----
> From: Bharath Vissapragada [mailto:bharathv@cloudera.com]
> Sent: Tuesday, December 02, 2014 2:00 PM
> To: hbase-user
> Subject: Re: how to tell there is a OOM in regionserver
>
> I agree with Otis' response. Adding a few more details, there is a ".out"
>  file in the logs/ directory, that is the stdout for each of these daemons
> and incase of  an OOM crash, it prints something like this
>
> # java.lang.OutOfMemoryError: Java heap space
>
> # -XX:OnOutOfMemoryError="kill -9 %p"
>
> #   Executing /bin/sh -c "kill -9 <pid>"...
>
>
>
> On Tue, Dec 2, 2014 at 11:06 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
> > Hi Ming,
> >
> > 1) There typically is an OOM message from the JVM itself
> >
> > 2) I would monitor the server instead of relying on log messages
> > mentioning OOMs.  For example, in SPM <http://sematext.com/spm/> we
> > have "hearbeat alerts" that tell us when we stop hearing from
> > RegionServers and other types of servers.  It also helps when servers
> > simply die for reasons other than OOM.
> >
> > 3) You could (should?) monitor individual memory pools and possibly
> > set alerts or anomaly detection on those.  If you have that, if there
> > was an OOM, you will typically see one of the memory pools approach
> > 100% utilization.  I personally really like this report in SPM because
> > it gives a bit more insight than just "heap size/utilization".  So I'd
> > point the admin to this sort of monitoring report.
> >
> > 4) High GC counts/time, or jump in those metrics, and then typically
> > also jump in CPU usage is what often precedes OOMs.
> >
> > Otis
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> > On Tue, Dec 2, 2014 at 12:22 AM, Liu, Ming (HPIT-GADSC)
> > <mi...@hp.com>
> > wrote:
> >
> > > Hi, all,
> > >
> > > Recently, one of our HBase 0.98.5 instance meet with issues: when
> > > run
> > some
> > > specific workload, all region servers will suddenly shut down at
> > > same
> > time,
> > > but master is still running. When I check the log, in master log, I
> > > can
> > see
> > > messages like
> > > 2014-12-01 08:28:11,072 DEBUG [main-EventThread] master.ServerManager:
> > > Added=n008.cluster,60020,1417413986550 to dead servers, submitted
> > shutdown
> > > handler to be executed meta=false
> > > And on n008, regionserver log file, there is no ERROR message, the
> > > last log entry looks very like a ZooKeeper startup message. The log
> > > just
> > stopped
> > > with that last ZooKeeper startup message, and the Region Server
> > > process
> > was
> > > gone when we check with 'jps'.
> > >
> > > We then increased the heap size of regionserver, and it work fine.
> > > RegionServer no longer disappear. So we doubt there was a Out Of
> > > Memory issue, so the region server processes are killed. But my
> questions are:
> > >
> > > 1.       What log message will indicate there is a OOM? Since the
> region
> > > server is 'kill -9', so I think there is no message can tell this.
> > >
> > > 2.       If there is no typical log message about OOM, then how can an
> > > admin make sure there is a region server OOM happened? We just
> > > guess, but can not make sure. We hope there is a method to tell OOM
> > > occured for
> > sure.
> > >
> > > 3.       Does the Zookeeper message appears every time with
> RegionServer
> > > OOM (if it is a OOM). Or it is just a random event just in our system?
> > >
> > > So in sum, I want to know what is the typical clue that people can
> > > make sure there is a OOM issue in HBase region server?
> > >
> > > Thank you,
> > > Ming
> > >
> >
>
>
>
> --
> Bharath Vissapragada
> <http://www.cloudera.com>
>

RE: how to tell there is a OOM in regionserver

Posted by "Liu, Ming (HPIT-GADSC)" <mi...@hp.com>.
Thank you both!

Yes, I can see there is the '.out' file with clear proof of process was 'killed'. So we can prove this issue now!
And it is also true that we must rely on JVM itself for proof that the kill operation is due to OOM. 
Thank you both, this is a very good learning.

Thanks,
Ming

-----Original Message-----
From: Bharath Vissapragada [mailto:bharathv@cloudera.com] 
Sent: Tuesday, December 02, 2014 2:00 PM
To: hbase-user
Subject: Re: how to tell there is a OOM in regionserver

I agree with Otis' response. Adding a few more details, there is a ".out"
 file in the logs/ directory, that is the stdout for each of these daemons and incase of  an OOM crash, it prints something like this

# java.lang.OutOfMemoryError: Java heap space

# -XX:OnOutOfMemoryError="kill -9 %p"

#   Executing /bin/sh -c "kill -9 <pid>"...



On Tue, Dec 2, 2014 at 11:06 AM, Otis Gospodnetic < otis.gospodnetic@gmail.com> wrote:

> Hi Ming,
>
> 1) There typically is an OOM message from the JVM itself
>
> 2) I would monitor the server instead of relying on log messages 
> mentioning OOMs.  For example, in SPM <http://sematext.com/spm/> we 
> have "hearbeat alerts" that tell us when we stop hearing from 
> RegionServers and other types of servers.  It also helps when servers 
> simply die for reasons other than OOM.
>
> 3) You could (should?) monitor individual memory pools and possibly 
> set alerts or anomaly detection on those.  If you have that, if there 
> was an OOM, you will typically see one of the memory pools approach 
> 100% utilization.  I personally really like this report in SPM because 
> it gives a bit more insight than just "heap size/utilization".  So I'd 
> point the admin to this sort of monitoring report.
>
> 4) High GC counts/time, or jump in those metrics, and then typically 
> also jump in CPU usage is what often precedes OOMs.
>
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management 
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Dec 2, 2014 at 12:22 AM, Liu, Ming (HPIT-GADSC) 
> <mi...@hp.com>
> wrote:
>
> > Hi, all,
> >
> > Recently, one of our HBase 0.98.5 instance meet with issues: when 
> > run
> some
> > specific workload, all region servers will suddenly shut down at 
> > same
> time,
> > but master is still running. When I check the log, in master log, I 
> > can
> see
> > messages like
> > 2014-12-01 08:28:11,072 DEBUG [main-EventThread] master.ServerManager:
> > Added=n008.cluster,60020,1417413986550 to dead servers, submitted
> shutdown
> > handler to be executed meta=false
> > And on n008, regionserver log file, there is no ERROR message, the 
> > last log entry looks very like a ZooKeeper startup message. The log 
> > just
> stopped
> > with that last ZooKeeper startup message, and the Region Server 
> > process
> was
> > gone when we check with 'jps'.
> >
> > We then increased the heap size of regionserver, and it work fine.
> > RegionServer no longer disappear. So we doubt there was a Out Of 
> > Memory issue, so the region server processes are killed. But my questions are:
> >
> > 1.       What log message will indicate there is a OOM? Since the region
> > server is 'kill -9', so I think there is no message can tell this.
> >
> > 2.       If there is no typical log message about OOM, then how can an
> > admin make sure there is a region server OOM happened? We just 
> > guess, but can not make sure. We hope there is a method to tell OOM 
> > occured for
> sure.
> >
> > 3.       Does the Zookeeper message appears every time with RegionServer
> > OOM (if it is a OOM). Or it is just a random event just in our system?
> >
> > So in sum, I want to know what is the typical clue that people can 
> > make sure there is a OOM issue in HBase region server?
> >
> > Thank you,
> > Ming
> >
>



--
Bharath Vissapragada
<http://www.cloudera.com>

Re: how to tell there is a OOM in regionserver

Posted by Bharath Vissapragada <bh...@cloudera.com>.
I agree with Otis' response. Adding a few more details, there is a ".out"
 file in the logs/ directory, that is the stdout for each of these daemons
and incase of  an OOM crash, it prints something like this

# java.lang.OutOfMemoryError: Java heap space

# -XX:OnOutOfMemoryError="kill -9 %p"

#   Executing /bin/sh -c "kill -9 <pid>"...



On Tue, Dec 2, 2014 at 11:06 AM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi Ming,
>
> 1) There typically is an OOM message from the JVM itself
>
> 2) I would monitor the server instead of relying on log messages mentioning
> OOMs.  For example, in SPM <http://sematext.com/spm/> we have "hearbeat
> alerts" that tell us when we stop hearing from RegionServers and other
> types of servers.  It also helps when servers simply die for reasons other
> than OOM.
>
> 3) You could (should?) monitor individual memory pools and possibly set
> alerts or anomaly detection on those.  If you have that, if there was an
> OOM, you will typically see one of the memory pools approach 100%
> utilization.  I personally really like this report in SPM because it gives
> a bit more insight than just "heap size/utilization".  So I'd point the
> admin to this sort of monitoring report.
>
> 4) High GC counts/time, or jump in those metrics, and then typically also
> jump in CPU usage is what often precedes OOMs.
>
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Dec 2, 2014 at 12:22 AM, Liu, Ming (HPIT-GADSC) <mi...@hp.com>
> wrote:
>
> > Hi, all,
> >
> > Recently, one of our HBase 0.98.5 instance meet with issues: when run
> some
> > specific workload, all region servers will suddenly shut down at same
> time,
> > but master is still running. When I check the log, in master log, I can
> see
> > messages like
> > 2014-12-01 08:28:11,072 DEBUG [main-EventThread] master.ServerManager:
> > Added=n008.cluster,60020,1417413986550 to dead servers, submitted
> shutdown
> > handler to be executed meta=false
> > And on n008, regionserver log file, there is no ERROR message, the last
> > log entry looks very like a ZooKeeper startup message. The log just
> stopped
> > with that last ZooKeeper startup message, and the Region Server process
> was
> > gone when we check with 'jps'.
> >
> > We then increased the heap size of regionserver, and it work fine.
> > RegionServer no longer disappear. So we doubt there was a Out Of Memory
> > issue, so the region server processes are killed. But my questions are:
> >
> > 1.       What log message will indicate there is a OOM? Since the region
> > server is 'kill -9', so I think there is no message can tell this.
> >
> > 2.       If there is no typical log message about OOM, then how can an
> > admin make sure there is a region server OOM happened? We just guess, but
> > can not make sure. We hope there is a method to tell OOM occured for
> sure.
> >
> > 3.       Does the Zookeeper message appears every time with RegionServer
> > OOM (if it is a OOM). Or it is just a random event just in our system?
> >
> > So in sum, I want to know what is the typical clue that people can make
> > sure there is a OOM issue in HBase region server?
> >
> > Thank you,
> > Ming
> >
>



-- 
Bharath Vissapragada
<http://www.cloudera.com>

Re: how to tell there is a OOM in regionserver

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hi Ming,

1) There typically is an OOM message from the JVM itself

2) I would monitor the server instead of relying on log messages mentioning
OOMs.  For example, in SPM <http://sematext.com/spm/> we have "hearbeat
alerts" that tell us when we stop hearing from RegionServers and other
types of servers.  It also helps when servers simply die for reasons other
than OOM.

3) You could (should?) monitor individual memory pools and possibly set
alerts or anomaly detection on those.  If you have that, if there was an
OOM, you will typically see one of the memory pools approach 100%
utilization.  I personally really like this report in SPM because it gives
a bit more insight than just "heap size/utilization".  So I'd point the
admin to this sort of monitoring report.

4) High GC counts/time, or jump in those metrics, and then typically also
jump in CPU usage is what often precedes OOMs.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Dec 2, 2014 at 12:22 AM, Liu, Ming (HPIT-GADSC) <mi...@hp.com>
wrote:

> Hi, all,
>
> Recently, one of our HBase 0.98.5 instance meet with issues: when run some
> specific workload, all region servers will suddenly shut down at same time,
> but master is still running. When I check the log, in master log, I can see
> messages like
> 2014-12-01 08:28:11,072 DEBUG [main-EventThread] master.ServerManager:
> Added=n008.cluster,60020,1417413986550 to dead servers, submitted shutdown
> handler to be executed meta=false
> And on n008, regionserver log file, there is no ERROR message, the last
> log entry looks very like a ZooKeeper startup message. The log just stopped
> with that last ZooKeeper startup message, and the Region Server process was
> gone when we check with 'jps'.
>
> We then increased the heap size of regionserver, and it work fine.
> RegionServer no longer disappear. So we doubt there was a Out Of Memory
> issue, so the region server processes are killed. But my questions are:
>
> 1.       What log message will indicate there is a OOM? Since the region
> server is 'kill -9', so I think there is no message can tell this.
>
> 2.       If there is no typical log message about OOM, then how can an
> admin make sure there is a region server OOM happened? We just guess, but
> can not make sure. We hope there is a method to tell OOM occured for sure.
>
> 3.       Does the Zookeeper message appears every time with RegionServer
> OOM (if it is a OOM). Or it is just a random event just in our system?
>
> So in sum, I want to know what is the typical clue that people can make
> sure there is a OOM issue in HBase region server?
>
> Thank you,
> Ming
>