You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "stack (JIRA)" <ji...@apache.org> on 2007/09/14 01:24:32 UTC
[jira] Updated: (HADOOP-1813) [hbase] OOME makes zombie of region
server
[ https://issues.apache.org/jira/browse/HADOOP-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HADOOP-1813:
--------------------------
Attachment: oome.patch
Here is first cut at a patch to clean up how servers handle the unexpected (OOMEs, etc.)
{code}
HADOOP-1813 OOME makes zombie of region server
All threads (and servers) now log unhandled exceptions and try to abort as best they can.
Make all but Leases and Worker inherit from new Chore Thread. Make all but the servers
be daemon threads. Give all threads useful names. Set startup flag that will get us a heap
dump on OOME on 1.6 JVMs.
M src/contrib/hbase/src/test/org/apache/hadoop/hbase/TestCleanRegionServerExit.java
Add logging of significant region server transition.
A src/contrib/hbase/src/test/org/apache/hadoop/hbase/OOMEHMaster.java
A src/contrib/hbase/src/test/org/apache/hadoop/hbase/OOMERegionServer.java
Classes to test server behavior at OOME extremes.
M src/contrib/hbase/src/test/org/apache/hadoop/hbase/MiniHBaseCluster.java
Give regionservers a useful thread name.
(startRegionServer, waitOnRegionServer): Changed void return to
name of pertinent regionserver.
M src/contrib/hbase/src/test/org/apache/hadoop/hbase/TestDFSAbort.java
(main): Added.
M src/contrib/hbase/src/test/org/apache/hadoop/hbase/MultiRegionTable.java
Added license.
M src/contrib/hbase/src/java/org/apache/hadoop/hbase/RemoteExceptionHandler.java
Fixed a couple of eclipse complaints
(checkIOException): Added.
M src/contrib/hbase/src/java/org/apache/hadoop/hbase/Leases.java
Made its internal thread a daemon thread. Log any unhandled exceptions.
(setName): Added.
M src/contrib/hbase/src/java/org/apache/hadoop/hbase/HRegionServer.java
Replaced IOException checking code with call to
RemoteExceptionHandler.checkIOException. Report unhandled exceptions
and run abort if we get one. Do this for Worker thread too. Tried to
make the big long main loops smaller by moving bits of code that coheres
off into methods: e.g. reportForDuty, a method that contains the code
for telling master we are up.
(stopRequested): Made it an AtomicBoolean so Chore threads could be
passed the stop flag by reference rather than have to know about
hosting class.
(SplitOrCompactChecker, Flusher, LogRoller): Refactored to inherit
from Chore. Removed corresponding Runner data member, one for each
Chore thread.
(getRegionsToCheck, setDaemonThreadRunning, reportForDuty, doMain): Added.
A src/contrib/hbase/src/java/org/apache/hadoop/hbase/Chore.java
Abstract base Thread used running chores on a period.
M src/contrib/hbase/src/java/org/apache/hadoop/hbase/HMaster.java
Changed closed from boolean to AtomicBoolean. Use new Sleeper
class for sleepytime. Have Meta scanners inherit from Chore.
Use new RemoteException.checkIOException everywhere.
Removed keeping around Runner data member for each meta scanner.
Report unhandled exceptions and try to close up afterward.
(doMain): Added.
A src/contrib/hbase/src/java/org/apache/hadoop/hbase/util/Sleeper.java
Sleeper that keeps its eye on passed AtomicBoolean exiting if
set.
M src/contrib/hbase/bin/hbase
Set -XX:+HeapDumpOnOutOfMemoryError so we get a heap dump on OOME.
{code}
> [hbase] OOME makes zombie of region server
> ------------------------------------------
>
> Key: HADOOP-1813
> URL: https://issues.apache.org/jira/browse/HADOOP-1813
> Project: Hadoop
> Issue Type: Bug
> Components: contrib/hbase
> Reporter: stack
> Assignee: stack
> Priority: Minor
> Attachments: failed_compaction.log, oome.patch
>
>
> We need to catch Errors in the main regionserver and master run methods. During a cluster loading, an OOME made main thread exit but server stayed up doing a zombie-impersonation. For OOME's we could add an attempted handler. Service threads also need to be 'daemon'-ified. Here's an extract from thread dump of hung region server:
> {code}
> Full thread dump Java HotSpot(TM) Server VM (1.6.0_01-b06 mixed mode):
> ...
> "Lease.monitor" prio=10 tid=0x082ec800 nid=0x41ca waiting on condition [0xb122d000..0xb122df30]
> java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at org.apache.hadoop.hbase.Leases$LeaseMonitor.run(Leases.java:226)
> at java.lang.Thread.run(Thread.java:619)
> "Thread-10.logRoller" prio=10 tid=0x082eb400 nid=0x41c9 waiting on condition [0xb127e000..0xb127eeb0]
> java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at org.apache.hadoop.hbase.HRegionServer$LogRoller.run(HRegionServer.java:379)
> at java.lang.Thread.run(Thread.java:619)
> "Thread-10.cacheFlusher" prio=10 tid=0x082ea000 nid=0x41c7 waiting on condition [0xb1320000..0xb1320fb0]
> java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:325)
> at java.lang.Thread.run(Thread.java:619)
> "Thread-10.worker" prio=10 tid=0x082e9400 nid=0x41c6 waiting on condition [0xb1371000..0xb1372130]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0xb649a288> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1927)
> at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:395)
> at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:859)
> at java.lang.Thread.run(Thread.java:619)
> ...
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.