You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "stack (JIRA)" <ji...@apache.org> on 2007/09/14 01:24:32 UTC

[jira] Updated: (HADOOP-1813) [hbase] OOME makes zombie of region server

     [ https://issues.apache.org/jira/browse/HADOOP-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HADOOP-1813:
--------------------------

    Attachment: oome.patch

Here is first cut at a patch to clean up how servers handle the unexpected (OOMEs, etc.)
{code}
HADOOP-1813 OOME makes zombie of region server
All threads (and servers) now log unhandled exceptions and try to abort as best they can.
Make all but Leases and Worker inherit from new Chore Thread.  Make all but the servers
be daemon threads. Give all threads useful names.  Set startup flag that will get us a heap
dump on OOME on 1.6 JVMs.
M  src/contrib/hbase/src/test/org/apache/hadoop/hbase/TestCleanRegionServerExit.java
    Add logging of significant region server transition.
A  src/contrib/hbase/src/test/org/apache/hadoop/hbase/OOMEHMaster.java
A  src/contrib/hbase/src/test/org/apache/hadoop/hbase/OOMERegionServer.java
    Classes to test server behavior at OOME extremes.
M  src/contrib/hbase/src/test/org/apache/hadoop/hbase/MiniHBaseCluster.java
    Give regionservers a useful thread name.
    (startRegionServer, waitOnRegionServer): Changed void return to
    name of pertinent regionserver.
M  src/contrib/hbase/src/test/org/apache/hadoop/hbase/TestDFSAbort.java
    (main): Added.
M  src/contrib/hbase/src/test/org/apache/hadoop/hbase/MultiRegionTable.java
    Added license.
M  src/contrib/hbase/src/java/org/apache/hadoop/hbase/RemoteExceptionHandler.java
    Fixed a couple of eclipse complaints
    (checkIOException): Added.
M  src/contrib/hbase/src/java/org/apache/hadoop/hbase/Leases.java
    Made its internal thread a daemon thread. Log any unhandled exceptions.
    (setName): Added.
M  src/contrib/hbase/src/java/org/apache/hadoop/hbase/HRegionServer.java
    Replaced IOException checking code with call to
    RemoteExceptionHandler.checkIOException.  Report unhandled exceptions
    and run abort if we get one.  Do this for Worker thread too. Tried to
    make the big long main loops smaller by moving bits of code that coheres
    off into methods: e.g.  reportForDuty, a method that contains the code
    for telling master we are up.
    (stopRequested): Made it an AtomicBoolean so Chore threads could be
    passed the stop flag by reference rather than have to know about
    hosting class.
    (SplitOrCompactChecker, Flusher, LogRoller): Refactored to inherit
    from Chore.  Removed corresponding Runner data member, one for each
    Chore thread.
    (getRegionsToCheck, setDaemonThreadRunning, reportForDuty, doMain): Added.
A  src/contrib/hbase/src/java/org/apache/hadoop/hbase/Chore.java
    Abstract base Thread used running chores on a period.
M  src/contrib/hbase/src/java/org/apache/hadoop/hbase/HMaster.java
    Changed closed from boolean to AtomicBoolean.  Use new Sleeper
    class for sleepytime.  Have Meta scanners inherit from Chore.
    Use new RemoteException.checkIOException everywhere.
    Removed keeping around Runner data member for each meta scanner.
    Report unhandled exceptions and try to close up afterward.
    (doMain): Added.
A  src/contrib/hbase/src/java/org/apache/hadoop/hbase/util/Sleeper.java
    Sleeper that keeps its eye on passed AtomicBoolean exiting if
    set.
M  src/contrib/hbase/bin/hbase
    Set -XX:+HeapDumpOnOutOfMemoryError so we get a heap dump on OOME.
{code}

> [hbase] OOME makes zombie of region server
> ------------------------------------------
>
>                 Key: HADOOP-1813
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1813
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>         Attachments: failed_compaction.log, oome.patch
>
>
> We need to catch Errors in the main regionserver and master run methods.  During a cluster loading, an OOME made main thread exit but server stayed up doing a zombie-impersonation.  For OOME's we could add an attempted handler.  Service threads also need to be 'daemon'-ified.  Here's an extract from thread dump of hung region server:
> {code}
> Full thread dump Java HotSpot(TM) Server VM (1.6.0_01-b06 mixed mode):
> ...
> "Lease.monitor" prio=10 tid=0x082ec800 nid=0x41ca waiting on condition [0xb122d000..0xb122df30]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.hbase.Leases$LeaseMonitor.run(Leases.java:226)
>         at java.lang.Thread.run(Thread.java:619)
> "Thread-10.logRoller" prio=10 tid=0x082eb400 nid=0x41c9 waiting on condition [0xb127e000..0xb127eeb0]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.hbase.HRegionServer$LogRoller.run(HRegionServer.java:379)
>         at java.lang.Thread.run(Thread.java:619)
> "Thread-10.cacheFlusher" prio=10 tid=0x082ea000 nid=0x41c7 waiting on condition [0xb1320000..0xb1320fb0]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:325)
>         at java.lang.Thread.run(Thread.java:619)
> "Thread-10.worker" prio=10 tid=0x082e9400 nid=0x41c6 waiting on condition [0xb1371000..0xb1372130]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0xb649a288> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1927)
>         at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:395)
>         at org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer.java:859)
>         at java.lang.Thread.run(Thread.java:619)
> ...
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.