You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Xiaolin Ha (Jira)" <ji...@apache.org> on 2021/05/24 08:12:00 UTC
[jira] [Created] (HBASE-25905) Limit the shutdown time of WAL
Xiaolin Ha created HBASE-25905:
----------------------------------
Summary: Limit the shutdown time of WAL
Key: HBASE-25905
URL: https://issues.apache.org/jira/browse/HBASE-25905
Project: HBase
Issue Type: Improvement
Components: regionserver, wal
Affects Versions: 2.0.0, 3.0.0-alpha-1
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha
Attachments: rs-jstack1, rs-jstack2
We use the fan-out HDFS OutputStream and AsyncFSWAL on our clusters, but met the problem than RS can not exit completely for several hours util manual interventions.
We find the problem of RS exiting incompletely is relevant to WAL stuck problems. When the blockOnSync timeout in flushing cache, the RS will kill itself, but only stream broken problems can make sync fail and broken the WAL, there is no socket timeout in fan out OutputStream, as a result, the output flush can stuck for a unlimited long time.
The two jstacks below show that the regionserver thread can waiting unlimitedly in both
AsyncFSWAL#waitForSafePoint()
{code:java}
"regionserver/gh-data-hbase-prophet25.gh.sankuai.com/10.22.129.16:16020" #28 prio=5 os_prio=0 tid=0x00007fd7f4716000 nid=0x83a2 waiting on condition [0x00007fd7ec687000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fddc4a5ac68> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:709)
at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:793)
at org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
at org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
at org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
at java.lang.Thread.run(Thread.java:745)
{code}
and AbstractFSWAL.rollWriterLock(when logRoller stuck in AsyncFSWAL#waitForSafePoint())
{code:java}
"regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020" #28 prio=5 os_prio=0 tid=0x00007fafc68e9000 nid=0x2091 waiting on condition [0x00007f9d661ef000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fa03d52c298> (a java.util.concurrent.locks.ReentrantLock$FairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:791)
at org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
at org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
at org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
at java.lang.Thread.run(Thread.java:745)
...
"regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020.logRoller" #253 daemon prio=5 os_prio=0 tid=0x00007fafa90a1000 nid=0x20a4 waiting on condition [0x00007f9d649ba000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fa03d2f18f8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:681)
at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:124)
at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:685)
at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:746)
at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:188)
at java.lang.Thread.run(Thread.java:745){code}
HBASE-14790 referred that the timeout setting is a hard problem. But we can limit the shutdown time of WAL, it is also reasonable for MTTR.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)