You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Xiaolin Ha (Jira)" <ji...@apache.org> on 2021/11/01 07:23:00 UTC

[jira] [Updated] (HBASE-25905) Limit the shutdown time of WAL

     [ https://issues.apache.org/jira/browse/HBASE-25905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiaolin Ha updated HBASE-25905:
-------------------------------
    Issue Type: Bug  (was: Improvement)

> Limit the shutdown time of WAL
> ------------------------------
>
>                 Key: HBASE-25905
>                 URL: https://issues.apache.org/jira/browse/HBASE-25905
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, wal
>    Affects Versions: 3.0.0-alpha-1, 2.0.0
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>         Attachments: rs-jstack1, rs-jstack2, wal-stuck-error-logs.png
>
>
> We use the fan-out HDFS OutputStream and AsyncFSWAL on our clusters, but met the problem than RS can not exit completely for several hours util manual interventions.
> We find the problem of RS exiting incompletely is relevant to WAL stuck problems, as mentioned in HBASE-25631. When the blockOnSync timeout in flushing cache, the RS will kill itself, but only stream broken problems can make sync fail and broken the WAL, there is no socket timeout in fan out OutputStream, as a result, the output flush can stuck for a unlimited long time.
> The two jstacks below show that the regionserver thread can waiting unlimitedly in both 
> AsyncFSWAL#waitForSafePoint()
> {code:java}
> "regionserver/gh-data-hbase-prophet25.gh.sankuai.com/10.22.129.16:16020" #28 prio=5 os_prio=0 tid=0x00007fd7f4716000 nid=0x83a2 waiting on condition [0x00007fd7ec687000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007fddc4a5ac68> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
>         at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
>         at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:709)
>         at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:793)
>         at org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
>         at org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
>         at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> and AbstractFSWAL.rollWriterLock(when logRoller stuck in AsyncFSWAL#waitForSafePoint())
> {code:java}
> "regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020" #28 prio=5 os_prio=0 tid=0x00007fafc68e9000 nid=0x2091 waiting on condition [0x00007f9d661ef000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007fa03d52c298> (a java.util.concurrent.locks.ReentrantLock$FairSync)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>         at java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>         at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
>         at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:791)
>         at org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:165)
>         at org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:220)
>         at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:249)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1406)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1148)
>         at java.lang.Thread.run(Thread.java:745)
> ...
> "regionserver/rz-data-hbase-yarnlog04.rz.sankuai.com/10.16.196.37:16020.logRoller" #253 daemon prio=5 os_prio=0 tid=0x00007fafa90a1000 nid=0x20a4 waiting on condition [0x00007f9d649ba000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00007fa03d2f18f8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
>         at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:652)
>         at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:681)
>         at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:124)
>         at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:685)
>         at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:746)
>         at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:188)
>         at java.lang.Thread.run(Thread.java:745){code}
>  WAL stuck error logs:
> !wal-stuck-error-logs.png|width=924,height=225!
> HBASE-14790 referred that the timeout setting is a hard problem. But we can limit the shutdown time of WAL, it is also reasonable for MTTR.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)