You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2021/06/30 04:17:00 UTC

[jira] [Commented] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

    [ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371796#comment-17371796 ] 

Michael Stack commented on HBASE-26042:
---------------------------------------

Here is what it looked like when I tried to repro the hang-up in test env:
{code:java}
2021-06-29 20:33:35,183 WARN  [AsyncFSWAL-0-hdfs://localhost.localdomain:37680/user/mstack/test-data/329a3111-d35d-18ae-9107-065eee2a4e62] wal.AsyncFSWAL(299): sync failed
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: Injected(..) failed: Connection reset by peer
2021-06-29 20:33:35,184 DEBUG [LogRoller] wal.AbstractWALRoller(170): WAL  wal:(num 1625024015181) roll requested {code}
See how we immediately follow the error with 'WAL wal:(num 1625024015181) roll requested '.

Looking at the thread dump, four threads are trying to write a flush marker to the WAL... three have made it inside mvcc#begin and are trying to stamp sequenceid.... (see thread dump above) while one other is here:
{code:java}
Thread 9466: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=175 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() @bci=1, line=836 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node, int) @bci=67, line=870 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(int) @bci=17, line=1199 (Compiled frame)
 - java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock() @bci=5, line=943 (Compiled frame)
 - org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(org.apache.hadoop.hbase.wal.WAL, long, java.util.Collection, org.apache.hadoop.hbase.monitoring.MonitoredTask, boolean, org.apache.hadoop.hbase.regionserver.FlushLifeCycleTracker) @bci=306, line=2641 (Compiled frame)
 - org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(org.apache.hadoop.hbase.wal.WAL, long, java.util.Collection, org.apache.hadoop.hbase.monitoring.MonitoredTask, boolean, org.apache.hadoop.hbase.regionserver.FlushLifeCycleTracker) @bci=11, line=2573 (Compiled frame)
 - org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(java.util.Collection, org.apache.hadoop.hbase.monitoring.MonitoredTask, boolean, org.apache.hadoop.hbase.regionserver.FlushLifeCycleTracker) @bci=13, line=2547 (Compiled frame)
 - org.apache.hadoop.hbase.regionserver.HRegion.flushcache(boolean, boolean, org.apache.hadoop.hbase.regionserver.FlushLifeCycleTracker) @bci=558, line=2436 (Compiled frame)
 - org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(org.apache.hadoop.hbase.regionserver.HRegion, boolean, boolean, org.apache.hadoop.hbase.regionserver.FlushLifeCycleTracker) @bci=86, line=610 (Compiled frame)
 - org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(org.apache.hadoop.hbase.regionserver.FlushType) @bci=738, line=288 (Compiled frame)
 - org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$700(org.apache.hadoop.hbase.regionserver.MemStoreFlusher, org.apache.hadoop.hbase.regionserver.FlushType) @bci=2, line=67 (Compiled frame)
 - org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run() @bci=128, line=344 (Compiled frame) {code}
Then there is the logroller BLOCKED as per above. Looks like a deadlock but I can't see it – yet anyways.

> WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-26042
>                 URL: https://issues.apache.org/jira/browse/HBASE-26042
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.3.5
>            Reporter: Michael Stack
>            Priority: Major
>
> Making note of issue seen in production cluster.
> Node had been struggling under load for a few days with slow syncs up to 10 seconds, a few STUCK MVCCs from which it recovered and some java pauses up to three seconds in length.
> Then the below happened:
> {code:java}
> 2021-06-27 13:41:27,604 WARN  [AsyncFSWAL-0-hdfs://XXXX:8020/hbase] wal.AsyncFSWAL: sync failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer {code}
> ... and WAL turned dead in the water. Scanners start expiring. RPC prints text versions of requests complaining requestsTooSlow. Then we start to see these:
> {code:java}
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 300000 ms for txid=552128301, WAL system stuck? {code}
> Whats supposed to happen when other side goes away like this is that we will roll the WAL – go set up a new one. You can see it happening if you run
> {code:java}
> mvn test -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter {code}
> I tried hacking the test to repro the above hang by throwing same exception in above test (on linux because need epoll to repro) but all just worked.
> Thread dumps of the hungup WAL subsystem are a little odd. The log roller is stuck w/o timeout trying to write a long on the WAL header:
>  
> {code:java}
> Thread 9464: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=175 (Compiled frame)
>  - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, line=1707 (Compiled frame)
>  - java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) @bci=119, line=3323 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, line=1742 (Compiled frame)
>  - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) @bci=16, line=189 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) @bci=9, line=202 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, long) @bci=107, line=170 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) @bci=61, line=113 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) @bci=22, line=651 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) @bci=2, line=128 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) @bci=101, line=797 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) @bci=18, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 (Compiled frame) {code}
>  
> Other threads are BLOCKED trying to append the WAL w/ flush markers etc. unable to add the ringbuffer:
>  
> {code:java}
> Thread 9465: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 (Compiled frame)
>  - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 (Compiled frame)
>  - com.lmax.disruptor.MultiProducerSequencer.next() @bci=2, line=105 (Interpreted frame)
>  - com.lmax.disruptor.RingBuffer.next() @bci=4, line=263 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.lambda$stampSequenceIdAndPublishToRingBuffer$1(org.apache.commons.lang3.mutable.MutableLong, com.lmax.disruptor.RingBuffer) @bci=2, line=1031 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL$$Lambda$270.run() @bci=8 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl.begin(java.lang.Runnable) @bci=36, line=140 (Interpreted frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.stampSequenceIdAndPublishToRingBuffer(org.apache.hadoop.hbase.client.RegionInfo, org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit, boolean, com.lmax.disruptor.RingBuffer) @bci=62, line=1030 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.append(org.apache.hadoop.hbase.client.RegionInfo, org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit, boolean) @bci=10, line=589 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.appendMarker(org.apache.hadoop.hbase.client.RegionInfo, org.apache.hadoop.hbase.wal.WALKeyImpl, org.apache.hadoop.hbase.wal.WALEdit) @bci=5, line=1081 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullMarkerAppendTransaction(org.apache.hadoop.hbase.wal.WAL, java.util.NavigableMap, org.apache.hadoop.hbase.client.RegionInfo, org.apache.hadoop.hbase.wal.WALEdit, org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl, java.util.Map, boolean) @bci=39, line=161 (Compiled frame)
>  - org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(org.apache.hadoop.hbase.wal.WAL, java.util.NavigableMap, org.apache.hadoop.hbase.client.RegionInfo, org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$FlushDescriptor, boolean, org.apache.hadoop.hbase.regionserver.MultiVersionConcurrencyControl) @bci=13, line=89 (Compiled frame)
>  {code}
>  
> Interesting is how more than one thread is able to be inside the synchronize block in mvcc#begin seemingly....
> In



--
This message was sent by Atlassian Jira
(v8.3.4#803005)