You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Elliott Clark (JIRA)" <ji...@apache.org> on 2015/07/03 08:49:05 UTC

[jira] [Updated] (HBASE-13971) Flushes stuck since 6 hours on a regionserver.

     [ https://issues.apache.org/jira/browse/HBASE-13971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Elliott Clark updated HBASE-13971:
----------------------------------
    Attachment: jstack.3
                jstack.2
                jstack.1
                jstack.5
                jstack.4

I've now seen this a few times. I've seen this with multiwal, so I assumed that it was related to that. ( I took jstacks from a regionserver with multiwal )

However I just saw it again on a server with no multiwal.

Error messages before everything just stopped:

{code}
15/07/02 18:23:47 ERROR wal.FSHLog: Error syncing, request close of wal 
java.io.IOException: All datanodes 10.8.69.37:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1206)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1004)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:548){code}
 

> Flushes stuck since 6 hours on a regionserver.
> ----------------------------------------------
>
>                 Key: HBASE-13971
>                 URL: https://issues.apache.org/jira/browse/HBASE-13971
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 1.3.0
>         Environment: Caused while running IntegrationTestLoadAndVerify for 20 M rows on cluster with 32 region servers each with max heap size of 24GBs.
>            Reporter: Abhilash
>            Priority: Critical
>         Attachments: jstack.1, jstack.2, jstack.3, jstack.4, jstack.5, rsDebugDump.txt, screenshot-1.png
>
>
> One region server stuck while flushing(possible deadlock). Its trying to flush two regions since last 6 hours (see the screenshot).
> Caused while running IntegrationTestLoadAndVerify for 20 M rows with 600 mapper jobs and 100 back references. ~37 Million writes on each regionserver till now but no writes happening on any regionserver from past 6 hours  and their memstore size is zero(I dont know if this is related). But this particular regionserver has memstore size of 9GBs from past 6 hours.
> Relevant snaps from debug dump:
> Tasks:
> ===========================================================
> Task: Flushing IntegrationTestLoadAndVerify,R\x9B\x1B\xBF\xAE\x08\xD1\xA2,1435179555993.8e2d075f94ce7699f416ec4ced9873cd.
> Status: RUNNING:Preparing to flush by snapshotting stores in 8e2d075f94ce7699f416ec4ced9873cd
> Running for 22034s
> Task: Flushing IntegrationTestLoadAndVerify,\x93\xA385\x81Z\x11\xE6,1435179555993.9f8d0e01a40405b835bf6e5a22a86390.
> Status: RUNNING:Preparing to flush by snapshotting stores in 9f8d0e01a40405b835bf6e5a22a86390
> Running for 22033s
> Executors:
> ===========================================================
> ...
> Thread 139 (MemStoreFlusher.1):
>   State: WAITING
>   Blocked count: 139711
>   Waited count: 239212
>   Waiting on java.util.concurrent.CountDownLatch$Sync@b9c094a
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>     java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>     org.apache.hadoop.hbase.wal.WALKey.getSequenceId(WALKey.java:305)
>     org.apache.hadoop.hbase.regionserver.HRegion.getNextSequenceId(HRegion.java:2422)
>     org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2168)
>     org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2047)
>     org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2011)
>     org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1902)
>     org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1828)
>     org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:510)
>     org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:471)
>     org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75)
>     org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
>     java.lang.Thread.run(Thread.java:745)
> Thread 137 (MemStoreFlusher.0):
>   State: WAITING
>   Blocked count: 138931
>   Waited count: 237448
>   Waiting on java.util.concurrent.CountDownLatch$Sync@53f41f76
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>     java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>     java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>     org.apache.hadoop.hbase.wal.WALKey.getSequenceId(WALKey.java:305)
>     org.apache.hadoop.hbase.regionserver.HRegion.getNextSequenceId(HRegion.java:2422)
>     org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2168)
>     org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2047)
>     org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2011)
>     org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1902)
>     org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1828)
>     org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:510)
>     org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:471)
>     org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75)
>     org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
>     java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)