You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2015/04/23 07:44:38 UTC

[jira] [Commented] (HBASE-13535) Regions go unassigned when meta-carrying RS is killed

    [ https://issues.apache.org/jira/browse/HBASE-13535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508498#comment-14508498 ] 

stack commented on HBASE-13535:
-------------------------------

Having trouble figuring what is going on. Out on the crashed server, it looks like the region had been successfully closed minutes before the kill... but somehow up in zk, region is transition in state OPEN.  Let me see.  Also odd is that the am.processServerShutdown handler should return the region as for reassignment because its in 'transition'. Logs seem to match except for this one....

143328 2015-04-21 17:05:07,576 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] handler.ServerShutdownHandler: Reassigning 0 region(s) that c2024.halxg.cloudera.com,16020,1429660802192 was carrying (and 0 regions(s) that were opening on this server)

the 'and 0 regions'... should be 'and 6 regions'.... 

Looking...

> Regions go unassigned when meta-carrying RS is killed
> -----------------------------------------------------
>
>                 Key: HBASE-13535
>                 URL: https://issues.apache.org/jira/browse/HBASE-13535
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>
> hbase-1.1 will be the first release with DLR on by default. I've been running  ITBLLs on a cluster trying to find issues with DLR. My first few runs ran nicely... but the current run failed complaining regions are not online and indeed recovery is stuck making no progress.
> Upon examination, it looks to be an assignment rather than DLR issue. A server carring meta has its meta log replayed first but we are seemingly failing to assign regions after meta is back online.
> Meantime, my regionserver logs are filling with spewing complaint that regions are not online (we should dampen our logging of region not being online... ) and then the split log workers are stuck:
> {code}
> Thread 13206 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-2):
>   State: TIMED_WAITING
>   Blocked count: 45
>   Waited count: 59
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
> Thread 13205 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-1):
>   State: TIMED_WAITING
>   Blocked count: 45
>   Waited count: 59
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
> Thread 13204 (RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-0):
>   State: TIMED_WAITING
>   Blocked count: 50
>   Waited count: 63
>   Stack:
>     java.lang.Thread.sleep(Native Method)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.waitUntilRegionOnline(WALSplitter.java:1959)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.locateRegionAndRefreshLastFlushedSequenceId(WALSplitter.java:1857)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.groupEditsByServer(WALSplitter.java:1761)
>     org.apache.hadoop.hbase.wal.WALSplitter$LogReplayOutputSink.append(WALSplitter.java:1674)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1104)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1096)
>     org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1066)
> {code}
> ...complaining that:
> 2015-04-22 21:28:02,746 DEBUG [RS_LOG_REPLAY_OPS-c2021:16020-1] wal.WALSplitter: Used 134248328 bytes of buffered edits, waiting for IO threads...
> The accounting seems off around here in SSH where it is moving regions that were on dead server to OFFLINE but is reporting no regions to assign:
> {code}
> 143320 2015-04-21 17:05:07,571 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] handler.ServerShutdownHandler: Mark regions in recovery for crashed server c2024.halxg.cloudera.com,16020,1429660802192 before assignment; regions=[]
> 143321 2015-04-21 17:05:07,572 DEBUG [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Adding to processed servers c2024.halxg.cloudera.com,16020,1429660802192
> 143322 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {8d63312bc39a39727afea627bb20fee4 state=OPEN, ts=1429660996054, server=c2024.halxg.cloudera.com,16020,1429660802192} to {8d63312bc39a39727afea627bb20fee4 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143323 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {64c97bf39441e09977332c02e628a8c2 state=OPEN, ts=1429660996045, server=c2024.halxg.cloudera.com,16020,1429660802192} to {64c97bf39441e09977332c02e628a8c2 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143324 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {3f4ea5ea14653cee6006f13c7d06d10b state=OPEN, ts=1429660996066, server=c2024.halxg.cloudera.com,16020,1429660802192} to {3f4ea5ea14653cee6006f13c7d06d10b state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143325 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {6eaf51e55c9c23356a697f286f473db8 state=OPEN, ts=1429660996051, server=c2024.halxg.cloudera.com,16020,1429660802192} to {6eaf51e55c9c23356a697f286f473db8 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143326 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {1cbe63cd29c7709209adbf4305ebc746 state=OPEN, ts=1429660996062, server=c2024.halxg.cloudera.com,16020,1429660802192} to {1cbe63cd29c7709209adbf4305ebc746 state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143327 2015-04-21 17:05:07,575 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] master.RegionStates: Transition {3a49d17e189b2a26eb73e1d43cf2d0ac state=OPEN, ts=1429660996053, server=c2024.halxg.cloudera.com,16020,1429660802192} to {3a49d17e189b2a26eb73e1d43cf2d0ac state=OFFLINE, ts=1429661107575, server=c2024.halxg.cloudera.com,16020,1429660802192}
> 143328 2015-04-21 17:05:07,576 INFO  [MASTER_SERVER_OPERATIONS-c2020:16000-0] handler.ServerShutdownHandler: Reassigning 0 region(s) that c2024.halxg.cloudera.com,16020,1429660802192 was carrying (and 0 regions(s) that were opening on this server)
> {code}
> ... and indeed these regions are never assigned.  I see DLR on occasion timing out like this....
> 2015-04-22 21:27:31,327 ERROR [RS_LOG_REPLAY_OPS-c2021:16020-1-Writer-0] wal.WALSplitter: Exiting thread
> ....
> Caused by:
> Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException): org.apache.hadoop.hbase.NotServingRegionException: Region IntegrationTestBigLinkedList,\xB3333333(,1429660019160.8d63312bc39a39727afea627bb20fee4. is not online on c2024.halxg.cloudera.com,16020,1429661908290
> ... which is one of the regions in the set of OFFLINE but not assigned.
> Digging....



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)