You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2011/06/30 00:29:33 UTC

[jira] [Resolved] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

     [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans resolved HBASE-3984.
---------------------------------------

      Resolution: Fixed
    Release Note: 
In trunk:
All HRegionInferface methods will now throw a RegionServerStoppedException if it's in that state, whereas we used to only check it for a few methods.
SingleServerBulkAssigner will not kill the Master anymore when getting IOEs, instead it will just log an error and the TimeoutMonitor will take care of picking up the pieces.

In 0.90:
Only a couple of checkOpen calls were added in order to change as less code as possible while still fixing the issue.
    Hadoop Flags: [Reviewed]

Commmitted the 0.90 patch to branch and the other patch to trunk including the fix that Ted pointed to. Thanks guys for the reviews.

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, HBASE-3984-trunk-v2.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira