You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2011/06/22 01:56:47 UTC
[jira] [Updated] (HBASE-3984) CT.verifyRegionLocation isn't doing a
very good check, can delay cluster recovery
[ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jean-Daniel Cryans updated HBASE-3984:
--------------------------------------
Attachment: HBASE-3984-trunk.patch
HBASE-3984-0.90.patch
Those patches for branch and trunk fix the issue by adding the checkOpen call to every method exposed HRegionInterface except in branch were I needed to add a IOException to one method so that change is only in trunk.
> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
> Key: HBASE-3984
> URL: https://issues.apache.org/jira/browse/HBASE-3984
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Priority: Blocker
> Fix For: 0.90.4
>
> Attachments: HBASE-3984-0.90.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira