You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2011/06/13 20:36:51 UTC

[jira] [Created] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
---------------------------------------------------------------------------------

                 Key: HBASE-3984
                 URL: https://issues.apache.org/jira/browse/HBASE-3984
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.90.3
            Reporter: Jean-Daniel Cryans
            Priority: Blocker
             Fix For: 0.90.4


After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.

It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.

What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056832#comment-13056832 ] 

stack commented on HBASE-3984:
------------------------------

+1

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, HBASE-3984-trunk-v2.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053647#comment-13053647 ] 

Jean-Daniel Cryans commented on HBASE-3984:
-------------------------------------------

I haven't changed trunk yet after today's discoveries, that's where the meaty part is. I don't want that in 0.90, I'm keeping it more confined in order to change as less code as possible but still fixing the bug.

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053638#comment-13053638 ] 

stack commented on HBASE-3984:
------------------------------

+1 again for trunk patch. +1 on 0.90 patch but how do you get away w/o amending tests?  You are not changing everywhere?

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-3984:
--------------------------------------

    Attachment: HBASE-3984-trunk.patch
                HBASE-3984-0.90.patch

Those patches for branch and trunk fix the issue by adding the checkOpen call to every method exposed HRegionInterface except in branch were I needed to add a IOException to one method so that change is only in trunk.

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053493#comment-13053493 ] 

Jean-Daniel Cryans commented on HBASE-3984:
-------------------------------------------

Now that I'm running the unit tests, I can see that the Master doesn't handle the case where it's assigning a region to a region server that is shutting down. This will create some ripples, I'll try to keep it contained on trunk.

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056835#comment-13056835 ] 

Ted Yu commented on HBASE-3984:
-------------------------------

In TestSplitTransactionOnCluster.java:
{code}
+    while (cluster.getMaster().getClusterStatus().getServers().size() == 2) {
{code}
Maybe a constant should be introduced, e.g. INIT_CLUSTER_SIZE so that it can be used by TESTING_UTIL.startMiniCluster() as well.

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, HBASE-3984-trunk-v2.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057587#comment-13057587 ] 

Hudson commented on HBASE-3984:
-------------------------------

Integrated in HBase-TRUNK #1998 (See [https://builds.apache.org/job/HBase-TRUNK/1998/])
    HBASE-3984  CT.verifyRegionLocation isn't doing a very good check,
            can delay cluster recovery

jdcryans : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HBaseServer.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/TestRegionRebalancing.java
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestHMasterRPCException.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/TestGlobalMemStoreSize.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/RegionServerStoppedException.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/ServerNotRunningYetException.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestRollingRestart.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/ServerNotRunningException.java


> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, HBASE-3984-trunk-v2.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-3984:
--------------------------------------

    Attachment: HBASE-3984-0.90-v2.patch

Less risky version of the patch for 0.90, passes the unit tests.

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053029#comment-13053029 ] 

stack commented on HBASE-3984:
------------------------------

+1

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-3984:
--------------------------------------

    Attachment: HBASE-3984-trunk-v2.patch

New patch for trunk that introduces a new exception for when a RS is asked to shutdown. It had a few ripple effects since some methods were accessed in the unit tests even after a RS was asked to shutdown (that's why I needed to change the way we wait on RSs to die in one test).

I'm also modifying the behavior of SingleServerBulkAssigner, it will not kill the master anymore when it gets an exception trying to talk to a RS (unless it's interrupted). Instead it will just log the problem and the RIT will get timed out by the TimeoutMonitor. It currently works for TestSplitTransactionOnCluster.

Most of the unit test pass but it's hard to tell with all the others that are already failing if it's caused by this patch.

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, HBASE-3984-trunk-v2.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053640#comment-13053640 ] 

stack commented on HBASE-3984:
------------------------------

Were you going to change the exception that comes up out of checkOpen?

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans resolved HBASE-3984.
---------------------------------------

      Resolution: Fixed
    Release Note: 
In trunk:
All HRegionInferface methods will now throw a RegionServerStoppedException if it's in that state, whereas we used to only check it for a few methods.
SingleServerBulkAssigner will not kill the Master anymore when getting IOEs, instead it will just log an error and the TimeoutMonitor will take care of picking up the pieces.

In 0.90:
Only a couple of checkOpen calls were added in order to change as less code as possible while still fixing the issue.
    Hadoop Flags: [Reviewed]

Commmitted the 0.90 patch to branch and the other patch to trunk including the fix that Ted pointed to. Thanks guys for the reviews.

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90-v2.patch, HBASE-3984-0.90.patch, HBASE-3984-trunk-v2.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans reassigned HBASE-3984:
-----------------------------------------

    Assignee: Jean-Daniel Cryans

> CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-3984
>                 URL: https://issues.apache.org/jira/browse/HBASE-3984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3984-0.90.patch, HBASE-3984-trunk.patch
>
>
> After some extensive debugging in the thread [A sudden msg of "java.io.IOException: Server not running, aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region servers weren't able to talk to the new .META. location because the old one was still alive but on it's way down after a OOME.
> It translates into exceptions like "Server not running" coming from trying to edit .META. and digging in the code I see that CT.waitForMetaServerConnectionDefault -> waitForMeta -> getMetaServerConnection(true) calls verifyRegionLocation since we force the refresh. In this method we check if the RS is good by calling getRegionInfo which *does not* check if the region server is trying to close.
> What this means is that a cluster can't recover a .META.-serving RS failure until it has fully shutdown since every time a RS tries to open a region (like right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira