You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2011/03/17 22:58:29 UTC

[jira] Created: (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

CatalogTracker.waitForMeta can wait forever and totally stall a RS
------------------------------------------------------------------

                 Key: HBASE-3668
                 URL: https://issues.apache.org/jira/browse/HBASE-3668
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.90.1
            Reporter: Jean-Daniel Cryans
            Assignee: Jean-Daniel Cryans
            Priority: Blocker
             Fix For: 0.90.2


We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted).

Here's the relevant jstack from the region servers:

{code}

"regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
{code}

Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs:

{code}
2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192
2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker
2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
{code}

Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever:
{code}
     if (getMetaServerConnection(true) != null) {
        return metaLocation;
      }
      while(!stopped && !metaAvailable.get() &&
          (timeout == 0 || System.currentTimeMillis() < stop)) {
        metaAvailable.wait(timeout);
      }
{code}

So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event!

This is what happens when the RS then tries to update .META. after opening a region:

{code}
2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
        at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
        at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)

{code}

The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009954#comment-13009954 ] 

Hudson commented on HBASE-3668:
-------------------------------

Integrated in HBase-TRUNK #1803 (See [https://hudson.apache.org/hudson/job/HBase-TRUNK/1803/])
    HBASE-3668  CatalogTracker.waitForMeta can wait forever and totally stall a RS


> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>         Attachments: HBASE-3668.patch
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> 	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> 	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008623#comment-13008623 ] 

Ted Yu commented on HBASE-3668:
-------------------------------

In OpenRegionHandler, line 94:
{code}
    region = openRegion();
    if (region == null) return;
{code}
I think we need a mechanism for OpenRegionHandler to convey the fact that every region failed openning to region server which in turn informs master.

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> 	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> 	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008581#comment-13008581 ] 

Jean-Daniel Cryans commented on HBASE-3668:
-------------------------------------------

@Ted 
Those lines are coming from ReplicationSink which is using a HTable so you see stuff like that being printed out. I should have left that line out as it's irrelevant.

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> 	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> 	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008619#comment-13008619 ] 

Ted Yu commented on HBASE-3668:
-------------------------------

Were the defective region servers online ?
Master would distribute regions to every online server.

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> 	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> 	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008613#comment-13008613 ] 

Jean-Daniel Cryans commented on HBASE-3668:
-------------------------------------------

bq. The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

What I meant by that is that the cluster was able to reach a "stable" state when the defective region servers we finally stripped out of every region they were assigned, but then the balancer tries to even the distribution so the cycle repeats. In essence, the balancer isn't doing anything wrong... it couldn't know that those RS were in a bad state.

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> 	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> 	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans resolved HBASE-3668.
---------------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed to branch and trunk, thanks for looking at it Stack.

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>         Attachments: HBASE-3668.patch
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> 	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> 	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008518#comment-13008518 ] 

Ted Yu commented on HBASE-3668:
-------------------------------

@J-D: Can you post more log after this line ?
{code}
2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
{code}

I also noticed that refresh passed to getMetaServerConnection() is always true. We may remove parameter refresh.

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> 	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> 	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009863#comment-13009863 ] 

stack commented on HBASE-3668:
------------------------------

Can you do something about the metaAvailable.wait where timeout == 0?

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>         Attachments: HBASE-3668.patch
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> 	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> 	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009864#comment-13009864 ] 

Jean-Daniel Cryans commented on HBASE-3668:
-------------------------------------------

Will commit with metaAvailable.wait(timeout == 0 ? 50 : timeout);

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>         Attachments: HBASE-3668.patch
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> 	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> 	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008624#comment-13008624 ] 

Jean-Daniel Cryans commented on HBASE-3668:
-------------------------------------------

@Ted

I understand that the bug report has a lot of information, but it does answer those questions. Yes, the region servers were online. Yes, they got regions. But, as I showed, they were unable to update .META. and eventually failed on NotAllMetaRegionsOnlineException. Thus, the master saw the RIT timeout and then reassigned the regions, which may go back to a weird region server or not. Eventually they are all opened, but then the balancer runs again. Lather, rinse, repeat.

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> 	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> 	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3668) CatalogTracker.waitForMeta can wait forever and totally stall a RS

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-3668:
--------------------------------------

    Attachment: HBASE-3668.patch

This patches puts the checking of meta location inside the loop instead of outside of it, this way if you're in the thread that comes from ZK then you won't be waiting on yourself.

> CatalogTracker.waitForMeta can wait forever and totally stall a RS
> ------------------------------------------------------------------
>
>                 Key: HBASE-3668
>                 URL: https://issues.apache.org/jira/browse/HBASE-3668
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.2
>
>         Attachments: HBASE-3668.patch
>
>
> We got into a weird situation after hitting HBASE-3664 leaving almost half the region servers unable to open regions (we didn't kill the master but all RS were restarted).
> Here's the relevant jstack from the region servers:
> {code}
> "regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 in Object.wait() [0x00007fbd2b7f6000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
> 	- locked <0x00007fbd4e1ac6d0> (a java.util.concurrent.atomic.AtomicBoolean)
> 	at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
> 	at org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
> 	- locked <0x00007fbd4d7ab060> (a org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> {code}
> Every server in that state cannot receive any ZK event. See how we first do a nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the logs:
> {code}
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
> 2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error)
> 2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x12d627b723e081f Set watcher on existing znode /hbase/unassigned/1028785192
> 2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker
> 2011-03-17 17:57:06,988 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Lookedup root region location, connection=org.
> {code}
> Node is updated, then is missing, than came back! The reason we're stuck is that CT.waitForMeta will wait forever:
> {code}
>      if (getMetaServerConnection(true) != null) {
>         return metaLocation;
>       }
>       while(!stopped && !metaAvailable.get() &&
>           (timeout == 0 || System.currentTimeMillis() < stop)) {
>         metaAvailable.wait(timeout);
>       }
> {code}
> So when it tried getMetaServerConnection the first time it didn't work, and then since the timeout is 0 then it waits. Even if it was looping, metaAvailable will never be true since we're already inside the event thread thus blocking any other event!
> This is what happens when the RS then tries to update .META. after opening a region:
> {code}
> 2011-03-17 17:59:06,557 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
> 2011-03-17 17:59:06,557 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
> org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
>         at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)
> {code}
> The master eventually tries to assign the region somewhere else where it may work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira