You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@whirr.apache.org by "Tom White (JIRA)" <ji...@apache.org> on 2011/05/25 19:06:47 UTC

[jira] [Created] (WHIRR-314) HBase integration test can fail due to Thrift server race

HBase integration test can fail due to Thrift server race
---------------------------------------------------------

                 Key: WHIRR-314
                 URL: https://issues.apache.org/jira/browse/WHIRR-314
             Project: Whirr
          Issue Type: Bug
            Reporter: Tom White
            Assignee: Tom White


There is a race condition where the Thrift server comes up faster than the master, fails to connect (after trying 10 times), then shuts down for good. Both Andrei and I have seen this fail on Rackspace Cloud Servers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (WHIRR-314) HBase integration test can fail due to Thrift server race

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/WHIRR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated WHIRR-314:
----------------------------

    Attachment: WHIRR-314.patch

Updated patch which addresses Andrei's comment. I'm going to commit this now. 

> HBase integration test can fail due to Thrift server race
> ---------------------------------------------------------
>
>                 Key: WHIRR-314
>                 URL: https://issues.apache.org/jira/browse/WHIRR-314
>             Project: Whirr
>          Issue Type: Bug
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: WHIRR-314.patch, WHIRR-314.patch
>
>
> There is a race condition where the Thrift server comes up faster than the master, fails to connect (after trying 10 times), then shuts down for good. Both Andrei and I have seen this fail on Rackspace Cloud Servers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (WHIRR-314) HBase integration test can fail due to Thrift server race

Posted by "Tom White (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/WHIRR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039212#comment-13039212 ] 

Tom White commented on WHIRR-314:
---------------------------------

Here's a stack trace from the thrift server node:

{noformat}
2011-05-25 16:40:19,672 INFO org.apache.hadoop.hbase.client.HConnectionManager$TableServers: getMaster attempt 9 of 10 failed; no more retrying.
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/master
     at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:481)
     at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readMasterAddressOrThrow(ZooKeeperWrapper.java:377)
     at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getMaster(HConnectionManager.java:381)
     at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:78)
     at org.apache.hadoop.hbase.thrift.ThriftServer$HBaseHandler.<init>(ThriftServer.java:191)
     at org.apache.hadoop.hbase.thrift.ThriftServer.doMain(ThriftServer.java:817)
     at org.apache.hadoop.hbase.thrift.ThriftServer.main(ThriftServer.java:874)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/master
     at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
     at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
     at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
     at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.readAddressOrThrow(ZooKeeperWrapper.java:477)
     ... 6 more
2011-05-25 16:40:19,677 INFO org.apache.zookeeper.ZooKeeper: Session: 0x1302806aebc0001 closed
2011-05-25 16:40:19,678 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <173-203-217-78.static.cloud-ips.com:2181:/hbase,org.apache.hadoop.hbase.client.HConnectionManage
r>Closed connection with ZooKeeper; /hbase/root-region-server
{noformat}

> HBase integration test can fail due to Thrift server race
> ---------------------------------------------------------
>
>                 Key: WHIRR-314
>                 URL: https://issues.apache.org/jira/browse/WHIRR-314
>             Project: Whirr
>          Issue Type: Bug
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: WHIRR-314.patch
>
>
> There is a race condition where the Thrift server comes up faster than the master, fails to connect (after trying 10 times), then shuts down for good. Both Andrei and I have seen this fail on Rackspace Cloud Servers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (WHIRR-314) HBase integration test can fail due to Thrift server race

Posted by "Andrei Savu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/WHIRR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13039299#comment-13039299 ] 

Andrei Savu commented on WHIRR-314:
-----------------------------------

+1 and we need the same change for CDH HBase in {{services/cdh/src/main/resources/functions/configure_cdh_hbase.sh}}. 

Side note: later we should make sure that tests do not block forever and they fail after a reasonable amount of time (all the cleanup work is annoying). 

> HBase integration test can fail due to Thrift server race
> ---------------------------------------------------------
>
>                 Key: WHIRR-314
>                 URL: https://issues.apache.org/jira/browse/WHIRR-314
>             Project: Whirr
>          Issue Type: Bug
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: WHIRR-314.patch
>
>
> There is a race condition where the Thrift server comes up faster than the master, fails to connect (after trying 10 times), then shuts down for good. Both Andrei and I have seen this fail on Rackspace Cloud Servers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (WHIRR-314) HBase integration test can fail due to Thrift server race

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/WHIRR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated WHIRR-314:
----------------------------

    Attachment: WHIRR-314.patch

This patch fixes the problem by increasing the number of retries to 100. I ran the integration test and it passed.

> HBase integration test can fail due to Thrift server race
> ---------------------------------------------------------
>
>                 Key: WHIRR-314
>                 URL: https://issues.apache.org/jira/browse/WHIRR-314
>             Project: Whirr
>          Issue Type: Bug
>            Reporter: Tom White
>            Assignee: Tom White
>         Attachments: WHIRR-314.patch
>
>
> There is a race condition where the Thrift server comes up faster than the master, fails to connect (after trying 10 times), then shuts down for good. Both Andrei and I have seen this fail on Rackspace Cloud Servers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (WHIRR-314) HBase integration test can fail due to Thrift server race

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/WHIRR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White resolved WHIRR-314.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.5.0

I've just committed this.

> HBase integration test can fail due to Thrift server race
> ---------------------------------------------------------
>
>                 Key: WHIRR-314
>                 URL: https://issues.apache.org/jira/browse/WHIRR-314
>             Project: Whirr
>          Issue Type: Bug
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.5.0
>
>         Attachments: WHIRR-314.patch, WHIRR-314.patch
>
>
> There is a race condition where the Thrift server comes up faster than the master, fails to connect (after trying 10 times), then shuts down for good. Both Andrei and I have seen this fail on Rackspace Cloud Servers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira