You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2014/11/21 19:49:34 UTC
[jira] [Commented] (HBASE-12554) TestBaseLoadBalancer may timeout due to lengthy rack lookup

    [ https://issues.apache.org/jira/browse/HBASE-12554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221242#comment-14221242 ] 

stack commented on HBASE-12554:
-------------------------------

Looks like the commit adding debugging and timeouts surfaced the root cause of zombie TestBaseLoadBalancer.

The code you have added here has the balancer wait a minute and then just moves on.  The 60seconds should be configurable.  Does the cancel actually interrupt the ongoing lookup or does it leave it hanging?  The implementation of DNSToSwitchMapping is primitive.  We could do better (use more of CachedDNSToSwitchMapping).  I like the idea of complaint if this lookup is going on too long.  Could punt all this to a new issue.

To fix the breaking test, RackManager loads an instance of class specified at  hbase.util.ip.to.rack.determiner which is an implementation of DNSToSwitchMapping  Why not in the test put in a mock of DNSToSwitchMapping that returns immediately.  Who cares about a lookup in test? Why wait?



> TestBaseLoadBalancer may timeout due to lengthy rack lookup
> -----------------------------------------------------------
>
>                 Key: HBASE-12554
>                 URL: https://issues.apache.org/jira/browse/HBASE-12554
>             Project: HBase
>          Issue Type: Test
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>         Attachments: 12554-v1.txt
>
>
> Here is one of the recent occurrences (https://builds.apache.org/job/PreCommit-HBASE-Build/11778/console):
> {code}
> testImmediateAssignment(org.apache.hadoop.hbase.master.balancer.TestBaseLoadBalancer)  Time elapsed: 30.019 sec  <<< ERROR!
> java.lang.Exception: test timed out after 30000 milliseconds
> 	at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
> 	at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
> 	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
> 	at java.net.InetAddress.getAllByName0(InetAddress.java:1246)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1162)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1098)
> 	at java.net.InetAddress.getByName(InetAddress.java:1048)
> 	at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:561)
> 	at org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:578)
> 	at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109)
> 	at org.apache.hadoop.hbase.master.RackManager.getRack(RackManager.java:66)
> 	at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer$Cluster.<init>(BaseLoadBalancer.java:273)
> 	at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer.createCluster(BaseLoadBalancer.java:1113)
> 	at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer.randomAssignment(BaseLoadBalancer.java:1175)
> 	at org.apache.hadoop.hbase.master.balancer.BaseLoadBalancer.immediateAssignment(BaseLoadBalancer.java:1145)
> 	at org.apache.hadoop.hbase.master.balancer.TestBaseLoadBalancer.testImmediateAssignment(TestBaseLoadBalancer.java:136)
> {code}
> One possible fix is to submit CachedDNSToSwitchMapping.resolve() to executor pool for execution. RackManager.getRack() can set a timeout beyond which UNKNOWN_RACK is returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)