You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2009/09/17 21:38:57 UTC

[jira] Created: (HBASE-1851) Broken master failover

Broken master failover
----------------------

                 Key: HBASE-1851
                 URL: https://issues.apache.org/jira/browse/HBASE-1851
             Project: Hadoop HBase
          Issue Type: Bug
            Reporter: stack


Master crashed, SIGSEGV (0xb) at pc=0x00000031a40fea07, pid=14689, tid=1133910336.  Four other masters running ready to take the failover.  I see where we move to new master but there is an error:

{code}
2009-09-13 22:07:02,061 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Wrote master address XX.XX.XX.251:20000 to ZooKeeper
2009-09-13 22:07:02,064 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to set state node in ZooKeeper
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /hbase/shutdown
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:522)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.setClusterState(ZooKeeperWrapper.java:279)
        at org.apache.hadoop.hbase.master.HMaster.writeAddressToZooKeeper(HMaster.java:270)
        at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:255)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
        at java.lang.reflect.Constructor.newInstance(Unknown Source)
        at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1200)
        at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1241)
{code}

Is /hbase/shutdown now ephemeral?

Otherwise, the transition went off well it seems.

Except, if I look in zk -- this is a good while after teh event -- I do not see a master.. its empty.  Do we not record in zk on failover?

But then a split comes in:

2009-09-17 05:50:05,070 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
2009-09-17 05:50:30,745 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT: enwikibase_dumpurls,,1253145470066: Daughters; enwikibase_dum
purls,,1253166628107, enwikibase_dumpurls,EzAdzwPBtG_o9BLsEqu4bV\x3D\x3D,1253166628107 from aa0-018-6.u.powerset.com,20020,1251458355425; 1 of 3
2009-09-17 05:50:30,745 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.44.91:20020, startcode: 1251458355425, load: (requests=0, r
egions=3, usedHeap=490, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
2009-09-17 05:50:30,838 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.44.49:20020, startcode: 1250638276455, load: (requests=3, r
egions=4, usedHeap=134, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
2009-09-17 05:50:31,117 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.45.128:20020, startcode: 1250638269214, load: (requests=5, 
regions=4, usedHeap=130, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
2009-09-17 05:50:31,119 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.45.221:20020, startcode: 1250638268709, load: (requests=5, 
regions=4, usedHeap=82, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
2009-09-17 05:50:31,150 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.44.75:20020, startcode: 1250638276632, load: (requests=9, r
egions=4, usedHeap=284, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
2009-09-17 05:50:31,215 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.45.180:20020, startcode: 1250638269143, load: (requests=11,
 regions=4, usedHeap=132, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
2009-09-17 05:50:31,265 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.45.121:20020, startcode: 1250638269297, load: (requests=5, 
regions=4, usedHeap=54, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
...

And we never recover from the above (12 hours and still at it).



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1851) Broken master failover

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756738#action_12756738 ] 

Jean-Daniel Cryans commented on HBASE-1851:
-------------------------------------------

The shutdown node exception is ok, this is just a warn that tells us the cluster is running... it could be clearer tho.

Also we see that the master znode was written just before so it should have stayed there. The region servers at the same moment should be telling in the logs that they found a new master. Something else happened after that?

> Broken master failover
> ----------------------
>
>                 Key: HBASE-1851
>                 URL: https://issues.apache.org/jira/browse/HBASE-1851
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>
> Master crashed, SIGSEGV (0xb) at pc=0x00000031a40fea07, pid=14689, tid=1133910336.  Four other masters running ready to take the failover.  I see where we move to new master but there is an error:
> {code}
> 2009-09-13 22:07:02,061 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Wrote master address XX.XX.XX.251:20000 to ZooKeeper
> 2009-09-13 22:07:02,064 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to set state node in ZooKeeper
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /hbase/shutdown
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:522)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.setClusterState(ZooKeeperWrapper.java:279)
>         at org.apache.hadoop.hbase.master.HMaster.writeAddressToZooKeeper(HMaster.java:270)
>         at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:255)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
>         at java.lang.reflect.Constructor.newInstance(Unknown Source)
>         at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1200)
>         at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1241)
> {code}
> Is /hbase/shutdown now ephemeral?
> Otherwise, the transition went off well it seems.
> Except, if I look in zk -- this is a good while after teh event -- I do not see a master.. its empty.  Do we not record in zk on failover?
> But then a split comes in:
> 2009-09-17 05:50:05,070 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
> 2009-09-17 05:50:30,745 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT: enwikibase_dumpurls,,1253145470066: Daughters; enwikibase_dum
> purls,,1253166628107, enwikibase_dumpurls,EzAdzwPBtG_o9BLsEqu4bV\x3D\x3D,1253166628107 from aa0-018-6.u.powerset.com,20020,1251458355425; 1 of 3
> 2009-09-17 05:50:30,745 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.44.91:20020, startcode: 1251458355425, load: (requests=0, r
> egions=3, usedHeap=490, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> 2009-09-17 05:50:30,838 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.44.49:20020, startcode: 1250638276455, load: (requests=3, r
> egions=4, usedHeap=134, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> 2009-09-17 05:50:31,117 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.45.128:20020, startcode: 1250638269214, load: (requests=5, 
> regions=4, usedHeap=130, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> 2009-09-17 05:50:31,119 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.45.221:20020, startcode: 1250638268709, load: (requests=5, 
> regions=4, usedHeap=82, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> 2009-09-17 05:50:31,150 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.44.75:20020, startcode: 1250638276632, load: (requests=9, r
> egions=4, usedHeap=284, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> 2009-09-17 05:50:31,215 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.45.180:20020, startcode: 1250638269143, load: (requests=11,
>  regions=4, usedHeap=132, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> 2009-09-17 05:50:31,265 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.45.121:20020, startcode: 1250638269297, load: (requests=5, 
> regions=4, usedHeap=54, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> ...
> And we never recover from the above (12 hours and still at it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1851) Broken master failover

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756749#action_12756749 ] 

stack commented on HBASE-1851:
------------------------------

Regionservers seemed to ride over the new master fine.  When split came in, assignment went wacky and the master seemed uncontactable:

{code}
2009-09-17 17:42:55,981 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 3 time(s).
2009-09-17 17:42:57,001 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 4 time(s).
2009-09-17 17:42:58,021 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 5 time(s).
2009-09-17 17:42:59,031 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 6 time(s).
2009-09-17 17:43:00,041 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 7 time(s).
2009-09-17 17:43:01,051 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 8 time(s).
2009-09-17 17:43:02,061 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 9 time(s).
2009-09-17 17:43:02,062 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Exceeded max retries: 10
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:307)
        at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
        at $Proxy0.regionServerReport(Unknown Source)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:476)
        at java.lang.Thread.run(Unknown Source)
2009-09-17 17:43:03,071 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for 329804129 milliseconds - retrying
2009-09-17 17:43:04,091 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 0 time(s).
2009-09-17 17:43:05,101 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 1 time(s).
2009-09-17 17:43:06,111 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 2 time(s).
2009-09-17 17:43:07,121 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 3 time(s).
2009-09-17 17:43:08,131 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 4 time(s).
2009-09-17 17:43:09,151 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 5 time(s).
2009-09-17 17:43:10,171 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 6 time(s).
2009-09-17 17:43:11,181 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 7 time(s).
2009-09-17 17:43:12,191 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 8 time(s).
2009-09-17 17:43:13,201 INFO org.apache.hadoop.ipc.HBaseClient: Retrying connect to server: /XX.XX.XX.231:20000. Already tried 9 time(s).
2009-09-17 17:43:13,201 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Exceeded max retries: 10
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:307)
        at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
        at $Proxy0.regionServerReport(Unknown Source)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:476)
        at java.lang.Thread.run(Unknown Source)
{code}



> Broken master failover
> ----------------------
>
>                 Key: HBASE-1851
>                 URL: https://issues.apache.org/jira/browse/HBASE-1851
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>
> Master crashed, SIGSEGV (0xb) at pc=0x00000031a40fea07, pid=14689, tid=1133910336.  Four other masters running ready to take the failover.  I see where we move to new master but there is an error:
> {code}
> 2009-09-13 22:07:02,061 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Wrote master address XX.XX.XX.251:20000 to ZooKeeper
> 2009-09-13 22:07:02,064 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to set state node in ZooKeeper
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /hbase/shutdown
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:522)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.setClusterState(ZooKeeperWrapper.java:279)
>         at org.apache.hadoop.hbase.master.HMaster.writeAddressToZooKeeper(HMaster.java:270)
>         at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:255)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
>         at java.lang.reflect.Constructor.newInstance(Unknown Source)
>         at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1200)
>         at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1241)
> {code}
> Is /hbase/shutdown now ephemeral?
> Otherwise, the transition went off well it seems.
> Except, if I look in zk -- this is a good while after teh event -- I do not see a master.. its empty.  Do we not record in zk on failover?
> But then a split comes in:
> 2009-09-17 05:50:05,070 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
> 2009-09-17 05:50:30,745 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT: enwikibase_dumpurls,,1253145470066: Daughters; enwikibase_dum
> purls,,1253166628107, enwikibase_dumpurls,EzAdzwPBtG_o9BLsEqu4bV\x3D\x3D,1253166628107 from aa0-018-6.u.powerset.com,20020,1251458355425; 1 of 3
> 2009-09-17 05:50:30,745 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.44.91:20020, startcode: 1251458355425, load: (requests=0, r
> egions=3, usedHeap=490, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> 2009-09-17 05:50:30,838 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.44.49:20020, startcode: 1250638276455, load: (requests=3, r
> egions=4, usedHeap=134, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> 2009-09-17 05:50:31,117 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.45.128:20020, startcode: 1250638269214, load: (requests=5, 
> regions=4, usedHeap=130, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> 2009-09-17 05:50:31,119 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.45.221:20020, startcode: 1250638268709, load: (requests=5, 
> regions=4, usedHeap=82, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> 2009-09-17 05:50:31,150 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.44.75:20020, startcode: 1250638276632, load: (requests=9, r
> egions=4, usedHeap=284, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> 2009-09-17 05:50:31,215 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.45.180:20020, startcode: 1250638269143, load: (requests=11,
>  regions=4, usedHeap=132, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> 2009-09-17 05:50:31,265 DEBUG org.apache.hadoop.hbase.master.RegionManager: Assigning for address: XX.XX.45.121:20020, startcode: 1250638269297, load: (requests=5, 
> regions=4, usedHeap=54, maxHeap=2031): total nregions to assign=2, nregions to reach balance=4, isMetaAssign=false
> ...
> And we never recover from the above (12 hours and still at it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.